Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 11 hours 18 min ago

Debug Java in Bio7

Wed, 06/27/2018 - 14:33

(This article was first published on R – Bio7 Website, and kindly contributed to R-bloggers)

27.06.2018

In Bio7 R and Java code can be easily combined in the Bio7 platform. For instance to create Graphical User Interfaces in Java (with SWT or JavaFX), transfer image pixel/selection data from ImageJ and finally doing the analysis parts in R code/scripts which can be called from within Java (an easy to use API on top of Rserve in Bio7 is available, see, e.g., Groovy examples here).

However sometimes it is necessary to debug the Java language as part of such combined work. In Bio7 (based on Eclipse RCP) the very powerful Java debugging tools from Eclipse are available by default.

In addition in Bio7 it is possible to debug dynamically compiled Java classes in the same Bio7 process (using Java like a scripting language is a special feature of Bio7).

The video at the bottom of this page demonstrates the use of a remote debugging connection on the same computer (localhost connection) to debug dynamically compiled/executed Java code.

Debugging a Java process which has already started (Bio7 in this case) is only possible if Bio7 is started beforehand with the following Java arguments inside a shell to start a debug server connection:

Windows (started from the current directory = Bio7 installation directory):
Bio7.exe -vmargs -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8000

Mac (given the full path to the Bio7 executable):
/Applications/Bio7.app/Contents/MacOS/Bio7 -vmargs -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8000

Linux (started from the current directory = Bio7 installation directory):
./Bio7 -vmargs -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8000

Furthermore the compiler debug information option has to be enabled in the Bio7 Java preferences (for the dynamic compilation process) and a seperate debug connection has to be started for this Java class (with the available Eclipse debug configuration dialog).

In the video below I (hopefully) demonstrate how you can (graphically) debug a simple cellular automata simulation like the Game of Life within a running Bio7 (Java) process.

" data-embed-play="Play VideoThis video will be embedded from Youtube. The apply."> Please activate JavaScript to view this video.
Video-Link:
https://www.youtube.com/watch?v=vKAGBGuQKlI var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Bio7 Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Game of Friendship Paradox

Wed, 06/27/2018 - 09:13

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

In the introduction of my course next week, I will (briefly) mention networks, and I wanted to provide some illustration of the Friendship Paradox. On network of thrones (discussed in Beveridge and Shan (2016)), there is a dataset with the network of characters in Game of Thrones. The word “friend” might be abusive here, but let’s continue to call connected nodes “friends”. The friendship paradox states that

People on average have fewer friends than their friends

This was discussed in Feld (1991) for instance, or Zuckerman & Jost (2001). Let’s try to see what it means here. First, let us get a copy of the dataset

1 2 3 4 download.file("https://www.macalester.edu/~abeverid/data/stormofswords.csv","got.csv") GoT=read.csv("got.csv") library(networkD3) simpleNetwork(GoT[,1:2])

Because it is difficult for me to incorporate some d3js script in the blog, I will illustrate with a more basic graph,

Consider a vertex in the undirected graph (with classical graph notations), and let denote the number of edges touching it (i.e. has friends). The average number of friends of a random person in the graph is The average number of friends that a typical friend has is
But
\sum_{v\in V} \left(\frac{1}{d(v)}\sum_{v’\in E_v} d(v’)\right)=\sum_{v,v’ \in G} \left(

\frac{d(v’)}{d(v)}+\frac{d(v)}{d(v’)}\right)
Thus,
Note that this can be related to the variance decomposition i.e.(Jensen inequality). But let us get back to our network. The list of nodes is

1 2 M=(rbind(as.matrix(GoT[,1:2]),as.matrix(GoT[,2:1]))) nodes=unique(M[,1])

and we each of them, we can get the list of friends, and the number of friends

1 2 friends = function(x) as.character(M[which(M[,1]==x),2]) nb_friends = Vectorize(function(x) length(friends(x)))

as well as the number of friends friends have, and the average number of friends

1 2 friends_of_friends = function(y) (Vectorize(function(x) length(friends(x)))(friends(y))) nb_friends_of_friends = Vectorize(function(x) mean(friends_of_friends(x)))

We can look at the density of the number of friends, for a random node,

1 2 3 4 5 6 Nb = nb_friends(nodes) Nb2 = nb_friends_of_friends(nodes) hist(Nb,breaks=0:40,col=rgb(1,0,0,.2),border="white",probability = TRUE) hist(Nb2,breaks=0:40,col=rgb(0,0,1,.2),border="white",probability = TRUE,add=TRUE) lines(density(Nb),col="red",lwd=2) lines(density(Nb2),col="blue",lwd=2)


and we can also compute the averages, just to check

1 2 3 4 mean(Nb) [1] 6.579439 mean(Nb2) [1] 13.94243

So, indeed, people on average have fewer friends than their friends.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

One R Package a Day

Wed, 06/27/2018 - 08:00

(This article was first published on stevenmortimer.com, and kindly contributed to R-bloggers)

Discover one R package a day by following the @RLangPackage account on Twitter! Inspired by One R Tip Day (@RLangTip) I created a Twitter bot that will feature one package with its description and GitHub URL each day. The R programming language (referred to as #rstats on Twitter) now has over 12,600 packages. Tweeting one each day would take more than 34 years to get through them all and that doesn’t even consider the rapid growth of new packages as shown in this post.

The purpose of the @RLangPackage Twitter account is to feature the diversity of packages that R has to offer. Most users have heard of ggplot2 and dplyr, but there are thousands of other great packages waiting to be discovered. In order to highlight those I took a list of all packages on CRAN and filtered it down to ones with anywhere between 5,000 and 1,000,000 CRAN downloads. I then took a subset of that with anywhere between 10 and 1,000 stars on GitHub. I decided to focus on these mildly popular packages with source code hosted on GitHub so that I can embed the link in the tweet and promote packages with code that people have already started to explore. After following those subsetting rules, one package is selected at random after narrowing to a list of packages that haven’t already been tweeted. I am hosting the bot on Heroku and using Heroku Scheduler to send out a tweet each day at 10:30am UTC or 6:30am Eastern Time. Below the credits and resources is the sloppily written Python that’s currently being hosted on Heroku and executed.

Credits

I was found a couple different blog posts on creating a Twitter bot with R, namely here and here, but they didn’t involve deploying on Heroku or another cloud service. However, there were plenty of Python based Twitter bot tutorials, so I took full advantage of them and went the Python route. Below is the host of resources I considered while figuring out how to deploy the app, what Twitter package to use, and some basic Python syntax which, embarrassingly, I should know by now. I think most users will gravitate towards the tweepy Python library but I found that it throws up an error when using it with Conda 3.6. The issue is documented on GitHub https://github.com/tweepy/tweepy/issues/894.

Resources

Full Script # script.py from os import environ from os.path import join, dirname from dotenv import load_dotenv from re import sub import pandas from TwitterAPI import TwitterAPI, TwitterPager # create .env file path dotenv_path = join(dirname(__file__), '.env') # load file from the path load_dotenv(dotenv_path) if __name__ == "__main__": # connect to api api = TwitterAPI(consumer_key=environ['TWITTER_CONSUMER_KEY'], consumer_secret=environ['TWITTER_CONSUMER_SECRET'], access_token_key=environ['TWITTER_ACCESS_TOKEN'], access_token_secret=environ['TWITTER_ACCESS_TOKEN_SECRET']) # scrape all prior tweets to check which packages I've already tweeted about SCREEN_NAME = 'RLangPackage' pager = TwitterPager(api, 'statuses/user_timeline', {'screen_name': SCREEN_NAME, 'count': 100}) # parse out the package name that occurs before the hyphen at the beginning previous_pks = [] for item in pager.get_iterator(wait=3.5): if 'text' in item: this_pkg = sub("^(\w+) - (.*)", "\\1", item['text']) previous_pks.append(this_pkg) # convert the package names to a dataframe prev_df = pandas.DataFrame({'name': previous_pks}) prev_df.set_index('name') # load the data I've compiled on R packages url = "https://raw.githubusercontent.com/StevenMMortimer/one-r-package-a-day/master/r-package-star-download-data.csv" all_df = pandas.read_csv(url) all_df.set_index('name') # do an "anti join" to throw away previously tweeted rows all_df = pandas.merge(all_df, prev_df, how='outer', indicator=True) all_df = all_df[all_df['_merge'] == 'left_only'] # focus on packages in middle ground of downloads and stars filtered_df = all_df[all_df['stars'].notnull()] filtered_df = filtered_df[filtered_df['stars'].between(10,1000)] filtered_df = filtered_df[filtered_df['downloads'].notnull()] filtered_df = filtered_df[filtered_df['downloads'].between(5000, 1000000)] # randomly select one of the remaining rows selected_pkg = filtered_df.sample(1) # pull out the name and description to see if we need to # truncate because of Twitters 280 character limit prepped_name = selected_pkg.iloc[0]['name'] prepped_desc = sub('\s+', ' ', selected_pkg.iloc[0]['description']) name_len = len(prepped_name) desc_len = len(prepped_desc) # 280 minus 3 for " - ", then minus 23 because links are counted as such, # then minus 9 for the " #rstats " hashtag if desc_len <= (280-3-23-9-name_len): prepped_desc = prepped_desc[0:desc_len] else: prepped_desc = prepped_desc[0:(280-6-23-9-name_len)] + "..." # cobble together the tweet text TWEET_TEXT = prepped_name + " - " + prepped_desc + " #rstats " + selected_pkg.iloc[0]['github_url'] print(TWEET_TEXT) # tweet it out to the world! r = api.request('statuses/update', {'status': TWEET_TEXT}) print('SUCCESS' if r.status_code == 200 else 'PROBLEM: ' + r.text) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: stevenmortimer.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Text-to-speech with R

Wed, 06/27/2018 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Computers started talking to us! They do this with so called Text-to-Speech (TTS) systems. With neural nets, deep learning and lots of training data, these systems have gotten a whole lot better in recent years. In some cases, they are so good that you can’t distinguish between human and machine voice.

In one of our recent codecentric.AI videos, we compared different Text-to-Speech systems (the video is in German, though – but the text snippets and their voice recordings we show in the video are a mix of German and English). In this video, we had a small contest between Polly, Alexa, Siri And Co to find out who best speaks different tongue twisters.

Here, I want to find out what’s possible with R and Text-to-Speech packages.

How does TTS work?

Challenges for good TTS systems are the complexity of the human language: we intone words differently, depending on where they are in a sentence, what we want to convey with that sentence, how our mood is, and so on. AI-based TTS systems can take phonemes and intonation into account.

There are different ways to artificially produce speech. A very important method is Unit Selection synthesis. With this method, text is first normalized and divided into smaller entities that represent sentences, syllables, words, phonemes, etc. The structure (e.g. the pronunciation) of these entities is then learned in context. We call this part Natural Language Processing (NLP). Usually, these learned segments are stored in a database (either as human voice recordings or synthetically generated) that can be searched to find suitable speech parts (Unit Selection). This search is often done with decision trees, neural nets or Hidden-Markov-Models.

If the speech has been generated by a computer, this is called formant synthesis. It offers more flexibility because the collection of words isn’t limited to what has been pre-recorded by a human. Even imaginary or new words can easily be produced and the voices can be readily exchanged. Until recently, this synthetic voice did not sound anything like a human recorded voice; you could definitely hear that it was “fake”. Most of the TTS systems today still suffer from this, but this is in the process of changing: there are already a few artificial TTS systems that do sound very human.

What TTS systems are there?

We already find TTS systems in many digital devices, like computers, smart phones, etc. Most of the “big players” offer TTS-as-a-service, but there are also many “smaller” and free programs for TTS. Many can be downloaded as software or used from a web browser or as an API. Here is an incomplete list:

Text-to-Speech in R

The only package for TTS I found was Rtts. It doesn’t seem very comprehensive but it does the job of converting text to speech. The only API that works right now is **ITRI (http://tts.itri.org.tw)**. And it only supports English and Chinese.

Let’s try it out!

library(Rtts) ## Lade nötiges Paket: RCurl ## Lade nötiges Paket: bitops

Here, I’ll be using a quote from DOUGLAS ADAMS’ THE HITCHHIKER’S GUIDE TO THE GALAXY:

content <- "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."

The main TTS function is tts_ITRI() and I’m going to loop over the different voice options.

speakers = c("Bruce", "Theresa", "Angela", "MCHEN_Bruce", "MCHEN_Joddess", "ENG_Bob", "ENG_Alice", "ENG_Tracy") lapply(speakers, function(x) tts_ITRI(content, speaker = x, destfile = paste0("audio_tts_", x, ".mp3")))

I uploaded the results to Soundcloud for you to hear: – audio-tts-bruceaudio-tts-theresaaudio-tts-angelaaudio-tts-mchen-bruceaudio-tts-mchen-joddessaudio-tts-eng-bobaudio-tts-eng-aliceaudio-tts-eng-tracy

As you can hear, it sounds quite wonky. There are many better alternatives out there, but most of them aren’t free and/or can’t be used (as easily) from R. Noam Ross tried IBM Watson’s TTS API in this post, which would be a very good solution. Or you could access the Google Cloud API from within R.

The most convenient solution for me was to use eSpeak from the command line. The output sounds relatively good, it is free and offers many languages and voices with lots of parameters to tweak. This is how you would produce audio from text with eSpeak:

  • English US
espeak -v english-us -s 150 -w '/Users/shiringlander/Documents/Github/audio_tts_espeak_en_us.wav' "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."
  • just for fun: English Scottish
espeak -v en-scottish -s 150 -w '/Users/shiringlander/Documents/Github/audio_tts_espeak_en-scottish.wav' "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."
  • even funnier: German
espeak -v german -s 150 -w '/Users/shiringlander/Documents/Github/audio_tts_espeak_german.wav' "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."

The playlist contains all audio files I generated in this post.

sessionInfo() ## R version 3.5.0 (2018-04-23) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS High Sierra 10.13.5 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] Rtts_0.3.3 RCurl_1.95-4.10 bitops_1.0-6 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.17 bookdown_0.7 digest_0.6.15 rprojroot_1.3-2 ## [5] backports_1.1.2 magrittr_1.5 evaluate_0.10.1 blogdown_0.6 ## [9] stringi_1.2.3 rmarkdown_1.10 tools_3.5.0 stringr_1.3.1 ## [13] xfun_0.2 yaml_2.1.19 compiler_3.5.0 htmltools_0.3.6 ## [17] knitr_1.20 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bayesian GANs [#2]

Wed, 06/27/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

As an illustration of the lack of convergence of the Gibbs sampler applied to the two “conditionals” defined in the Bayesian GANs paper discussed yesterday, I took the simplest possible example of a Normal mean generative model (one parameter) with a logistic discriminator (one parameter) and implemented the scheme (during an ISBA 2018 session). With flat priors on both parameters. And a Normal random walk as Metropolis-Hastings proposal. As expected, since there is no stationary distribution associated with the Markov chain, simulated chains do not exhibit a stationary pattern,

And they eventually reach an overflow error or a trapping state as the log-likelihood gets approximately to zero (red curve).

Too bad I missed the talk by Shakir Mohammed yesterday, being stuck on the Edinburgh by-pass at rush hour!, as I would have loved to hear his views about this rather essential issue…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New R package flatxml: working with XML files as R dataframes

Tue, 06/26/2018 - 23:20

(This article was first published on Topics in R, and kindly contributed to R-bloggers)

The world is flat The new R package flatxml provides functions to easily deal with XML files. When parsing an XML document fxml_importXMLFlat produces a special dataframe that is ‘flat’ by its very nature but contains all necessary information about the hierarchical structure of the underlying XML document (for details on the dataframe see the reference for the fxml_importXMLFlat function). flatxml offers a set of functions to work with this dataframe. Apart from representing the XML document in a dataframe structure, there is yet another way in which flatxml relates to dataframes: the fxml_toDataFrame function can be used to extract data from an XML document into a dataframe, e.g. to work on the data with statistical functions. Because in this case there is no need to represent the XML document structure as such (it’s all about the data contained in the document), there is no representation of the hierarchical structure of the document any more, it’s just a normal dataframe. Each XML element, for example Here is some text has certain characteristics that can be accessed via the flatxml interface functions, after an XML document has been imported with \fxml_importXMLFlat. These characteristics are:

  • value: The (text) value of the element, “Here is some text” in the example above
  • attributes: The XML attributes of the element, attribute with its value “some value” in the example above
  • children: The elements on the next lower hierarchical level
  • parent: The element of the next higher hierarchical level, i.e. the element to which the current element is a child
  • siblings: The elements on the same hierarchical level as the current element

Structure of the flatxml interface The flatxml interface to access these characteristics follows a simple logic: For each of the characteristics there are typically three functions available:

  • fxml_has…(): Determines if the current XML element has (at least one instance of) the characteristic
  • fxml_num…(): Returns the number of the characteristics of the current XML (e.g. the number of children elements
  • fxml_get…(): Returns (the IDs of) the respective characteristics of the current XML element (e.g. the children of the current element)

Learn more For more information on the flatxml package please go to http://www.zuckarelli.de/flatxml/index.html.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Topics in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

May 2018: “Top 40” New Packages

Tue, 06/26/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

While looking over the 215 or so new packages that made it to CRAN in May, I was delighted to find several packages devoted to subjects a little bit out of the ordinary; for instance, bioacoustics analyzes audio recordings, freegroup looks at some abstract mathematics, RQEntangle computes quantum entanglement, stemmatology analyzes textual musical traditions, and treedater estimates clock rates for evolutionary models. I take this as evidence that R is expanding beyond its traditional strongholds of statistics and finance as people in other fields with serious analytic and computational requirements become familiar with the language. And, when I see a package from a philologist and scholar of “Ancient and Medieval Worlds”, I am persuaded to think that R is making a unique contribution to computational literacy.

Below are my “Top 40” package picks for May 2018, organized into the following categories: Computational Methods, Data, Data Science, Finance, Mathematics, Music, Science, Statistics, Time Series, Utilities and Visualization.

Computational Methods

dqrng v0.0.4: Provides fast random number generators with good statistical properties, including the 64-bit variant of the Mersenne-Twister, pcg64, and Xoroshiro128 and Xoroshiro256. There is an Introduction and a vignette on Parallel Usage.

optimParallel v.7-2: Provides a parallel versions of the gradient-based stats::optim() function. The vignette is informative.

Data

childesr v0.1.0: Implements an interface to CHILDES, an open repository for transcripts of parent-child interaction. There is a vignette.

PetfindeR v1.1.3: Is a wrapper of the Petfinder API that implements methods for interacting with and extracting data from the Petfinder database, one of the largest online, searchable databases of adoptable animals and animal welfare organizations across North America. See the Getting Started Guide: part1 and part2.

Data Science

catch v1.0: Provides functions to perform classification and variable selection on high-dimensional tensors (multi-dimensional arrays) after adjusting for additional covariates (scalar or vectors) as CATCH model in Pan, Mai and Zhang (2018).

SemiSupervised v1.0: Implements several safe graph-based semi-supervised learning algorithms. For technical details, refer to Culp and Ryan (2013), Ryan and Culp (2015) and the package vignette.

spFSR v1.0.0: Offers functions to perform feature selection and ranking via simultaneous perturbation stochastic approximation (SPSA-FSR) based on works by Aksakalli and Malekipirbazari (2015) and Yenice et al. (2018). See the Introduction.

Finance

PortfolioAnalytics v1.1.0: Provides functions for portfolio analysis, including numerical methods for portfolio optimization. There is an Introduction and vignettes on Optimization, Custom Moment and Objective Functions, and Portfolio Optimization with CVaR Budgets.

sparseIndexTracking v0.1.0: Provides functions to compute sparse portfolios for financial index tracking, i.e., joint selection of a subset of the assets that compose the index and computation of their relative weights (capital allocation) based on the paper Benidis et al. (2018). The vignette shows how to design a portfolio to track an index.

Mathematics

freegroup v1.0: Provides functions to elements of the free group, including inversion, multiplication by a scalar, group-theoretic power operation, and Tietze forms. See the vignette for details.

ODEsensitivity v1.1.1: Provides functions to perform sensitivity analysis for ordinary differential equation (ode) models using the ODEnetwork package, which simulates a network of second-order ODEs. See [Weber et al. (2018)(https://eldorado.tu-dortmund.de/handle/2003/36875)] for details, and the vignette to get started.

Music

stemmatoloty v0.3.1: Allows users to explore and analyze the genealogy of textual or musical traditions from their variants, with various stemmatological methods, mainly the disagreement-based algorithms suggested by Camps and Cafiero (2015). The vignette provides details.

Science

bioacoustics v0.1.2: Provides functions to analyze audio recordings and automatically extract animal vocalizations. Contains all the necessary tools to process audio recordings of various formats (e.g., WAV, WAC, MP3, ZC), filter noisy files, display audio signals, and detect and extract automatically acoustic features for further analysis such as classification. The vignette provides an example.

epiphy v0.3.4: Provides a toolbox for analyzing plant disease epidemics and a common framework for plant disease intensity data recorded over time and/or space. There is a vignette on Definitions and Relationships between Parameters and another with Examples.

treedater v0.2.0: Offers functions for estimating times of common ancestry and molecular clock rates of evolution using a variety of evolutionary models. See Volz and Frost (2017). The vignette provides an example using the Influenza H3N2 virus.

RQEntangle v0.1.0: Provides functions to compute the Schmidt decomposition of bipartite quantum systems, discrete or continuous, and their respective entanglement metrics. See Ekert and Knight (1995) and the vignettes Entanglement in Coupled Harmonics and Entanglement between Two Coupled Two-Level Systems for details.

Statistics

glmmboot v0.1.2: Provides two functions to perform bootstrap resampling for most models that update() works for. BootGlmm() performs block resampling if random effects are present, and case resampling if not; BootCI() converts output from bootstrap model runs into confidence intervals and p-values. See Humphrey and Swingley (2018) for the details and the vignette to get started.

glmmEP v1.0-1: Allows users to solve Generalized Linear Mixed Model Analysis via Expectation Propagation. In this version, the random effects can be any reasonable dimension. However, only probit mixed models with one level of nesting are supported. See the methodology in Hall et al. (2018), and the user manual in the vignette.

groupedSurv v1.0.1: Provides Rcpp-based functions to compute the efficient score statistics for grouped time-to-event data (Prentice and Gloeckler (1978)), with the optional inclusion of baseline covariates. The vignette gives an example.

modeldb v0.1.0: Provides functions to fit models inside databases with dplyr backends. There are vignettes showing how to implement linear regression and kmeans models.

MLZ v0.1.1: Provides estimation functions and diagnostic tools for mean length-based total mortality estimators based on Gedamke and Hoenig (2006). There is a vignette.

NFWdist v0.1.0: Provides density, distribution function, quantile function, and random generation for the 3D Navarro, Frenk & White (NFW) profile. For details see Robotham & Howlett (2018) and the vignette.

survBootOutliers v1.0: Offers three new concordance-based methods for outlier detection in a survival context. The methodology is described in two papers by Pinto J., Carvalho A. and Vinga S.: paper1 and paper2. The vignette provides an introduction.

vinereg v0.3.0: Implements D-vine quantile regression models with parametric or non-parametric pair-copulas. See Kraus and Czado (2017) and Schallhorn et al. (2017). There is a vignette showing how to use the package, and another covering an Analysis of bike rental data.

Time Series

ASSA v1.0: Provides functions to model and decompose time series into principal components using singular spectrum analysis. See de Carvalho and Rua (2017) and de Carvalho et al (2012).

DTWBI v1.0: Provides functions to impute large gaps within time series based on Dynamic Time Warping methods. See Phan et al. (2017). The website has examples.

permutes v0.1: Uses permutation testing (Maris & Oostenveld (2007)) to help determine optimal windows for analyzing densely-sampled time series. The vignette provides details.

Utilities

bench v1.0.1: Provides tools to benchmark and analyze execution times for R expressions.

conflicted v0.1.0: Provides an alternative to R’s default conflict-resolution strategy for R packages.

diffdf v1.0.0: Provides functions for comparing two data frames. The vignette describes how to use the package.

pkgdown v1.1.0: Provides functions to generate a website from a source package by converting your documentation, vignettes, README, and more to HTML. See the vignette.

rtika v0.1.8: Provides functions to extract text or metadata from over a thousand file types, using Apache Tika to produce either plain-text or structured XHTML content. See the vignette.

tabulizer v0.2.2: Provides bindings for the Tabula Java library, which can extract tables from PDF documents. The tabulizerjars packages provides versioned .jar files. There is an Introduction.

shinytest v1.3.0: Enables automated testing of Shiny applications, using a headless browser.

vcr v0.1.0: A port of the Ruby gem of the same name, vcr enables users to record test suite HTTP requests and replay them during future runs. There is an Introduction and vignettes on Configuration and Request Matching

Visualization

c3 v0.2.0: Implements a wrapper (htmlwidget) for the C3.js charting library that includes all types of C3.js plots, enabling interactive web-based charts to be embedded in R Markdown documents and Shiny applications. The vignette shows basic usage.

chromoMap v0.1: Provides interactive, configurable graphics visualizations of human chromosomes, allowing users to map chromosome elements (like genes, SNPs, etc.) on the chromosome plot, and introduces a special plot, the “chromosome heatmap”, which enables visualizing data associated with chromosome elements. See the vignette.

ExPanDaR v0.2.0: Provides a shiny-based front end and a set of functions for exploratory panel data analysis. Run as a web-based app, it enables users to assess the robustness of empirical evidence without providing them access to the underlying data. There is a vignette on using the package and another on panel data exploration.

ggdistribute v1.0.1: Extends ggplot2 for plotting posterior or other types of unimodal distributions that require overlaying information about a distribution’s intervals. The vignette provides examples.

r2d3 v0.2.2: Provides a suite of tools for using the D3 library to produce dynamic, interactive data visualizations. There are vignettes on Advanced Rendering with Callbacks, R to D3 Data Conversion, CSS & JavaScript Dependencies, Learning D3, Package Development, and D3 Visualization Options.

_____='https://rviews.rstudio.com/2018/06/26/may-2018-top-40-new-packages/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny 1.1.0: Scaling Shiny with async

Tue, 06/26/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

This is a significant release for Shiny, with a major new feature that was nearly a year in the making: support for asynchronous operations!

Without this capability, when Shiny performs long-running calculations or tasks on behalf of one user, it stalls progress for all other Shiny users that are connected to the same process. Therefore, Shiny apps that feature long-running calculations or tasks have generally been deployed using many R processes, each serving a small number of users; this works, but is not the most efficient approach. Such applications now have an important new tool in the toolbox to improve performance under load.

Shiny async is implemented via integration with the future and promises packages. These two packages are used together:

  1. Use future to perform long-running operations in a worker process that runs in the background, leaving Shiny processes free to serve other users in the meantime. This yields much better responsiveness under load, and much more predictable latency.
  2. Use promises to handle the result of each long-running background operation back in the Shiny process, where additional processing can occur, such as further data manipulation, or displaying to the user via a reactive output.

If your app has a small number of severe performance bottlenecks, you can use this technique to get massively better responsiveness under load. For example, if the httr::GET call in this server function takes 30 seconds to complete:

server <- function(input, output, session) { r <- reactive({ httr::GET(url) %>% httr::content("parsed") }) output$plot <- renderPlot({ r() %>% ggplot(aes(speed, dist)) + geom_point() }) }

then the entire R process is stalled for those 30 seconds.

We can rewrite it asynchronously like this:

library(promises) library(future) plan(multisession) server <- function(input, output, session) { r <- reactive({ future(httr::GET(url)) %...>% httr::content("parsed") %...>% }) output$plot <- renderPlot({ r() %...>% { ggplot(., aes(speed, dist)) + geom_point() } }) }

Even if the httr::GET(url) takes 30 seconds, the r reactive executes almost instantly, and returns control to the caller. The code inside future(...) is executed in a different R process that runs in the background, and whenever its result becomes available (i.e. in 30 seconds), the right-hand side of %...>% will be executed with that result. (%...>% is called a “promise pipe”; it works similarly to a magrittr pipe that knows how to wait for and “unwrap” promises.)

If the original (synchronous) code appeared in a Shiny app, then during that 30 seconds, the R process is stuck dealing with the download and can’t respond to any requests being made by other users. But with the async version, the R process only needs to kick off the operation, and then is free to service other requests. This means other users will only have to wait milliseconds, not minutes, for the app to respond.

Case study

We’ve created a detailed case study that walks through the async conversion of a realistic example app. This app processes low-level logging data from RStudio’s CRAN mirrors, to let us explore the heaviest downloaders for each day.

To load test this example app, we launched 50 sessions of simulated load, with a 5 second delay between each launch, and directed this traffic to a single R process. We then rewrote the app to use futures and promises, and reran the load test with this async version. (The tools we used to perform the load testing are not yet publicly available, but you can refer to Sean Lopp’s talk at rstudio::conf 2018 for a preview.)

Under these conditions, the finished async version displays significantly lower (mean) response times than the original. In the table below, “HTTP traffic” refers to requests that are made during page load time, and “reactive processing” refers to the time between the browser sending a reactive input value and the server returning updated reactive outputs.

table td:first-child, table th:first-child {text-align:left !important;} table td, table th {text-align:right !important;} Response type Original Async Delta HTTP traffic 605 ms 139 ms -77% Reactive processing 10.7 sec 3.48 sec -67% Learn more

Visit the promises website to learn more, or watch my recent webinar on Shiny async.

See the full changelog for Shiny v1.1.0.

Related packages

Over the last year, we created or enhanced several other packages to support async Shiny:

  • The promises package (released 2018-04-13) mentioned above provides the actual API you’ll use to do async programming in R. We implemented this as a separate package so that other parts of the R community, not just Shiny users, can take advantage of these techniques. The promises package was inspired by the basic ideas of JavaScript promises, but also have significantly improved syntax and extensibility to make them work well with R and Shiny. Currently, promises is most useful when used with the future package by Henrik Bengtsson.
  • later (released 2017-06-25) adds a low-level feature to R that is critical to async programming: the ability to schedule R code to be executed in the future, within the same R process. You can do all sorts of cool stuff on top of this, as some people are discovering.
  • httpuv (1.4.0 released 2018-04-19) has long been the HTTP web server that Shiny, and most other web frameworks for R, sit on top of. Version 1.4.0 adds support for asynchronous handling of HTTP requests, and also adds a dedicated I/O-handling thread for greatly improved performance under load.

In the coming weeks, you can also expect updates for async compatibility to htmlwidgets, plotly, and DT. Most other HTML widgets will automatically become async compatible once htmlwidgets is updated.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Models are about what changes, and what doesn’t

Tue, 06/26/2018 - 02:00

(This article was first published on R on mages' blog, and kindly contributed to R-bloggers)

Following on from last week’s post on Principled Bayesian Workflow I want to reflect on how to motivate a model.
The purpose of most models is to understand change, and yet, considering what doesn’t change and should be kept constant can be equally important.
I will go through a couple of models in this post to illustrate this idea. The purpose of the model I want to build today is to predict how much ice cream is sold for different temperatures \((t)\).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Re-referencing factor levels to estimate standard errors when there is interaction turns out to be a really simple solution

Tue, 06/26/2018 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Maybe this should be filed under topics that are so obvious that it is not worth writing about. But, I hate to let a good simulation just sit on my computer. I was recently working on a paper investigating the relationship of emotion knowledge (EK) in very young kids with academic performance a year or two later. The idea is that kids who are more emotionally intelligent might be better prepared to learn. My collaborator suspected that the relationship between EK and academics would be different for immigrant and non-immigrant children, so we agreed that this would be a key focus of the analysis.

In model terms, we would describe the relationship for each student \(i\) as;

\[ T_i = \beta_0 + \beta_1 I_i + \beta_2 EK_i + \beta_3 I_i \times EK_i + \epsilon_i,\]
where \(T\) is the academic outcome, \(I\) is an indicator for immigrant status (either 0 or 1), and \(EK\) is a continuous measure of emotion knowledge. By including the \(I \times EK\) interaction term, we allow for the possibility that the effect of emotion knowledge will be different for immigrants. In particular, if we code \(I=0\) for non-immigrant kids and \(I=1\) for immigrant kids, \(\beta_2\) represents the relationship of EK and academic performance for non-immigrant kids, and \(\beta_2 + \beta_3\) is the relationship for immigrant kids. In this case, non-immigrant kids are the reference category.

Here’s the data generation:

library(simstudy) library(broom) set.seed(87265145) def <- defData(varname = "I", formula = .4, dist = "binary") def <- defData(def, varname = "EK", formula = "0 + 0.5*I", variance = 4) def <- defData(def, varname = "T", formula = "10 + 2*I + 0.5*EK + 1.5*I*EK", variance = 4 ) dT <- genData(250, def) genFactor(dT, "I", labels = c("not Imm", "Imm")) ## id I EK T fI ## 1: 1 1 -1.9655562 5.481254 Imm ## 2: 2 1 0.9230118 16.140710 Imm ## 3: 3 0 -2.5315312 9.443148 not Imm ## 4: 4 1 0.9103722 15.691873 Imm ## 5: 5 0 -0.2126550 9.524948 not Imm ## --- ## 246: 246 0 -1.2727195 7.546245 not Imm ## 247: 247 0 -1.2025184 6.658869 not Imm ## 248: 248 0 -1.7555451 11.027569 not Imm ## 249: 249 0 2.2967681 10.439577 not Imm ## 250: 250 1 -0.3056299 11.673933 Imm

Let’s say our primary interest in this exploration is point estimates of \(\beta_2\) and \(\beta_2 + \beta_3\), along with 95% confidence intervals of the estimates. (We could have just as easily reported \(\beta_3\), but we decided the point estimates would be more intuitive to understand.) The point estimates are quite straightforward: we can estimate them directly from the estimates of \(\beta_2\) and \(\beta_3\). And the standard error (and confidence interval) for \(\beta_2\) can be read directly off of the model output table. But what about the standard error for the relationship of EK and academic performance for the immigrant kids? How do we handle that?

I’ve always done this the cumbersome way, using this definition:

\[
\begin{aligned}
se_{\beta_2 + \beta_3} &= [Var(\beta_2 + \beta_3)]^\frac{1}{2} \\
&=[Var(\beta_2) + Var(\beta_3) + 2 \times Cov(\beta_2,\beta_3)]^\frac{1}{2}
\end{aligned}
\]

In R, this is relatively easy (though maybe not super convenient) to do manually, by extracting the information from the estimated parameter variance-covariance matrix.

First, fit a linear model with an interaction term:

lm1 <- lm(T ~ fI*EK, data = dT) tidy(lm1) ## term estimate std.error statistic p.value ## 1 (Intercept) 10.161842 0.16205385 62.706574 2.651774e-153 ## 2 fIImm 1.616929 0.26419189 6.120281 3.661090e-09 ## 3 EK 0.461628 0.09252734 4.989098 1.147653e-06 ## 4 fIImm:EK 1.603680 0.13960763 11.487049 9.808529e-25

The estimate for the relationship of EK and academic performance for non-immigrant kids is 0.46 (se = 0.093). And the point estimate for the relationship for immigrant kids is \(2.06=0.46 + 1.60\)

The standard error can be calculated from the variance-covariance matrix that is derived from the linear model:

vcov(lm1) ## (Intercept) fIImm EK fIImm:EK ## (Intercept) 0.026261449 -0.026261449 -0.000611899 0.000611899 ## fIImm -0.026261449 0.069797354 0.000611899 -0.006838297 ## EK -0.000611899 0.000611899 0.008561309 -0.008561309 ## fIImm:EK 0.000611899 -0.006838297 -0.008561309 0.019490291

\[Var(\beta_2+\beta_3) = 0.0086 + 0.0195 + 2\times(-.0086) = 0.0109\]

The standard error of the estimate is \(\sqrt{0.0109} = 0.105\).

So?

OK – so maybe that isn’t really all that interesting. Why am I even talking about this? Well, in the actual study, we have a fair amount of missing data. In some cases we don’t have an EK measure, and in others we don’t have an outcome measure. And since the missingness is on the order of 15% to 20%, we decided to use multiple imputation. We used the mice package in R to impute the data, and we pooled the model estimates from the completed data sets to get our final estimates. mice is a fantastic package, but one thing that is does not supply is some sort of pooled variance-covariance matrix. Looking for a relatively quick solution, I decided to use bootstrap methods to estimate the confidence intervals.

(“Relatively quick” is itself a relative term, since bootstrapping and imputing together is not exactly a quick process – maybe something to work on. I was also not fitting standard linear models but mixed effect models. Needless to say, it took a bit of computing time to get my estimates.)

Seeking credit (and maybe some sympathy) for all of my hard work, I mentioned this laborious process to my collaborator. She told me that you can easily estimate the group specific effects merely by changing the reference group and refitting the model. I could see right away that the point estimates would be fine, but surely the standard errors would not be estimated correctly? Of course, a few simulations ensued.

First, I just changed the reference group so that \(\beta_2\) would be measuring the relationship of EK and academic performance for immigrant kids, and \(\beta_2 + \beta_3\) would represent the relationship for the non-immigrant kids. Here are the levels before the change:

head(dT$fI) ## [1] Imm Imm not Imm Imm not Imm not Imm ## Levels: not Imm Imm

And after:

dT$fI <- relevel(dT$fI, ref="Imm") head(dT$fI) ## [1] Imm Imm not Imm Imm not Imm not Imm ## Levels: Imm not Imm

And the model:

lm2 <- lm(T ~ fI*EK, data = dT) tidy(lm2) ## term estimate std.error statistic p.value ## 1 (Intercept) 11.778770 0.2086526 56.451588 8.367177e-143 ## 2 fInot Imm -1.616929 0.2641919 -6.120281 3.661090e-09 ## 3 EK 2.065308 0.1045418 19.755813 1.112574e-52 ## 4 fInot Imm:EK -1.603680 0.1396076 -11.487049 9.808529e-25

The estimate for this new \(\beta_2\) is 2.07 (se=0.105), pretty much aligned with our estimate that required a little more work. While this is not a proof by any means, I did do variations on this simulation (adding other covariates, changing the strength of association, changing sample size, changing variation, etc.) and both approaches seem to be equivalent. I even created 10000 samples to see if the coverage rates of the 95% confidence intervals were correct. They were. My collaborator was right. And I felt a little embarrassed, because it seems like something I should have known.

But …

Would this still work with missing data? Surely, things would go awry in the pooling process. So, I did one last simulation, generating the same data, but then added missingness. I imputed the missing data, fit the models, and pooled the results (including pooled 95% confidence intervals). And then I looked at the coverage rates.

First I added some missingness into the data

defM <- defMiss(varname = "EK", formula = "0.05 + 0.10*I", logit.link = FALSE) defM <- defMiss(defM, varname = "T", formula = "0.05 + 0.05*I", logit.link = FALSE) defM ## varname formula logit.link baseline monotonic ## 1: EK 0.05 + 0.10*I FALSE FALSE FALSE ## 2: T 0.05 + 0.05*I FALSE FALSE FALSE

And then I generated 500 data sets, imputed the data, and fit the models. Each iteration, I stored the final model results for both models (in one where the reference is non-immigrant and the the other where the reference group is immigrant).

library(mice) nonRes <- list() immRes <- list() set.seed(3298348) for (i in 1:500) { dT <- genData(250, def) dT <- genFactor(dT, "I", labels = c("non Imm", "Imm"), prefix = "non") dT$immI <- relevel(dT$nonI, ref = "Imm") # generate a missing data matrix missMat <- genMiss(dtName = dT, missDefs = defM, idvars = "id") # create obseverd data set dtObs <- genObs(dT, missMat, idvars = "id") dtObs <- dtObs[, .(I, EK, nonI, immI, T)] # impute the missing data (create 20 data sets for each iteration) dtImp <- mice(data = dtObs, method = 'cart', m = 20, printFlag = FALSE) # non-immgrant is the reference group estImp <- with(dtImp, lm(T ~ nonI*EK)) lm1 <- summary(pool(estImp), conf.int = TRUE) dt1 <- as.data.table(lm1) dt1[, term := rownames(lm1)] setnames(dt1, c("2.5 %", "97.5 %"), c("conf.low", "conf.high")) dt1[, iter := i] nonRes[[i]] <- dt1 # immgrant is the reference group estImp <- with(dtImp, lm(T ~ immI*EK)) lm2 <- summary(pool(estImp), conf.int = TRUE) dt2 <- as.data.table(lm2) dt2[, term := rownames(lm2)] setnames(dt2, c("2.5 %", "97.5 %"), c("conf.low", "conf.high")) dt2[, iter := i] immRes[[i]] <- dt2 } nonRes <- rbindlist(nonRes) immRes <- rbindlist(immRes)

The proportion of confidence intervals that contain the true values is pretty close to 95% for both estimates:

mean(nonRes[term == "EK", conf.low < 0.5 & conf.high > 0.5]) ## [1] 0.958 mean(immRes[term == "EK", conf.low < 2.0 & conf.high > 2.0]) ## [1] 0.948

And the estimates of the mean and standard deviations are also pretty good:

nonRes[term == "EK", .(mean = round(mean(estimate),3), obs.SD = round(sd(estimate),3), avgEst.SD = round(sqrt(mean(std.error^2)),3))] ## mean obs.SD avgEst.SD ## 1: 0.503 0.086 0.088 immRes[term == "EK", .(mean = round(mean(estimate),3), obs.SD = round(sd(estimate),3), avgEst.SD = round(sqrt(mean(std.error^2)),3))] ## mean obs.SD avgEst.SD ## 1: 1.952 0.117 0.124

Because I like to include at least one visual in a post, here is a plot of the 95% confidence intervals, with the CIs not covering the true values colored blue:

The re-reference approach seems to work quite well (in the context of this simulation, at least). My guess is the hours of bootstrapping may have been unnecessary, though I haven’t fully tested all of this out in the context of clustered data. My guess is it will turn out OK in that case as well.

Appendix: ggplot code nonEK <- nonRes[term == "EK", .(iter, ref = "Non-immigrant", estimate, conf.low, conf.high, cover = (conf.low < 0.5 & conf.high > 0.5))] immEK <- immRes[term == "EK", .(iter, ref = "Immigrant", estimate, conf.low, conf.high, cover = (conf.low < 2 & conf.high > 2))] EK <- rbindlist(list(nonEK, immEK)) vline <- data.table(xint = c(.5, 2), ref = c("Non-immigrant", "Immigrant")) ggplot(data = EK, aes(x = conf.low, xend = conf.high, y = iter, yend = iter)) + geom_segment(aes(color = cover)) + geom_vline(data=vline, aes(xintercept=xint), lty = 3) + facet_grid(.~ ref, scales = "free") + theme(panel.grid = element_blank(), axis.ticks.y = element_blank(), axis.text.y = element_blank(), axis.title.y = element_blank(), legend.position = "none") + scale_color_manual(values = c("#5c81ba","grey75")) + scale_x_continuous(expand = c(.1, 0), name = "95% CI") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Running Python inside the RStudio Server IDE

Mon, 06/25/2018 - 18:39

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

A great many R users will have to run some python code from time to time, and this short video from our Head of Data Engineering, Mark Sellors outlines one approach you can take that doesn’t mean leaving the RStudio IDE.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Demography simulations by @ellis2013nz

Mon, 06/25/2018 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

Total fertility rate

This is my second blog post while familiarising myself with the arithmetic of demography.

In my earlier post I looked at how life expectancy is calculated. “Period life expectancy at birth” as estimated by national statistical offices is actually best thought of as a summary of death rates by age group in a particular year; the calculation is effectively “what would be the life expectancy of someone born today, if all through their life the death rates in each age bracket are the same as they are today.” That is (for quite defensible reasons), no effort is made to forecast changes in the death rates; so in a time of decreasing death rates as we have been lucky to be living through for a while now, the actual life expectancy of various cohorts as observed once they’re all dead has been more than that reported in official statistics.

It turns out that a similar calculation is done with total fertility rate, which is the average number of children born to a woman over her lifetime. It has the same problem of life expectancy. For historical periods with sufficient data, we can look at any cohort of women and calculate the average number of children directly. But at any point in time, we don’t know what is going to happen to future birth rates, and hence the total fertility rate of women alive today is unknown.

The solution adopted by demographers is the same for total fertility rate as for life expectancy. To quote Wikipedia:

total period fertility rate (TPFR) of a population is the average number of children that would be born to a woman over her lifetime if: she were to experience the exact current age-specific fertility rates (ASFRs) through her lifetime; and she were to survive from birth through the end of her reproductive life.

In other words, just like period life expectancy is a summary of today’s death rates per age group, period fertility rate is a summary of today’s rates of giving birth per age group.

Converting total fertility rate into probability of giving birth this year

Why do I care? It’s because the ultimate purpose of both of these blog posts (previously hidden, dear reader) is to build a simple simulation model with a small number of user-specified starting parameters, and total fertility rate is a logical choice for one of those parameters. It’s easy to understand (anyone can intuit that 2.1 live children per woman is going to be roughly the replacement rate) and unlike crude birth rates per 100,000 which means different things in countries with different age profiles, it is self-standardising.

My simulation model needs a vector of probabilities for each woman in my population giving birth in a particular year, and to get those probabilities I need a way of converting the total fertility rate into a probability that any woman, given her age, will have a live birth this year. In effect, I need to reverse the calculation demographers do when they estimate the fertility rate in the first place, but whereas they are aggregating detailed data (birth rates per age group) into a single aggregate number I am doing the reverse.

This means I need to make some additional assumptions. Simplest would be to allocate the same probability to any woman of reproductive age, but this is obviously unrealistic. I need a sort of curve that gives higher chance of giving birth in the 20s and 30s but non-zero before and after that peak time; and adds up to an expected value of (say) 2.1 over the whole lifetime. Obviously I could estimate that curve with real data, but I don’t have any to hand, and I wanted something that would be easy to scale up and down to different fertility rates.

Now what’s a curve with definite bounds at the beginning and end, that’s guaranteed to have its area add up to some known constant? Any bounded probability distribution will do the job, and a beta distribution in particular is perfect. Just scale it from [0, 1] to whatever we deem the beginning and ending ages of fertility, choose some values of alpha and beta we like, and multiply the densities by the fertility rate. So for an average rate of 2.2 we can get something like this:

This is the approach I use in my model.

The simulation

The idea of the simulation is that you set your assumption-controlling levers at particular points, hit run, wait a minute, and then look at some standard results. Here’s a screenshot from the end product (click on it to go to the actual tool):

The assumptions you can control include:

  • average children per woman at the beginning of the simulation period
  • male and female infant mortality rates at the beginning of the simulation period
  • death rates other than in the first year, as a multiple of death rates in France in 2015
  • the ratio of male to female births (around 107 boys are born for every 100 girls, varying by country)
  • the size and magnitude of drift upwards and downards in fertility and death rates over the period.

Having run the simulation, you get two sets of output to look at. The screenshot above shows standard results on things like population size, average age, and crude birth and death rates. The screenshot below shows population distribution by age group. This variable is usually shown in demographic pyramids but I’ve opted for straightforward statistical density charts here, because I actually think they are better to read.

With the particular settings used for the model in the screenshots, I’d chosen fairly high infant mortality, death and fertility rates, similar to those in developing countries in the twentieth century. By having all those things trend downwards, we can see the resulting steady increase in life expectancy, and an increasing but aging population for several centuries until a crucial point is reached where crude birth rates are less than death rates and the population starts to decrease.

In fact, playing around with those assumptions for this sort of result was exactly why I’d built this model. I wanted to understand (for example) how changes life expectancy can seemingly get out of sync with crude death rates for a time, during a period when the age composition of a population is changing. If you’re interested in this sort of thing, have a play!

Shining up Shiny

As always, the source code is on GitHub. There are a few small bits of polish of possible interest, including:

  • non-default fonts in a ggplot2 image in a Shiny app (surprisingly tricky, with the best approach I know of being the use of renderImage as set out in this Stack Overflow question and answer, in combination with the showtext R package);
  • loader animations – essential during a simulation like this, where the user needs to wait for some seconds if not minutes – courtesy of the shinycssloaders package;
  • Custom fonts for the web page itself via CSS stored in the www subfolder.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science For Business Tutorial: Using Machine Learning With LIME To Understand Employee Churn

Mon, 06/25/2018 - 06:45

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

Data science tools are getting better and better, which is improving the predictive performance of machine learning models in business. With new, high-performance tools like, H2O for automated machine learning and Keras for deep learning, the performance of models are increasing tremendously. There’s one catch: Complex models are unexplainable… that is until LIME came along! LIME, which stands for Local Interpretable Model-agnostic Explanations, has opened the doors to black-box (complex, high-performance, but unexplainable) models in business applications! Explanations are MORE CRITICAL to the business than PERFORMANCE. Think about it. What good is a high performance model that predicts employee attrition if we can’t tell what features are causing people to quit? We need explanations to improve business decision making. Not just performance.

Explanations are MORE CRITICAL to the business than PERFORMANCE. Think about it. What good is a high performance model that predicts employee attrition if we can’t tell what features are causing people to quit? We need explanations to improve business decision making. Not just performance.

In this Machine Learning Tutorial, Brad Boehmke, Director of Data Science at 84.51°, shows us how to use LIME for machine learning interpretability on a Human Resources Employee Turnover Problem, specifically showing the value of developing interpretablity visualizations. He shows us options for Global Importance and compares it to LIME for Local Importance. We use machine learning R packages h2o, caret, and ranger in the tutorial, showcasing how to use lime for local explanations. Let’s get started!

LIME: A Secret Weapon For ROI-Driven Data Science

Introduction by Matt Dancho, Founder of Business Science

Business success is dependent on the ability for managers, process stakeholders, and key decision makers to make the right decisions often using data to understand what’s going on. This is where machine learning can help. Machine learning can analyze vast amounts of data, creating highly predictive models that tell managers key information such as how likely someone is likely to leave an organization. However, machine learning alone is not enough. Business leaders require explanations so they can determine adjustments that will improve results. These explanations require a different tool: LIME. Let’s find out why LIME is truly a secret weapon for ROI-driven data science!

In the HR Employee Attrition example discussed in this article, the machine learning model predicts the probability of someone leaving the company. This probability is then converted to a prediction of either leave or stay through a process called Binary Classification. However, this doesn’t solve the main objective, which is to make better decisions. It only tells us if someone is a high flight risk (i.e. has high attrition probability).

Employee Attrition: Machine Learning Predicts Which Employees Are Likely To Leave

How do we change decision making and therefore improve? It comes down to levers and probability. Machine learning tells us which employees are highest risk and therefore high probability. We can hone in on these individuals, but we need a different tool to understand why an individual is leaving. This is where LIME comes into play. LIME uncovers the levers or features we can control to make business improvements.

LIME: Uncovers Levers or Features We Can Control

In our HR Employee Attrition Example, LIME detects “Over Time” (lever) as a key feature that supports employee turnover. We can control the “Over Time” feature by implementing a “limited-overtime” or “no-overtime” policy.

Analyzing A Policy Change: Targeting Employees With Over Time

Toggling the “OverTime” feature to “No” enables calculating an expected value or benefit of reducing overtime by implementing a new OT policy. For the individual employee, a expected savings results. When applied to the entire organization, this process of adjusting levers can result in impactful policy changes that save the organization millions per year and generate ROI.

Adjusting The Over Time Results In Expected Savings

Interested in Learning LIME While Solving A Real-World Churn Problem?

If you want to solve this real-world employee churn problem developing models with H2O Automated Machine Learning, using LIME For Black-Box ML Model Explanation, and analyzing the impact of a policy change through optimization and sensitivity analysis, get started today with Data Science For Business (DS4B 201 / HR 201). You’ll learn ROI-Driven Data Science, implementing the tools (H2O + LIME) and our data science framework (BSPF) under my guidance (Matt Dancho, Instructor and Founder of Business Science) in our new, self-paced course part of the Business Science University virtual data science workshop.

Learning Trajectory

Now that we have a flavor for what LIME does, let’s get on with learning how to use it! In this machine learning tutorial, you will learn:

In fact, one of the coolest things you’ll learn is how to create a visualization that compares multiple H2O modeling algorithms that examine employee attrition. This is akin to getting different perspectives for how each of the models view the problem:

  • Random Forest (RF)
  • Generalized Linear Regression (GLM)
  • Gradient Boosted Machine (GBM).

Comparing LIME results of different H2O modeling algorithms

About The Author

This MACHINE LEARNING TUTORIAL comes from Brad Boehmke, Director of Data Science at 84.51°, where he and his team develops algorithmic processes, solutions, and tools that enable 84.51° and its analysts to efficiently extract insights from data and provide solution alternatives to decision-makers. Brad is not only a talented data scientist, he’s an adjunct professor at the University of Cincinnati, Wake Forest, and Air Force Institute of Technology. Most importantly, he’s an active contributor to the Data Science Community and he enjoys giving back via advanced machine learning education available at the UC Business Analytics R Programming Guide!

Machine Learning Tutorial: Visualizing Machine Learning Models with LIME

By Brad Boehmke, Director of Data Science at 84.51°

Machine learning (ML) models are often considered “black boxes” due to their complex inner-workings. More advanced ML models such as random forests, gradient boosting machines (GBM), artificial neural networks (ANN), among others are typically more accurate for predicting nonlinear, faint, or rare phenomena. Unfortunately, more accuracy often comes at the expense of interpretability, and interpretability is crucial for business adoption, model documentation, regulatory oversight, and human acceptance and trust. Luckily, several advancements have been made to aid in interpreting ML models.

Moreover, it’s often important to understand the ML model that you’ve trained on a global scale, and also to zoom into local regions of your data or your predictions and derive local explanations. Global interpretations help us understand the inputs and their entire modeled relationship with the prediction target, but global interpretations can be highly approximate in some cases. Local interpretations help us understand model predictions for a single row of data or a group of similar rows.

This post demonstrates how to use the lime package to perform local interpretations of ML models. This will not focus on the theoretical and mathematical underpinnings but, rather, on the practical application of using lime. 1

Libraries

This tutorial leverages the following packages.

# required packages # install vip from github repo: devtools::install_github("koalaverse/vip") library(lime) # ML local interpretation library(vip) # ML global interpretation library(pdp) # ML global interpretation library(ggplot2) # visualization pkg leveraged by above packages library(caret) # ML model building library(h2o) # ML model building # other useful packages library(tidyverse) # Use tibble, dplyr library(rsample) # Get HR Data via rsample::attrition library(gridExtra) # Plot multiple lime plots on one graph # initialize h2o h2o.init() ## Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 3 minutes 17 seconds ## H2O cluster timezone: America/New_York ## H2O data parsing timezone: UTC ## H2O cluster version: 3.19.0.4208 ## H2O cluster version age: 4 months and 6 days !!! ## H2O cluster name: H2O_started_from_R_mdancho_zbl980 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.50 GB ## H2O cluster total cores: 4 ## H2O cluster allowed cores: 4 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## H2O Internal Security: FALSE ## H2O API Extensions: Algos, AutoML, Core V3, Core V4 ## R Version: R version 3.4.4 (2018-03-15) h2o.no_progress()

To demonstrate model visualization techniques we’ll use the employee attrition data that has been included in the rsample package. This demonstrates a binary classification problem (“Yes” vs. “No”) but the same process that you’ll observe can be used for a regression problem. Note: I force ordered factors to be unordered as h2o does not support ordered categorical variables.

For this exemplar I retain most of the observations in the training data sets and retain 5 observations in the local_obs set. These 5 observations are going to be treated as new observations that we wish to understand why the particular predicted response was made.

# create data sets df <- rsample::attrition %>% mutate_if(is.ordered, factor, ordered = FALSE) %>% mutate(Attrition = factor(Attrition, levels = c("Yes", "No"))) index <- 1:5 train_obs <- df[-index, ] local_obs <- df[index, ] # create h2o objects for modeling y <- "Attrition" x <- setdiff(names(train_obs), y) train_obs.h2o <- as.h2o(train_obs) local_obs.h2o <- as.h2o(local_obs)

We will explore how to visualize a few of the more popular machine learning algorithms and packages in R. For brevity I train default models and do not emphasize hyperparameter tuning. The following produces:

  • Random forest model using ranger via the caret package
  • Random forest model using h2o
  • Elastic net model using h2o
  • GBM model using h2o
  • Random forest model using ranger directly
# Create Random Forest model with ranger via caret fit.caret <- train( Attrition ~ ., data = train_obs, method = 'ranger', trControl = trainControl(method = "cv", number = 5, classProbs = TRUE), tuneLength = 1, importance = 'impurity' ) # create h2o models h2o_rf <- h2o.randomForest(x, y, training_frame = train_obs.h2o) h2o_glm <- h2o.glm(x, y, training_frame = train_obs.h2o, family = "binomial") h2o_gbm <- h2o.gbm(x, y, training_frame = train_obs.h2o) # ranger model --> model type not built in to LIME fit.ranger <- ranger::ranger( Attrition ~ ., data = train_obs, importance = 'impurity', probability = TRUE ) Global Interpretation

The most common ways of obtaining global interpretation is through:

  • variable importance measures
  • partial dependence plots

Variable importance quantifies the global contribution of each input variable to the predictions of a machine learning model. Variable importance measures rarely give insight into the average direction that a variable affects a response function. They simply state the magnitude of a variable’s relationship with the response as compared to other variables used in the model. For example, the ranger random forest model identified monthly income, overtime, and age as the top 3 variables impacting the objective function.

vip(fit.ranger) + ggtitle("ranger: RF")

After the most globally relevant variables have been identified, the next step is to attempt to understand how the response variable changes based on these variables. For this we can use partial dependence plots (PDPs) and individual conditional expectation (ICE) curves. These techniques plot the change in the predicted value as specified feature(s) vary over their marginal distribution. Consequently, we can gain some local understanding how the reponse variable changes across the distribution of a particular variable but this still only provides a global understanding of this relationships across all observed data.

For example, if we plot the PDP of the monthly income variable we see that the probability of an employee attriting decreases, on average, as their monthly income approaches $5,000 and then remains relatively flat.

# built-in PDP support in H2O h2o.partialPlot(h2o_rf, data = train_obs.h2o, cols = "MonthlyIncome")

## PartialDependence: Partial Dependence Plot of model DRF_model_R_1529928665020_106 on column 'MonthlyIncome' ## MonthlyIncome mean_response stddev_response ## 1 1009.000000 0.229433 0.229043 ## 2 2008.473684 0.216096 0.234960 ## 3 3007.947368 0.167686 0.234022 ## 4 4007.421053 0.161024 0.231287 ## 5 5006.894737 0.157775 0.231151 ## 6 6006.368421 0.156628 0.231455 ## 7 7005.842105 0.157734 0.230267 ## 8 8005.315789 0.160137 0.229286 ## 9 9004.789474 0.164437 0.229305 ## 10 10004.263158 0.169652 0.227902 ## 11 11003.736842 0.165502 0.226240 ## 12 12003.210526 0.165297 0.225529 ## 13 13002.684211 0.165051 0.225335 ## 14 14002.157895 0.165051 0.225335 ## 15 15001.631579 0.164983 0.225316 ## 16 16001.105263 0.165051 0.225019 ## 17 17000.578947 0.165556 0.224890 ## 18 18000.052632 0.165556 0.224890 ## 19 18999.526316 0.166498 0.224726 ## 20 19999.000000 0.182348 0.219882

We can gain further insight by using centered ICE curves which can help draw out further details. For example, the following ICE curves show a similar trend line as the PDP above but by centering we identify the decrease as monthly income approaches $5,000 followed by an increase in probability of attriting once an employee’s monthly income approaches $20,000. Futhermore, we see some turbulence in the flatlined region between $5-$20K) which means there appears to be certain salary regions where the probability of attriting changes.

fit.ranger %>% pdp::partial(pred.var = "MonthlyIncome", grid.resolution = 25, ice = TRUE) %>% autoplot(rug = TRUE, train = train_obs, alpha = 0.1, center = TRUE)

These visualizations help us to understand our model from a global perspective: identifying the variables with the largest overall impact and the typical influence of a feature on the response variable across all observations. However, what these do not help us understand is given a new observation, what were the most influential variables that determined the predicted outcome. Say we obtain information on an employee that makes about $10,000 per month and we need to assess their probabilty of leaving the firm. Although monthly income is the most important variable in our model, it may not be the most influential variable driving this employee to leave. To retain the employee, leadership needs to understand what variables are most influential for that specific employee. This is where lime can help.

Local Interpretation

Local Interpretable Model-agnostic Explanations (LIME) is a visualization technique that helps explain individual predictions. As the name implies, it is model agnostic so it can be applied to any supervised regression or classification model. Behind the workings of LIME lies the assumption that every complex model is linear on a local scale and asserting that it is possible to fit a simple model around a single observation that will mimic how the global model behaves at that locality. The simple model can then be used to explain the predictions of the more complex model locally.

The generalized algorithm LIME applies is:

  1. Given an observation, permute it to create replicated feature data with slight value modifications.
  2. Compute similarity distance measure between original observation and permuted observations.
  3. Apply selected machine learning model to predict outcomes of permuted data.
  4. Select m number of features to best describe predicted outcomes.
  5. Fit a simple model to the permuted data, explaining the complex model outcome with m features from the permuted data weighted by its similarity to the original observation .
  6. Use the resulting feature weights to explain local behavior.

Each of these steps will be discussed in further detail as we proceed.

lime::lime

The application of the LIME algorithm via the lime package is split into two operations: lime::lime() and lime::explain(). The lime::lime() function creates an “explainer” object, which is just a list that contains the machine learning model and the feature distributions for the training data. The feature distributions that it contains includes distribution statistics for each categorical variable level and each continuous variable split into n bins (default is 4 bins). These feature attributes will be used to permute data.

The following creates our lime::lime() object and I change the number to bin our continuous variables into to 5.

explainer_caret <- lime::lime(train_obs, fit.caret, n_bins = 5) class(explainer_caret) ## [1] "data_frame_explainer" "explainer" ## [3] "list" summary(explainer_caret) ## Length Class Mode ## model 24 train list ## bin_continuous 1 -none- logical ## n_bins 1 -none- numeric ## quantile_bins 1 -none- logical ## use_density 1 -none- logical ## feature_type 31 -none- character ## bin_cuts 31 -none- list ## feature_distribution 31 -none- list lime::explain

Once we created our lime objects, we can now perform the generalized LIME algorithm using the lime::explain() function. This function has several options, each providing flexibility in how we perform the generalized algorithm mentioned above.

  • x: Contains the one or more single observations you want to create local explanations for. In our case, this includes the 5 observations that I included in the local_obs data frame. Relates to algorithm step 1.
  • explainer: takes the explainer object created by lime::lime(), which will be used to create permuted data. Permutations are sampled from the variable distributions created by the lime::lime() explainer object. Relates to algorithm step 1.
  • n_permutations: The number of permutations to create for each observation in x (default is 5,000 for tabular data). Relates to algorithm step 1.
  • dist_fun: The distance function to use. The default is Gower’s distance but can also use euclidean, manhattan, or any other distance function allowed by ?dist(). To compute similarity distance of permuted observations, categorical features will be recoded based on whether or not they are equal to the actual observation. If continuous features are binned (the default) these features will be recoded based on whether they are in the same bin as the observation. Using the recoded data the distance to the original observation is then calculated based on a user-chosen distance measure. Relates to algorithm step 2.
  • kernel_width: To convert the distance measure to a similarity value, an exponential kernel of a user defined width (defaults to 0.75 times the square root of the number of features) is used. Smaller values restrict the size of the local region. Relates to algorithm step 2.
  • n_features: The number of features to best describe predicted outcomes. Relates to algorithm step 4.
  • feature_select: To select the best n features, lime can use forward selection, ridge regression, lasso, or a tree to select the features. In this example I apply a ridge regression model and select the m features with highest absolute weights. Relates to algorithm step 4.

For classification models we also have two additional features we care about and one of these two arguments must be given:

  • labels: Which label do we want to explain? In this example, I want to explain the probability of an observation to attrit (“Yes”).
  • n_labels: The number of labels to explain. With this data I could select n_labels = 2 to explain the probability of “Yes” and “No” responses.
explanation_caret <- lime::explain( x = local_obs, explainer = explainer_caret, n_permutations = 5000, dist_fun = "gower", kernel_width = .75, n_features = 5, feature_select = "highest_weights", labels = "Yes" )

The explain() function above first creates permutations, then calculates similarities, followed by selecting the m features. Lastly, explain() will then fit a model (algorithm steps 5 & 6). lime applies a ridge regression model with the weighted permuted observations as the simple model. [2] If the model is a regressor, the simple model will predict the output of the complex model directly. If the complex model is a classifier, the simple model will predict the probability of the chosen class(es).

The explain() output is a data frame containing different information on the simple model predictions. Most importantly, for each observation in local_obs it contains the simple model fit (model_r2) and the weighted importance (feature_weight) for each important feature (feature_desc) that best describes the local relationship.

tibble::glimpse(explanation_caret) ## Observations: 25 ## Variables: 13 ## $ model_type "classification", "classification", "cla... ## $ case "1", "1", "1", "1", "1", "2", "2", "2", ... ## $ label "Yes", "Yes", "Yes", "Yes", "Yes", "Yes"... ## $ label_prob 0.216, 0.216, 0.216, 0.216, 0.216, 0.070... ## $ model_r2 0.5517840, 0.5517840, 0.5517840, 0.55178... ## $ model_intercept 0.1498396, 0.1498396, 0.1498396, 0.14983... ## $ model_prediction 0.3233514, 0.3233514, 0.3233514, 0.32335... ## $ feature "OverTime", "MaritalStatus", "BusinessTr... ## $ feature_value Yes, Single, Travel_Rarely, Sales, Very_... ## $ feature_weight 0.14996805, 0.05656074, -0.03929651, 0.0... ## $ feature_desc "OverTime = Yes", "MaritalStatus = Singl... ## $ data [[41, Yes, Travel_Rarely, 1102, Sales, ... ## $ prediction [[0.216, 0.784], [0.216, 0.784], [0.216... Visualizing results

However the simplest approach to interpret the results is to visualize them. There are several plotting functions provided by lime but for tabular data we are only concerned with two. The most important of which is plot_features(). This will create a visualization containing an individual plot for each observation (case 1, 2, …, n) in our local_obs data frame. Since we specified labels = "Yes" in the explain() function, it will provide the probability of each observation attriting. And since we specified n_features = 10 it will plot the 10 most influential variables that best explain the linear model in that observations local region and whether the variable is causes an increase in the probability (supports) or a decrease in the probability (contradicts). It also provides us with the model fit for each model (“Explanation Fit: XX”), which allows us to see how well that model explains the local region.

Consequently, we can infer that case 3 has the highest liklihood of attriting out of the 5 observations and the 3 variables that appear to be influencing this high probability include working overtime, being single, and working as a lab tech.

plot_features(explanation_caret)

The other plot we can create is a heatmap showing how the different variables selected across all the observations influence each case. We use the plot_explanations() function. This plot becomes useful if you are trying to find common features that influence all observations or if you are performing this analysis across many observations which makes plot_features() difficult to discern.

plot_explanations(explanation_caret)

Tuning LIME

As you saw in the above plot_features() plot, the output provides the model fit. In this case the best simple model fit for the given local regions was R^2 = 0.59 for case 3. Considering there are several knobs we can turn when performing the LIME algorithm, we can treat these as tuning parameters to try find the best fit model for the local region. This helps to maximize the amount of trust we can have in the local region explanation.

As an example, the following changes the distance function to use the manhattan distance algorithm, we increase the kernel width substantially to create a larger local region, and we change our feature selection approach to a LARS lasso model. The result is a fairly substantial increase in our explanation fits.

# tune LIME algorithm explanation_caret <- lime::explain( x = local_obs, explainer = explainer_caret, n_permutations = 5000, dist_fun = "manhattan", kernel_width = 3, n_features = 5, feature_select = "lasso_path", labels = "Yes" ) plot_features(explanation_caret)

Supported vs Non-support models

Currently, lime supports supervised models produced in caret, mlr, xgboost, h2o, keras, and MASS::lda. Consequently, any supervised models created with these packages will function just fine with lime.

explainer_h2o_rf <- lime(train_obs, h2o_rf, n_bins = 5) explainer_h2o_glm <- lime(train_obs, h2o_glm, n_bins = 5) explainer_h2o_gbm <- lime(train_obs, h2o_gbm, n_bins = 5) explanation_rf <- lime::explain(local_obs, explainer_h2o_rf, n_features = 5, labels = "Yes", kernel_width = .1, feature_select = "highest_weights") explanation_glm <- lime::explain(local_obs, explainer_h2o_glm, n_features = 5, labels = "Yes", kernel_width = .1, feature_select = "highest_weights") explanation_gbm <- lime::explain(local_obs, explainer_h2o_gbm, n_features = 5, labels = "Yes", kernel_width = .1, feature_select = "highest_weights") p1 <- plot_features(explanation_rf, ncol = 1) + ggtitle("rf") p2 <- plot_features(explanation_glm, ncol = 1) + ggtitle("glm") p3 <- plot_features(explanation_gbm, ncol = 1) + ggtitle("gbm") gridExtra::grid.arrange(p1, p2, p3, nrow = 1)

However, any models that do not have built in support will produce an error. For example, the model we created directly with ranger is not supported and produces an error.

explainer_ranger <- lime(train, fit.ranger, n_bins = 5) ## Error in UseMethod("lime", x): no applicable method for 'lime' applied to an object of class "function"

We can work with this pretty easily by building two functions that make lime compatible with an unsupported package. First, we need to create a model_type() function that specifies what type of model this unsupported package is using. model_type() is a lime specific function, we just need to create a ranger specific method. We do this by taking the class name for our ranger object and creating the model_type.ranger method and simply return the type of model (“classification” for this example).

# get the model class class(fit.ranger) ## [1] "ranger" # need to create custom model_type function model_type.ranger <- function(x, ...) { # Function tells lime() what model type we are dealing with # 'classification', 'regression', 'survival', 'clustering', 'multilabel', etc return("classification") } model_type(fit.ranger) ## [1] "classification"

We then need to create a predict_model() method for ranger as well. The output for this function should be a data frame. For a regression problem it should produce a single column data frame with the predicted response and for a classification problem it should create a column containing the probabilities for each categorical class (binary “Yes” “No” in this example).

# need to create custom predict_model function predict_model.ranger <- function(x, newdata, ...) { # Function performs prediction and returns data frame with Response pred <- predict(x, newdata) return(as.data.frame(pred$predictions)) } predict_model(fit.ranger, newdata = local_obs) ## Yes No ## 1 0.2915452 0.7084548 ## 2 0.0701619 0.9298381 ## 3 0.4406563 0.5593437 ## 4 0.3465294 0.6534706 ## 5 0.2525397 0.7474603

Now that we have those methods developed and in our global environment we can run our lime functions and produce our outputs.3

explainer_ranger <- lime(train_obs, fit.ranger, n_bins = 5) explanation_ranger <- lime::explain(local_obs, explainer_ranger, n_features = 5, n_labels = 2, kernel_width = .1) plot_features(explanation_ranger, ncol = 2) + ggtitle("ranger")

Learning More

At Business Science, we’ve been using the lime package with clients to help explain our machine learning models – It’s been our secret weapon. Our primary use cases are with h2o and keras, both of which are supported in lime. In fact, we actually built the h2o integration to gain the beneifts of LIME with stacked ensembles, deep learning, and other black-box algorithms. We’ve used it with clients to help them detect which employees should be considered for executive promotion. We’ve even provided previous real-world business problem / machine learning tutorials:

In fact, those that want to learn lime while solving a real world data science problem can get started today with our new course: Data Science For Business (DS4B 201)

Additional Resources

LIME provides a great, model-agnostic approach to assessing local interpretation of predictions. To learn more I would start with the following resources:

Announcements

Matt was recently on Episode 165 of the SuperDataScience Podcast. In his second appearance (also was on Episode 109 where he spoke about the transition to data science), he talks about giving back to the data science community if the form of education, open source software, and blogging!

Business Science University

If you are looking to take the next step and learn Data Science For Business (DS4B), Business Science University is for you! Our goal is to empower data scientists through teaching the tools and techniques we implement every day. You’ll learn:

  • TO SOLVE A REAL WORLD CHURN PROBLEM: Employee Turnover!
  • Data Science Framework: Business Science Problem Framework
  • Tidy Eval
  • H2O Automated Machine Learning
  • LIME Feature Explanations
  • Sensitivity Analysis
  • Tying data science to financial improvement (ROI-Driven Data Science)
Special Autographed “Deep Learning With R” Giveaway!!!

One lucky student that enrolls in June will receive an autographed copy of Deep Learning With R, signed by JJ Allaire, Founder of Rstudio and DLwR co-author.

DS4B Virtual Workshop: Predicting Employee Attrition

Did you know that an organization that loses 200 high performing employees per year is essentially losing $15M/year in lost productivity? Many organizations don’t realize this because it’s an indirect cost. It goes unnoticed. What if you could use data science to predict and explain turnover in a way that managers could make better decisions and executives would see results? You will learn the tools to do so in our Virtual Workshop. Here’s an example of a Shiny app you will create.


Get 15% OFF in June!

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in HR 301

Our first Data Science For Business Virtual Workshop teaches you how to solve this employee attrition problem in four courses that are fully integrated:

The Virtual Workshop is intended for intermediate and advanced R users. It’s code intensive (like these articles), but also teaches you fundamentals of data science consulting including CRISP-DM and the Business Science Problem Framework. The content bridges the gap between data science and the business, making you even more effective and improving your organization in the process.


Get 15% OFF in June!

Don’t Miss A Beat

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

Footnotes
  1. To this end, you are encouraged to read through the article that introduced the lime framework as well as the additional resources linked to from the original Python repository

  2. If you’ve never applied a weighted ridge regression model you can see some details on its application in the glmnet vignette 

  3. If you are curious as to why simply creating the model_type.ranger and predict_model.ranger methods and hosting them in your global environment causes the lime functions to work then I suggest you read chapter 6 of Advanced R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Switching to blogdown, Netlify and Travis

Mon, 06/25/2018 - 02:00

(This article was first published on Lorenz Walthert, and kindly contributed to R-bloggers)

The problem Some time ago, I started a blog. I actually did not post a whole lot of stuff though. I was using Jekyll, but my set-up was rather brittle and there were a few problems:
I could not use R Markdown directly. I always had to knitr manually to get plain Markdown from my R Markdown files and then use them as input for Jekyll. Since the majority of my posts involved at least some R code, this was not a very elegant thing to do.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Lorenz Walthert. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Deep Learning for Time Series Forecasting: Predicting Sunspot Frequency with Keras

Mon, 06/25/2018 - 02:00

(This article was first published on TensorFlow for R, and kindly contributed to R-bloggers)

Forecasting sunspots with deep learning

In this post we will examine making time series predictions using the sunspots dataset that ships with base R. Sunspots are dark spots on the sun, associated with lower temperature. Here’s an image from NASA showing the solar phenomenon.

Source: https://www.nasa.gov/content/goddard/largest-sunspot-of-solar-cycle

 

We’re using the monthly version of the dataset, sunspots.month (there is a yearly version, too). It contains 265 years worth of data (from 1749 through 2013) on the number of sunspots per month.

Forecasting this dataset is challenging because of high short term variability as well as long-term irregularities evident in the cycles. For example, maximum amplitudes reached by the low frequency cycle differ a lot, as does the number of high frequency cycle steps needed to reach that maximum low frequency cycle height.

Our post will focus on two dominant aspects: how to apply deep learning to time series forecasting, and how to properly apply cross validation in this domain. For the latter, we will use the rsample package that allows to do resampling on time series data. As to the former, our goal is not to reach utmost performance but to show the general course of action when using recurrent neural networks to model this kind of data.

Recurrent neural networks

When our data has a sequential structure, it is recurrent neural networks (RNNs) we use to model it.

As of today, among RNNs, the best established architectures are the GRU (Gated Recurrent Unit) and the LSTM (Long Short Term Memory). For today, let’s not zoom in on what makes them special, but on what they have in common with the most stripped-down RNN: the basic recurrence structure.

In contrast to the prototype of a neural network, often called Multilayer Perceptron (MLP), the RNN has a state that is carried on over time. This is nicely seen in this diagram from Goodfellow et al., a.k.a. the “bible of deep learning”:

Source: http://www.deeplearningbook.org

At each time, the state is a combination of the current input and the previous hidden state. This is reminiscent of autoregressive models, but with neural networks, there has to be some point where we halt the dependence.

That’s because in order to determine the weights, we keep calculating how our loss changes as the input changes. Now if the input we have to consider, at an arbitrary timestep, ranges back indefinitely – then we will not be able to calculate all those gradients. In practice, then, our hidden state will, at every iteration, be carried forward through a fixed number of steps.

We’ll come back to that as soon as we’ve loaded and pre-processed the data.

Setup, pre-processing, and exploration Libraries

Here, first, are the libraries needed for this tutorial.

# Core Tidyverse library(tidyverse) library(glue) library(forcats) # Time Series library(timetk) library(tidyquant) library(tibbletime) # Visualization library(cowplot) # Preprocessing library(recipes) # Sampling / Accuracy library(rsample) library(yardstick) # Modeling library(keras) library(tfruns)

If you have not previously run Keras in R, you will need to install Keras using the install_keras() function.

# Install Keras if you have not installed before install_keras() Data

sunspot.month is a ts class (not tidy), so we’ll convert to a tidy data set using the tk_tbl() function from timetk. We use this instead of as.tibble() from tibble to automatically preserve the time series index as a zoo yearmon index. Last, we’ll convert the zoo index to date using lubridate::as_date() (loaded with tidyquant) and then change to a tbl_time object to make time series operations easier.

sun_spots <- datasets::sunspot.month %>% tk_tbl() %>% mutate(index = as_date(index)) %>% as_tbl_time(index = index) sun_spots # A time tibble: 3,177 x 2 # Index: index index value 1 1749-01-01 58 2 1749-02-01 62.6 3 1749-03-01 70 4 1749-04-01 55.7 5 1749-05-01 85 6 1749-06-01 83.5 7 1749-07-01 94.8 8 1749-08-01 66.3 9 1749-09-01 75.9 10 1749-10-01 75.5 # ... with 3,167 more rows Exploratory data analysis

The time series is long (265 years!). We can visualize the time series both in full, and zoomed in on the first 10 years to get a feel for the series.

Visualizing sunspot data with cowplot

We’ll make two ggplots and combine them using cowplot::plot_grid(). Note that for the zoomed in plot, we make use of tibbletime::time_filter(), which is an easy way to perform time-based filtering.

p1 <- sun_spots %>% ggplot(aes(index, value)) + geom_point(color = palette_light()[[1]], alpha = 0.5) + theme_tq() + labs( title = "From 1749 to 2013 (Full Data Set)" ) p2 <- sun_spots %>% filter_time("start" ~ "1800") %>% ggplot(aes(index, value)) + geom_line(color = palette_light()[[1]], alpha = 0.5) + geom_point(color = palette_light()[[1]]) + geom_smooth(method = "loess", span = 0.2, se = FALSE) + theme_tq() + labs( title = "1749 to 1759 (Zoomed In To Show Changes over the Year)", caption = "datasets::sunspot.month" ) p_title <- ggdraw() + draw_label("Sunspots", size = 18, fontface = "bold", colour = palette_light()[[1]]) plot_grid(p_title, p1, p2, ncol = 1, rel_heights = c(0.1, 1, 1)) Backtesting: time series cross validation

When doing cross validation on sequential data, the time dependencies on preceding samples must be preserved. We can create a cross validation sampling plan by offsetting the window used to select sequential sub-samples. In essence, we’re creatively dealing with the fact that there’s no future test data available by creating multiple synthetic “futures” – a process often, esp. in finance, called “backtesting”.

As mentioned in the introduction, the rsample package includes facitlities for backtesting on time series. The vignette, “Time Series Analysis Example”, describes a procedure that uses the rolling_origin() function to create samples designed for time series cross validation. We’ll use this approach.

Developing a backtesting strategy

The sampling plan we create uses 50 years (initial = 12 x 50 samples) for the training set and ten years (assess = 12 x 10) for the testing (validation) set. We select a skip span of about twenty years (skip = 12 x 20 – 1) to approximately evenly distribute the samples into 6 sets that span the entire 265 years of sunspots history. Last, we select cumulative = FALSE to allow the origin to shift which ensures that models on more recent data are not given an unfair advantage (more observations) over those operating on less recent data. The tibble return contains the rolling_origin_resamples.

periods_train <- 12 * 100 periods_test <- 12 * 50 skip_span <- 12 * 22 - 1 rolling_origin_resamples <- rolling_origin( sun_spots, initial = periods_train, assess = periods_test, cumulative = FALSE, skip = skip_span ) rolling_origin_resamples # Rolling origin forecast resampling # A tibble: 6 x 2 splits id 1 Slice1 2 Slice2 3 Slice3 4 Slice4 5 Slice5 6 Slice6 Visualizing the backtesting strategy

We can visualize the resamples with two custom functions. The first, plot_split(), plots one of the resampling splits using ggplot2. Note that an expand_y_axis argument is added to expand the date range to the full sun_spots dataset date range. This will become useful when we visualize all plots together.

# Plotting function for a single split plot_split <- function(split, expand_y_axis = TRUE, alpha = 1, size = 1, base_size = 14) { # Manipulate data train_tbl <- training(split) %>% add_column(key = "training") test_tbl <- testing(split) %>% add_column(key = "testing") data_manipulated <- bind_rows(train_tbl, test_tbl) %>% as_tbl_time(index = index) %>% mutate(key = fct_relevel(key, "training", "testing")) # Collect attributes train_time_summary <- train_tbl %>% tk_index() %>% tk_get_timeseries_summary() test_time_summary <- test_tbl %>% tk_index() %>% tk_get_timeseries_summary() # Visualize g <- data_manipulated %>% ggplot(aes(x = index, y = value, color = key)) + geom_line(size = size, alpha = alpha) + theme_tq(base_size = base_size) + scale_color_tq() + labs( title = glue("Split: {split$id}"), subtitle = glue("{train_time_summary$start} to {test_time_summary$end}"), y = "", x = "" ) + theme(legend.position = "none") if (expand_y_axis) { sun_spots_time_summary <- sun_spots %>% tk_index() %>% tk_get_timeseries_summary() g <- g + scale_x_date(limits = c(sun_spots_time_summary$start, sun_spots_time_summary$end)) } return(g) }

The plot_split() function takes one split (in this case Slice01), and returns a visual of the sampling strategy. We expand the axis to the range for the full dataset using expand_y_axis = TRUE.

rolling_origin_resamples$splits[[1]] %>% plot_split(expand_y_axis = TRUE) + theme(legend.position = "bottom")

The second function, plot_sampling_plan(), scales the plot_split() function to all of the samples using purrr and cowplot.

# Plotting function that scales to all splits plot_sampling_plan <- function(sampling_tbl, expand_y_axis = TRUE, ncol = 3, alpha = 1, size = 1, base_size = 14, title = "Sampling Plan") { # Map plot_split() to sampling_tbl sampling_tbl_with_plots <- sampling_tbl %>% mutate(gg_plots = map(splits, plot_split, expand_y_axis = expand_y_axis, alpha = alpha, base_size = base_size)) # Make plots with cowplot plot_list <- sampling_tbl_with_plots$gg_plots p_temp <- plot_list[[1]] + theme(legend.position = "bottom") legend <- get_legend(p_temp) p_body <- plot_grid(plotlist = plot_list, ncol = ncol) p_title <- ggdraw() + draw_label(title, size = 14, fontface = "bold", colour = palette_light()[[1]]) g <- plot_grid(p_title, p_body, legend, ncol = 1, rel_heights = c(0.05, 1, 0.05)) return(g) }

We can now visualize the entire backtesting strategy with plot_sampling_plan(). We can see how the sampling plan shifts the sampling window with each progressive slice of the train/test splits.

rolling_origin_resamples %>% plot_sampling_plan(expand_y_axis = T, ncol = 3, alpha = 1, size = 1, base_size = 10, title = "Backtesting Strategy: Rolling Origin Sampling Plan")

And, we can set expand_y_axis = FALSE to zoom in on the samples.

rolling_origin_resamples %>% plot_sampling_plan(expand_y_axis = F, ncol = 3, alpha = 1, size = 1, base_size = 10, title = "Backtesting Strategy: Zoomed In")

We’ll use this backtesting strategy (6 samples from one time series each with 50/10 split in years and a ~20 year offset) when testing the veracity of the LSTM model on the sunspots dataset.

The LSTM model

To begin, we’ll develop an LSTM model on a single sample from the backtesting strategy, namely, the most recent slice. We’ll then apply the model to all samples to investigate modeling performance.

example_split <- rolling_origin_resamples$splits[[6]] example_split_id <- rolling_origin_resamples$id[[6]]

We can reuse the plot_split() function to visualize the split. Set expand_y_axis = FALSE to zoom in on the subsample.

plot_split(example_split, expand_y_axis = FALSE, size = 0.5) + theme(legend.position = "bottom") + ggtitle(glue("Split: {example_split_id}")) Data setup

To aid hyperparameter tuning, besides the training set we also need a validation set. For example, we will use a callback, callback_early_stopping, that stops training when no significant performance is seen on the validation set (what’s considered significant is up to you).

We will dedicate 2 thirds of the analysis set to training, and 1 third to validation.

df_trn <- analysis(example_split)[1:800, , drop = FALSE] df_val <- analysis(example_split)[801:1200, , drop = FALSE] df_tst <- assessment(example_split)

First, let’s combine the training and testing data sets into a single data set with a column key that specifies where they came from (either “training” or “testing)”. Note that the tbl_time object will need to have the index respecified during the bind_rows() step, but this issue should be corrected in dplyr soon.

df <- bind_rows( df_trn %>% add_column(key = "training"), df_val %>% add_column(key = "validation"), df_tst %>% add_column(key = "testing") ) %>% as_tbl_time(index = index) df # A time tibble: 1,800 x 3 # Index: index index value key 1 1849-06-01 81.1 training 2 1849-07-01 78 training 3 1849-08-01 67.7 training 4 1849-09-01 93.7 training 5 1849-10-01 71.5 training 6 1849-11-01 99 training 7 1849-12-01 97 training 8 1850-01-01 78 training 9 1850-02-01 89.4 training 10 1850-03-01 82.6 training # ... with 1,790 more rows Preprocessing with recipes

The LSTM algorithm will usually work better if the input data has been centered and scaled. We can conveniently accomplish this using the recipes package. In addition to step_center and step_scale, we’re using step_sqrt to reduce variance and remov outliers. The actual transformations are executed when we bake the data according to the recipe:

rec_obj <- recipe(value ~ ., df) %>% step_sqrt(value) %>% step_center(value) %>% step_scale(value) %>% prep() df_processed_tbl <- bake(rec_obj, df) df_processed_tbl # A tibble: 1,800 x 3 index value key 1 1849-06-01 0.714 training 2 1849-07-01 0.660 training 3 1849-08-01 0.473 training 4 1849-09-01 0.922 training 5 1849-10-01 0.544 training 6 1849-11-01 1.01 training 7 1849-12-01 0.974 training 8 1850-01-01 0.660 training 9 1850-02-01 0.852 training 10 1850-03-01 0.739 training # ... with 1,790 more rows

Next, let’s capture the original center and scale so we can invert the steps after modeling. The square root step can then simply be undone by squaring the back-transformed data.

center_history <- rec_obj$steps[[2]]$means["value"] scale_history <- rec_obj$steps[[3]]$sds["value"] c("center" = center_history, "scale" = scale_history) center.value scale.value 6.694468 3.238935 Reshaping the data

Keras LSTM expects the input as well as the target data to be in a specific shape. The input has to be a 3-d array of size num_samples, num_timesteps, num_features.

Here, num_samples is the number of observations in the set. This will get fed to the model in portions of batch_size. The second dimension, num_timesteps, is the length of the hidden state we were talking about above. Finally, the third dimension is the number of predictors we’re using. For univariate time series, this is 1.

How long should we choose the hidden state to be? This generally depends on the dataset and our goal. If we did one-step-ahead forecasts – thus, forecasting the following month only – our main concern would be choosing a state length that allows to learn any patterns present in the data.

Now say we wanted to forecast 12 months instead, as does SILSO, the World Data Center for the production, preservation and dissemination of the international sunspot number. The way we can do this, with Keras, is by wiring the LSTM hidden states to sets of consecutive outputs of the same length. Thus, if we want to produce predictions for 12 months, our LSTM should have a hidden state length of 12.

These 12 time steps will then get wired to 12 linear predictor units using a time_distributed() wrapper. That wrapper’s task is to apply the same calculation (i.e., the same weight matrix) to every state input it receives.

Now, what’s the target array’s format supposed to be? As we’re forecasting several timesteps here, the target data again needs to be 3-dimensional. Dimension 1 again is the batch dimension, dimension 2 again corresponds to the number of timesteps (the forecasted ones), and dimension 3 is the size of the wrapped layer. In our case, the wrapped layer is a layer_dense() of a single unit, as we want exactly one prediction per point in time.

So, let’s reshape the data. The main action here is creating the sliding windows of 12 steps of input, followed by 12 steps of output each. This is easiest to understand with a shorter and simpler example. Say our input were the numbers from 1 to 10, and our chosen sequence length (state size) were 4. Tthis is how we would want our training input to look:

1,2,3,4 2,3,4,5 3,4,5,6

And our target data, correspondingly:

5,6,7,8 6,7,8,9 7,8,9,10

We’ll define a short function that does this reshaping on a given dataset. Then finally, we add the third axis that is formally needed (even though that axis is of size 1 in our case).

# these variables are being defined just because of the order in which # we present things in this post (first the data, then the model) # they will be superseded by FLAGS$n_timesteps, FLAGS$batch_size and n_predictions # in the following snippet n_timesteps <- 12 n_predictions <- n_timesteps batch_size <- 10 # functions used build_matrix <- function(tseries, overall_timesteps) { t(sapply(1:(length(tseries) - overall_timesteps + 1), function(x) tseries[x:(x + overall_timesteps - 1)])) } reshape_X_3d <- function(X) { dim(X) <- c(dim(X)[1], dim(X)[2], 1) X } # extract values from data frame train_vals <- df_processed_tbl %>% filter(key == "training") %>% select(value) %>% pull() valid_vals <- df_processed_tbl %>% filter(key == "validation") %>% select(value) %>% pull() test_vals <- df_processed_tbl %>% filter(key == "testing") %>% select(value) %>% pull() # build the windowed matrices train_matrix <- build_matrix(train_vals, n_timesteps + n_predictions) valid_matrix <- build_matrix(valid_vals, n_timesteps + n_predictions) test_matrix <- build_matrix(test_vals, n_timesteps + n_predictions) # separate matrices into training and testing parts # also, discard last batch if there are fewer than batch_size samples # (a purely technical requirement) X_train <- train_matrix[, 1:n_timesteps] y_train <- train_matrix[, (n_timesteps + 1):(n_timesteps * 2)] X_train <- X_train[1:(nrow(X_train) %/% batch_size * batch_size), ] y_train <- y_train[1:(nrow(y_train) %/% batch_size * batch_size), ] X_valid <- valid_matrix[, 1:n_timesteps] y_valid <- valid_matrix[, (n_timesteps + 1):(n_timesteps * 2)] X_valid <- X_valid[1:(nrow(X_valid) %/% batch_size * batch_size), ] y_valid <- y_valid[1:(nrow(y_valid) %/% batch_size * batch_size), ] X_test <- test_matrix[, 1:n_timesteps] y_test <- test_matrix[, (n_timesteps + 1):(n_timesteps * 2)] X_test <- X_test[1:(nrow(X_test) %/% batch_size * batch_size), ] y_test <- y_test[1:(nrow(y_test) %/% batch_size * batch_size), ] # add on the required third axis X_train <- reshape_X_3d(X_train) X_valid <- reshape_X_3d(X_valid) X_test <- reshape_X_3d(X_test) y_train <- reshape_X_3d(y_train) y_valid <- reshape_X_3d(y_valid) y_test <- reshape_X_3d(y_test) 5.1.6 Building the LSTM model

Now that we have our data in the required form, let’s finally build the model. As always in deep learning, an important, and often time-consuming, part of the job is tuning hyperparameters. To keep this post self-contained, and considering this is primarily a tutorial on how to use LSTM in R, let’s assume the following settings were found after extensive experimentation (in reality experimentation did take place, but not to a degree that performance couldn’t possibly be improved).

Instead of hard coding the hyperparameters, we’ll use tfruns to set up an environment where we could easily perform grid search.

We’ll quickly comment on what these parameters do but mainly leave those topics to further posts.

FLAGS <- flags( # There is a so-called "stateful LSTM" in Keras. While LSTM is stateful per se, # this adds a further tweak where the hidden states get initialized with values # from the item at same position in the previous batch. # This is helpful just under specific circumstances, or if you want to create an # "infinite stream" of states, in which case you'd use 1 as the batch size. # Below, we show how the code would have to be changed to use this, but it won't be further # discussed here. flag_boolean("stateful", FALSE), # Should we use several layers of LSTM? # Again, just included for completeness, it did not yield any superior performance on this task. # This will actually stack exactly one additional layer of LSTM units. flag_boolean("stack_layers", FALSE), # number of samples fed to the model in one go flag_integer("batch_size", 10), # size of the hidden state, equals size of predictions flag_integer("n_timesteps", 12), # how many epochs to train for flag_integer("n_epochs", 100), # fraction of the units to drop for the linear transformation of the inputs flag_numeric("dropout", 0.2), # fraction of the units to drop for the linear transformation of the recurrent state flag_numeric("recurrent_dropout", 0.2), # loss function. Found to work better for this specific case than mean squared error flag_string("loss", "logcosh"), # optimizer = stochastic gradient descent. Seemed to work better than adam or rmsprop here # (as indicated by limited testing) flag_string("optimizer_type", "sgd"), # size of the LSTM layer flag_integer("n_units", 128), # learning rate flag_numeric("lr", 0.003), # momentum, an additional parameter to the SGD optimizer flag_numeric("momentum", 0.9), # parameter to the early stopping callback flag_integer("patience", 10) ) # the number of predictions we'll make equals the length of the hidden state n_predictions <- FLAGS$n_timesteps # how many features = predictors we have n_features <- 1 # just in case we wanted to try different optimizers, we could add here optimizer <- switch(FLAGS$optimizer_type, sgd = optimizer_sgd(lr = FLAGS$lr, momentum = FLAGS$momentum)) # callbacks to be passed to the fit() function # We just use one here: we may stop before n_epochs if the loss on the validation set # does not decrease (by a configurable amount, over a configurable time) callbacks <- list( callback_early_stopping(patience = FLAGS$patience) )

After all these preparations, the code for constructing and training the model is rather short! Let’s first quickly view the “long version”, that would allow you to test stacking several LSTMs or use a stateful LSTM, then go through the final short version (that does neither) and comment on it.

This, just for reference, is the complete code.

model <- keras_model_sequential() model %>% layer_lstm( units = FLAGS$n_units, batch_input_shape = c(FLAGS$batch_size, FLAGS$n_timesteps, n_features), dropout = FLAGS$dropout, recurrent_dropout = FLAGS$recurrent_dropout, return_sequences = TRUE ) if (FLAGS$stack_layers) { model %>% layer_lstm( units = FLAGS$n_units, dropout = FLAGS$dropout, recurrent_dropout = FLAGS$recurrent_dropout, return_sequences = TRUE ) } model %>% time_distributed(layer_dense(units = 1)) model %>% compile( loss = FLAGS$loss, optimizer = optimizer, metrics = list("mean_squared_error") ) if (!FLAGS$stateful) { model %>% fit( x = X_train, y = y_train, validation_data = list(X_valid, y_valid), batch_size = FLAGS$batch_size, epochs = FLAGS$n_epochs, callbacks = callbacks ) } else { for (i in 1:n_epochs) { model %>% fit( x = X_train, y = y_train, validation_data = list(X_valid, y_valid), callbacks = callbacks, batch_size = FLAGS$batch_size, epochs = 1, shuffle = FALSE ) model %>% reset_states() } } if (FLAGS$stateful) model %>% reset_states()

Now let’s step through the simpler, yet better (or equally) performing configuration below.

# create the model model <- keras_model_sequential() # add layers # we have just two, the LSTM and the time_distributed model %>% layer_lstm( units = FLAGS$n_units, # the first layer in a model needs to know the shape of the input data batch_input_shape = c(FLAGS$batch_size, FLAGS$n_timesteps, n_features), dropout = FLAGS$dropout, recurrent_dropout = FLAGS$recurrent_dropout, # by default, an LSTM just returns the final state return_sequences = TRUE ) %>% time_distributed(layer_dense(units = 1)) model %>% compile( loss = FLAGS$loss, optimizer = optimizer, # in addition to the loss, Keras will inform us about current MSE while training metrics = list("mean_squared_error") ) history <- model %>% fit( x = X_train, y = y_train, validation_data = list(X_valid, y_valid), batch_size = FLAGS$batch_size, epochs = FLAGS$n_epochs, callbacks = callbacks )

As we see, training was stopped after ~55 epochs as validation loss did not decrease any more. We also see that performance on the validation set is way worse than performance on the training set – normally indicating overfitting.

This topic too, we’ll leave to a separate discussion another time, but interestingly regularization using higher values of dropout and recurrent_dropout (combined with increasing model capacity) did not yield better generalization performance. This is probably related to the characteristics of this specific time series we mentioned in the introduction.

plot(history, metrics = "loss")

Now let’s see how well the model was able to capture the characteristics of the training set.

pred_train <- model %>% predict(X_train, batch_size = FLAGS$batch_size) %>% .[, , 1] # Retransform values to original scale pred_train <- (pred_train * scale_history + center_history) ^2 compare_train <- df %>% filter(key == "training") # build a dataframe that has both actual and predicted values for (i in 1:nrow(pred_train)) { varname <- paste0("pred_train", i) compare_train <- mutate(compare_train,!!varname := c( rep(NA, FLAGS$n_timesteps + i - 1), pred_train[i,], rep(NA, nrow(compare_train) - FLAGS$n_timesteps * 2 - i + 1) )) }

We compute the average RSME over all sequences of predictions.

coln <- colnames(compare_train)[4:ncol(compare_train)] cols <- map(coln, quo(sym(.))) rsme_train <- map_dbl(cols, function(col) rmse( compare_train, truth = value, estimate = !!col, na.rm = TRUE )) %>% mean() rsme_train 21.01495

How do these predictions really look? As a visualization of all predicted sequences would look pretty crowded, we arbitrarily pick start points at regular intervals.

ggplot(compare_train, aes(x = index, y = value)) + geom_line() + geom_line(aes(y = pred_train1), color = "cyan") + geom_line(aes(y = pred_train50), color = "red") + geom_line(aes(y = pred_train100), color = "green") + geom_line(aes(y = pred_train150), color = "violet") + geom_line(aes(y = pred_train200), color = "cyan") + geom_line(aes(y = pred_train250), color = "red") + geom_line(aes(y = pred_train300), color = "red") + geom_line(aes(y = pred_train350), color = "green") + geom_line(aes(y = pred_train400), color = "cyan") + geom_line(aes(y = pred_train450), color = "red") + geom_line(aes(y = pred_train500), color = "green") + geom_line(aes(y = pred_train550), color = "violet") + geom_line(aes(y = pred_train600), color = "cyan") + geom_line(aes(y = pred_train650), color = "red") + geom_line(aes(y = pred_train700), color = "red") + geom_line(aes(y = pred_train750), color = "green") + ggtitle("Predictions on the training set")

This looks pretty good. From the validation loss, we don’t quite expect the same from the test set, though.

Let’s see.

pred_test <- model %>% predict(X_test, batch_size = FLAGS$batch_size) %>% .[, , 1] # Retransform values to original scale pred_test <- (pred_test * scale_history + center_history) ^2 pred_test[1:10, 1:5] %>% print() compare_test <- df %>% filter(key == "testing") # build a dataframe that has both actual and predicted values for (i in 1:nrow(pred_test)) { varname <- paste0("pred_test", i) compare_test <- mutate(compare_test,!!varname := c( rep(NA, FLAGS$n_timesteps + i - 1), pred_test[i,], rep(NA, nrow(compare_test) - FLAGS$n_timesteps * 2 - i + 1) )) } compare_test %>% write_csv(str_replace(model_path, ".hdf5", ".test.csv")) compare_test[FLAGS$n_timesteps:(FLAGS$n_timesteps + 10), c(2, 4:8)] %>% print() coln <- colnames(compare_test)[4:ncol(compare_test)] cols <- map(coln, quo(sym(.))) rsme_test <- map_dbl(cols, function(col) rmse( compare_test, truth = value, estimate = !!col, na.rm = TRUE )) %>% mean() rsme_test 31.31616 ggplot(compare_test, aes(x = index, y = value)) + geom_line() + geom_line(aes(y = pred_test1), color = "cyan") + geom_line(aes(y = pred_test50), color = "red") + geom_line(aes(y = pred_test100), color = "green") + geom_line(aes(y = pred_test150), color = "violet") + geom_line(aes(y = pred_test200), color = "cyan") + geom_line(aes(y = pred_test250), color = "red") + geom_line(aes(y = pred_test300), color = "green") + geom_line(aes(y = pred_test350), color = "cyan") + geom_line(aes(y = pred_test400), color = "red") + geom_line(aes(y = pred_test450), color = "green") + geom_line(aes(y = pred_test500), color = "cyan") + geom_line(aes(y = pred_test550), color = "violet") + ggtitle("Predictions on test set")

That’s not as good as on the training set, but not bad either, given this time series is quite challenging.

Having defined and run our model on a manually chosen example split, let’s now revert to our overall re-sampling frame.

Backtesting the model on all splits

To obtain predictions on all splits, we move the above code into a function and apply it to all splits. First, here’s the function. It returns a list of two dataframes, one for the training and test sets each, that contain the model’s predictions together with the actual values.

obtain_predictions <- function(split) { df_trn <- analysis(split)[1:800, , drop = FALSE] df_val <- analysis(split)[801:1200, , drop = FALSE] df_tst <- assessment(split) df <- bind_rows( df_trn %>% add_column(key = "training"), df_val %>% add_column(key = "validation"), df_tst %>% add_column(key = "testing") ) %>% as_tbl_time(index = index) rec_obj <- recipe(value ~ ., df) %>% step_sqrt(value) %>% step_center(value) %>% step_scale(value) %>% prep() df_processed_tbl <- bake(rec_obj, df) center_history <- rec_obj$steps[[2]]$means["value"] scale_history <- rec_obj$steps[[3]]$sds["value"] FLAGS <- flags( flag_boolean("stateful", FALSE), flag_boolean("stack_layers", FALSE), flag_integer("batch_size", 10), flag_integer("n_timesteps", 12), flag_integer("n_epochs", 100), flag_numeric("dropout", 0.2), flag_numeric("recurrent_dropout", 0.2), flag_string("loss", "logcosh"), flag_string("optimizer_type", "sgd"), flag_integer("n_units", 128), flag_numeric("lr", 0.003), flag_numeric("momentum", 0.9), flag_integer("patience", 10) ) n_predictions <- FLAGS$n_timesteps n_features <- 1 optimizer <- switch(FLAGS$optimizer_type, sgd = optimizer_sgd(lr = FLAGS$lr, momentum = FLAGS$momentum)) callbacks <- list( callback_early_stopping(patience = FLAGS$patience) ) train_vals <- df_processed_tbl %>% filter(key == "training") %>% select(value) %>% pull() valid_vals <- df_processed_tbl %>% filter(key == "validation") %>% select(value) %>% pull() test_vals <- df_processed_tbl %>% filter(key == "testing") %>% select(value) %>% pull() train_matrix <- build_matrix(train_vals, FLAGS$n_timesteps + n_predictions) valid_matrix <- build_matrix(valid_vals, FLAGS$n_timesteps + n_predictions) test_matrix <- build_matrix(test_vals, FLAGS$n_timesteps + n_predictions) X_train <- train_matrix[, 1:FLAGS$n_timesteps] y_train <- train_matrix[, (FLAGS$n_timesteps + 1):(FLAGS$n_timesteps * 2)] X_train <- X_train[1:(nrow(X_train) %/% FLAGS$batch_size * FLAGS$batch_size),] y_train <- y_train[1:(nrow(y_train) %/% FLAGS$batch_size * FLAGS$batch_size),] X_valid <- valid_matrix[, 1:FLAGS$n_timesteps] y_valid <- valid_matrix[, (FLAGS$n_timesteps + 1):(FLAGS$n_timesteps * 2)] X_valid <- X_valid[1:(nrow(X_valid) %/% FLAGS$batch_size * FLAGS$batch_size),] y_valid <- y_valid[1:(nrow(y_valid) %/% FLAGS$batch_size * FLAGS$batch_size),] X_test <- test_matrix[, 1:FLAGS$n_timesteps] y_test <- test_matrix[, (FLAGS$n_timesteps + 1):(FLAGS$n_timesteps * 2)] X_test <- X_test[1:(nrow(X_test) %/% FLAGS$batch_size * FLAGS$batch_size),] y_test <- y_test[1:(nrow(y_test) %/% FLAGS$batch_size * FLAGS$batch_size),] X_train <- reshape_X_3d(X_train) X_valid <- reshape_X_3d(X_valid) X_test <- reshape_X_3d(X_test) y_train <- reshape_X_3d(y_train) y_valid <- reshape_X_3d(y_valid) y_test <- reshape_X_3d(y_test) model <- keras_model_sequential() model %>% layer_lstm( units = FLAGS$n_units, batch_input_shape = c(FLAGS$batch_size, FLAGS$n_timesteps, n_features), dropout = FLAGS$dropout, recurrent_dropout = FLAGS$recurrent_dropout, return_sequences = TRUE ) %>% time_distributed(layer_dense(units = 1)) model %>% compile( loss = FLAGS$loss, optimizer = optimizer, metrics = list("mean_squared_error") ) model %>% fit( x = X_train, y = y_train, validation_data = list(X_valid, y_valid), batch_size = FLAGS$batch_size, epochs = FLAGS$n_epochs, callbacks = callbacks ) pred_train <- model %>% predict(X_train, batch_size = FLAGS$batch_size) %>% .[, , 1] # Retransform values pred_train <- (pred_train * scale_history + center_history) ^ 2 compare_train <- df %>% filter(key == "training") for (i in 1:nrow(pred_train)) { varname <- paste0("pred_train", i) compare_train <- mutate(compare_train, !!varname := c( rep(NA, FLAGS$n_timesteps + i - 1), pred_train[i, ], rep(NA, nrow(compare_train) - FLAGS$n_timesteps * 2 - i + 1) )) } pred_test <- model %>% predict(X_test, batch_size = FLAGS$batch_size) %>% .[, , 1] # Retransform values pred_test <- (pred_test * scale_history + center_history) ^ 2 compare_test <- df %>% filter(key == "testing") for (i in 1:nrow(pred_test)) { varname <- paste0("pred_test", i) compare_test <- mutate(compare_test, !!varname := c( rep(NA, FLAGS$n_timesteps + i - 1), pred_test[i, ], rep(NA, nrow(compare_test) - FLAGS$n_timesteps * 2 - i + 1) )) } list(train = compare_train, test = compare_test) }

Mapping the function over all splits yields a list of predictions.

all_split_preds <- rolling_origin_resamples %>% mutate(predict = map(splits, obtain_predictions))

Calculate RMSE on all splits:

calc_rmse <- function(df) { coln <- colnames(df)[4:ncol(df)] cols <- map(coln, quo(sym(.))) map_dbl(cols, function(col) rmse( df, truth = value, estimate = !!col, na.rm = TRUE )) %>% mean() } all_split_preds <- all_split_preds %>% unnest(predict) all_split_preds_train <- all_split_preds[seq(1, 11, by = 2), ] all_split_preds_test <- all_split_preds[seq(2, 12, by = 2), ] all_split_rmses_train <- all_split_preds_train %>% mutate(rmse = map_dbl(predict, calc_rmse)) %>% select(id, rmse) all_split_rmses_test <- all_split_preds_test %>% mutate(rmse = map_dbl(predict, calc_rmse)) %>% select(id, rmse)

How does it look? Here’s RMSE on the training set for the 6 splits.

all_split_rmses_train # A tibble: 6 x 2 id rmse 1 Slice1 22.2 2 Slice2 20.9 3 Slice3 18.8 4 Slice4 23.5 5 Slice5 22.1 6 Slice6 21.1 all_split_rmses_test # A tibble: 6 x 2 id rmse 1 Slice1 21.6 2 Slice2 20.6 3 Slice3 21.3 4 Slice4 31.4 5 Slice5 35.2 6 Slice6 31.4

Looking at these numbers, we see something interesting: Generalization performance is much better for the first three slices of the time series than for the latter ones. This confirms our impression, stated above, that there seems to be some hidden development going on, rendering forecasting more difficult.

And here are visualizations of the predictions on the respective training and test sets.

First, the training sets:

plot_train <- function(slice, name) { ggplot(slice, aes(x = index, y = value)) + geom_line() + geom_line(aes(y = pred_train1), color = "cyan") + geom_line(aes(y = pred_train50), color = "red") + geom_line(aes(y = pred_train100), color = "green") + geom_line(aes(y = pred_train150), color = "violet") + geom_line(aes(y = pred_train200), color = "cyan") + geom_line(aes(y = pred_train250), color = "red") + geom_line(aes(y = pred_train300), color = "red") + geom_line(aes(y = pred_train350), color = "green") + geom_line(aes(y = pred_train400), color = "cyan") + geom_line(aes(y = pred_train450), color = "red") + geom_line(aes(y = pred_train500), color = "green") + geom_line(aes(y = pred_train550), color = "violet") + geom_line(aes(y = pred_train600), color = "cyan") + geom_line(aes(y = pred_train650), color = "red") + geom_line(aes(y = pred_train700), color = "red") + geom_line(aes(y = pred_train750), color = "green") + ggtitle(name) } train_plots <- map2(all_split_preds_train$predict, all_split_preds_train$id, plot_train) p_body_train <- plot_grid(plotlist = train_plots, ncol = 3) p_title_train <- ggdraw() + draw_label("Backtested Predictions: Training Sets", size = 18, fontface = "bold") plot_grid(p_title_train, p_body_train, ncol = 1, rel_heights = c(0.05, 1, 0.05))

And the test sets:

plot_test <- function(slice, name) { ggplot(slice, aes(x = index, y = value)) + geom_line() + geom_line(aes(y = pred_test1), color = "cyan") + geom_line(aes(y = pred_test50), color = "red") + geom_line(aes(y = pred_test100), color = "green") + geom_line(aes(y = pred_test150), color = "violet") + geom_line(aes(y = pred_test200), color = "cyan") + geom_line(aes(y = pred_test250), color = "red") + geom_line(aes(y = pred_test300), color = "green") + geom_line(aes(y = pred_test350), color = "cyan") + geom_line(aes(y = pred_test400), color = "red") + geom_line(aes(y = pred_test450), color = "green") + geom_line(aes(y = pred_test500), color = "cyan") + geom_line(aes(y = pred_test550), color = "violet") + ggtitle(name) } test_plots <- map2(all_split_preds_test$predict, all_split_preds_test$id, plot_test) p_body_test <- plot_grid(plotlist = test_plots, ncol = 3) p_title_test <- ggdraw() + draw_label("Backtested Predictions: Test Sets", size = 18, fontface = "bold") plot_grid(p_title_test, p_body_test, ncol = 1, rel_heights = c(0.05, 1, 0.05))

This has been a long post, and necessarily will have left a lot of questions open, first and foremost: How do we obtain good settings for the hyperparameters (learning rate, number of epochs, dropout)? How do we choose the length of the hidden state? Or even, can we have an intuition how well LSTM will perform on a given dataset (with its specific characteristics)? We will tackle questions like the above in upcoming posts.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: TensorFlow for R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why R 2018 Winners

Mon, 06/25/2018 - 02:00

(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers)

So it’s here… After lots of entries (147 to be precise), we can finally announce the winner of the WhyR 2018 Competition! But first, we have to tell you quickly about how we picked the winner.

How we did it

So it really wasn’t that hard. We held the questionnaire on typeform. Conveniently, my colleague has created a package rtypeform which is an interface to the typeform API. You can install this from CRAN in the usual way

install.packages("rtypeform") library("rtypeform")

The following code will give you a data frame called q containing a column for each question contained in the WhyR questionnaire.

typeforms = get_all_typeforms()$content uid = typeforms[typeforms$name == "WhyR", "uid"] q = get_questionnaire(uid)$completed

Now, obviously, we can’t give you access to this so I’ve hidden the Jumping Rivers API key. But, given you have your own API key assigned to the variable typeform_api this code will work (remember to change the questionnaire name!). With that, a bit of dplyr and the classic sample(), we can generate a winner for the competition

The winners

Drum roll please… the winner is…

Agnieszka Fronczyk!

Congratulations Agnieszka! We’ll see you in Wroclaw!

Commiserations to all our unlucky participants. However, we are sponsoring more data science events to come so keep an eye out here for more competitions!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

#19: Intel MKL in Debian / Ubuntu follow-up

Sun, 06/24/2018 - 23:41

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the (very brief) nineteenth post in the ruefully recalcitrant R reflections series of posts, or R4 for short.

About two months ago, in the most recent post in the series, #18, we provided a short tutorial about how to add the Intel Math Kernel Library to a Debian or Ubuntu system thanks to the wonderful apt tool — and the prepackaged binaries by Intel. This made for a simple, reproducible, scriptable, and even reversible (!!) solution—which a few people seem to have appreciated. Good.

In the meantime, more good things happened. Debian maintainer Mo Zhou had posted this ‘intent-to-package’ bug report leading to this git repo on salsa and this set of packages currently in the ‘NEW’ package queue.

So stay tuned, "soon" (for various definitions of "soon") we should be able to directly get the MKL onto Debian systems via apt without needing Intel’s repo. And in a release or two, Ubuntu should catch up. The fastest multithreaded BLAS and LAPACK for everybody, well-integrated and package. That said, it is still a monstrously large package so I mostly stick with the (truly open source rather than just ‘gratis’) OpenBLAS but hey, choice is good. And yes, technically these packages are ‘outside’ of Debian in the non-free section but they will be visible by almost all default configurations.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Converting Between Effect Sizes for Meta-Analysis

Sun, 06/24/2018 - 15:00

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Converting Between Effect Sizes I’m currently working on my promised video on mixed effects meta-analysis, and was planning on covering this particular topic in that video – converting between effect sizes. But I decided to do this as a separate post that I can reference in the video, which I hope to post next week.

As a brief refresher, meta-analysis is aimed at estimating the true effect (or effects) in an area of study by combining findings from multiple studies on that topic. Effect sizes, the most frequently used being Cohen’s d, Pearson’s r, and log odds ratio, are estimated from information in study reports and presentations. There’s a lot of variation in how clearly reports and documents describe the findings and the information given to estimate the study’s overall effect. But when you conduct the meta-analysis, whether using fixed, random, or mixed effects analysis, you need to use only one type of effect size. That means that, sometimes, studies will give you a different type of effect size than you plan to use. Fortunately, there are ways to convert between effect sizes and use different types of statistical information to generate your estimates.

First up, converting between those key effect sizes. In the meta-analysis I performed in grad school, I examined the effect of pretrial publicity on guilt. There are two ways guilt was frequently operationalized in the studies: as a guilty/not guilty verdict or as a continuous guilt rating. For those outcomes, we would likely use, respectively, log odds ratio and Cohen’s d. The escalc function in the metafor package can compute log odds ratio for guilty/not guilty counts, and Cohen’s d for mean and standard deviation of the guilt ratings. But studies may use different types of information when presenting their results, so you may not be able to simply compute those effect sizes.

For instance, a study using verdict may present a chi-square and one of its effect sizes, Cramer’s V, which is very similar to a correlation coefficient. How can I convert that into log odds ratio?

To convert from one effect size to the other, you need to follow a prescribed path, which can be seen in the diagram below. What this diagram tells you is which effect sizes you can convert between directly: you can directly convert between log odds ratio and Cohen’s d, and between Cohen’s d and Pearson’s r. If you wanted to convert between Pearson’s r and log odds ratio, you’ll first need to convert to Cohen’s d. You’ll need to do the same thing for variance – compute it for the native effect size metric, then convert that to the new effect size metric.

Let’s start by setting up functions that will convert between our effect sizes for us, beginning with Cohen’s d and log odds ratio. Then we’ll demonstrate with some real data.

#Convert log odds ratio to d
ltod <- function(lor) {
d = lor * (sqrt(3)/pi)
return(d)
}
vltovd <- function(vl) {
vd = vl * (3/pi^2)
return(vd)
}

#Convert d to log odds ratio
dtol <- function(d) {
lor = d*(pi/sqrt(3))
return(lor)
}
vdtovl <- function(vd) {
vl = vd*(pi^2/3)
return(vl)
}

You’ll notice a mathematical symmetry in these equations – the numerators and denominators switch between the equations. Now let’s set up equations to r and d. These equations are slightly more complex and will require a few additional arguments. For instance, converting the variance of r to variance of d requires both the variance of r and r itself. Converting from d to r requires group sample sizes, referred to as n1 and n2.

#Convert r to d
rtod <- function(r) {
d = (2*r)/(sqrt(1-r^2))
return(d)
}
vrtovd <- function(vr,r) {
vd = (4*vr)/(1-r^2)^3
return(vd)
}

#Convert d to r
dtor <- function(n1,n2,d) {
a = (n1+n2)^2/(n1*n2)
r = d/(sqrt(d^2+a))
return(r)
}
vdtovr <- function(n1,n2,vd,d) {
a = (n1+n2)^2/(n1*n2)
vr = a^2*vd/(d^2+a)^3
return(vr)
}

Remember that the metafor package can compute effect sizes and variances for you, so you might want to run the escalc on the native effect sizes so that you have the estimates and variances you need to run these functions. But if you ever find yourself having to compute those variances by hand, here are the equations, which we’ll use in the next step.

vard <- function(n1,n2,d) {
vd = ((n1+n2)/(n1*n2)) + (d^2/(2*(n1+n2)))
return(vd)
}

varr <- function(r,n) {
vr = (1-r^2)^2/(n-1)
return(vr)
}

varlor <- function(a,b,c,d) {
vl = (1/a)+(1/b)+(1/c)+(1/d)
return(vl)
}

One of the studies I included in my meta-analysis gave Cramer’s V. It had a sample size of 42, with 21 people in each group. I’d like to convert that effect size to log odds ratio. Here’s how I could do it.

cramerv <- 0.67
studyd <- rtod(cramerv)
studyvr <- varr(0.67,42)
studyvd <- vrtovd(studyvr,cramerv)
dtol(studyd)
## [1] 3.274001
vdtovl(studyvd)
## [1] 0.5824038

I can now include this study in my meta-analysis of log odds ratios.

What if my study gives different information? For instance, it might have given me a chi-square or a t-value. This online effect size calculator, created by David Wilson, coauthor of Practical Meta-Analysis, can compute effect sizes for you from many different types of information. In fact, spoiler alert: I used an earlier version of this calculator extensively for my meta-analysis. Note that this calculator returns odds ratios, so you’ll need to convert those values into a log odds ratio.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Let R/Python send messages when the algorithms are done training

Sun, 06/24/2018 - 11:53

(This article was first published on Rbloggers – The Analytics Lab, and kindly contributed to R-bloggers)

As Data Scientists, we often train complex algorithms in order to tackle certain business problems and generate value. These algorithms, however, can take a while to train. Sometimes they take a couple of hours, hours which I’m not going to spend just sitting and waiting. But regularly checking whether the training is done, is also not the most efficient way.

Now I started to use Telegram to send me notifications from R and Python to let me know when training is done. Furthermore, I’m also using it for example to send me notifications when pipelines / ETLs fail, which allows me to repair them as soon as they fail.

It’s really easy, so I thought I’ll share my code!

First, after you’ve installed Telegram, search for the BotFather, which is a bot from the app itself. When you text /newbot, and follow the instructions, it will create your first bot and gives you a token. Copy this!

Next step is to find the id to send messages to. Find your bot in Telegram and say something. Then, go to your browser and go to https://api.telegram.org/bot/getUpdates, where it should show you your chat id.

Finally install the necessary packages for R [install.packages(‘telegram’)] and / or Python [pip install telegram]. And you’re ready!

For R, use the following function:

send_telegram_message <- function(text, chat_id, bot_token){ require(telegram) bot <- TGBot$new(token = bot_token) bot$sendMessage(text = text, chat_id = chat_id) }

And this one for Python:

def send_telegram_message(text, chat_id, bot_token):
import telegram
bot = telegram.Bot(token=bot_token)
bot.send_message(chat_id=chat_id, text = text )

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rbloggers – The Analytics Lab. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Forecasting my weight with R

Sun, 06/24/2018 - 02:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

I’ve been measuring my weight almost daily for almost 2 years now; I actually started earlier, but
not as consistently. The goal of this blog post is to get re-acquaiented with time series; I haven’t
had the opportunity to work with time series for a long time now and I have seen that quite a few
packages that deal with time series have been released on CRAN. In this blog post, I will explore
my weight measurements using some functions from the {tsibble} and {tibbletime} packages,
and then do some predictions with the {forecast} package.

First, let’s load the needed packages, read in the data and convert it to a tsibble:

library("tidyverse") library("readr") library("forecast") library("tsibble") library("tibbletime") library("mice") weight <- read_csv("https://gist.githubusercontent.com/b-rodrigues/ea60679135f8dbed448ccf66a216811f/raw/18b469f3b0720f76ce5ee2715d0f9574b615f170/gistfile1.txt") %>% as_tsibble() ## Parsed with column specification: ## cols( ## Date = col_date(format = ""), ## Poids = col_double() ## ) ## The `index` is `Date`.

You can read more about {tsibble} here. Here, I use {tsibble} mostly
for the next step, which is using the function fill_na() on the tsibble. fill_na() turns
implicit missing values into explicit missing values. These are implicit missing values:

Date Poids 1 2013-01-01 84.10 2 2013-01-04 85.60

and this is the same view, but with explicit missing values:

Date Poids 1 2013-01-01 84.10 2 2013-01-02 NA 3 2013-01-03 NA 4 2013-01-04 85.60

This is useful to do, because I want to impute the missing values using the {mice} package.
Let’s do this:

weight <- weight %>% fill_na() imp_weight <- mice(data = weight) %>% mice::complete("long") ## ## iter imp variable ## 1 1 Poids ## 1 2 Poids ## 1 3 Poids ## 1 4 Poids ## 1 5 Poids ## 2 1 Poids ## 2 2 Poids ## 2 3 Poids ## 2 4 Poids ## 2 5 Poids ## 3 1 Poids ## 3 2 Poids ## 3 3 Poids ## 3 4 Poids ## 3 5 Poids ## 4 1 Poids ## 4 2 Poids ## 4 3 Poids ## 4 4 Poids ## 4 5 Poids ## 5 1 Poids ## 5 2 Poids ## 5 3 Poids ## 5 4 Poids ## 5 5 Poids

Let’s take a look at imp_weight:

head(imp_weight) ## .imp .id Date Poids ## 1 1 1 2013-10-28 84.1 ## 2 1 2 2013-10-29 84.4 ## 3 1 3 2013-10-30 83.5 ## 4 1 4 2013-10-31 84.1 ## 5 1 5 2013-11-01 85.6 ## 6 1 6 2013-11-02 85.2

Let’s select the relevant data. I filter from the 11th of July 2016, which is where I started
weighing myself almost every day, to the 31st of May 2018. I want to predict my weight for the
month of June (you might think of the month of June 2018 as the test data, and the rest as training
data):

imp_weight_train <- imp_weight %>% filter(Date >= "2016-07-11", Date <= "2018-05-31")

In the next lines, I create a column called imputation which is simply the same as the column
.imp but of character class, remove unneeded columns and rename some other columns (“Poids” is
French for weight):

imp_weight_train <- imp_weight_train %>% mutate(imputation = as.character(.imp)) %>% select(-.id, -.imp) %>% rename(date = Date) %>% rename(weight = Poids)

Let’s take a look at the data:

ggplot(imp_weight_train, aes(date, weight, colour = imputation)) + geom_line() + theme(legend.position = "bottom")

This plots gives some info, but it might be better to smooth the lines. This is possible by
computing a rolling mean. For this I will use the rollify() function of the {tibbletime} package:

mean_roll_5 <- rollify(mean, window = 5) mean_roll_10 <- rollify(mean, window = 10)

rollify() can be seen as an adverb, pretty much like purrr::safely(); rollify() is a higher
order function that literally rollifies a function, in this case mean() which means that
rollifying the mean creates a function that returns the rolling mean. The window argument lets
you decide how smooth you want the curve to be: the higher the smoother. However, you will lose
some observations. Let’s use this functions to add the rolling means to the data frame:

imp_weight_train <- imp_weight_train %>% group_by(imputation) %>% mutate(roll_5 = mean_roll_5(weight), roll_10 = mean_roll_10(weight))

Now, let’s plot these new curves:

ggplot(imp_weight_train, aes(date, roll_5, colour = imputation)) + geom_line() + theme(legend.position = "bottom") ## Warning: Removed 20 rows containing missing values (geom_path).

ggplot(imp_weight_train, aes(date, roll_10, colour = imputation)) + geom_line() + theme(legend.position = "bottom") ## Warning: Removed 45 rows containing missing values (geom_path).

That’s easier to read, isn’t it?

Now, I will use the auto.arima() function to train a model on the data to forecast my weight for
the month of June. However, my data, imp_weight_train is a list of datasets. auto.arima() does
not take a data frame as an argument, much less so a list of datasets. I’ll create a wrapper around
auto.arima() that works on a dataset, and then map it to the list of datasets:

auto.arima.df <- function(data, y, ...){ y <- enquo(y) yts <- data %>% pull(!!y) %>% as.ts() auto.arima(yts, ...) }

auto.arima.df() takes a data frame as argument, and then y, which is the column that contains the
univariate time series. This column then gets pulled out of the data frame, converted to a time
series object with as.ts(), and then passed down to auto.arima(). I can now use this function
on my list of data sets. The first step is to nest the data:

nested_data <- imp_weight_train %>% group_by(imputation) %>% nest()

Let’s take a look at nested_data:

nested_data ## # A tibble: 5 x 2 ## imputation data ## ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5

nested_data is a tibble with a column called data, which is a so-called list-column. Each
element of data is itself a tibble. This is a useful structure, because now I can map auto.arima.df()
to the data frame:

models <- nested_data %>% mutate(model = map(data, auto.arima.df, y = weight))

This trick can be a bit difficult to follow the first time you see it. The idea is the following:
nested_data is a tibble. Thus, I can add a column to it using mutate(). So far so good.
Now that I am “inside” the mutate call, I can use purrr::map(). Why? purrr::map() takes a list
and then a function as arguments. Remember that data is a list column; you can see it above,
the type of the column data is list. So data is a list, and thus can be used inside purrr::map().
Great. Now, what is inside data? tibbles, where inside each of them is a column
called weight. This is the column that contains my univariate time series I want to model. Let’s
take a look at models:

models ## # A tibble: 5 x 3 ## imputation data model ## ## 1 1 ## 2 2 ## 3 3 ## 4 4 ## 5 5

models is a tibble with a column called model, where each element is a model of type ARIMA.

Adding forecasts is based on the same trick as above, and we use the forecast() function:

forecasts <- models %>% mutate(predictions = map(model, forecast, h = 24)) %>% mutate(predictions = map(predictions, as_tibble)) %>% pull(predictions)

I forecast 24 days (I am writing this on the 24th of June), and convert the predictions to tibbles,
and then pull only the predictions tibble:

forecasts ## [[1]] ## # A tibble: 24 x 5 ## `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95` ## * ## 1 71.5 70.7 72.3 70.2 72.8 ## 2 71.5 70.7 72.4 70.3 72.8 ## 3 71.5 70.6 72.3 70.1 72.8 ## 4 71.5 70.6 72.4 70.1 72.9 ## 5 71.4 70.5 72.4 70.0 72.9 ## 6 71.5 70.5 72.4 70.0 72.9 ## 7 71.4 70.5 72.4 69.9 72.9 ## 8 71.4 70.4 72.4 69.9 72.9 ## 9 71.4 70.4 72.4 69.9 72.9 ## 10 71.4 70.4 72.4 69.8 73.0 ## # ... with 14 more rows ## ## [[2]] ## # A tibble: 24 x 5 ## `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95` ## * ## 1 71.6 70.8 72.3 70.3 72.8 ## 2 71.6 70.8 72.5 70.3 72.9 ## 3 71.5 70.6 72.4 70.2 72.9 ## 4 71.5 70.6 72.5 70.1 72.9 ## 5 71.5 70.5 72.5 70.0 73.0 ## 6 71.5 70.5 72.5 70.0 73.0 ## 7 71.5 70.5 72.5 69.9 73.0 ## 8 71.5 70.4 72.5 69.9 73.1 ## 9 71.5 70.4 72.5 69.8 73.1 ## 10 71.4 70.3 72.6 69.7 73.1 ## # ... with 14 more rows ## ## [[3]] ## # A tibble: 24 x 5 ## `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95` ## * ## 1 71.6 70.8 72.4 70.4 72.8 ## 2 71.5 70.7 72.4 70.2 72.8 ## 3 71.5 70.6 72.4 70.2 72.9 ## 4 71.5 70.6 72.4 70.1 72.9 ## 5 71.5 70.5 72.4 70.0 72.9 ## 6 71.5 70.5 72.4 70.0 73.0 ## 7 71.5 70.5 72.5 69.9 73.0 ## 8 71.4 70.4 72.5 69.9 73.0 ## 9 71.4 70.4 72.5 69.8 73.0 ## 10 71.4 70.4 72.5 69.8 73.1 ## # ... with 14 more rows ## ## [[4]] ## # A tibble: 24 x 5 ## `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95` ## * ## 1 71.5 70.8 72.3 70.3 72.8 ## 2 71.5 70.7 72.4 70.3 72.8 ## 3 71.5 70.7 72.4 70.2 72.8 ## 4 71.5 70.6 72.4 70.1 72.9 ## 5 71.5 70.6 72.4 70.1 72.9 ## 6 71.5 70.5 72.5 70.0 73.0 ## 7 71.5 70.5 72.5 69.9 73.0 ## 8 71.5 70.4 72.5 69.9 73.0 ## 9 71.4 70.4 72.5 69.8 73.1 ## 10 71.4 70.3 72.5 69.8 73.1 ## # ... with 14 more rows ## ## [[5]] ## # A tibble: 24 x 5 ## `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95` ## * ## 1 71.5 70.8 72.3 70.3 72.8 ## 2 71.5 70.7 72.4 70.3 72.8 ## 3 71.5 70.7 72.4 70.2 72.8 ## 4 71.5 70.6 72.4 70.1 72.9 ## 5 71.5 70.6 72.4 70.1 72.9 ## 6 71.5 70.5 72.4 70.0 73.0 ## 7 71.5 70.5 72.5 69.9 73.0 ## 8 71.5 70.4 72.5 69.9 73.0 ## 9 71.4 70.4 72.5 69.8 73.1 ## 10 71.4 70.3 72.5 69.8 73.1 ## # ... with 14 more rows

So forecasts is a list of tibble, each containing a forecast. Remember that I have 5 tibbles, because
I imputed the data 5 times. I will merge this list of data sets together into one, but before I need
to add a column that indices the forecasts:

forecasts <- map2(.x = forecasts, .y = as.character(seq(1, 5)), ~mutate(.x, id = .y)) %>% bind_rows() %>% select(-c(`Lo 80`, `Hi 80`)) colnames(forecasts) <- c("point_forecast", "low_95", "hi_95", "id")

Let’s take a look again at forecasts:

forecasts ## # A tibble: 120 x 4 ## point_forecast low_95 hi_95 id ## ## 1 71.5 70.2 72.8 1 ## 2 71.5 70.3 72.8 1 ## 3 71.5 70.1 72.8 1 ## 4 71.5 70.1 72.9 1 ## 5 71.4 70.0 72.9 1 ## 6 71.5 70.0 72.9 1 ## 7 71.4 69.9 72.9 1 ## 8 71.4 69.9 72.9 1 ## 9 71.4 69.9 72.9 1 ## 10 71.4 69.8 73.0 1 ## # ... with 110 more rows

I now select the true values for the month of June. I also imputed this data, but here I will
simply keep the average of the imputations:

weight_june <- imp_weight %>% filter(Date >= "2018-06-01") %>% select(-.id) %>% group_by(Date) %>% summarise(true_weight = mean(Poids)) %>% rename(date = Date)

Let’s take a look at weight_june:

weight_june ## # A tibble: 24 x 2 ## date true_weight ## ## 1 2018-06-01 71.8 ## 2 2018-06-02 70.8 ## 3 2018-06-03 71.2 ## 4 2018-06-04 71.4 ## 5 2018-06-05 70.9 ## 6 2018-06-06 70.8 ## 7 2018-06-07 70.5 ## 8 2018-06-08 70.1 ## 9 2018-06-09 70.3 ## 10 2018-06-10 71.0 ## # ... with 14 more rows

Let’s repeat weight_june 5 times, and add the index 1 to 5. Why? Because I want to merge the
true data with the forecasts, and having the data in this form makes things easier:

weight_june <- modify(list_along(1:5), ~`<-`(., weight_june)) %>% map2(.y = as.character(seq(1, 5)), ~mutate(.x, id = .y)) %>% bind_rows()

The first line:

modify(list_along(1:5), ~`<-`(., weight_june))

looks quite complicated, but you will see that it is not, once we break it apart. modify()
modifies a list. The list to modify is list_along(1:5), which create a list of NULLs:

list_along(1:5) ## [[1]] ## NULL ## ## [[2]] ## NULL ## ## [[3]] ## NULL ## ## [[4]] ## NULL ## ## [[5]] ## NULL

The second argument of modify() is either a function or a formula. I created the following
formula:

~`<-`(., weight_june)

We all know the function <-(), but are not used to see it that way. But consider the following:

a <- 3 `<-`(a, 3)

These two formulations are equivalent. So these lines fill the empty element of the list of NULLs
with the data frame weight_june. Then I add the id column and then bind the rows together: bind_rows().

Let’s bind the columns of weight_june and forecasts and take a look at it:

forecasts <- bind_cols(weight_june, forecasts) %>% select(-id1) forecasts ## # A tibble: 120 x 6 ## date true_weight id point_forecast low_95 hi_95 ## ## 1 2018-06-01 71.8 1 71.5 70.2 72.8 ## 2 2018-06-02 70.8 1 71.5 70.3 72.8 ## 3 2018-06-03 71.2 1 71.5 70.1 72.8 ## 4 2018-06-04 71.4 1 71.5 70.1 72.9 ## 5 2018-06-05 70.9 1 71.4 70.0 72.9 ## 6 2018-06-06 70.8 1 71.5 70.0 72.9 ## 7 2018-06-07 70.5 1 71.4 69.9 72.9 ## 8 2018-06-08 70.1 1 71.4 69.9 72.9 ## 9 2018-06-09 70.3 1 71.4 69.9 72.9 ## 10 2018-06-10 71.0 1 71.4 69.8 73.0 ## # ... with 110 more rows

Now, for the last plot:

ggplot(forecasts, aes(x = date, colour = id)) + geom_line(aes(y = true_weight), size = 2) + geom_line(aes(y = hi_95)) + geom_line(aes(y = low_95)) + theme(legend.position = "bottom")

The true data fall within all the confidence intervals, but I am a bit surprised by the intervals,
especially the upper confidence intervals; they all are way above 72kg, however my true weight
has been fluctuating around 71kg for quite some months now. I think I have to refresh my memory
on time series, because I am certainly missing something!

If you found this blog post useful, you might want to follow me on twitter
for blog post updates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages