The tidy caret interface in R
(This article was first published on poissonisfish, and kindly contributed to Rbloggers)
Can human experience and chemoinformatics unravel the chemistry of smell? We are about to find out in the new world of caret [source]Among most popular offtheshelf machine learning packages available to R, caret ought to stand out for its consistency. It reaches out to a wide range of dependencies that deploy and support model building using a uniform, simple syntax. I have been using caret extensively for the past three years, with a precious partial least squares (PLS) tutorial in my track record.
A couple of years ago, the creator and maintainer of caret Max Kuhn joined RStudio where he has contributing new packages to the ongoing tidyparanoia – the supporting recipes, yardstick, rsample and many other packages that are part of the tidyverse paradigm and I knew little about. As it happens, caret is now best used with some of these. As an aspiring data scientist with fashionable hexstickers on my laptop cover and a tendency to start any sentence with ‘big data’, I set to learn tidyverse and going Super Mario using pipes (%>%, Ctrl + Shift + M).
Overall, I found the ‘tidy’ approach quite enjoyable and efficient. Besides streamlining model design in a sequential manner, recipes and the accompanying ‘tidy’ packages fix a common problem in the statistical learning realm that curtails predictive accuracy, data leakage. You can now generate a ‘recipe’ that lays out the plan for all the feature engineering (or data preprocessing, depending on how many hexstickers you have) and execute it separately for validation and resampling splits (i.e. split first, process later), thus providing a more robust validation scheme. It also enables recycling any previous computations afterwards.
I drew a lot of inspiration from a webinar and a presentation from Max himself to come up with this tutorial. Max is now handson writing a new book, entitled Feature Engineering and Selection: A Practical Approach for Predictive Models that will include applications of caret supported by recipes. I very much look forward to read it considering he did a fantastic job with Applied Predictive Modeling.
The aim here is to test this new caret framework on the DREAM Olfaction Prediction Challenge, a crowdsourced initiative launched in January 2015. This challenge demonstrated predictive models can identify perceptual attributes (e.g. coffeelike, spicy and sweet notes) in a collection of fragrant compounds from physicochemical features (e.g. atom types). Results and conclusions were condensed into a research article [1] that reports an impressive predictive accuracy for a large number of models and teams, over various perceptive attributes. Perhaps more impressive yet, is the fact it anticipates the very scarce knowledge of human odour receptors and their molecular activation mechanisms. Our task, more concretely, will be to predict populationlevel pleasantness of hundreds of compounds from physicochemical features, using the training set of the competition.
Finally, a short announcement. Many people have been raising reproducibility issues with previous entries, due to either package archiving (e.g. GenABEL, for genetic mapping) or crossincompatibility (e.g. caret, for PLS). There were easy fixes for the most part, but I have nevertheless decided to start packing scripts together with sessionInfo() reports in unique GitHub repositories. The bundle for this tutorial is available under https://github.com/monogenea/olfactionPrediction. I hope this will help track down the exact package versions and system configurations I use, so that anyone can anytime fully reproduce these results. No hexstickers required!
NOTE that most code chunks containing pipes are corrupted. PLEASE refer to the materials from the repo.
The DREAM Olfaction Prediction ChallengeThe DREAM Olfaction Prediction Challenge training set consists of 338 compounds characterised by 4,884 physicochemical features (the design matrix), and profiled by 49 subjects with respect to 19 semantic descriptors, such as ‘flower’, ‘sweet’ and ‘garlic’, together with intensity and pleasantness (all perceptual attributes). Two different dilutions were used per compound. The perceptual attributes were given scores in the 0100 range. The two major goals of the competition were to use the physicochemical features in modelling i) the perceptual attributes at the subjectlevel and ii) both the mean and standard deviation of the perceptual attributes at the populationlevel. How ingenious it was to model the standard deviation, the variability in how something tastes to different people! In the end, the organisers garnered important insights about the biochemical basis of flavour, from teams all over the world. The resulting models were additionally used to reverseengineer nonexistent compounds from target sensory profiles – an economically exciting prospect.
According to the publication, the topfive best predicted perceptual attributes at the populationlevel (i.e. mean values across all 49 subjects) were ‘garlic’, ‘fish’, ‘intensity’, pleasantness and ‘sweet’ (cf. Fig. 3A in [1]), all exhibiting average prediction correlations of , far greater than the average subjectlevel predictions (cf. Fig. 2D in [1]). This should come as no surprise since averaging over subjects is more likely to flesh out universal mechanisms. This averaging also helps stabilising the subjectlevel variance – no subject will necessarily utilise the full 0100 scoring range; some will tend to use narrow intervals (low variance) while some will use the whole range instead (high variance).
Since pleasantness is one of the most general olfactory properties, I propose modelling the populationlevel median pleasantness of all compounds, from all physicochemical features. The median, as opposed to the mean, is less sensitive to outliers and an interesting departure from the original approach we can later compare against.
Let’s get started with RWe can start off by loading all the necessary packages and pulling the data directly from the DREAM Olfaction Prediction GitHub repository. More specifically, we will create a directory called data and import both the profiled perceptual attributes (TrainSet.txt) and the physicochemical features that comprise our design matrix (molecular_descriptors_data.txt). Because we are tidy people, we will use readr to parse these as tabseparated values (TSV). I also suggest rewriting the column name Compound Identifier in the sensory profiles into CID, to match with that from the design matrix molFeats.
# Mon Oct 29 13:17:55 2018  ## DREAM olfaction prediction challenge library(caret) library(rsample) library(tidyverse) library(recipes) library(magrittr) library(doMC) # Create directory and download files dir.create("data/") ghurl < "https://github.com/dreamolfaction/olfactionprediction/raw/master/data/" download.file(paste0(ghurl, "TrainSet.txt"), destfile = "data/trainSet.txt") download.file(paste0(ghurl, "molecular_descriptors_data.txt"), destfile = "data/molFeats.txt") # Read files with readr, select least diluted compounds responses % rename(CID = `Compound Identifier`) molFeats < read_tsv("data/molFeats.txt") # molecular featuresNext, we filter compounds that are common to both datasets and reorder them accordingly. The profiled perceptual attributes span a total of 49 subjects, two different dilutions and some replications (cf. Fig. 1A in [1]). Determine the median pleasantness (i.e. VALENCE/PLEASANTNESS) per compound, across all subjects and dilutions while ignoring missingness with na.rm = T. The last line will ensure the order of the compounds is identical between this new dataset and the design matrix, so that we can subsequently introduce populationlevel pleasantness as a new column termed Y in the new design matrix X. We no longer need CID so we can discard it.
Regarding missingness, I found maxima of 0.006% and 23% NA content over all columns and rows, respectively. A couple of compounds could be excluded but I would move on.
# Determine intersection of compounds in features and responses commonMols % dplyr::summarise(pleasantness = median(`VALENCE/PLEASANTNESS`, na.rm = T)) all(medianPlsnt$CID == molFeats$CID) # TRUE  rownames match # Concatenate predictors (molFeats) and population pleasantness X % select(CID)In examining the structure of the design matrix, you will see there are many skewed binary variables. In the most extreme case, if a binary predictor is either all zeros or all ones throughout it is said to be a zerovariance predictor that if anything, will harm the model training process. There is still the risk of having nearzerovariance predictors, which can be controlled based on various criteria (e.g. number of samples, size of splits). Of course, this can also impact quantitative, continuous variables. Here we will use nearZeroVar from caret to identify faulty predictors that are either zerovariance or display values whose frequency exceeds a predefined threshold. Having a 5x repeated 5fold crossvalidation in mind, I suggest setting freqCut = 4, which will rule out i) binary variables with , and ii) continuous variables with . See ?nearZeroVar for more information. From 4,870 features we are left with 2,554 – a massive reduction by almost a half.
# Filter nzv X < X[,nearZeroVar(X, freqCut = 4)] # == 80/20Now we will use rsample to define the traintest and crossvalidation splits. Please use set.seed as it is, to ensure you will generate the same partitions and arrive to the same results. Here I have allocated 10% of the data to the test set; by additionally requesting stratification based on the target Y, we are sure to have a representative subset. We can next pass the new objects to training and testing to effectively split the design matrix into train and test sets, respectively. The following steps consist of setting up a 5x repeated 5fold crossvalidation for the training set. Use vfold_cv and convert the output to a caretlike object via rsample2caret. Next, initialise a trainControl object and overwrite index and indexOut using the corresponding elements in the newly converted vfold_cv output.
# Split train/test with rsample set.seed(100) initSplit < initial_split(X, prop = .9, strata = "Y") trainSet < training(initSplit) testSet < testing(initSplit) # Create 5fold crossvalidation, convert to caret class set.seed(100) myFolds % rsample2caret() ctrl < trainControl(method = "cv", selectionFunction = "oneSE") ctrl$index < myFolds$index ctrl$indexOut < myFolds$indexOutPrior to modelling, I will create two reference variables – binVars, to identify binary variables, and missingVars, to identify any variables containing missing values. These will help with i) excluding binary variables from meancentering and unitvariance scaling, and ii) restricting meanbased imputation to variables that do contain missing values, respectively.
# binary vars binVars < which(sapply(X, function(x){all(x %in% 0:1)})) missingVars < which(apply(X, 2, function(k){any(is.na(k))}))Recipes are very simple in design. The call to recipe is first given the data it is supposed to be applied on, as well as a lmstyle formula. The recipes package contains a family of functions with a step_... prefix, involved in encodings, date conversions, filters, imputation, normalisation, dimension reduction and a lot more. These can be linked using pipes, forming a logic sequence of operations that starts with the recipe call. Supporting functions such as all_predictors, all_outcomes and all_nominal delineate specific groups of variables at any stage in the sequence. One can also use the names of the variables, akin to my usage of binVars and missingVars.
I propose writing a basic recipe myRep that does the following:
 YeoJohnson [2] transformation of all quantitative predictors;
 Meancenter all quantitative predictors;
 Unitvariance scale all quantitative predictors;
 Impute missing values with the mean of the incident variables.
In brief, the YeoJohnson procedure transforms variables to be distributed as Gaussianlike as possible. Before delving into the models let us tweak the recipe and assign it to pcaRep, conduct a principal component analysis and examine how different compounds distribute along the first two principal components. Colour them based on their populationlevel pleasantness – red for very pleasant, blue for very unpleasant and the gradient in between, via colGrad.
# simple PCA, plot pcaRec % step_pca(all_predictors()) myPCA % juice() colGrad < trainSet$Y/100 # add color plot(myPCA$PC1, myPCA$PC2, col = rgb(1  colGrad, 0, colGrad,.5), pch = 16, xlab = "PC1", ylab = "PC2") legend("topleft", pch = 16, col = rgb(c(0,.5,1), 0, c(1,.5,0), alpha = .5), legend = c("Pleasant", "Neutral", "Unpleasant"), bty = "n")The compounds do not seem to cluster into groups, nor do they clearly separate with respect to pleasantness.
Note that pcaRep itself is just a recipe on wait. Except for recipes passed to caret::train, to process and extract the data as instructed you need to either ‘bake’ or ‘juice’ the recipe. The difference between the two is that ‘juicing’ outputs the data associated to the recipe (e.g. the training set), whereas ‘baking’ can be applied to process any other dataset. ‘Baking’ is done with bake, provided the recipe was trained using prep. I hope this explains why I used juice above!
Next we have the model training. I propose training five regression models – knearest neighbours (KNN), elastic net (ENET), support vector machine with a radial basis function kernel (SVM), random forests (RF), extreme gradient boosting (XGB) and Quinlan’s Cubist (CUB) – aiming at minimising the root mean squared error (RMSE). Note I am using tuneGrid and tuneLength interchangeably. I rather supply predefined hyperparameters to simple models I am more familiar with, and run the rest with a number of defaults via tuneLength.
Parallelise the computations if possible. With macOS, I can use registerDoMC from the doMC package to harness multithreading of the training procedure. If you are running a different OS, please use a different library if necessary. To the best of my knowledge, doMC will also work in Linux but Windows users might need to use the doParallel package instead.
# Train doMC::registerDoMC(10) knnMod < train(myRec, data = trainSet, method = "knn", tuneGrid = data.frame(k = seq(5, 25, by = 4)), trControl = ctrl) enetMod < train(myRec, data = trainSet, method = "glmnet", tuneGrid = expand.grid(alpha = seq(0, 1, length.out = 5), lambda = seq(.5, 2, length.out = 5)), trControl = ctrl) svmMod < train(myRec, data = trainSet, method = "svmRadial", tuneLength = 8, trControl = ctrl) rfMod < train(myRec, data = trainSet, method = "ranger", tuneLength = 8, num.trees = 5000, trControl = ctrl) xgbMod < train(myRec, data = trainSet, method = "xgbTree", tuneLength = 8, trControl = ctrl) cubMod < train(myRec, data = trainSet, method = "cubist", tuneLength = 8, trControl = ctrl) modelList < list("KNN" = knnMod, "ENET" = enetMod, "SVM" = svmMod, "RF" = rfMod, "XGB" = xgbMod, "CUB" = cubMod)Once the training is over, we can compare the optimal five crossvalidated models based on their RMSE across all resamples, using some bwplot magic sponsored by lattice. In my case the runtime was unusually long (>2 hours) compared to previous releases.
bwplot(resamples(modelList), metric = "RMSE")The crossvalidated RSME does not vary considerably across the six optimal models. To conclude, I propose creating a model ensemble by simply taking the average predictions of all six optimal models on the test set, to compare observed and predicted populationlevel pleasantness in this holdout subset of compounds.
# Validate on test set with ensemble allPreds < sapply(modelList, predict, newdata = testSet) ensemblePred < rowSums(allPreds) / length(modelList) # Plot predicted vs. observed; create PNG plot(ensemblePred, testSet$Y, xlim = c(0,100), ylim = c(0,100), xlab = "Predicted", ylab = "Observed", pch = 16, col = rgb(0, 0, 0, .25)) abline(a=0, b=1) writeLines(capture.output(sessionInfo()), "sessionInfo")The ensemble fit on the test set is not terrible – slightly underfit, predicting the poorest in the two extremes of the pleasantness range. All in all, it attained a prediction correlation of , which is slightly larger than the mean reported (cf. Fig. 3A in [1]). However, note that there are only 31 compounds in the test set. A model like this could be easily moved into production stage or repurposed to solve a molecular structure from a sensory profile of interest, as mentioned earlier.
WrapupThe revamped caret framework is still under development, but it seems to offer
 A substantially expanded, unified syntax for models and utils. Keep an eye on textrecipes, an upcoming complementary package for processing textual data;
 A sequential streamlining of extraction, processing and modelling, enabling recycling of previous computations;
 Executingaftersplitting, thereby precluding leaky validation strategies.
On the other hand, I hope to see improvements in
 Decoupling from proprietary frameworks in the likes of tidyverse, although there are alternatives;
 The runtime. I suspect the recipes themselves are the culprit, not the models. Future updates will certainly fix this.
Regarding the DREAM Olfaction Prediction Challenge, we were able to predict populationlevel perceived pleasantness from physicochemical features with an accuracy as good as that achieved, in average, by the competing teams. Our model ensemble seems underfit, judging from the narrow range of predicted values. Maybe we could be less restrictive about the nearzerovariance predictors and use a repeated 3fold crossvalidation.
I hope you had as much fun as I did in dissecting a fascinating problem while unveiling a bit of the new caret. You can use this code as a template for exploring different recipes and models. If you encounter any problems or have any questions do not hesitate to post a comment underneath – we all learn from each other!
References[1] Keller, Andreas et al. (2017). Predicting human olfactory perception from chemical features of odor molecules. Science, 355(6327), 820826.
[2] Yeo, InKwon and Johnson, Richard (2000). A new family of power transformations to improve normality or symmetry. Biometrika, 87(4), 954959.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: poissonisfish. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Report from the Enterprise Applications of the R Language conference
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Mango Solutions presented the EARL conference in Seattle this November and I was fortunate enough to have time to attend. During the conference I took notes on the presentations, which I’ll pass along to you.
Thoughts on the future of R in industryThe EARL conference started with a panel discussion on R. Moderated by Doug Ashton from Mango Solutions, the panel included Julia Silge with Stack Overflow, David Smith with Microsoft, and Joe Cheng with Rstudio.
Topics ranged from the tidyverse, R certification, R as a general purpose language, Shiny, and the future of R. I captured some interesting quotes from the panel members:
Regarding R certification, David Smith points out that certifications are important for big companies trying to filter employment applications. He mentions that certification is a minimum bar for some HR departments. Julia Silge mentions that Stack Overflow has found certifications to be less influential in the hiring process.
R as a general purpose language: Joe Cheng feels that R is useful for more than just statistics, but that Rstudio isn’t interested in developing general purpose tools. There was discussion around Python as the “second best” language for a lot of applications, and an agreement that R should remain focused on data science.
Most interesting was the discussion regarding the future of R. Julia Silge points out that Stack Overflow data shows R growing fast year over year — at about the same rate as Python. There are a lot of new users and packages need to take that into account.
I learned more about Natural Language ProcessingTim Oldfield introduces this conference as NOT a codebased conference. However, Julia Silge doesn't hesitate to illustrate her discussion with appropriate code. And seriously, how would you discuss natural language processing without using the language of the trade?
I won't get into the terms (TFIDF) and technology (tidytext) of Mz Silge's presentation. I will mention she does a great job of summarizing how and why to perform text mining. Like all good tech, you can easily scratch the surface of text mining in fifteen minutes. A thorough understanding requires years of hard research. If you'd like an introduction to her work, take a look at her paper She Giggles, He Gallops – analyzing gender tropes in film.
I gained an understanding of machinelearningDavid Smith with Microsoft presented a session on neural nets, machine learning and transfer learning. More than just a theoretical chat, David illustrated the concepts with working tools. I’ve read quite a bit about machine learning – but this illustration really brings it home. Oh — and it’s pretty damn funny. ( David posted this on a blog entry here )
I learned to (sort of) love ExcelEina Ooka found herself forced to incorporate Excel with her data workflow. Now — we all have opinions about Excel in data science — but Eina points out that for multidisciplinary data science teams, it’s good for data storage, modeling, and reports. There are issues about reproducibility and source control and for that, R is a good solution. But Eina summarizes that excel is still useful. Not all projects can move away from it.
Data science teams without structured, intentional collaboration leak knowledge and waste resourcesStephanie Kirmer provided reallife experience and lessons learned on Data Science teams. Her themes included collaboration, version control, reproducibility, institutional knowledge, and other concerns. She has accomplished this with the consistent use of R packages.
One of her most interesting concepts was using packages to capture institutional knowledge. Documenting procedures with a function, contained in a standardized package provides stability and a common tool. This package then becomes an onboarding tool for new hires and a knowledge repository for departing staff.
I learned risk can be studied and quantifiedRisk is the chance that something bad will happen. David Severski with Starbucks revealed some of the tools used by the coffee giant, specifically OpenFAIR (Open Factor Analysis of Information Risk) and EvaluatoR (an R package for risk evaluation). Dave points out that R is an elegant language for data tasks. It has an ecosystem of tools for simulations and statistics, making risk evaluation a plugandplay process.
Personally, I don’t have call for risk evaluation. But it’s interesting to get a quick peek into the language and concerns of this specialty.
I was reminded of the Science in Data ScienceMark Gallivan reminds us about the science in data science. He researched the effect of Zika on travel by searching twitter for #babymoon. With that data, he crossreferences on the geolocation of the tweet to draw conclusions of the impact of Zika on travel plans of pregnant women. This is one of those useful presentations on the nuts and bolts of research.
I gained knowledge for nonR conversationsOn November 7th I attended the EARL (Enterprise Applications of the R Language) conference in Seattle. Two days later I attended Orycon, the annual science fiction convention in Portland, Oregon. After every conference I attend, I do a private postmortem. I ask myself if the conference was worthwhile, if I’d attend again, and if my time was wellspent.
EARL is a deep dive into the application of the R language. Attendees are assumed to have deep understanding of R, statistics, and a domain knowledge; the quintessential definition of data science.
Orycon is a gathering of science fiction fans. It includes a crowd of cosplayers imitating the latest Star Wars characters — but I’ll ignore them for this discussion. To be clear, the people that appreciate science fiction are deeply involved in science fact.
As a result of attending EARL, I’m better prepared to understand the talent and experience slightly under the radar at Orycon. I already knew the methods the experts used to perform and document their research. Thanks to David Smith’s “not hotdog” I understand machine learning at an operational level, so can skip over that discussion — or correct bad science from a misinformed espouser of pseudofact.
Mark is an author, educator, and writer and teaches about R and Raspberry Pi at LinkedIn Learning.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Rbloggers weekly – top R posts from last week (20181104 till 20181110)
Most liked R posts from last week, sorted based on the number of likes they got on twitter, enjoy:
 New R Cheatsheet: Data Science Workflow with R (295 likes)
 Coding Regression trees in 150 lines of R code (90 likes)
 ‘How do neural nets learn?’ A step by step explanation using the H2O Deep Learning algorithm. (88 likes)
 Detailed introduction of “myprettyreport” R package (83 likes)
 How to Sort Data in R (76 likes)
 Explore Your Dataset in R (75 likes)
 Tesseract 4 is here! State of the art OCR in R! (74 likes)
 “A Guide to Working With Census Data in R” is now Complete! (70 likes)
 4 ways to be more efficient using RStudio’s Code Snippets, with 11 ready to use examples (69 likes)
 Show R shiny app in WordPress (67 likes)
 Integrating R and Telegram (57 likes)
 Working with US Census Data in R (51 likes)
 R shiny stock analysis (48 likes)
 How to Use Simulated Data to Check Choice Model Experimental Designs Using R (47 likes)
 anytime – dates in R (46 likes)
 Practical statistics books for software engineers (43 likes)
 Onearm Bayesian Adaptive Trial Simulation Code (42 likes)
 Tmobile uses R for Customer Service AI (40 likes)
 Image segmentation based on Superpixels and Clustering (40 likes)
 Thrice: Sentiment Analysis – Emotions in Lyrics! (39 likes)
 Coding Gradient boosted machines in 100 lines of code (36 likes)
 The “probability to win” is hard to estimate… (34 likes)
 Source and List: Organizing R Shiny Apps (34 likes)
 RStudio 1.2 Preview – New Features in RStudio Pro (33 likes)
 R tip: Make Your Results Clear with sigr (33 likes)
 Adding different annotation to each facet in ggplot (31 likes)
 Indatabase xgboost predictions with R (27 likes)
 Where to live in Japan: XKCDthemed climate plots and maps! (27 likes)
 Exploring Models with lime (25 likes)
 Building a Repository of Alpinebased Docker Images for R, Part I (24 likes)
 Visualize the World Cup with R! Part 1: Recreating Goals with ggsoccer and ggplot2 (22 likes)
 Escaping the macOS 10.14 (Mojave) Filesystem Sandbox with R / RStudio (21 likes)
 The Gamification Of Fitbit: How an API Provided the Next Level of tRaining (20 likes)
 Management accounting and controlling in R (20 likes)
 Data Science With R Course Series – Week 8 (20 likes)
 Thrice: Breaking Down The Lyrics WordbyWord! (17 likes)
 Exploring Japan’s Postwar Economic Miracle with gganimate, tweenr, & highcharter! (16 likes)
 My first R package building experience: Reflections from creating bulletchartr! (16 likes)
 Can we predict the crawling of the GoogleBot? (16 likes)
 AzureR: R packages to control Azure services (15 likes)
Using a genetic algorithm for the hyperparameter optimization of a SARIMA model
(This article was first published on Econometrics and Free Software, and kindly contributed to Rbloggers)
IntroductionIn this blog post, I’ll use the data that I cleaned in a previous
blog post, which you can download
here. If you want to follow along,
download the monthly data. In my last blog post
I showed how to perform a grid search the “tidy” way. As an example, I looked for the right
hyperparameters of a SARIMA model. However, the goal of the post was not hyperparameter optimization
per se, so I did not bother with tuning the hyperparameters on a validation set, and used the test
set for both validation of the hyperparameters and testing the forecast. Of course, this is not great
because doing this might lead to overfitting the hyperparameters to the test set. So in this blog post
I split my data into trainig, validation and testing sets and use a genetic algorithm to look
for the hyperparameters. Again, this is not the most optimal way to go about this problem, since
the {forecast} package contains the very useful auto.arima() function. I just wanted to see
what kind of solution a genetic algorithm would return, and also try different cost functions.
If you’re interested, read on!
Let’s first load some libraries and define some helper functions (the helper functions were explained
in the previous blog posts):
Now, let’s load the data:
avia_clean_monthly < read_csv("https://raw.githubusercontent.com/brodrigues/avia_par_lu/master/avia_clean_monthy.csv") ## Parsed with column specification: ## cols( ## destination = col_character(), ## date = col_date(format = ""), ## passengers = col_integer() ## )Let’s split the data into a train set, a validation set and a test set:
avia_clean_train < avia_clean_monthly %>% select(date, passengers) %>% filter(year(date) < 2013) %>% group_by(date) %>% summarise(total_passengers = sum(passengers)) %>% pull(total_passengers) %>% ts(., frequency = 12, start = c(2005, 1)) avia_clean_validation < avia_clean_monthly %>% select(date, passengers) %>% filter(between(year(date), 2013, 2016)) %>% group_by(date) %>% summarise(total_passengers = sum(passengers)) %>% pull(total_passengers) %>% ts(., frequency = 12, start = c(2013, 1)) avia_clean_test < avia_clean_monthly %>% select(date, passengers) %>% filter(year(date) >= 2016) %>% group_by(date) %>% summarise(total_passengers = sum(passengers)) %>% pull(total_passengers) %>% ts(., frequency = 12, start = c(2016, 1)) logged_test_data < ihs(avia_clean_test) logged_validation_data < ihs(avia_clean_validation) logged_train_data < ihs(avia_clean_train)I will train the models on data from 2005 to 2012, look for the hyperparameters on data from 2013
to 2016 and test the accuracy on data from 2016 to March 2018. For this kind of exercise, the ideal
situation would be to perform crossvalidation. Doing this with timeseries data is not obvious
because of the autocorrelation between observations, which would be broken by sampling independently
which is required by CV. Also, if for example you do leaveoneout CV,
you would end up trying to predict a point in, say, 2017, with data
from 2018, which does not make sense. So you should be careful about that. {forecast} is able
to perform CV for time series and scikitlearn, the
Python package, is able to perform
crossvalidation of time series data
too. I will not do it in this blog post and simply focus on the genetic algorithm part.
Let’s start by defining the cost function to minimize. I’ll try several, in the first one I will
minimize the RMSE:
If arima() is not able to estimate a model for the given parameters, I force it to return NULL,
and in that case force the cost function to return a very high cost. If a model was successfully estimated,
then I compute the RMSE.
Let’s also take a look at what auto.arima() says:
starting_model < auto.arima(logged_train_data) summary(starting_model) ## Series: logged_train_data ## ARIMA(1,0,2)(2,1,0)[12] ## ## Coefficients: ## ar1 ma1 ma2 sar1 sar2 ## 0.9754 0.7872 0.2091 0.7285 0.4413 ## s.e. 0.0261 0.1228 0.1213 0.1063 0.1150 ## ## sigma^2 estimated as 0.004514: log likelihood=105.61 ## AIC=199.22 AICc=198.13 BIC=184.64 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE ## Training set 0.008398036 0.06095102 0.03882593 0.07009285 0.3339574 ## MASE ACF1 ## Training set 0.4425794 0.02073886Let’s compute the cost at this vector of parameters:
cost_function_rmse(c(1, 0, 2, 2, 1, 0), train_data = logged_train_data, validation_data = logged_validation_data, forecast_periods = 65) ## [1] 0.1731473Ok, now let’s start with optimizing the hyperparameters. Let’s help the genetic algorithm a little
bit by defining where it should perform the search:
This matrix constraints the first parameter to lie between 0 and 3, the second one between 0 and 2,
and so on.
Let’s call the genoud() function from the {rgenoud} package, and use 8 cores:
cl < makePSOCKcluster(8) clusterExport(cl, c('logged_train_data', 'logged_validation_data')) tic < Sys.time() auto_arima_rmse < genoud(cost_function_rmse, nvars = 6, data.type.int = TRUE, starting.values = c(1, 0, 2, 2, 1, 0), # < from auto.arima Domains = domains, cluster = cl, train_data = logged_train_data, validation_data = logged_validation_data, forecast_periods = length(logged_validation_data), hard.generation.limit = TRUE) toc_rmse < Sys.time()  ticmakePSOCKcluster() is a function from the {parallel} package. I must also export the global
variables logged_train_data or logged_validation_data. If I don’t do that, the workers called
by genoud() will not know about these variables and an error will be returned. The option
data.type.int = TRUE force the algorithm to look only for integers, and hard.generation.limit = TRUE
forces the algorithm to stop after 100 generations.
The process took 7 minutes, which is faster than doing the grid search.
What was the solution found?
Let’s train the model using the arima() function at these parameters:
best_model_rmse < arima(logged_train_data, order = auto_arima_rmse$par[1:3], season = list(order = auto_arima_rmse$par[4:6], period = 12), method = "ML") summary(best_model_rmse) ## ## Call: ## arima(x = logged_train_data, order = auto_arima_rmse$par[1:3], seasonal = list(order = auto_arima_rmse$par[4:6], ## period = 12), method = "ML") ## ## Coefficients: ## ar1 ar2 ar3 ma1 sar1 sma1 ## 0.6999 0.4541 0.0476 0.9454 0.4996 0.9846 ## s.e. 0.1421 0.1612 0.1405 0.1554 0.1140 0.2193 ## ## sigma^2 estimated as 0.006247: log likelihood = 57.34, aic = 100.67 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE ## Training set 0.0006142355 0.06759545 0.04198561 0.005408262 0.3600483 ## MASE ACF1 ## Training set 0.4386693 0.008298546Let’s extract the forecasts:
best_model_rmse_forecast < forecast::forecast(best_model_rmse, h = 65) best_model_rmse_forecast < to_tibble(best_model_rmse_forecast) ## Joining, by = "date" ## Joining, by = "date" starting_model_forecast < forecast(starting_model, h = 65) starting_model_forecast < to_tibble(starting_model_forecast) ## Joining, by = "date" ## Joining, by = "date"and plot the forecast to see how it looks:
avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% mutate(total_ihs = ihs(total)) %>% ggplot() + ggtitle("Minimization of RMSE") + geom_line(aes(y = total_ihs, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m%Y") + geom_ribbon(data = best_model_rmse_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#666018", alpha = 0.2) + geom_line(data = best_model_rmse_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#8e9d98") + geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#98431e", alpha = 0.2) + geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#a53031") + theme_blog()The yellowish line and confidence intervals come from minimizing the genetic algorithm, and the
redish from auto.arima(). Interesting; the point estimate is very precise, but the confidence
intervals are very wide. Low bias, high variance.
Now, let’s try with another cost function, where I minimize the BIC, similar to the auto.arima() function:
cost_function_bic < function(param, train_data, validation_data, forecast_periods){ order < param[1:3] season < c(param[4:6], 12) model < purrr::possibly(arima, otherwise = NULL)(x = train_data, order = order, seasonal = season, method = "ML") if(is.null(model)){ return(9999999) } else { BIC(model) } }Let’s take a look at the cost at the parameter values returned by auto.arima():
cost_function_bic(c(1, 0, 2, 2, 1, 0), train_data = logged_train_data, validation_data = logged_validation_data, forecast_periods = 65) ## [1] 184.6397Let the genetic algorithm run again:
cl < makePSOCKcluster(8) clusterExport(cl, c('logged_train_data', 'logged_validation_data')) tic < Sys.time() auto_arima_bic < genoud(cost_function_bic, nvars = 6, data.type.int = TRUE, starting.values = c(1, 0, 2, 2, 1, 0), # < from auto.arima Domains = domains, cluster = cl, train_data = logged_train_data, validation_data = logged_validation_data, forecast_periods = length(logged_validation_data), hard.generation.limit = TRUE) toc_bic < Sys.time()  ticThis time, it took 6 minutes, a bit slower than before. Let’s take a look at the solution:
auto_arima_bic ## $value ## [1] 201.0656 ## ## $par ## [1] 0 1 1 1 0 1 ## ## $gradients ## [1] NA NA NA NA NA NA ## ## $generations ## [1] 12 ## ## $peakgeneration ## [1] 1 ## ## $popsize ## [1] 1000 ## ## $operators ## [1] 122 125 125 125 125 126 125 126 0Let’s train the model at these parameters:
best_model_bic < arima(logged_train_data, order = auto_arima_bic$par[1:3], season = list(order = auto_arima_bic$par[4:6], period = 12), method = "ML") summary(best_model_bic) ## ## Call: ## arima(x = logged_train_data, order = auto_arima_bic$par[1:3], seasonal = list(order = auto_arima_bic$par[4:6], ## period = 12), method = "ML") ## ## Coefficients: ## ma1 sar1 sma1 ## 0.6225 0.9968 0.832 ## s.e. 0.0835 0.0075 0.187 ## ## sigma^2 estimated as 0.004145: log likelihood = 109.64, aic = 211.28 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE ## Training set 0.003710982 0.06405303 0.04358164 0.02873561 0.3753513 ## MASE ACF1 ## Training set 0.4553447 0.03450603And let’s plot the results:
best_model_bic_forecast < forecast::forecast(best_model_bic, h = 65) best_model_bic_forecast < to_tibble(best_model_bic_forecast) ## Joining, by = "date" ## Joining, by = "date" avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% mutate(total_ihs = ihs(total)) %>% ggplot() + ggtitle("Minimization of BIC") + geom_line(aes(y = total_ihs, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m%Y") + geom_ribbon(data = best_model_bic_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#5160a0", alpha = 0.2) + geom_line(data = best_model_bic_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#208480") + geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#98431e", alpha = 0.2) + geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#a53031") + theme_blog()The solutions are very close, both in terms of point estimates and confidence intervals. Bias
increased, but variance lowered… This gives me an idea! What if I minimize the RMSE, while
keeping the number of parameters low, as a kind of regularization? This is somewhat what minimising
BIC does, but let’s try to do it a more “naive” approach:
This is also similar to what auto.arima() does; by default, the max.order argument in auto.arima()
is set to 5, and is the sum of p + q + P + Q. So I’ll try something similar.
Let’s take a look at the cost at the parameter values returned by auto.arima():
cost_function_rmse_low_k(c(1, 0, 2, 2, 1, 0), train_data = logged_train_data, validation_data = logged_validation_data, forecast_periods = 65, max.order = 5) ## [1] 0.1731473Let’s see what will happen:
cl < makePSOCKcluster(8) clusterExport(cl, c('logged_train_data', 'logged_validation_data')) tic < Sys.time() auto_arima_rmse_low_k < genoud(cost_function_rmse_low_k, nvars = 6, data.type.int = TRUE, starting.values = c(1, 0, 2, 2, 1, 0), # < from auto.arima max.order = 5, Domains = domains, cluster = cl, train_data = logged_train_data, validation_data = logged_validation_data, forecast_periods = length(logged_validation_data), hard.generation.limit = TRUE) toc_rmse_low_k < Sys.time()  ticIt took 1 minute to train this one, quite fast! Let’s take a look:
auto_arima_rmse_low_k ## $value ## [1] 0.002503478 ## ## $par ## [1] 1 2 0 3 1 0 ## ## $gradients ## [1] NA NA NA NA NA NA ## ## $generations ## [1] 11 ## ## $peakgeneration ## [1] 1 ## ## $popsize ## [1] 1000 ## ## $operators ## [1] 122 125 125 125 125 126 125 126 0And let’s plot it:
best_model_rmse_low_k < arima(logged_train_data, order = auto_arima_rmse_low_k$par[1:3], season = list(order = auto_arima_rmse_low_k$par[4:6], period = 12), method = "ML") summary(best_model_rmse_low_k) ## ## Call: ## arima(x = logged_train_data, order = auto_arima_rmse_low_k$par[1:3], seasonal = list(order = auto_arima_rmse_low_k$par[4:6], ## period = 12), method = "ML") ## ## Coefficients: ## ar1 sar1 sar2 sar3 ## 0.6468 0.7478 0.5263 0.1143 ## s.e. 0.0846 0.1171 0.1473 0.1446 ## ## sigma^2 estimated as 0.01186: log likelihood = 57.88, aic = 105.76 ## ## Training set error measures: ## ME RMSE MAE MPE MAPE ## Training set 0.0005953302 0.1006917 0.06165919 0.003720452 0.5291736 ## MASE ACF1 ## Training set 0.6442205 0.3706693 best_model_rmse_low_k_forecast < forecast::forecast(best_model_rmse_low_k, h = 65) best_model_rmse_low_k_forecast < to_tibble(best_model_rmse_low_k_forecast) ## Joining, by = "date" ## Joining, by = "date" avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% mutate(total_ihs = ihs(total)) %>% ggplot() + ggtitle("Minimization of RMSE + low k") + geom_line(aes(y = total_ihs, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m%Y") + geom_ribbon(data = best_model_rmse_low_k_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#5160a0", alpha = 0.2) + geom_line(data = best_model_rmse_low_k_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#208480") + geom_ribbon(data = starting_model_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#98431e", alpha = 0.2) + geom_line(data = starting_model_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#a53031") + theme_blog()Looks like this was not the right strategy. There might be a better cost function than what I have
tried, but looks like minimizing the BIC is the way to go.
Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates or
buy me an espresso.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
In case you missed it: October 2018 roundup
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
In case you missed them, here are some articles from October of particular interest to R users.
Peter Provost ports some 80'sera BASIC programs for kids to R.
In a podcast for Fringe FM, I discuss the ethics of AI, Microsoft and Open Source, and the R Community.
Roundup of AI, Machine Learning and Data Science news from October 2018.
In this episode of "Guy in a Cube", R is used to visualize Anscombe's Quartet via Power BI.
Di Cook suggests using computer vision to automate statistical model assessment for machine learning in the 2018 Belz Lecture.
R provides the analysis behind a frontpage story on bridge safety in the Baltimore Sun.
Tomas Kalibera describes the big impacts of a small tweak to the logical comparison operators in R.
The Economist is now using R to calculate its famous "Big Mac Index".
Behindthescenes details of how R gets built on Windows, from a presentation by Jeroen Ooms.
The R Consortium has accepted another round of grant applications for R community projects.
A list of upcoming R conferences.
A recap of AI, Machine Learning and Data Science announcements from the Microsoft Ignite conference.
And some general interest stories (not necessarily related to R):
 A lesson on diversity: the Parable of the Polygons
 The story behind the baseball scene in The Naked Gun
 Public Key Cryptography, as explained by IKEA
As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Quoting in R
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
Many R users appear to be big fans of "code capturing" or "non standard evaluation" (NSE) interfaces. In this note we will discuss quoting and nonquoting interfaces in R.
The above terms are simply talking about interfaces where a name to be used is captured from the source code the user typed, and thus does not need quote marks. For example:
d < data.frame(x = 1) d$x ## [1] 1Notice both during data.frame creation and column access: the column name is given without quotes and also accessed without quotes.
This differs from using a standard value oriented interface as in the following:
d[["x"]] ## [1] 1A natural reason for R users to look for automatic quoting is: it helps make working with columns in data.frames (R‘s primary data analysis structure) look much like working with variables in the environment. Without the quotes a column name looks very much like a variable name. And thinking of columns as variables is a useful mindset.
Another place implicit quoting shows up is with R‘s "combine" operator where one can write either of the following.
c(a = "b") ## a ## "b" c("a" = "b") ## a ## "b"The wrapr package brings in a new function: qc() or "quoting c()" that gives a very powerful and convenient way to elide quotes.
library(wrapr) qc(a = b) ## a ## "b"Notice quotes are not required on either side of the name assignment. Again, eliding quotes is not that big a deal, and not to everyone’s taste. For example I have never seen a Python user feel they are missing anything because they write "{"a" : "b"}" to construct their own named dictionary structure.
That being said, qc() is a very convenient and consistent notation if you do want to work in an NSE style.
For example, if it ever bothered you that dplyr join takes the join column names as a character vector you can use qc() to instead write:
dplyr::full_join( iris, iris, by = qc(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species))(Actually I very much like that the join takes the columns as a vector, as it is much easier to program over.) I feel the qc() grouping of the columns makes it easier for a reader to see which arguments are the column set than a use of ... would. Please take, as an example, the following dplyr::group_by():
library(dplyr) starwars %>% group_by(homeworld, species, add = FALSE) %>% summarize(mass = mean(mass, na.rm = TRUE)) ## # A tibble: 58 x 3 ## # Groups: homeworld [?] ## homeworld species mass ## ## 1 Alderaan Human 64 ## 2 Aleen Minor Aleena 15 ## 3 Bespin Human 79 ## 4 Bestine IV Human 110 ## 5 Cato Neimoidia Neimodian 90 ## 6 Cerea Cerean 82 ## 7 Champala Chagrian NaN ## 8 Chandrila Human NaN ## 9 Concord Dawn Human 79 ## 10 Corellia Human 78.5 ## # ... with 48 more rowsWhen coming back to such code later, I find the following notation to be easier to read:
library(seplyr) starwars %>% group_by_se(qc(homeworld, species), add = FALSE) %>% summarize(mass = mean(mass, na.rm = TRUE)) ## # A tibble: 58 x 3 ## # Groups: homeworld [?] ## homeworld species mass ## ## 1 Alderaan Human 64 ## 2 Aleen Minor Aleena 15 ## 3 Bespin Human 79 ## 4 Bestine IV Human 110 ## 5 Cato Neimoidia Neimodian 90 ## 6 Cerea Cerean 82 ## 7 Champala Chagrian NaN ## 8 Chandrila Human NaN ## 9 Concord Dawn Human 79 ## 10 Corellia Human 78.5 ## # ... with 48 more rowsIn the above we can clearly see which arguments to the grouping command are intended to be column names, and which are not.
qc() is a powerful NSE tool that annotates and contains where we are expecting quoting behavior. Some possible applications include examples such as the following.
# install many packages install.packages(qc(testthat, knitr, rmarkdown, R.rsp)) # select columns iris[, qc(Petal.Length, Petal.Width, Species)] # control a forloop for(col in qc(Petal.Length, Petal.Width)) { iris[[col]] < sqrt(iris[[col]]) } # control a vapply vapply(qc(Petal.Length, Petal.Width), function(col) { sum(is.na(iris[[col]])) }, numeric(1))The idea is: with qc() the user can switch name capturing notation at will, with no priorarrangement needed in the functions or packages used. Also the parenthesis in qc() make for more legible code: a reader can see which arguments are being quoted and taken as a group.
As of wrapr 1.7.0 qc() incorporates bquote() functionality. bquote() is R‘s builtin quasiquotation facility. It was added to R in August of 2003 by Thomas Lumley, and doesn’t get as much attention as it deserves.
A quoting tool such as qc() becomes a quasiquoting tool if we add a notation that signals we do not wish to quote. In R the standard notation for this is ".()" (Lisp uses a backtick, and the rlang package uses "!!"). The bquote()enabled version of qc() lets us write code such as the following.
extra_column = "Species" qc(Petal.Length, Petal.Width, extra_column) ## [1] "Petal.Length" "Petal.Width" "extra_column" qc(Petal.Length, Petal.Width, .(extra_column)) ## [1] "Petal.Length" "Petal.Width" "Species"Notice it is unambiguous what is going on above. The first qc() quotes all of its arguments into strings. The second works much the same, with the exception of names marked with .(). This ability to "break out" or turn off quoting is convenient if we are working with a combination of values we wish to type in directly and others we wish to take from variables.
qc() allows substitution on the lefthand sides of assignments, if we use the alternate := notation for assignment (a convention put forward by data.table, and later adopted by dplyr).
library(wrapr) left_name = "a" right_value = "b" qc(.(left_name) := .(right_value)) ## a ## "b"The wrapr package also exports an implementation for :=. So one could also write:
library(wrapr) left_name := right_value ## a ## "b"The hope is that the qc() and := operators are well behaved enough to commute in the sense the following two statements should return the same value.
library(wrapr) qc(a := b, c := d) ## a c ## "b" "d" qc(a, c) := qc(b, d) ## a c ## "b" "d"The idea is: when there is a symmetry it is often evidence you are using the right concepts.
In conclusion: the goal of wrapr::qc() is to put a very regular and controllable quoting facility directly into the hands of the R user. This allows the R user to treat just about any R function or package as if the function or package itself implemented argument quoting and quasiquotation capabilities.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A deep dive into glmnet: standardize
(This article was first published on R – Statistical Odds & Ends, and kindly contributed to Rbloggers)
I’m writing a series of posts on various function options of the glmnet function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.
In this post, we will focus on the standardize option.
For reference, here is the full signature of the glmnet function:
glmnet(x, y, family=c("gaussian","binomial","poisson","multinomial","cox","mgaussian"), weights, offset=NULL, alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobsUnless otherwise stated, will denote the number of observations, will denote the number of features, and fit will denote the output/result of the glmnet call. The data matrix is denoted by and the response is denoted by .
standardize
When standardize = TRUE (default), columns of the data matrix x are standardized, i.e. each column of x has mean 0 and standard deviation 1. More specifically, we have that for each ,
, and .
Why might we want to do this? Standardizing our features before model fitting is common practice in statistical learning. This is because if our features are on vastly different scales, the features with larger scales will tend to dominate the action. (One instance where we might not want to standardize our features is if they are already all measured along the same scale, e.g. meters or kilograms.)
Notice that the standardization here is slightly different from that offered by the scale function: scale(x, center = TRUE, scale = TRUE) gives the standardization
, and .
We verify this with a small data example. Generate data according to the following code:
n < 100; p < 5; true_p < 2 set.seed(950) X < matrix(rnorm(n * p), nrow = n) beta < matrix(c(rep(1, true_p), rep(0, p  true_p)), ncol = 1) y < X %*% beta + 3 * rnorm(n)Create a version of the data matrix which has standardized columns:
X_centered < apply(X, 2, function(x) x  mean(x)) Xs < apply(X_centered, 2, function(x) x / sqrt(sum(x^2) / n))Next, we run glmnet on Xs and y with both possible options for standardize:
library(glmnet) fit < glmnet(Xs, y, standardize = TRUE) fit2 < glmnet(Xs, y, standardize = FALSE)We can check that we get the same fit in both cases (modulo numerical precision):
sum(fit$lambda != fit2$lambda) # 0 max(abs(fit$beta  fit2$beta)) # 6.661338e16The documentation notes that the coefficients returned are on the original scale. Let’s confirm that with our small data set. Run glmnet with the original data matrix and standardize = TRUE:
fit3 < glmnet(X, y, standardize = TRUE)For each column , our standardized variables are , where and are the mean and standard deviation of column respectively. If and represent the model coefficients of fit2 and fit3 respectively, then we should have
i.e. we should have and for . The code below checks that this is indeed the case (modulo numerical precision):
# get column means and SDs X_mean < colMeans(X) X_sd < apply(X_centered, 2, function(x) sqrt(sum(x^2) / n)) # check difference for intercepts fit2_int < coefficients(fit2)[1,] fit3_int < coefficients(fit3)[1,] temp < fit2_int  colSums(diag(X_mean / X_sd) %*% fit2$beta) max(abs(temp  fit3_int)) # 1.110223e16 # check difference for feature coefficients temp < diag(1 / X_sd) %*% fit2$beta max(abs(temp  fit3$beta)) # 1.110223e15The discussion above has been for the standardization of x. What about standardization for y? The documentation notes that when family = "gaussian", y is automatically standardized, and the coefficients are unstandardized at the end of the procedure.
More concretely, let the mean and standard deviation of be denoted by and respectively. If running glmnet on standardized y gives intercept and coefficients , then glmnet on unstandardized y will give intercept and coefficients .
Again, this can be verified empirically:
# get mean and SD of y y_mean < mean(y) y_sd < sqrt(sum((y  y_mean)^2) / n) # fit model with standardized y fit4 < glmnet(X, (y  y_mean) / y_sd, standardize = TRUE) # check difference for intercepts fit4_int < coefficients(fit4)[1,] temp < fit4_int * y_sd + y_mean max(abs(temp  fit3_int)) # 1.110223e16 # check difference for feature coefficients max(abs(y_sd * fit4$beta  fit3$beta)) # 8.881784e16 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
GoldMining Week 11 (2018)
(This article was first published on R – Fantasy Football Analytics, and kindly contributed to Rbloggers)
Week 11 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!
The post GoldMining Week 11 (2018) appeared first on Fantasy Football Analytics.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Make Beautiful Tables with the Formattable Package
(This article was first published on RBloggers – Displayr, and kindly contributed to Rbloggers)
I love the formattable package, but I always struggle to remember its syntax. A quick Google search reveals that I’m not alone in this struggle. This post is intended as a reminder for myself of how the package works – and hopefully you’ll find it useful too!
Default formattable exampleThe table below is an R data frame (you can turn most things into a data frame using as.data.frame(x), where x is whatever you are converting). It’s by no means as bad as most R tables, but clearly it is not good enough to be shared with others.
If we give this table (called prevalence) to formattable, it does a great job just using defaults.
library(formattable) formattable(prevalence) Column alignmentWe can control column alignment using the align parameter. In the example below, I set the first column to leftaligned, and the remaining columns are rightaligned.
formattable(prevalence, align = c("l", rep("r", NCOL(prevalence)  1))) Simple column formattingThe main reason people love formattable is the formatting of columns. Below, the first column has been changed to grey, color bars have been added to Average, and the last column has been formatted as percentages. Note that the variable names are surrounded by backticks (the key above your Tab on Englishlanguage keyboards), not single quotation marks.
prevalence[, "Improvement"] = prevalence[, "Improvement"] / 100 formattable(prevalence, align = c("l",rep("r", NCOL(prevalence)  1)), list(`Indicator Name` = formatter("span", style = ~ style(color = "grey", font.weight = "bold")), `Average` = color_bar("#FA614B"), `Improvement` = percent)) More customized column formattingIn the example above, prior to using formattable I divided the last column by 100, as formattable‘s percent function assumes the inputs are decimals. However, we can perform transformations within formattable. In the code below, I divide by 100 and I also color the values as red or green depending on their value. Note that in the bottom two lines, we define x as being the value by placing it to the left of the ~ and then use it in the function to the right (it is a lambda function, to use some jargon).
formattable(prevalence, align = c("l",rep("r", NCOL(prevalence)  1)), list(`Indicator Name` = formatter("span", style = ~ style(color = "grey", font.weight = "bold")), `Average` = color_bar("#FA614B"), `Improvement` = formatter("span", x ~ percent(x / 100), style = x ~ style(color = ifelse(x&amp;amp;lt; 0, "red", "green")))))Below I extend this even further, replacing the percentages with ticks, crosses, and words.
formattable(prevalence, align = c("l",rep("r", NCOL(prevalence)  1)), list(`Indicator Name` = formatter("span", style = ~ style(color = "grey", font.weight = "bold")), `Average` = color_bar("#FA614B"), `Improvement` = formatter("span", x ~ icontext(ifelse(x < 0, "ok", "remove"), ifelse(< 0, "Yes", "No")), style = x ~ style(color = ifelse(x &amp;amp;lt; 0, "red", "green"))))) Controlling the width of the barsIn the table below I have used the standard color bar, which scales the bars so that the bar lengths are proportional to the values being displayed. However, IQ cannot really be 0, so arguably the bars are misleading.
The fix to this problem is to provide a function that has a more appropriate mapping between the values and the length of the bars. In the code below, I create a function that returns a 0 for the lowest value (70), and a 1 for the highest value (150).
unit.scale = function(x) (x  min(x)) / (max(x)  min(x)) formattable(iq.data, align = c("l","r"), list(`IQ` = color_bar("#FA614B66", fun = unit.scale))) Formatting areas (ranges of cells)It is possible to also set the shading of ranges of cells, rather than just individual columns. In the example below, I’ve created a heatmap using two shades of green. In this case I have specified the area using just the columns, but row can also be supplied as well as or in place of col.
formattable(prevalence, align = c("l",rep("r", NCOL(prevalence)  1)), list( `Indicator Name` = formatter("span", style = ~ style(color = "grey",font.weight = "bold")), area(col = 2:7) ~ color_tile("#DeF7E9", "#71CA97")))In this next example, I first format all the cells to be percentages, and then apply the color shading to the year columns. I have to wrap percent in another function, as percent only works on a single column of numbers.
library(formattable) formattable(prevalence, align = c("l",rep("r", NCOL(prevalence)  1)), list(`Indicator Name` = formatter("span", style = ~ style(color = "grey",font.weight = "bold")), area(col = 2:7) ~ function(x) percent(x / 100, digits = 0), area(col = 2:7) ~ color_tile("#DeF7E9", "#71CA97"))) Custom formattersYou can also write your own functions for controlling formatting. In the example below, rather than use formattable‘s inbuilt color_tile (as done in the previous example), I’ve instead customized it, controlling the padding, border radios, and font color.
custom_color_tile &amp;amp;amp;amp;amp;amp;lt; function (...) { formatter("span", style = function(x) style(display = "block", padding = "0 4px", `color` = "white", `borderradius` = "4px", `backgroundcolor` = csscolor(gradient(as.numeric(x), ...)))) } formattable(prevalence, align = "r", list( `Indicator Name` = formatter("span", style = ~ style(color = "grey",font.weight = "bold")), area(col = 2:9) ~ function(x) percent(x / 100, digits = 0), area(col = 2:7) ~ custom_color_tile("#B1CBEB", "#3E7DCC"))) Arrows (and hiding columns)Sometimes it is useful to use arrows to show statistical significance. While formattable is not super flexible in this regard, it can do a good job nonetheless. The first step is to create a table where in addition to the data to be displayed, we also have a column containing zscores.
In the code below I first hide the column called z (z = FALSE), add arrows for zscores of less than 1.96 and greater than 1.96, and make z scores of greater than 0 green and less than 0 red.
formattable(prev.sig, list(z = FALSE, `2016` = formatter("span", style = ~ style(color = ifelse(`2016` > `2015`, "green", "red")), ~ icontext(sapply(`z`, function(x) if (x < 1.96) "arrowdown" else if (x > 1.96) "arrowup" else ""), `2016`))))A problem with this table is that the arrows are to the left of the numbers and are not lined up neatly. The only way I have figured out to avoid this is to put the arrows in a separate column, as shown here:
This is done by:
 Replacing z with an invisible character ( ).
 Replacing the values with the arrows.
Adding sparklines to tables
The sparklines package can be used to create sparklines:
library(sparkline) sparkline(c(1,2,7,6,5), type = "bar", barColor = "green")We can also include them in formattable tables.
The way that we do this is by converting the sparkline into text (character(htmltools::as.tags), and then (in the last two lines), telling the formattable HTML widget that it also contains sparklines.
library(sparkline) library(formattable) df = data.frame("Type" = c("bar", "line", "bullet", "pie", "tristate", "discrete"), Sparkline = c(as.character(htmltools::as.tags(sparkline(c(1,2,7,6,5), type = "bar"))), as.character(htmltools::as.tags(sparkline(c(1,2,7,6,5), type = "line"))), as.character(htmltools::as.tags(sparkline(c(1,2,7,6,5), type = "bullet"))), as.character(htmltools::as.tags(sparkline(c(1,2,7,6,5), type = "pie"))), as.character(htmltools::as.tags(sparkline(c(1,0,1,1,1,1,0,2), type = "tristate"))), as.character(htmltools::as.tags(sparkline(c(1,2,7,6,5), type = "discrete"))))) out = as.htmlwidget(formattable(df)) out$dependencies = c(out$dependencies, htmlwidgets:::widget_dependencies("sparkline", "sparkline")) out Putting it all togetherIn this final example, I combine many of the different ideas I’ve discussed into one table.
library(formattable) library(sparkline) prevalence$` ` = c(4.1, .3, .5, 1.4) prevalence$`2012` = apply(prevalence[, 2:7], 1, FUN = function(x) as.character(htmltools::as.tags(sparkline(as.numeric(x), type = "line")))) names(prevalence)[3] = "  " new.prevalance = prevalence[, c(1, 2, 3, 7, 10)] out = as.htmlwidget(formattable(new.prevalance, align = c("l",rep("r", NCOL(prevalence)  1)), list(`Indicator Name` = formatter("span", style = ~ style(color = "grey", font.weight = "bold")), " " = formatter("span", style = ~ style(color = ifelse(`2016` > `2011`, "green", "red")), ~ icontext(sapply(` `, function(x) if (x < 1.96) "arrowdown" else if (x> 1.96) "arrowup" else "")))))) out$dependencies = c(out$dependencies, htmlwidgets:::widget_dependencies("sparkline", "sparkline")) out View all the source code and play around with these examples yourselfI’ve created all the examples in this post in a live Displayr document, so you can look at the code and play around with it yourself. Click here to view the code and tables discussed in this post.
AcknowledgementsThe main example and many of the ideas in this post are from LITTLE MISS DATA, although I’ve reworked the code quite significantly. The hack for getting sparklines into the tables comes from HTML widget guru Kent Russell. Bert Wassink provided the trick for having a blank column name. My colleague Justin helped me a lot with this post.
View and modify the code in all these examples here!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBloggers – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Rcpp now used by 1500 CRAN packages
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Right now Rcpp stands at 1500 reversedependencies on CRAN. The graph is on the left depicts the growth of Rcpp usage (as measured by Depends, Imports and LinkingTo, but excluding Suggests) over time. What an amazing few days this has been as we also just marked the tenth anniversary and the big one dot oh release.
Rcpp cleared 300 packages in November 2014. It passed 400 packages in June 2015 (when I only tweeted about it), 500 packages in late October 2015, 600 packages in March 2016, 700 packages last July 2016, 800 packages last October 2016, 900 packages early January 2017,
1000 packages in April 2017, and 1250 packages in November 2018. The chart extends to the very beginning via manually compiled data from CRANberries and checked with crandb. The next part uses manually saved entries. The core (and by far largest) part of the data set was generated semiautomatically via a short script appending updates to a small filebased backend. A list of packages using Rcpp is kept on this page.
Also displayed in the graph is the relative proportion of CRAN packages using Rcpp. The four percent hurdle was cleared just before useR! 2014 where I showed a similar graph (as two distinct graphs) in my invited talk. We passed five percent in December of 2014, six percent July of 2015, seven percent just before Christmas 2015, eight percent last summer, nine percent midDecember 2016, cracked ten percent in the summer of 2017 and eleven percent this year. We are currently at 11.199 percent or just over one in nine packages. There is more detail in the chart: how CRAN seems to be pushing back more and removing more aggressively (which my CRANberries tracks but not in as much detail as it could), how the growth of Rcpp seems to be slowing somewhat outright and even more so as a proportion of CRAN – just like every growth curve should, eventually. But we leave all that for another time.
1500 user packages is pretty mindboggling. We can use the progression of CRAN itself compiled by Henrik in a series of posts and emails to the main development mailing list. Not that long ago CRAN itself did not have 1500 packages, and here we are at almost 13400 with Rcpp at 11.2% and still growing (albeit slightly more slowly). Amazeballs.
This puts a whole lot of responsibility on us in the Rcpp team as we continue to keep Rcpp as performant and reliable as it has been.
And with that, and as always, a very big Thank You! to all users and contributors of Rcpp for help, suggestions, bug reports, documentation or, of course, code.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Discourse Network Analysis: Undertaking Literature Reviews in R
(This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to Rbloggers)
Literature reviews are the cornerstone of science. Keeping abreast of developments within any given field of enquiry has become increasingly difficult given the enormous amounts of new research. Databases and search technology have made finding relevant literature easy but, keeping a coherent overview of the discourse within a field of enquiry is an ever more encompassing task.
Scholars have proposed many approaches to analysing literature, which can be placed along a continuum from traditional narrative methods to systematic analytic syntheses of text using machine learning. Traditional reviews are biased because they rely entirely on the interpretation of the researcher. Analytical approaches follow a process that is more like scientific experimentation. These systematic methods are reproducible in the way literature is searched and collated but still rely on subjective interpretation.
Machine learning provides new methods to analyse large swaths of text. Although these methods sound exciting, these methods are incapable of providing insight. Machine learning cannot interpret a text; it can only summarise and structure a corpus. Machine learning still requires human interpretation to make sense of the information.
This article introduces a mixedmethod technique for reviewing literature, combining qualitative and quantitative methods. I used this method to analyse literature published by the International Water Association as part of my dissertation into water utility marketing. You can read the code below, or download it from GitHub. Detailed infromation about the methodology is available through FigShare.
A literature review with RQDAThe purpose of this review was to ascertain the relevance of marketing theory to the discourse of literature in water management. This analysis uses a sample of 244 journal abstracts, each of which was coded with the RQDA library. This library provides functionality for qualitative data analysis. RQDA provides a graphical user interface to mark sections of text and assign them to a code, as shown below.
Marking topics in an abstract with RQDA.You can load a corpus of text into RQDA and mark each of the texts with a series of codes. The texts and the codes are stored in an SQLite database, which can be easily queried for further analysis.
I used a marketing dictionary to assess the abstracts from journals published by the International Water Association from the perspective of marketing. This phase resulted in a database with 244 abstracts and their associated coding.
Discourse Network AnalysisOnce all abstracts are coded, we can start analysing the internal structure of the IWA literature. First, let’s have a look at the occurrence of the topics identified for the corpus of abstracts.
The first lines in this snippet call the tidyverse and RQDA libraries and open the abstracts database. The
getCodingTablefunction provides a data frame with each of the marked topics and their location. This function allows us to visualise the occurrence of the topics in the literature.
library(tidyverse) library(RQDA) ## Open project openProject("IWA_Abstracts.rqda", updateGUI = TRUE) ## Visualise codes getCodingTable() %>% group_by(codename) %>% count() %>% arrange(n) %>% ungroup() %>% mutate(codename = factor(codename, levels = codename)) %>% ggplot(aes(codename, n)) + geom_col() + coord_flip() + xlab("Code name") + ylab("Occurence") Frequencies of topics in IWA literature.This bar chart tells us that the literature is preoccupied with asset management and the quality of the product (water) or the service (customer perception). This insight is interesting, but not very enlightening information. We can use discourse network analysis to find a deeper structure in the literature.
Discourse Network AnalysisWe can view each abstract with two or more topics as a network where each topic is connected. The example below shows four abstracts with two or more codes and their internal networks.
Examples of complete networks for four abstracts.The union of these four networks forms a more extensive network that allows us to analyse the structure of the corpus of literature, shown below.
Union of networks and community detection.We can create a network of topics with the igraph package. The first step is to create a DocumentTermMatrix. This matrix counts how often a topic occurs within each abstract. From this matrix, we can create a graph by transforming it into an Adjacency Matrix. This matrix describes the graph which can be visualised and analysed. For more detailed information about this method, refer to my dissertation.
library(igraph) library(reshape2) dtm < getCodingTable()[,c(5, 4)] %>% mutate(freq = 1) %>% acast(filename ~ codename, sum) adj < crossprod(dtm) g < graph.adjacency(adj, weighted = T, mode = "undirected") g < simplify(g) ## Network Graphs V(g)$name < gsub(" ", "\n", V(g)$name) par(mar = rep(0, 4)) plot(g, layout = layout.fruchterman.reingold, vertex.label.cex = 1, vertex.size = degree(g), vertex.label.color = "black", vertex.frame.color = "white", vertex.color = "Dodgerblue", edge.width = E(g)$weight * 1, edge.color = "darkgray" ) The network of topics in IWA literature.In this graph, each node is a topic in the literature, and each edge implies that a topic is used in the same abstract. This graph uses the FruchtermanReingold algorithm to position each of the nodes, with the most connected topic in the centre.
The last step is to identify the structure of this graph using community detection. A community is a group of nodes that are more connected with each other than with nodes outside the community.
set.seed(123) comms < spinglass.community(g, spins = 100) par(mar = rep(0, 4)) plot(comms, g, layout = layout.fruchterman.reingold, vertex.label.cex = .7, vertex.size = degree(g), vertex.label.color = "black", vertex.frame.color = NA, edge.color = "black", vertex.label.family = "sanserif", mark.border = NA ) Community detection in IWA literature.We have now succeeded to convert a corpus of 244 journal abstracts to a parsimonious overview of four communities of topics. This analysis resulted in greater insight into how marketing theory applies to water management, which was used to structure a book about water utility marketing.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data – The Lucid Manager. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Searching for the optimal hyperparameters of an ARIMA model in parallel: the tidy gridsearch approach
(This article was first published on Econometrics and Free Software, and kindly contributed to Rbloggers)
IntroductionIn this blog post, I’ll use the data that I cleaned in a previous
blog post, which you can download
here. If you want to follow along,
download the monthly data.
In the previous blog post, I used the auto.arima() function to very quickly get a “goodenough”
model to predict future monthly total passengers flying from LuxAirport. “Goodenough” models can
be all you need in a lot of situations, but perhaps you’d like to have a better model. I will show
here how you can get a better model by searching through a grid of hyperparameters.
This blog post was partially inspired by: https://drsimonj.svbtle.com/gridsearchinthetidyverse
The problemSARIMA models have a lot of hyperparameters, 7 in total! Three trend hyperparameters, p, d, q,
same as for an ARIMA model, and four seasonal hyperparameters, P, D, Q, S. The traditional way t
o search for these hyperparameters is the socalled BoxJenkins method. You can read about it
here. This method was described
in a 1970 book, Time series analysis: Forecasting and control by Box and Jenkins. The method
requires that you first prepare the data by logging it and differencing it, in order to make the
time series stationary. You then need to analyze ACF and PACF plots, in order to determine the
right amount of lags… It take some time, but this method made sense in a time were computing
power was very expensive. Today, we can simply let our computer search through thousands of models,
check memes on the internet, and come back to the best fit. This blog post is for you, the busy
data scientist meme connoisseurs who cannot waste time with theory and other such useless time drains,
when there are literally thousands of new memes being created and shared every day. Every second counts.
To determine what model is best, I will do pseudo outofsample forecasting and compute the RMSE
for each model. I will then choose the model that has the lowest RMSE.
Let’s first load some libraries:
library(tidyverse) library(forecast) library(lubridate) library(furrr) library(tsibble) library(brotools) ihs < function(x){ log(x + sqrt(x**2 + 1)) }Now, let’s load the data:
avia_clean_monthly < read_csv("https://raw.githubusercontent.com/brodrigues/avia_par_lu/master/avia_clean_monthy.csv") ## Parsed with column specification: ## cols( ## destination = col_character(), ## date = col_date(format = ""), ## passengers = col_integer() ## )Let’s split the data into a training set and into a testing set:
avia_clean_train < avia_clean_monthly %>% select(date, passengers) %>% filter(year(date) < 2015) %>% group_by(date) %>% summarise(total_passengers = sum(passengers)) %>% pull(total_passengers) %>% ts(., frequency = 12, start = c(2005, 1)) avia_clean_test < avia_clean_monthly %>% select(date, passengers) %>% filter(year(date) >= 2015) %>% group_by(date) %>% summarise(total_passengers = sum(passengers)) %>% pull(total_passengers) %>% ts(., frequency = 12, start = c(2015, 1)) logged_train_data < ihs(avia_clean_train) logged_test_data < ihs(avia_clean_test)I also define a helper function:
to_tibble < function(forecast_object){ point_estimate < forecast_object$mean %>% as_tsibble() %>% rename(point_estimate = value, date = index) upper < forecast_object$upper %>% as_tsibble() %>% spread(key, value) %>% rename(date = index, upper80 = `80%`, upper95 = `95%`) lower < forecast_object$lower %>% as_tsibble() %>% spread(key, value) %>% rename(date = index, lower80 = `80%`, lower95 = `95%`) reduce(list(point_estimate, upper, lower), full_join) }This function takes a forecast object as argument, and returns a nice tibble. This will be useful
later, and is based on the code I already used in my previous
blog post.
Now, let’s take a closer look at the arima() function:
ARIMA Modelling of Time Series Description Fit an ARIMA model to a univariate time series. Usage arima(x, order = c(0L, 0L, 0L), seasonal = list(order = c(0L, 0L, 0L), period = NA), xreg = NULL, include.mean = TRUE, transform.pars = TRUE, fixed = NULL, init = NULL, method = c("CSSML", "ML", "CSS"), n.cond, SSinit = c("Gardner1980", "Rossignol2011"), optim.method = "BFGS", optim.control = list(), kappa = 1e6)The user is supposed to enter the hyperparameters as two lists, one called order for p, d, q
and one called seasonal for P, D, Q, S. So what we need is to define these lists:
I first start with order_list. This list has 3 elements, “p”, “d” and “q”. Each element is a
sequence from 0 to 3 (2 in the case of “d”). When I pass this list to purrr::cross() I get the
product set of the starting list, so in this case a list of 4*3*4 = 48 elements. However, this
list looks pretty bad:
I would like to have something like this instead:
[[1]] p d q 0 0 0 [[2]] p d q 1 0 0 [[3]] p d q 2 0 0This is possible with the last line, map(lift(c)). There’s a lot going on in this very small
line of code. First of all, there’s map(). map() iterates over lists, and applies a function,
in this case lift(c). purrr::lift() is a very interesting function that lifts the domain of
definition of a function from one type of input to another. The function whose input I am lifting
is c(). So now, c() can take a list instead of a vector. Compare the following:
So order_list is exactly what I wanted:
head(order_list) ## [[1]] ## p d q ## 0 0 0 ## ## [[2]] ## p d q ## 1 0 0 ## ## [[3]] ## p d q ## 2 0 0 ## ## [[4]] ## p d q ## 3 0 0 ## ## [[5]] ## p d q ## 0 1 0 ## ## [[6]] ## p d q ## 1 1 0I do the same for season_list:
season_list < list("P" = seq(0, 3), "D" = seq(0, 2), "Q" = seq(0, 3), "period" = 12) %>% cross() %>% map(lift(c))I now coerce these two lists of vectors to tibbles:
orderdf < tibble("order" = order_list) seasondf < tibble("season" = season_list)And I can now finally create the grid of hyperparameters:
hyper_parameters_df < crossing(orderdf, seasondf) nrows < nrow(hyper_parameters_df) head(hyper_parameters_df) ## # A tibble: 6 x 2 ## order season ## ## 1 ## 2 ## 3 ## 4 ## 5 ## 6The hyper_parameters_df data frame has 2304 rows, meaning, I will now estimate 2304
models, and will do so in parallel. Let’s just take a quick look at the internals of hyper_parameters_df:
So in the order column, the vector 0, 0, 0 is repeated as many times as there are combinations
of P, D, Q, S for season. Same for all the other vectors of the order column.
Because training these models might take some time, I will use the fantastic {furrr} package
by Davis Vaughan to train the arima() function in parallel.
For this, I first define 8 workers:
And then I run the code:
tic < Sys.time() models_df < hyper_parameters_df %>% mutate(models = future_map2(.x = order, .y = season, ~possibly(arima, otherwise = NULL)(x = logged_train_data, order = .x, seasonal = .y))) running_time < Sys.time()  ticI use future_map2(), which is just like map2() but running in parallel.
I add a new column to the data called models, which will contain the models trained over all the
different combinations of order and season. The models are trained on the logged_train_data.
Training the 2304 models took 18 minutes, which is
plenty of time to browse the latest memes, but still quick enough that it justifies the whole approach.
Let’s take a look at the models_df object:
As you can see, the models column contains all the trained models. The model on the first row,
was trained with the hyperparameters of row 1, and so on. But, our work is not over! We now need
to find the best model. First, I add a new column to the tibble, which contains the forecast. From
the forecast, I extract the point estimate:
You have to be familiar with a forecast object to understand the last line: a forecast object
is a list with certain elements, the point estimates, the confidence intervals, and so on. To get
the point estimates, I have to extract the “mean” element from the list. Hence the weird ~.$mean.
Then I need to add a new listcolumn, where each element is the vector of true values, meaning the data
from 2015 to 2018. Because I have to add it as a list of size 2304, I do that with purrr::rerun():
It is then easy to compute the RMSE, which I add as a column to the original data:
... %>% mutate(true_value = rerun(nrows, logged_test_data)) %>% mutate(rmse = map2_dbl(point_forecast, true_value, ~sqrt(mean((.x  .y) ** 2))))The whole workflow is here:
models_df < models_df %>% mutate(forecast = map(models, ~possibly(forecast, otherwise = NULL)(., h = 39))) %>% mutate(point_forecast = map(forecast, ~.$`mean`)) %>% mutate(true_value = rerun(nrows, logged_test_data)) %>% mutate(rmse = map2_dbl(point_forecast, true_value, ~sqrt(mean((.x  .y) ** 2))))This is how models_df looks now:
head(models_df) ## # A tibble: 6 x 7 ## order season models forecast point_forecast true_value rmse ## ## 1 0.525 ## 2 0.236 ## 3 0.235 ## 4 0.217 ## 5 0.190 ## 6 0.174Now, I can finally select the best performing model. I select the model with minimum RMSE:
best_model < models_df %>% filter(rmse == min(rmse, na.rm = TRUE))And save the forecast into a new variable, as a tibble, using my to_tibble() function:
(best_model_forecast < to_tibble(best_model$forecast[[1]])) ## Joining, by = "date" ## Joining, by = "date" ## # A tsibble: 39 x 6 [1M] ## date point_estimate upper80 upper95 lower80 lower95 ## ## 1 2015 Jan 11.9 12.1 12.1 11.8 11.7 ## 2 2015 Feb 11.9 12.0 12.1 11.7 11.6 ## 3 2015 Mar 12.1 12.3 12.3 11.9 11.9 ## 4 2015 Apr 12.2 12.3 12.4 12.0 11.9 ## 5 2015 May 12.2 12.4 12.5 12.1 12.0 ## 6 2015 Jun 12.3 12.4 12.5 12.1 12.0 ## 7 2015 Jul 12.2 12.3 12.4 12.0 11.9 ## 8 2015 Aug 12.3 12.5 12.6 12.2 12.1 ## 9 2015 Sep 12.3 12.5 12.6 12.2 12.1 ## 10 2015 Oct 12.2 12.4 12.5 12.1 12.0 ## # ... with 29 more rowsAnd now, I can plot it:
avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% mutate(total_ihs = ihs(total)) %>% ggplot() + ggtitle("Logged data") + geom_line(aes(y = total_ihs, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m%Y") + geom_ribbon(data = best_model_forecast, aes(x = date, ymin = lower95, ymax = upper95), fill = "#666018", alpha = 0.2) + geom_line(data = best_model_forecast, aes(x = date, y = point_estimate), linetype = 2, colour = "#8e9d98") + theme_blog()Compared to the previous blog post, the
dotted line now seems to follow the true line even better!
Now, I am not saying that you should always do a gridsearch whenever you have a problem like this
one. In the case of univariate time series, I am still doubtful that a gridsearch like this is really
necessary. However, it makes for a good exercise and a good illustration of the power of {furrr}.
Hope you enjoyed! If you found this blog post useful, you might want to follow
me on twitter for blog post updates.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
More on Bias Corrected Standard Deviation Estimates
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
This note is just a quick followup to our last note on correcting the bias in estimated standard deviations for binomial experiments.
For normal deviates there is, of course, a well know scaling correction that returns an unbiased estimate for observed standard deviations.
It (from the same source):
… provides an example where imposing the requirement for unbiased estimation might be seen as just adding inconvenience, with no real benefit.
Let’s make a quick plot comparing the naive estimate of standard deviation (“forgetting to use n1 in the denominator”) and the Bessel corrected estimate (the squareroot of the Bessel corrected variance). It is well known that the naive estimate is biaseddown and underestimates both the variance and standard deviation. The Bessel correction deliberately inflates the variance estimate to get the expected value right (i.e., to remove the bias). However, as we can see in the following graph: for the standard deviation the correction is too much. The squareroot of the Bessel corrected variance is systematically an overestimate of the standard deviation.
We can show this graphically as follows.
The above graph is portraying, for different sample sizes (n), the ratio of the expected values of the various estimates to the true value of the standard deviation (for observations from an i.i.d. normal random source). So: an unbiased estimate would lie on the line y=1.
Notice the Bessel corrected is further away from the true value of the standard deviation than the naive estimate was (just biased in the opposite direction). So from the standarddeviation point of view the Bessel correction isn’t really better than the naive estimate.
All work is shared here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
More Sandwiches, Anyone?
(This article was first published on Econometrics Beat: Dave Giles' Blog, and kindly contributed to Rbloggers)
Consider this my Good Deed for the Day!A retweet from a colleague whom I follow on Twitter brought an important paper to my attention. I thought I’d share it more widely.
The paper is titled, “Smallsample methods for clusterrobust variance estimation and hypothesis testing in fixed effect models”, by James Pustejovski (@jepusto) and Beth Tipton (@statstipton). It appears in The Journal of Business and Economic Statistics. You can tell right away, from its title, that this paper is going to be a mustread for empirical economists. And note the words, “Smallsample” in the title – that sounds interesting. Here’s a compilation of Beth’s six tweets:“Econ friends, @jepusto and I have a new paper out that we would love to share. It’s about clustering your standard errors (more below).
Any suggestions for how to get these methods out to economists given that we aren’t NBER?
Summary: Our paper provides smallsample adjustments to cluster robust variance estimation (CRVE). It can be used with panel data, experimental data, and regression. You can implement the method in a Stata macro called REG_SANDWICH and an R package called clubSandwich.
Why do you need this? Regular CRVE doesn’t do so well, even with as many as 100 clusters (!). In fact, CRVE only gives you appropriate Type I error when your covariates are balanced.
What did we do? We extended the biasrobust linearization method (BRL) by Bell & McCaffrey in three ways: (1) in addition to a ttest, there is now an Ftest; (2) We can handle the inclusion of fixed effects; (3) You get the same results whether you use FE or absorption.
How does it work? The adjustment inflates the standard errors a small bit. But more importantly, it provides Satterthwaitetype degrees of freedom that are more appropriate. The result is a test we call the ‘Approximate Hotelling’s Tsquared’ (AHT) test.
We’d love to share the work broadly, so if you have ideas, please let us know. Thanks!”
I’ve added the links to the R and Stata software in the quote above.There are also some nice R vignettes available:
 Clusterrobust variance estimation with clubSandwich
 Metaanalysis with clusterrobust variance estimation
 Clusterrobust standard errors and hypothesis tests in panel data models
© 2018, David E. Giles var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Econometrics Beat: Dave Giles' Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Use GitHub Vulnerability Alerts to Keep Users of Your R Packages Safe
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
Despite their now inherent evil status, GitHub has some tools other repository aggregators do not. One such tool is the free vulnerability alert service which will scan repositories for outdated+vulnerable dependencies.
Now, “R” is nowhere near a firstclass citizen in the internet writ large, including software development tooling (e.g. the TravisCI and GitLab continuous integration recipes are community maintained vs a firstclass/supported offering). This also means that GitHub’s service will never check for nor alert when a pure R package has security issues, mostly due to the fact that there’s only a teensy few of us who even bother to check packages for issues once in a while and there’s no real way to report said issues into the CVE process easily (though I guess I could given that my $DAYJOB is an official CVE issuer), so the integrity & safety of the R package ecosystem is still in the “trust me, everything’s !!” state. Given that, any extra way to keep even some R packages less insecure is great.
So, right now you’re thinking “you clickbaited us with a title that your lede just said isn’t possible…WTHeck?!.
It’s true that GitHub does not consider R a firstclass citizen, but it does support Java and:
available.packages() %>% dplyr::as_data_frame() %>% tidyr::separate_rows(Imports, sep=",[[:space:]]*") %>% # we really just tidyr::separate_rows(Depends, sep=",[[:space:]]*") %>% # need these two tidyr::separate_rows(Suggests, sep=",[[:space:]]*") %>% tidyr::separate_rows(Enhances, sep=",[[:space:]]*") %>% dplyr::select(Package, Imports, Depends) %>% filter( grepl("rJava", Imports)  grepl("rJava", "Depends")  grepl("Suggests", Imports)  grepl("Enhances", "Depends") ) %>% dplyr::distinct(Package) %>% dplyr::summarise(total_pkgs_using_rjava = n()) ## # A tibble: 1 x 1 ## total_pkgs_using_rjava ## ## 1 66according to there are 66 CRAN packages that require rJava, seven of which explicitly provide only JARs (a compressed directory tree of supporting Java classes). There are more CRANunpublished rJavabased projects on GitLab & GitHub, but it’s likely that publicfacing rJava packages that include or depend on public JARdependent projects still number less than ~200. Given the now >13K packages in CRAN, this is a tiny subset but with the sorry state of R security, anything is better than nothing.
Having said that, one big reason (IMO) for the lack of Javawrapped CRAN or “devtools”only released rJavadependent packages it that it’s 2018 and you still have better odds of winning a Vegasjackpot than you do getting rJava to work on your workstation in less than 4 tries and especially after an OS upgrade. That’s sad since there are many wonderful, solid and useful Java libraries that would be superhandy for many workflows yet most of us (I’m including myself) packagewriters prefer to spin wheels to get C++ or Rust libraries working with R than try to make it easier for regular R users to tap into that rich Java ecosystem.
But, I digress.
For the handful of us that do write and use rJavabased packages, we can better serve our userbase by deliberately putting those R+Java repos on GitHub. Now, I hear you. They’re evil and by doing this one of the most evil corporations on the planet can make money with your metadata (and, possibly just blatantly steal your code for use inproduct without credit) but I’ll give that up on a casebycase basis to make it easier to keep users safe.
Why will this enhance safety? Go take a look at one of my nonCRAN rJavabacked packages: pdfbox. It has this awesome “inyourface” security warning banner:
The vulnerability is CVE201811797 which is baseline computed to be “high severity” with a the following specific weakness: In Apache PDFBox 1.8.0 to 1.8.15 and 2.0.0RC1 to 2.0.11, a carefully crafted PDF file can trigger an extremely long running computation when parsing the page tree.. So, it’s a process denial of service vulnerability. You’ll also note I haven’t updated the JARs yet (mostly since it’s not a codeexecution vulnerability).
I knew about this 28 days ago (I’ve been incredibly busy and there’s alot of blather required to talk about it, hence the delay in blogging) thanks to the GitHub service and will resolve it when I get some free time over the Thanksgiving break. I received an alert for this, there are hooks for security alerts (so one can autocreate an issue) and there’s a warning for users and any of them could file an issue to let me know it’s superimportant to them that I get it fixed (or they could be superawesome and file a PR :).
FINThe TLDR is (first) a note — to package authors — who use rJava to bite the GitHub bullet and take advantage of this free service; and, (second) — to users — to encourage use of this service by authors of packages you use and to keep a watchful eye out for any security alerts for code you depend on to get things done.
A (perhaps) third and final note is for all of us to be to continually mindful about the safety & integrity of the R package ecosystem and do what we can to keep moving it forward.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
anytime 0.3.3
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new minor cleanup release of the anytime package arrived on CRAN overnight. This is the fourteenth release, and follows the 0.3.2 release a good week ago.
anytime is a very focused package aiming to do just one thing really well: to convert anything in integer, numeric, character, factor, ordered, … format to either POSIXct or Date objects – and to do so without requiring a format string. See the anytime page, or the GitHub README.md for a few examples.
This release really adds the nice new vignette as a vignette—there was a gotcha in the 0.3.2 release—and updates some core documentation in the README.md to correctly show anydata() on input such as 20160101 (which was an improvement made starting with the 0.3.0 release).
Changes in anytime version 0.3.3 (20181113)
Vignette build quirkyness on Windows resolved so vignette reinstated.

Documentation updated showing correct use of anydate (and not anytime) on input like ‘2016010’ following the 0.3.0 release heuristic change.

Set #define for Boost to make compilation more quiet.
Courtesy of CRANberries, there is a comparison to the previous release. More information is on the anytime page.
For questions or comments use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Gazing into the Abyss of PHacking: HARKing vs. Optional Stopping
(This article was first published on R – Nicebread, and kindly contributed to Rbloggers)
by Angelika Stefan & Felix Schönbrodt
Almost all researchers have experienced the tingling feeling of suspense that arises right before they take a look at longawaited data: Will they support their favored hypothesis? Will they yield interesting or even groundbreaking results? In a perfect world (especially one without publication bias), the cause of this suspense should be nothing else but scientific curiosity. However, the world, and specifically the incentive system in science, is not perfect. A lot of pressure rests on researchers to produce statistically significant results. For many researchers, statistical significance is the cornerstone of their academic career, so nonsignificant results in an important study can not only question their scientific convictions but also crash their hopes of professional promotion. (Although, fortunately things are changing for the better).
Now, what does a researcher do confronted with messy, nonsignificant results? According to several muchcited studies (for example John et al., 2012; Simmons et al., 2011), a common reaction is to start sampling again (and again, and again, …) in the hope that a somewhat larger sample size can boost significance. Another reaction is to wildly conduct hypothesis tests on the existing sample until at least one of them becomes significant (see for example: Simmons et al., 2011; Kerr, 1998 ). These practices, along with some others, are commonly known as phacking, because they are designed to drag the famous pvalue right below the mark of .05 which usually indicates statistical significance. Undisputedly, phacking works (for a demonstration try out the phacker app). The two questions we want to answer in this blog post are: How does it work and why is that bad for science?
As many people may have heard, phacking works because it exploits a process called alpha error accumulation which is covered in most introductory statistics classes (but also easily forgotten again). Basically, alpha error accumulation means that as one conducts more and more hypothesis tests, the probability increases that one makes a wrong test decision at least once. Specifically, this wrong test decision is a false positive decision or alpha error, which means that you proclaim the existence of an effect although, in fact, there is none. Speaking in statistical terms, an alpha error occurs when the test yields a significant result although the null hypothesis (“There is no effect”) is true in the population. This means that phacking leads to the publication of an increased rate of false positive results, that is, studies that claim to have found an effect although, in fact, their result is just due to the randomness of the data. Such studies will never replicate.
At this point, the blog post could be over. PHacking exploits alpha error accumulation and fosters the publication of false positive results which is bad for science. However, we want to take a closer look at how bad it really is. In fact, some phacking techniques are worse than others (or, if you like the unscrupulous science villain perspective: some phacking techniques work better than others). As a showcase, we want to introduce two researchers: The HARKer takes existing data and conducts multiple independent hypothesis tests (based on multiple uncorrelated variables in the data set) with the goal to publish the ones that become significant. For example, the HARKer tests for each possible correlation in a large data set whether it differs significantly from zero. On the other hand, the Accumulator uses optional stopping. This means that he collects data for a single research question test in a sequential manner until either statistical significance or a maximum sample size is reached. For simplicity, we assume that they use neither other phacking techniques nor other questionable research practices.
The HARKer’s phacking strategyLet us start with the HARKer: Since the conducted hypothesis tests in our defined scenario are essentially independent, the situation can be seen as a problem of multiple testing. This means, it is comparatively easy to determine the exact probability that the HARKer will end up with at least one falsepositive result given a certain number of hypothesis tests. Assuming no effects in the population (for example, no correlation between the variables), one can picture the situation as a decision tree: At each branch level stands a hypothesis test which can either result in a nonsignificant result with 95% probability or in a (spurious) significant result with 5% probability, which is the level.
No matter how many hypothesis tests the HARKer conducts, there will only be one condition in the allnull scenario where no error occurs, that is, where all hypothesis tests yield nonsignificant results. The probability that this occurs can be calculated by , with being the number of conducted hypothesis tests. The probability that at least one of the hypothesis tests is significant is the probability of the complementary event, that is . For example, when the HARKer computes hypothesis with an alpha level of , the overall probability to obtain at least one false positive result is . Of course, the formula can be adjusted for other suggested alpha levels, such as or . We show this general formula in the Rcode chunk below.
[cc lang=”rsplus” escaped=”true”]
# The problem of HARKing
harker < function(x, alpha){1(1alpha)^x}
[/cc]
The Accumulator has a different tactic: Instead of conducting multiple hypothesis tests on different variables of one data set, he repeatedly conducts the same hypothesis test on the same variables in a growing sample. Starting with a minimum sample size, the Accumulator looks at the results of the analysis – if these are significant, data collection is stopped, if not, more data is collected until (finally) the results become significant, or a maximum sample size is reached. This strategy is also called Optional Stopping. Of course, the more often a researcher peeks at the data, the higher is the probability to obtain a false positive result at least once. However, this overall probability is not the same as the one obtained through HARKing. The reason is that the hypothesis tests are not independent in this case. Why is that? The same hypothesis test is repeatedly conducted on only slightly different data. In fact, the data that were used in in the first hypothesis test are used in every single of the subsequent hypothesis tests so that there is a spillover effect of the first test to every other hypothesis test in the set. Imagine, your initial sample contains an outlier: This outlier will affect the test results in any other test. With multiple testing, in contrast, the outlier will affect only the test in question but none of the other tests in the set.
So does this dependency make optional stopping more or less effective than HARKing? Of course, people have been wondering about this for quite a while. A paper by Armitage et al. (1969) demonstrates error accumulation in optional stopping for three different tests. We can replicate their results for the ztest with a small simulation (a more flexible simulation can be found at the end of the blog post): We start by drawing a large number of samples (iter) with the maximum sample size (n.max) from the null hypothesis. Then we conduct a sequential testing procedure on each of the samples, starting with a minimum sample size (n.min) and going up in several steps (step) up to the maximum sample size. The probability to obtain a significant result at least once up to a certain step can be estimated through the percentage of iterations that end up with at least one significant result at that point.
[cc lang=”rsplus” escaped=”true”]
# The Problem of optional stopping
accumulator < function(n.min, n.max, step, alpha=0.05, iter=10000){
# Determine places of peeks
peeks < seq(n.min, n.max, by=step)
# Initialize result matrix (nonsequential)
res < matrix(NA, ncol=length(peeks), nrow=iter)
colnames(res) < peeks
# Conduct sequential testing (always until n.max, with peeks at predetermined places)
for(i in 1:iter){
sample < rnorm(n.max, 0, 1)
res[i,] < sapply(peeks, FUN=function(x){sum(sample[1:x])/sqrt(x)})
}
# Create matrix: Which tests are significant?
signif < abs(res) >= qnorm(1alpha)
# Create matrix: Has there been at least one significant test in the trial?
seq.signif < matrix(NA, ncol=length(peeks), nrow=iter)
for(i in 1:iter){
for(j in 1:ncol(signif)){
seq.signif[i,j] < TRUE %in% signif[i, 1:j]
}
}
# Determine the sequential alpha error probability for the sequential tests
seq.alpha < apply(seq.signif, MARGIN = 2, function(x){sum(x)/iter})
# Return a list of individual test pvalues, sequential significance and sequential alpha error probability
return(list(seq.alpha = seq.alpha))
}
[/cc]
For example, the researcher conducts a twosided onesample ztest with an overall level of .05 in a sequential way. He starts with 10 observations, then always adds another 10 if the result is not significant, up to 100 observations at maximum. This means, he has 10 chances to peek at the data and end the data collection if the hypothesis test is significant. Using our simulation function, we can determine the probability to have obtained at least one false positive result at any of these steps:
[cc lang=”rsplus” escaped=”true”]
set.seed(1234567)
res.optstopp < accumulator(n.min=10, n.max=100, step=10, alpha=0.025, iter=10000)
print(res.optstopp[[1]])
[/cc]
[cc]
[1] 0.0492 0.0824 0.1075 0.1278 0.1431 0.1574 0.1701 0.1788 0.1873 0.1949
[/cc]
We can see that with one single evaluation, the false positive rate is at the nominal 5%. However, when more inbetween tests are calculated, the false positive rate rises to roughly 20% with ten peeks. This means that even if there is no effect at all in the population, the researcher would have stopped data collection with a signficant result in 20% of the cases.
A comparison of the HARKer’s and the Accumulator’s strategyLet’s compare the false positive rates of HARKing and optional stopping: Since the researcher in our example above conducts one to ten dependent hypothesis tests, we can compare this to a situation where a HARKer conducts one to ten independent hypothesis tests. The figure below shows the results of both phacking strategies:
[cc lang=”rsplus” escaped=”true”]
# HARKing False Positive Rates
HARKs < harker(1:10, alpha=0.05)
[/cc]
We can see that HARKing produces higher false positive rates than optional stopping with the same number of tests. This can be explained through the dependency on the first sample in the case of optional stopping: Given that the null hypothesis is true, this sample is not very likely to show extreme effects in any direction (however, there is a small probability that it does). Every extension of this sample has to “overcome” this property not only by being extreme in itself but also by being extreme enough to shift the test on the overall sample from nonsignificance to significance. In contrast, every sample in the multiple testing case only needs to be extreme in itself. Note, however, that false positive rates in optional stopping are not only dependent on the number of interim peeks, but also on the size of the initial sample and on the step size (how many observations are added between two peeks?). The difference between multiple testing and optional stopping which you see in the figure above is therefore only valid for this specific case. Going back to the two researchers from our example, we can say that the HARKer has a better chance to come up with significant results than the Accumulator, if both do the same number of hypothesis tests.
Practice HARKing and Optional Stopping yourselfYou can use the interactive phacker app to experience the efficiency of both phacking strategies yourself: You can increase the number of dependent variables and see whether one of them gets significant (HARing), or you can got to the “Now: phack!” tab and increase your sample until you obtain significance. Note that the DVs in the phacker app are not completely independent as in our example above, but rather correlate with r = .2, assuming that the DVs to some extent measure at least related constructs.
ConclusionTo conclude, we have shown how two phacking techniques work and why their application is bad for science. We found out that phacking techniques based on multiple testing typically end up with higher rates of false positive results than phacking techniques based on optional stopping, if we assume the same number of hypothesis tests. We want to stress that this does not mean that naive optional stopping is okay (or even okayish) in frequentist statistics, even if it does have a certain appeal. For those who want to do guiltfree optional stopping, there are ways to control for the error accumulation in the frequentist framework (see for example Wald, 1945, Chow & Chang, 2008, Lakens, 2014) and sequential Bayesian hypothesis tests (see for example our paper on sequential hypothesis testing with Bayes factors or Rouder, 2014).
Alternative Simulation Code (also including onesided tests and ttests)[cc lang=”rsplus” escaped=”true”]
sim.optstopping < function(n.min, n.max, step, alpha=0.05, test=”z.test”, alternative=”two.sided”, iter=10000){
match.arg(test, choices=c(“t.test”, “z.test”))
match.arg(alternative, choices=c(“two.sided”, “directional”))
# Determine places of peek
peeks < seq(n.min, n.max, by=step)
# Initialize result matrix (nonsequential)
res < matrix(NA, ncol=length(peeks), nrow=iter)
colnames(res) < peeks
# Conduct sequential testing (always until n.max, with peeks at predetermined places)
for(i in 1:iter){
sample < rnorm(n.max, 0, 1)
if(test==”t.test”){res[i,] < sapply(peeks, FUN=function(x){mean(sample[1:x])/sd(sample[1:x])*sqrt(x)})}
if(test==”z.test”){res[i,] < sapply(peeks, FUN=function(x){sum(sample[1:x])/sqrt(x)})}
}
# Create matrix: Which tests are significant?
if(test==”z.test”){
ifelse(alternative==”two.sided”, signif < abs(res) >= qnorm(1alpha), signif < res <= qnorm(alpha))
}else if (test==”t.test”){
n < matrix(rep(peeks, iter), nrow=iter, byrow=T)
ifelse(alternative==”two.sided”, signif < abs(res) >= qt(1alpha, df=n1), signif < res <= qt(alpha, df=n1))
}
# Create matrix: Has there been at least one significant test in the trial?
seq.signif < matrix(NA, ncol=length(peeks), nrow=iter)
for(i in 1:iter){
for(j in 1:ncol(signif)){
seq.signif[i,j] < TRUE %in% signif[i, 1:j]
}
}
# Determine the sequential alpha error probability for the sequential tests
seq.alpha < apply(seq.signif, MARGIN = 2, function(x){sum(x)/iter})
# Return a list of individual test pvalues, sequential significance and sequential alpha error probability
return(list(p.values = res,
seq.significance = seq.signif,
seq.alpha = seq.alpha))
}
[/cc]
The post Gazing into the Abyss of PHacking: HARKing vs. Optional Stopping appeared first on Nicebread.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Nicebread. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Windows Clipboard Access with R
(This article was first published on R – RBAR, and kindly contributed to Rbloggers)
The windows clipboard is a quick way to get data in and out of R. How can we exploit this feature to accomplish our basic data exploration needs and when might its use be inappropriate? Read on.
We won't be using it in this post, but you can see the contents of the Windows clipboard in R using the readClipboard() command. Going through the documentation you'll note a variety of formats that can be read. For our purposes, we are looking at moving text based data into R from HTML tables or spreadsheets that happen to be open on our desktop.
HTML TablesHTML tables are easy to copy into R via the windows clipboard if you're using Chrome or Firefox. To demonstrate, copy the simple HTML table shown below by highlighting the text and pressing CTRL + C on your keyboard (or whatever copy method works best for you.)
Items Inventory Apple 5 Kiwi 7 Meat Balls 253Next, grab the windows clipboard content and “paste” it into an R data frame using the following Rcode:
1 2 3 4 My_HTML_Data < read.table(file = "clipboard", header = T, sep = "\t")Notice that the file argument of the read.table command is calling out to the windows “clipboard” (line 2).
Note 1: I've gotten this to work with Mozilla Firefox and Google Chrome but Not Microsoft Edge or Internet Explorer.
Note 2: You copied all the data, but you might have gotten a warning that looks like this:
Warning message:
In read.table(file = “clipboard”, header = T, sep = “\t”) :
incomplete final line found by readTableHeader on ‘clipboard’
There seems to be a hidden character at the end of the HTML table. If you miss it, you get the error mentioned above. Digging deeper into the warning with the readClipboard command, you get something that looks like this:
1 2 readClipboard() ## [1] "Items \tInventory" "Apple \t5" "Kiwi \t7" "Meat Balls \t253 "If you get the hidden character, you don't get the error. Running the readClipboard command show there is an extra empty item in the list (see line 3):
1 2 3 readClipboard() ##[1] "Items \tInventory" "Apple \t5" "Kiwi \t7" "Meat Balls \t253" ##[5] ""Either way, all the data ends up in your target data frame.
SpreadsheetsI can think of numerous times when I've had multiple small tables in the same spreadsheet as shown below.
In these example tables, we see data related to notable serial killers. The left table provides the Name and Sex of a given serial killer. The right table provides their year of Birth and Death as well as the number of victims. Both tables are in the same spreadsheet.
Often, I want to merge or perform a quick analysis on tables like these in R but saving each table to a csv first is a nuisance. To get around this problem, we can leverage the Window's clipboard using the following code:
1 2 3 4 5 6 7 8 9 10 11 #Copy the Left table from Spread Sheet Left_Table < read.table(file = "clipboard", header = T, sep = "\t") #Copy the Right table from Spread Sheet Right_Table < read.table(file = "clipboard", header = T, sep = "\t")Note: Between each read.table command, you will need to copy the appropriate data from the target spreadsheet to the clipboard. Once you have the tables in R, you can use the write.csv command to write them to disk.
Copy Data from R to ClipboardIn the previous section, we copied two tables containing information about serial killers into R. In this section, we are going to merge those tables together and copy the result back to our spreadsheet. The code to merge and copy to clipboard is shown below:
1 2 3 4 Merged_TBLs < merge(Left_Table, Right_Table, by="KEY", all.x = T) write.table(Merged_TBLs, file="clipboard", row.names = F)Here is a screenshot showing the merged table pasted back into our spreadsheet software:
Note: after pasting the table back into the spreadsheet, a texttocolumn process was required to put the data back into the individual columns.
Summary and Usage NotesCopying and pasting data to and from the Window's clipboard to R is quick and easy. All we need to do is use the standard read.table and write.table syntax and set the file argument equal to “clipboard”. Despite the ease of moving data around in this manner, we are straying away from reproducible data processing. For example, it would be hard for me to reproduce your result if your R code referenced the clipboard. With this in mind, get your source data into a csv file if your quick glace at HTML or spreadsheet data becomes more than a 30 minute affair or you want to share you analysis. Finally, if your running a OS X or X11 based windowing systems check out the clipr package.
Other Articles at rbar.net: XmR Plots with ggQC
The post Windows Clipboard Access with R appeared first on RBAR.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – RBAR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Mathematician’s Perspective on Topological Data Analysis and R
(This article was first published on R Views, and kindly contributed to Rbloggers)
A few years ago, when I first became aware of Topological Data Analysis (TDA), I was really excited by the possibility that the elegant theorems of Algebraic Topology could provide some new insights into the practical problems of data analysis. But time has passed, and the sober assessment of Larry Wasserman seems to describe where things stand.
TDA is an exciting area and is full of interesting ideas. But so far, it has had little impact on data analysis.
Nevertheless, TDA researchers have been quietly working the problem and at least some of them are using R (see below). Since I first read Professor Wasserman’s paper, I have been very keen on getting the perspective of a TDA researcher. So, I am delighted to present the following interview with Noah Giansiracusa, Algebraic Geometer, TDA researcher and coauthor of a recent JCGS paper on a new visualization tool for persistent homology.
Q: Hello Dr. Giansiracusa. Thank you for making time for us at R Views. How did you get interested in TDA?
While doing a postdoc in pure mathematics (algebraic geometry, specifically) I, probably like many people, could not escape a feeling that crept up from time to time—particularly during the more challenging moments of research frustration—that perhaps the efforts I was putting into proving abstract theorems might have been better spent working in a more practical, applied realm of mathematics. However, pragmatic considerations made me apprehensive at that point in my career to take a sudden departure, for I finally felt like I was gaining some momentum in algebraic geometry, developing a nice network of supportive colleagues, etc., and also that I would very soon be on the tenure track job market and I knew that if I was hired (a big “if”!) it would be for a subject I had actually published in, not one I was merely curious about or had recently moved into. But around this same time I kept hearing about an exciting but possibly overhyped topic called topological data analysis: TDA. It really seemed to be in the air at the time (this was about five years ago).
Q: Why do you think TDA took off?
I can only speak for myself, but I think there were two big reasons that TDA generated so much buzz among mathematicians at the early stages.
First, it was then, and still is impossible to escape the media blitz on “big data” and the “data revolution” and related sentiments. This is felt strongly within academic circles (our deans would love us all to be working in data it seems!) but also in the mainstream press. Yet, I think pure mathematicians often felt somewhat on the periphery of this revolution: we knew that the modern developments in data and deep learning and artificial intelligence would not be possible without the rigorous foundations our mathematical ancestors had laid, but we also knew that most of the theorems we are currently proving would in all likelihood play absolutely zero role in any of the contemporary story. TDA provided a hope for relevance, that in the end the pure mathematician would come out of the shadows of obscurity and strike a data science victory proving our ineluctable relevance and superiority in all things technical—and this hope quickly turned to hype.
I think I and many other pure mathematicians were rooting for TDA, to show the world that our work has value. We tired of telling stories of how mathematicians invented differential geometry before Einstein used it in relativity and your GPS would not be possible without this. We needed a more fresh, decisive victory in the intellectual landscape; number theory used in cryptography is great, but still too specialized: TDA had the promise of bringing us into the (big) data revolution. And so we hoped, and we hyped.
And second, from a very practical perspective, I simply did not have time to retrain myself in applied math, the usual form of applied math based heavily on differential equations, modeling, numerical analysis, etc. But TDA seemed to offer a chance to gently transition to data science mathematical relevance—instead of starting from scratch, pure mathematicians such as myself would simply need to add one more chapter to our background in topics like algebraic topology and then we’d be ready to go and could brand ourselves as useful! And if academia didn’t work out, Google would surely rather open the doors of employment to a TDA expert than to a traditional algebraic topologist (or algebraic geometer, in my case).
I think these are two of the main things that brought TDA so much early attention before it really had many realworld successes under its belt, and they are certainly what brought me to it; well, that and also an innocent curiosity to understand what TDA really is, how it works, and whether or not it does what it claims.
Q: So how did you get started?
I first dipped my toes in the TDA waters by agreeing to do a reading course with a couple undergraduates interested in the topic; then I led an undergraduate/master’s level course where we studied the basics of persistent homology, downloaded some data sets, and played around. We chose to use R for that since there are many data sets readily available, and also because we wanted to do some simple experiments like sampling points from nice objects like a sphere but then adding noise, so we knew we wanted to have a lot of statistical functions available to us and R had that plus TDA packages already. While doing this I grew to quite like R and so have stuck with it ever since. In fact, I’m now using it also on a (nonTDA) project to analyze Supreme Court voting patterns from a computational geometry perspective.
Q: Do you think TDA might become a practical tool for statisticians?
First of all, I think this is absolutely the correct way to phrase the question! A few years ago TDA seemed to have almost an adversarial nature to it, that topologists were going to do what data scientists were doing but better because fancy math and smart people were involved. So the question at the time seemed to be whether TDA would supplant other forms of data science, and this was a very unfortunate way to view things.
But, it was easy to entirely discredit TDA by saying that it makes no practical difference whether your data has the homology of a Klein bottle, or there were no realworld examples where TDA had outperformed machine learning. This type of dialogue was missing the point. As your question suggests, TDA should be viewed as a tool to be added to the quiver of data science arrows, rather than an entirely new weapon.
In fact, while this clarification moves the dialogue in a healthy direction (TDA and machine learning should work together, rather than compete with each other!) I think there’s still one further step we should take here: TDA is not really a welldefined entity. For instance, when I see topics like random decision forests, it looks very much like topology to me! (Any graph, of which a tree is an example, is a 1dimensional simplicial complex, and actually if you look under the hood, the standard Rips complex approach in TDA builds its higher dimensional simplicial complexes out of a 1dimensional simplicial complex, so both random forests and TDA—and most of network theory—are really rooted in the common world of graph theory.)
Another example: the 0dimensional barcode for the Rips flavor of TDA encodes essentially the same information as hierarchical clustering. All I’m saying here is that there’s more traditional data science in TDA than one might first imagine, and there’s more topology in traditional data science than one might realize. I think this is healthy, to recognize connections like these—it helps one see a continuum of intellectual development here rather than a discrete jump from ordinary data science to fancy topological data science.
That’s a longwinded way of saying that you phrased the question well. The (less longwinded) answer to the question is: Yes! Once one sees TDA as one more tool for extracting structure and statistics from data, it is much easier to imagine it being absorbed into the mainstream. It need not outperform all previous methods or revolutionize data science, it merely needs to be, exactly as you worded it, a practical tool. Data science is replete with tools that apply in some settings and not others, work better with some data than others, reveal relevant information sometimes more than others, and TDA (whatever it is exactly) fits right into this. There certainly will be some branches of TDA that gain more traction over the years than others, but I am absolutely convinced that at least some of the methods used in TDA will be absorbed into statistical learning just as things like random decision trees and SVMs (both of which have very topological/geometric flavors to them!) have. This does not mean that every statistician needs to learn TDA, just as not every statistician needs to learn all the latest methods in deep learning.
Q: Where do you think TDA has had the most success?
Over the past few years I think the biggest strides TDA has made have been in terms of better interweaving it with other methods and disciplines—so big topics with lots of progress but still room for more have included confidence intervals, distributions of barcodes, feature selection and kernel methods in persistent homology. These are all exciting topics and healthy for the longterm development of TDA.
I think, perhaps controversially, the next step might actually be to rid ourselves of the label TDA. For one thing, TDA is very geometric and not just topological (which is to say: distances matter!). But the bigger issue for me is that we should refer to the actual tools being used (mapper, persistent homology in its various flavors, etc.) rather than lump them arbitrarily together under this common label. It took many years for statisticians to jump on the machine learning bandwagon, and part of what prevented them from doing so sooner was language; the field of statistical learning essentially translates machine learning into more familiar statistical terminology and reveals that it is just another branch of the same discipline. I suspect something similar will take place with TDA… er, I should say, with these recent topological and geometric data analysis tools.
Q: Do you think focusing on the kinds of concrete problems faced when trying to apply topological and algebraic ideas to data analysis will turn out to be a productive means of motivating mathematical research?
Yes, absolutely—and this is also a great question and a healthy way to look at things! Pure mathematicians have no particular agenda or preconceived notion of what they should and should not be studying: pure mathematics, broadly speaking, is logical exploration and development of structure and symmetry. The more intricate a structure appears to be, and the more connected to other structures we have studied, the more interested we tend to be in it. But that really is pretty much all we need to be interested—and to be happy.
So TDA provides a whole range of new questions we can ask, and new structures we can uncover, and inevitably many of these will tie back to earlier areas of pure mathematics in fascinating ways—all the while, throughout these explorations pure mathematicians likely will end up laying foundations that help provide a stable scaffolding for the adventurous data practitioners who jump into methodology before the full mathematical landscape has been revealed. So TDA absolutely will lead to new, important mathematical research: important both because we’ll uncover beautiful structures and connections, and important also because it will provide some certainty to the applied methods build on top of this.
Q: More specifically, what role might the R language play in facilitating the practice or teaching of mathematics?
Broadly speaking, I think many teachers—especially in pure mathematics—undervalue the importance of computer programming skills, though this is starting to change as pure mathematicians increasingly enjoy experimentation as a way of exploring small examples, honing intuition, and finding evidence for conjectures. While the idea of theoremproof mathematics is certainly the staple of our discipline, it’s not the only way to understand mathematical structure. In fact, students often find mathematical material resonates with them much more strongly if they uncover a pattern by experimenting on a computer rather than just being fed it through lecture or textbooks. Concretely, if students play with something like the distribution of prime numbers, they might get excited to see the fascinating interplay between randomness and structure that emerges, and that can better prepare them to appreciate formally learning the prime number theorem in a classroom. So as things like TDA emerge, the number of pure mathematics topics that can be explored on a computer increases, and I think that’s a great thing.
Q: Where does R fit in?
Well, much of the mathematical exploration I’m referring to here is symbolic—so very precise and algebraic flavor—and R certainly has no limitations working precisely, but it’s not the main goal of the language so one likely would use a computer algebra system instead. But, one exciting thing TDA does help us see is that there’s a marvelous interface between the symbolic and numerical worlds (here represented by the topology and the statistics, respectively) and I think this is great for both teaching and research. The more common ground we find between topics that previously seemed quite disparate, the more chance we have of building meaningful interdisciplinary collaborations, the more perspectives we can provide our students to motivate and study something, and the more we see unity within mathematics. My favorite manifestation of this is that TDA is the study of the topology of discrete spaces—but discrete spaces have no nontrivial topology! What’s really going on then is that data gives us a discrete glimpse into a continuous, highly structured world, and TDA aims to restore the geometric structure lost due to sampling. In doing so one cannot, and should not, avoid statistics, so pure mathematics is brought meaningfully in contact with statistics and I absolutely love that. This means the R language finds a role in pure math where it may previously not have: topology with noise, algebra with uncertainty.
Thank you again! I think your ideas are going to inspire some R Views readers.
Editors note: here are some R packages for doing TDA:
* TDA contains tools for the statistical analysis of persistent homology and for density clustering.
* TDAmapper enables TDA using Discrete Morse Theory.
* TDAstats offers a tool set for TDA including for calculating persistent homology in a VietorisRips complex.
* pterrace builds on TDA and offers a new multiscale and parameter free summary plot for studying topological signals.
_____='https://rviews.rstudio.com/2018/11/14/amathematiciansperspectiveontopologicaldataanalysisandr/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Rdew Valley: Optimizing Farming with R
(This article was first published on schochastics, and kindly contributed to Rbloggers)
I recently picked up a copy of my favorite game Stardew Valley again.
If you don’t know the game, I can highly recommend it! You inherit a pixel farm and
you are in charge of everything. Crops, animals, fishing, mining and never forget to socialize.
My plan was to shut off work for at least a few hours while playing. But at one point
you inevitably start optimizing your farm. In most cases, the layout of crops.
Aaaaaaaand that’s how you turn a farming game into an optimization problem you try to solve with R.
If you start a standard farm, this is what you can work with.
The first step is to get the layout into R. Luckily, I found a
reddit post,
where someone posted the layout as an excel file. I converted the file into a matrix and
a long data frame. The data frame is for plotting and the matrix for the actual optimization.
The green tiles are the ones we can use for farming. In theory, we could plant
3414 crops there. But the tiles are totally unprotected against the nasty
crows that would eat all your precious crops. We need scarecrows to protect them.
When you plant your crops, you should make sure that your crops fall into the range of
a placed scarecrow. No crow will ever touch anything there.
A single scarecrow can protect 248 tiles. And now we have our first optimization problem:
Where should we place scarecrows on
the farm to maximize the protected area?
Now of course we could just blindly put them on the map until everything is covered.
But this would a) reduce the number of tiles we can use for growing crops and b)
result in a lot of unnecessary overlap between scarecrow ranges. So the goal is to maximize the
covered area and minimize the overlap of scarecrows. This is not an easy task though.
If you want to place one crow somewhere on the farm, you have 3414 possibilities
to do so. For two crows 5,825,991 and for three 6,626,093,764. So it is definitely not a viable option to try out all
possible placements, especially when the number of crows increases. We will not be able
to get the optimal solution, but we can try to come as close as possible.
There are certainly several heuristics we could use
but I decided to apply simulated annealing.
The general idea is very simple. In each iteration, try out a new solution. If it is
better then the old one, keep it. If it is worse, accept it with a certain probability.
This second step is very important, since it allows us to get out of local maxima.
The probability to accept worse solution decreases over time, since we assume that we
come closer to the “true” maximum. New solutions are of course not chosen randomly, but constructed from
previous solutions. In our case, I decided to implement four potential alterations of the current solution:
 add a scarecrow on a random unoccupied tile
 delete a random scarecrow
 teleport a scarecrow to a random unoccupied tile
 move a scarecrow to random neighboring tile
This is the theory, now to the code.
R codeThe code of the helper functions (add/delete/move/teleport scarecrows) can be found at the end of this post.
We can use the init_farm() function to get an initial scarecrow layout
farm < farm_max < init_farm(farm_blank,no = 10) opt_val < opt_val_max < opt_func(farm)This initial solution covers 820 tiles. Now onto the optimization.
#initialize variables temp < 100 max_crow < 35 min_crow < 3 steps < c("a","d","m") #annealing while(temp>=0.1){ for(i in 1:1000){ ncrow < nrow(locate_crows(farm,farm_blank)) farm_old < farm t < sample(steps,1) if(t=="a"){ if(ncrowmin_crow){ farm < del_crow(farm,farm_blank) } else{ farm < add_crow(farm) } } else if(t=="m"){ p < exp(temp100) if(runif(1)<=p){ farm < teleport_crow(farm,farm_blank) } else{ farm < move_crow(farm,farm_blank) } } #evaluate farm_val < opt_func(farm) if(farm_val>opt_val){ opt_val < farm_val if(opt_val>opt_val_max){ opt_val_max < opt_val farm_max < farm print(c(temp,opt_val_max)) } } else{ p < exp((farm_valopt_val)/temp) if(runif(1)<=p){ opt_val < farm_val } else{ farm < farm_old } } } temp < temp*0.99 }Note that this will run for a while (R and loops, you know). If you want to optimize the runtime,
implementing it in C++ with Rcpp might do the trick.
Below is the best layout found during the optimization.
crows < as_tibble(locate_crows(farm_max,farm_blank)) idx < which(farm_max>=1,arr.ind = T) crow_area < tibble(x=idx[,1],y=idx[,2],val=farm_max[which(farm_max>=1)]) ggplot(map)+geom_tile(aes(x,y,fill=val))+ scale_fill_manual(values=c("c"="#458B00", "w"="#1874CD","h"="#8B0000","b"="#B0B0B0","e"="#CD2626","x"="black"))+ geom_tile(aes(x=x,y=y),fill=c("#104E8B"),alpha=0.25,data=crow_area[crow_area$val==1,])+ geom_tile(aes(x=x,y=y),fill=c("#104E8B"),alpha=0.35,data=crow_area[crow_area$val>1,])+ geom_point(aes(x=row,y=col),data=crows,col="#CD9B1D")+ theme(legend.position = "none", axis.text = element_blank(), axis.ticks = element_blank(), axis.title = element_blank(), axis.line = element_blank(), panel.background = element_blank(), panel.grid = element_blank())The number of placed scarecrows is 18. 3119 tiles
are uniquely protected and 88 are protected by more than one scarecrow.
That leaves 207 tiles outside of the crow ranges.
So we can now safely plant 3207 crops without crows attacking them.
But wait, there is more! We also need to water them…
I am sure nobody wants to water those crops manually, so we need to place sprinkler.
To optimize their placement, we do exactly the same as for scarecrows. Our optimization
goal is to maximize the number of protected and watered tiles.
This is the final layout. The dark blue tiles are the ones that are watered and protected by a scarecrow.
light blue tiles are only watered, dark green tiles only protected and light green ones are neither.
This layout gives us 3144 watered tiles with 180 sprinkler and
2998 of those are protected by a scarecrow.
If you run this code by yourself, you will notice that you might get different (even better!)
results. Simulated annealing is not deterministic, hence it produces different nearly
optimal solutions in each run.
To leave a comment for the author, please follow the link and comment on their blog: schochastics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...