Forecasting in NYC: 2527 June 2018
(This article was first published on R on Rob J Hyndman, and kindly contributed to Rbloggers)
In late June, I will be in New York to teach my 3day workshop on Forecasting using R. Tickets are available at Eventbrite.
This is the first time I’ve taught this workshop in the US, having previously run it in the Netherlands and Australia. It will be based on the 2nd edition of my book “Forecasting: Principles and Practice” with George Athanasopoulos. All participants will get a print version of the book.
To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Mapping earthquakes off the Irish Coast using the leaflet package in R
(This article was first published on Environmental Science and Data Analytics, and kindly contributed to Rbloggers)
Ireland is a typically stable region from a seismic activity perspective as it is distant from major plate boundaries where subduction and seafloor spreading occur. However, in reading the following article, I was surprised to discover that earthquakes occur quite frequently both off the Irish coast and within the country itself. Most of these events occur as a result of tension and pressure release in the crustal rock.
The Irish National Seismic Network (INSN) maintains a dataset on 144 local earthquakes dating back to February 1980. The variables in the dataset are event date, event time, latitude, longitude, magnitude, region and subjective intensity (with levels such as “felt”, “feltIII”, “feltIV”).
Distribution of local earthquake magnitudesThe median magnitude recorded since 1980 is 1.5 on the Richter Scale. The largest intensity recorded was magnitude 5.4 on the 19th July 1984 in the Lleyn Peninsula region of Wales. This event occurred near the village of Llanaelhaearn and is the location of the biggest earthquake in the past 50 years.
Average waiting time between eventsHow frequently do seismic events occur around Ireland and its coast? It turns out that the median time between events is just 25 days! The greatest interval (1,335 days) is between an event recorded in New Ross, Co. Waterford on the 19th April 2002 and a magnitude 2.8 in the Irish Sea on the 14th December 2005. The distribution of waiting times is plotted below with outliers included and removed.
Geospatial mapping of earthquake distributionThe leaflet package for R was used to map the data. This wonderful package allows one to create excellent interactive data visualisations. The map below is a static .png with no interactivity but it shows the distribution well.
With the interactive version, each data point can be investigated by clicking on it to bring up a box containing additional information about the event. Zooming in and out is also possible with the interactive version as is changing the base map layer and other aesthetics.
Packages usedH. Wickham. ggplot2: Elegant Graphics for Data Analysis. SpringerVerlag New York, 2009.
Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller (2017). dplyr: A Grammar of Data Manipulation. R package version 0.7.4. https://CRAN.Rproject.org/package=dplyr
Hadley Wickham and Jennifer Bryan (2017). readxl: Read Excel Files. R package version
1.0.0. https://CRAN.Rproject.org/package=readxl
Joe Cheng, Bhaskar Karambelkar and Yihui Xie (2017). leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library. R package version 1.1.0. https://CRAN.Rproject.org/package=leaflet
Ramnath Vaidyanathan, Yihui Xie, JJ Allaire, Joe Cheng and Kenton Russell (2018).
htmlwidgets: HTML Widgets for R. R package version 1.0.
https://CRAN.Rproject.org/package=htmlwidgets
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Environmental Science and Data Analytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Predatory Journals and R
(This article was first published on Marcelo S. Perlin, and kindly contributed to Rbloggers)
–
My paper about the penetration of predatory journals in Brazil, Is predatory publishing a real threat? Evidence from a large database study, just got published in Scientometrics!. The working paper version is available in SSRN.
This is a nice example of a dataintensive scientific work cycle, from gathering data to reporting results. Everything was done in R, using web scrapping algorithms, parallel processing, tidyverse packages and more. This was a special project for me, given its implications in science making in Brazil. It took me nearly one year to produce and execute the whole code. It is also a nice case of the capabilities of package ggplot2 in producing publicationready figures. As a side output, our database of predatory journals is available as a shiny app.
More details about the study itself is available in the paper. Our abstract is as follows:
Using a database of potential, possible, or probable predatory scholarly openaccess journals, the objective of this research is to study the penetration of predatory publications in the Brazilian academic system and the profile of authors in a crosssection empirical study. Based on a massive amount of publications from Brazilian researchers of all disciplines during the 2000–2015 period, we were able to analyze the extent of predatory publications using an econometric modeling. Descriptive statistics indicate that predatory publications represent a small overall proportion, but grew exponentially in the last 5 years. Departing from prior studies, our analysis shows that experienced researchers with a high number of nonindexed publications and PhD obtained locally are more likely to publish in predatory journals. Further analysis shows that once a journal regarded as predatory is listed in the local ranking system, the Qualis, it starts to receive more publications than nonpredatory ones.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Sketchnotes from TWiML&AI #121: Reproducibility and the Philosophy of Data with Clare Gollnick
(This article was first published on Shirin's playgRound, and kindly contributed to Rbloggers)
These are my sketchnotes for Sam Charrington’s podcast This Week in Machine Learning and AI about Reproducibility and the Philosophy of Data with Clare Gollnick:
You can listen to the podcast here.
In this episode, i’m joined by Clare Gollnick, CTO of Terbium Labs, to discuss her thoughts on the “reproducibility crisis” currently haunting the scientific landscape. For a little background, a “Nature” survey in 2016 showed that more than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments. Clare gives us her take on the situation, and how it applies to data science, along with some great nuggets about the philosophy of data and a few interesting use cases as well. We also cover her thoughts on Bayesian vs Frequentist techniques and while we’re at it, the Vim vs Emacs debate. No, actually I’m just kidding on that last one. But this was indeed a very fun conversation that I think you’ll enjoy! https://twimlai.com/twimltalk121reproducibilityphilosophydataclaregollnick/
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Update: Can we predict flu outcome with Machine Learning in R?
(This article was first published on Shirin's playgRound, and kindly contributed to Rbloggers)
Since I migrated my blog from Github Pages to blogdown and Netlify, I wanted to start migrating (most of) my old posts too – and use that opportunity to update them and make sure the code still works.
Here I am updating my very first machine learning post from 27 Nov 2016: Can we predict flu deaths with Machine Learning and R?. Changes are marked as bold comments.
The main changes I made are:
 using the tidyverse more consistently throughout the analysis
 focusing on comparing multiple imputations from the mice package, rather than comparing different algorithms
 using purrr, map(), nest() and unnest() to model and predict the machine learning algorithm over the different imputed datasets
Among the many nice R packages containing data collections is the outbreaks package. It contains a dataset on epidemics and among them is data from the 2013 outbreak of influenza A H7N9 in China as analysed by Kucharski et al. (2014):
A. Kucharski, H. Mills, A. Pinsent, C. Fraser, M. Van Kerkhove, C. A. Donnelly, and S. Riley. 2014. Distinguishing between reservoir exposure and humantohuman transmission for emerging pathogens using case onset data. PLOS Currents Outbreaks. Mar 7, edition 1. doi: 10.1371/currents.outbreaks.e1473d9bfc99d080ca242139a06c455f.
A. Kucharski, H. Mills, A. Pinsent, C. Fraser, M. Van Kerkhove, C. A. Donnelly, and S. Riley. 2014. Data from: Distinguishing between reservoir exposure and humantohuman transmission for emerging pathogens using case onset data. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.2g43n.
I will be using their data as an example to show how to use Machine Learning algorithms for predicting disease outcome.
library(outbreaks) library(tidyverse) library(plyr) library(mice) library(caret) library(purrr) The dataThe dataset contains case ID, date of onset, date of hospitalization, date of outcome, gender, age, province and of course outcome: Death or Recovery.
PreprocessingChange: variable names (i.e. column names) have been renamed, dots have been replaced with underscores, letters are all lower case now.
Change: I am using the tidyverse notation more consistently.
First, I’m doing some preprocessing, including:
 renaming missing data as NA
 adding an ID column
 setting column types
 gathering date columns
 changing factor names of dates (to make them look nicer in plots) and of province (to combine provinces with few cases)
I’m also
 adding a third gender level for unknown gender
For plotting, I am defining a custom ggplot2 theme:
my_theme < function(base_size = 12, base_family = "sans"){ theme_minimal(base_size = base_size, base_family = base_family) + theme( axis.text = element_text(size = 12), axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5), axis.title = element_text(size = 14), panel.grid.major = element_line(color = "grey"), panel.grid.minor = element_blank(), panel.background = element_rect(fill = "aliceblue"), strip.background = element_rect(fill = "lightgrey", color = "grey", size = 1), strip.text = element_text(face = "bold", size = 12, color = "black"), legend.position = "bottom", legend.justification = "top", legend.box = "horizontal", legend.box.background = element_rect(colour = "grey50"), legend.background = element_blank(), panel.border = element_rect(color = "grey", fill = NA, size = 0.5) ) }And use that theme to visualize the data:
ggplot(data = fluH7N9_china_2013_gather, aes(x = Date, y = age, fill = outcome)) + stat_density2d(aes(alpha = ..level..), geom = "polygon") + geom_jitter(aes(color = outcome, shape = gender), size = 1.5) + geom_rug(aes(color = outcome)) + scale_y_continuous(limits = c(0, 90)) + labs( fill = "Outcome", color = "Outcome", alpha = "Level", shape = "Gender", x = "Date in 2013", y = "Age", title = "2013 Influenza A H7N9 cases in China", subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)", caption = "" ) + facet_grid(Group ~ province) + my_theme() + scale_shape_manual(values = c(15, 16, 17)) + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1") ggplot(data = fluH7N9_china_2013_gather, aes(x = Date, y = age, color = outcome)) + geom_point(aes(color = outcome, shape = gender), size = 1.5, alpha = 0.6) + geom_path(aes(group = case_id)) + facet_wrap( ~ province, ncol = 2) + my_theme() + scale_shape_manual(values = c(15, 16, 17)) + scale_color_brewer(palette="Set1", na.value = "grey50") + scale_fill_brewer(palette="Set1") + labs( color = "Outcome", shape = "Gender", x = "Date in 2013", y = "Age", title = "2013 Influenza A H7N9 cases in China", subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)", caption = "\nTime from onset of flu to outcome." ) FeaturesIn machine learningspeak features are what we call the variables used for model training. Using the right features dramatically influences the accuracy and success of your model.
For this example, I am keeping age, but I am also generating new features from the date information and converting gender and province into numerical values.
dataset < fluH7N9_china_2013 %>% mutate(hospital = as.factor(ifelse(is.na(date_of_hospitalisation), 0, 1)), gender_f = as.factor(ifelse(gender == "f", 1, 0)), province_Jiangsu = as.factor(ifelse(province == "Jiangsu", 1, 0)), province_Shanghai = as.factor(ifelse(province == "Shanghai", 1, 0)), province_Zhejiang = as.factor(ifelse(province == "Zhejiang", 1, 0)), province_other = as.factor(ifelse(province == "Zhejiang"  province == "Jiangsu"  province == "Shanghai", 0, 1)), days_onset_to_outcome = as.numeric(as.character(gsub(" days", "", as.Date(as.character(date_of_outcome), format = "%Y%m%d")  as.Date(as.character(date_of_onset), format = "%Y%m%d")))), days_onset_to_hospital = as.numeric(as.character(gsub(" days", "", as.Date(as.character(date_of_hospitalisation), format = "%Y%m%d")  as.Date(as.character(date_of_onset), format = "%Y%m%d")))), age = age, early_onset = as.factor(ifelse(date_of_onset < summary(fluH7N9_china_2013$date_of_onset)[[3]], 1, 0)), early_outcome = as.factor(ifelse(date_of_outcome < summary(fluH7N9_china_2013$date_of_outcome)[[3]], 1, 0))) %>% subset(select = c(2:4, 6, 8)) rownames(dataset) < dataset$case_id dataset[, 2] < as.numeric(as.matrix(dataset[, 2])) head(dataset) ## case_id outcome age hospital gender_f province_Jiangsu province_Shanghai ## 1 1 Death 87 0 0 0 1 ## 2 2 Death 27 1 0 0 1 ## 3 3 Death 35 1 1 0 0 ## 4 4 45 1 1 1 0 ## 5 5 Recover 48 1 1 1 0 ## 6 6 Death 32 1 1 1 0 ## province_Zhejiang province_other days_onset_to_outcome ## 1 0 0 13 ## 2 0 0 11 ## 3 0 1 31 ## 4 0 0 NA ## 5 0 0 57 ## 6 0 0 36 ## days_onset_to_hospital early_onset early_outcome ## 1 NA 1 1 ## 2 4 1 1 ## 3 10 1 1 ## 4 8 1 NA ## 5 11 1 0 ## 6 7 1 1 summary(dataset$outcome) ## Death Recover NA's ## 32 47 57 Imputing missing valuesI am using the mice package for imputing missing values
Note: Since publishing this blogpost I learned that the idea behind using mice is to compare different imputations to see how stable they are, instead of picking one imputed set as fixed for the remainder of the analysis. Therefore, I changed the focus of this post a little bit: in the old post I compared many different algorithms and their outcome; in this updated version I am only showing the Random Forest algorithm and focus on comparing the different imputed datasets. I am ignoring feature importance and feature plots because nothing changed compared to the old post.
 md.pattern() shows the pattern of missingness in the data:
 mice() generates the imputations
 by default, mice() calculates five (m = 5) imputed data sets
 we can combine them all in one output with the complete("long") function
 I did not want to impute missing values in the outcome column, so I have to merge it back in with the imputed data
Let’s compare the distributions of the five different imputed datasets:
datasets_complete %>% gather(x, y, age:early_outcome) %>% ggplot(aes(x = y, fill = .imp, color = .imp)) + facet_wrap(~ x, ncol = 3, scales = "free") + geom_density(alpha = 0.4) + scale_fill_brewer(palette="Set1", na.value = "grey50") + scale_color_brewer(palette="Set1", na.value = "grey50") + my_theme() Test, train and validation data setsNow, we can go ahead with machine learning!
The dataset contains a few missing values in the outcome column; those will be the test set used for final predictions (see the old blog post for this).
train_index < which(is.na(datasets_complete$outcome)) train_data < datasets_complete[train_index, ] test_data < datasets_complete[train_index, 2]The remainder of the data will be used for modeling. Here, I am splitting the data into 70% training and 30% test data.
Because I want to model each imputed dataset separately, I am using the nest() and map() functions.
set.seed(42) val_data < train_data %>% group_by(.imp) %>% nest() %>% mutate(val_index = map(data, ~ createDataPartition(.$outcome, p = 0.7, list = FALSE)), val_train_data = map2(data, val_index, ~ .x[.y, ]), val_test_data = map2(data, val_index, ~ .x[.y, ])) Machine Learning algorithms Random ForestTo make the code tidier, I am first defining the modeling function with the parameters I want.
model_function < function(df) { caret::train(outcome ~ ., data = df, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3, verboseIter = FALSE)) }Next, I am using the nested tibble from before to map() the model function, predict the outcome and calculate confusion matrices.
set.seed(42) val_data_model < val_data %>% mutate(model = map(val_train_data, ~ model_function(.x)), predict = map2(model, val_test_data, ~ data.frame(prediction = predict(.x, .y[, 2]))), predict_prob = map2(model, val_test_data, ~ data.frame(outcome = .y[, 2], prediction = predict(.x, .y[, 2], type = "prob"))), confusion_matrix = map2(val_test_data, predict, ~ confusionMatrix(.x$outcome, .y$prediction)), confusion_matrix_tbl = map(confusion_matrix, ~ as.tibble(.x$table))) Comparing accuracy of modelsTo compare how the different imputations did, I am plotting
 the confusion matrices:
 and the prediction probabilities for correct and wrong predictions:
Hope, you found that example interesting and helpful!
sessionInfo() ## R version 3.4.3 (20171130) ## Platform: x86_64appledarwin15.6.0 (64bit) ## Running under: macOS High Sierra 10.13.4 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib ## ## locale: ## [1] de_DE.UTF8/de_DE.UTF8/de_DE.UTF8/C/de_DE.UTF8/de_DE.UTF8 ## ## attached base packages: ## [1] methods stats graphics grDevices utils datasets base ## ## other attached packages: ## [1] bindrcpp_0.2 caret_6.078 mice_2.46.0 ## [4] lattice_0.2035 plyr_1.8.4 forcats_0.3.0 ## [7] stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4 ## [10] readr_1.1.1 tidyr_0.8.0 tibble_1.4.2 ## [13] ggplot2_2.2.1.9000 tidyverse_1.2.1 outbreaks_1.3.0 ## ## loaded via a namespace (and not attached): ## [1] nlme_3.1131.1 lubridate_1.7.3 dimRed_0.1.0 ## [4] RColorBrewer_1.12 httr_1.3.1 rprojroot_1.32 ## [7] tools_3.4.3 backports_1.1.2 R6_2.2.2 ## [10] rpart_4.113 lazyeval_0.2.1 colorspace_1.32 ## [13] nnet_7.312 withr_2.1.1.9000 tidyselect_0.2.4 ## [16] mnormt_1.55 compiler_3.4.3 cli_1.0.0 ## [19] rvest_0.3.2 xml2_1.2.0 labeling_0.3 ## [22] bookdown_0.7 scales_0.5.0.9000 sfsmisc_1.11 ## [25] DEoptimR_1.08 psych_1.7.8 robustbase_0.928 ## [28] randomForest_4.612 digest_0.6.15 foreign_0.869 ## [31] rmarkdown_1.8 pkgconfig_2.0.1 htmltools_0.3.6 ## [34] rlang_0.2.0.9000 readxl_1.0.0 ddalpha_1.3.1.1 ## [37] rstudioapi_0.7 bindr_0.1 jsonlite_1.5 ## [40] ModelMetrics_1.1.0 magrittr_1.5 Matrix_1.212 ## [43] Rcpp_0.12.15 munsell_0.4.3 stringi_1.1.6 ## [46] yaml_2.1.17 MASS_7.349 recipes_0.1.2 ## [49] grid_3.4.3 parallel_3.4.3 crayon_1.3.4 ## [52] haven_1.1.1 splines_3.4.3 hms_0.4.1 ## [55] knitr_1.20 pillar_1.2.1 reshape2_1.4.3 ## [58] codetools_0.215 stats4_3.4.3 CVST_0.21 ## [61] glue_1.2.0 evaluate_0.10.1 blogdown_0.5 ## [64] modelr_0.1.1 foreach_1.4.4 cellranger_1.1.0 ## [67] gtable_0.2.0 kernlab_0.925 assertthat_0.2.0 ## [70] DRR_0.0.3 xfun_0.1 gower_0.1.2 ## [73] prodlim_1.6.1 broom_0.4.3 e1071_1.68 ## [76] class_7.314 survival_2.413 viridisLite_0.3.0 ## [79] timeDate_3043.102 RcppRoll_0.2.2 iterators_1.0.9 ## [82] lava_1.6 ipred_0.96 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Fitting GAMs with brms: part 1
(This article was first published on From the Bottom of the Heap  R, and kindly contributed to Rbloggers)
Regular readers will know that I have a somewhat unhealthy relationship with GAMs and the mgcv package. I use these models all the time in my research but recently we’ve been hitting the limits of the range of models that mgcv can fit. So I’ve been looking into alternative ways to fit the GAMs I want to fit but which can handle the kinds of data or distributions that have been cropping up in our work. The brms package (Bürkner, 2017) is an excellent resource for modellers, providing a highlevel R front end to a vast array of model types, all fitted using Stan. brms is the perfect package to go beyond the limits of mgcv because brms even uses the smooth functions provided by mgcv, making the transition easier. In this post I take a look at how to fit a simple GAM in brms and compare it with the same model fitted using mgcv.
In this post we’ll use the following packages. If you don’t know schoenberg, it’s a package I’m writing to provide ggplot versions of plots that can be produced by mgcv from fitted GAM objects. schoenberg is in early development, but it currentl works well enough to plot the models we fit here. If you’ve never come across this package before, you can install it from Github using devtools::install_github(‘gavinsimpson/schoenberg’)
## packages library('mgcv') library('brms') library('ggplot2') library('schoenberg') theme_set(theme_bw())To illustrate brms GAMfittgin chops, we’ll use the mcycle data set that comes with the MASS package. It contains a set of measurements of the acceleration force on a rider’s head during a simulated motorcycle collision and the time, in milliseconds, post collision. The data are loaded using data() and we take a look at the first few rows
## load the example data mcycle data(mcycle, package = 'MASS') ## show data head(mcycle) times accel 1 2.4 0.0 2 2.6 1.3 3 3.2 2.7 4 3.6 0.0 5 4.0 2.7 6 6.2 2.7The aim is to model the acceleration force (accel) as a function of time post collision (times). The plot below shows the data.
ggplot(mcycle, aes(x = times, y = accel)) + geom_point() + labs(x = "Miliseconds post impact", y = "Acceleration (g)", title = "Simulated Motorcycle Accident", subtitle = "Measurements of head acceleration")We’ll model acceleration as a smooth function of time using a GAM and the default thin plate regression spline basis. This can be done using the gam() function in mgcv ad for comparison with the fully bayesian model we’ll fit shortly, we use `method = “REML” to estimate the smoothness parameter for the spline in mixed model form and REML
m1 < gam(accel ~ s(times), data = mcycle, method = "REML") summary(m1) Family: gaussian Link function: identity Formula: accel ~ s(times) Parametric coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 25.546 1.951 13.09 <2e16 ***  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Approximate significance of smooth terms: edf Ref.df F pvalue s(times) 8.625 8.958 53.4 <2e16 ***  Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Rsq.(adj) = 0.783 Deviance explained = 79.7% REML = 616.14 Scale est. = 506.35 n = 133As we can see from the model summary, the estimated smooth uses about 8.5 effective degrees of freedom and in the test of zero effect, the null hypothesis is strongly rejected. The fitted spline explains about 80% of the variance or deviance in the data.
To plot the fitted smooth we could use the plot() method provided by mgcv, but this uses base graphics. Instead we can use the draw() method from schoenberg, which can currently handle most of the univariate smooths in mgcv plus 2d tensor product smooths
draw(m1)The equivalent model can be estimated using a fullybayesian approach via the brm() function in the brms package. In fact, brm() will use the smooth specification functions from mgcv, making our lives much easier. The major difference though is that you can’t use te() or ti() smooths in brm() models; you need to use t2() tensor product smooths instead. This is because the smooths in the model are going to be treated as random effects and the model estimated as a GLMM, which exploits the duality of splines as random effects. In this representation, the wiggly parts of the spline basis are treated as a random effect and their associated variance parameter controls the degree of wiggliness of the fitted spline. The perfectly smooth parts of the basis are treated as a fixed effect. In this form, the GAM can be estimated using standard GLMM software; it’s what allows the gamm4() function for example to fit GAMMs using the lme4 package for example. This is also the reason why we can’t use te() or ti() smooths; those smooths do not have nicely separable penalties which means they can’t be written in the form required to be fitted using typical mixed model software.
The brm() version of the GAM is fitted using the code below. Note that I have changed a few things from their default values as
 the model required more than the default number of MCMC samples — iter = 4000,
 the samples needed thinning to deal with some strong autocorrelation in the Markov chains — thin = 10,
 the adapt.delta parameter, a tuning parameter in he NUTS sampler for Hamiltonian Monte Carlo, potentially needed raising — there was a warning about a potential divergent transition but I should have looked to see if it was one or not; instead I just increased the tuning parameter to 0.99,
 four chains fitted by default but I wanted these to be fitted using 4 CPU cores,
 seed sets the internal random number generator seed, which allows reproducibility of models, and
 for this post I didn’t want to print out the progress of the sampler — refresh = 0 — typically you won’t want to do this so you can see how the sampling is progressing.
The rest of the model is pretty similar to the gam() version we fitted earlier. The main difference is that I use the bf() function to create a special brms formula specifying the model. You don’t actually need to do this for such a simple model, but in a later post we’ll use this to fit distributional GAMs. Note that I’m leaving all the priors in the model at the default values. I’ll look at defining priors in a later post; for now I’m just going to use the default priors that brm() uses
m2 < brm(bf(accel ~ s(times)), data = mcycle, family = gaussian(), cores = 4, seed = 17, iter = 4000, warmup = 1000, thin = 10, refresh = 0, control = list(adapt_delta = 0.99)) Compiling the C++ model Start samplingOnce the model has finished compiling and sampling we can output the model summary
summary(m2) Family: gaussian Links: mu = identity; sigma = identity Formula: accel ~ s(times) Data: mcycle (Number of observations: 133) Samples: 4 chains, each with iter = 4000; warmup = 1000; thin = 10; total postwarmup samples = 1200 ICs: LOO = NA; WAIC = NA; R2 = NA Smooth Terms: Estimate Est.Error l95% CI u95% CI Eff.Sample Rhat sds(stimes_1) 722.44 198.12 450.17 1150.27 1180 1.00 PopulationLevel Effects: Estimate Est.Error l95% CI u95% CI Eff.Sample Rhat Intercept 25.54 2.02 29.66 21.50 1200 1.00 stimes_1 16.10 38.20 61.46 90.91 1171 1.00 Family Specific Parameters: Estimate Est.Error l95% CI u95% CI Eff.Sample Rhat sigma 22.78 1.47 19.94 25.68 1200 1.00 Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat = 1).This outputs details of the model fitted plus parameter estimates (as posterior means), standard errors, (by default) 95% credible intervals and two other diagnostics:
 Eff.Sample is the effective sample size of the posterior samples in the model, and
 Rhat is the potential scale reduction factor or GelmanRubin diagnostic and is a measure of how well the chains have converged and ideally should be equal to 1.
The summary includes two entries for the smooth of times:
 sds(stimes_1) is the variance parameter, which has the effect of controlling the wiggliness of the smooth — the larger this value the more wiggly the smooth. We can see that the credible interval doesn’t include 0 so there is evidence that a smooth is required over and above a linear parametric effect of times, details of which are given next,
 stimes_1 is the fixed effect part of the spline, which is the linear function that is perfectly smooth.
The final parameter table includes information on the variance of the data about the conditional mean of the response.
How does this model compare with the one fitted using gam()? We can use the gam.vcomp() function to compute the variance component representation of the smooth estimated via gam(). To make it comparable with the value shown for the brms model, we don’t undo the rescaling of the penalty matrix that gam() performs to help with numeric stability during model fitting.
gam.vcomp(m1, rescale = FALSE) Standard deviations and 0.95 confidence intervals: std.dev lower upper s(times) 807.88726 480.66162 1357.88215 scale 22.50229 19.85734 25.49954 Rank: 2/2This gives a posterior mean of 807.89 with 95% confidence interval of 480.66–1357.88, which compares well with posterior mean and credible interval of the brm() version of 722.44 (450.17 – 1150.27).
The marginal_smooths() function is used to extract the marginal effect of the spline.
msms < marginal_smooths(m2)This function extracts enough information about the estimated spline to plot it using the plot() method
plot(msms)Given the similarity in the variance components of the two models it is not surprising the two estimated smooth also look similar. The marginal_smooths() is effectively the equivalent of the plot() method for mgcvbased GAMs.
There’s a lot that we can and should do to check the model fit. For now, we’ll look at two posterior predictive check plots that brms, via the bayesplot package (Gabry and Mahr, 2018), makes very easy to produce using the pp_check() function.
pp_check(m2) Using 10 posterior samples for ppc type 'dens_overlay' by default.The default produces a density plot overlay of the original response values (the thick black line) with 10 draws from the posterior distribution of the model. If the model is a good fit to the data, samples of data sampled from it at the observed values of the covariate(s) should be similar to one another.
Another type of posterior predictive check plot is the empirical cumulative distribution function of the observations and random draws from the model posterior, which we can produce with type = “ecdf_overlay”
pp_check(m2, type = "ecdf_overlay") Using 10 posterior samples for ppc type 'ecdf_overlay' by default.Both plots show significant deviations between the the posterior simulations and the observed data. The poor posterior predictive check results are in large part due to the nonconstant variance of the acceleration data conditional upon the covariate. Both models assumed that the observation are distributed Gaussian with means equal to the fitted values (estimated expectation of the response) with the same variance (^2). The observations appear to have different variances, which we can model with a distributional model, which allow all parameters of the distribution of the response to be modelled with linear predictors. We’ll take a look at these models in a future post.
ReferencesBürkner, P.C. (2017). brms: An R package for bayesian multilevel models using Stan. Journal of Statistical Software 80, 1–28. doi:10.18637/jss.v080.i01.
Gabry, J., and Mahr, T. (2018). Bayesplot: Plotting for bayesian models. Available at: https://CRAN.Rproject.org/package=bayesplot.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: From the Bottom of the Heap  R. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R/Finance 2018 Registration
(This article was first published on FOSS Trading, and kindly contributed to Rbloggers)
This year marks the 10th anniversary of the R/Finance Conference! As in prior years, we expect more than 250 attendees from around the world. R users from industry, academia, and government will joining 50+ presenters covering all areas of finance with R. The conference will take place on June 1st and 2nd, at UIC in Chicago.
You can find registration information on the conference website, or you can go directly to the Cvent registration page.
Note that registration fees will increase by 50% at the end of early registration on May 21, 2018.
We are very excited about keynote presentations by JJ Allaire, Li Deng, and Norm Matloff. The conference agenda (currently) includes 18 full presentations and 33 shorter “lightning talks”. As in previous years, several (optional) preconference seminars are offered on Friday morning. We’re still working on the agenda, but we have another great lineup of speakers this year!
There is also an (optional) conference dinner at Wyndham Grand Chicago Riverfront in the 39th Floor Penthouse Ballroom and Terrace. Situated directly on the riverfront, it is a perfect venue to continue conversations while dining and drinking.
We would to thank our 2018 Sponsors for the continued support enabling us to host such an exciting conference:
UIC Liautaud Master of Science in Finance
Microsoft
R Consortium
RStudio
William Blair
Citadel
Quasardb
On behalf of the committee and sponsors, we look forward to seeing you in Chicago!
Gib Bassett, Peter Carl, Dirk Eddelbuettel, Brian Peterson, Dale Rosenthal, Jeffrey Ryan, Joshua Ulrich
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Packaging Shiny applications: A deep dive
(This article was first published on Mango Solutions, and kindly contributed to Rbloggers)
(Or, how to write a Shiny app.R file that only contains a single line of code)
Mark Sellors, Head of Data EngineeringThis post is long overdue. The information contained herein has been built up over years of deploying and hosting Shiny apps, particularly in production environments, and mainly where those Shiny apps are very large and contain a lot of code.
Last year, during some of my conference talks, I told the story of Mango’s early adoption of Shiny and how it wasn’t always an easy path to production for us. In this post I’d like to fill in some of the technical background and provide some information about Shiny app publishing and packaging that is hopefully useful to a wider audience.
I’ve figured out some of this for myself, but the most pivotal piece of information came from Shiny creator, Joe Cheng. Joe told me some time ago, that all you really need in an app.R file is a function that returns a Shiny application object. When he told me this, I was heavily embedded in the publication side and I didn’t immediately understand the implications.
Over time though I came to understand the power and flexibility that this model provides and, to a large extent, that’s what this post is about.
What is Shiny?Hopefully if you’re reading this you already know, but Shiny is a web application framework for Ri. It allows R users to develop powerful web applications entirely in R without having to understand HTML, CSS and JavaScript. It also allows us to embed the statistical power of R directly into those web applications.
Shiny apps generally consist of either a ui.R and a server.R (containing user interface and serverside logic respectively) or a single app.R which contains both.
Why package a Shiny app anyway?
If your app is small enough to fit comfortably in a single file, then packaging your application is unlikely to be worth it. As with any R script though, when it gets too large to be comfortably worked with as a single file, it can be useful to break it up into discrete components.
Publishing a packaged app will be more difficult, but to some extent that will depend on the infrastructure you have available to you.
Pros of packagingPackaging is one of the many great features of the R language. Packages are fairly straightforward, quick to create and you can build them with a host of useful features like builtin documentation and unit tests.
They also integrate really nicely into Continuous Integration (CI) pipelines and are supported by tools like Travis. You can also get test coverage reports using things like codecov.io.
They’re also really easy to share. Even if you don’t publish your package to CRAN, you can still share it on GitHub and have people install it with devtools, or build the package and share that around, or publish the package on a CRANlike system within your organisation’s firewall.
Cons of packagingBefore you get all excited and start to package your Shiny applications, you should be aware that — depending on your publishing environment — packaging a Shiny application may make it difficult or even impossible to publish to a system like Shiny Server or RStudio Connect, without first unpacking it again.
A little bit of Mango historyThis is where Mango were in the early days of our Shiny use. We had a significant disconnect between our data scientists writing the Shiny apps and the IT team tasked with supporting the infrastructure they used. This was before we’d committed to having an engineering team that could sit in the middle and provide a bridge between the two.
When our data scientists would write apps that got a little large or that they wanted robust tests and documentation for, they would stick them in packages and send them over to me to publish to our original Shiny Server. The problem was: R packages didn’t really mean anything to me at the time. I knew how to install them, but that was about as far as it went. I knew from the Shiny docs that a Shiny app needs certain files (server.R and ui.R or an app.R) file, but that wasn’t what I got, so I’d send it back to the data science team and tell them that I needed those files or I wouldn’t be able to publish it.
More than once I got back a response along the lines of, “but you just need to load it up and then do runApp()”. But, that’s just not how Shiny Server works. Over time, we’ve evolved a set of best practices around when and how to package a Shiny application.
The first step was taking the leap into understanding Shiny and R packages better. It was here that I started to work in the space between data science and IT.
How to package a Shiny applicationIf you’ve seen the simple app you get when you choose to create a new Shiny application in RStudio, you’ll be familiar with the basic structure of a Shiny application. You need to have a UI object and a server function.
If you have a look inside the UI object you’ll see that it contains the html that will be used for building your user interface. It’s not everything that will get served to the user when they access the web application — some of that is added by the Shiny framework when it runs the application — but it covers off the elements you’ve defined yourself.
The server function defines the serverside logic that will be executed for your application. This includes code to handle your inputs and produce outputs in response.
The great thing about Shiny is that you can create something awesome quite quickly, but once you’ve mastered the basics, the only limit is your imagination.
For our purposes here, we’re going to stick with the ‘geyser’ application that RStudio gives you when you click to create a new Shiny Web Application. If you open up RStudio, and create a new Shiny app — choosing the single file app.R version — you’ll be able to see what we’re talking about. The small size of the geyser app makes it ideal for further study.
If you look through the code you’ll see that there are essentially three components: the UI object, the server function, and the shinyApp() function that actually runs the app.
Building an R package of just those three components is a case of breaking them out into the constituent parts and inserting them into a blank package structure. We have a version of this up on GitHub that you can check out if you want.
The directory layout of the demo project looks like this:
 DESCRIPTION  LICENSE  NAMESPACE  R   launchApp.R   shinyAppServer.R  ` shinyAppUI.R  README.md  inst  ` shinyApp  ` app.R  man   launchApp.Rd   shinyAppServer.Rd  ` shinyAppUI.Rd ` shinyAppDemo.RprojOnce the app has been adapted to sit within the standard R package structure we’re almost done. The UI object and server function don’t really need to be exported, and we’ve just put a really thin wrapper function around shinyApp() — I’ve called it launchApp() — which we’ll actually use to launch the app. If you install the package from GitHub with devtools, you can see it in action.
library(shinyAppDemo) launchApp()This will start the Shiny application running locally.
The approach outlined here also works fine with Shiny Modules, either in the same package, or called from a separate package.
And that’s almost it! The only thing remaining is how we might deploy this app to Shiny server (including Shiny Server Pro) or RStudio Connect.
Publishing your packaged Shiny appWe already know that Shiny Server and RStudio Connect expect either a ui.R and a server.R or an app.R file. We’re running our application out of a package with none of this, so we won’t be able to publish it until we fix this problem.
The solution we’ve arrived at is to create a directory called ‘shinyApp’ inside the inst directory of the package. For those of you who are new to R packaging, the contents of the ‘inst’ directory are essentially ignored during the package build process, so it’s an ideal place to put little extras like this.
The name ‘shinyApp’ was chosen for consistency with Shiny Server which uses a ‘shinyApps’ directory if a user is allowed to serve applications from their home directory.
Inside this directory we create a single ‘app.R’ file with the following line in it:
shinyAppDemo::launchApp()And that really is it. This one file will allow us to publish our packaged application under some circumstances, which we’ll discuss shortly.
Here’s where having a packaged Shiny app can get tricky, so we’re going to talk you through the options and do what we can to point out the pitfalls.
Shiny Server and Shiny Server ProPerhaps surprisingly — given that Shiny Server is the oldest method of Shiny app publication — it’s also the easiest one to use with these sorts of packaged Shiny apps. There are basically two ways to publish on Shiny Server. From your home directory on the server — also known as selfpublishing — or publishing from a central location, usually the directory ‘/srv/shinyserver’.
The central benefit of this approach is the ability to update the application just by installing a newer version of the package. Sadly though, it’s not always an easy approach to take.
Apps served from home directory (AKA selfpublishing)The first publication method is from a users’ home directory. This is generally used in conjunction with RStudio Server. In the selfpublishing model, Shiny Server (and Pro) expect apps to be found in a directory called ‘ShinyApps’, within the users home directory. This means that if we install a Shiny app in a package the final location of the app directory will be inside the installed package, not in the ShinyApps directory. In order to work around this, we create a link from where the app is expected to be, to where it actually is within the installed package structure.
So in the example of our package, we’d do something like this in a terminal session:
# make sure we’re in our home directory cd # change into the shinyApps directory cd shinyApps # create a link from our app directory inside the package ln s /home/sellorm/R/x86_64pclinuxgnulibrary/3.4/shinyAppDemo/shinyApp ./testAppNote: The path you will find your libraries in will differ from the above. Check by running .libPaths()[1] and then dir(.libPaths()[1]) to see if that’s where your packages are installed.
Once this is done, the app should be available at ‘http://:3838//’ and can be updated by updating the installed version of the package. Update the package and the updates will be published via Shiny Server straight away.
Apps Server from a central location (usually /srv/shinyserver)This is essentially the same as above, but the task of publishing the application generally falls to an administrator of some sort.
Since they would have to transfer files to the server and log in anyway, it shouldn’t be too much of an additional burden to install a package while they’re there. Especially if that makes life easier from then on.
The admin would need to transfer the package to the server, install it and then create a link — just like in the example above — from the expected location, to the installed location.
The great thing with this approach is that when updates are due to be installed the admin only has to update the installed package and not any other files.
RStudio ConnectConnect is the next generation Shiny Server. In terms of features and performance, it’s far superior to its predecessor. One of the best features is the ability to push Shiny app code directly from the RStudio IDE. For the vast majority of users, this is a huge productivity boost, since you no longer have to wait for an administrator to publish your app for you.
Since publishing doesn’t require anyone to directly log into the server as part of the publishing process, there aren’t really any straightforward opportunities to install a custom package. This means that, in general, publishing a packaged shiny application isn’t really possible.
There’s only one real workaround for this situation that I’m aware of. If you have an internal CRANlike repository for your custom packages, you should be able to use that to update Connect, with a little work.
You’d need to have your dev environment and Connect hooked up to the same repo. The updated app package needs to be available in that repo and installed in your dev environment. Then, you could publish and then update the single line app.R for each successive package version you publish.
Connect uses packrat under the hood, so when you publish the app.R the packrat manifest will also be sent to the server. Connect will use the manifest to decide which packages are required to run your app. If you’re using a custom package this would get picked up and installed or updated during deployment.
shinyapps.ioIt’s not currently possible to publish a packaged application to shinyapps.io. You’d need to make sure your app followed the accepted conventions for creating Shiny apps and only uses files, rather than any custom packages.
ConclusionPackaging Shiny apps can be a real productivity boon for you and your team. In situations where you can integrate that process into other processes, such as automatically running your unit tests or automated publishing it can also help you adopt devopsstyle workflows.
However, in some instances, the practice can actually make things worse and really slow you down. It’s essential to understand what the publishing workflow is in your organisation before embarking on any significant Shiny packaging project as this will help steer you towards the best course of action.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
eRum Competition Winners
(This article was first published on R on The Jumping Rivers Blog, and kindly contributed to Rbloggers)
The results of the eRum competition are in! Before we announce the winners we would like to thank everyone who entered. It has been a pleasure to look at all of the ideas on show.
The Main CompetitionThe winner of the main competition is Lukasz Janiszewski. Lukasz provided a fantastic visualisation of the locations of each R user/ladies group and all R conferences. You can see his app here. If you want to view his code, you are able to do so in this GitHub repo. The code is contained in the directory erum_jr and the data preparation can be seen in budap.R.
Lukasz made 3 csv files contained the information about the R user, R ladies and R conferences. Using the help of an Rbloggers post, he was able to add the geospatial information to those csv files. Finally, he scraped each meetup page for information on the Rladies groups. Using all of this information, he was able to make an informative, visually appealing dashboard with shiny.
Lukasz will now be jetting off to Budapest, to eRum 2018!
The Secondary CompetitionThe winner of the secondary competition is Jenny Snape. Jenny provided an excellent script to parse the current .Rmd files and extract the conference and group urls & locations. The script can be found in this GitHub gist. Jenny has written a few words to summarise her script…
“The files on github can be read into R as character vectors (where each line is a element of the vector) using the R readLines() function.
From this character vector, we need to extract the country, the group name and url. This can be done by recognising that each line containing a country starts with a ‘##’ and each line containing the group name and url starts with a ’*‘. Therefore we can use these ’tags’ to cycle through each element of the character vector and pull out vectors containing the countries, the cities and the urls of the R groups. These vectors can then be cleaned and joined together into a data frame.
I wrote these steps into a function that accepted each R group character vector as an input and returned the final data frame. As one of the data sets contained just R Ladies groups, I fed this in as an argument and returned it as a column in the final data frame in order to differentiate between the different group types. I also returned a variable based on the character vector input in order to differentiate between the different world continents.
Running this function on each of the character vectors creates separate data sets which can then be all joined together. This creates a final dataset containing all the information on each R group: the type of group, the url, the city and the region."
As well as this, Jenny provided us with a fantastic shiny dashboard, summarising the data.
Jenny has now received a free copy of Efficient R Prgoramming!
Once again, thank you to all who entered and well done to our winners, Lukasz and Jenny!
What next?We’re in the process of converting Jenny’s & Lukasz’s hard work into a nice dashboard that will be magically updated via our list of useR groups and conferences. It should be ready in a few days.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Two day workshop: Flexible programming of MCMC and other methods for hierarchical and Bayesian models
(This article was first published on R – NIMBLE, and kindly contributed to Rbloggers)
We’ll be giving a two day workshop at the 43rd Annual Summer Institute of Applied Statistics at Brigham Young University (BYU) in Utah, June 1920, 2018.
Abstract is below, and registration and logistics information can be found here.
This workshop provides a handson introduction to using, programming, and sharing Bayesian and hierarchical modeling algorithms using NIMBLE (rnimble.org). In addition to learning the NIMBLE system, users will develop handson experience with various computational methods. NIMBLE is an Rbased system that allows one to fit models specified using BUGS/JAGS syntax but with much more flexibility in defining the statistical model and the algorithm to be used on the model. Users operate from within R, but NIMBLE generates C++ code for models and algorithms for fast computation. I will open with an overview of creating a hierarchical model and fitting the model using a basic MCMC, similarly to how one can use WinBUGS, JAGS, and Stan. I will then discuss how NIMBLE allows the user to modify the MCMC – changing samplers and specifying blocking of parameters. Next I will show how to extend the BUGS syntax with userdefined distributions and functions that provide flexibility in specifying a statistical model of interest. With this background we can then explore the NIMBLE programming system, which allows one to write new algorithms not already provided by NIMBLE, including new MCMC samplers, using a subset of the R language. I will then provide examples of nonMCMC algorithms that have been programmed in NIMBLE and how algorithms can be combined together, using the example of a particle filter embedded within an MCMC. We will see new functionality in NIMBLE that allows one to fit Bayesian nonparametric models and spatial models. I will close with a discussion of how NIMBLE enables sharing of new methods and reproducibility of research. The workshop will include a number of breakout periods for participants to use and program MCMC and other methods, either on example problems or problems provided by participants. In addition, participants will see NIMBLE’s flexibility in action in several real problems.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hackathon – Hack the city 2018
(This article was first published on R blog  Quantide  R training & consulting, and kindly contributed to Rbloggers)
Hello, RUsers!
Have you ever joined an hackathon? If so, you surely know how fun and stimulating these events are. If not… now’s your chance!
Quantide is collaborating with Hack the City 2018, the 4th edition of the Southern Switzerland’s hackathon. Aside from being partners, we’re actively part of it: we have members in the mentors’ group and as part of the technical commission of the jury. We also give out a special mention for for the best data science project, developed with open source technology and preferably using R as programming language. It’s the first time that R makes its entry in this hackathon: if you work with R, here’s your chance!
Hack the city is a great occasion to show off your talent and your ideas, in team or as an individual. Programmers, graphic designers, project designers and makers collaborate to build something in just 48 hours. You can also join the competition from your own home, as long as at least one of your teammates is in Lugano! At the end of the hackathon, a jury will give out great prizes to the best project: respectively 5000, 2500 and 100o CHF for first, second and third prizes. There are other special mentions and scholarships, as we have mentioned before.
The last minutes tickets are available now, so if you want to grab this occasion, you should do it fast! Hack the City 2018 will take place in Lugano, from 27th to 29th April. Hope to see you there!
The post Hackathon – Hack the city 2018 appeared first on Quantide – R training & consulting.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R blog  Quantide  R training & consulting. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Recession Before 2020 Is Likely; On the Distribution of Time Between Recessions
(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to Rbloggers)
I recently saw a Reddit thread in r/PoliticalDiscussion asking the question “If the economy is still booming 2020, how should the Democratic address this?” This gets to an issue that’s been on my mind since at least 2016, maybe even 2014: when will the current period of economic growth end?
For some context, the Great Recession, as economists colloquially call the recession beginning in 2007 and punctuated with the 2008 financial crisis, ended officially in June 2009; it was then the economy resumed growth. As of this writing, that was about eight years, ten months ago. The longest previous period between recessions was the time between the early 1990s recession and the recession in the early 2000s that coincided with the collapse of the dotcom bubble; that period was ten years, and the only period longer than the present period between recessions.
There is growing optimism in the economy, most noticeably amongst consumers, and we are finally seeing wages increase in the United States after years of stagnation. Donald Trump and Republicans point to the economy as a reason to vote for Republicans in November (and yet Donald Trump is still historically unpopular and Democrats have a strong chance of capturing the House, and a fair chance at the Senate). Followers of the American economy are starting to ask, “How long can this last?”
In 2016, I was thinking about this issue in relation to the election. I wanted Hillary Clinton to win, but at the same time I feared that a Clinton win would be a shortterm gain, longterm loss for Democrats. One reason why is I believe there’s a strong chance of a recession within the next few years.
The 2008 financial crisis was a dramatic event, yet the DoddFrank reforms and other policy responses, in my opinion, did not go far enough to address the problems unearthed by the financial crisis. Toobigtofail institutions are now a part of law (though the policy jargon is systemically important financial institution, or SIFI). In fact, the scandal surrounding HSBC’s support of money laundering and the Justice Departments weak response suggested bankers may be toobigtojail! Many of the financial products and practices that caused the financial crisis are still legal; the fundamentals that produced the crisis have not changed. Barack Obama and the Democrats (and the Republicans, certainly) failed to break the political back of the bankers.
While I did not think Bernie Sanders’ reforms would necessarily make the American economy better, I thought he would put the fear of God back into the financial sector, and that alone could help keep risky behavior in check. Donald Trump, for all his populist rhetoric, has not demonstrated he’s going to put that fear in them. In fact, the Republicans passed a bill that’s a gift to corporations and top earners. The legacy of the 2008 financial crisis is that the financial sector can make grossly risky bets in the good “get government off our back!” times, but will have their losses covered by taxpayers in the “we need government help!” times. Recessions and financial crises are a part of the process of expropriating taxpayers. (I wrote other articles about this topic: see this article and this article, as well as this paper I wrote for an undergraduate class.)
Given all this, there’s good reason to believe that nothing has changed about the American economy that would change the likelihood of a financial crisis. Since it has been so long since the last one, it’s time to start expecting one, and whoever holds the Presidency will be blamed.
Right now that’s Donald Trump and the Republicans. And I don’t need to tell you that given Trump’s popularity in good economic times is historically low, a recession before the 2020 election would lead to a Republican rout, with few survivors.
And in a Census year, too!
So what is the probability of a recession? The rest of this article will focus on finding a statistical model for duration between elections and using that model to estimate the probability of a recession.
A recent article in the magazine Significance entitled “The Weibull distribution” describes the Weibull distribution, a common and expressive probability distribution (and one I recently taught in my statistics class). This distribution is used to model a lot of phenomena, including survival times, the time until a system fails or how long a patient diagnosed with a disease survives. Time until recession sounds like a “survival time”, so perhaps the Weibull distribution can be used to model it.
First, I’m going to be doing some bootstrapping, so here’s the seed for replicability:
set.seed(4182018)The dataset below, obtained from this Wikipedia article, contains the time between recessions in the United States. I look only at recessions since the Great Depression, considering this to be the “modern” economic era for the United States. The sample size is necessarily small, at 13 observations.
recessions < c( 4+ 2/12, 6+ 8/12, 3+ 1/12, 3+ 9/12, 3+ 3/12, 2+ 0/12, 8+10/12, 3+ 0/12, 4+10/12, 1+ 0/12, 7+ 8/12, 10+ 0/12, 6+ 1/12) hist(recessions) plot(density(recessions))The fitdistrplus allows for estimating the parameters of statistical distributions using the usual statistical techniques. (I found J.Stat.Soft article useful for learning about the package.) I load it below and look at an initial plot to get a sense of appropriate distributions.
suppressPackageStartupMessages(library(fitdistrplus)) descdist(recessions, boot = 1000) ## summary statistics ##  ## min: 1 max: 10 ## median: 4.166667 ## mean: 4.948718 ## estimated sd: 2.71943 ## estimated skewness: 0.51865 ## estimated kurtosis: 2.349399The recessions dataset is platykurtic though rightskewed, a surprising result. However, that’s not enough to deter me from attempting to use the Weibull distribution to model time between recessions. (I should mention here that I am assuming essentially that I’m assuming that time between recessions since the Great Depression are independent and identically distributed. This is not obvious or uncontroversial, but I doubt this could be credibly disproven or that assuming dependence would improve the model.) Let’s fit parameters.
fw < fitdist(recessions, "weibull") summary(fw) ## Fitting of the distribution ' weibull ' by maximum likelihood ## Parameters : ## estimate Std. Error ## shape 2.001576 0.4393137 ## scale 5.597367 0.8179352 ## Loglikelihood: 30.12135 AIC: 64.2427 BIC: 65.3726 ## Correlation matrix: ## shape scale ## shape 1.0000000 0.3172753 ## scale 0.3172753 1.0000000 plot(seq(0, 15, length.out = 1000), dweibull(seq(0, 15, length.out = 1000), shape = fw$estimate["shape"], scale = fw$estimate["scale"]), col = "blue", type = "l", xlab = "Duration", ylab = "Density", main = "Weibull distribution applied to recession duration") lines(density(recessions)) plot(fw)The plots above suggest the fitted Weibull distribution describe the observed distribution; the QQ plot, PP plot, and the estimated density function all fit well with a Weibull distribution. I also compared the AIC values of the fitted Weibull distribution to two other close candidates, the gamma and lognormal distributions; the Weibull distribution provides the best fit according to the AIC criterion, being twice as reasonable as the lognormal distribution, although only slightly better than the gamma distribution (which is not surprising, given that the two distributions are similar). Due to the interpretations that come with the Weibull distribution and the statistical evidence, I believe it provides the better fit and should be used.
Based on the form of the distribution and the estimated parameters we can find a point estimate for the probability of a recession both before the 2018 midterm election and before the 2020 presidential election. That is, if is the time between recessions, we can estimate
t_0)" title="P(T \leq t + t_0  T > t_0)" class="latex" />
alpha < fw$estimate["shape"] beta < fw$estimate["scale"] recession_prob_wei < function(delta, passed, shape, scale) { # Computes the probability of a recession within the next delta years given # passed years # # args: # delta: a number representing time to next recession # passed: a number representing time since last recession # shape: the shape parameter of the Weibull distribution # scale: the scale parameter of the Weibull distribution if (delta < 0  passed < 0) { stop("Both delta and passed must be nonnegative") } return(1  pweibull(passed + delta, shape = shape, scale = scale, lower.tail = FALSE) / pweibull(passed, shape = shape, scale = scale, lower.tail = FALSE)) } # Recession prob. before 2018 election point estimate recession_prob_wei(6/12, 8+10/12, shape = alpha, scale = beta) ## [1] 0.252013 # Before 2020 election recession_prob_wei(2+6/12, 8+10/12, shape = alpha, scale = beta) ## [1] 0.8005031Judging by the point estimates, there’s a 25% chance of a recession before the 2018 midterm election and an 80% chance of a recession before the 2020 election.
The code below finds bootstrapped 95% confidence intervals for these numbers.
suppressPackageStartupMessages(library(boot)) recession_prob_wei_bootci < function(data, delta, passed, conf = .95, R = 1000) { # Computes bootstrapped CI for the probability a recession will occur before # a certain time given some time has passed # # args: # data: A numeric vector containing recession data # delta: A nonnegative real number representing maximum time till recession # passed: A nonnegative real number representing time since last recession # conf: A real number between 0 and 1; the confidence level # R: A positive integer for the number of bootstrap replicates bootobj < boot(recessions, R = R, statistic = function(data, indices) { d < data[indices] params < fitdist(d, "weibull")$estimate return(recession_prob_wei(delta, passed, shape = params["shape"], scale = params["scale"])) }) boot.ci(bootobj, type = "perc", conf = conf) } # Bootstrapped 95% CI for probability of recession before 2018 election recession_prob_wei_bootci(recessions, 6/12, 8+10/12, R = 10000) ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = bootobj, conf = conf, type = "perc") ## ## Intervals : ## Level Percentile ## 95% ( 0.1691, 0.6174 ) ## Calculations and Intervals on Original Scale # Bootstrapped 95% CI for probability of recession before 2020 election recession_prob_wei_bootci(recessions, 2+6/12, 8+10/12, R = 10000) ## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS ## Based on 10000 bootstrap replicates ## ## CALL : ## boot.ci(boot.out = bootobj, conf = conf, type = "perc") ## ## Intervals : ## Level Percentile ## 95% ( 0.6299, 0.9974 ) ## Calculations and Intervals on Original ScaleThese CIs suggest that while the probability of a recession before the 2018 midterm is very uncertain (it could plausibly be between 17% and 62%), my hunch about 2020 has validity; even the lower bound of that CI suggests a recession before 2020 is likely, and the upper bound is almostcertainty.
How bad could it be? That’s hard to say. However, these odds make the Republican tax bill and its trilliondollar deficits look even more irresponsible; that money will be needed to deal with a potential recession’s fallout.
As bad as 2018 looks for Republicans, it could look like a cakewalk compared to 2020.
(And despite the seemingly jubilant tone, this suggests I may have trouble finding a job in the upcoming years.)
I have created a video course published by Packt Publishing entitled Data Acqusition and Manipulation with Python, the second volume in a fourvolume set of video courses entitled, Taming Data with Python; Excelling as a Data Analyst. This course covers more advanced Pandas topics such as reading in datasets in different formats and from databases, aggregation, and data wrangling. The course then transitions to cover getting data in “messy” formats from Web documents via web scraping. The course covers web scraping using BeautifulSoup, Selenium, and Scrapy. If you are starting out using Python for data analysis or know someone who is, please consider buying my course or at least spreading the word about it. You can buy the course directly or purchase a subscription to Mapt and watch it there.
If you like my blog and would like to support it, spread the word (if not get a copy yourself)! Also, stay tuned for future courses I publish with Packt at the Video Courses section of my site.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Saddling up; getting on the hoRse for the first time
(This article was first published on Mango Solutions, and kindly contributed to Rbloggers)
Laura Swales, Marketing and Events Assistant
This year at Mango we’re proudly sponsoring the Bath Cats & Dogs
Home. To start our fundraising
for them, we decided to run a sweepstake on the Grand National. We asked
for £2 per horse, which would go to the cats and dogs home and the
winner was promised a bottle of wine for their charitable efforts.
Working in a Data Science company I knew that I couldn’t simply pick
names out of a hat for the sweepstake, ‘That’s not truly random!’ they
would cry. So in my hour of need, I turned to our two university
placement students Owen and Caroline to help me randomise the names in
R.
To use an appropriate horsebased metaphor, I would class myself as a
‘nonstarter’ in R – I’m not even near the actual race! My knowledge is
practically nonexistent (‘Do you just type alot of random letters?’)
and up until this blog I didn’t even have RStudio on my laptop.
We began by creating a list of the people who had entered the
sweepstake. With some people betting on more than one horse their name
was entered as many times as needed to correlate to how many bets they
had laid down.
I had now associated all the names with the object called people_list.
Next I created an object that contained numbers 140 to represent each
horse.
With the two sets of values ready to go, I wanted to display them in a
table format to make it easier to match names and numbers.
Now the data appeared in a table, but had not been randomised. To do
this I used the sample function to jumble up the people_list names.
Success! I had a list of numbers (140) representing the horses and a
randomly jumbled up list of those taking part in the sweepstake.
At the time of writing (In RMarkdown!), unfortunately fate had randomly
selected me the favourite to win. As you can imagine, this is something
that will not make you popular in the office.
I hope you enjoyed my first attempt in R. I will definitely use it again
to randomise our next sweepstake, though under intense supervision. I
can still hear the cries of ‘FIX!’ around the office. It’s always an
awkward moment when you win your own sweepstake…
Despite the controversy, it was fun to try out R in an accessible way
and it helped me understand some of the basic functions available.
Perhaps I’ll sit in on the next LondonR
workshop and learn some more!
If you’d like to find out more about the Bath Cats & Dogs Home please
visit here.
To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Upcoming speaking engagments
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
I have a couple of public appearances coming up soon.
 The East Bay R Language Beginners Group: Preparing Datasets – The Ugly Truth & Some Solutions, Tuesday, May 1, 2018 at Robert Half Technologies, 1999 Harrison Street, Oakland, CA, 94612.
 Official May 2018 BARUG Meeting: rquery: a Query Generator for Working With SQL Data, Tuesday, May 8, 2018 at Intuit, Building 20
2600 Marine Way · Mountain View, CA.
vtreat
Preparing Datasets – The Ugly Truth & Some Solutions is a great idea of Jim Porzak’s. Jim will speak on problems one is likely to encounter in trying to use real world data for predictive modeling and then I will speak on how the vtreat package helps address these issues. vtreat systematizes a number of routine domain independent data repairs and preparations, leaving you more time to work on important domain specific issues (plus it has citable documentation, helping make your methodology section smaller).
vtreat is the best way to prepare messy real world data for predictive modeling.
rqueryrquery: a Query Generator for Working With SQL Data
is an introduction to the rquery query generator system. rquery is a new R package that builds “pipeable SQL” and includes a number of very powerful data operators and analyses. It includes a number of very neat features, including query pipeline diagrams.
We think rquery (plus cdata) is going to be the best way (easiest to learn, most expressive, easiest to maintain, and most performant) method to use R to manipulate data at scale (SQL databases and Spark).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
15 Jobs for R users from around the world (20180419)
Just visit this link and post a new R job to the R community.
You can post a job for free (and there are also “featured job” options available for extra exposure).
Current R jobsJob seekers: please follow the links below to learn more and apply for your R job of interest:
Featured Jobs
 FullTime
 Associate, Data Analytics National Network for Safe Communities (NNSC>) – Posted by nnsc
 New York New York, United States
 18 Apr 2018

 FullTime
 Senior Research Associate Kellogg Research Support – Posted by lorbos
 Evanston Illinois, United States
 17 Apr 2018

 FullTime
 Empirical Research Fellow Kellogg School of Management (Northwestern University)– Posted by lorbos
 Evanston Illinois, United States
 17 Apr 2018

 Freelance
 Dashboard/Visualization developer in R/Shiny International Finance Corporation – Posted by Oleksiy
 Anywhere
 9 Apr 2018

 PartTime
 Data Scientist Alchemy – Posted by Andreas Voniatis
 Anywhere
 7 Apr 2018

 FullTime
 Product Owner (Data Science) (m/f) Civey GmbH – Posted by Civey
 Berlin Berlin, Germany
 6 Apr 2018

 FullTime
 R Engineer (Data Science) (m/f) Civey GmbH – Posted by Civey
 Berlin Berlin, Germany
 6 Apr 2018

 FullTime
 Solutions Engineer RStudio – Posted by nwstephens
 Anywhere
 3 Apr 2018

 FullTime
 Associate, Data Analytics National Network for Safe Communities (NNSC>) – Posted by nnsc
 New York New York, United States
 18 Apr 2018

 FullTime
 Senior Research Associate Kellogg Research Support – Posted by lorbos
 Evanston Illinois, United States
 17 Apr 2018

 FullTime
 Empirical Research Fellow Kellogg School of Management (Northwestern University)– Posted by lorbos
 Evanston Illinois, United States
 17 Apr 2018

 FullTime
 Discrete Choice Modeler @ Chicago, Illinois, U.S. Resource Systems Group – Posted by patricia.holland@rsginc.com
 Chicago Illinois, United States
 13 Apr 2018

 FullTime
 R&D Database Developer @ Toronto, Canada Crescendo Technology Ltd – Posted by Crescendo
 Toronto Ontario, Canada
 12 Apr 2018

 FullTime
 Applied Statistics Position @ Florida U.S. New College of Florida – Posted by Jobs@New College
 Sarasota Florida, United States
 9 Apr 2018

 Freelance
 Dashboard/Visualization developer in R/Shiny International Finance Corporation – Posted by Oleksiy
 Anywhere
 9 Apr 2018

 FullTime
 PhD Fellowship @ Wallace University, US Ubydul Haque – Posted by Ubydul
 Khon Kaen Chang Wat Khon Kaen, Thailand
 9 Apr 2018

 PartTime
 Data Scientist Alchemy – Posted by Andreas Voniatis
 Anywhere
 7 Apr 2018

 FullTime
 Product Owner (Data Science) (m/f) Civey GmbH – Posted by Civey
 Berlin Berlin, Germany
 6 Apr 2018

 FullTime
 R Engineer (Data Science) (m/f) Civey GmbH – Posted by Civey
 Berlin Berlin, Germany
 6 Apr 2018

 FullTime
 Solutions Engineer RStudio – Posted by nwstephens
 Anywhere
 3 Apr 2018

 FullTime
 PostDoctoral Fellow – Data Scientist @ CIMMYT (eastern Africa) International Maize and Wheat Improvement Center – Posted by cimmytjobs
 Marsabit County Kenya
 27 Mar 2018

 Freelance
 Statistician / R Developer – for Academic Statistical Research Academic Research – Posted by empiricus
 Anywhere
 27 Mar 2018

 FullTime
 Graduate Data Scientist JamieAi – Posted by JamieAi
 Anywhere
 27 Mar 2018
In Rusers.com you can see all the R jobs that are currently available.
Rusers Resumes
Rusers also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.
(you may also look at previous R jobs posts ).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Adding Zero Catches
(This article was first published on fishR Blog, and kindly contributed to Rbloggers)
IntroductionMuch of my work is with undergraduates who are first learning to analyze fisheries data. A common “learning opportunity” occurs when students are asked to compute the mean catch (or CPE), along with a standard deviation (SD), across multiple gear sets for each species. The learning opportunity occurs because some species will invariably not be caught in some gear sets. When the students summarize the number of fish caught for each species in each gear set those species not caught in a particular gear set will not “appear” in their data. Thus, when calculating the mean, the student will get the correct numerator (sum of catch across all gear sets) but not denominator (they use number of catches summed rather than total number of gear sets), which inflates (overestimates) the mean catch and (usually) deflates (underestimates) the SD of catches. Once confronted with this issue, they easily realize how to correct the mean calculation, but calculating the standard deviation is still an issue. These problems are exacerbated when using software to compute these summary statistics across many individual gear sets.
In software, the “trick” is to add a zero for each species not caught in a specific gear set that was caught in at least one of all of the gear sets. For example, if Bluegill were caught in at least one gear set but not in the third gear set, then a zero must be added as the catch of Bluegill in the third gear set. The addZeroCatch() function in the FSA package was an attempt to efficiently add these zeroes. This function has proven useful over the years, but I have become dissatisfied with its clunkiness. Additionally, I recently became aware of the complete() function in the tidyr package which holds promise for handling the same task. In this post, I explore the use of complete() for handling this issue.
This post requires the dplyr and tidyr packages. It also uses FSA behind the scenes.
library(FSA) # for mapvalues() library(dplyr) # for %>%, group_by(), summarize(), mutate(), right_join() library(tidyr) # for complete(), nesting() Example 1 – Very Simple DataIn this first example, the data consists of species and length recorded for each captured fish organized by the gear set identification number (ID) and held in the fishdat data.frame.
head(fishdat) ## ID species tl ## 1 1 BLG 148 ## 2 1 BLG 153 ## 3 1 BLG 147 ## 4 1 BLG 149 ## 5 1 BLG 144 ## 6 1 BLG 145The catch of each species in each gear set may be found using group_by() and summarize() with n(). Note that as.data.frame() is used simply to remove the tibble structure returned by group_by().1
catch < fishdat %>% group_by(ID,species) %>% summarize(num=n()) %>% as.data.frame() catch ## ID species num ## 1 1 BLG 10 ## 2 1 LMB 5 ## 3 1 YEP 5 ## 4 2 LMB 9 ## 5 2 YEP 7 ## 6 3 BLG 12 ## 7 3 YEP 7 ## 8 4 BLG 1 ## 9 4 LMB 11 ## 10 4 YEP 11 ## 11 5 BLG 9From this it is seen that three species (“BLG”, “LMB”, and “YEP”) were captured in all nets, but that “BLG” were not captured in “ID=2”, “LMB” were not captured in “ID=3”, and “LMB” and “YEP” were not captured in “ID=5”. The sample size, mean, and SD of catches per species from these data may be found by again using group_by() and summarize(). However, note that these calculations are INCORRECT because they do not include the zero catches of “BLG” in “ID=2”, “LMB” in “ID=3”, and “LMB” and “YEP” in “ID=5”. The problem is most evident in the sample sizes, which should be five (gear sets) for each species.
## Example of INCORRECT summaries because not using zeroes catch %>% group_by(species) %>% summarize(n=n(),mn=mean(num),sd=sd(num)) %>% as.data.frame() ## species n mn sd ## 1 BLG 4 8.000000 4.830459 ## 2 LMB 3 8.333333 3.055050 ## 3 YEP 4 7.500000 2.516611The complete() function can be used to add rows to a data.frame for variables (or combinations of variables) that should be present in the data.frame (relative to other values that are present) but are not. The complete() function takes a data.frame as its first argument (but will be “piped” in below with %>%) and the variable or variables that will be used to identify which items are missing. For example, with these data, a zero should be added to num for missing combinations defined by ID and species.
## Example of default complete ... see below to add zeroes, not NAs catch %>% complete(ID,species) %>% as.data.frame() ## ID species num ## 1 1 BLG 10 ## 2 1 LMB 5 ## 3 1 YEP 5 ## 4 2 BLG NA ## 5 2 LMB 9 ## 6 2 YEP 7 ## 7 3 BLG 12 ## 8 3 LMB NA ## 9 3 YEP 7 ## 10 4 BLG 1 ## 11 4 LMB 11 ## 12 4 YEP 11 ## 13 5 BLG 9 ## 14 5 LMB NA ## 15 5 YEP NAFrom this result, it is seen that complete() added a row for “BLG” in “ID=2”, “LMB” in “ID=3”, and “LMB” and “YEP” in “ID=5”, as we had hoped. However, complete() adds NAs by default. The value to add can be changed with fill=, which takes a list that includes the name of the variable to which the NAs were added (num in this case) set equal to the value to be added (0 in this case).
catch < catch %>% complete(ID,species,fill=list(num=0)) %>% as.data.frame() catch ## ID species num ## 1 1 BLG 10 ## 2 1 LMB 5 ## 3 1 YEP 5 ## 4 2 BLG 0 ## 5 2 LMB 9 ## 6 2 YEP 7 ## 7 3 BLG 12 ## 8 3 LMB 0 ## 9 3 YEP 7 ## 10 4 BLG 1 ## 11 4 LMB 11 ## 12 4 YEP 11 ## 13 5 BLG 9 ## 14 5 LMB 0 ## 15 5 YEP 0These correct catch data can then be summarized as above to show the correct sample size, mean, and SD of catches per species.
catch %>% group_by(species) %>% summarize(n=n(),mn=mean(num),sd=sd(num)) %>% as.data.frame() ## species n mn sd ## 1 BLG 5 6.4 5.504544 ## 2 LMB 5 5.0 5.049752 ## 3 YEP 5 6.0 4.000000 Example 2 – Multiple Values to Receive ZeroesSuppose that the fish data included a column that indicates whether the fish was marked and returned to the waterbody or not.
head(fishdat2) ## ID species tl marked ## 1 1 BLG 148 YES ## 2 1 BLG 153 YES ## 3 1 BLG 147 no ## 4 1 BLG 149 no ## 5 1 BLG 144 no ## 6 1 BLG 145 noThe catch and number of fish marked and returned per gear set ID and species may again be computed with group_by() and summarize(). Note, however, the use of ifelse() to use a 1 if the fish was marked and a 0 if it was not. Summing these values returns the number of fish that were marked. Giving this data.frame to complete() as before will add zeroes for both the num and nmarked variables as long as both are included in the list given to fill=.
catch2 < fishdat2 %>% group_by(ID,species) %>% summarize(num=n(), nmarked=sum(ifelse(marked=="YES",1,0))) %>% complete(ID,species,fill=list(num=0,nmarked=0)) %>% as.data.frame() catch2 ## ID species num nmarked ## 1 1 BLG 10 2 ## 2 1 LMB 5 2 ## 3 1 YEP 5 3 ## 4 2 BLG 0 0 ## 5 2 LMB 9 6 ## 6 2 YEP 7 2 ## 7 3 BLG 12 4 ## 8 3 LMB 0 0 ## 9 3 YEP 7 6 ## 10 4 BLG 1 0 ## 11 4 LMB 11 5 ## 12 4 YEP 11 8 ## 13 5 BLG 9 4 ## 14 5 LMB 0 0 ## 15 5 YEP 0 0 Example 3 – More Information that Does Not Get ZeroesSuppose that a data.frame called geardat contains information specific to each gear set.
geardat ## ID mon year lake run effort ## 1 1 May 2018 round 1 1.34 ## 2 2 May 2018 round 2 1.87 ## 3 3 May 2018 round 3 1.56 ## 4 4 May 2018 twin 1 0.92 ## 5 5 May 2018 twin 2 0.67And, for the purposes of this example, let’s suppose that we have summarized catch data WITHOUT the zeroes having been added.
catch3 < fishdat2 %>% group_by(ID,species) %>% summarize(num=n(), nmarked=sum(ifelse(marked=="YES",1,0))) %>% as.data.frame() catch3 ## ID species num nmarked ## 1 1 BLG 10 2 ## 2 1 LMB 5 2 ## 3 1 YEP 5 3 ## 4 2 LMB 9 6 ## 5 2 YEP 7 2 ## 6 3 BLG 12 4 ## 7 3 YEP 7 6 ## 8 4 BLG 1 0 ## 9 4 LMB 11 5 ## 10 4 YEP 11 8 ## 11 5 BLG 9 4Finally, suppose that these summarized catch data are joined with the gear data such that the gear set specific information is shown with each catch.
catch3 < right_join(geardat,catch3,by="ID") catch3 ## ID mon year lake run effort species num nmarked ## 1 1 May 2018 round 1 1.34 BLG 10 2 ## 2 1 May 2018 round 1 1.34 LMB 5 2 ## 3 1 May 2018 round 1 1.34 YEP 5 3 ## 4 2 May 2018 round 2 1.87 LMB 9 6 ## 5 2 May 2018 round 2 1.87 YEP 7 2 ## 6 3 May 2018 round 3 1.56 BLG 12 4 ## 7 3 May 2018 round 3 1.56 YEP 7 6 ## 8 4 May 2018 twin 1 0.92 BLG 1 0 ## 9 4 May 2018 twin 1 0.92 LMB 11 5 ## 10 4 May 2018 twin 1 0.92 YEP 11 8 ## 11 5 May 2018 twin 2 0.67 BLG 9 4These data simulate what might be seen from a flat database.
With these data, zeroes still need to be added as defined by missing combinations of ID and species. However, if only these two variables are included in complete() then NAs, zeroes, or something else (if we define that) will be added mon, year, lake, run, and effort, which is not desired. These five variables are “nested” with the ID variable (i.e., if you know ID then you know these variables) and should be treated as a group. Nesting of variables can be handled in complete() by including these variables within nesting().
catch3 %>% complete(nesting(ID,mon,year,lake,run,effort),species, fill=list(num=0,nmarked=0)) %>% as.data.frame() ## ID mon year lake run effort species num nmarked ## 1 1 May 2018 round 1 1.34 BLG 10 2 ## 2 1 May 2018 round 1 1.34 LMB 5 2 ## 3 1 May 2018 round 1 1.34 YEP 5 3 ## 4 2 May 2018 round 2 1.87 BLG 0 0 ## 5 2 May 2018 round 2 1.87 LMB 9 6 ## 6 2 May 2018 round 2 1.87 YEP 7 2 ## 7 3 May 2018 round 3 1.56 BLG 12 4 ## 8 3 May 2018 round 3 1.56 LMB 0 0 ## 9 3 May 2018 round 3 1.56 YEP 7 6 ## 10 4 May 2018 twin 1 0.92 BLG 1 0 ## 11 4 May 2018 twin 1 0.92 LMB 11 5 ## 12 4 May 2018 twin 1 0.92 YEP 11 8 ## 13 5 May 2018 twin 2 0.67 BLG 9 4 ## 14 5 May 2018 twin 2 0.67 LMB 0 0 ## 15 5 May 2018 twin 2 0.67 YEP 0 0It is possible to have nesting with species as well. Suppose, for example, that the scientific name for the species was included in the original fishdata2 that was summarized (using a combination of the examples from above, but not shown here) to catch4.
catch4 ## ID mon year lake run effort species spsci num nmarked ## 1 1 May 2018 round 1 1.34 BLG Lepomis macrochirus 10 2 ## 2 1 May 2018 round 1 1.34 LMB Micropterus dolomieu 5 2 ## 3 1 May 2018 round 1 1.34 YEP Perca flavescens 5 3 ## 4 2 May 2018 round 2 1.87 LMB Micropterus dolomieu 9 6 ## 5 2 May 2018 round 2 1.87 YEP Perca flavescens 7 2 ## 6 3 May 2018 round 3 1.56 BLG Lepomis macrochirus 12 4 ## 7 3 May 2018 round 3 1.56 YEP Perca flavescens 7 6 ## 8 4 May 2018 twin 1 0.92 BLG Lepomis macrochirus 1 0 ## 9 4 May 2018 twin 1 0.92 LMB Micropterus dolomieu 11 5 ## 10 4 May 2018 twin 1 0.92 YEP Perca flavescens 11 8 ## 11 5 May 2018 twin 2 0.67 BLG Lepomis macrochirus 9 4The zeroes are then added to this data.frame making sure to note the nesting of species and spsci.
catch4 %>% complete(nesting(ID,mon,year,lake,run,effort), nesting(species,spsci), fill=list(num=0,nmarked=0)) %>% as.data.frame() ## ID mon year lake run effort species spsci num nmarked ## 1 1 May 2018 round 1 1.34 BLG Lepomis macrochirus 10 2 ## 2 1 May 2018 round 1 1.34 LMB Micropterus dolomieu 5 2 ## 3 1 May 2018 round 1 1.34 YEP Perca flavescens 5 3 ## 4 2 May 2018 round 2 1.87 BLG Lepomis macrochirus 0 0 ## 5 2 May 2018 round 2 1.87 LMB Micropterus dolomieu 9 6 ## 6 2 May 2018 round 2 1.87 YEP Perca flavescens 7 2 ## 7 3 May 2018 round 3 1.56 BLG Lepomis macrochirus 12 4 ## 8 3 May 2018 round 3 1.56 LMB Micropterus dolomieu 0 0 ## 9 3 May 2018 round 3 1.56 YEP Perca flavescens 7 6 ## 10 4 May 2018 twin 1 0.92 BLG Lepomis macrochirus 1 0 ## 11 4 May 2018 twin 1 0.92 LMB Micropterus dolomieu 11 5 ## 12 4 May 2018 twin 1 0.92 YEP Perca flavescens 11 8 ## 13 5 May 2018 twin 2 0.67 BLG Lepomis macrochirus 9 4 ## 14 5 May 2018 twin 2 0.67 LMB Micropterus dolomieu 0 0 ## 15 5 May 2018 twin 2 0.67 YEP Perca flavescens 0 0 CommentsThis is my first explorations with complete() and it looks promising for this task of adding zeroes to data frames of catch by gear set for gear sets in which a species was not caught. I will be curious to hear what others think of this function and how it might fit in their workflow.
Footnotes
I find the tibble structure to be annoying with simple data.frames like this. Thus, I usually use as.data.frame() to remove it.
To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
rename phylogeny tip labels in treeio
(This article was first published on R on Guangchuang YU, and kindly contributed to Rbloggers)
I don’t know whether ‘rename taxa’ is a common task or not. It seems not a good idea to rename taxa in Newick tree text, since it may introduce problems when mapping the original sequence alignment to the tree.
If you just want to show different or additional information when plotting the tree, it is fine and easy to do it using ggtree:
require(treeio) ## Loading required package: treeio require(ggtree) ## Loading required package: ggtree ## ggtree v1.11.6 For help: https://guangchuangyu.github.io/software/ggtree ## ## If you use ggtree in published research, please cite: ## Guangchuang Yu, David Smith, Huachen Zhu, Yi Guan, Tommy TsanYuk Lam. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 2017, 8(1):2836, doi:10.1111/2041210X.12628 tr < read.tree(text = "((a,(b,c)),d);") genus < c("Gorilla", "Pan", "Homo", "Pongo") species < c("gorilla", "spp.", "sapiens", "pygmaeus") geo < c("Africa", "Africa", "World", "Asia") d < data.frame(label = tr$tip.label, genus = genus, species = species, geo = geo) d ## label genus species geo ## 1 a Gorilla gorilla Africa ## 2 b Pan spp. Africa ## 3 c Homo sapiens World ## 4 d Pongo pygmaeus Asia ggtree(tr) %<+% d + xlim(NA, 5) + geom_tiplab(aes(label=paste0('italic(', genus, ')~bolditalic(', species, ')~', geo)), parse=T) ## Warning: package 'bindrcpp' was built under R version 3.4.4However, it is also possible to rename taxa of the tree object (either treedata or phylo) in treeio:
tr2 = rename_taxa(tr, d, label, genus) write.tree(tr2) ## [1] "((Gorilla,(Pan,Homo)),Pongo);" d2 = dplyr::mutate(d, newlab = paste(genus, species, sep='')) d2 ## label genus species geo newlab ## 1 a Gorilla gorilla Africa Gorillagorilla ## 2 b Pan spp. Africa Panspp. ## 3 c Homo sapiens World Homosapiens ## 4 d Pongo pygmaeus Asia Pongopygmaeus tr3 = rename_taxa(tr, d2, label, newlab) write.tree(tr3) ## [1] "((Gorillagorilla,(Panspp.,Homosapiens)),Pongopygmaeus);"If the input tree object is a treedata instance, you can use write.beast to export the tree with associated data to a BEAST compatible NEXUS file.
Vignettes var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on Guangchuang YU. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Approximations of Pi: A Random Walk though the Beauty of Pi
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Various methods to approximate and visualise the digits of pi using the R computing language for statistics. The apparent randomness of the decimal expression of Pi is a source for beautiful visualisations. Continue reading →
The post Approximations of Pi: A Random Walk though the Beauty of Pi appeared first on The Devil is in the Data.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Arrow and beyond: Collaborating on next generation tools for open source data science
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
Two years ago, Wes McKinney and Hadley Wickham got together to discuss some of the systems challenges facing the Python and R communities. Data science teams inevitably work with multiple languages and systems, so it’s critical that data flow seamlessly and efficiently between these environments. Wes and Hadley wanted to explore opportunities to collaborate on tools for improving interoperability between Python, R, and external compute and storage systems. This discussion led to the creation of the feather file format, a very fast ondisk format for storing data frames that can be read and written to by multiple languages.
Feather was a successful project, and has made it easier for thousands of data scientists and data engineers to collaborate across language boundaries. In this post, we want to update you on how we think about crosslanguage collaboration, and share some exciting new plans.
Beyond filebased interoperabilityFilebased interoperability is a great first step, but is fundamentally clunky: to communicate between R and Python running on the same computer, you have to save out from one and load into the other. What if there were some way to share data in memory without having to copy objects or round trip to disk?
You may have experienced a taste of this if you’ve tried the reticulate package. It makes it possible to use Python objects and functions from R. But reticulate is focused on solving only one part of the problem, for R and Python. It doesn’t help pass data from R to Julia, or Julia to Python, or Python to Apache Spark. What if there were some way to share data between multiple languages without having to write a translation layer between every pair of languages? That challenge is the inspiration for the Apache Arrow project, which defines a standardized, language independent, columnar memory format for analytics and data science.
A new data science runtimeThe Apache Arrow project has been making great progress, so we can now start to think about what could be built on top of that foundation. Modern hardware platforms provide huge opportunities for optimization (cache pipelining, CPU parallelism, GPUs, etc.), which should allow us to use a laptop to interactively analyze 100GB datasets. We should also be getting dramatically better performance when building models and visualizing data on smaller datasets.
We think that the time has come to build a modern data science runtime environment that takes advantage of the computational advances of the last 20 years, and can be used from many languages (in the same way that Project Jupyter has built an interactive data science environment that supports many languages). We don’t think that it makes sense to build this type of infrastructure for a single language, as there are too many difficult problems, and we need diverse viewpoints to solve them. Wes has been thinking and talking publicly about shared infrastructure for data science for some time, and recently RStudio and Wes have been talking about what we could do to begin making this a reality.
These discussions have culminated in a plan to work closely together on building a new data science runtime powered by Apache Arrow. What might this new runtime look like? Here are some of the things currently envisioned:

A core set of C++ shared libraries with bindings for each host language

Runtime inmemory format based on the Arrow columnar format, with auxiliary data structures that can be described by composing Arrow data structures

Reusable operator “kernel” containing functions utilizing Arrow format as input and output. This includes pandasstyle array functions, as well as SQLstyle relational operations (joins, aggregations, etc.)

Multithreaded graph dataflowbased execution engine for efficient evaluation of lazy data frame expressions created in the hostlanguage

Subgraph compilation using LLVM; optimization of common operator patterns

Support for userdefined operators and function kernels

Comprehensive interoperability with existing data representations (e.g., data frames in R, pandas / NumPy in Python)

New frontend interfaces for host languages (e.g., dplyr and other “tidy” front ends for R, evolution of pandas for Python)
When you consider the scope and potential impact of the project, it’s hopefully easy to see why language communities need to come together around making it happen rather than work in their own silos.
Ursa LabsToday, Wes has announced Ursa Labs, an independent opensource development lab that will serve as the focal point for the development of a new crosslanguage data science runtime powered by Apache Arrow. Ursa Labs isn’t a startup company and won’t have its own employees. Instead, a variety of individuals and organizations will contribute to the effort.
RStudio will serve as a host organization for Ursa Labs, providing operational support and infrastructure (e.g., help with hiring, DevOps, QA, etc.) which will enable Wes and others to dedicate 100% of their time and energy to creating great opensource software.
Hadley will be a key technical advisor to Ursa, and collaborate with Wes on the design and implementation of the data science runtime. Hadley and his team will also build a dplyr back end, as well as other tidy interfaces to the new system.
It might sound strange to hear that Wes, who is so closely associated with Python, will be working with RStudio. It might also sound strange that RStudio will be investing in tools that are useful for R and Python users alike. Aren’t these languages and tools out to succeed at each other’s expense? That’s not how we see it. Rather, we are inspired to work together by the common desire to make the languages our users love more successful. Languages are vocabularies for interacting with computation, and like human vocabularies, are rich and varied. We succeed as tool builders by understanding the users that embrace our languages, and by building tools perfectly suited to their needs. That’s what Wes, Hadley, and RStudio have been doing for many years, and we think everyone will be better off if we do it together!
We are tremendously excited to see the fruits of this work, and to continue R’s tradition of providing fluent and powerful interfaces to stateoftheart computational environments. Check out the Ursa Labs website for additional details and to find out how you can get involved with the project!
.title { fontsize: 1.0em; } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Yet Another Caret Workshop
(This article was first published on Computational Social Science, and kindly contributed to Rbloggers)
IntroYesterday I gave a workshop on applied predictive modelling with caret at the 1st LSE Computational Social Science hackathon. Organiser privileges. I put together some introductory code and started a simple GitHub repo for the participants, so I thought I’d share it here as well. This is not supposed to cover all aspects of caret (plus there is already this), but more of a starterpack for those who might be migrating from Python or another machine learning library like mlr. I have also saved the environment as caret.rdata, so that the participants can load it up during the workshop (insert harrowing experience about live coding) and follow through—that’s included in the repo too if you rather have a test run first.
The DataLet’s start by creating some synthetic data using caret. The twoClassSim generates a dataset suitable for binaryoutcomes:
dat < twoClassSim(n = 1000, #number of rows linearVars = 2, #linearly important variables noiseVars = 5, #uncorrelated irrelevant variables corrVars = 2, #correlated irrelevant variables mislabel = .01) #percentage possibly mislabeled colnames(dat) ## [1] "TwoFactor1" "TwoFactor2" "Linear1" "Linear2" "Nonlinear1" ## [6] "Nonlinear2" "Nonlinear3" "Noise1" "Noise2" "Noise3" ## [11] "Noise4" "Noise5" "Corr1" "Corr2" "Class"The above chunk simulates a dataframe with 1000 rows containing 15 variables:
 Class: Binary outcome (Class)
 TwoFactor: Correlated multivariate normal predictors (TwoFactor1, TwoFactor2)
 Nonlinear: Uncorrelated random uniform predictors (NonLinear1, …, Nonlinear3)
 Linear: (Optional) uncorrelated standard normal predictors (Linear1, Linear2)
 Noise: (Optional) uncorrelated standard normal predictors (Noise1, … , Noise5)
 Correlated: (Optional) correlated multivariate normal predictors (Corr1, Corr2)
We can take a closer look at the variables using two packages: skimr and xray. Both have functions that provide a snapshot of your covariates in an easytounderstand output:
skim(dat) #unintended ## Skim summary statistics ## n obs: 1000 ## n variables: 15 ## ## Variable type: factor ## variable missing complete n n_unique top_counts ordered ## Class 0 1000 1000 2 Cla: 567, Cla: 433, NA: 0 FALSE ## ## Variable type: numeric ## variable missing complete n mean sd p0 p25 median ## Corr1 0 1000 1000 0.078 1 3.19 0.76 0.11 ## Corr2 0 1000 1000 0.016 1 2.91 0.71 0.014 ## Linear1 0 1000 1000 0.016 1 3.29 0.68 0.054 ## Linear2 0 1000 1000 0.023 1.02 3.31 0.66 0.053 ## Noise1 0 1000 1000 0.022 1 3.46 0.71 0.023 ## Noise2 0 1000 1000 0.048 0.99 3.52 0.74 0.042 ## Noise3 0 1000 1000 0.016 0.97 3.11 0.64 0.014 ## Noise4 0 1000 1000 0.015 1.02 3.48 0.71 0.028 ## Noise5 0 1000 1000 0.036 0.94 3.03 0.59 0.072 ## Nonlinear1 0 1000 1000 0.0086 0.58 1 0.49 0.038 ## Nonlinear2 0 1000 1000 0.49 0.29 7.8e05 0.25 0.5 ## Nonlinear3 0 1000 1000 0.51 0.29 5e04 0.26 0.52 ## TwoFactor1 0 1000 1000 0.086 1.36 4.28 0.96 0.085 ## TwoFactor2 0 1000 1000 0.042 1.39 4.63 1.03 0.0095 ## p75 p100 hist ## 0.58 3.58 ▁▂▆▇▇▂▁▁ ## 0.67 3.57 ▁▂▅▇▆▂▁▁ ## 0.71 3.02 ▁▁▃▇▇▅▂▁ ## 0.69 3.88 ▁▂▅▇▇▂▁▁ ## 0.64 3.22 ▁▁▃▇▇▅▁▁ ## 0.61 3.39 ▁▁▅▇▇▅▁▁ ## 0.64 3.77 ▁▁▅▇▅▂▁▁ ## 0.67 3.17 ▁▁▃▇▇▅▁▁ ## 0.71 3.1 ▁▁▅▇▇▅▁▁ ## 0.5 1 ▇▇▇▇▇▇▇▇ ## 0.73 1 ▇▇▇▇▇▇▇▆ ## 0.76 1 ▇▇▇▆▇▇▇▇ ## 0.81 3.83 ▁▂▃▇▇▅▂▁ ## 0.89 4.79 ▁▁▅▇▇▃▁▁You should also try out xray::anomalies(dat) and see which output you prefer. Because our data is synthetic, we have these nice bell curves and normal distributions that are harder to locate in the wild.
Let’s split the data into train/test using an index of row numbers:
index < createDataPartition(y = dat$Class, p = .7, list = FALSE) training < dat[index, ] test < dat[index, ]First, we supply the outcome variable y so that caret can take it into account when creating the split (in terms of classbalance). We use 70% of the data for training and hold out the remaining 30% for testing later. We want a vector instead of a list so we convey this to R by overriding the default behaviour. The actual splitting happens when we subset using the index we just created; the selected row numbers generate the training data whereas the rest goes to the test (using negative indexing).
trainControlThe magic of caret happens in the control arguments. Default arguments tend to cater to regression problems; given our focus on classification, I only briefly mention the former here:
reg.ctrl < trainControl(method = "repeatedcv", number = 10, repeats = 5, allowParallel = TRUE)We now have a trainControl object that will signal a tenk fold (repeated 5 times; so 50 resamples in total) to the train function. Classification controls require several more arguments:
cls.ctrl < trainControl(method = "repeatedcv", #boot, cv, LOOCV, timeslice OR adaptive etc. number = 10, repeats = 5, classProbs = TRUE, summaryFunction = twoClassSummary, savePredictions = "final", allowParallel = TRUE)There is a good variety of crossvalidation methods you can choose in caret, which I will not cover here. classProbs computes class probabilities for each resample. We need to set the summary function to twoClassSummary for binary classification. Finally, we set save predictions to TRUE—note that this is not a classificationspecific argument; we didn’t have it in the regression controls because we won’t be covering them here in detail.
For future reference, there are several other useful arguments that you can call within trainControl. For example, you can evoke subsampling using sampling if you have classimbalance. You can set seeds for each resample for perfect reproducibility. You can also define your own indices (index) for resampling purposes.
Model FittingWe’ll start with a placeholder regression example for completeness. You should always set the seed before calling train. caret accepts the formula interface if you supply the data later. Below, we arbitrarily select one of the linear variables as the outcome, and fit the rest of the variables as predictors using the dot indicator:
set.seed(1895) lm.fit < train(Linear1 ~ ., data = training, trControl = reg.ctrl, method = "lm") lm.fit ## Linear Regression ## ## 701 samples ## 14 predictor ## ## No preprocessing ## Resampling: CrossValidated (10 fold, repeated 5 times) ## Summary of sample sizes: 630, 632, 632, 630, 631, 630, ... ## Resampling results: ## ## RMSE Rsquared MAE ## 0.9995116 0.01280926 0.8039185 ## ## Tuning parameter 'intercept' was held constant at a value of TRUEProbably not the most amazing \(R^2\) value you have ever seen, but that’s alright. Note that calling the model fit displays the most crucial information in a succinct way.
Let’s move on to a classification algorithm. It’s good practice to start with a logistic regression and take it from there. In R, logistic regression is in the glm framework and can be specified by calling family = "binomial". We set the performance metric to ROC as the default metric is accuracy. Accuracy tends to be unreliable when you have classimbalance. A classic example is having one positive outcome and 99 negative outcomes; any lazy algorithm predicting all zeros would be 99% accurate (but it would be uninformative as a result). The Receiver Operating Characteristic—a proud member of the good ol’ WWII school of naming things—provides a better performance metric by taking into account the rate of true positives and true negatives. Finally, we apply some preprocessing to our data by passing a set of strings: we drop nearzero variance variables, as well as centring and scaling all covariates:
set.seed(1895) glm.fit < train(Class ~ ., data = training, trControl = cls.ctrl, method = "glm", family = "binomial", metric = "ROC", preProcess = c("nzv", "center", "scale"))For reference, you can also vectorise your x and y if you find it easier to read:
y < training$Class predictors < training[,which(colnames(training) != "Class")] #Same logit fit set.seed(1895) glm.fit < train(x = predictors, y = y, trControl = cls.ctrl, method = "glm", family = "binomial", metric = "ROC", preProcess = c("nzv", "center", "scale")) glm.fit ## Generalized Linear Model ## ## 701 samples ## 14 predictor ## 2 classes: 'Class1', 'Class2' ## ## Preprocessing: centered (14), scaled (14) ## Resampling: CrossValidated (10 fold, repeated 5 times) ## Summary of sample sizes: 630, 630, 630, 631, 631, 631, ... ## Resampling results: ## ## ROC Sens Spec ## 0.8639053 0.8565385 0.7206237You can quickly find out which variables contribute to predictive accuracy:
varImp(glm.fit) ## glm variable importance ## ## Overall ## TwoFactor1 100.000 ## TwoFactor2 97.267 ## Linear2 73.712 ## Nonlinear1 25.964 ## Linear1 12.900 ## Nonlinear2 12.261 ## Corr1 8.758 ## Noise2 7.424 ## Corr2 6.116 ## Nonlinear3 5.798 ## Noise1 5.371 ## Noise3 4.306 ## Noise5 2.668 ## Noise4 0.000 plot(varImp(glm.fit))Let’s fit a couple of other models before moving on. One common choice would be the elastic net. Elastic net relies on L1 and L2 regularisations and it’s basically a mix of both: the former shrinks some variable coefficients to zero (so that they are dropped out; i.e. feature selection/dimensionality reduction), whereas the latter penalises coefficient size. In R, it has two hyperparameters that can be tuned: alpha and lambda. Alpha controls the type of regression; 0 representing Ridge and 1 denoting LASSO (Least Absolute Shrinkage and Selector Operator). Lambda, on the other hand, determines the penalty amount. Note that the expand.grid function actually just creates a dataset with two columns called alpha and lambda, which are then used for the model fit based on the valuepairs in each row.
set.seed(1895) glmnet.fit < train(x = predictors, y = y, trControl = cls.ctrl, method = "glmnet", metric = "ROC", preProcess = c("nzv", "center", "scale"), tuneGrid = expand.grid(alpha = 0:1, lambda = seq(0.0001, 1, length = 20)))Because it has tuneable parameters, we can visualise their performance by calling plot on the model fit:
plot(glmnet.fit)where the two colours denote the alpha level and the dots are the specified lambda values.
Finally, let’s fit a Random Forest using the ranger package, which is a fast C++ implementation of the original algorithm in R:
set.seed(1895) rf.fit < train(Class ~ ., data = training, trControl = cls.ctrl, method = "ranger", metric = "ROC", preProcess = c("nzv", "center", "scale")) confusionMatrix(rf.fit) ## CrossValidated (10 fold, repeated 5 times) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction Class1 Class2 ## Class1 51.2 7.2 ## Class2 5.4 36.1 ## ## Accuracy (average) : 0.8739This is all good, as we are fitting different algorithms using a simple interface without needing to memorise the idiosyncrasies of each package. However, we are still writing lots of redundant code. caretList provides this functionality:
set.seed(1895) models < caretList(Class ~ ., data = training, trControl = cls.ctrl, metric = "ROC", tuneList = list(logit = caretModelSpec(method = "glm", family = "binomial"), elasticnet = caretModelSpec(method = "glmnet", tuneGrid = expand.grid(alpha = 0:1, lambda = seq(0.0001, 1, length = 20))), rf = caretModelSpec(method = "ranger")), preProcess = c("nzv", "center", "scale"))We basically just merged the first three model fits into a single call using tuneList, which requires a list of model specifications. If we want to predict using unseen data, we can now get predictions from all three models:
models.preds < lapply(models, predict, newdata = test) #add type = "prob" for class probabilities models.preds < data.frame(models.preds) head(models.preds, 10) ## logit elasticnet rf ## 1 Class1 Class1 Class1 ## 2 Class2 Class2 Class2 ## 3 Class1 Class1 Class2 ## 4 Class2 Class2 Class1 ## 5 Class1 Class1 Class2 ## 6 Class2 Class2 Class2 ## 7 Class2 Class2 Class2 ## 8 Class1 Class1 Class2 ## 9 Class1 Class1 Class1 ## 10 Class1 Class1 Class1The resamples function collects all the resampling data from all models and allows you to easily assess insample performance metrics:
bwplot(resamples(models)) #try dotplot as wellAveraged over all resamples, the Random Forest algorithm has the highest ROC value, however the whiskers overlap in all three categories—perhaps a larger number of resamples are needed for significant separation. It also outperforms other two algorithms when it comes to detecting true positives and true negatives. Note that often the results will not be this clear; it’s common for an algorithm to do really well in one area and perform terribly in the other.
We could also create a simple linear ensemble using the three model fits. You can check whether the model predictions are linearly correlated:
modelCor(resamples(models)) ## logit elasticnet rf ## logit 1.0000000 0.9996123 0.1135033 ## elasticnet 0.9996123 1.0000000 0.1141322 ## rf 0.1135033 0.1141322 1.0000000Seems like logit and elastic net predictions are more or less identical, meaning one of them is redundant:
xyplot(resamples(models))And now for the ensemble:
set.seed(1895) greedy_ensemble < caretEnsemble(models, metric = "ROC", trControl = cls.ctrl) summary(greedy_ensemble) ## The following models were ensembled: logit, elasticnet, rf ## They were weighted: ## 4.9237 44.2633 42.2176 8.0329 ## The resulting ROC is: 0.9517 ## The fit for each individual model on the ROC is: ## method ROC ROCSD ## logit 0.8618236 0.03919733 ## elasticnet 0.8614454 0.03923875 ## rf 0.9469568 0.02237265The ROC of the ensemble (0.9517) is higher than any individual model, however the Random Forest algorithm by itself provides similar levels of accuracy (0.9469).
Feature SelectionAs an extra, I’ll briefly cover several feature selection wrapper functions that are available in caret.
Recursive Feature EliminationRFE works by passing a vector of subsets consisting of different number of variables to be used in model fitting. For example, because we only have 14 variables in our dataset (excluding the outcome), we can try all numbers from one to 14. With all three feature selection algorithms, we will need to change the summary function to twoClassSummary for classification purposes.
subsets < c(1:length(training)) lrFuncs$summary < twoClassSummaryWe need to pass an additional control function specifically for the RFE. We select the linear regression wrapper (lrFuncs), and choose bootstrapped crossvalidation (25). After that, we call the rfe function:
rfe.ctrl = rfeControl(functions = lrFuncs, method = "boot", number = 25, allowParallel = TRUE, verbose = TRUE) set.seed(1895) rfe < rfe(x = predictors, y = y, sizes = subsets, metric = "ROC", rfeControl = rfe.ctrl) rfe ## ## Recursive feature selection ## ## Outer resampling method: Bootstrapped (25 reps) ## ## Resampling performance over subset size: ## ## Variables ROC Sens Spec ROCSD SensSD SpecSD Selected ## 1 0.6383 0.8529 0.4462 0.03546 0.03912 0.05763 ## 2 0.8123 0.8351 0.6355 0.03603 0.04063 0.04609 ## 3 0.8606 0.8723 0.6905 0.01693 0.03575 0.04339 ## 4 0.8610 0.8634 0.6958 0.01955 0.03571 0.04740 * ## 5 0.8596 0.8595 0.7011 0.01893 0.03560 0.04360 ## 6 0.8582 0.8574 0.7010 0.01925 0.03567 0.04386 ## 7 0.8565 0.8565 0.6983 0.01996 0.03595 0.04805 ## 8 0.8565 0.8557 0.7017 0.01907 0.03565 0.04726 ## 9 0.8561 0.8590 0.6993 0.01988 0.03484 0.04806 ## 10 0.8560 0.8569 0.7052 0.02019 0.03384 0.04739 ## 11 0.8560 0.8592 0.7048 0.02003 0.03353 0.04939 ## 12 0.8561 0.8558 0.7047 0.02014 0.03430 0.04612 ## 13 0.8561 0.8561 0.7051 0.02015 0.03413 0.04721 ## 14 0.8562 0.8555 0.7047 0.01999 0.03407 0.04688 ## ## The top 4 variables (out of 4): ## TwoFactor1, TwoFactor2, Linear2, Nonlinear1We see that the model with four variables (indicated with an asterisk) resulted in the highest ROC value and thus selected as the best model. The top variables are also displayed at the end. One approach would be fitting a model on the test data using these four variables only.
Simulated AnnealingSA works by starting with a certain number of variables and introducing small changes to them along the way. If the change results in an `upgrade’ (i.e. higher predictive accuracy), the initial candidate is abandoned in favour of the new solution. Unlike RFE, which is greedy—meaning, it only assesses the subset sizes once and moves on forever—SA can be programmed to go back and try again if it doesn’t find an improvement within a certain number of iterations (below we set this limit to 5). We can also pass which performance metrics to be used for both the internal and external processes, as well as defining the amount that will be held out (20%):
caretSA$fitness_extern < twoClassSummary safs.ctrl = safsControl(functions = caretSA, method = "boot", number = 10, metric = c(internal = "ROC", external = "ROC"), maximize = c(internal = TRUE, external = TRUE), holdout = .2, improve = 5, allowParallel = TRUE, verbose = TRUE)We can then fit the algorithm by calling safs:
sa < safs(x = predictors, y = y, iters = 10, method = "glm", family = "binomial", metric = "ROC", trControl = cls.ctrl, safsControl = safs.ctrl) ## + final SA ## 1 0.4597138 (3) ## 2 0.4597138>0.4663124 (3+1, 75.0%) * ## 3 0.4663124>0.4626936 (41, 75.0%) 0.9769874 A ## 4 0.4663124>0.6563141 (4+0, 60.0%) * ## 5 0.6563141>0.657809 (41, 75.0%) * ## 6 0.657809>0.7425797 (3+1, 75.0%) * ## 7 0.7425797>0.7398033 (4+1, 80.0%) 0.9741674 A ## 8 0.7425797>0.7417647 (4+0, 60.0%) 0.991258 A ## 9 0.7425797>0.7407108 (4+1, 50.0%) 0.9776039 A ## 10 0.7425797>0.7427323 (4+0, 60.0%) * ## + final modelCalling the object returns an informative summary of the whole process:
sa ## ## Simulated Annealing Feature Selection ## ## 701 samples ## 14 predictors ## 2 classes: 'Class1', 'Class2' ## ## Maximum search iterations: 10 ## Restart after 5 iterations without improvement (0.3 restarts on average) ## ## Internal performance values: ROC, Sens, Spec ## Subset selection driven to maximize internal ROC ## ## External performance values: ROC, Sens, Spec ## Best iteration chose by maximizing external ROC ## External resampling method: Bootstrapped (10 reps) ## Subsampling for internal fitness calculation: 20% ## ## During resampling: ## * the top 5 selected variables (out of a possible 14): ## Linear2 (70%), TwoFactor1 (70%), TwoFactor2 (70%), Nonlinear1 (60%), Corr1 (40%) ## * on average, 5.6 variables were selected (min = 3, max = 8) ## ## In the final search using the entire training set: ## * 4 features selected at iteration 10 including: ## TwoFactor1, Linear2, Noise3, Noise4 ## * external performance at this iteration is ## ## ROC Sens Spec ## 0.7539 0.8022 0.5891 Genetic AlgorithmLast but not least, we will cover genetic algorithms. Here, variables are put through pressures similar to that of natural selection. We keep the iteration and population sizes really, really low as the code chunks are only supposed to give you a working example of the process. These algorithms fit a lot of models, so always start with a small value and gradually increase the number of iterations/generations.
caretGA$fitness_extern < twoClassSummary gafs.ctrl = gafsControl(functions = caretGA, method = "boot", number = 10, metric = c(internal = "ROC", external = "ROC"), maximize = c(internal = TRUE, external = TRUE), holdout = .2, allowParallel = TRUE, genParallel = TRUE, verbose = TRUE) set.seed(1895) ga < gafs(x = predictors, y = y, iters = 5, popSize = 2, elite = 0, differences = TRUE, method = "glm", family = "binomial", metric = "ROC", trControl = cls.ctrl, gafsControl = gafs.ctrl) ## + final GA ## 1 0.8197197 (12) ## 2 0.8197197>0.820586 (12>12, 100.0%) * ## 3 0.820586>0.8211813 (12>12, 100.0%) * ## 4 0.8211813>0.8218709 (12>12, 100.0%) * ## 5 0.8218709>0.8215376 (12>12, 100.0%) ## + final modelSimilar to the previous algorithms, calling the final object provides a summary:
ga ## ## Genetic Algorithm Feature Selection ## ## 701 samples ## 14 predictors ## 2 classes: 'Class1', 'Class2' ## ## Maximum generations: 5 ## Population per generation: 2 ## Crossover probability: 0.8 ## Mutation probability: 0.1 ## Elitism: 0 ## ## Internal performance values: ROC, Sens, Spec ## Subset selection driven to maximize internal ROC ## ## External performance values: ROC, Sens, Spec ## Best iteration chose by maximizing external ROC ## External resampling method: Bootstrapped (10 reps) ## Subsampling for internal fitness calculation: 20% ## ## During resampling: ## * the top 5 selected variables (out of a possible 14): ## Linear2 (100%), Noise3 (100%), Nonlinear3 (100%), TwoFactor2 (100%), Corr2 (90%) ## * on average, 12.4 variables were selected (min = 10, max = 14) ## ## In the final search using the entire training set: ## * 12 features selected at iteration 1 including: ## TwoFactor1, TwoFactor2, Linear1, Nonlinear1, Nonlinear2 ... ## * external performance at this iteration is ## ## ROC Sens Spec ## 0.8475 0.8481 0.7055Given the small number of iterations used for SA and GA, we can’t really judge the quality of their results. However, running the algorithms for hundreds or thousands of iterations is not necessarily the best option either. As these algorithms focus on maximising insample ROC, given enough iterations, they will perfectly learn the specific noise of your dataset and will not generalise to unseen data (i.e. overfitting). As always, aim to leverage your domain knowledge and gradually increase the number of iterations until you see a divergence between training and test validation results.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Computational Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...