Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 hour 25 min ago

RStudio:addins part 1. – code reproducibility testing

Sat, 05/05/2018 - 16:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Contents
  1. Introduction
  2. Prerequisites and recommendations
  3. Step 1 – Creating a package
  4. Step 2 – Writing the first functions
  5. Step 3 – Setting up an addin
  6. Step 4 – Updating our DESCRIPTION and NAMESPACE
  7. What is next – Always paying our (technical) debts
  8. Wrapping up
  9. TL;DR – Just give me the package
  10. References
Introduction

This is the first post in the RStudio:addins series. The aim of the series is to walk the readers through creating an R package that will contain functionality for integrating useful addins into the RStudio IDE. At the end of this first article, your RStudio will be 1 useful addin richer.

The addin we will create in this article will let us run a script open in RStudio in R vanilla mode via a keyboard shortcut and open a file with the script’s output in RStudio.

This is useful for testing whether your script is reproducible by users that do not have the same start-up options as you (e.g. preloaded environment, site file, etc.), making it a good tool to test your scripts before sharing them.

If you want to get straight to the code, you can find it at https://gitlab.com/jozefhajnala/jhaddins.git

Prerequisites and recommendations

To make the most use of the series, you will need the following:

  1. R, ideally version 3.4.3 or more recent, 64bit
  2. RStudio IDE, ideally version 1.1.383 or more recent
  3. Also recommended
  • git, for version control
  • TortoiseGit, convenient shell interface to git for those using Windows, with pretty icons and all
  1. Recommended R packages (install with install.packages("packagename"), or via RStudio’s Packages tab):
  • devtools – makes your development life easier
  • testthat – provides a framework for unit testing integrated into RStudio
  • roxygen2 – makes code documentation easy
Step 1 – Creating a package
  1. Use devtools::create to create a package (note that we will update more DESCRIPTION fields later and you can also choose any path you like and it will be reflected in the name of the package)
devtools::create( path = "jhaddins" , description = list("License" = "GPL-3") )
  1. In RStudio or elsewhere navigate to the jhaddins folder and open the project jhaddins.Rproj (or the name of your project if you chose a different path)

  2. Run the first check and install the package

devtools::check() # Ctrl+Shift+E or Check button on RStudio's build tab devtools::install() # Ctrl+Shift+B or Install button on RStudio's build tab
  1. Optionally, initialize git for version control
devtools::use_git() Step 2 – Writing the first functions

We will now write some functions into a file called makeCmd.R that will let us run the desired functionality:

  1. makeCmd to create a command executable via system or shell, with defaults set up for executing an R file specified by path
makeCmd <- function(path , command = "Rscript" , opts = "--vanilla" , outputFile = NULL , suffix = NULL , addRhome = TRUE) { if (Sys.info()["sysname"] == "Windows") { qType <- "cmd2" } else { qType <- "sh" } if (isTRUE(addRhome)) { command <- file.path(R.home("bin"), command) } cmd <- paste( shQuote(command, type = qType) , shQuote(opts, type = qType) , shQuote(path, type = qType) ) if (!is.null(outputFile)) { cmd <- paste(cmd, ">", shQuote(outputFile)) } if (!is.null(suffix)) { cmd <- paste(cmd, suffix) } cmd }
  1. executeCmd to execute a command
executeCmd <- function(cmd, intern = FALSE) { sysName <- Sys.info()["sysname"] stopifnot( is.character(cmd) , length(cmd) == 1 , sysName %in% c("Windows", "Linux") ) if (sysName == "Windows") { shell(cmd, intern = intern) } else { system(cmd, intern = intern) } }
  1. replaceTilde for Linux purposes
replaceTilde <- function(path) { if (substr(path, 1, 1) == "~") { path <- sub("~", Sys.getenv("HOME"), path, fixed = TRUE) } file.path(path) }
  1. And finally the function which will be used for the addin execution – runCurrentRscript to retrieve the path to the currently active file in RStudio, run it, write the output to a file output.txt and open the file with output.
runCurrentRscript <- function( path = replaceTilde(rstudioapi::getActiveDocumentContext()[["path"]]) , outputFile = "output.txt") { cmd <- makeCmd(path, outputFile = outputFile) executeCmd(cmd) if (!is.null(outputFile) && file.exists(outputFile)) { rstudioapi::navigateToFile(outputFile) } } Step 3 – Setting up an addin

Now that we have all our functions ready, all we have to do is create a file addins.dcf under the \inst\rstudio folder of our package. We specify the Name of the addin, write a nice Description of what it does and most importantly specify the Binding to the function we want to call:

creating addins.dcf under inst/rstudio

Name: runCurrentRscript Description: Executes the currently open R script file via Rscript with --vanilla option Binding: runCurrentRscript Interactive: false

Now we can rebuild and install our package and in RStudio’s menu navigate to Tools -> Addins -> Browse Addins..., and there it is – our first addin. For the best experience, we can click the Keyboard Shortcuts... button and assign a keyboard shortcut to our addin for easy use.

setting an RStudio addin keyboard shortcut

Now just open an R script, hit our shortcut and voilà, our script gets execute via RScript in vanilla mode.

Step 4 – Updating our DESCRIPTION and NAMESPACE

As our last steps, we should

  1. Update our DESCRIPTION file with rstudioapi as Imports, as we will be needing it before using our package:
Package: jhaddins Title: JH's RStudio Addins Version: 0.0.0.9000 Authors@R: person("Jozef", "Hajnala", email = "jozef.hajnala@gmail.com", role = c("aut", "cre")) Description: Useful addins to make RStudio even better. Depends: R (>= 3.0.1) Imports: rstudioapi (>= 0.7) License: GPL-3 Encoding: UTF-8 LazyData: true RoxygenNote: 6.0.1
  1. Update our NAMESPACE by importing the functions from other packages that we are using, namely:
importFrom(rstudioapi, navigateToFile) importFrom(rstudioapi, getActiveDocumentContext)

Now we can finally rebuild and install our package again and run a CHECK to see that we have no errors, warnings and notes telling us something is wrong. Make sure to use the document = FALSE for now.

devtools::install() # Ctrl+Shift+B or Install button on RStudio's build tab devtools::check(document = FALSE) # Ctrl+Shift+E or Check button on RStudio's build tab What is next – Always paying our (technical) debts

In the next post of the series, we will pay our debt of

  • missing documentation for our functions, that will help us to generate updates to our NAMESPACE automatically and help us get a nice documentation so that we can read about our functions using ?
  • and unit tests to help us sleep better knowing that our functions get tested!
Wrapping up

We can quickly create an RStudio addin by:

  1. Creating an R package
  2. Writing a function in that package
  3. Creating a addins.dcf in \inst\rstudio folder of our package
TL;DR – Just give me the package

Use git clone from: https://gitlab.com/jozefhajnala/jhaddins.git

References var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Different Way To Think About Drawdown — Geometric Calmar Ratio

Fri, 05/04/2018 - 20:40

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will discuss the idea of the geometric Calmar ratio — a way to modify the Calmar ratio to account for compounding returns.

So, one thing that recently had me sort of annoyed in terms of my interpretation of the Calmar ratio is this: essentially, the way I interpret it is that it’s a back of the envelope measure of how many years it takes you to recover from the worst loss. That is, if a strategy makes 10% a year (on average), and has a loss of 10%, well, intuition serves that from that point on, on average, it’ll take about a year to make up that loss–that is, a Calmar ratio of 1. Put another way, it means that on average, a strategy will make money at the end of 252 trading days.

But, that isn’t really the case in all circumstances. If an investment manager is looking to create a small, meager return for their clients, and is looking to make somewhere between 5-10%, then sure, the Calmar ratio approximation and interpretation makes sense in that context. Or, it makes sense in the context of “every year, we withdraw all profits and deposit to make up for any losses”. But in the context of a hedge fund trying to create large, market-beating returns for its investors, those hedge funds can have fairly substantial drawdowns.

Citadel–one of the gold standards of the hedge fund industry, had a drawdown of more than 50% during the financial crisis, and of course, there was https://www.reuters.com/article/us-usa-fund-volatility/exclusive-ljm-partners-shutting-its-doors-after-vol-mageddon-losses-in-u-s-stocks-idUSKCN1GC29Hat least one fund that blew up in the storm-in-a-teacup volatility spike on Feb. 5 (in other words, if those guys were professionals, what does that make me? Or if I’m an amateur, what does that make them?).

In any case, in order to recover from such losses, it’s clear that a strategy would need to make back a lot more than what it lost. Lose 25%? 33% is the high water mark. Lose 33%? 50% to get back to even. Lose 50%? 100%. Beyond that? You get the idea.

In order to capture this dynamic, we should write a new Calmar ratio to express this idea.

So here’s a function to compute the geometric calmar ratio:

require(PerformanceAnalytics) geomCalmar <- function(r) { rAnn <- Return.annualized(r) maxDD <- maxDrawdown(r) toHighwater <- 1/(1-maxDD) - 1 out <- rAnn/toHighwater return(out) }

So, let's compare how some symbols stack up. We'll take a high-volatility name (AMZN), the good old S&P 500 (SPY), and a very low volatility instrument (SHY).

getSymbols(c('AMZN', 'SPY', 'SHY'), from = '1990-01-01') rets <- na.omit(cbind(Return.calculate(Ad(AMZN)), Return.calculate(Ad(SPY)), Return.calculate(Ad(SHY)))) compare <- rbind(table.AnnualizedReturns(rets), maxDrawdown(rets), CalmarRatio(rets), geomCalmar(rets)) rownames(compare)[6] <- "Geometric Calmar" compare

The returns start from July 31, 2002. Here are the statistics.

AMZN.Adjusted SPY.Adjusted SHY.Adjusted Annualized Return 0.3450000 0.09110000 0.01940000 Annualized Std Dev 0.4046000 0.18630000 0.01420000 Annualized Sharpe (Rf=0%) 0.8528000 0.48860000 1.36040000 Worst Drawdown 0.6525491 0.55189461 0.02231459 Calmar Ratio 0.5287649 0.16498652 0.86861760 Geometric Calmar 0.1837198 0.07393135 0.84923475

For my own proprietary volatility trading strategy, a strategy which has a Calmar above 2 (interpretation: finger in the air means that you make a new equity high every six months in the worst case scenario), here are the statistics:

> CalmarRatio(stratRetsAggressive[[2]]['2011::']) Close Calmar Ratio 3.448497 > geomCalmar(stratRetsAggressive[[2]]['2011::']) Close Annualized Return 2.588094

Essentially, because of the nature of losses compounding, the geometric Calmar ratio will always be lower than the standard Calmar ratio, which is to be expected when dealing with the geometric nature of compounding returns.

Essentially, I hope that this gives individuals some thought about re-evaluating the Calmar Ratio.

Thanks for reading.

NOTES: registration for R/Finance 2018 is open. As usual, I’ll be giving a lightning talk, this time on volatility trading.

I am currently contracting and seek network opportunities, along with information about prospective full time roles starting in July. Those interested in my skill set can feel free to reach out to me on LinkedIn here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

tidyposterior slides

Fri, 05/04/2018 - 18:10

(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

tidyposterior is an R package for comparing models based on their resampling statistics. There are a few case studies on the webpage to illustrate the process.

I gave a talk at the Open Data Science Conference (ODSC) yesterday. A pdf of the slides are here and the animated gif above is here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Do you really need a multilevel model? A preview of powerlmm 0.4.0

Fri, 05/04/2018 - 14:55

(This article was first published on R Psychologist - R, and kindly contributed to R-bloggers)

In this post I will show some of the new simulation features that will be available in powerlmm 0.4.0. You can already install the dev version from GitHub.

# GitHub devtools::install_github("rpsychologist/powerlmm")

The revamped simulation functions offer 3 major new features:

  • Compare multiple model formulas, including OLS models (no random effects).
  • Evaluate a “forward” or “backward” model selection strategy using LRT.
  • Apply a data transformation during the simulation, this makes it possible to compare e.g. an ANCOVA versus a 2-level LMM, or write your own custom MNAR or MAR missing data function.

I will cover these new function in two examples.

library(powerlmm) nsim <- 5000 cores <- 16 Example 1 Do I really need a LMM? 2-lvl LMM versus ANCOVA

This example will show both the new data_transform argument and the new support for multiples formula, and formulas without random effects. Each formula is fit to the same data set (or a transformed version) during the simulation. Let’s assume we’ll do a trial where we measure patients for 11 weeks, from baseline to week 10. We can analyze such data in many ways, as an example we will compare 3 popular models:

  • t-test: group differences at posttest
  • ANCOVA: group differences at posttest adjusting for pretest values
  • LMM: group difference in change over time using a 2-level linear-mixed effects model with a random intercept and slope

To make the LMM more similar to the ANCOVA we also fit a constrained model where we assume there is no group differences at baseline (which there isn’t). Let’s setup the models and run the simulation, we will try different amounts of random slope variance compared to the within-subjects variance (var_ratio), and with or without dropout (30 % at posttest).

p <- study_parameters(n1 = 11, n2 = 40, # per treatment icc_pre_subject = 0.5, cor_subject = -0.5, var_ratio = c(0, 0.01, 0.02, 0.03), dropout = c(0, dropout_weibull(0.3, 1)), effect_size = cohend(c(0, 0.8))) # Formulas -------------------------------------------------------------------- # OLS (t-test) f_PT <- sim_formula("y ~ treatment", test = "treatment", data_transform = transform_to_posttest) # ANCOVA f_PT_pre <- sim_formula("y ~ treatment + pretest", test = "treatment", data_transform = transform_to_posttest) # analyze as 2-level longitudinal f_LMM <- sim_formula("y ~ time * treatment + (1 + time | subject)") # constrain treatment differences at pretest f_LMM_c <- sim_formula("y ~ time + time:treatment + (1 + time | subject)") # combine formulas f <- sim_formula_compare("posttest" = f_PT, "ANCOVA" = f_PT_pre, "LMM" = f_LMM, "LMM_c" = f_LMM_c) # Run sim -------------------------------------------------------------------- res <- simulate(p, formula = f, nsim = nsim, cores = cores, satterthwaite = TRUE, batch_progress = FALSE)

Let’s summarize the results for the treatment effect.

# need to specify what parameter estimates the treatment effect. tests <- list("posttest" = "treatment", "ANCOVA" = "treatment", "LMM" = "time:treatment", "LMM_c" = "time:treatment") x <- summary(res, para = tests) # Only print the first 5 print(head(x, 5), add_cols = c("var_ratio")) ## Model: 'All' | Type: 'fixed' | Parameter(s): 'treatment', 'time:treatment' ## --- ## model var_ratio M_est theta M_se SD_est Power Power_bw Power_satt ## ANCOVA 0.00 -0.0441 0 2.7 2.7 0.053 0.049 0.049 ## ANCOVA 0.01 0.0083 0 3.1 3.1 0.052 0.048 0.048 ## ANCOVA 0.02 0.0153 0 3.6 3.5 0.051 0.046 0.046 ## ANCOVA 0.03 -0.0822 0 4.0 4.0 0.047 0.043 0.043 ## ANCOVA 0.00 11.3523 0 2.7 2.8 0.982 0.981 0.981 ## --- ## nsim: 5000 | 24 columns not shown

Since we have 16 × 4 model results, it is probably better to plot the results.

library(ggplot2) library(viridis) # re-order x$model <- factor(x$model, levels = c("posttest", "ANCOVA", "LMM", "LMM_c")) # Set custom limits per facet d_limits <- data.frame(effect_size = c(0), Power_satt = c(0.025, 0.075), var_ratio = 0, model = 0, dropout = 0) # Set hlines per facet d_hline <- data.frame(effect_size = c(0, 0.8), x = c(0.05, 0.8)) # Plot ggplot(x, aes(model, Power_satt, group = interaction(var_ratio, dropout), color = factor(var_ratio), linetype = factor(dropout))) + geom_hline(data = d_hline, aes(yintercept = x), linetype = "dotted") + geom_point() + geom_blank(data = d_limits) + geom_line() + labs(y = "Power (Satterthwaite)", linetype = "Dropout", color = "Variance ratio", title = "Power and Type I error") + facet_grid(dropout ~ effect_size, scales = "free", labeller = "label_both") + coord_flip() + theme_minimal() + scale_color_viridis_d()

We can see that for complete data that the ANCOVA is generally more powerful than the standard LMM as heterogeneity increases (“variance ratio”), whereas the constrained LMM is more powerful. The difference between ANCOVA and t-test (“posttest”) also decrease with increasing heterogeneity in change over time—since this leads to a weaker correlation between pretest and posttest. For the scenarios with missing data, LMM is more powerful, as would be expected—the cross-sectional methods have 30 % of the observations missing, whereas the LMMs can include the available responses up until dropout occurs.

It is worth noting that the 2-lvl LMM with 11 repeated measures is not always more powerful than a “cross-sectional” t-test at posttest. Obviously, this is a limited example, but it demonstrates that it is a mistake to base sample size calculations on a t-test, when a LMM is planned, with the motivation that “a LMM will have more power” (I see such motivations quite often).

Example 2 Do I really need a multilevel model? Using LRT to perform model selection

A common strategy when analyzing (longitudinal) data is to build the model in a data driven fashion—by starting with a random intercept model, then add a random slope and perform a likelihood ratio test (LRT) and keep the random slope if it is significant, and so on. We can investigate how well such a strategy works using sim_formula_compare. We’ll define our model formulas, starting with a 2-level random intercept model and working up to a 3-level random intercept and slope model. The true model is a 3-level model with only a random slope. Let’s first setup the models

p <- study_parameters(n1 = 11, n2 = 40, n3 = 3, icc_pre_subject = 0.5, cor_subject = -0.5, icc_slope = 0.05, var_ratio = 0.03) f0 <- sim_formula("y ~ time * treatment + (1 | subject)") f1 <- sim_formula("y ~ time * treatment + (1 + time | subject)") f2 <- sim_formula("y ~ time * treatment + (1 + time | subject) + (0 + time | cluster)") f3 <- sim_formula("y ~ time * treatment + (1 + time | subject) + (1 + time | cluster)") f <- sim_formula_compare("subject-intercept" = f0, "subject-slope" = f1, "cluster-slope" = f2, "cluster-intercept" = f3)

Then we run the simulation, the four model formulas in f will be fit the each data set.

res <- simulate(p, formula = f, nsim = nsim, satterthwaite = TRUE, cores = cores, CI = FALSE)

During each simulation the REML log likelihood is saved for each model, we can therefore perform the model selection in the summary method, as a post-processing step. Since REML is used it is assumed the fixed effects are the same, and that we compare a “full model” to a “reduced model”. Let’s try a forward selection strategy, where start with the first model and compare it to the next. If the comparison is significant we continue testing models, else we stop. The summary function performs this model comparison for each of the nsim simulations and returns the results from the “winning” model in each simulation replication.

x <- summary(res, model_selection = "FW", LRT_alpha = 0.05) x ## Model: model_selection ## ## Random effects ## ## parameter M_est theta est_rel_bias prop_zero is_NA ## subject_intercept 100.000 100.00 0.00029 0.00 0 ## subject_slope 2.900 2.80 0.00710 0.00 0 ## error 100.000 100.00 -0.00017 0.00 0 ## cor_subject -0.490 -0.50 -0.01000 0.00 0 ## cluster_slope 0.270 0.15 0.82000 0.00 0 ## cluster_intercept 7.900 0.00 7.90000 0.00 0 ## cor_cluster -0.081 0.00 -0.08100 0.67 0 ## ## Fixed effects ## ## parameter M_est theta M_se SD_est Power Power_bw Power_satt ## (Intercept) 0.0160 0 1.10 1.00 0.050 . . ## time -0.0045 0 0.25 0.28 0.130 . . ## treatment 0.0160 0 1.50 1.50 0.049 . . ## time:treatment -0.0024 0 0.36 0.40 0.130 0.048 0.12 ## --- ## Number of simulations: 5000 | alpha: 0.05 ## Time points (n1): 11 ## Subjects per cluster (n2 x n3): 40 x 3 (treatment) ## 40 x 3 (control) ## Total number of subjects: 240 ## --- ## Results based on LRT model comparisons, using direction: FW (alpha = 0.05) ## Model selected (proportion) ## cluster-intercept cluster-slope subject-slope ## 0.0054 0.4360 0.5586

The point of the model selection algorithm is to mimic a type of data driven model selection that is quite common. We see that this strategy do not lead to nominal Type I errors in this scenario. The cluster-level is left out of the model too often, leading to Type I errors around 12 %. However, it is fairly common to increase the LRT’s alpha level to try to improve this strategy. Let’s try a range of alpha level to see the impact.

alphas <- seq(0.01, 0.5, length.out = 50) x <- vapply(alphas, function(a) { type1 <- summary(res, model_selection = "FW", LRT_alpha = a) type1$summary$model_selection$FE$Power_satt[4] }, numeric(1)) d <- data.frame(LRT_alpha = alphas, type1 = x) d <- data.frame(LRT_alpha = alphas, type1 = x) ggplot(d, aes(LRT_alpha, type1)) + geom_line() + geom_hline(yintercept = 0.05, linetype = "dotted") + labs(y = "Type I error (time:treatment)", title = "Impact of LRT alpha level for model selection") + theme_minimal()

The figure shows that the LRT alpha level need to be very liberal to keep Type I errors, for the treatment effect, close to the 5 % level.

We can also see the results from each of the four models. Here we will just look at the time:treatment effect.

x1 <- summary(res, para = "time:treatment") x1 ## Model: summary ## ## Fixed effects: 'time:treatment' ## ## model M_est theta M_se SD_est Power Power_bw Power_satt ## subject-intercept -0.0024 0 0.14 0.4 0.500 0.330 0.500 ## subject-slope -0.0024 0 0.25 0.4 0.220 0.080 0.220 ## cluster-slope -0.0024 0 0.39 0.4 0.088 0.028 0.054 ## cluster-intercept -0.0024 0 0.40 0.4 0.082 0.026 0.043 ## --- ## Number of simulations: 5000 | alpha: 0.05 ## Time points (n1): 11 ## Subjects per cluster (n2 x n3): 40 x 3 (treatment) ## 40 x 3 (control) ## Total number of subjects: 240

We see that the 2-lvl random intercept model lead to substantially inflated Type I errors = 0.495. The 2-level model that also adds a random slope is somewhat better but still not good, Type I errors = 0.215. The correct 3-level model that account for the third level using a random slope have close to nominal Type I errors = 0.054. The full 3-level that adds an unnecessary random intercept is somewhat conservative, Type I errors = 0.043.

When choosing a strategy Type I errors is not only factor we want to minimize, power is also important. So let’s see how power is affected.

# See if power is impacted p1 <- update(p, effect_size = cohend(0.8)) res_power <- simulate(p1, formula = f, nsim = nsim, satterthwaite = TRUE, cores = cores, CI = FALSE) # we can also summary a fixed effect for convenience x <- summary(res_power, model_selection = "FW", LRT_alpha = 0.05, para = "time:treatment") print(x, verbose = FALSE) ## Model: summary ## ## Fixed effects: 'time:treatment' ## ## model M_est theta M_se SD_est Power Power_bw Power_satt ## model_selection 1.1 1.1 0.36 0.39 0.82 0.61 0.69 ## --- x1 <- summary(res_power, para = "time:treatment") print(x1, verbose = FALSE) ## Model: summary ## ## Fixed effects: 'time:treatment' ## ## model M_est theta M_se SD_est Power Power_bw Power_satt ## subject-intercept 1.1 1.1 0.14 0.39 0.98 0.97 0.98 ## subject-slope 1.1 1.1 0.25 0.39 0.95 0.86 0.94 ## cluster-slope 1.1 1.1 0.39 0.39 0.80 0.55 0.63 ## cluster-intercept 1.1 1.1 0.40 0.39 0.78 0.52 0.55 ## ---

We can note that power for the treatment effect based on LRT model selection is only slightly higher than for the correct 3-level model. If we balance this slight increase in power compared to the noticeable increase in Type I errors, it might be reasonable to conclude that for these data we should always fit the 3-level model. Lastly, the overspecified 3-level model that include an unnecessary random intercept loses some power.

Summary

The improved simulation method in powerlmm makes it really convenient to plan and evaluate the analysis of longitudinal treatment trials with a possible third level of clustering (therapists, schools, groups, etc). The support for data transforms and single level (lm()) models also enables a lot of custom models to be fit.

Feedback, bugs, etc

I appreciate all types of feedback, e.g. typos, bugs, inconsistencies, feature requests, etc. Open an issue on github.com/rpsychologist/powerlmm/issues, common on this post, or contact me here rpsychologist.com/about.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Psychologist - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

greybox package for R

Fri, 05/04/2018 - 14:22

(This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers)

I am delighted to announce a new package on CRAN. It is called “greybox”. I know, what my American friends will say, as soon as they see the name – they will claim that there is a typo, and that it should be “a” instead of “e”. But in fact no mistake was made – I used British spelling for the name, and I totally understand that at some point I might regret this…

So, what is “greybox”? Wikipedia tells us that grey box is a model that “combines a partial theoretical structure with data to complete the model”. This means that almost any statistical model can be considered as a grey box, thus making the package potentially quite flexible and versatile.

But why do we need a new package on CRAN?

First, there were several functions in smooth package that did not belong there, and there are several functions in TStools package that can be united with a topic of model building. They focus on the multivariate regression analysis rather than on state-space models, time series smoothing or anything else. It would make more sense to find them their own home package. An example of such a function is

ro()

Rolling Origin – function that Yves and I wrote in 2016 on our way to the International Symposium on Forecasting. Arguably this function can be used not only for assessing the accuracy of forecasting models, but also for the variables / model selection.

Second, in one of my side projects, I needed to work more on the multivariate regressions, and I had several ideas I wanted to test. One of those is creating a combined multivariate regression from several models using information criteria weights. The existing implementations did not satisfy me, so I ended up writing a function

combiner()

that does that. In addition, our research together with Yves Sagaert indicates that there is a nice solution for a fat regression problem (when the number of parameters is higher than the number of observations) using information criteria. Uploading those function in

smooth

did not sound right, but having a

greybox

helps a lot. There are other ideas that I have in mind, and they don’t fit in the other packages.

Finally, I could not find satisfactory (from my point of view) packages on CRAN that would focus on multivariate model building and forecasting – the usual focus is on analysis instead (including time series analysis). The other thing is the obsession of many packages with p-values and hypotheses testing, which was yet another motivator for me to develop a package that would be completely hypotheses-free (at 95% level). As a result, if you work with the functions from

greybox

, you might notice that they produce confidence intervals instead of p-values (because I find them more informative and useful). Finally, I needed good instruments for the promotional modelling for several projects, and it was easier to implement them myself than to compile them from different functions from different packages.

Keeping that in mind, it makes sense to briefly discuss what is already available there. I’ve already discussed how

xregExpander()

and

stepwise()

functions work in one of the previous posts, and these functions are now available in

greybox

instead of

smooth

. However, I have not covered either

combiner()

or

ro()

functions yet. While

combiner()

is still under construction and works only for normal cases (fat regression can be solved, but not 100% efficiently),

ro()

has worked efficiently for several years already. So I created a detailed vignette, explaining what is rolling origin, how the function works and how to use it. So, if you are interested in finding out more, check it out on CRAN.

As a wrap up,

greybox

package is focused on model building and forecasting and from now on will be periodically updated.

As a final note, I plan to do the following in

greybox

in future releases:

  1. Move nemenyi()

    function from TStools to greybox;

  2. Develop functions for promotional modelling;
  3. Write a function for multiple correlation coefficients (will be used for multicollinearity analysis);
  4. Implement variables selection based on rolling origin evaluation;
  5. Stepwise regression and combinations of models, based on Laplace and the other distributions;
  6. AICc for Laplace and the other distributions;
  7. Solve fat regression problem via combination of regression models (sounds crazy, right?);
  8. xregTransformer

    – Non-linear transformation of the provided xreg variables;

  9. Other cool stuff.

If you have any thoughts on what to implement, leave a comment – I will consider your idea.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Survey books, courses and tools by @ellis2013nz

Fri, 05/04/2018 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

Surveys everywhere!

Surveys are everywhere, and thinking about them in different ways has dominated my work so far this year. What’s right and wrong with them, how to improve them, do we need them, can we replace this one or that one with administrative data, why do the media (and nearly everyone else) so misunderstand and misinterpret them… those sorts of questions.

Well, they work (mostly), and they’re not going anywhere fast. To take just the most obvious example, have the (unfair) complaints about the pollsters letting down the pundits in the UK and USA in 2016 led to a collapse in the polling industry? No. A quick glance at the Five Thirty Eight data download page shows that there have been more than 1,500 polls with a combined sample size of 2.3 million Americans in the past 18 months just to help people judge “Are Democrats or Republicans Winning the Race for Congress?”

In the world of official statistics they’re still indispensable in most – perhaps all – countries, despite the rise and promise of administrative data. It would be difficult to measure under-employment rates, firm-level innovation rates, or smoking prevalence from administrative data, to pick three examples at random.

I love the thought and rigour that has gone into designing and analysing surveys, by academics, national statistical offices and (to a certain extent) market researchers. The unifying concept is total survey error
as seen in this diagram. This gives us categories like measurement error, coverage error and adjustment error without which it’s difficult even to think about whether our estimation process is working and how to improve it. The concepts evolved through the second half of the twentieth century and matured in the 1980s and 1990s; I don’t think any single author can (or does) claim ownership of the whole idea.

It’s a simple but brilliant organising framework for design, analysis and just talking about surveys, and it’s one that should be paid more attention by those working with data outside the survey field too. Even with survey professionals there’s far too little attention to sources of error other than the easily-quantified sampling error (inevitably the only acknowledgement of uncertainty quoted in reporting of opinion polls, for example); but in much analysis and discussion in the context of today’s data revolution one sees not even that. I commend the diagram on the right and its many variants to any serious analyst’s attention.

So back to this blog… I thought I’d note down good books on this I’ve read recently and some other stuff.

Books

My university training emphasised the analysis and sample design aspects of surveys and didn’t cover in any detail issues such as questionnaire development and testing, interviewer effects, mode effects and the like. Can’t do everything. I suspect these are bits of craft more often taught on the job, although there’s no reason for this. I might be wrong about it – certainly there’s good university research going on in these areas.

In my reading this year I emphasised books about what feel to me to be the messier aspects of survey quality and measurement and non-response error. These include subtle changes of wording and how mode effects are changing as people change the way they relate to strangers asking them questions (whether face to face, on the phone, or online).

Paul Biemer and others (edited), Total Survey Error in Practice, 2017

I gave this five stars. It’s a great (and very in-depth – not for the faint-hearted) collection of applications and theorising on survey quality. Covers all aspects – design, details of question wording, data collection in various modes (old and new), evaluation and analysis. Very much rooted in the real world of today, and today’s data problems including new data sources beyond the classic survey.

Don Dillman and others, Internet, Phone, Mail, and Mixed-Mode Surveys: The Tailored Design Method, 2014.

Another five star entry for me. This is the latest edition of the bible for putting these survey things together (with an emphasis on questionnaires and their delivery rather than sample design) and it’s a very thorough, well evidenced setting out of what works, what doesn’t, and how we know.

Floyd Fowler, Survey Research Methods, 2013

Four stars. Nice overview, much more introductory than the other two. Would suit a busy researcher who needs good understanding of the range of the issues rather than a survey specialist.

Massive Open Online Course

I had the pleasure this year of checking out the Coursera Survey Data Collection and Analytics specialization (a collection of seven courses) with an eye to being able to recommend a general introduction for analysts to the design, management, delivery and analysis of surveys. The courses are provided via the Coursera platform, by Maryland and Michigan universities. I rate the overall specialization as four or five stars, and have rated the individual courses as three, four or five stars on Coursera. In my search for affordable online courses in this area, this was the only one that covered the material in both the range and the depth that I was looking for. Other courses available are either too superficial, or focus on just one aspect of surveys, or more commonly both.

The content is spot on, but the delivery can be heavy going – it’s more like a traditional university course than a lot of online courses. That’s a strength in part as well as an issue of course.

The courses cover all aspects of survey management, from a framework in which to plan research, through questionnaire design, the technical aspects of data collection via various modes, processing (weighting, imputation, etc), and analysis. The courses are rigorous and backed up with the latest research into survey effectiveness, including aspects such as the importance of interviewer effects in inflating the variance of estimates; the relative effectiveness of different modes of delivery in different circumstances; and modern software for analysis (Stata and R examples are provided in the later courses, but this is not a course in Stata or R).

There are no pre-requisites for the course but it would be helpful to have some background in university level statistics; otherwise some of the material in courses 4, 5 and 6 on sample design, imputation and weighting, and analysis would be a difficult introduction to those topics (but not impossible). Overall the courses are rated “intermediate” and there is a fair amount of both material and rigour (in-lecture quizzes, assessed quizzes, and assignments). A student new to the topics might take five or six months to complete the whole specialisation, with 2 – 6 hours of effort per week.

I strongly recommend the first six courses (and optionally the seventh “capstone” project) for anyone working hands-on on an important survey, particularly one that needs to meet standards of official statistics or of rigorous research. I would recommend the first course (“Framework for Data Collection and Analysis”) for managers of people working on surveys, and for people who need to use surveys in their work and need an introduction to concepts such as “Total Survey Error” and a quality framework for surveys. Other individual courses may be of interest to people with specific needs (eg questionnaire design, sampling issues).

For someone aiming to be a hands-on analyst of surveys in the depth covered in these course, I would recommend in parallel building general skills in R. The best way to do this will depend on learning style but I recommend either the “data analyst with R” stream on Data Camp or working through the free book R for Data Science . These do not cover survey-specific issues but would build the background skills needed to make the most of the material in courses 5 and 6 of the survey specialization. Data Camp in particular provides an excellent platform for step by step learning and practice of technical skills; these would need to be placed in context by other training or support.

Tools Survey data collection platforms

The market of suppliers and solutions for professional questionnaire development and delivery is surprisingly crowded. There are in fact many good quality outfits that provide good tools for developing and testing questionnaires, delivery (whether Computer-Assisted Personal Interviewing with accompany tablet software, Computer-Assisted Telephone Interviewing, or web-based) and managing invitations and the like. Those that I’ve looked at all looked to be good quality, and mostly offered to look after data management in the cloud during collection too.

The surprise for me in this space was the World Bank’s very high quality offering – Survey Solutions. It’s very good, very powerful (uses C# for more complex things but very simple to get the knack of), great control over survey logic, validation and so-on. It’s definitely up there with or better than the commercial competition, and free for developing in. Growing fast, with a rapid enhancement development cycle, satisfactory video-based training aimed at the different users (interviewers, designers, etc), and a vibrant user community. For some customers (eg member governments) the Bank might host data management during collection too. I definitely recommend looking at Survey Solutions if you’re doing a serious survey, particularly CAPI in difficult conditions.

Processing, analysis and sample design

For analysis of complex surveys, Thomas Lumley’s brilliantly effective and comprehensive R survey package is the day to day workhorse which meets virtually all my survey-specific needs. That is, on top of the usual R environment and packages I use for data management and graphics of course. His book on complex surveys is my handbook for the analytical stage and I hope there will be a new edition soon.

There are other excellent relevant R packages too; see the CRAN official statistics and survey methodology task view.

Dealing with surveys is also a strength of Stata.

The philosophy of complex survey analysis in Stata is very similar to survey in R. One specifies the data, design (primary sample units, strata, and so on), then uses specialist analysis functions that do things like means, totals, proportions, models in the appropriate way given that design. This abstracts nearly all the complexity away from the analyst and leads to rapid development and readable code.

Q is a nice product for analysis of surveys aimed at people who aren’t specialist statisticians. It’s got much more grunt (better graphics and statistics) than the other cross-tabs tools I’ve seen aimed at market researchers, but is still fairly simple to use. It handles complexities like multiple response questions (‘pick as many as apply…’) elegantly and can do statistical inference that appropriately takes the weights into account (not other aspects of sample design though, I think). For data transform, reshape or graphics that get too much for the point-and-click, the user can write R (nice) or JavaScript (yuck – this isn’t what JavaScript was designed for…). The R connection isn’t brilliant – the data and analysis are sent to a distant server and analysed there before coming back to the main Q client, so if you were going to do much of that you’d be better off just using R. Q is pricey per user compared to R or Stata.

If used out of the box, SPSS will interpret weights as frequencies which leads to wildly inappropriate statistical inference. It’s common in market research to adjust for this by scaling weights to add to the sample size rather than the population but this is very crude, only very partially fixes the problem, and makes it much harder to produce population estimates. There’s a “complex survey” module available that adds in the necessary functionality but it costs extra on top of base SPSS. I’ve never seen it in use.

I don’t recommend SPSS for this or any purpose. There’s a nice open source competitor in this space jamovi, built on R but adding SPSS-like GUI control; but at the moment it can’t handle complex surveys at all. This may change, because anything available in R is in principle simple to add to jamovi.

SAS can handle complex surveys appropriately of course. It comes with a price tag, and generally less functionality than available for much less cost in R and Stata.

Microsoft Power BI not for serious statistical inference, but it is an effective exploratory and a great dashboarding tool. In my most recent post a month ago I showed how to work-around the way it doesn’t deal naturally with weighted data.

None of the above is comprehensive, just some of the survey-related things I’ve been thinking about recently.

That’s all for today. If I were assigning exercises it would be – go have another look at the total survey error diagram, and think about how it applies to other data analysis!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

About Risks and Side-Effects… Consult your Purrr-Macist

Fri, 05/04/2018 - 08:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

Capture errors, warnings and messages, but keep your list operations going

In a recent post about text mining, I discussed some solutions to webscraping the contents of our STATWORX blog using the purrr-package. However, while preparing the next the episode of my series on text mining, I remembered a little gimmick that I found quite helpful along the way. Thus, a little detour: How do I capture side-effects and errors when I perform operations on lists with purrr rather than using a loop?

First of all, a quick motivating example: Imagine we will build a parser for the blog-section of our STATWORX website. However, we have no idea how many entries were posted in the meantime (the hive of data scientists in our office is pretty quick with writing these things, usually). Thus, we need such a function to be more robust in the sense that it can endure the cruelties of "404 – Not found" error messages and still continues parsing after running into an error.

How could this possibly() work?

So let's use some beautiful purrr adverbs to fearlessly record all our outputs, errors and warnings, rather than stopping and asking the user to handle side-effects the exact moment errors turn up. These adverbs are reminiscent of try(), however, these are a little more convenient for operations on lists.

Let's consider a more complex motivating example first, but no worries – there are more obvious examples to help explain the nitty-gritties further down this page. The R-code below illustrates our use of possibly() for later use with puurr::map(). First, let us have a look at what we tried to achieve with our function. More specifically, what happens between the curly braces below: Our robust_parse() function will simply parse HTML-webpages for other links using URLs that we provide it with. In this case, we simply use paste0() to create a vector of links to our blog overview pages, extract the weblinks from these each of these pages using XML::xpathSApply(), pipe these weblinks into a data_frame and clean our results from duplicates using dplyr::filter() – there are various overview pages that group our blogs by category – and dplyr::distinct().

robust_parse <- possibly(function(value){ htmlParse(paste0("http://www.statworx.com/de/blog/page/", value, "/")) %>% xpathSApply(., "//a/@href") %>% data_frame(.) %>% filter(., grepl("/blog", .)) %>% filter(., !grepl("/blog/$|/blog/page/|/data-science/|/statistik/", .)) %>% distinct() }, otherwise = NULL)

Second, let us inspect how we employ possibly() in this context. possibly() expects a function to be modified from us, as well as the argument otherwise, stating what it is supposed to do when things go south. In this case, we want NULL as an output value. Another popular choice would be NA, signaling that somewhere, we have not produced a string as intended. However, in our example we are happy with NULL, since we only want to parse the pages that exist and do not require a specific listing of pages that do not exist (or what happened when we did not find a page).

webpages <- map_df(0:100, ~robust_parse(.)) %>% unlist webpages .1 "https://www.statworx.com/de/blog/strsplit-but-keeping-the-delimiter/" .2 "https://www.statworx.com/de/blog/data-science-in-python-vorstellung-von-nuetzlichen-datenstrukturen-teil-1/" .3 "https://www.statworx.com/de/blog/burglr-stealing-code-from-the-web/" .4 "https://www.statworx.com/de/blog/regularized-greedy-forest-the-scottish-play-act-i/" ...

Third, we use our new function robust_parse() to operate on a vector or list of integers from 0 to 100 (possible numbers of subpages we want to parse) and have a quick look at the beautiful links we extracted. Just as a reminder, below you find the code to extract and clean the contents of the individual pages, using another map_df()-based loop – which is the focus of another post.

tidy_statworx_blogs <- map_df(webpages, ~read_html(.) %>% htmlParse(., asText = TRUE) %>% xpathSApply(., "//p", xmlValue) %>% paste(., collapse = "\n") %>% gsub("\n", "", .) %>% data_frame(text = .) %>% unnest_tokens(word, text) %>% anti_join(data_frame(word = stopwords("de"))) %>% anti_join(data_frame(word = stopwords("en"))) %>% mutate(author = .$word[2]))

However, we actually want to go back to our purrr-helpers and see what they can do for us. To be more specific, rather than helpers, these are actually called adverbs since we use them to modify the behavior of a function (i.e. a verb). Our current robust_parse() function does not produce entries when the loop does not successfully find a webpage to parse for links. Consider the situation where you intend to keep track of unsuccessfull operations and of errors that arise along the way. Instead of further exploring purrr adverbs using the above code, let us look at a much easier example to realise the possible contexts in which using purrr adverbs might help you out.

A much easier example: Try dividing a character string by 2

Suppose there is an element in our list where our amazing division powers are useless: We are going to try to divide all the elements in our list by 2 – but this time, we want purrr to note where the function i_divide_things resists dividing particular elements for us. Again, the otherwise argument helps us defining our output in situations that are beyond the scope of our function.

i_divide_things <- possibly(function(value){ value /2}, otherwise = "I won't divide this for you.") # Let's try our new function > purrr::map(list(1, 2, "a", 6), ~ i_divide_things(.)) [[1]] [1] 0.5 [[2]] [1] 1 [[3]] [1] "I won't divide this for you." [[4]] [1] 3

However, consider the case where "something did not work out" might not suffice and you want to keep track of possible errors as well as warnings while still retaining the entire output. A job for safely(): As illustrated below, wrapping our function by safely(), helps us output a nested list. For each element of the input, the output provides two components – $result and $error. For all iterations where a list element is numeric, $result includes a numeric output and an empty (= NULL) error-element. Only for the third list element – where our function stumbled over a character input – we captured an error message, as well as the result we defined using otherwise.

i_divide_things <- safely(function(value){ value /2}, otherwise = "This did not quite work out.") purrr::map(list(1, 2, "a", 6), ~ i_divide_things(.)) [[1]] [[1]]$result [1] 0.5 [[1]]$error NULL [[2]] [[2]]$result [1] 1 [[2]]$error NULL [[3]] [[3]]$result [1] "This did not quite work out." [[3]]$error [[4]] [[4]]$result [1] 3 [[4]]$error NULL

In the example above, we have only been revealing our errors once we have looped over all elements of our list, by inspecting the output list. However, safely() also has the quiet argument – by default set to TRUE. If we set this to FALSE, we receive our errors the very moment they occur.

Now, we want to have a quick look at quietly(). We will define a warning, a message and print an output. This is to illustrate where purrr saves the individual components that our function returns. For each element of our input the returned list provides four components:

  • $result again returns the result of our operation
  • $output returns the output that would be printed in the console.
  • $warnings and $message return the strings that we defined.
i_divide_things <- purrr::quietly(function(value){ if(is.numeric(value) == TRUE) { print(value / 2) } else{ warning("Can't be done. Printing this instead.") message("Why would you even try dividing this?") print(value) } }) purrr::map(list(1, "a", 6), ~i_divide_things(.)) [[1]] [[1]]$result [1] 0.5 [[1]]$output [1] "[1] 0.5" [[1]]$warnings character(0) [[1]]$messages character(0) [[2]] [[2]]$result [1] "a" [[2]]$output [1] "[1] \"a\"" [[2]]$warnings [1] "Can't be done. Printing this instead." [[2]]$messages [1] "Why would you even try dividing this?\n" [[3]] [[3]]$result [1] 3 [[3]]$output [1] "[1] 3" [[3]]$warnings character(0) [[3]]$messages character(0)

Last, there is auto_browse(), which allows us to trigger the RStudio browser for debugging and brings the user to the approximate location of the error. This case is illustrated in the screenshot below.

i_divide_things <- purrr::auto_browse(function(value){ print(value / 2) }) purrr::map(list(1, "a", 6), ~i_divide_things(.))

Splendid - this was a quick wrap-up of how to wrap your functions for handling side-effects in your operations on lists using adverbs of purrr. Happy wrapping everyone!

Über den Autor

David Schlepps

David ist Mitglied im Data Science Team und interessiert sich für R und Markdown. In seiner Freizeit spielt er gerne Gitarre.

Der Beitrag About Risks and Side-Effects… Consult your Purrr-Macist erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

MLE with General Optimization Functions in R

Fri, 05/04/2018 - 05:45

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

In my previous post (https://statcompute.wordpress.com/2018/02/25/mle-in-r/), it is shown how to estimate the MLE based on the log likelihood function with the general-purpose optimization algorithm, e.g. optim(), and that the optimizer is more flexible and efficient than wrappers in statistical packages.

A benchmark comparison are given below showing the use case of other general optimizers commonly used in R, including optim(), nlm(), nlminb(), and ucminf(). Since these optimizers are normally designed to minimize the objective function, we need to add a minus (-) sign to the log likelihood function that we want to maximize, as shown in the minLL() function below. In addition, in order to speed up the optimization process, we can suppress the hessian in the function call. If indeed the hessian is required to calculate standard errors of estimated parameters, it can be calculated by calling the hessian() function in the numDeriv package.

As shown in the benchmark result, although the ucminf() is the most efficient optimization function, a hessian option can increase the computing time by 70%. In addition, in the second fastest nlminb() function, there is no built-in option to output the hessian. Therefore, sometimes it might be preferable to estimate model parameters first and then calculate the hessian afterwards for the analysis purpose, as demonstrated below.

df <- read.csv("Downloads/credit_count.txt") ### DEFINE THE OBJECTIVE FUNCTION ### minLL <- function(par) { mu <- exp(par[1] + par[2] * df$AGE + par[3] * df$ACADMOS + par[4] * df$MINORDRG + par[5] * df$OWNRENT) return(ll <- -sum(log(exp(-mu) * (mu ^ df$MAJORDRG) / factorial(df$MAJORDRG)))) } ### BENCHMARKING ### import::from("rbenchmark", "benchmark") benchmark(replications = 10, order = "elapsed", relative = "elapsed", columns = c("test", "replications", "elapsed", "relative"), optim = {optim(par = rep(0, 5), fn = minLL, hessian = F)}, nlm = {nlm(f = minLL, p = rep(0, 5), hessian = F)}, nlminb = {nlminb(start = rep(0, 5), objective = minLL)}, ucminf = {ucminf::ucminf(par = rep(0, 5), fn = minLL, hessian = 0)}, hessian = {ucminf::ucminf(par = rep(0, 5), fn = minLL, hessian = 1)} ) # test replications elapsed relative # 4 ucminf 10 4.044 1.000 # 3 nlminb 10 6.444 1.593 # 5 hessian 10 6.973 1.724 # 2 nlm 10 8.292 2.050 # 1 optim 10 12.027 2.974 ### HOW TO CALCULATE THE HESSIAN ### fit <- nlminb(start = rep(0, 5), objective = minLL) import::from("numDeriv", "hessian") std <- sqrt(diag(solve(hessian(minLL, fit$par)))) est <- data.frame(beta = fit$par, stder = std, z_values = fit$par / std) # beta stder z_values # 1 -1.379324501 0.0438155970 -31.480217 # 2 0.010394876 0.0013645030 7.618068 # 3 0.001532188 0.0001956843 7.829894 # 4 0.461129515 0.0068557359 67.261856 # 5 -0.199393808 0.0283222704 -7.040177

It is worth mentioning that, although these general optimizers are fast, they are less user-friendly than wrappers in statistical packages, such as mle or maxLik. For instance, we have to calculate AIC or BIC based on the log likelihood function or p-values based on Z-scores.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Seventeen Minutes From Tweet To Package

Fri, 05/04/2018 - 04:19

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

Earlier today, @noamross posted to Twitter:

#rstats #lazyweb What's the R/httr/curl equivalent of

curl -F "file=@somefile.html" https://t.co/abbugLz9ZW

— Noam Ross (@noamross) May 3, 2018

The answer was a 1:1 “file upload” curl to httr translation:

httr::POST( url = "https://file.io", encode = "multipart", body = list(file = httr::upload_file("/some/path/to/file")), )

but I wanted to do more than that since Noam took 20 minutes out his day this week (with no advance warning) to speak to my intro-to-stats class about his work and R.

The Twitter request was (ultimately) a question on how to use R to post content to https://file.io. They have a really simple API, and the timespan from Noam’s request to the initial commit of a fully functional package was roughly 17 minutes. The end product included the ability to post files, strings and R data (something that seemed like a good thing to add).

Not too long after came a v0.1.0 release complete with tests and passing CRAN checks on all platforms.

Noam also suggested I do a screencast:

You should do a screencast of your process for this.

— Noam Ross (@noamross) May 3, 2018

I don’t normally do screencasts but had some conference call time so folks can follow along at home:

That’s not the best screencast in the world, but it’s very representative of the workflow I used. A great deal of boilerplate package machinations is accomplished with this bash script.

I wasn’t happy with the hurried function name choices I made nor was I thrilled with the package title, description, tests and basic docs, so I revamped all those into another release. That took a while, mostly due to constantly triggering API warnings about being rate-limited.

So, if you have a 5 GB or less file, character vector or in-memory R data you’d like to ephemerally share with others, take the fileio package for a spin:

devtools::install_github("hrbrmstr/fileio") fileio::fi_post_text("TWFrZSBzdXJlIHRvIEAgbWUgb24gVHdpdHRlciBpZiB5b3UgZGVjb2RlIHRoaXM=") ## success key link expiry ## 1 TRUE n18ZSB https://file.io/n18ZSB 14 days

(bonus points if you can figure out what that seemingly random sequence of characters says).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bayesian Inference with Backfitting MCMC

Thu, 05/03/2018 - 06:25

(This article was first published on R – Stable Markets, and kindly contributed to R-bloggers)

Previous posts in this series on MCMC samplers for Bayesian inference (in order of publication): Bayesian Simple Linear Regression with Gibbs Sampling in R Blocked Gibbs Sampling in R for Bayesian Multiple Linear Regression Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz Speeding up Metropolis-Hastings with Rcpp All code for this (and previous) posts are in … Continue reading Bayesian Inference with Backfitting MCMC →

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stable Markets. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Qualitative Data Science: Using RQDA to analyse interviews

Thu, 05/03/2018 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Qualitative data science sounds like a contradiction in terms. Data scientists generally solve problems using numerical solutions. Even the analysis of text is reduced to a numerical problem using Markov chains, topic analysis, sentiment analysis and other mathematical tools.

Scientists and professionals consider numerical methods the gold standard of analysis. There is, however, a price to pay when relying on numbers alone. Numerical analysis reduces the complexity of the social world. When analysing people, numbers present an illusion of precision and accuracy. Giving primacy to quantitative research in the social sciences comes at a high price. The dynamics of reality are reduced to statistics, losing the narrative of the people that the research aims to understand.

Being both an engineer and a social scientist, I acknowledge the importance of both numerical and qualitative methods. My dissertation used a mixed-method approach to review the relationship between employee behaviour and customer perception in water utilities. This article introduces some aspects of qualitative data science with an example from my dissertation. In this article, I show how I analysed data from interviews using both quantitative and qualitative methods and demonstrate why qualitative data science is better to understand text than numerical methods. The most recent version of the code is available on my GitHub repository.

Qualitative Data Science

The often celebrated artificial intelligence of machine learning is impressive but does not come close to human intelligence and ability to understand the world. Many data scientists are working on automated text analysis to solve this issue (the topicmodels package is an example of such an attempt). These efforts are impressive but even the smartest text analysis algorithm is not able to derive meaning from text. To fully embrace all aspects of data science we need to be able to methodically undertake qualitative data analysis.

The capabilities of R in numerical analysis are impressive but it can also assist with Qualitative Data Analysis (QDA). Huang Ronggui from Hong Kong developed the RQDA package to analyse texts in R. RQDA assists with qualitative data analysis using a GUI front-end to analyse collections texts. The video below contains a complete course in using this software. Below the video, I share an example from my dissertation which compares qualitative and quantitative methods for analysing text.

For my dissertation about water utility marketing, I interviewed seven people from various organisations. The purpose of these interviews was to learn about the value proposition for water utilities. The data consists of the transcripts of six interviews which I manually coded using RQDA. For reasons of agreed anonymity, I cannot provide the raw data file of the interviews through GitHub.

Numerical Text Analysis

Word clouds are a popular method for exploratory analysis of texts. The wordcloud is created with the text mining and wordcloud packages. The transcribed interviews are converted to a text corpus (the native format for the tm package) and whitespace, punctuation etc is removed. This code snippet opens the RQDA file and extracts the texts from the database. RQDA stores all text in an SQLite database and the package provides a query command to extract data.

# INIT library(tidyverse) library(RQDA) library(tm) library(wordcloud) library(topicmodels) library(igraph) #Load data library(RQDA) openProject("Hydroinformatics/Macromarketing/stakeholders.rqda") # Word frequency diagram library(tm) interviews <- RQDAQuery("SELECT file FROM source") interviews$file <- apply(interviews, 1, function(x) gsub("…", "...", x)) interviews$file <- apply(interviews, 1, function(x) gsub("’", "", x)) interviews <- Corpus(VectorSource(interviews$file)) interviews <- tm_map(interviews, stripWhitespace) interviews <- tm_map(interviews, content_transformer(tolower)) interviews <- tm_map(interviews, removeWords, stopwords("english")) interviews <- tm_map(interviews, removePunctuation) interviews <- tm_map(interviews, removeNumbers) interviews <- tm_map(interviews, removeWords, c("interviewer", "interviewee")) library(wordcloud) set.seed(1969) wordcloud(interviews, min.freq = 10, max.words = 50, rot.per=0.35, colors = brewer.pal(8, "Blues")[-1:-5])

This word cloud makes it clear that the interviews are about water businesses and customers, which is a pretty obvious statement. The interviews are also about the opinion of the interviewees (think). While the word cloud is aesthetically pleasing and provides a quick snapshot of the content of the texts, they cannot inform us about their meaning.

Topic modelling is a more advanced method to extract information from the text by assessing the proximity of words to each other. The topic modelling package provides functions to perform this analysis. I am not an expert in this field and simply followed basic steps using default settings with four topics.

# Topic Analysis library(topicmodels) dtm <- DocumentTermMatrix(interviews) dtm <- removeSparseTerms(dtm, 0.99) ldaOut <-LDA(dtm, k = 4) terms(ldaOut,6)

This code converts the corpus created earlier into a Document-Term Matrix, which is a matrix of words and documents (the interviews) and the frequency at which each of these words occurs. The LDA function applies a Latent Dietrich Allocation to the matrix to define the topics. The variable k defines the number of anticipated topics. An LDA is similar to clustering in multivariate data. The final output is a table with six words for each topic.

Topics extracted from interviews Topic 1 Topic 2 Topic 3 Topic 4 water water customers water think think water think actually inaudible customer companies customer people think yeah businesses service business issues customers businesses service don’t

This table does not tell me much at all about what was discussed in the interviews. Perhaps it is the frequent use of the word “water” or “think”—I did ask people their opinion about water-related issues. To make this analysis more meaningful I could perhaps manually remove the words water, yeah, and so on. This introduces bias in the analysis and reduces the reliability of the topic analysis because I would be interfering with the text.

Numerical text analysis sees a text as a bag of words instead of a set of meaningful words. It seems that any automated text mining needs a lot of manual cleaning to derive anything meaningful. This excursion shows that automated text analysis is not a sure-fire way to analyse the meaning of a collection of words. After a lot of trial and error to try to make this work, I decided to go back to my roots of qualitative analysis using RQDA as my tool.

Qualitative Data Science Using RQA

To use RQDA for qualitative data science, you first need to manually analyse each text and assign codes (topics) to parts of the text. The image below shows a question and answer and how it was coded. All marked text is blue and the codes are shown between markers. Coding a text is an iterative process that aims to extract meaning from a text. The advantage of this method compared to numerical analysis is that the researcher injects meaning into the analysis. The disadvantage is that the analysis will always be biased, which in the social sciences is unavoidable. My list of topics was based on words that appear in a marketing dictionary so that I analysed the interviews from that perspective.

Example of text coded with RQDA.

My first step was to look at the occurrence of codes (themes) in each of the interviews.

codings <- getCodingTable()[,4:5] categories <- RQDAQuery("SELECT filecat.name AS category, source.name AS filename FROM treefile, filecat, source WHERE treefile.catid=filecat.catid AND treefile.fid=source.id AND treefile.status=1") codings <- merge(codings, categories, all.y = TRUE) head(codings) # Open coding reorder_size <- function(x) { factor(x, levels = names(sort(table(x)))) } ggplot(codings, aes(reorder_size(codename), fill=category)) + geom_bar(stat="count") + facet_grid(~filename) + coord_flip() + theme(legend.position="bottom", legend.title=element_blank()) + ylab("Code frequency in interviews") + xlab("Code") ggsave("Hydroinformatics/Macromarketing/rqda_codes.png", width=9, height=11, dpi = 300)

The code uses an internal RQDA function getCodingTable to obtain the primary data. The RQDAQuery function provides more flexibility and can be used to build more complex queries of the data. You can view the structure of the RQDA database using the RQDATables function.

The occurrence of themes from six interviews.

This bar chart helps to explore the topics that interviewees discussed but it does not help to understand how these topic relate to each other. This method provides a better view than the ‘bag of words’ approach because the text has been given meaning.

RQDA provides a facility to assign each code to a code category. This structure can be visualised using a network. The network is visualised using the igraph package and the graph shows how codes relate to each other.

Qualitative data analysis can create value from a text by interpreting it from a given perspective. This article is not even an introduction to the science and art of qualitative data science. I hope it invites you to explore RQA and similar tools.

If you are interested in finding out more about how I used this analysis, then read my dissertation on customer service in water utilities.

# Axial coding edges <- RQDAQuery("SELECT codecat.name, freecode.name FROM codecat, freecode, treecode WHERE codecat.catid=treecode.catid AND freecode.id=treecode.cid") g <- graph_from_edgelist(as.matrix(edges), directed=FALSE) V(g)$name <- gsub(" ", "\n", V(g)$name) c <- spinglass.community(g) par(mar=rep(0,4)) set.seed(666) plot(c, g, vertex.size=10, vertex.color=NA, vertex.frame.color=NA, edge.width=E(g)$weight, layout=layout.drl)

The post Qualitative Data Science: Using RQDA to analyse interviews appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

purrr Like a Kitten till the Lake Pipes RoaR

Wed, 05/02/2018 - 21:11

(This article was first published on R – FinEx.Co, and kindly contributed to R-bloggers)

I really should make a minimal effort to resist opening a data analysis blog post with Beach Boys’ lyrics, but this time the combination is too apt. We use the purrr package to show how to let your pipes roar in R.

The tidyverse GitHub site contains a simple example illustrating how well pipes and purrr work together. For more learning, try Jenny Bryan’s purrr tutorial.

First, load the tidyverse and the purrr package.

library(tidyverse) library(purrr) #devtools::install_github("jennybc/repurrrsive") #library(repurrrsive)

If you want to be more adventurous, you can download Jenny’s repurrrsive package from GitHub. The code is hashed out in the chunk above.

You Don’t Know What I Got

The great thing about using pipes is you can tie together a series of steps to produce a single output of exactly what you want. In this case, we are going to start with the base R dataset airquality and apply a number of functions to come up with the adjusted R-squared value for the Ozone data for each month between May and September. Along the way we will run a linear regression on the data to generate the adjusted R-squared values.

Here is what the full process looks like, from beginning to end.

airquality %>% split(.$Month) %>% map(~ lm(Ozone ~ Temp, data = .)) %>% map(summary) %>% map_dbl('adj.r.squared') 5 6 7 8 9 0.2781 0.3676 0.5024 0.3307 0.6742

The problem, of course, is the output generated by the intermediate steps stays behind the scenes. For a beginner, this can be a bit confusing because it isn’t clear what is going on. So let’s break the full chunk into its constituent pieces so we can see what purrr is doing and how pipes tie the whole thing together.

Start at the beginning and take things step-by-step.

Run the airquality data set to see what we are dealing with. We can see the data is collected daily for five months. Ozone looks to be the target, so it is natural to wonder if there is any relationship between it and the other variables.

airquality Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10 11 7 NA 6.9 74 5 11 12 16 256 9.7 69 5 12 13 11 290 9.2 66 5 13 14 14 274 10.9 68 5 14 15 18 65 13.2 58 5 15 16 14 334 11.5 64 5 16 17 34 307 12.0 66 5 17 18 6 78 18.4 57 5 18 19 30 322 11.5 68 5 19 20 11 44 9.7 62 5 20 21 1 8 9.7 59 5 21 22 11 320 16.6 73 5 22 23 4 25 9.7 61 5 23 24 32 92 12.0 61 5 24 25 NA 66 16.6 57 5 25 26 NA 266 14.9 58 5 26 27 NA NA 8.0 57 5 27 28 23 13 12.0 67 5 28 29 45 252 14.9 81 5 29 30 115 223 5.7 79 5 30 31 37 279 7.4 76 5 31 32 NA 286 8.6 78 6 1 33 NA 287 9.7 74 6 2 34 NA 242 16.1 67 6 3 35 NA 186 9.2 84 6 4 36 NA 220 8.6 85 6 5 37 NA 264 14.3 79 6 6 38 29 127 9.7 82 6 7 39 NA 273 6.9 87 6 8 40 71 291 13.8 90 6 9 41 39 323 11.5 87 6 10 42 NA 259 10.9 93 6 11 43 NA 250 9.2 92 6 12 44 23 148 8.0 82 6 13 45 NA 332 13.8 80 6 14 46 NA 322 11.5 79 6 15 47 21 191 14.9 77 6 16 48 37 284 20.7 72 6 17 49 20 37 9.2 65 6 18 50 12 120 11.5 73 6 19 51 13 137 10.3 76 6 20 52 NA 150 6.3 77 6 21 53 NA 59 1.7 76 6 22 54 NA 91 4.6 76 6 23 55 NA 250 6.3 76 6 24 56 NA 135 8.0 75 6 25 57 NA 127 8.0 78 6 26 58 NA 47 10.3 73 6 27 59 NA 98 11.5 80 6 28 60 NA 31 14.9 77 6 29 61 NA 138 8.0 83 6 30 62 135 269 4.1 84 7 1 63 49 248 9.2 85 7 2 64 32 236 9.2 81 7 3 65 NA 101 10.9 84 7 4 66 64 175 4.6 83 7 5 67 40 314 10.9 83 7 6 68 77 276 5.1 88 7 7 69 97 267 6.3 92 7 8 70 97 272 5.7 92 7 9 71 85 175 7.4 89 7 10 72 NA 139 8.6 82 7 11 73 10 264 14.3 73 7 12 74 27 175 14.9 81 7 13 75 NA 291 14.9 91 7 14 76 7 48 14.3 80 7 15 77 48 260 6.9 81 7 16 78 35 274 10.3 82 7 17 79 61 285 6.3 84 7 18 80 79 187 5.1 87 7 19 81 63 220 11.5 85 7 20 82 16 7 6.9 74 7 21 83 NA 258 9.7 81 7 22 84 NA 295 11.5 82 7 23 85 80 294 8.6 86 7 24 86 108 223 8.0 85 7 25 87 20 81 8.6 82 7 26 88 52 82 12.0 86 7 27 89 82 213 7.4 88 7 28 90 50 275 7.4 86 7 29 91 64 253 7.4 83 7 30 92 59 254 9.2 81 7 31 93 39 83 6.9 81 8 1 94 9 24 13.8 81 8 2 95 16 77 7.4 82 8 3 96 78 NA 6.9 86 8 4 97 35 NA 7.4 85 8 5 98 66 NA 4.6 87 8 6 99 122 255 4.0 89 8 7 100 89 229 10.3 90 8 8 101 110 207 8.0 90 8 9 102 NA 222 8.6 92 8 10 103 NA 137 11.5 86 8 11 104 44 192 11.5 86 8 12 105 28 273 11.5 82 8 13 106 65 157 9.7 80 8 14 107 NA 64 11.5 79 8 15 108 22 71 10.3 77 8 16 109 59 51 6.3 79 8 17 110 23 115 7.4 76 8 18 111 31 244 10.9 78 8 19 112 44 190 10.3 78 8 20 113 21 259 15.5 77 8 21 114 9 36 14.3 72 8 22 115 NA 255 12.6 75 8 23 116 45 212 9.7 79 8 24 117 168 238 3.4 81 8 25 118 73 215 8.0 86 8 26 119 NA 153 5.7 88 8 27 120 76 203 9.7 97 8 28 121 118 225 2.3 94 8 29 122 84 237 6.3 96 8 30 123 85 188 6.3 94 8 31 124 96 167 6.9 91 9 1 125 78 197 5.1 92 9 2 126 73 183 2.8 93 9 3 127 91 189 4.6 93 9 4 128 47 95 7.4 87 9 5 129 32 92 15.5 84 9 6 130 20 252 10.9 80 9 7 131 23 220 10.3 78 9 8 132 21 230 10.9 75 9 9 133 24 259 9.7 73 9 10 134 44 236 14.9 81 9 11 135 21 259 15.5 76 9 12 136 28 238 6.3 77 9 13 137 9 24 10.9 71 9 14 138 13 112 11.5 71 9 15 139 46 237 6.9 78 9 16 140 18 224 13.8 67 9 17 141 13 27 10.3 76 9 18 142 24 238 10.3 68 9 19 143 16 201 8.0 82 9 20 144 13 238 12.6 64 9 21 145 23 14 9.2 71 9 22 146 36 139 10.3 81 9 23 147 7 49 10.3 69 9 24 148 14 20 16.6 63 9 25 149 30 193 6.9 70 9 26 150 NA 145 13.2 77 9 27 151 14 191 14.3 75 9 28 152 18 131 8.0 76 9 29 153 20 223 11.5 68 9 30

The first step is to break the data up by month, so we make use of base R‘s split() function. Notice all the data is grouped by month.

airquality %>% split(.$Month) $`5` Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 7 23 299 8.6 65 5 7 8 19 99 13.8 59 5 8 9 8 19 20.1 61 5 9 10 NA 194 8.6 69 5 10 11 7 NA 6.9 74 5 11 12 16 256 9.7 69 5 12 13 11 290 9.2 66 5 13 14 14 274 10.9 68 5 14 15 18 65 13.2 58 5 15 16 14 334 11.5 64 5 16 17 34 307 12.0 66 5 17 18 6 78 18.4 57 5 18 19 30 322 11.5 68 5 19 20 11 44 9.7 62 5 20 21 1 8 9.7 59 5 21 22 11 320 16.6 73 5 22 23 4 25 9.7 61 5 23 24 32 92 12.0 61 5 24 25 NA 66 16.6 57 5 25 26 NA 266 14.9 58 5 26 27 NA NA 8.0 57 5 27 28 23 13 12.0 67 5 28 29 45 252 14.9 81 5 29 30 115 223 5.7 79 5 30 31 37 279 7.4 76 5 31 $`6` Ozone Solar.R Wind Temp Month Day 32 NA 286 8.6 78 6 1 33 NA 287 9.7 74 6 2 34 NA 242 16.1 67 6 3 35 NA 186 9.2 84 6 4 36 NA 220 8.6 85 6 5 37 NA 264 14.3 79 6 6 38 29 127 9.7 82 6 7 39 NA 273 6.9 87 6 8 40 71 291 13.8 90 6 9 41 39 323 11.5 87 6 10 42 NA 259 10.9 93 6 11 43 NA 250 9.2 92 6 12 44 23 148 8.0 82 6 13 45 NA 332 13.8 80 6 14 46 NA 322 11.5 79 6 15 47 21 191 14.9 77 6 16 48 37 284 20.7 72 6 17 49 20 37 9.2 65 6 18 50 12 120 11.5 73 6 19 51 13 137 10.3 76 6 20 52 NA 150 6.3 77 6 21 53 NA 59 1.7 76 6 22 54 NA 91 4.6 76 6 23 55 NA 250 6.3 76 6 24 56 NA 135 8.0 75 6 25 57 NA 127 8.0 78 6 26 58 NA 47 10.3 73 6 27 59 NA 98 11.5 80 6 28 60 NA 31 14.9 77 6 29 61 NA 138 8.0 83 6 30 $`7` Ozone Solar.R Wind Temp Month Day 62 135 269 4.1 84 7 1 63 49 248 9.2 85 7 2 64 32 236 9.2 81 7 3 65 NA 101 10.9 84 7 4 66 64 175 4.6 83 7 5 67 40 314 10.9 83 7 6 68 77 276 5.1 88 7 7 69 97 267 6.3 92 7 8 70 97 272 5.7 92 7 9 71 85 175 7.4 89 7 10 72 NA 139 8.6 82 7 11 73 10 264 14.3 73 7 12 74 27 175 14.9 81 7 13 75 NA 291 14.9 91 7 14 76 7 48 14.3 80 7 15 77 48 260 6.9 81 7 16 78 35 274 10.3 82 7 17 79 61 285 6.3 84 7 18 80 79 187 5.1 87 7 19 81 63 220 11.5 85 7 20 82 16 7 6.9 74 7 21 83 NA 258 9.7 81 7 22 84 NA 295 11.5 82 7 23 85 80 294 8.6 86 7 24 86 108 223 8.0 85 7 25 87 20 81 8.6 82 7 26 88 52 82 12.0 86 7 27 89 82 213 7.4 88 7 28 90 50 275 7.4 86 7 29 91 64 253 7.4 83 7 30 92 59 254 9.2 81 7 31 $`8` Ozone Solar.R Wind Temp Month Day 93 39 83 6.9 81 8 1 94 9 24 13.8 81 8 2 95 16 77 7.4 82 8 3 96 78 NA 6.9 86 8 4 97 35 NA 7.4 85 8 5 98 66 NA 4.6 87 8 6 99 122 255 4.0 89 8 7 100 89 229 10.3 90 8 8 101 110 207 8.0 90 8 9 102 NA 222 8.6 92 8 10 103 NA 137 11.5 86 8 11 104 44 192 11.5 86 8 12 105 28 273 11.5 82 8 13 106 65 157 9.7 80 8 14 107 NA 64 11.5 79 8 15 108 22 71 10.3 77 8 16 109 59 51 6.3 79 8 17 110 23 115 7.4 76 8 18 111 31 244 10.9 78 8 19 112 44 190 10.3 78 8 20 113 21 259 15.5 77 8 21 114 9 36 14.3 72 8 22 115 NA 255 12.6 75 8 23 116 45 212 9.7 79 8 24 117 168 238 3.4 81 8 25 118 73 215 8.0 86 8 26 119 NA 153 5.7 88 8 27 120 76 203 9.7 97 8 28 121 118 225 2.3 94 8 29 122 84 237 6.3 96 8 30 123 85 188 6.3 94 8 31 $`9` Ozone Solar.R Wind Temp Month Day 124 96 167 6.9 91 9 1 125 78 197 5.1 92 9 2 126 73 183 2.8 93 9 3 127 91 189 4.6 93 9 4 128 47 95 7.4 87 9 5 129 32 92 15.5 84 9 6 130 20 252 10.9 80 9 7 131 23 220 10.3 78 9 8 132 21 230 10.9 75 9 9 133 24 259 9.7 73 9 10 134 44 236 14.9 81 9 11 135 21 259 15.5 76 9 12 136 28 238 6.3 77 9 13 137 9 24 10.9 71 9 14 138 13 112 11.5 71 9 15 139 46 237 6.9 78 9 16 140 18 224 13.8 67 9 17 141 13 27 10.3 76 9 18 142 24 238 10.3 68 9 19 143 16 201 8.0 82 9 20 144 13 238 12.6 64 9 21 145 23 14 9.2 71 9 22 146 36 139 10.3 81 9 23 147 7 49 10.3 69 9 24 148 14 20 16.6 63 9 25 149 30 193 6.9 70 9 26 150 NA 145 13.2 77 9 27 151 14 191 14.3 75 9 28 152 18 131 8.0 76 9 29 153 20 223 11.5 68 9 30

In the second step, we apply the purrr map() function to the linear regression model we create with lm(). We want the adjusted R-squared value for each month for Ozone. I played around with the variables a bit to find which one illustrated the adjusted R-squared values best and settled on Temp, but you can choose any other besides Month and Day.

The map() command applies the lm() function to each monthy group and yields the typical output for each month. We now have five linear regression models, one for each month, but no adjusted R-squared values.

airquality %>% split(.$Month) %>% map(~ lm(Ozone ~ Temp, data = .)) $`5` Call: lm(formula = Ozone ~ Temp, data = .) Coefficients: (Intercept) Temp -102.16 1.88 $`6` Call: lm(formula = Ozone ~ Temp, data = .) Coefficients: (Intercept) Temp -91.99 1.55 $`7` Call: lm(formula = Ozone ~ Temp, data = .) Coefficients: (Intercept) Temp -372.92 5.15 $`8` Call: lm(formula = Ozone ~ Temp, data = .) Coefficients: (Intercept) Temp -238.86 3.56 $`9` Call: lm(formula = Ozone ~ Temp, data = .) Coefficients: (Intercept) Temp -149.35 2.35

To generate the adjusted R-squared values, we need to map() the summary() command to each group. This is the third step. Again, the results are familiar. We have the typical summary of the linear model, one for each month.

airquality %>% split(.$Month) %>% map(~ lm(Ozone ~ Temp, data = .)) %>% map(summary) $`5` Call: lm(formula = Ozone ~ Temp, data = .) Residuals: Min 1Q Median 3Q Max -30.32 -8.62 -2.41 5.32 68.26 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -102.159 38.750 -2.64 0.0145 * Temp 1.885 0.578 3.26 0.0033 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 18.9 on 24 degrees of freedom (5 observations deleted due to missingness) Multiple R-squared: 0.307, Adjusted R-squared: 0.278 F-statistic: 10.6 on 1 and 24 DF, p-value: 0.00331 $`6` Call: lm(formula = Ozone ~ Temp, data = .) Residuals: Min 1Q Median 3Q Max -12.99 -9.34 -6.31 11.08 23.27 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -91.991 51.312 -1.79 0.116 Temp 1.552 0.653 2.38 0.049 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.5 on 7 degrees of freedom (21 observations deleted due to missingness) Multiple R-squared: 0.447, Adjusted R-squared: 0.368 F-statistic: 5.65 on 1 and 7 DF, p-value: 0.0491 $`7` Call: lm(formula = Ozone ~ Temp, data = .) Residuals: Min 1Q Median 3Q Max -32.11 -14.52 -1.16 7.58 75.29 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -372.92 84.45 -4.42 0.00018 *** Temp 5.15 1.01 5.12 3e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.3 on 24 degrees of freedom (5 observations deleted due to missingness) Multiple R-squared: 0.522, Adjusted R-squared: 0.502 F-statistic: 26.2 on 1 and 24 DF, p-value: 3.05e-05 $`8` Call: lm(formula = Ozone ~ Temp, data = .) Residuals: Min 1Q Median 3Q Max -40.42 -17.65 -8.07 9.97 118.58 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -238.861 82.023 -2.91 0.0076 ** Temp 3.559 0.974 3.65 0.0013 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 32.5 on 24 degrees of freedom (5 observations deleted due to missingness) Multiple R-squared: 0.357, Adjusted R-squared: 0.331 F-statistic: 13.4 on 1 and 24 DF, p-value: 0.00126 $`9` Call: lm(formula = Ozone ~ Temp, data = .) Residuals: Min 1Q Median 3Q Max -27.45 -8.59 -3.69 11.04 31.39 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -149.347 23.688 -6.30 9.5e-07 *** Temp 2.351 0.306 7.68 2.9e-08 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13.8 on 27 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.686, Adjusted R-squared: 0.674 F-statistic: 58.9 on 1 and 27 DF, p-value: 2.95e-08

She Blows em Outta the Water Like You Never Seen

Now things get interesting, because we can put it all together.

Step 4 involves using the specialized map_dbl() command to pull the adjusted R-squared values from each month’s linear model summary and output them in a single line.

How do we know the adjusted R-squared value is a double? Well, it looks like one since it consists of a floating decimal. But we could guess, too. If we were to use the map_int() command we would get an error that tells us the value is a double. If we guessed character we could use mapZ_chr. That would work but we would have the output in character form, which isn’t what we want.

So we can recognize its data form or we can figure it out through trial and error. Either way we end up at the same place, back where we started.

airquality %>% split(.$Month) %>% map(~ lm(Ozone ~ Temp, data = .)) %>% map(summary) %>% map_dbl('adj.r.squared') 5 6 7 8 9 0.2781 0.3676 0.5024 0.3307 0.6742

And if That Ain’t Enough to Make You Flip Your Lid

I don’t know what will.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – FinEx.Co. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

EARL London Keynote Speaker announcement: Garrett Grolemund

Wed, 05/02/2018 - 17:03

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

We’re delighted to announce RStudio’s Garrett Grolemund as one of our Keynote Speakers at this year’s EARL London.

He will join Starcount’s Edwina Dunn and a whole host of brilliant speakers for the 5th EARL London Conference on 11-13 September at The Tower Hotel.

Garrett specialises in teaching people how to use R, which is why you’ll see his name on some brilliant resources, including video courses on Datacamp.com and O’Reilly media, his series of popular R cheat sheets distributed by RStudio, and as co-author of R for Data Science and Hands-On Programming with R. He also wrote the lubridate R package and works for RStudio as an advocate who trains engineers to do data science with R and the Tidyverse.

He earned his Phd in Statistics from Rice University in 2012 under the guidance of Hadley Wickham. Before that, he earned a Bachelor’s degree in Psychology from Harvard University and briefly attended law school before wising up.

Garrett is one of the foremost promoters of Shiny, R Markdown, and the Tidyverse, so we’re really looking forward to his keynote.

Don’t miss out on early bird tickets

Early bird tickets for all EARL Conferences are now available:
London: 11-13 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistical Sins: Is Your Classification Model Any Good?

Wed, 05/02/2018 - 14:27

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Prediction with Binomial Regression April A to Z is complete! We now return to your regularly scheduled statistics blog posts! Today, I want to talk about an issue I touched on during A to Z: using regression to predict values and see how well your model is doing.

Specifically, I talked a couple of times about binomial regression (here and here), which is used to predict (read: recreate with a set of variables significantly related to) a binary outcome. The data example I used involved my dissertation data and the binary outcome was verdict: guilty or not guilty. A regression model returns the linear correction applied to the predictor variables to reproduce the outcome, and will highlight whether a predictor was significantly related to the outcome or not. But a big question you may be asking of your binomial model is: how well does it predict the outcome? Specifically, how can you examine whether your regression model is correctly classifying cases?

We’ll start by loading/setting up the data and rerunning the binomial regression with interactions.

dissertation<-read.delim("dissertation_data.txt",header=TRUE)
dissertation<-dissertation[,1:44]
predictors<-c("obguilt","reasdoubt","bettertolet","libertyvorder",
"jurevidence","guilt")
dissertation<-subset(dissertation, !is.na(libertyvorder))

dissertation[45:50]<-lapply(dissertation[predictors], function(x) {
y<-scale(x, center=TRUE, scale=TRUE)
}
)

pred_int<-'verdict ~ obguilt.1 + reasdoubt.1 + bettertolet.1 + libertyvorder.1 +
jurevidence.1 + guilt.1 + obguilt.1*guilt.1 + reasdoubt.1*guilt.1 +
bettertolet.1*guilt.1 + libertyvorder.1*guilt.1 + jurevidence.1*guilt.1'
model<-glm(pred_int, family="binomial", data=dissertation)
summary(model)
##
## Call:
## glm(formula = pred_int, family = "binomial", data = dissertation)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6101 -0.5432 -0.1289 0.6422 2.2805
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.47994 0.16264 -2.951 0.00317 **
## obguilt.1 0.25161 0.16158 1.557 0.11942
## reasdoubt.1 -0.09230 0.20037 -0.461 0.64507
## bettertolet.1 -0.22484 0.20340 -1.105 0.26899
## libertyvorder.1 0.05825 0.21517 0.271 0.78660
## jurevidence.1 0.07252 0.19376 0.374 0.70819
## guilt.1 2.31003 0.26867 8.598 < 2e-16 ***
## obguilt.1:guilt.1 0.14058 0.23411 0.600 0.54818
## reasdoubt.1:guilt.1 -0.61724 0.29693 -2.079 0.03764 *
## bettertolet.1:guilt.1 0.02579 0.30123 0.086 0.93178
## libertyvorder.1:guilt.1 -0.27492 0.29355 -0.937 0.34899
## jurevidence.1:guilt.1 0.27601 0.36181 0.763 0.44555
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.08 on 354 degrees of freedom
## Residual deviance: 300.66 on 343 degrees of freedom
## AIC: 324.66
##
## Number of Fisher Scoring iterations: 6

The predict function, which I introduced here, can also be used for the binomial model. Let’s have R generate predicted scores for everyone in the dissertation sample:

dissertation$predver<-predict(model)
dissertation$predver
## [1] 0.3907097456 -4.1351129605 2.1820478279 -2.8768390246 2.5804618523
## [6] 0.4244692909 2.3065468369 -2.7853434926 0.3504760502 -0.2747339639
## [11] -1.8506160725 -0.6956240161 -4.7860574839 -0.3875950731 -2.4955679446
## [16] -0.3941516951 -4.5831011509 1.6185480937 0.4971923298 4.1581842900
## [21] -0.6320531052 -4.8447046319 -2.3974890696 1.8566258698 0.0360685822
## [26] 2.2151040131 2.3477149003 -2.4493726369 -0.2253481404 -4.8899805287
## [31] 1.7789459288 -0.0978703861 -3.5541042186 -3.6009218603 0.1568318789
## [36] 3.7866003489 -0.6371816898 -0.7047761441 -0.7529742376 -0.0302759317
## [41] -0.1108055330 1.9751810033 0.2373614802 0.0424471071 -0.4018757856
## [46] 0.0530272726 -1.0763759980 0.0099577637 0.3128581222 1.4806679691
## [51] -1.7468626219 0.2998282372 -3.6359162016 -2.2200774510 0.3192366472
## [56] 3.0103216033 -2.0625775984 -6.0179845235 2.0300503627 2.3676828409
## [61] -2.8971753746 -3.2131490026 2.1349358889 3.0215336139 1.2436192890
## [66] 0.2885535375 0.2141821004 1.9480686936 0.0438751446 -1.9368013875
## [71] 0.2931258287 0.5319938265 0.0177643261 3.3724920900 0.0332949791
## [76] 2.5935500970 0.7571810150 0.7131757400 2.5411073339 2.8499853550
## [81] 2.8063291084 -0.4500738791 1.4700679077 -0.8659309719 0.0870492258
## [86] 0.5728074322 0.1476797509 2.4697257261 2.5935500970 -2.2200774510
## [91] -0.0941827753 1.3708676633 1.4345235392 -0.2407209578 2.4662700339
## [96] -1.9687731888 -6.7412580522 -0.0006224018 -4.4132951092 -2.8543032695
## [101] 1.2295635352 2.8194173530 0.1215689324 -3.8258079371 1.8959803882
## [106] -4.5578801595 2.3754402614 0.0826808026 1.5112359711 -3.5402060466
## [111] 0.2556657363 0.7054183194 1.4675797244 -2.3974890696 2.6955929822
## [116] -0.3123518919 -4.8431862346 -2.0132721372 0.4673405434 -2.3053405270
## [121] 1.9498822386 -0.5164183930 -1.8277820872 -0.0134750769 -2.3013547136
## [126] -0.2498730859 -4.4281010683 -0.0134750769 -0.2604532514 0.1476797509
## [131] -2.3392939519 -2.0625775984 -3.5541042186 1.5087477879 -4.6453051124
## [136] 2.0616474606 -3.2691362859 -7.3752231145 -1.6666447439 1.0532964013
## [141] -2.0625775984 -0.3355312717 2.2481601983 -2.2200774510 -4.3276959075
## [146] 0.8685972087 -0.7727065311 1.7511589809 -0.4774548995 0.0008056357
## [151] 1.7022334970 -0.4202625135 -0.2902646169 2.4409712692 0.0008056357
## [156] 0.0008056357 -3.6009218603 -0.8567788439 -0.4528474822 0.3517462520
## [161] 0.1307210605 -3.7843118182 -2.8419024763 -3.5191098774 -0.1460684795
## [166] 1.8809888141 2.8194173530 -2.4656469123 1.0589888029 0.1659840070
## [171] 1.4345235392 2.3676828409 1.5749534339 -0.1681557545 2.6406620359
## [176] 0.1476797509 -2.2135177411 1.9168260534 -3.4993205379 0.4557086940
## [181] -3.8136089417 -0.1121510987 -3.9772095600 1.3849234171 0.3504760502
## [186] 2.3807710856 -3.0667307601 2.3040586537 1.7599138086 -0.2083894500
## [191] 0.6844579761 -0.3552635652 -1.9459392035 -0.6075281598 -2.1663310490
## [196] 2.3676828409 -1.9205271122 -2.2334295071 -4.4265826710 -1.0117771483
## [201] -0.0161530548 -0.3072233074 -0.0161530548 -0.7451676752 -7.0351269313
## [206] 2.6406620359 -3.7523234832 -0.2498730859 2.0222929422 3.2886316225
## [211] -1.6221457956 2.4749949634 1.7570711677 0.0904873650 -4.7332807307
## [216] 0.1568318789 -0.0302759317 0.5127229828 1.3097316594 -6.9309218514
## [221] 0.0515992352 -0.4514194447 -0.2253481404 -4.7652690656 -0.4279866041
## [226] -4.4136563866 -3.7618312672 0.0156676181 -0.2590252139 2.6076058507
## [231] 1.6420333133 -3.9985172969 -6.2076483227 0.1632104039 0.1829426974
## [236] -4.7652690656 -4.4212844958 1.6001906117 0.8579971472 -3.8699110198
## [241] 0.3022779567 -0.1679979189 1.9421248181 0.6592738895 1.6132788564
## [246] -0.0366544567 -3.4818233673 -3.9422152187 -0.3473613776 0.4321933815
## [251] 0.7480288869 -0.2498730859 -1.9861068488 -2.2297920164 -0.7621263656
## [256] 1.2966434147 0.1632104039 0.2048721368 1.7789459288 0.4926393080
## [261] 0.4096285430 -1.7794744955 -2.5822853071 2.0413250624 -6.6574350219
## [266] -0.1277642235 -2.1972434657 -2.5075677545 -0.4482774141 -0.6943740757
## [271] -0.7821891015 6.3289445390 0.1568318789 0.1165981835 1.4781797859
## [276] -4.2287015488 -3.6157278195 -0.1511970641 -0.7047761441 2.0935344484
## [281] -3.8258079371 -4.4231102471 1.3097316594 3.4081542651 -0.4996175382
## [286] -2.0534397824 0.9783975145 -2.2562634924 3.7196170683 1.1110084017
## [291] 2.1661785291 -4.2138955896 1.9421248181 2.3065468369 -0.7139282722
## [296] -4.1431023472 -2.0854115837 2.9389399956 1.7711269214 -0.0302759317
## [301] -2.6458711124 0.5856241187 -0.1199576611 1.8566258698 -2.2383553905
## [306] 2.3807710856 -0.2838860920 3.1176953128 2.8499853550 2.8063291084
## [311] 0.0034011417 -0.4683781352 -3.0377484314 -1.3833686805 1.7764577456
## [316] 1.7842151661 3.4081542651 0.1165981835 -4.6988069009 -2.6013721641
## [321] 2.0616474606 -0.2498730859 -4.2207121622 4.1705330009 5.2103776377
## [326] -4.5406977837 -1.5080855068 -2.5232652805 -5.7259789038 2.5211393933
## [331] -0.3487069432 -2.5035573312 -2.2764097339 -5.8364854607 -1.8694684539
## [336] 1.3402996614 0.5728074322 0.3663267540 -0.1603491921 -2.1690805453
## [341] -1.4105339689 3.0768201201 -5.1065624241 -4.5966850670 -4.5498907729
## [346] -1.3078399029 -1.0882592824 0.3128581222 -0.3644156933 0.3100845191
## [351] 2.4774831467 -1.0763759980 2.2151040131 -0.0952748801 -4.6864864366

Now, remember that the outcome variable is not guilty (0) and guilty (1), so you might be wondering – what’s with these predicted values? Why aren’t they 0 or 1?

Binomial regression is used for nonlinear outcomes. Since the outcome is 0/1, it’s nonlinear. But binomial regression is based on the general linear model. So how can we apply the general linear model to a nonlinear outcome? Answer: by transforming scores. Specifically, it transforms the outcome into a log odds ratio; the log transform makes the outcome variable behave somewhat linearly and symmetrically. The predicted outcome, then, is also a log odds ratio.

ordvalues<-dissertation[order(dissertation$predver),]
ordvalues<-ordvalues[,51]
ordvalues<-data.frame(1:355,ordvalues)
colnames(ordvalues)<-c("number","predver")
library(ggplot2)
ggplot(data=ordvalues, aes(number,predver))+geom_smooth()
## `geom_smooth()` using method = 'loess'

Log odds ratios are great for analysis, but when trying to understand how well your model is predicting values, we want to convert them into a metric that’s easier to understand in isolation and when compared to the observed values. We can convert them into probabilities with the following equation:

dissertation$verdict_predicted<-exp(predict(model))/(1+exp(predict(model)))

This gives us a value ranging from 0 to 1, which is the probability that a particular person will select guilty. We can use this value in different ways to see how well our model is doing. Typically, we’ll divide at the 50% mark, so anyone with a probability of 0.5 or greater is predicted to select guilty, and anyone with a probability less than 0.5 would be predicted to select not guilty. We then compare this new variable with the observed results to see how well the model did.

dissertation$vpred_rounded<-round(dissertation$verdict_predicted,digits=0)
library(expss)
## Warning: package 'expss' was built under R version 3.4.4
dissertation<- apply_labels(dissertation,
verdict = "Actual Verdict",
verdict = c("Not Guilty" = 0,
"Guilty" = 1),
vpred_rounded = "Predicted Verdict",
vpred_rounded = c("Not Guilty" = 0,
"Guilty" = 1)
)
cro(dissertation$verdict,list(dissertation$vpred_rounded, total()))
 Predicted Verdict     #Total   Not Guilty   Guilty     Actual Verdict     Not Guilty  152 39   191    Guilty  35 129   164    #Total cases  187 168   355

One thing we could look at regarding this table, which when dealing with actual versus predicted categories is known as a confusion matrix, is how well the model did at correctly categorizing cases – which we get by adding together the number of people with both observed and predicted not guilty, and people with observed and predicted guilty, then dividing that sum by the total.

accuracy<-(152+129)/355
accuracy
## [1] 0.7915493

Our model correctly classified 79% of the cases. However, this is not the only way we can determine how well our model did. There are a variety of derivations you can make from the confusion matrix. But two you should definitely include when doing this kind of analysis are sensitivity and specificity. Sensitivity refers to the true positive rate, and specificity refers to the true negative rate.

When you’re working with confusion matrices, you’re often trying to diagnose or identify some condition, one that may be deemed positive or present, and the other that may be deemed negative or absent. These derivations are important because they look at how well your model identifies these different states. For instance, if most of my cases selected not guilty, I could get a high accuracy rate by simply predicting that everyone will select not guilty. But then my model lacks sensitivity – it only identifies negative cases (not guilty) and fails to identify any positive cases (guilty). If I were dealing with something even higher stakes, like whether a test result indicates the presence of a condition, I want to make certain my classification is sensitive to those positive cases. And vice versa, I could keep from missing any positive cases by just classifying everyone as positive, but then my model lacks specificity and I may subject people to treatment they don’t need (and that could be harmful).

Just like accuracy, sensitivity and specificity are easy to calculate. As I said above, I’ll consider not guilty to be negative and guilty to be positive. Sensitivity is simply the number of true positives (observed and predicted guilty) divided by the sum of true positives and false negatives (people who selected guilty but were classified as not guilty).

sensitivity<-129/164
sensitivity
## [1] 0.7865854

And specificity is the number of true negatives (observed and predicted not guilty) divided by the sum of true negatives and false positives (people who selected not guilty but were classified as guilty).

specificity<-152/191
specificity
## [1] 0.7958115

So the model correctly classifies 79% of the positive cases and 80% of the negative cases. The model could be improved, but it’s functioning equally well across positive and negative cases, which is good.

It should be pointed out that you can select any cutpoint you want for your probability variable. That is, if I want to be very conservative in identifying positive cases, I might want there to be a higher probability that it is a positive case before I classify it as such – perhaps I want to use a cutpoint like 75%. I can easily do that.

dissertation$vpred2[dissertation$verdict_predicted < 0.75]<-0
dissertation$vpred2[dissertation$verdict_predicted >= 0.75]<-1
dissertation<- apply_labels(dissertation,
vpred2 = "Predicted Verdict (0.75 cut)",
vpred2 = c("Not Guilty" = 0,
"Guilty" = 1)
)
cro(dissertation$verdict,list(dissertation$vpred2, total()))
 Predicted Verdict (0.75 cut)     #Total   Not Guilty   Guilty     Actual Verdict     Not Guilty  177 14   191    Guilty  80 84   164    #Total cases  257 98   355 accuracy2<-(177+84)/355
sensitivity2<-84/164
specificity2<-177/191
accuracy2
## [1] 0.7352113
sensitivity2
## [1] 0.5121951
specificity2
## [1] 0.9267016

Changing the cut score improves specificity but at the cost of sensitivity, which makes sense, because our model was predicting equally well (or poorly, depending on how you look at it) across positives and negatives. In this case, a different cut score won’t improve our model. We would need to go back and see if there are better variables to use for prediction. And to keep us from fishing around in our data, we’d probably want to use a training and testing set for such exploratory analysis.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Moving to blogdown

Wed, 05/02/2018 - 05:47

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

I’ve been in the process of transferring my blog (along with creating a personal website) to blogdown, which is hosted on Github Pages. The new blog, or rather, the continuation of this blog, will be at webbedfeet.github.io/posts, and it went live today.

I’ll be cross-posting here for a while, at least until Tal gets my new blog address included in R-Bloggers. I’m enjoying the RMarkdown blogging experience now, which is quite nice, and any code or analyses I want to include isn’t “lost in translation” when on WP. Since I live in R most of my days, it is also allowing a rather free flow of ideas onto the virtual page.

Hope you’ll come visit

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How efficient are multifactorial experiments?

Wed, 05/02/2018 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

I recently described why we might want to conduct a multi-factorial experiment, and I alluded to the fact that this approach can be quite efficient. It is efficient in the sense that it is possible to test simultaneously the impact of multiple interventions using an overall sample size that would be required to test a single intervention in a more traditional RCT. I demonstrate that here, first with a continuous outcome and then with a binary outcome.

In all of the examples that follow, I am assuming we are in an exploratory phase of research, so our alpha levels are relaxed a bit to \(\alpha = 0.10\). In addition, we make no adjustments for multiple testing. This might be justifiable, since we are not as concerned about making a Type 1 error (concluding an effect is real when there isn’t actually one). Because this is a screening exercise, the selected interventions will be re-evaluated. At the same time, we are setting desired power to be 90%. This way, if an effect really exists, we are more likely to select it for further review.

Two scenarios with a continuous outcome

To start, I have created two sets of underlying assumptions. In the first, the effects of the four interventions (labeled fac1, fac2, fac3, and fac4) are additive. (The factor variables are parameterized using effect-style notation, where the value -1 represents no intervention and 1 represents the intervention.) So, with no interventions the outcome is 0, and each successive intervention adds 0.8 to the observed outcome (on average), so that individuals exposed to all four factors will have an average outcome \(4 \times 0.8 = 3.2\).

cNoX <- defReadCond("DataMF/FacSumContNoX.csv") cNoX ## condition formula variance dist link ## 1: (fac1 + fac2 + fac3 + fac4) == -4 0.0 9.3 normal identity ## 2: (fac1 + fac2 + fac3 + fac4) == -2 0.8 9.3 normal identity ## 3: (fac1 + fac2 + fac3 + fac4) == 0 1.6 9.3 normal identity ## 4: (fac1 + fac2 + fac3 + fac4) == 2 2.4 9.3 normal identity ## 5: (fac1 + fac2 + fac3 + fac4) == 4 3.2 9.3 normal identity

In the second scenario, each successive exposure continues to add to the effect, but each additional intervention adds a little less. The first intervention adds 0.8, the second adds 0.6, the third adds 0.4, and the fourth adds 0.2. This is a form of interaction.

cX <- defReadCond("DataMF/FacSumContX.csv") cX ## condition formula variance dist link ## 1: (fac1 + fac2 + fac3 + fac4) == -4 0.0 9.3 normal identity ## 2: (fac1 + fac2 + fac3 + fac4) == -2 0.8 9.3 normal identity ## 3: (fac1 + fac2 + fac3 + fac4) == 0 1.4 9.3 normal identity ## 4: (fac1 + fac2 + fac3 + fac4) == 2 1.8 9.3 normal identity ## 5: (fac1 + fac2 + fac3 + fac4) == 4 2.0 9.3 normal identity

This is what a plot of the means might look like for each of the scenarios. The straight line represents the additive (non-interactive) scenario, and the bent line is the interaction scenario:

Sample size requirement for a single intervention compared to control

If we were to conduct a more traditional randomized experiment with two groups – treatment and control – we would need about 500 total subjects under the assumptions that we are using:

power.t.test(power = 0.90, delta = .8, sd = 3.05, sig.level = 0.10) ## ## Two-sample t test power calculation ## ## n = 249.633 ## delta = 0.8 ## sd = 3.05 ## sig.level = 0.1 ## power = 0.9 ## alternative = two.sided ## ## NOTE: n is number in *each* group

To take a look at the sample size requirements for a multi-factorial study, I’ve written this function that repeatedly samples data based on the definitions and fits the appropriate model, storing the results after each model estimation.

library(simstudy) iterFunc <- function(dc, dt, seed = 464653, iter = 1000, binary = FALSE) { set.seed(seed) res <- list() for (i in 1:iter) { dx <- addCondition(dc, dt, "Y") if (binary == FALSE) { fit <- lm(Y~fac1*fac2*fac3*fac4, data = dx) } else { fit <- glm(Y~fac1*fac2*fac3*fac4, data = dx, family = binomial) } # A simple function to pull data from the fit res <- appendRes(res, fit) } return(res) }

And finally, here are the results for the sample size requirements based on no interaction across interventions. (I am using function genMultiFac to generate replications of all the combinations of four factors. This function is now part of simstudy, which is available on github, and will hopefully soon be up on CRAN.)

dt <- genMultiFac(32, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res <- iterFunc(cNoX, dt) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.894 0.895 0.905 0.902

A sample size of \(32 \times 16 = 512\) gives us 90% power that we are seeking. In case you don’t believe my simulation, we can compare the estimate provided by the MOST package, created by the Methodology Center at Penn State:

library(MOST) FactorialPowerPlan(alpha = 0.10, model_order = 1, nfactors = 4, ntotal = 500, sigma_y = 3.05, raw_main = 0.8)$power ## [1] "------------------------------------------------------------" ## [1] "FactorialPowerPlan Macro" ## [1] "The Methodology Center" ## [1] "(c) 2012 Pennsylvania State University" ## [1] "------------------------------------------------------------" ## [1] "Assumptions:" ## [1] "There are 4 dichotomous factors." ## [1] "There is independent random assignment." ## [1] "Analysis will be based on main effects only." ## [1] "Two-sided alpha: 0.10" ## [1] "Total number of participants: 500" ## [1] "Effect size as unstandardized difference in means: 0.80" ## [1] "Assumed standard deviation for the response variable is 3.05" ## [1] "Attempting to calculate the estimated power." ## [1] "------------------------------------------------------------" ## [1] "Results:" ## [1] "The calculated power is 0.9004" ## [1] 0.9004 Interaction

A major advantage of the multi-factorial experiment over the traditional RCT, of course, is that it allows us to investigate if the interventions interact in any interesting ways. However, in practice it may be difficult to generate sample sizes large enough to measure these interactions with much precision.

In the next pair of simulations, we see that even if we are only interested in exploring the main effects, underlying interaction reduces power. If there is actually interaction (as in the second scenario defined above), the original sample size of 500 may be inadequate to estimate the main effects:

dt <- genMultiFac(31, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res <- iterFunc(cX, dt) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.567 0.556 0.588 0.541

Here, a total sample of about 1300 does the trick:

dt <- genMultiFac(81, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res <- iterFunc(cX, dt) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.898 0.893 0.908 0.899

But this sample size is not adequate to estimate the actual second degree interaction terms:

apply(res$p[, .(`fac1:fac2`, `fac1:fac3`, `fac1:fac4`, `fac2:fac3`, `fac2:fac4`, `fac3:fac4`)] < 0.10, 2, mean) ## fac1:fac2 fac1:fac3 fac1:fac4 fac2:fac3 fac2:fac4 fac3:fac4 ## 0.144 0.148 0.163 0.175 0.138 0.165

You would actually need a sample size of about 32,000 to be adequately powered to estimate the interaction! Of course, this requirement is driven by the size of the interaction effects and the variation, so maybe this is a bit extreme:

dt <- genMultiFac(2000, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res <- iterFunc(cX, dt) apply(res$p[, .(`fac1:fac2`, `fac1:fac3`, `fac1:fac4`, `fac2:fac3`, `fac2:fac4`, `fac3:fac4`)] < 0.10, 2, mean) ## fac1:fac2 fac1:fac3 fac1:fac4 fac2:fac3 fac2:fac4 fac3:fac4 ## 0.918 0.902 0.888 0.911 0.894 0.886 A binary outcome

The situation with the binary outcome is really no different than the continuous outcome, except for the fact that sample size requirements might be much more sensitive to the strength of underlying interaction.

Again, we have two scenarios – one with interaction and one without. When I talk about an additive (non-interaction) model in this context, the additivity is on the log-odds scale. This becomes apparent when looking at a plot.

I want to reiterate here that we have interaction when there are limits to how much marginal effect an additional intervention can have conditional on the presence of other interventions. In a recent project (one that motivated this pair of blog entries), we started with the assumption that a single intervention would have a 5 percentage point effect on the outcome (which was smoking cessation), but a combination of all four interventions might only get a 10 percentage point reduction. This cap generates severe interaction which dramatically affected sample size requirements, as we see below (using even less restrictive interaction assumptions).

No interaction:

## condition formula variance dist link ## 1: (fac1 + fac2 + fac3 + fac4) == -4 0.10 NA binary identity ## 2: (fac1 + fac2 + fac3 + fac4) == -2 0.18 NA binary identity ## 3: (fac1 + fac2 + fac3 + fac4) == 0 0.30 NA binary identity ## 4: (fac1 + fac2 + fac3 + fac4) == 2 0.46 NA binary identity ## 5: (fac1 + fac2 + fac3 + fac4) == 4 0.63 NA binary identity

Interaction:

## condition formula variance dist link ## 1: (fac1 + fac2 + fac3 + fac4) == -4 0.10 NA binary identity ## 2: (fac1 + fac2 + fac3 + fac4) == -2 0.18 NA binary identity ## 3: (fac1 + fac2 + fac3 + fac4) == 0 0.24 NA binary identity ## 4: (fac1 + fac2 + fac3 + fac4) == 2 0.28 NA binary identity ## 5: (fac1 + fac2 + fac3 + fac4) == 4 0.30 NA binary identity

The plot highlights that additivity is on the log-odds scale only:

The sample size requirement for a treatment effect of 8 percentage points for a single intervention compared to control is about 640 total participants:

power.prop.test(power = 0.90, p1 = .10, p2 = .18, sig.level = 0.10) ## ## Two-sample comparison of proportions power calculation ## ## n = 320.3361 ## p1 = 0.1 ## p2 = 0.18 ## sig.level = 0.1 ## power = 0.9 ## alternative = two.sided ## ## NOTE: n is number in *each* group

Simulation shows that the multi-factorial experiment requires only 500 participants, a pretty surprising reduction:

dt <- genMultiFac(31, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res <- iterFunc(bNoX, dt, binary = TRUE) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.889 0.910 0.916 0.901

But, if there is a cap to how much we can effect the outcome (i.e. there is underlying interaction), estimated power is considerably reduced:

dt <- genMultiFac(31, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res <- iterFunc(bX, dt, binary = TRUE) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.398 0.409 0.405 0.392

We need to increase the sample size to about \(125 \times 16 = 2000\) just to explore the main effects:

dt <- genMultiFac(125, nFactors = 4, coding = "effect", colNames = paste0("fac", c(1:4))) res <- iterFunc(bX, dt, binary = TRUE) apply(res$p[, .(fac1, fac2, fac3, fac4)] < 0.10, 2, mean) ## fac1 fac2 fac3 fac4 ## 0.910 0.890 0.895 0.887

I think the biggest take away from all of this is that multi-factorial experiments are a super interesting option when exploring possible interventions or combinations of interventions, particularly when the outcome is continuous. However, this approach may not be as feasible when the outcome is binary, as sample size requirements may quickly become prohibitive, given the number of factors, sample sizes, and extent of interaction.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Get your tracks from the Strava API and plot them on Leaflet maps

Tue, 05/01/2018 - 11:29

(This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

Here is some updated R code from my previous post. It doesn’t throw any warnings when importing tracks with and without heart rate information. Also, it is easier to distinguish types of tracks now (e.g., when you want to plot runs and rides separately). Another thing I changed: You get very basic information on the track when you click on it (currently the name of the track and the total length).

Have fun and leave a comment if you have any questions.

options(stringsAsFactors = F)

rm(list=ls())

library(httr)
library(rjson)
library(leaflet)
library(dplyr)

token <- “”

# Functions —————————————————————

get.coord.df.from.stream <- function (stream.obj) {
  data.frame(lat = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[1]]),
             lon = sapply(stream.obj[[1]]$data, USE.NAMES = F, FUN = function (x) x[[2]]))
}

get.stream.from.activity <- function (act.id, token) {
  stream <- GET(“https://www.strava.com/”,
                path = paste0(“api/v3/activities/”, act.id, “/streams/latlng”),
                query = list(access_token = token))
  content(stream)
}

get.activities2 <- function (token) {
  activities <- GET(“https://www.strava.com/”, path = “api/v3/activities”,
                    query = list(access_token = token, per_page = 200))
  activities <- content(activities, “text”)
  activities <- fromJSON(activities)
  res.df <- data.frame()
  for (a in activities) {
    values <- sapply(c(“name”, “distance”, “moving_time”, “elapsed_time”, “total_elevation_gain”,
                       “type”, “id”, “start_date_local”,
                       “location_country”, “average_speed”, “max_speed”, “has_heartrate”, “elev_high”,
                       “elev_low”, “average_heartrate”, “max_heartrate”), FUN = function (x) {
                         if (is.null(a[[x]])) {
                           NA } else { a[[x]] }
                       })
    res.df <- rbind(res.df, values)
  }
  names(res.df) <- c(“name”, “distance”, “moving_time”, “elapsed_time”, “total_elevation_gain”,
                     “type”, “id”, “start_date_local”,
                     “location_country”, “average_speed”, “max_speed”, “has_heartrate”, “elev_high”,
                     “elev_low”, “average_heartrate”, “max_heartrate”)
  res.df
}

get.multiple.streams <- function (act.ids, token) {
  res.list <- list()
  for (act.id.i in 1:length(act.ids)) {
    if (act.id.i %% 5 == 0) cat(“Actitivy no.”, act.id.i, “of”, length(act.ids), “\n”)
    stream <- get.stream.from.activity(act.ids[act.id.i], token)
    coord.df <- get.coord.df.from.stream(stream)
    res.list[[length(res.list) + 1]] <- list(act.id = act.ids[act.id.i],
                                             coords = coord.df)
  }
  res.list
}

activities <- get.activities2(token)

stream.list <- get.multiple.streams(activities$id, token)

# Leaflet —————————————————————–

lons.range <- c(9.156572, 9.237580)
lats.range <- c(48.74085, 48.82079)

map <- leaflet() %>%
  addProviderTiles(“OpenMapSurfer.Grayscale”, # nice: CartoDB.Positron, OpenMapSurfer.Grayscale, CartoDB.DarkMatterNoLabels
                   options = providerTileOptions(noWrap = T)) %>%
  fitBounds(lng1 = min(lons.range), lat1 = max(lats.range), lng2 <- max(lons.range), lat2 = min(lats.range))

add.run <- function (act.id, color, act.name, act.dist, strlist = stream.list) {
  act.ind <- sapply(stream.list, USE.NAMES = F, FUN = function (x) {
    x$act.id == act.id
  })
  act.from.list <- strlist[act.ind][[1]]
  map <<- addPolylines(map, lng = act.from.list$coords$lon,
               lat = act.from.list$coords$lat,
               color = color, opacity = 1/3, weight = 2,
               popup = paste0(act.name, “, “, round(as.numeric(act.dist) / 1000, 2), ” km”))
}

# plot all
for (i in 1:nrow(activities)) {
  add.run(activities[i, “id”], ifelse(activities[i, “type”] == “Run”, “red”,
                                      ifelse(activities[i, “type”] == “Ride”, “blue”, “black”)),
          activities[i, “name”], activities[i, “distance”])
}

map

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppArmadillo 0.8.500.0

Tue, 05/01/2018 - 00:26

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

RcppArmadillo release 0.8.500.0, originally prepared and uploaded on April 21, has hit CRAN today (after having already been available via the RcppCore drat repo). A corresponding Debian release will be prepared as well. This RcppArmadillo release contains Armadillo release 8.500.0 with a number of rather nice changes (see below for details), and continues our normal bi-monthly CRAN release cycle.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 472 other packages on CRAN.

A high-level summary of changes follows.

Changes in RcppArmadillo version 0.8.500.0 (2018-04-21)
  • Upgraded to Armadillo release 8.500 (Caffeine Raider)

    • faster handling of sparse matrices by kron() and repmat()

    • faster transpose of sparse matrices

    • faster element access in sparse matrices

    • faster row iterators for sparse matrices

    • faster handling of compound expressions by trace()

    • more efficient handling of aliasing in submatrix views

    • expanded normalise() to handle sparse matrices

    • expanded .transform() and .for_each() to handle sparse matrices

    • added reverse() for reversing order of elements

    • added repelem() for replicating elements

    • added roots() for finding the roots of a polynomial

  • Fewer LAPACK compile-time guards are used, new unit tests for underlying features have been added (Keith O’Hara in #211 addressing #207).

  • The configure check for LAPACK features has been updated accordingly (Keith O’Hara in #214 addressing #213).

  • The compile-time check for g++ is now more robust to minimal shell versions (#217 addressing #216).

  • Compiler tests to were added for macOS (Keith O’Hara in #219).

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

17 Jobs for R users from around the world (2018-04-30)

Mon, 04/30/2018 - 22:09
To post your R job on the next post

Just visit  this link and post a new R job  to the R community.

You can post a job for  free  (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers:  please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
All New R Jobs
  1. Full-Time
    Lead Data Scientist @ Washington, District of Columbia, U.S. AFL-CIO – Posted by carterkalchik
    Washington District of Columbia, United States
    27 Apr 2018
  2. Full-Time
    Data Scientist R @ Medellín, Antioquia, Colombia IDATA S.A.S – Posted by vmhoyos
    Medellín Antioquia, Colombia
    27 Apr 2018
  3. Full-Time
    Data Scientist @ Annapolis, Maryland, United States The National Socio-Environmental Synthesis Center – Posted by sesync
    Annapolis Maryland, United States
    27 Apr 2018
  4. Freelance
    Freelance Project: R Libraries Install on Amazon AWS EC2 WS
    Anywhere
    27 Apr 2018
  5. Full-Time
    Associate, Data Analytics National Network for Safe Communities (NNSC>) – Posted by nnsc
    New York New York, United States
    18 Apr 2018
  6. Full-Time
    Senior Research Associate Kellogg Research Support – Posted by lorbos
    Evanston Illinois, United States
    17 Apr 2018
  7. Full-Time
    Empirical Research Fellow Kellogg School of Management (Northwestern University)– Posted by lorbos
    Evanston Illinois, United States
    17 Apr 2018
  8. Full-Time
    Discrete Choice Modeler @ Chicago, Illinois, U.S. Resource Systems Group – Posted by patricia.holland@rsginc.com
    Chicago Illinois, United States
    13 Apr 2018
  9. Full-Time
    R&D Database Developer @ Toronto, Canada Crescendo Technology Ltd – Posted by Crescendo
    Toronto Ontario, Canada
    12 Apr 2018
  10. Full-Time
    Applied Statistics Position @ Florida U.S. New College of Florida – Posted by Jobs@New College
    Sarasota Florida, United States
    9 Apr 2018
  11. Full-Time
    PhD Fellowship @ Wallace University, US Ubydul Haque – Posted by Ubydul
    Khon Kaen Chang Wat Khon Kaen, Thailand
    9 Apr 2018
  12. Part-Time
    Data Scientist Alchemy – Posted by Andreas Voniatis
    Anywhere
    7 Apr 2018
  13. Full-Time
    Product Owner (Data Science) (m/f) Civey GmbH – Posted by Civey
    Berlin Berlin, Germany
    6 Apr 2018
  14. Full-Time
    Solutions Engineer RStudio – Posted by nwstephens
    Anywhere
    3 Apr 2018
  15. Full-Time
    Post-Doctoral Fellow – Data Scientist @ CIMMYT (eastern Africa) International Maize and Wheat Improvement Center – Posted by cimmyt-jobs
    Marsabit County Kenya
    27 Mar 2018
  16. Freelance
    Statistician / R Developer – for Academic Statistical Research Academic Research – Posted by empiricus
    Anywhere
    27 Mar 2018
  17. Full-Time
    Graduate Data Scientist JamieAi – Posted by JamieAi
    Anywhere
    27 Mar 2018

In  R-users.com  you can see  all  the R jobs that are currently available.

R-users Resumes

R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

(you may also look at  previous R jobs posts ).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Microsoft R Open 3.4.4 now available

Mon, 04/30/2018 - 20:52

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

An update to Microsoft R Open (MRO) is now available for download on Windows, Mac and Linux. This release upgrades the R language engine to version 3.4.4, which addresses some minor issues with timezone detection and some edge cases in some statistics functions. As a maintenance release, it's backwards-compatible with scripts and packages from the prior release of MRO.

MRO 3.4.4 points to a fixed CRAN snapshot taken on April 1 2018, and you can see some highlights of new packages released since the prior version of MRO on the Spotlights page. As always, you can use the built-in checkpoint package to access packages from an earlier date (for reproducibility) or a later date (to access new and updated packages).

Looking ahead, the next update based on R 3.5.0 has started the build and test process. Microsoft R Open 3.5.0 is scheduled for release on May 31.

We hope you find Microsoft R Open useful, and if you have any comments or questions please visit the Microsoft R Open forum. You can follow the development of Microsoft R Open at the MRO Github repository. To download Microsoft R Open, simply follow the link below.

MRAN: Download Microsoft R Open

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages