Binning Data in a Database
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
Roz King just wrote an interesting article on binning data (a common data analytics step) in a database. He compares a casebased approach (where the bin divisions are stuffed into code) with a join based approach. He shares code and timings.
Best of all: rquery gets some attention and turns out to be the dominant solution at all scales measured.
Here is an example timing (lower times better):
So please check the article out.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Some R Packages for ROC Curves
(This article was first published on R Views, and kindly contributed to Rbloggers)
In a recent post, I presented some of the theory underlying ROC curves, and outlined the history leading up to their present popularity for characterizing the performance of machine learning models. In this post, I describe how to search CRAN for packages to plot ROC curves, and highlight six useful packages.
Although I began with a few ideas about packages that I wanted to talk about, like ROCR and pROC, which I have found useful in the past, I decided to use Gábor Csárdi’s relatively new package pkgsearch to search through CRAN and see what’s out there. The package_search() function takes a text string as input and uses basic text mining techniques to search all of CRAN. The algorithm searches through package text fields, and produces a score for each package it finds that is weighted by the number of reverse dependencies and downloads.
library(tidyverse) # for data manipulation library(dlstats) # for package download stats library(pkgsearch) # for searching packagesAfter some trial and error, I settled on the following query, which includes a number of interesting ROCrelated packages.
rocPkg < pkg_search(query="ROC",size=200)Then, I narrowed down the field to 46 packages by filtering out orphaned packages and packages with a score less than 190.
rocPkgShort < rocPkg %>% filter(maintainer_name != "ORPHANED", score > 190) %>% select(score, package, downloads_last_month) %>% arrange(desc(downloads_last_month)) head(rocPkgShort) ## # A tibble: 6 x 3 ## score package downloads_last_month ## ## 1 690. ROCR 56356 ## 2 7938. pROC 39584 ## 3 1328. PRROC 9058 ## 4 833. sROC 4236 ## 5 266. hmeasure 1946 ## 6 1021. plotROC 1672To complete the selection process, I did the hard work of browsing the documentation for the packages to pick out what I thought would be generally useful to most data scientists. The following plot uses Guangchuang Yu’s dlstats package to look at the download history for the six packages I selected to profile.
library(dlstats) shortList < c("pROC","precrec","ROCit", "PRROC","ROCR","plotROC") downloads < cran_stats(shortList) ggplot(downloads, aes(end, downloads, group=package, color=package)) + geom_line() + geom_point(aes(shape=package)) + scale_y_continuous(trans = 'log2') ROCR – 2005ROCR has been around for almost 14 years, and has be a rocksolid workhorse for drawing ROC curves. I particularly like the way the performance() function has you set up calculation of the curve by entering the true positive rate, tpr, and false positive rate, fpr, parameters. Not only is this reassuringly transparent, it shows the flexibility to calculate nearly every performance measure for a binary classifier by entering the appropriate parameter. For example, to produce a precisionrecall curve, you would enter prec and rec. Although there is no vignette, the documentation of the package is very good.
The following code sets up and plots the default ROCR ROC curve using a synthetic data set that comes with the package. I will use this same data set throughout this post.
library(ROCR) ## Loading required package: gplots ## ## Attaching package: 'gplots' ## The following object is masked from 'package:stats': ## ## lowess # plot a ROC curve for a single prediction run # and color the curve according to cutoff. data(ROCR.simple) df < data.frame(ROCR.simple) pred < prediction(df$predictions, df$labels) perf < performance(pred,"tpr","fpr") plot(perf,colorize=TRUE) pROC – 2010It is clear from the downloads curve that pROC is also popular with data scientists. I like that it is pretty easy to get confidence intervals for the Area Under the Curve, AUC, on the plot.
library(pROC) ## Type 'citation("pROC")' for a citation. ## ## Attaching package: 'pROC' ## The following objects are masked from 'package:stats': ## ## cov, smooth, var pROC_obj < roc(df$labels,df$predictions, smoothed = TRUE, # arguments for ci ci=TRUE, ci.alpha=0.9, stratified=FALSE, # arguments for plot plot=TRUE, auc.polygon=TRUE, max.auc.polygon=TRUE, grid=TRUE, print.auc=TRUE, show.thres=TRUE) sens.ci < ci.se(pROC_obj) plot(sens.ci, type="shape", col="lightblue") ## Warning in plot.ci.se(sens.ci, type = "shape", col = "lightblue"): Low ## definition shape. plot(sens.ci, type="bars") PRROC – 2014Although not nearly as popular as ROCR and pROC, PRROC seems to be making a bit of a comeback lately. The terminology for the inputs is a bit eclectic, but once you figure that out the roc.curve() function plots a clean ROC curve with minimal fuss. PRROC is really set up to do precisionrecall curves as the vignette indicates.
library(PRROC) PRROC_obj < roc.curve(scores.class0 = df$predictions, weights.class0=df$labels, curve=TRUE) plot(PRROC_obj) plotROC – 2014plotROC is an excellent choice for drawing ROC curves with ggplot(). My guess is that it appears to enjoy only limited popularity because the documentation uses medical terminology like “disease status” and “markers”. Nevertheless, the documentation, which includes both a vignette and a Shiny application, is very good.
The package offers a number of featurerich ggplot() geoms that enable the production of elaborate plots. The following plot contains some styling, and includes Clopper and Pearson (1934) exact method confidence intervals.
library(plotROC) rocplot < ggplot(df, aes(m = predictions, d = labels))+ geom_roc(n.cuts=20,labels=FALSE) rocplot + style_roc(theme = theme_grey) + geom_rocci(fill="pink") precrec – 2015precrec is another library for plotting ROC and precisionrecall curves.
library(precrec) ## ## Attaching package: 'precrec' ## The following object is masked from 'package:pROC': ## ## auc precrec_obj < evalmod(scores = df$predictions, labels = df$labels) autoplot(precrec_obj)Parameter options for the evalmod() function make it easy to produce basic plots of various model features.
precrec_obj2 < evalmod(scores = df$predictions, labels = df$labels, mode="basic") autoplot(precrec_obj2) ROCit – 2019ROCit is a new package for plotting ROC curves and other binary classification visualizations that rocketed onto the scene in January, and is climbing quickly in popularity. I would never have discovered it if I had automatically filtered my original search by downloads. The default plot includes the location of the Yourden’s J Statistic.
library(ROCit) ## Warning: package 'ROCit' was built under R version 3.5.2 ROCit_obj < rocit(score=df$predictions,class=df$labels) plot(ROCit_obj)Several other visualizations are possible. The following plot shows the cumulative densities of the positive and negative responses. The KS statistic shows the maximum distance between the two curves.
ksplot(ROCit_obj)In this attempt to dig into CRAN and uncover some of the resources R contains for plotting ROC curves and other binary classifier visualizations, I have only scratched the surface. Moreover, I have deliberately ignored the many packages available for specialized applications, such as survivalROC for computing timedependent ROC curves from censored survival data, and cvAUC, which contains functions for evaluating crossvalidated AUC measures. Nevertheless, I hope that this little exercise will help you find what you are looking for.
_____='https://rviews.rstudio.com/2019/03/01/somerpackagesforroccurves/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
htmlunitjars Updated to 2.34.0
(This article was first published on R – rud.is, and kindly contributed to Rbloggers)
The indev htmlunit package for javascript“enabled” webscraping without the need for Selenium, Splash or headless Chrome relies on the HtmlUnit library and said library just released version 2.34.0 with a wide array of changes that should make it possible to scrape more gnarly javascript“enabled” sites. The Chrome emulation is now also onpar with Chrome 72 series (my Chrome beta is at 73.0.3683.56 so it’s super close to very current).
In reality, the update was to the htmlunitjars package where the main project JAR and dependent JARs all received a refresh.
The README and tests were all rerun on both packages and Travis is happy.
If you’ve got a working rJava installation (aye, it’s 2019 and that’s still “a thing”) then you can just do:
install.packages(c("htmlunitjars", "htmlunit"), repos = "https://cinc.rud.is/")to get them installed and start playing with the DSL or work directly with the Java classes.
FINAs usual, use your preferred social coding site to log feature requests or problems.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RStudio Instructor Training
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
We are pleased to announce the launch of RStudio’s instructor training and certification program. Its goal is to help people apply modern evidencebased teaching practices to teach data science using R and RStudio’s products, and to help people who need such training find the trainers they need.
Like the training programs for flight instructors, the ski patrol, and the Carpentries, ours distinguishes between instructors (who teach end users) and coaches (who teach instructors). This program focuses on instructors; anyone in the R community who wishes to become one is encouraged to apply by filling in this form. Candidates must be proficient in the technical skills and tools they wish to teach; they can demonstrate this when applying by pointing us at materials they have previously developed and/or sharing a short video with us (such as a screencast or a recording of a conference talk).
There are three steps to becoming certified:

Candidates must take part in a oneday training course on modern teaching methods similar to that offered at rstudio::conf 2019. (We will offer this course several times in the coming year.)

After completing that course, candidates must complete a onehour exam on the material, then prepare and deliver a demonstration lesson on a topic assigned to them.

Finally, in order to ensure that instructors are proficient with the technical content they will be teaching, they must complete a practical examination and deliver a demonstration for each subject they wish to be certified to teach.
Instructors must certify on a pertopic basis, just as pilots obtain ratings for different kinds of aircraft. We will initially offer certification on data analysis using the tidyverse and on Shiny. Other subjects, such as administering our professional products and using Connect, will be rolled out before the end of 2019.
As at present, we will advertise instructors on our web site, and instructors will be eligible for free teaching licenses to RStudio professional onpremises and cloud products for use in their training. They can teach in whatever format they want, from halfday workshops to multiweek courses, and charge whatever the market will bear, but will be required to make their materials available to RStudio for inspection (but not for use) if asked to do so. (And yes, there will be stickers…)
The training course will cost $500, and each examination will cost an additional $500; anyone who does not pass an exam can retry once at a later date for an additional $500. Once certified, instructors will be required to pay $50/year in membership dues, and to requalify every three years. Exemptions will be provided on a casebycase basis for those who might otherwise not be able to take part.
Finally, we will make the training materials we have developed available for instructors to use under Creative Commons licenses, and we will encourage them to collaborate on modifying and extending these materials and the curricula they develop themselves. Different instructors may wish to teach topics in a variety of ways to meet the needs of different audiences, and may choose to keep their materials private, but as with open source software, we believe that pooling effort will make everyone more effective.
If you would like to know more about this program, or would like to arrange training for staff at your company or institution, please contact certification@rstudio.com or fill in this form. If you are already listed on our trainers’ page, please keep an eye on your inbox for news as well.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
KDA–Robustness Results
(This article was first published on R – QuantStrat TradeR, and kindly contributed to Rbloggers)
This post will display some robustness results for KDA asset allocation.
Ultimately, the two canary instruments fare much better using the original filter weights in Defensive Asset Allocation than in other variants of the weights for the filter. While this isn’t as worrying (the filter most likely was created that way and paired with those instruments by design), what *is* somewhat more irritating is that the strategy is dependent upon the endofmonth phenomenon, meaning this strategy cannot be simply tranched throughout an entire trading month.
So first off, let’s review the code from last time:
# KDA asset allocation # KDA stands for Kipnis Defensive Adaptive (Asset Allocation). # compute strategy statistics stratStats < function(rets) { stats < rbind(table.AnnualizedReturns(rets), maxDrawdown(rets)) stats[5,] < stats[1,]/stats[4,] stats[6,] < stats[1,]/UlcerIndex(rets) rownames(stats)[4] < "Worst Drawdown" rownames(stats)[5] < "Calmar Ratio" rownames(stats)[6] < "Ulcer Performance Index" return(stats) } # required libraries require(quantmod) require(PerformanceAnalytics) require(tseries) # symbols symbols < c("SPY", "VGK", "EWJ", "EEM", "VNQ", "RWX", "IEF", "TLT", "DBC", "GLD", "VWO", "BND") # get data rets < list() for(i in 1:length(symbols)) { returns < Return.calculate(Ad(get(getSymbols(symbols[i], from = '19900101')))) colnames(returns) < symbols[i] rets[[i]] < returns } rets < na.omit(do.call(cbind, rets)) # algorithm KDA < function(rets, offset = 0, leverageFactor = 1.5, momWeights = c(12, 4, 2, 1)) { # get monthly endpoints, allow for offsetting ala AllocateSmartly/Newfound Research ep < endpoints(rets) + offset ep[ep < 1] < 1 ep[ep > nrow(rets)] < nrow(rets) ep < unique(ep) epDiff < diff(ep) if(last(epDiff)==1) { # if the last period only has one observation, remove it ep < ep[length(ep)] } # initialize vector holding zeroes for assets emptyVec < data.frame(t(rep(0, 10))) colnames(emptyVec) < symbols[1:10] allWts < list() # we will use the 13612F filter for(i in 1:(length(ep)12)) { # 12 assets for returns  2 of which are our crash protection assets retSubset < rets[c((ep[i]+1):ep[(i+12)]),] epSub < ep[i:(i+12)] sixMonths < rets[(epSub[7]+1):epSub[13],] threeMonths < rets[(epSub[10]+1):epSub[13],] oneMonth < rets[(epSub[12]+1):epSub[13],] # computer 13612 fast momentum moms < Return.cumulative(oneMonth) * momWeights[1] + Return.cumulative(threeMonths) * momWeights[2] + Return.cumulative(sixMonths) * momWeights[3] + Return.cumulative(retSubset) * momWeights[4] assetMoms < moms[,1:10] # Adaptive Asset Allocation investable universe cpMoms < moms[,11:12] # VWO and BND from Defensive Asset Allocation # find qualifying assets highRankAssets < rank(assetMoms) >= 6 # top 5 assets posReturnAssets < assetMoms > 0 # positive momentum assets selectedAssets < highRankAssets & posReturnAssets # intersection of the above # perform meanvariance/quadratic optimization investedAssets < emptyVec if(sum(selectedAssets)==0) { investedAssets < emptyVec } else if(sum(selectedAssets)==1) { investedAssets < emptyVec + selectedAssets } else { idx < which(selectedAssets) # use 13612 fast correlation average to match with momentum filter cors < (cor(oneMonth[,idx]) * momWeights[1] + cor(threeMonths[,idx]) * momWeights[2] + cor(sixMonths[,idx]) * momWeights[3] + cor(retSubset[,idx]) * momWeights[4])/sum(momWeights) vols < StdDev(oneMonth[,idx]) # use last month of data for volatility computation from AAA covs < t(vols) %*% vols * cors # do standard min vol optimization minVolRets < t(matrix(rep(1, sum(selectedAssets)))) minVolWt < portfolio.optim(x=minVolRets, covmat = covs)$pw names(minVolWt) < colnames(covs) investedAssets < emptyVec investedAssets[,selectedAssets] < minVolWt } # crash protection  between aggressive allocation and crash protection allocation pctAggressive < mean(cpMoms > 0) investedAssets < investedAssets * pctAggressive pctCp < 1pctAggressive # if IEF momentum is positive, invest all crash protection allocation into it # otherwise stay in cash for crash allocation if(assetMoms["IEF"] > 0) { investedAssets["IEF"] < investedAssets["IEF"] + pctCp } # leverage portfolio if desired in cases when both risk indicator assets have positive momentum if(pctAggressive == 1) { investedAssets = investedAssets * leverageFactor } # append to list of monthly allocations wts < xts(investedAssets, order.by=last(index(retSubset))) allWts[[i]] < wts } # put all weights together and compute cash allocation allWts < do.call(rbind, allWts) allWts$CASH < 1rowSums(allWts) # add cash returns to universe of investments investedRets < rets[,1:10] investedRets$CASH < 0 # compute portfolio returns out < Return.portfolio(R = investedRets, weights = allWts) return(list(allWts, out)) }So, the idea is that we take the basic Adaptive Asset Allocation algorithm, and wrap it in a canary universe from Defensive Asset Allocation (see previous post for links to both), which we use to control capital allocation, ranging from 0 to 1 (or beyond, in cases where leverage applies).
One of the ideas was to test out different permutations of the parameters belonging to the canary filter–a 1, 3, 6, 12 weighted filter focusing on the first month.
There are two interesting variants of this–equal weighting on the filter (both for momentum and the safety assets), and reversing the weights (that is, 1 * 1, 3 * 2, 6 * 4, 12 * 12). Here are the results of that experiment:
# different leverages KDA_100 < KDA(rets, leverageFactor = 1) KDA_EW < KDA(rets, leverageFactor = 1, momWeights = c(1,1,1,1)) KDA_rev < KDA(rets, leverageFactor = 1, momWeights = c(1, 2, 4, 12)) # KDA_150 < KDA(rets, leverageFactor = 1.5) # KDA_200 < KDA(rets, leverageFactor = 2) # compare compare < na.omit(cbind(KDA_100[[2]], KDA_EW[[2]], KDA_rev[[2]])) colnames(compare) < c("KDA_base", "KDA_EW", "KDA_rev") charts.PerformanceSummary(compare, colorset = c('black', 'purple', 'gold'), main = "KDA AA with various momentum weights") stratStats(compare) apply.yearly(compare, Return.cumulative)With the following results:
> stratStats(compare) KDA_base KDA_EW KDA_rev Annualized Return 0.10990000 0.0879000 0.0859000 Annualized Std Dev 0.09070000 0.0900000 0.0875000 Annualized Sharpe (Rf=0%) 1.21180000 0.9764000 0.9814000 Worst Drawdown 0.07920363 0.1360625 0.1500333 Calmar Ratio 1.38756275 0.6460266 0.5725396 Ulcer Performance Index 3.96188378 2.4331636 1.8267448 > apply.yearly(compare, Return.cumulative) KDA_base KDA_EW KDA_rev 20081231 0.15783690 0.101929228 0.08499664 20091231 0.18169281 0.004707164 0.02403531 20101231 0.17797930 0.283216782 0.27889530 20111230 0.17220203 0.161001680 0.03341651 20121231 0.13030215 0.081280035 0.09736187 20131231 0.12692163 0.120902015 0.09898799 20141231 0.04028492 0.047381890 0.06883301 20151231 0.01621646 0.005016891 0.01841095 20161230 0.01253209 0.020960805 0.01580218 20171229 0.15079063 0.148073455 0.18811112 20181231 0.06583962 0.029804042 0.04375225 20190220 0.01689700 0.003934044 0.00962020So, one mea culpa: after comparing AllocateSmartly, my initial code (which I’ve since edited, most likely owing to getting some logic mixed up when I added functionality to lag the day of month to trade) had some sort of bug in it which gave a slightly better than expected 2015 return. Nevertheless, the results are very similar. What is interesting to note is that in the raging bull market that was essentially from 2010 onwards, the equal weight and reverse weight filters don’t perform too badly, though the reverse weight filter has a massive drawdown in 2011, but in terms of capitalizing in awful markets, the original filter as designed by Keller and TrendXplorer works best, both in terms of making money during the recession, and does better near the market bottom in 2009.
Now that that’s out of the way, the more interesting question is how does the strategy work when not trading at the end of the month? Long story short, the best time to trade it is in the last week of the month. Once the new month rolls around, hands off. If you’re talking about tranching this strategy, then you have about a week’s time to get your positions in, so I’m not sure the actual dollar volume this strategy can manage, as it’s dependent on the monthend effect (I know that one of my former managers–a brilliant man, by all accounts–said that this phenomena no longer existed, but I feel these empirical results refute that assertion in this particular instance). Here are these results:
lagCompare < list() for(i in 1:21) { offRets < KDA(rets, leverageFactor = 1, offset = i) tmp < offRets[[2]] colnames(tmp) < paste0("Lag", i) lagCompare[[i]] < tmp } lagCompare < do.call(cbind, lagCompare) lagCompare < na.omit(cbind(KDA_100[[2]], lagCompare)) colnames(lagCompare)[1] < "Base" charts.PerformanceSummary(lagCompare, colorset=c("orange", rep("gray", 21))) stratStats(lagCompare)With the results:
> stratStats(lagCompare) Base Lag1 Lag2 Lag3 Lag4 Lag5 Lag6 Lag7 Lag8 Annualized Return 0.11230000 0.0584000 0.0524000 0.0589000 0.0319000 0.0319000 0.0698000 0.0790000 0.0912000 Annualized Std Dev 0.09100000 0.0919000 0.0926000 0.0945000 0.0975000 0.0957000 0.0943000 0.0934000 0.0923000 Annualized Sharpe (Rf=0%) 1.23480000 0.6357000 0.5654000 0.6229000 0.3270000 0.3328000 0.7405000 0.8460000 0.9879000 Worst Drawdown 0.07920363 0.1055243 0.1269207 0.1292193 0.1303246 0.1546962 0.1290020 0.1495558 0.1227749 Calmar Ratio 1.41786439 0.5534272 0.4128561 0.4558141 0.2447734 0.2062107 0.5410771 0.5282311 0.7428230 Ulcer Performance Index 4.03566328 1.4648618 1.1219982 1.2100649 0.4984094 0.5012318 1.3445786 1.4418132 2.3277271 Lag9 Lag10 Lag11 Lag12 Lag13 Lag14 Lag15 Lag16 Lag17 Annualized Return 0.0854000 0.0863000 0.0785000 0.0732000 0.0690000 0.0862000 0.0999000 0.0967000 0.1006000 Annualized Std Dev 0.0906000 0.0906000 0.0900000 0.0913000 0.0906000 0.0909000 0.0923000 0.0947000 0.0949000 Annualized Sharpe (Rf=0%) 0.9426000 0.9524000 0.8722000 0.8023000 0.7617000 0.9492000 1.0825000 1.0209000 1.0600000 Worst Drawdown 0.1278059 0.1189949 0.1197596 0.1112761 0.1294588 0.1498408 0.1224511 0.1290538 0.1274083 Calmar Ratio 0.6682006 0.7252411 0.6554796 0.6578231 0.5329880 0.5752771 0.8158357 0.7493000 0.7895878 Ulcer Performance Index 2.3120919 2.6415855 2.4441605 1.9248615 1.8096134 2.2378207 2.8753265 2.9092448 3.0703542 Lag18 Lag19 Lag20 Lag21 Annualized Return 0.097100 0.0921000 0.1047000 0.1019000 Annualized Std Dev 0.092900 0.0903000 0.0958000 0.0921000 Annualized Sharpe (Rf=0%) 1.044900 1.0205000 1.0936000 1.1064000 Worst Drawdown 0.100604 0.1032067 0.1161583 0.1517104 Calmar Ratio 0.965170 0.8923835 0.9013561 0.6716747 Ulcer Performance Index 3.263040 2.7159601 3.0758230 3.0414002Essentially, the trade at the very end of the month is the only one with a Calmar ratio above 1, though the Calmar ratio from lag15 to lag 21 is about .8 or higher, with a Sharpe ratio of 1 or higher. So, there’s definitely a window of when to trade, and when not to–namely, the lag 1 through 5 variations have the worst performances by no small margin. Therefore, I strongly suspect that the 13612 filter was designed around the idea of the endofmonth effect, or at least, not stresstested for different trading days within the month (and given that longerdated data is only monthly, this is understandable).
Nevertheless, I hope this does answer some people’s questions from the quant finance universe. I know that Corey Hoffstein of Think Newfound (and wow that blog is good from the perspective of properties of trendfollowing) loves diversifying across every bit of the process, though in this case, I do think there’s something to be said about “diworsification”.
In any case, I do think there are some future research venues for further research here.
Thanks for reading.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
You Need to Start Branding Your Graphs. Here’s How, with ggplot!
(This article was first published on Michael Toth's Blog, and kindly contributed to Rbloggers)
In today’s post I want to help you incorporate your company’s branding into your ggplot graphs. Why should you care about this? I’m glad you asked!
Have you ever seen a graph that looks like this? Of course you have! This is the default ggplot theme, and these graphs are everywhere. Now, look–I like the way this graph looks. The base ggplot theme is reasonable and the graph is clear. But it doesn’t in any way differentiate it from thousands of other, similarly designed graphs on the internet. That’s not good. You want your graphs to stand out! Take a look at this next graph:
I’m guessing, even before you saw the caption at the bottom, that you knew this was a graph from FiveThirtyEight, Nate Silver’s datadriven news service.
How about this one?
Again, this graph is immediately recognizable as coming from the Economist magazine. These two companies have done exceptional jobs of creating branded, differentiating styles to make their graphics immediately recognizable to anybody who sees them. In this post, I’m going to convince you why it’s important that you develop a branded style for your graphs at your own company, and then I’ll show you some quick steps to do it.
Now, I know, You might be thinking: branding and visual identity is something for the design and marketing teams to worry about. You’re a data scientist! You don’t have those skills, and, frankly, you have more important things to do. I sympathize, but I’m going to be honest with you: you need to get that idea out of your head. It’s easier than you think, and it’s part of your job! Or, at least, it should be. You see, when you start to make YOUR work fit in with YOUR COMPANY’S work, doors start to open for you. You create more for crossdepartmental and external facing opportunities when your work clearly matches the company brand. Maybe a graph you created can help the sales team put together a presentation to win a big client. Maybe the marketing team can use your work to help put together a press kit. These probably aren’t the core focus of your job, but the more you help people around your organization, the more respected you’ll be and the more opportunities you’ll have.
Over time, you will build a reputation and be recognized for your quality work because your work will be VISIBLE. Pretty soon, you will be the goto person when an executive needs a graph for a board presentation or an investor pitch. As you build more relationships throughout your company, you’ll be able to direct your focus to work that you enjoy, get involved in interesting projects, and advocate for projects of your own. And the best part? Most of this is completely passive! You’re already doing the work, these tips will just help you to make it more visible, more shareable, and more impactful!
Convinced? Okay, let’s get started. This is going to be one of the easiest highimpact changes you can make, so I hope you’re excited!
I’m going to show you how you can easily change the color palette of a graph and add your company’s logo to create a final, branded image that’s ready for publication.
To start, create a standard ggplot graph like you otherwise would:
Once you have your base graph put together, the next step is for you to change the colors to match your company’s color palette. I’m going to create a CocaCola branded graph for demonstration purposes, but you should substitute in your own company’s details here. If you don’t know what your company’s color palette is, ask somebody from your design or marketing teams to send it to you! Here, I’m using red, black, and gray to match CocaCola’s color palette.
Choosing individual colors like I’m doing here works for categorical graphs, but for continuous graphs you’ll need to do a bit more work to get a branded color scale. That’s a subject for another post, but check out the awesome Viz Palette tool by Elijah Meeks and Susie Lu for a sense of what’s possible. As you become more familiar, you can create custom ggplot themes and ggplot color paletes to make this process seamless, but I don’t want to get into all of that here, as it can be a bit overwhelming to learn all that at once.
Finally, let’s add your company’s logo to the graph for a complete, branded, and publicationready graph. Download a moderately high resolution logo and save it somewhere on your machine. The workhorse here is the grid.raster function, which can render an image on top of a preexisting image (in this case, your graph). The trick is to get the positioning and sizing right. This can be a bit confusing when you’re first starting with these image manipulations, so I’ll walk through them onebyone:
 x: this controls the xposition of where you place the logo. This should be a numeric value between 0 and 1, where 0 represents a position all the way on the left of the graph and 1 represents a position all the way on the right.
 y: this controls the yposition of where you place the logo. This should be a numeric value between 0 and 1, where 0 represents a position all the way on the bottom of the graph and 1 represents a position all the way on the top.
 just: this is a set of two values, the first corresponding to the horizontal justification and the second corresponding to the vertical justification. With x and y we chose a position on the grid to place the logo. just lets us choose how to justify the image at that location. Here, I’ve selected ‘left’ and ‘bottom’ justification, which means that the left bottom corner of the logo will be placed at the xy coordinates specified.
 width: this scales the logo down to a smaller size so it can be placed on the image. Here, I’m scaling the logo down to a 1inch size, but the size you’ll want will depend on the size of the graph you’ve created. Play around with different sizes until you find one that feels right.
And there we have it: a branded graph that would be suitable for a sales meeting, marketing presentation, or investor deck!
I help technology companies to leverage their data to produce branded, influential content to share with their clients. I work on everything from investor newsletters to blog posts to research papers. If you enjoy the work I do and are interested in working together, you can visit my consulting website or contact me at michael@michaeltothanalytics.com!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Michael Toth's Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Robust Regressions: Dealing with Outliers in R
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
Categories
Tags
It is often the case that a dataset contains significant outliers – or observations that are significantly out of range from the majority of other observations in our dataset. Let us see how we can use robust regressions to deal with this issue.
I described in another tutorial how we can run a linear regression in R. However, this does not account for the outliers in our data. So, how can we solve this?
PlotsA useful way of dealing with outliers is by running a robust regression, or a regression that adjusts the weights assigned to each observation in order to reduce the skew resulting from the outliers.
In this particular example, we will build a regression to analyse internet usage in megabytes across different observations. You will see that we have several outliers in this dataset. Specifically, we have three incidences where internet consumption is vastly higher than other observations in the dataset.
Let’s see how we can use a robust regression to mitigate for these outliers.
Firstly, let’s plot Cook’s distance and the QQ Plot:
Cook’s Distance QQ PlotWe can see that a plot of Cook’s distance shows clear outliers, and the QQ plot demonstrates the same (with a significant number of our observations not lying on the regression line).
When we get a summary of our data, we see that the maximum value for usage sharply exceeds the mean or median:
summary(mydata) age gender webpages Min. :19.00 Min. :0.0000 Min. : 10.00 1st Qu.:27.00 1st Qu.:0.0000 1st Qu.: 20.00 Median :37.00 Median :1.0000 Median : 25.00 Mean :37.94 Mean :0.5124 Mean : 29.64 3rd Qu.:48.75 3rd Qu.:1.0000 3rd Qu.: 32.00 Max. :60.00 Max. :1.0000 Max. :950.00 videohours income usage Min. : 0.0000 Min. : 0 Min. : 500 1st Qu.: 0.4106 1st Qu.: 3503 1st Qu.: 3607 Median : 1.7417 Median : 6362 Median : 9155 Mean : 4.0229 Mean : 6179 Mean : 11955 3rd Qu.: 4.7765 3rd Qu.: 8652 3rd Qu.: 19361 Max. :45.0000 Max. :12000 Max. :108954 OLS RegressionLet’s now run a standard OLS regression and see what we come up with.
#OLS Regression summary(ols < lm(usage ~ income + videohours + webpages + gender + age, data = mydata)) Call: lm(formula = usage ~ income + videohours + webpages + gender + age, data = mydata) Residuals: Min 1Q Median 3Q Max 10925 2559 479 2266 46030 Coefficients: Estimate Std. Error t value Pr(>t) (Intercept) 2.499e+03 4.893e+02 5.107 3.96e07 *** income 7.874e01 4.658e02 16.903 < 2e16 *** videohours 1.172e+03 3.010e+01 38.939 < 2e16 *** webpages 5.414e+01 3.341e+00 16.208 < 2e16 *** gender 1.227e+02 2.650e+02 0.463 0.643 age 8.781e+01 1.116e+01 7.871 9.50e15 ***  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4113 on 960 degrees of freedom Multiple Rsquared: 0.8371, Adjusted Rsquared: 0.8363 Fstatistic: 986.8 on 5 and 960 DF, pvalue: < 2.2e16We see that along with the estimates, most of our observations are significant at the 5% level and the RSquared is reasonably high at 0.8371.
However, we need to bear in mind that this regression is not accounting for the fact that significant outliers exist in our dataset.
Cook’s DistanceA method we can use to determine outliers in our dataset is Cook’s distance. As a rule of thumb, if Cook’s distance is greater than 1, or if the distance in absolute terms is significantly greater than others in the dataset, then this is a good indication that we are dealing with an outlier.
#Compute Cooks Distance dist < cooks.distance(ols) dist<data.frame(dist) ]s < stdres(ols) a < cbind(mydata, dist, s) #Sort in order of standardized residuals sabs < abs(s) a < cbind(mydata, dist, s, sabs) asorted < a[order(sabs), ] asorted[1:10, ] age gender webpages videohours income usage dist 966 35 0 80 45.0000000 6700 108954 2.058817511 227 32 1 27 0.0000000 7830 24433 0.010733761 93 37 1 30 0.0000000 10744 27213 0.018822466 419 25 1 20 0.4952778 1767 17843 0.010178820 568 36 1 37 1.6666667 8360 25973 0.007652459 707 42 1 40 4.2347222 8527 29626 0.006085165 283 39 0 46 0.0000000 6244 22488 0.007099847 581 40 0 24 0.0000000 9746 23903 0.010620785 513 29 1 23 1.1111111 11398 25182 0.015107423 915 30 0 42 0.0000000 9455 22821 0.010412391 s sabs 966 11.687771 11.687771 227 4.048576 4.048576 93 4.026393 4.026393 419 3.707761 3.707761 568 3.627891 3.627891 707 3.583459 3.583459 283 3.448079 3.448079 581 3.393124 3.393124 513 3.353123 3.353123 915 3.162762 3.162762We are adding Cook’s distance and standardized residuals to our dataset. Observe that we have the highest Cook’s distance and the highest standaridized residual for the observation with the greatest internet usage.
Huber and Bisquare WeightsAt this point, we can now adjust the weights assigned to each observation to adjust our regression results accordingly.
Let’s see how we can do this using Huber and Bisquare weights.
Huber Weights #Huber Weights summary(rr.huber < rlm(usage ~ income + videohours + webpages + gender + age, data = mydata)) Call: rlm(formula = usage ~ income + videohours + webpages + gender + age, data = mydata) Residuals: Min 1Q Median 3Q Max 11605.7 2207.7 177.2 2430.9 49960.3 Coefficients: Value Std. Error t value (Intercept) 3131.2512 423.0516 7.4016 income 0.9059 0.0403 22.4945 videohours 1075.0703 26.0219 41.3140 webpages 57.1909 2.8880 19.8028 gender 173.4154 229.0994 0.7569 age 88.6238 9.6455 9.1881 Residual standard error: 3449 on 960 degrees of freedom huber < data.frame(usage = mydata$usage, resid = rr.huber$resid, weight = rr.huber$w) huber2 < huber[order(rr.huber$w), ] huber2[1:10, ] usage resid weight 966 108954 49960.32 0.09284397 227 24433 16264.20 0.28518764 93 27213 15789.65 0.29375795 419 17843 15655.03 0.29628629 707 29626 14643.43 0.31675392 568 25973 14605.87 0.31756653 283 22488 13875.58 0.33427984 581 23903 13287.62 0.34907331 513 25182 13080.99 0.35458486 105 19606 12478.59 0.37170941 Bisquare weighting #Bisquare weighting rr.bisquare < rlm(usage ~ income + videohours + webpages + gender + age, data=mydata, psi = psi.bisquare) summary(rr.bisquare) Call: rlm(formula = usage ~ income + videohours + webpages + gender + age, data = mydata, psi = psi.bisquare) Residuals: Min 1Q Median 3Q Max 11473.6 2177.8 119.7 2491.9 50000.1 Coefficients: Value Std. Error t value (Intercept) 3204.9051 426.9458 7.5066 income 0.8985 0.0406 22.1074 videohours 1077.1598 26.2615 41.0167 webpages 56.3637 2.9146 19.3384 gender 156.7096 231.2082 0.6778 age 90.2113 9.7343 9.2673 Residual standard error: 3434 on 960 degrees of freedom bisqr < data.frame(usage = mydata$usage, resid = rr.bisquare$resid, weight = rr.bisquare$w) bisqr2 < bisqr[order(rr.bisquare$w), ] bisqr2[1:10, ] usage resid weight 227 24433 16350.56 0.0000000000 966 108954 50000.09 0.0000000000 93 27213 15892.10 0.0005829225 419 17843 15700.87 0.0022639112 707 29626 14720.97 0.0264753021 568 25973 14694.59 0.0274546293 283 22488 13971.53 0.0604133298 581 23903 13389.67 0.0944155930 513 25182 13192.86 0.1072333000 105 19606 12536.39 0.1543038994In both of the above instances, observe that a much lower weight of 0.092 is assigned to observation 966 using Huber weights, and a weight of 0 is assigned to the same observation using Bisquare weighting.
In this regard, we are allowing the respective regressions to adjust the weights in a way that yields lesser importance to outliers in our model.
ConclusionIn this tutorial, you have learned how to:
 Screen for outliers using Cook’s distance and QQ Plots
 Why standard linear regressions do not necessarily adjust for outliers
 How to use weighting techniques to adjust for such anomalies
If you have any questions on anything I have covered in this tutorial, please leave a comment and I will do my best to address your query. You can also find a videobased tutorial on this topic here.
Dataset
Related Post
 Multilevel Modelling in R: Analysing Vendor Data
 Logistic Regression with Python using Titanic data
 Failure Pressure Prediction Using Machine Learning
 Machine learning logistic regression for credit modelling in R
 Commercial data analytics: An economic view on the data science methods
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
handlr: convert among citation formats
(This article was first published on rOpenSci  open tools for open science, and kindly contributed to Rbloggers)
Citations are a crucial piece of scholarly work. They hold metadata on each scholarly work, including what people were involved, what year the work was published, where it was published, and more. The links between citations facilitate insight into many questions about scholarly work.
Citations come in many different formats including BibTex, RIS, JATS, and many more. This is not to be confused with citation styles such as APA vs. MLA and so on.
Those that deal with or do research on citations often get citations in one format (e.g., BibTex), but they would like them in a different format (e.g., RIS). One recent tool that does a very nice job of this is bolognese from Martin Fenner of Datacite. bolognese is written in Ruby. I love the Ruby language, but it does not play nicely with R; thus it wasn’t an option to wrap or call bolognese from R.
handlr is a new R package modeled after bolognese.
The original motivation for starting handlr comes from this thread in the rorcid package, in which the citations retrieved from source A had mangled characters with accents, but source B gave unmangled characters but in the wrong format. Thus the need for a citation format converter.
handlr converts citations from one format to another. It currently supports reading the following formats:
And the following writers are supported:
handlr has not yet focused on performance, but we will do so in future versions.
Links:
 handlr source code: https://github.com/ropensci/handlr
 handlr on CRAN: https://cran.rstudio.com/web/packages/handlr/
Install the lastest from CRAN
install.packages("handlr")Some binaries are not up yet on CRAN – you can also install from GitHub.
There’s no compiled code though, so source install should work.
I somehow forgot to export the print.handl() function in the CRAN version, so
if you try this with the CRAN version you won’t get the compact output seen below.
Load handlr
library(handlr) The R6 approachThere’s a single R6 interface to all readers and writers
grab an example file that comes with the package
z < system.file("extdata/citeproc.json", package = "handlr")initialize the object
x < HandlrClient$new(x = z) x #> #> doi: #> ext: json #> format (guessed): citeproc #> path: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/handlr/extdata/citeproc.json #> string (abbrev.): noneread the file
x$read(format = "citeproc") x #> #> doi: #> ext: json #> format (guessed): citeproc #> path: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/handlr/extdata/citeproc.json #> string (abbrev.): noneinspect the parsed content
x$parsed #> #> from: citeproc #> many: FALSE #> count: 1 #> first 10 #> id/doi: https://doi.org/10.5438/4k3mnyvgwrite out to bibtex. by default does not write to a file; you can
also specify a file path.
If you prefer not to use the above approach, you can use the various
functions that start with he format (e.g., bibtex) followed by
_reader or _writer.
Here, we play with the bibtex format.
Get a sample file and use bibtex_reader() to read it in.
z < system.file('extdata/bibtex.bib', package = "handlr") bibtex_reader(x = z) #> #> from: bibtex #> many: FALSE #> count: 1 #> first 10 #> id/doi: https://doi.org/10.1142%2fs1363919602000495What this returns is a handl object, just a list with attributes.
The handl object is what we use as the internal representation that we
convert citations to and from.
Each reader and writer supports handling many citations at once. For all
formats, this means many citations in the same file.
 There’s still definitely some improvements that need to be made to various parts of citations in some of the formats. Do open an issue/let me know if you find anything off.
 Performance could be improved for sure
 Problems with very large files, e.g., ropensci/handlr#9
 Documentation, there is very little thus far
Get in touch if you have any handlr questions in the
issue tracker or the
rOpenSci discussion forum.
To leave a comment for the author, please follow the link and comment on their blog: rOpenSci  open tools for open science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
“If You Were an R Function, What Function Would You Be?”
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
We’ve been getting some good uptake on our piping in R article announcement.
The article is necessarily a bit technical. But one of its key points comes from the observation that piping into names is a special opportunity to give general objects the following personality quiz: “If you were an R function, what function would you be?”
 Everything that exists is an object.
 Everything that happens is a function call.
John Chambers speaking on S and R
So our question is: can we add a meaningful association between the two deepest concepts in R objects (or references to them) and functions?
We think the answer is a resounding “yes!”
The following example (adapted from the paper) should help illustrate the idea.
Suppose we had simple linear model.
set.seed(2019) data_use < base::sample(c("train", "test"), nrow(mtcars), replace = TRUE) mtcars_train < mtcars[data_use == "train", , drop = FALSE] mtcars_test < mtcars[data_use == "test", , drop = FALSE] model < lm(mpg ~ disp + wt, data = mtcars_train)Now if “model” were an R function, what function would it be? One possible answer is: it would be predict.lm(). It would be nice if “model(mtcars_test)” meant “predict(model, data = mtcars_test)“. Or, if we accept the pipe notation “mtcars_test %.>% model” as an approximate substitute for (note: not an equivalent of) “model(mtcars_test)” we can make that happen.
The “%.>%” is the wrapr dot arrow pipe. It can be made to ask the question “If you were an R function, what function would you be?” as follows.
First a bit of preparation, we tell R‘s S3 class system how to answer the question.
apply_right.lm < function(pipe_left_arg, pipe_right_arg, pipe_environment, left_arg_name, pipe_string, right_arg_name) { predict(pipe_right_arg, newdata = pipe_left_arg) }And now we can treat any reference to an object of class “lm” as a pipe destination or function.
mtcars_test %.>% modelAnd we see our results.
# Mazda RX4 Mazda RX4 Wag Hornet 4 Drive Duster 360 Merc 280 # 23.606199 22.518582 20.477232 18.347774 20.062914 # Merc 280C Merc 450SE Cadillac Fleetwood Lincoln Continental Fiat 128 # 20.062914 16.723133 10.506642 9.836894 25.888019 # Dodge Challenger AMC Javelin Porsche 9142 Lotus Europa Ford Pantera L # 18.814401 19.261396 25.892974 28.719255 20.108134 # Maserati Bora # 18.703696Notice we didn’t have to alter model or wrap it in a function. This solution can be used again and again in many different circumstances.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
uRos2019: tutorials, keynote speakers, registration and call for papers!
(This article was first published on R – Mark van der Loo, and kindly contributed to Rbloggers)
The 7th use of R in Official Statistics conference is the event for all things R in the production and use of government statistics. The 7th installment of this conference will take place from 20 to 21 May 2019 at the National Institute of Statistics in Bucharest, Romania.
 Follow on Twitter: @uRosconf
We are very proud to announce that we have two excellent keynote speakers.
 Julie Josse will talk about her work on theory and tools related to imputation and inference in the presence of missing data.
 Giulio Barcaroli will talk about 12 years of using R at ISTAT, the Italian Statistical Office.
Full abstracts can be found here
TutorialsThe conference is preceded by three tutorials on Data Cleaning, Statistical Disclosure Control and Optimal Sampling Stratification.
Call for papersYes, abstracts and papers are welcomed until 12 April 2019! You can contribute by sending an abstract in any of the following topics (relating to official statistics):
Sampling and estimation  R in organization  Data cleaning  R in production: data analysis  Methods for official statistics  Shiny applications  Time series  Report and GUI programming  R in production: automation  Big data  Dissemination and visualization
Registration is openYou can now register by following instructions here.
Markdown with by wpgfm var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RTrainings in Hamburg – Register now!
(This article was first published on eoda english R news – Der Datenanalyse Blog von eoda, and kindly contributed to Rbloggers)
With more than 1,500 satisfied participants, eodas Rtrainings are the leading courses for the programming language in the Germanspeaking region. In May, 2019, we bring our popular courses „Introduction to R“ and „Introduction to Machine Learning with R“ to Hamburg again.
What you can look forward to? Our program at a glance:
May 14th – 15thIntroduction to R
With practical tips and exercises, this introductory course serves as a basis for the further use of R in individual applications. The aim of the course is to teach the logic and terminology of R to participants without profound and previous knowledge in order to develop the fundament for independent work.
May 16th – 17thIntroduction in Machine Learning with R
In this introductory course you will gain an insight into machine learning algorithms. Besides developing own models, you will also learn what challenges you will face and how to master them in order to develop artificial intelligence applications based on your data.
In this introduction, we will teach you the skills you need for implementing machine learning processes independently in R by using practical examples and exercises.
Notice: Both courses are held in German!
Would you like to become an expert of R this year? Then register by 14 April 2019 at:Follow.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: eoda english R news – Der Datenanalyse Blog von eoda. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Customers who bought…
(This article was first published on RBloggers – Learning Machines, and kindly contributed to Rbloggers)
One of the classic examples in data science (called data mining at the time) is the beer and diapers example: when a big supermarket chain started analyzing their sales data they encountered not only trivial patterns, like toothbrushes and toothpaste being bought together, but also quite strange combinations like beer and diapers. Now, the trivial ones are reassuring that the method works but what about the more extravagant ones? Does it mean that young parents are alcoholics? Or that instead of breastfeeding they give their babies beer? Obviously, they had to get to the bottom of this.
As it turned out in many cases they following happened: stressed out mummy sends young daddy to the supermarket because they had run out of diapers. Young daddy seizes the opportunity to not only buy the much needed diapers but also to stock up on some beer! So what the supermarket did was to place the beer directly on the way from the aisle with the diapers – the result was a significant boost in beer sales (for all the young daddies who might have forgotten what they really wanted when buying diapers…).
So, to reproduce this example in a simplified way have a look at the following code:
# some example data for items bought together (market baskets) Diapers < c(1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0) Baby_Oil < c(1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0) Ham < c(rep(0, 6), rep(1, 2), rep(0, 7)) Beer < c(rep(0, 3), 1, rep(0, 11)) (basket < cbind(Diapers, Baby_Oil, Ham, Beer)) ## Diapers Baby_Oil Ham Beer ## [1,] 1 1 0 0 ## [2,] 0 1 0 0 ## [3,] 1 1 0 0 ## [4,] 1 0 0 1 ## [5,] 0 0 0 0 ## [6,] 1 1 0 0 ## [7,] 1 0 1 0 ## [8,] 0 0 1 0 ## [9,] 1 1 0 0 ## [10,] 1 1 0 0 ## [11,] 0 1 0 0 ## [12,] 0 1 0 0 ## [13,] 1 1 0 0 ## [14,] 0 0 0 0 ## [15,] 0 0 0 0 # analysis of items bought together round(cor_basket < cor(basket), 2) # cor is the core of the method! (no pun intended) ## Diapers Baby_Oil Ham Beer ## Diapers 1.00 0.33 0.03 0.25 ## Baby_Oil 0.33 1.00 0.48 0.33 ## Ham 0.03 0.48 1.00 0.10 ## Beer 0.25 0.33 0.10 1.00 diag(cor_basket) < 0 # we don't want to recommend the same products to the customers who already bought them round(cor_basket, 2) ## Diapers Baby_Oil Ham Beer ## Diapers 0.00 0.33 0.03 0.25 ## Baby_Oil 0.33 0.00 0.48 0.33 ## Ham 0.03 0.48 0.00 0.10 ## Beer 0.25 0.33 0.10 0.00 # printing items bought together for (i in 1:ncol(cor_basket)) { col < cor_basket[ , i, drop = FALSE] col < col[order(col, decreasing = TRUE), , drop = FALSE] cat("Customers who bought", colnames(col), "also bought", rownames(col)[col > 0], "\n") } ## Customers who bought Diapers also bought Baby_Oil Beer ## Customers who bought Baby_Oil also bought Diapers ## Customers who bought Ham also bought ## Customers who bought Beer also bought DiapersWhat we are looking for is some kind of dependance pattern within the shopping baskets, in this case we use the good old correlation function. Traditionally other (dependance) measures are used, namely support, confidence and lift. We will come to that later on in this post.
Wikipedia offers the following fitting description of association rule learning:
Association rule learning is a rulebased machine learning method for discovering interesting relations between variables in large databases. It is intended to identify rules discovered in databases using some measures of interestingness.
For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat.
Such information can be used as the basis for decisions about marketing activities such as, e.g. promotional pricing or product placements.
In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics.
So, this is also the basis of popular functions on ecommerce sites (“customers who bought this item also bought…”) or movie streaming platforms (“customers who watched this film also watched…”).
A very good package for realworld datasets is the arules package (on CRAN). Have a look at the following code:
library(arules) ## Loading required package: Matrix ## ## Attaching package: 'arules' ## The following objects are masked from 'package:base': ## ## abbreviate, write data("Groceries") rules < apriori(Groceries, parameter = list(supp = 0.001, conf = 0.5)) ## Apriori ## ## Parameter specification: ## confidence minval smax arem aval originalSupport maxtime support minlen ## 0.5 0.1 1 none FALSE TRUE 5 0.001 1 ## maxlen target ext ## 10 rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## Absolute minimum support count: 9 ## ## set item appearances ...[0 item(s)] done [0.00s]. ## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. ## sorting and recoding items ... [157 item(s)] done [0.00s]. ## creating transaction tree ... done [0.00s]. ## checking subsets of size 1 2 3 4 5 6 done [0.02s]. ## writing ... [5668 rule(s)] done [0.00s]. ## creating S4 object ... done [0.00s]. rules_conf < arules::sort(rules, by = "confidence", decreasing = TRUE) inspect(head(rules_conf, 10)) ## lhs rhs support confidence lift count ## [1] {rice, ## sugar} => {whole milk} 0.001220132 1 3.913649 12 ## [2] {canned fish, ## hygiene articles} => {whole milk} 0.001118454 1 3.913649 11 ## [3] {root vegetables, ## butter, ## rice} => {whole milk} 0.001016777 1 3.913649 10 ## [4] {root vegetables, ## whipped/sour cream, ## flour} => {whole milk} 0.001728521 1 3.913649 17 ## [5] {butter, ## soft cheese, ## domestic eggs} => {whole milk} 0.001016777 1 3.913649 10 ## [6] {citrus fruit, ## root vegetables, ## soft cheese} => {other vegetables} 0.001016777 1 5.168156 10 ## [7] {pip fruit, ## butter, ## hygiene articles} => {whole milk} 0.001016777 1 3.913649 10 ## [8] {root vegetables, ## whipped/sour cream, ## hygiene articles} => {whole milk} 0.001016777 1 3.913649 10 ## [9] {pip fruit, ## root vegetables, ## hygiene articles} => {whole milk} 0.001016777 1 3.913649 10 ## [10] {cream cheese , ## domestic eggs, ## sugar} => {whole milk} 0.001118454 1 3.913649 1The algorithm used here is the so called Apriori algorithm. It ameliorates the problem with realworld datasets that when you want to test all combinations of all possible items you very soon run into performance problems – even with very fast computers – because there are just too many possibilities to be tested.
The Apriori algorithm aggressively prunes the possibilities to be tested by making use of the fact that if you are only interested in rules that are supported by a certain number of instances you can start with testing the support of individual items – which is easy to do – and work your way up to more complicated rules.
The trick is that you don’t test more complicated rules with items which don’t have enough support on the individual level: this is because if you don’t have enough instances on the individual level you don’t even have to look at more complicated combinations with those items included (which would be even more scarce). What sounds like an obvious point brings about huge time savings for realworld datasets which couldn’t be analyzed without this trick.
As mentioned above important concepts to assess the quality (or interestingness) of association rules are support, confidence and lift:
 Support of : percentage of X for all cases
 Confidence of : percentage of Y for all X
 Lift of : ratio of the observed support of X and Y to what would be expected if X and Y were independent
To understand these concepts better we are going to rebuild the examples given in the Wikipedia article in R. Have a look at the parts “Definition” and “Useful Concepts” of the article and after that at the following code, which should be selfexplanatory:
M < matrix(c(1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0), ncol = 5, byrow = TRUE) colnames(M) < c("milk", "bread", "butter", "beer", "diapers") M ## milk bread butter beer diapers ## [1,] 1 1 0 0 0 ## [2,] 0 0 1 0 0 ## [3,] 0 0 0 1 1 ## [4,] 1 1 1 0 0 ## [5,] 0 1 0 0 0 supp < function(X) { sum(rowSums(M[ , X, drop = FALSE]) == length(X)) / nrow(M) # "rowSums == length" mimics logical AND for the selected columns } conf < function(X, Y) { supp(c(X, Y)) / supp(X) # conf(X => Y) } lift < function(X, Y) { supp(c(X, Y)) / (supp(X) * supp(Y)) # lift(X => Y) } supp(c("beer", "diapers")) # percentage of X for all cases ## [1] 0.2 conf(c("butter", "bread"), "milk") # percentage of Y for all X ## [1] 1 lift(c("milk", "bread"), "butter") # ratio of the observed support of X and Y to what would be expected if X and Y were independent ## [1] 1.25You should conduct your own experiments by playing around with different item combinations so that you really understand the mechanics of those important concepts.
If all of those analyses are being done for perfecting your customer experience or just outright manipulation to lure you into buying stuff you don’t really need is obviously a matter of perspective…
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBloggers – Learning Machines. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Using Rstudio Jobs for training many models in parallel
(This article was first published on That’s so Random, and kindly contributed to Rbloggers)
Recently, Rstudio added the Jobs feature, which allows you to run R scripts in the background. Computations are done in a separate R session that is not interactive, but just runs the script. In the meantime your regular R session stays live so you can do other work while waiting for the Job to complete. Instead refreshing your Twitter for the 156th time, you can stay productive! (I am actually writing this blog in Rmarkdown while I am waiting for my results to come in.) The number of jobs you can spinup is not limited to one. As each new job is started at a different core, you can start as many jobs as your system got cores (although leaving one idle is a good idea for other processes, like your interactive R session).
Recently, I needed to train many Bayesian models on subsets of a large dataset. The subsets varied greatly in size, with most of the models needing a few minutes to train, but the ones trained on the larger subsets took up to half an hour. With just a single Job the whole thing would have lasted over fifteen hours. Luckily, I have a server at my disposal with many cores and a lot of RAM. I chose to use Jobs for running the model training in parallel for two reasons. First, as mentioned, it allows me to do other R work while waiting for the results to come in. Second, I train the models with rstan which allows for using multiple cores for each model, each chain gets its own core. So, we have parallelization within the parallel Jobs. I could only get this to work efficiently with Jobs. Packages for parallelization did not seem to be able to handle this parallel within parallel. (Disclaimer, I am no expert on parallelization, a more knowledgeable person might figure it out. I stopped digging once I got what I wanted with Jobs).
Here, I share the steps I took to get the whole thing running.
Create the Jobs scriptFirst you should capture everything that is within every Job in an R script. You could choose to import your current global environment when starting the job, be careful with this when starting multiple Jobs. If you have large objects in your environment it will be loaded to every Job you start and you might run out of memory. Importing the global environment also poses a threat for the reproducibility of the outcome, because the Job can only run when the required objects are loaded in the global. Rather, I suggest you make a script that is completely selfcontained. Everything that is needed within the Job, dependencies, loaded data, your own functions, should be in the script (or sourced by the script). To run the same Job multiple times on different data, include a parameter that differs over the Jobs within the script. More about that later.
Divide your workWithin each Job, the computations will run sequentially. If you want the parallel Jobs to finish up in approximately the same time, you should divide the work they have to do in approximately equal chunks. The number of Jobs you can runs simultaneously is limited by the number of cores you have available and the RAM on your system. The first is easy to detect, just run
parallel::detectCores()If you have 8 cores on your machine, you can spin up to 7 Jobs parallel if you have enough memory available. Figuring out how much memory your Job consumes is a bit trialanderror. If you run a task for the first time it might be a good to run a test Job and check on its memory use. If you are on a unix system you can see how much memory the Job consumes by using the top function on the command line. This will show all running processes and the percentage of the available RAM they use. If the test Job consumes about 15% of your available memory, you could spin up about 4 to 5 parallel Jobs, assuming that the amount of memory used is approximately equal for each part of the data. Always allow for some room for fluctuation. I don’t let the Jobs consume more than 75% of the available memory. If your Jobs are very volatile in their memory use you might even want to be more conservative, since it is one strike and your out. (I know absolutely nothing about Windows, if you are on Windows and not sure how to monitor the memory use you have to do some Googling I am afraid).
After you have decided how many jobs you can run in parallel it is time to split up the work. I needed to train a great number of models on datasets varying in size. Splitting up the subsets in equal number of models to train is a suboptimal strategy, because if several larger sets end up in the same partition you have to wait long on one of the Jobs while the others are long finished. I used the number of rows of each subset as an indication for the time it would take for the model trained on it to complete. With the following little helper function you can create approximately equal chunks of the data. x is a vector with id’s for every group on which a model is trained, w is a vector with weights (the number of rows), g is the number of desired groups, this is equal to the number of Jobs you’ll start.
assign_to_chunk < function(x, w, g) { stopifnot(length(x) == length(w), is.numeric(w), length(g) == 1, g %% 1 == 0) splits < (sum(w) / g) * (1:(g  1)) total_sizes < cumsum(w) end_points < sapply(splits, function(x) which.min(abs(total_sizes  x))) start_points < c(1, end_points + 1) end_points < c(end_points, length(x)) chunk_assignment < rep(1:g, end_points  start_points + 1) data.frame(x, chunk_assignment) }If you have a large dataset, like I did, it is a good idea to partition data physically into separate files. Otherwise each Job has to hold the complete data set in its memory, wasting RAM on data records it doesn’t need.
Getting the whole thing to runNow all that is left is setting your system on fire by using its full capacity. There are two ways you can get your Job to run on different data sets. The first is to give your Job script a parameter variable, of which you change the value before starting the next Job. Say, you have split your data into five parts, called filename_part1.csv up until filename_part5.csv. You could start you Job script then with
part < 1 read.csv(paste0("location_to_file/filename_part", part, ".csv"))Next, you start the Job manually in the Jobs pane in Rstudio. For the next part you change part to the value 2 and you start another Job. Until you have five Jobs started. This is a little tedious and error prone. A more elegant solution was proposed by Rstudio’s Jonathan McPherson, in which the rstudioapi::jobRunScript function is used to kickoff the Job. In a second script you update the part variable after starting each Job and you start the Job by using the function instead of the pane.
part < 1 rstudioapi::jobRunScript(path = "path_to_Jobs_script", importEnv = TRUE) part < 2 rstudioapi::jobRunScript(path = "path_to_Jobs_script", importEnv = TRUE) # ... part < 5 rstudioapi::jobRunScript(path = "path_to_Jobs_script", importEnv = TRUE)You now only have to run this second script to have all the Jobs started at once. This even more convenient if you have several variables value that vary over the Jobs. A downside of this approach is that you have to import the global environment in every one of them. As mentioned, you must then make sure the Job is reproducible and the total memory of your system is not exceeded.
A major thanks to the Rstudio team for providing this awesome functionality. It is a great productivity boost to me!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Making thematic maps for Belgium
(This article was first published on bnosac :: open analytical helpers  bnosac :: open analytical helpers, and kindly contributed to Rbloggers)
For people from Belgium working in R with spatial data, you can find excellent workshop material on creating thematic maps for Belgium at https://workshop.mhermans.net/thematicmapsr/index.html by Maarten Hermans (researcher at the HIVA – Onderzoeksinstituut voor Arbeid en Samenleving – https://mhermans.net). The plots are heavily based on BelgiumMaps.Statbel – an R package from bnosac released 2 years ago (more info at http://www.bnosac.be/index.php/blog/55belgiummapsstatbelrpackagewithadministrativeboundariesofbelgium)var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers  bnosac :: open analytical helpers. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppStreams 0.1.3: Keeping CRAN happy
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Not unlike the Rblpapi release on Thursday and the RVowpalWabbit release on Friday (both of which dealt with the upcoming staged install), we now have another CRANrequested maintenance release. This time it is RcppStreams which got onto CRAN as of early this morning. RcppStreams brings the excellent Streamulus C++ template library for event stream processing to R.
Streamulus, written by Irit Katriel, uses very clever template metaprogramming (via Boost Fusion) to implement an embedded domainspecific event language created specifically for event stream processing.
This release provides suppresses issue reported by the UBSAN detector set up at Oxford; I simply no longer run the examples that triggered it as the errors came from very deep down inside Boost Proto and Fusion. As a more positive side effect, I updated the Rocker RDevel SAN/UBSAN Clang image and corresponding Docker container. So if you need SAN/UBSAN detection, that container may become your friend.
The NEWS file entries follows below:
Changes in version 0.1.3 (20190224) No longer run examples as they upset the UBSAN checks at CRAN
Courtesy of CRANberries, there is also a copy of the DESCRIPTION file for this initial release. More detailed information is on the RcppStreams page page and of course on the Streamulus page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Gartner’s 2019 Take on Data Science Software
(This article was first published on R – r4stats.com, and kindly contributed to Rbloggers)
I’ve just updated The Popularity of Data Science Software to reflect my take on Gartner’s 2019 report, Magic Quadrant for Data Science and Machine Learning Platforms. To save you the trouble of digging through all 40+ pages of my report, here’s just the updated section:
IT Research FirmsIT research firms study software products and corporate strategies. They survey customers regarding their satisfaction with the products and services and provide their analysis in reports that they sell to their clients. Each research firm has its own criteria for rating companies, so they don’t always agree. However, I find the detailed analysis that these reports contain extremely interesting reading. The reports exclude open source software that has no specific company backing, such as R, Python, or jamovi. Even open source projects that do have company backing, such as BlueSky Statistics, are excluded if they have yet to achieve sufficient market adoption. However, they do cover how company products integrate open source software into their proprietary ones.
While these reports are expensive, the companies that receive good ratings usually purchase copies to give away to potential customers. An Internet search of the report title will often reveal companies that are distributing them. On the date of this post, Datarobot is offering free copies.
Gartner, Inc. is one of the research firms that write such reports. Out of the roughly 100 companies selling data science software, Gartner selected 17 which offered “cohesive software.” That software performs a wide range of tasks including data importation, preparation, exploration, visualization, modeling, and deployment.
Gartner analysts rated the companies on their “completeness of vision” and their “ability to execute” that vision. Figure 3a shows the resulting “Magic Quadrant” plot for 2019, and 3b shows the plot for the previous year. Here I provide some commentary on their choices, briefly summarize their take, and compare this year’s report to last year’s. The main reports from both years contain far more detail than I cover here.
Figure 3a. Gartner Magic Quadrant for Data Science and Machine Learning Platforms from their 2019 report (plot done in November 2018, report released in 2019).
The Leaders quadrant is the place for companies whose vision is aligned with their customer’s needs and who have the resources to execute that vision. The further toward the upperright corner of the plot, the better the combined score.
 RapidMiner and KNIME reside in the best part of the Leaders quadrant this year and last. This year RapidMiner has the edge in ability to execute, while KNIME offers more vision. Both offer free and open source versions, but the companies differ quite a lot on how committed they are to the open source concept. KNIME’s desktop version is free and open source and the company says it will always be so. On the other hand, RapidMiner is limited by a cap on the amount of data that it can analyze (10,000 cases) and as they add new features, they usually come only via a commercial license with “difficulttonavigate pricing conditions.” These two offer very similar workflowstyle user interfaces and have the ability to integrate many open sources tools into their workflows, including R, Python, Spark, and H2O.
 Tibco moved from the Challengers quadrant last year to the Leaders this year. This is due to a number of factors, including the successful integration of all the tools they’ve purchased over the years, including Jaspersoft, Spotfire, Alpine Data, Streambase Systems, and Statistica.
 SAS declined from being solidly in the Leaders quadrant last year to barely being in it this year. This is due to a substantial decline in its ability to execute. Given SAS Institute’s billions in revenue, that certainly can’t be a financial limitation. It may be due to SAS’ more limited ability to integrate as wide a range of tools as other vendors have. The SAS language itself continues to be an important research tool among those doing complex mixedeffects linear models. Those models are among the very few that R often fails to solve.
The companies in the Visionaries Quadrant are those that have good future plans but which may not have the resources to execute that vision.
 Mathworks moved forward substantially in this quadrant due to MATLAB’s ability to handle unconventional data sources such as images, video, and the Internet of Things (IoT). It has also opened up more to open source deep learning projects.
 H2O.ai is also in the Visionaries quadrant. This is the company behind the open source H2O software, which is callable from many other packages or languages including R, Python, KNIME, and RapidMiner. While its own menubased interface is primitive, its integration into KNIME and RapidMiner makes it easy to use for noncoders. H2O’s strength is in modeling but it is lacking in data access and preparation, as well as model management.
 IBM dropped from the top of the Visionaries quadrant last year to the middle. The company has yet to fully integrate SPSS Statistics and SPSS Modeler into its Watson Studio. IBM has also had trouble getting Watson to deliver on its promises.
 Databricks improved both its vision and its ability to execute, but not enough to move out of the Visionaries quadrant. It has done well with its integration of opensource tools into its Apache Sparkbased system. However, it scored poorly in the predictability of costs.
 Datarobot is new to the Gartner report this year. As its name indicates, its strength is in the automation of machine learning, which broadens its potential user base. The company’s policy of assigning a data scientist to each new client gets them up and running quickly.
 Google’s position could be clarified by adding more dimensions to the plot. Its complex collection of a dozen products that work together is clearly aimed at software developers rather than data scientists or casual users. Simply figuring out what they all do and how they work together is a nontrivial task. In addition, the complete set runs only on Google’s cloud platform. Performance on big data is its forte, especially problems involving image or speech analysis/translation.
 Microsoft offers several products, but only its cloudonly Azure Machine Learning (AML) was comprehensive enough to meet Gartner’s inclusion criteria. Gartner gives it high marks for easeofuse, scalability, and strong partnerships. However, it is weak in automated modeling and AML’s relation to various other Microsoft components is overwhelming (same problem as Google’s toolset).
Figure 3b. Last year’s Gartner Magic Quadrant for Data Science and Machine Learning Platforms (January, 2018)
Those in the Challenger’s Quadrant have ample resources but less customer confidence in their future plans, or vision.
 Alteryx dropped slightly in vision from last year, just enough to drop it out of the Leaders quadrant. Its workflowbased user interface is very similar to that of KNIME and RapidMiner, and it too gets top marks in easeofuse. It also offers very strong data management capabilities, especially those that involve geographic data, spatial modeling, and mapping. It comes with geocoded datasets, saving its customers from having to buy it elsewhere and figuring out how to import it. However, it has fallen behind in cutting edge modeling methods such as deep learning, automodeling, and the Internet of Things.
 Dataiku strengthed its ability to execute significantly from last year. It added better scalability to its easeofuse and teamwork collaboration. However, it is also perceived as expensive with a “cumbersome pricing structure.”
Members of the Niche Players quadrant offer tools that are not as broadly applicable. These include Anaconda, Datawatch (includes the former Angoss), Domino, and SAP.
 Anaconda provides a useful distribution of Python and various data science libraries. They provide support and model management tools. The vast army of Python developers is its strength, but lack of stability in such a rapidly improving world can be frustrating to productionoriented organizations. This is a tool exclusively for experts in both programming and data science.
 Datawatch offers the tools it acquired recently by purchasing Angoss, and its set of “Knowledge” tools continues to get high marks on easeofuse and customer support. However, it’s weak in advanced methods and has yet to integrate the data management tools that Datawatch had before buying Angoss.
 Domino Data Labs offers tools aimed only at expert programmers and data scientists. It gets high marks for openness and ability to integrate open source and proprietary tools, but low marks for data access and prep, integrating models into daytoday operations, and customer support.
 SAP’s machine learning tools integrate into its main SAP Enterprise Resource Planning system, but its fragmented toolset is weak, and its customer satisfaction ratings are low.
To see many other ways to rate this type of software, see my ongoing article, The Popularity of Data Science Software. You may also be interested in my indepth reviews of pointandclick user interfaces to R. I invite you to subscribe to my blog or follow me on twitter where I announce new posts. Happy computing!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Bolsonaro’s First Job Approval Ratings
(This article was first published on Daniel Marcelino's Blog, and kindly contributed to Rbloggers)
President Jair Bolsonaro’s job approval ratings average 39.5% during his first quarter in office so far (from January through late February). Compared to the former presidents, for which I’ve estimates, his quarterly job approval ratings are above the overall average for the inauguration term (31%). However, his ratings trail quarterly averages of the Workers’ Party presidents, Lula I and II, and Dilma I.
Quarter President % Approving    1987.2 José Sarney 9.0 1990.2 Fernando Collor 36.0 1992.3 Itamar Franco 18.0 1995.1 Fernando Henrique I 39.0 1999.1 Fernando Henrique II 21.0 2003.1 Lula da Silva I 42.5 2007.1 Lula da Silva II 48.0 2011.1 Dilma Rousseff I 51.5 2015.1 Dilma Rousseff II 14.7 2016.2 Michel Temer 19.4 2019.1 Jair Bolsonaro 39.5 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Daniel Marcelino's Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Logistic regression in R using blorr package
(This article was first published on Rsquared Academy Blog, and kindly contributed to Rbloggers)
We are pleased to introduce the blorr package, a set of tools for building and
validating binary logistic regression models in R, designed keeping in mind
beginner/intermediate R users. The package includes:
 comprehensive regression output
 variable selection procedures
 bivariate analysis, model fit statistics and model validation tools
 various plots and underlying data
If you know how to build models using glm(), you will find blorr very
useful. Most of the functions use an object of class glm as input. So you
just need to build a model using glm() and then pass it onto the functions in
blorr. Once you have picked up enough knowledge of R, you can move on to
more intuitive approach offered by tidymodels etc. as they offer more
flexibility, which blorr does not.
blorr includes a shiny app which can be launched using
blr_launch_app()or try the live version here.
Read on to learn more about the features of blorr, or see the
blorr website for
detailed documentation on using the package.
To demonstrate the features of blorr, we will use the bank marketing data set.
The data is related with direct marketing campaigns of a Portuguese banking
institution. The marketing campaigns were based on phone calls. Often, more
than one contact to the same client was required, in order to access if the
product (bank term deposit) would be (‘yes’) or not (‘no’) subscribed. It
contains a random sample (~4k) of the original data set which can be found
at https://archive.ics.uci.edu/ml/datasets/bank+marketing.
Let us begin with careful bivariate analysis of each possible variable and the
outcome variable. We will use information value and likelihood ratio chi square
test for selecting the initial set of predictors for our model. The bivariate
analysis is currently avialable for categorical predictors only.
Weight of evidence (WoE) is used to assess the relative risk of di¤erent
attributes for a characteristic and as a means to transform characteristics
into variables. It is also a very useful tool for binning. The WoE for any
group with average odds is zero. A negative WoE indicates that the proportion
of defaults is higher for that attribute than the overall proportion and
indicates higher risk.
The information value is used to rank order variables in terms of their
predictive power. A high information value indicates a high ability to
discriminate. Values for the information value will always be positive and may
be above 3 when assessing highly predictive characteristics. Characteristics
with information values less than 0:10 are typically viewed as weak, while
values over 0.30 are sought after.
We can generate the weight of evidence and information value for multiple
variables using blr_woe_iv_stats().
blr_woe_iv() and blr_woe_iv_stats() are currently avialable for categorical
predictors only.
For the initial/ first cut model, all the independent variables are put into
the model. Our goal is to include a limited number of independent variables
(515) which are all significant, without sacrificing too much on the model
performance. The rationale behind notincluding too many variables is that the
model would be over fitted and would become unstable when tested on the
validation sample. The variable reduction is done using forward or backward
or stepwise variable selection procedures. We will use blr_step_aic_both()
to shortlist predictors for our model.
We can use bivariate analysis and stepwise selection procedure to shortlist
predictors and build the model using the glm(). The predictors used in the
below model are for illustration purposes and not necessarily shortlisted
from the bivariate analysis and variable selection procedures.
Use blr_regress() to generate comprehensive regression output. It accepts
either of the following
 model built using glm()
 model formula and data
Let us look at the output generated from blr_regress():
blr_regress(model) ##  Creating model overview. ##  Creating response profile. ##  Extracting maximum likelihood estimates. ##  Estimating concordant and discordant pairs. ## Model Overview ##  ## Data Set Resp Var Obs. Df. Model Df. Residual Convergence ##  ## data y 4521 4520 4498 TRUE ##  ## ## Response Summary ##  ## Outcome Frequency Outcome Frequency ##  ## 0 4004 1 517 ##  ## ## Maximum Likelihood Estimates ##  ## Parameter DF Estimate Std. Error z value Pr(>z) ##  ## (Intercept) 1 5.1347 0.3728 13.7729 0.0000 ## age 1 0.0096 0.0067 1.4299 0.1528 ## duration 1 0.0042 2e04 20.7853 0.0000 ## previous 1 0.0357 0.0392 0.9089 0.3634 ## housingno 1 0.7894 0.1232 6.4098 0.0000 ## defaultyes 1 0.8691 0.6919 1.2562 0.2091 ## loanno 1 0.6598 0.1945 3.3925 7e04 ## poutcomefailure 1 0.6085 0.2012 3.0248 0.0025 ## poutcomeother 1 1.1354 0.2700 4.2057 0.0000 ## poutcomesuccess 1 3.2481 0.2462 13.1913 0.0000 ## jobtechnician 1 0.2713 0.1806 1.5019 0.1331 ## jobentrepreneur 1 0.7041 0.3809 1.8486 0.0645 ## jobbluecollar 1 0.6132 0.1867 3.2851 0.0010 ## jobunknown 1 0.9932 0.8226 1.2073 0.2273 ## jobretired 1 0.3197 0.2729 1.1713 0.2415 ## jobadmin. 1 0.1120 0.2001 0.5599 0.5755 ## jobservices 1 0.1750 0.2265 0.7728 0.4397 ## jobselfemployed 1 0.1408 0.3009 0.4680 0.6398 ## jobunemployed 1 0.6581 0.3432 1.9174 0.0552 ## jobhousemaid 1 0.7456 0.3932 1.8963 0.0579 ## jobstudent 1 0.1927 0.3433 0.5613 0.5746 ## maritalsingle 1 0.5451 0.1387 3.9299 1e04 ## maritaldivorced 1 0.1989 0.1986 1.0012 0.3167 ##  ## ## Association of Predicted Probabilities and Observed Responses ##  ## % Concordant 0.8886 Somers' D 0.7773 ## % Discordant 0.1114 Gamma 0.7773 ## % Tied 0.0000 Taua 0.1575 ## Pairs 2070068 c 0.8886 ## If you want to examine the odds ratio estimates, set odd_conf_limit to TRUE.
The odds ratio estimates are not explicitly computed as we observed considerable
increase in computation time when dealing with large data sets.
Let us use the model formula and the data set to generate the above results.
blr_regress(y ~ age + duration + previous + housing + default + loan + poutcome + job + marital, data = bank_marketing) ##  Creating model overview. ##  Creating response profile. ##  Extracting maximum likelihood estimates. ##  Estimating concordant and discordant pairs. ## Model Overview ##  ## Data Set Resp Var Obs. Df. Model Df. Residual Convergence ##  ## data y 4521 4520 4498 TRUE ##  ## ## Response Summary ##  ## Outcome Frequency Outcome Frequency ##  ## 0 4004 1 517 ##  ## ## Maximum Likelihood Estimates ##  ## Parameter DF Estimate Std. Error z value Pr(>z) ##  ## (Intercept) 1 5.1347 0.3728 13.7729 0.0000 ## age 1 0.0096 0.0067 1.4299 0.1528 ## duration 1 0.0042 2e04 20.7853 0.0000 ## previous 1 0.0357 0.0392 0.9089 0.3634 ## housingno 1 0.7894 0.1232 6.4098 0.0000 ## defaultyes 1 0.8691 0.6919 1.2562 0.2091 ## loanno 1 0.6598 0.1945 3.3925 7e04 ## poutcomefailure 1 0.6085 0.2012 3.0248 0.0025 ## poutcomeother 1 1.1354 0.2700 4.2057 0.0000 ## poutcomesuccess 1 3.2481 0.2462 13.1913 0.0000 ## jobtechnician 1 0.2713 0.1806 1.5019 0.1331 ## jobentrepreneur 1 0.7041 0.3809 1.8486 0.0645 ## jobbluecollar 1 0.6132 0.1867 3.2851 0.0010 ## jobunknown 1 0.9932 0.8226 1.2073 0.2273 ## jobretired 1 0.3197 0.2729 1.1713 0.2415 ## jobadmin. 1 0.1120 0.2001 0.5599 0.5755 ## jobservices 1 0.1750 0.2265 0.7728 0.4397 ## jobselfemployed 1 0.1408 0.3009 0.4680 0.6398 ## jobunemployed 1 0.6581 0.3432 1.9174 0.0552 ## jobhousemaid 1 0.7456 0.3932 1.8963 0.0579 ## jobstudent 1 0.1927 0.3433 0.5613 0.5746 ## maritalsingle 1 0.5451 0.1387 3.9299 1e04 ## maritaldivorced 1 0.1989 0.1986 1.0012 0.3167 ##  ## ## Association of Predicted Probabilities and Observed Responses ##  ## % Concordant 0.8886 Somers' D 0.7773 ## % Discordant 0.1114 Gamma 0.7773 ## % Tied 0.0000 Taua 0.1575 ## Pairs 2070068 c 0.8886 ##  Model Fit StatisticsModel fit statistics are available to assess how well the model fits the data
and to compare two different models.The output includes likelihood ratio test,
AIC, BIC and a host of pseudo rsquared measures. You can read more about
pseudo rsquared at https://stats.idre.ucla.edu/other/multpkg/faq/general/faqwhatarepseudorsquareds/.
Hosmer and Lemeshow developed a goodnessoffit test for logistic regression
models with binary responses. The test involves dividing the data into
approximately ten groups of roughly equal size based on the percentiles of the
estimated probabilities. The observations are sorted in increasing order of
their estimated probability of having an even outcome. The discrepancies
between the observed and expected number of observations in these groups are
summarized by the Pearson chisquare statistic, which is then compared to
chisquare distribution with t degrees of freedom, where t is the number of
groups minus 2. Lower values of Goodnessoffit are preferred.
A lift curve is a graphical representation of the % of cumulative events
captured at a specific cutoff. The cutoff can be a particular decile or a
percentile. Similar, to rank ordering procedure, the data is in descending
order of the scores and is then grouped into deciles/percentiles. The
cumulative number of observations and events are then computed for each
decile/percentile. The lift curve is the created using the cumulative %
population as the xaxis and the cumulative percentage of events as the yaxis.
ROC curve is a graphical representation of the validity of cutoffs for a
logistic regression model. The ROC curve is plotted using the sensitivity and
specificity for all possible cutoffs, i.e., all the probability scores. The
graph is plotted using sensitivity on the yaxis and 1specificity on the
xaxis. Any point on the ROC curve represents a sensitivity X (1specificity)
measure corresponding to a cutoff. The area under the ROC curve is used as a
validation measure for the model – the bigger the area the better is the model.
The KS Statistic is again a measure of model efficiency, and it is created
using the lift curve. The lift curve is created to plot % events. If we also
plot % nonevents on the same scale, with % population at xaxis, we would get
another curve. The maximum distance between the lift curve for events and that
for nonevents is termed as KS. For a good model, KS should be big (>=0.3) and
should occur as close to the event rate as possible.
The decile lift chart displays the lift over the global mean event rate for
each decile. For a model with good discriminatory power, the top deciles should
have an event/conversion rate greater than the global mean.
If the model has good discriminatory power, the top deciles should have a higher
event/conversion rate compared to the bottom deciles.
The Lorenz curve is a simple graphic device which illustrates the degree of
inequality in the distribution of thevariable concerned. It is a visual
representation of inequality used to measure the discriminatory power of the
predictive model.
blorr can generate 22 plots for residual, influence and leverage diagnostics.
Influence Diagnostics blr_plot_diag_influence(model) Leverage Diagnostics blr_plot_diag_leverage(model) Fitted Values Diagnostics blr_plot_diag_fit(model) Learning MoreThe blorr website includes
comprehensive documentation on using the package, including the following
article
that covers various aspects of using blorr.
All feedback is welcome. Issues (bugs and feature
requests) can be posted to github tracker.
To leave a comment for the author, please follow the link and comment on their blog: Rsquared Academy Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A example in causal inference designed to frustrate: an estimate pretty much guaranteed to be biased
(This article was first published on ouR data generation, and kindly contributed to Rbloggers)
I am putting together a brief lecture introducing causal inference for graduate students studying biostatistics. As part of this lecture, I thought it would be helpful to spend a little time describing directed acyclic graphs (DAGs), since they are an extremely helpful tool for communicating assumptions about the causal relationships underlying a researcher’s data.
The strength of DAGs is that they help us think how these underlying relationships in the data might lead to biases in causal effect estimation, and suggest ways to estimate causal effects that eliminate these biases. (For a real introduction to DAGs, you could take a look at this paper by Greenland, Pearl, and Robins or better yet take a look at Part I of this book on causal inference by Hernán and Robins.)
As part of this lecture, I plan on including a (frustrating) example that illustrates a scenario where it may in fact be impossible to get an unbiased estimate of the causal effect of interest based on the data that has been collected. I thought I would share this little example here.
The scenarioIn the graph below we are interested in the causal effect of \(A\) on an outcome \(Y\). We have also measured a covariate \(L\), thinking it might be related to some unmeasured confounder (in this case \(U_2\)). Furthermore, there is another unmeasured variable \(U_1\) unrelated to \(A\), but related to the measure \(L\) and outcome \(Y\). These relationships are captured in this DAG:
It may help to be a bit more concrete about what these variables might represent. Say we are conducting an epidemiological study focused on whether or not exercise between the age of 50 and 60 has an effect on hypertension after 60. (So, \(A\) is exercise and \(Y\) is a measure of hypertension.) We are concerned that there might be confounding by some latent (unmeasured) factor related to an individual’s conscientiousness about their health; those who are more conscientious may exercise more, but they will also do other things to improve their health. In this case, we are able to measure whether or not the individual has a healthy diet (\(L\)), and we hope that will address the issue of confounding. (Note we are making the assumption that conscientiousness is related to hypertension only through exercise or diet, probably not very realistic.)
But, it also turns out that an individual’s diet is also partly determined by where the individual lives; that is, characteristics of the area may play a role. Unfortunately, the location of the individual (or characteristics of the location) was not measured (\(U_1\)). These same characteristics also affect locationspecific hypertension levels.
Inspecting the original DAG, we see that \(U_2\) is indeed confounding the relationship between \(A\) and \(Y\). There is a backdoor path \(A \rightarrow U_2 \rightarrow L \rightarrow Y\) that needs to be blocked. We cannot just ignore this path. If we generate data and estimate the effect of \(A\) on \(Y\), we will see that the estimate is quite biased. First, we generate data based on the DAG, assuming \(L\), and \(A\) are binary, and \(Y\) is continuous (though this is by no means necessary):
d < defData(varname = "U1", formula = 0.5, dist = "binary") d < defData(d, varname = "U2", formula = 0.4, dist = "binary") d < defData(d, varname = "L", formula = "1.6 + 1 * U1 + 1 * U2", dist = "binary", link = "logit") d < defData(d, varname = "A", formula = "1.5 + 1.2 * U2", dist = "binary", link="logit") d < defData(d, varname = "Y", formula = "0 + 1 * U1 + 1 * L + 0.5 * A", variance = .5, dist = "normal") set.seed(20190226) dd < genData(2500, d) dd ## id U1 U2 L A Y ## 1: 1 0 1 1 1 1.13 ## 2: 2 0 0 1 0 1.31 ## 3: 3 1 0 0 0 1.20 ## 4: 4 0 1 1 0 1.04 ## 5: 5 0 0 0 0 0.67 ##  ## 2496: 2496 0 0 0 0 0.29 ## 2497: 2497 0 0 0 0 0.24 ## 2498: 2498 1 0 1 0 1.32 ## 2499: 2499 1 1 1 1 3.44 ## 2500: 2500 0 0 0 0 0.78And here is the unadjusted model. The effect of \(A\) is overestimated (the true effect is 0.5):
broom::tidy(lm(Y ~ A, data = dd)) ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) 0.826 0.0243 34.0 2.54e208 ## 2 A 0.570 0.0473 12.0 1.53e 32 Adjusting for a potential confounder that is also a colliderWhile we are not able to measure \(U_2\), we have observed \(L\). We might think we are OK. But, alas, we are not. If we control for diet (\(L\)), we are controlling a “collider”, which will open up an association between \(U_1\) and \(U_2\). (I wrote about this before here.)
The idea is that if I have a healthy diet but I am not particularly conscientious about my health, I probably live in an area encourages or provides access to better food. Therefore, conditioning on diet induces a (negative, in this case) correlation between location type and health conscientiousness. So, by controlling \(L\) we’ve created a backdoor path \(A \rightarrow U_2 \rightarrow U_1 \rightarrow Y\). Confounding remains, though it may be reduced considerably if the induced link between \(U_2\) and \(U_1\) is relatively weak.
broom::tidy(lm(Y ~ L+ A, data = dd)) ## # A tibble: 3 x 5 ## term estimate std.error statistic p.value ## ## 1 (Intercept) 0.402 0.0231 17.4 6.58e 64 ## 2 L 1.26 0.0356 35.6 2.10e224 ## 3 A 0.464 0.0386 12.0 2.46e 32 More systematic exploration of bias and variance of estimatesIf we repeatedly generate samples (this time of size 500), we get a much better picture of the consequences of using different models to estimate the causal effect. The function below generates the data (using the same definitions as before), and then estimating three different models: (1) no adjustment, (2) incorrect adjustment for \(L\), the confounder/collider, and (3) the correct adjustment of the unmeasured confounder \(U_2\), which should be unbiased. The function returns the three estimates of the causal effect of \(A\):
repFunc < function(n, def) { dd < genData(n, def) c1 < coef(lm(Y ~ A, data = dd))["A"] c2 < coef(lm(Y ~ L + A, data = dd))["A"] c3 < coef(lm(Y ~ U2 + A, data = dd))["A"] return(data.table(c1, c2, c3)) }This following code generates 2500 replications of the “experiment” and stores the final results in data.table rdd:
RNGkind("L'EcuyerCMRG") # to set seed for parallel process reps < parallel::mclapply(1:2500, function(x) repFunc(500, d), mc.set.seed = TRUE) rdd < rbindlist(reps) rdd[, rep := .I] rdd ## c1 c2 c3 rep ## 1: 0.46 0.45 0.40 1 ## 2: 0.56 0.45 0.41 2 ## 3: 0.59 0.46 0.50 3 ## 4: 0.74 0.68 0.61 4 ## 5: 0.45 0.43 0.41 5 ##  ## 2496: 0.42 0.42 0.37 2496 ## 2497: 0.57 0.54 0.53 2497 ## 2498: 0.56 0.49 0.51 2498 ## 2499: 0.53 0.45 0.43 2499 ## 2500: 0.73 0.63 0.69 2500 rdd[, .(mean(c1  0.5), mean(c2  0.5), mean(c30.5))] ## V1 V2 V3 ## 1: 0.062 0.015 0.0016 rdd[, .(var(c1), var(c2), var(c3))] ## V1 V2 V3 ## 1: 0.011 0.0074 0.012As expected, the first two models are biased, whereas the third is not. Under these parameter and distribution assumptions, the variance of the causal effect estimate is larger for the unbiased estimate than for the model that incorrectly adjusts for diet (\(L\)). So, we seem to have a bias/variance tradeoff. In other cases, where we have binary outcome \(Y\) or continuous exposures, this tradeoff may be more or less extreme.
Here, we end with a look at the estimates, with the dashed line indicated at the true causal effect of \(A\) on \(Y\):
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Four Dataviz Posters
(This article was first published on R on kieranhealy.org, and kindly contributed to Rbloggers)
I was asked for some examples of posters I’ve made using R and ggplot. Here are four. Some of these are done from start to finish in R, others involved some postprocessing in Illustrator, usually to adjust some typographical elements or add text in a sidebar. I’ve linked to a PDF of each one, along with a pointer to the original post about the graphic.
If you’re interested in learning more about how to making graphs and charts using R and ggplot, then by a staggering coincidence there’s a new visualization book out that can help you with that.
Vaccination Exemptions in California KindergartensVaccination exemptions poster.
A CoCitation Network for PhilosophyCoCitation in Philosophy.
Mortality in France, 18162016French mortality.
Visualizing the Baby BoomThe Baby Boom in the US and the UK.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...