Diffusion/Wiener Model Analysis with brms – Part II: Model Diagnostics and Model Fit
(This article was first published on Computational Psychology  Henrik Singmann, and kindly contributed to Rbloggers)
This is the considerably belated second part of my blog series on fitting diffusion models (or better, the 4parameter Wiener model) with brms. The first part discusses how to set up the data and model. This second part is concerned with perhaps the most important steps in each model based data analysis, model diagnostics and the assessment of model fit. Note, the code in the part is completely self sufficient and can be run without running the code of part I.
SetupAt first, we load quite a few packages that we will need down the way. Obviously brms, but also some of the packages from the tidyverse (i.e., dplyr, tidyr, tibble, and ggplot2). It took me a little time to jump on the tidyverse bandwagon, but now that I use it more and more I cannot deny its utility. If your data can be made ‘tidy’, the coherent set of tools offered by the tidyverse make many seemingly complicated tasks pretty easy. A few examples of this will be shown below. If you need more introduction, I highly recommend the awesome ‘R for Data Science’ book by Grolemund and Wickham, which they made available for free! We also need gridExtra for combining plots and DescTools for the concordance correlation coefficient CCC used below.
library("brms") library("dplyr") library("tidyr") library("tibble") # for rownames_to_column library("ggplot2") library("gridExtra") # for grid.arrange library("DescTools") # for CCCAs in part I, we need package rtdists for the data.
data(speed_acc, package = "rtdists") speed_acc < droplevels(speed_acc[!speed_acc$censor,]) # remove extreme RTs speed_acc < droplevels(speed_acc[ speed_acc$frequency %in% c("high", "nw_high"),]) speed_acc$response2 < as.numeric(speed_acc$response)1I have uploaded the binary R data file containing the fitted model object as well as the generated posterior predictive distributions to github, from which we can download them directly into R. Note that I needed to go the way via a temporary folder. If there is a way without that I would be happy to learn about it.
tmp < tempdir() download.file("https://singmann.github.io/files/brms_wiener_example_fit.rda", file.path(tmp, "brms_wiener_example_fit.rda")) download.file("https://singmann.github.io/files/brms_wiener_example_predictions.rda", file.path(tmp, "brms_wiener_example_predictions.rda")) load(file.path(tmp, "brms_wiener_example_fit.rda")) load(file.path(tmp, "brms_wiener_example_predictions.rda")) Model DiagnosticsWe already know from part I that there are a few divergent transitions. If this were a real analysis we therefore would not be satisfied with the current fit and try to rerun brm with an increased adapt_delta with the hope that this removes the divergent transitions. The Stan warning guidelines clearly state that “the validity of the estimates is not guaranteed if there are postwarmup divergences”. However, it is unclear what the actual impact of the small number of divergent transitions (< 10) observed here is on the posterior. Also, it is unclear what one can do if adapt_delta cannot be increased anymore and the model also cannot be reparameterized. Should all fits with any divergent transitions be completely disregarded? I hope the Stan team provides more guidelines to such questions in the future.
Coming back to our fit, as a first step in our model diagnostics we check the Rhat statistic as well as the number of effective samples. Specifically, we look at the parameters with the highest R² and lowest number of effective samples.
tail(sort(rstan::summary(fit_wiener$fit)$summary[,"Rhat"])) # sd_id__conditionaccuracy:frequencyhigh # 1.00 # r_id__bs[15,conditionaccuracy] # 1.00 # b_bias_conditionaccuracy # 1.00 # cor_id__conditionspeed:frequencyhigh__ndt_conditionaccuracy # 1.00 # sd_id__ndt_conditionspeed # 1.00 # cor_id__conditionspeed:frequencynw_high__bs_conditionspeed # 1.01 head(sort(rstan::summary(fit_wiener$fit)$summary[,"n_eff"])) # lp__ # 462 # b_conditionaccuracy:frequencyhigh # 588 # sd_id__ndt_conditionspeed # 601 # sd_id__conditionspeed:frequencyhigh # 646 # b_conditionspeed:frequencyhigh # 695 # r_id[12,conditionaccuracy:frequencyhigh] # 712Both are unproblematic (i.e., Rhat < 1.05 and n_eff > 100) and suggest that the sampler has converged on the stationary distribution. If anyone has a similar oneliner to return the number of divergent transitions, I would be happy to learn about it.
We also visually inspect the chain behavior of a few semirandomly selected parameters.
pars < parnames(fit_wiener) pars_sel < c(sample(pars[1:10], 3), sample(pars[(1:10)], 3)) plot(fit_wiener, pars = pars_sel, N = 6, ask = FALSE, exact_match = TRUE, newpage = TRUE, plot = TRUE)This visual inspection confirms the earlier conclusion. For all parameters the posteriors look wellbehaved and the chains appears to mix well.
Finally, in the literature there are some discussions about parameter tradeoffs for the diffusion and related models. These tradeoffs supposedly make fitting the diffusion model in a Bayesian setting particularly complicated. To investigate whether fitting the Wiener model with HMC as implemented in Stan (i.e., NUTS) also shows this pattern we take a look at the joint posterior of the fixedeffects of the main Wiener parameters for the accuracy condition. For this we use the stanfit method of the pairs function and set the condition to "divergent__". This plots the few divergent transitions above the diagonal and the remaining samples below the diagonal.
pairs(fit_wiener$fit, pars = pars[c(1, 3, 5, 7, 9)], condition = "divergent__")This plot shows some correlations, but nothing too dramatic. HMC appears to sample quite efficiently from the Wiener model.
Next we also take a look at the correlations across all parameters (not only the fixed effects).
posterior < as.mcmc(fit_wiener, combine_chains = TRUE) cor_posterior < cor(posterior) cor_posterior[lower.tri(cor_posterior, diag = TRUE)] < NA cor_long < as.data.frame(as.table(cor_posterior)) cor_long < na.omit(cor_long) tail(cor_long[order(abs(cor_long$Freq)),], 10) # Var1 Var2 Freq # 43432 b_ndt_conditionspeed r_id__ndt[1,conditionspeed] 0.980 # 45972 r_id__ndt[4,conditionspeed] r_id__ndt[11,conditionspeed] 0.982 # 46972 b_ndt_conditionspeed r_id__ndt[16,conditionspeed] 0.982 # 44612 b_ndt_conditionspeed r_id__ndt[6,conditionspeed] 0.983 # 46264 b_ndt_conditionspeed r_id__ndt[13,conditionspeed] 0.983 # 45320 b_ndt_conditionspeed r_id__ndt[9,conditionspeed] 0.984 # 45556 b_ndt_conditionspeed r_id__ndt[10,conditionspeed] 0.985 # 46736 b_ndt_conditionspeed r_id__ndt[15,conditionspeed] 0.985 # 44140 b_ndt_conditionspeed r_id__ndt[4,conditionspeed] 0.990 # 45792 b_ndt_conditionspeed r_id__ndt[11,conditionspeed] 0.991This table lists the ten largest absolute values of correlations among posteriors for all pairwise combinations of parameters. The value in column Freq somewhat unintuitively is the observed correlation among the posteriors of the two parameters listed in the two previous columns. To create this table I used a trick from SO using as.table, which is responsible for labeling the column containing the correlation value Freq.
What the table shows is some extreme correlations for the individuallevel deviations (the first index in the squared brackets of the parameter names seems to be the participant number). Let us visualize these correlations as well.
pairs(fit_wiener$fit, pars = c("b_ndt_conditionspeed", "r_id__ndt[11,conditionspeed]", "r_id__ndt[4,conditionspeed]"), condition = "divergent__")This plot shows that some of the individuallevel parameters are not well estimated.
However, overall these extreme correlations appear rather rarely.
hist(cor_long$Freq, breaks = 40)Overall the model diagnostics do not show any particularly worrying behavior (with the exception of the divergent transitions). We have learned that a few of the individuallevel estimates for some of the parameters are not very trustworthy. However, this does not disqualify the overall fit. The main take away from this fact is that we would need to be careful in interpreting the individuallevel estimates. Thus, we assume the fit is okay and continue with the next step of the analysis.
Assessing Model FitWe will now investigate the model fit. That is, we will investigate whether the model provides an adequate description of the observed data. We will mostly do so via graphical checks. To do so, we need to prepare the posterior predictive distribution and the data. As a first step, we combine the posterior predictive distributions with the data.
d_speed_acc < as_tibble(cbind(speed_acc, as_tibble(t(pred_wiener))))Then we calculate three important measures (or test statistics T()) on the individual level for each cell of the design (i.e., combination of condition and frequency factors):
 Probability of giving an upper boundary response (i.e., respond “nonword”).
 Median RTs for responses to the upper boundary.
 Median RTs for the lower boundary.
We first calculate this for each sample of the posterior predictive distribution. We then summarize these three measures by calculating the median and some additional quantiles across the posterior predictive distribution. We calculate all of this in one step using a somewhat long combination of dplyr and tidyr magic.
d_speed_acc_agg < d_speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise_at(.vars = vars(starts_with("V")), funs(prob.upper = mean(. > 0), medrt.lower = median(abs(.[. < 0]) ), medrt.upper = median(.[. > 0] ) )) %>% ungroup %>% gather("key", "value", id, condition, frequency) %>% # remove grouping vars separate("key", c("rep", "measure"), sep = "_") %>% spread(measure, value) %>% group_by(id, condition, frequency) %>% # select grouping vars summarise_at(.vars = vars(prob.upper, medrt.lower, medrt.upper), .funs = funs(median = median(., na.rm = TRUE), llll = quantile(., probs = 0.01,na.rm = TRUE), lll = quantile(., probs = 0.025,na.rm = TRUE), ll = quantile(., probs = 0.1,na.rm = TRUE), l = quantile(., probs = 0.25,na.rm = TRUE), h = quantile(., probs = 0.75,na.rm = TRUE), hh = quantile(., probs = 0.9,na.rm = TRUE), hhh = quantile(., probs = 0.975,na.rm = TRUE), hhhh = quantile(., probs = 0.99,na.rm = TRUE) ))Next, we calculate the three measures also for the data and combine it with the results from the posterior predictive distribution in one data.frame using left_join.
speed_acc_agg < speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise(prob.upper = mean(response == "nonword"), medrt.upper = median(rt[response == "nonword"]), medrt.lower = median(rt[response == "word"]) ) %>% ungroup %>% left_join(d_speed_acc_agg) Aggregated ModelFitThe first important question is whether our model can adequately describe the overall patterns in the data aggregated across participants. For this we simply aggregate the results obtained in the previous step (i.e., the summary results from the posterior predictive distribution as well as the test statistics from the data) using mean.
d_speed_acc_agg2 < speed_acc_agg %>% group_by(condition, frequency) %>% summarise_if(is.numeric, mean, na.rm = TRUE) %>% ungroupWe then use these summaries and plot predictions (in grey and black) as well as data (in red) for the three measures. The inner (fat) error bars show the 80% credibility intervals (CIs), the outer (thin) error bars show the 95% CIs. The black circle shows the median of the posterior predictive distributions.
new_x < with(d_speed_acc_agg2, paste(rep(levels(condition), each = 2), levels(frequency), sep = "\n")) p1 < ggplot(d_speed_acc_agg2, aes(x = condition:frequency)) + geom_linerange(aes(ymin = prob.upper_lll, ymax = prob.upper_hhh), col = "darkgrey") + geom_linerange(aes(ymin = prob.upper_ll, ymax = prob.upper_hh), size = 2, col = "grey") + geom_point(aes(y = prob.upper_median), shape = 1) + geom_point(aes(y = prob.upper), shape = 4, col = "red") + ggtitle("Response Probabilities") + ylab("Probability of upper resonse") + xlab("") + scale_x_discrete(labels = new_x) p2 < ggplot(d_speed_acc_agg2, aes(x = condition:frequency)) + geom_linerange(aes(ymin = medrt.upper_lll, ymax = medrt.upper_hhh), col = "darkgrey") + geom_linerange(aes(ymin = medrt.upper_ll, ymax = medrt.upper_hh), size = 2, col = "grey") + geom_point(aes(y = medrt.upper_median), shape = 1) + geom_point(aes(y = medrt.upper), shape = 4, col = "red") + ggtitle("Median RTs upper") + ylab("RT (s)") + xlab("") + scale_x_discrete(labels = new_x) p3 < ggplot(d_speed_acc_agg2, aes(x = condition:frequency)) + geom_linerange(aes(ymin = medrt.lower_lll, ymax = medrt.lower_hhh), col = "darkgrey") + geom_linerange(aes(ymin = medrt.lower_ll, ymax = medrt.lower_hh), size = 2, col = "grey") + geom_point(aes(y = medrt.lower_median), shape = 1) + geom_point(aes(y = medrt.lower), shape = 4, col = "red") + ggtitle("Median RTs lower") + ylab("RT (s)") + xlab("") + scale_x_discrete(labels = new_x) grid.arrange(p1, p2, p3, ncol = 2)
Inspection of the plots show no dramatic misfit. Overall the model appears to be able to describe the general patterns in the data. Only the response probabilities for words (i.e., frequency = high) appears to be estimated too low. The red x appear to be outside the 80% CIs but possibly also outside the 95% CIs.
The plots of the RTs show an interesting (but not surprising) pattern. The posterior predictive distributions for the rare responses (i.e., “word” responses for upper/nonword stimuli and “nonword” response to lower/word stimuli) are relatively wide. In contrast, the posterior predictive distributions for the common responses are relatively narrow. In each case, the observed median is inside the 80% CI and also quite near to the predicted median.
IndividualLevel FitTo investigate the pattern of predicted response probabilities further, we take a look at them on the individual level. We again plot the response probabilities in the same way as above, but separated by participant id.
ggplot(speed_acc_agg, aes(x = condition:frequency)) + geom_linerange(aes(ymin = prob.upper_lll, ymax = prob.upper_hhh), col = "darkgrey") + geom_linerange(aes(ymin = prob.upper_ll, ymax = prob.upper_hh), size = 2, col = "grey") + geom_point(aes(y = prob.upper_median), shape = 1) + geom_point(aes(y = prob.upper), shape = 4, col = "red") + facet_wrap(~id, ncol = 3) + ggtitle("Prediced (in grey) and observed (red) response probabilities by ID") + ylab("Probability of upper resonse") + xlab("") + scale_x_discrete(labels = new_x)This plot shows a similar pattern as the aggregated data. For none of the participants do we observe dramatic misfit. Furthermore, response probabilities to nonword stimuli appear to be predicted rather well. In contrast, response probabilities for wordstimuli are overall predicted to be lower than observed. However, this misfit does not seem to be too strong.
As a next step we look at the coverage probabilities of our three measures across individuals. That is, we calculate for each of the measures, for each of the cells of the design, and for each of the CIs (i.e., 50%, 80%, 95%, and 99%), the proportion of participants for which the observed test statistics falls into the corresponding CI.
speed_acc_agg %>% mutate(prob.upper_99 = (prob.upper >= prob.upper_llll) & (prob.upper <= prob.upper_hhhh), prob.upper_95 = (prob.upper >= prob.upper_lll) & (prob.upper <= prob.upper_hhh), prob.upper_80 = (prob.upper >= prob.upper_ll) & (prob.upper <= prob.upper_hh), prob.upper_50 = (prob.upper >= prob.upper_l) & (prob.upper <= prob.upper_h), medrt.upper_99 = (medrt.upper > medrt.upper_llll) & (medrt.upper < medrt.upper_hhhh), medrt.upper_95 = (medrt.upper > medrt.upper_lll) & (medrt.upper < medrt.upper_hhh), medrt.upper_80 = (medrt.upper > medrt.upper_ll) & (medrt.upper < medrt.upper_hh), medrt.upper_50 = (medrt.upper > medrt.upper_l) & (medrt.upper < medrt.upper_h), medrt.lower_99 = (medrt.lower > medrt.lower_llll) & (medrt.lower < medrt.lower_hhhh), medrt.lower_95 = (medrt.lower > medrt.lower_lll) & (medrt.lower < medrt.lower_hhh), medrt.lower_80 = (medrt.lower > medrt.lower_ll) & (medrt.lower < medrt.lower_hh), medrt.lower_50 = (medrt.lower > medrt.lower_l) & (medrt.lower < medrt.lower_h) ) %>% group_by(condition, frequency) %>% ## grouping factors without id summarise_at(vars(matches("\\d")), mean, na.rm = TRUE) %>% gather("key", "mean", condition, frequency) %>% separate("key", c("measure", "ci"), "_") %>% spread(ci, mean) %>% as.data.frame() # condition frequency measure 50 80 95 99 # 1 accuracy high medrt.lower 0.706 0.8824 0.882 1.000 # 2 accuracy high medrt.upper 0.500 0.8333 1.000 1.000 # 3 accuracy high prob.upper 0.529 0.7059 0.765 0.882 # 4 accuracy nw_high medrt.lower 0.500 0.8125 0.938 0.938 # 5 accuracy nw_high medrt.upper 0.529 0.8235 1.000 1.000 # 6 accuracy nw_high prob.upper 0.529 0.8235 0.941 0.941 # 7 speed high medrt.lower 0.471 0.8824 0.941 1.000 # 8 speed high medrt.upper 0.706 0.9412 1.000 1.000 # 9 speed high prob.upper 0.000 0.0588 0.588 0.647 # 10 speed nw_high medrt.lower 0.706 0.8824 0.941 0.941 # 11 speed nw_high medrt.upper 0.471 0.7647 1.000 1.000 # 12 speed nw_high prob.upper 0.235 0.6471 0.941 1.000As can be seen, for the RTs, the coverage probability is generally in line with the width of the CIs or even above it. Furthermore, for the common response (i.e., upper for frequency = nw_high and lower for frequency = high), the coverage probability is 1 for the 99% CIs in all cases.
Unfortunately, for the response probabilities, the coverage is not that great. especially in the speed condition and for tighter CIs. However, for the wide CIs the coverage probabilities is at least acceptable. Overall the results so far suggest that the model provides an adequate account. There are some misfits that should be kept in mind if one is interested in extending the model or fitting it to new data, but overall it provides a satisfactory account.
QQplots: RTsThe final approach for assessing the fit of the model will be based on more quantiles of the RT distribution (i.e., so far we only looked at th .5 quantile, the median). We will then plot individual observed versus predicted (i.e., mean from posterior predictive distribution) quantiles across conditions. For this we first calculate the quantiles per sample from the posterior predictive distribution and then aggregate across the samples. This is achieved via dplyr::summarise_at using a list column and tidyr::unnest to unstack the columns (see section 25.3 in “R for data Science”). We then combine the aggregated predicted RT quantiles with the observed RT quantiles.
quantiles < c(0.1, 0.25, 0.5, 0.75, 0.9) pp2 < d_speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise_at(.vars = vars(starts_with("V")), funs(lower = list(rownames_to_column( data.frame(q = quantile(abs(.[. < 0]), probs = quantiles)))), upper = list(rownames_to_column( data.frame(q = quantile(.[. > 0], probs = quantiles )))) )) %>% ungroup %>% gather("key", "value", id, condition, frequency) %>% # remove grouping vars separate("key", c("rep", "boundary"), sep = "_") %>% unnest(value) %>% group_by(id, condition, frequency, boundary, rowname) %>% # grouping vars + new vars summarise(predicted = mean(q, na.rm = TRUE)) rt_pp < speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise(lower = list(rownames_to_column( data.frame(observed = quantile(rt[response == "word"], probs = quantiles)))), upper = list(rownames_to_column( data.frame(observed = quantile(rt[response == "nonword"], probs = quantiles )))) ) %>% ungroup %>% gather("boundary", "value", id, condition, frequency) %>% unnest(value) %>% left_join(pp2)To evaluate the agreement between observed and predicted quantiles we calculate for each cell and quantile the concordance correlation coefficient (CCC; e.g, Barchard, 2012, Psych. Methods). The CCC is a measure of absolute agreement between two values and thus better suited than simple correlation. It is scaled from 1 to 1 where 1 represent perfect agreement, 0 no relationship, and 1 a correlation of 1 with same mean and variance of the two variables.
The following code produces QQplots for each condition and quantile separately for responses to the upper boundary and lower boundary. The value in the upper left of each plot gives the CCC measures of absolute agreement.
plot_text < rt_pp %>% group_by(condition, frequency, rowname, boundary) %>% summarise(ccc = format( CCC(observed, predicted, na.rm = TRUE)$rho.c$est, digits = 2)) p_upper < rt_pp %>% filter(boundary == "upper") %>% ggplot(aes(x = observed, predicted)) + geom_abline(slope = 1, intercept = 0) + geom_point() + facet_grid(condition+frequency~ rowname) + geom_text(data=plot_text[ plot_text$boundary == "upper", ], aes(x = 0.5, y = 1.8, label=ccc), parse = TRUE, inherit.aes=FALSE) + coord_fixed() + ggtitle("Upper responses") + theme_bw() p_lower < rt_pp %>% filter(boundary == "lower") %>% ggplot(aes(x = observed, predicted)) + geom_abline(slope = 1, intercept = 0) + geom_point() + facet_grid(condition+frequency~ rowname) + geom_text(data=plot_text[ plot_text$boundary == "lower", ], aes(x = 0.5, y = 1.6, label=ccc), parse = TRUE, inherit.aes=FALSE) + coord_fixed() + ggtitle("Lower responses") + theme_bw() grid.arrange(p_upper, p_lower, ncol = 1)Results show that overall the fit is better for the accuracy than the speed conditions. Furthermore, fit is better for the common response (i.e., nw_high for upper and high for lower responses). This latter observation is again not too surprising.
When comparing the fit for the different quantiles it appears that at least the median (i.e., 50%) shows acceptable values for the common response. However, especially in the speed condition the account of the other quantiles is not great. Nevertheless, dramatic misfit is only observed for the rare responses.
One possibility for some of the low CCCs in the speed conditions may be the comparatively low variances in some of the cells. For example, for both speed conditions that are common (i.e., speed & nw_high for upper responses and speed & high for lower responses) the visual inspection of the plot suggests a acceptable account while at the same time some CCC value are low (i.e., < .5). Only for the 90% quantile in the speed conditions (and somewhat less the 75% quantile) we see some systematic deviations. The model predicts lower RTs than observed.
Taken together, the model appear to provide an at least acceptable account. The only slightly worrying patterns are (a) that the model predicts a slightly better performance for the word stimuli than observed (i.e., lower predicted rate of nonword responses than observed for wordstimuli) and (b) that in the speed conditions the model predicts somewhat longer RTs for the 75% and 90% quantile than observed.
The next step will be to look at differences between parameters as a function of the speedaccuracy condition. However, this will be the topic of the third blog post. I am hopeful it will not take two months this time.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Computational Psychology  Henrik Singmann. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Anscombe’s Quartet: 1980’s Edition
(This article was first published on Method Matters, and kindly contributed to Rbloggers)
In this post, I’ll describe a fun visualization of Anscombe’s quartet I whipped up recently.
If you aren’t familiar with Anscombe’s quartet, here’s a brief description from its Wikipedia entry: “Anscombe’s quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. He described the article as being intended to counter the impression among statisticians that ‘numerical calculations are exact, but graphs are rough.’ “
In essence, there are 4 different datasets with quite different patterns in the data. Fitting a linear regression model through each dataset yields (nearly) identical regression coefficients, while graphing the data makes it clear that the underlying patterns are very different. What’s amazing to me is how these simple data sets (and accompanying graphs) make immediately intuitive the importance of data visualization, and drive home the point of how wellconstructed graphs can help the analyst understand the data he or she is working with.
The Anscombe data are included in base R, and these data (and R, of course!) are used to produce the plot that accompanies the Wikipedia entry on Anscombe’s quartet.
Because the 1980’s are back, I decided to make a visualization of Anscombe’s quartet using the likemosttotallyrad 1980’s graphing elements I could come up with. I was aided with the colors by a number of graphic design palettes with accompanying hex codes. I used the excellent showtext package for the 1980’s font, which comes from the Google font “Press Start 2P.” (Note, if you’re reproducing the graph at home, the fonts won’t show properly in RStudio. Run the code in the standalone R program and everything works like a charm). I had to tweak a number of graphical parameters in order to get the layout right, but in the end I’m quite pleased with the result.
The Code # showtext library to get 1980's font# "Press Start 2P"
library(showtext)
# add the font from Google fonts
font.add.google("Press Start 2P", "start2p")
png("C:\\Directory\\Anscombe_80s.png",
width=11,height=8, units='in', res=600)
showtext.begin()
op < par(las=1, mfrow=c(2,2), mar=1.5+c(4,4,.5,1), oma=c(0,0,5,0),
lab=c(6,6,7), cex.lab=12.0, cex.axis=5, mgp=c(3,1,0), bg = 'black',
col.axis = '#F2CC00', col.lab = '#A10EEC', family = 'start2p')
ff < y ~ x
for(i in 1:4) {
ff[[2]] < as.name(paste("y", i, sep=""))
ff[[3]] < as.name(paste("x", i, sep=""))
lmi < lm(ff, data= anscombe)
xl < substitute(expression(x[i]), list(i=i))
yl < substitute(expression(y[i]), list(i=i))
plot(ff, data=anscombe, col="#490E61", pch=21, cex=2.4, bg = "#32FA05",
xlim=c(3,19), ylim=c(3,13)
, xlab=eval(xl), ylab=yl,
family = 'start2p'
)
abline(lmi, col="#FA056F", lwd = 5)
axis(1, col = '#FA1505')
axis(2, col = '#FA1505')
box(col="#FA1505")
}
mtext("Anscombe's Quartet", outer = TRUE,
cex = 20, family = 'start2p', col="#FA1505")
showtext.end()
dev.off()
The Plot Conclusion
In this post, I used data available in R to make a 1980’sthemed version of the Anscombe quartet graphs. The main visual elements I manipulated were the colors and the fonts. R’s wonderful and flexible plotting capabilities (here using base R!) made it very straightforward to edit every detail of the graph to achieve the desired retrokitsch aesthetic.
OK, so maybe this isn’t the most serious use of R for data analysis and visualization. There are doubtless more important business cases and analytical problems to solve. Nevertheless, this was super fun to do. Data analysis (or data science, or whatever you’d like to call it) is a field in which there are countless ways to be creative with data. It’s not always easy to bring this type of creativity to every applied project, but this blog is a place where I can do any crazy thing I set my mind to and just have fun. Judging by that standard, I think this project was a success.
Coming Up Next
In the next post, I’ll do something a little bit different with data. Rather than doing data analysis, I’ll describe a project in which I used Python to manage and edit metadata (ID3 tags) in mp3 files. Stay tuned!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Method Matters. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
prrd 0.0.1: Parallel Running [of] Reverse Depends
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new package is now on the ghrr drat. It was uploaded four days ago to CRAN but still lingers in the inspect/ state, along with a growing number of other packages. But as some new packages have come through, I am sure it will get processed eventually but in the meantime I figured I may as well make it available this way.
The idea of prrd is simple, and described in some more detail on its webpage and its GitHub repo. Reverse dependency checks are an important part of package development (provided you care about not breaking other packages as CRAN asks you too), and is easily done in a (serial) loop. But these checks are also generally embarassingly parallel as there is no or little interdependency between them (besides maybe shared build depedencies).
So this package uses the liteq package by Gabor Csardi to set up all tests to run as tasks in a queue. This permits multiple independent queue runners to work at a task at a time. Results are written back and summarized.
This already works pretty well as evidenced by the following screenshot (running six parallel workers, arranged in split byobu session).
See the aforementioned webpage and its repo for more details, and by all means give it a whirl.
For more questions or comments use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Discontinuing maintenance of jug
(This article was first published on FishyOperations, and kindly contributed to Rbloggers)
I’m starting off 2018 with dropping my involvement with jug. The reason for this being that I simply cannot find the time or interest anymore to maintain the code.
All of the opensource efforts that I started always had to do with a concrete need that I had in one way or another. Once that need is fulfilled by creating a solution for it, the next challenge becomes to maintain that solution.
In a best case scenario, someone with the right set of skills will continue the development into a mature project which become significant better then the original solution. A close to my heart example of this is simmer (see http://rsimmer.org/), where Iñaki Ucar has taken over the development and has improved the original enormously.
However, for jug the active developers community is currently too small too do a handover. Next to that, jug its main competitor plumber (see website) has been able to create a significant community with highly active support.
Today I have to sort of admit defeat. My own time availability as cofounder of dataroots and my changing interests unfortunately no long allow me to put in the time and effort required to do the unnecessary maintenance to do justice to the jug user base. therefore I will be discontinuing the maintenance of jug. If anyone would be willing to take over, they will be more than welcome.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: FishyOperations. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Undemocratic Democracies
(This article was first published on R – DiffusePrioR, and kindly contributed to Rbloggers)
I’ve always thought that it was ironic that most countries with the word “Democratic” in their name are exceptionally undemocratic in reality. So I was very interested in the following post on Reddit this week showing exactly this (https://www.reddit.com/r/dataisbeautiful/comments/7nkyek/countries_with_the_word_democratic_in_their/).
However, there are only 8 countries that have “democratic” in their name: People’s Democratic Republic of Algeria, Democratic Republic of the Congo, Democratic Republic of TimorLeste, Federal Democratic Republic of Ethiopia, Lao People’s Democratic Republic, Democratic People’s Republic of Korea, Federal Democratic Republic of Nepal, and the Democratic Socialist Republic of Sri Lanka. This and the fact that I am back teaching next week got me thinking that this might be a nice example to show how twosample ttests work.
The 8 “democratic” countries have an overall democracy index score of 3.89 with a sample standard deviation of 2.174942. This contrasts with a mean and standard deviation of 5.602013 and 2.172326 for the remaining 159 countries in the Economist Intelligence Unit’s democracy index (https://en.wikipedia.org/wiki/Democracy_Index).
Let’s conduct a twosample ttest that assumes equal variances to test whether this difference in means is statistically significant. The null hypothesis is that both sample means are equal. This is a twotailed test. The test statistic follows Gosset’s tdistribution with n1+n22 degrees of freedom (159+82=165). The test statistic is calculated (formula here: https://en.wikipedia.org/wiki/Student%27s_ttest):
s = sqrt(((158*2.172326^2)+(7*2.174942^2))/(159+82))
t = (5.6020133.89)/(s*(sqrt(1/159+1/8)))=2.1749
which allows us to reject at the conventional alpha of 0.05. The pvalue is ~0.03 meaning we would not be able to reject at the 1% level. Interestingly, if you do not follow the equal variance assumption, you can no longer reject the null at the 5% level.
Hopefully, this example will be of interest to people teaching stats and econometrics for undergrads!
rm(list = ls()) other < c(9.93, 9.5, 9.39, 9.26, 9.2, 9.15, 9.15, 9.09, 9.03, 9.01, 8.81, 8.8, 8.63, 8.41, 8.39, 8.36, 8.3, 8.28, 8.17, 7.99, 7.98, 7.98, 7.94, 7.92, 7.92, 7.88, 7.87, 7.86, 7.85, 7.85, 7.82, 7.81, 7.79, 7.78, 7.77, 7.65, 7.51, 7.47, 7.41, 7.39, 7.31, 7.29, 7.23, 7.13, 7.1, 7.01, 6.97, 6.96, 6.94, 6.9, 6.83, 6.77, 6.75, 6.75, 6.72, 6.67, 6.67, 6.65, 6.64, 6.62, 6.62, 6.59, 6.57, 6.54, 6.47, 6.42, 6.4, 6.38, 6.31, 6.27, 6.25, 6.21, 6.03, 6.01, 5.99, 5.93, 5.92, 5.92, 5.91, 5.81, 5.76, 5.73, 5.72, 5.7, 5.7, 5.67, 5.64, 5.63, 5.55, 5.33, 5.31, 5.26, 5.23, 5.07, 5.04, 4.93, 4.93, 4.92, 4.87, 4.86, 4.81, 4.77, 4.7, 4.68, 4.55, 4.5, 4.49, 4.33, 4.27, 4.2, 4.08, 4.02, 4.02, 3.96, 3.96, 3.96, 3.88, 3.85, 3.81, 3.74, 3.71, 3.54, 3.46, 3.46, 3.4, 3.38, 3.32, 3.31, 3.24, 3.18, 3.14, 3.14, 3.07, 3.06, 3.05, 3.04, 3.03, 2.91, 2.91, 2.83, 2.79, 2.75, 2.65, 2.55, 2.4, 2.37, 2.37, 2.34, 2.25, 2.06, 1.98, 1.95, 1.93, 1.89, 1.83, 1.7, 1.61, 1.5, 1.43) demo < c(7.24, 6.48, 4.86, 3.6, 3.56, 2.37, 1.93, 1.08) mean(other) ; sd(other) ; length(other) mean(demo) ; sd(demo) ; length(demo) t.test(other, demo, var.equal = T) s = sqrt(((158*2.172326^2)+(7*2.174942^2))/(159+82)) t = (5.6020133.89)/(s*(sqrt(1/159+1/8))) t.test(other, demo, var.equal = F) library(ggplot2) data1 < data.frame(Score = other, Name = "No") data2 < data.frame(Score = demo, Name = "Yes") data < rbind(data1, data2) ggplot(data, aes(Name, Score)) + geom_boxplot(fill="lightblue") + theme_bw() + xlab("Democratic in Name") + ylab("Democracy Score")
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – DiffusePrioR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Rainbowing a set of pictures
(This article was first published on Maëlle, and kindly contributed to Rbloggers)
I’ve now down a few collages from R using magick: the faces of #rstats Twitter, We RLadies with Lucy D’Agostino McGowan, and a holiday card for RLadies. The faces of #rstats Twitter and holiday card collages were arranged at random, while the We RLadies one was a mosaic forming the RLadies logo. I got the idea to up my collage skills by trying to learn how to arrange pics by their main colour, like a rainbow. The verb rainbow doesn’t exist, and “rainbowing” doesn’t mean ordering by colour, but I didn’t let this stop me.
It was the occasion to grab some useful knowledge about colours, not useless for someone who did not even know about Pantone’s Colors of the Year a few weeks ago…
This post has nothing to do with Kesha’s new album. However, you can listen to it while reading since it’s so good, but maybe switch to something older from her when I use “$”.
Getting some pics to play withThe first pictures I tried to arrange were all the pictures ever posted by RLadies local chapters on their Twitter account. While it was fun to grab them all, it was not very interesting to play with them as so many of them were pictures of screens. I therefore grabbed “nature” pictures from Pexels using the same method [as when creating the Bubblegum Puppies] following this Stack Overflow thread. I chose “nature” as a keyword because 1) it lead to many hits 2) it offered a good variety of colours.
library("rvest") library("RSelenium") library("magrittr") rD < rsDriver() remDr < rD[["client"]] # open the webpage remDr$navigate("https://www.pexels.com/search/nature/") # scroll down for(i in 1:130){ remDr$executeScript(paste("scroll(0,",i*10000,");"), args = list("dummy")) # be nice and wait Sys.sleep(1) } # https://www.pexels.com/faq/ page_content < remDr$getPageSource() remDr$close() get_link_from_src < function(node){ xml2::xml_attrs(node)["src"] %>% as.character() %>% stringr::str_replace("\\?h.*", "") } xtract_pic_links < function(source) { css < '.photoitem__img' read_html(source[[1]]) %>% html_nodes(css) %>% purrr::map_chr(get_link_from_src) } links < xtract_pic_links(page_content) links < links[1:1400] # save dir.create("nature") save_pic < function(url){ Sys.sleep(1) name < stringr::str_replace(url, ".*\\/", "") try(magick::image_read(url) %>% magick::image_write(paste0("nature/", name)), silent = TRUE) } purrr::walk(links, save_pic) Extracting the main colour and making pics sizecompatibleIn the following code, I extracted the main colour from each pic using Russel Dinnage’s method as presented in this blog post from David Smith. Before that I had to install two packages from Github, rblocks and rPlotter.
This code also serves another role: since I wanted to paste pics together at some point, I decided to make them all of the same dimensions by adding a border with magick. I had learnt how to do that when preparing RLadies Global holiday card, but this time instead of using the same colour every time (RLadies’ official purple), I used the main colour I’d just extracted. The most important points to make a picture a square are to know magick::image_info gives you the height and width of an image… and to somehow understand geometry which was embarrassingly a hurdle when I did that.
The code to extract colours didn’t work in a few cases which I did not investigate a lot: I had downloaded more pics than what I needed because I had experienced the issue when working with RLadies meetups pics, and had seen it was because of seemingly bicolor pics.
dir.create("formatted_pics") format_image < function(path){ image < magick::image_read(path) info < magick::image_info(image) # find in which direction I need to add pixels # to make this a square direction < ifelse(info$height > info$width, "height", "width") scale_number < as.numeric(info[direction]/500) image < magick::image_scale(image, paste0(info["width"]/scale_number, "x", info["height"]/scale_number)) newinfo < magick::image_info(image) # colours colours < try(rPlotter::extract_colours(path, num_col = 1), silent = TRUE) # one pic at least was problematic if(!is(colours, "tryerror")){ colour < colours[1] image < magick::image_border(image, colour, paste0((500newinfo$width)/2, "x", (500newinfo$height)/2)) info < magick::image_info(image) # odd numbers out! if(info$height/2 != floor(info$height/2)){ image < magick::image_crop(image, "0x500+0") } if(info$width/2 != floor(info$width/2)){ image < magick::image_crop(image, "500x0+0") } magick::image_write(image, stringr::str_replace(path, "nature", "formatted_pics")) tibble::tibble(path = path, colour = colour) }else{ NULL } } pics_main_colours < purrr::map_df(dir("nature", full.names = TRUE), format_image) readr::write_csv(pics_main_colours, path = "pics_main_colours.csv")And because I’m apparently a bad planner, I had to reduce pictures afterwards.
# we need smaller images reduce_image < function(path){ magick::image_read(path) %>% magick::image_scale("50x50!") %>% magick::image_write(path) } purrr::walk(dir("formatted_pics", full.names = TRUE), reduce_image) Preparing a function to order and paste picturesThis function has a collage part which you might recognize from my the faces of #rstats Twitter blog post, and a ordering pictures according to a variable part that’s new and uses a bit of tidy eval… Maybe I’ll really learn tidy eval this year! pics_info needs to be a data.frame with the path to pictures and well the variable one wants to use to order them.
library("rlang") make_column < function(i, files, no_rows){ magick::image_read(files[(i*no_rows+1):((i+1)*no_rows)]) %>% magick::image_append(stack = TRUE) %>% magick::image_write(paste0("cols/", i, ".jpg")) } make_collage < function(pics_info, no_rows, no_cols, ordering_col){ pics_info < dplyr::arrange(pics_info, !!!syms(ordering_col)) pics_info < pics_info[1:(no_rows*no_cols),] pics_info < dplyr::mutate(pics_info, column = rep(1:no_cols, each = no_rows)) pics_info < dplyr::group_by(pics_info, column) %>% dplyr::arrange(!!!syms(ordering_col)) %>% dplyr::mutate(row = 1:no_rows) %>% dplyr::ungroup() pics_info < dplyr::arrange(pics_info, column, row) dir.create("cols") purrr::walk(0:(no_cols1), make_column, files = pics_info$path, no_rows = no_rows) banner < magick::image_read(dir("cols/", full.names = TRUE)) %>% magick::image_append(stack = FALSE) unlink("cols", recursive = TRUE) return(banner) }The function returns a magick object that one can then write to disk as a PNG for instance.
I first tested it using a random approach added to the data.frame created in the next section, and show the result here to give an idea of the variety of pictures. For many of them, however, the main colour that you can see in their border is greyish.
set.seed(42) pics_info < dplyr::mutate(pics_info, random = sample(1:nrow(pics_info), nrow(pics_info))) make_collage(pics_info, 19, 59, "random") %>% magick::image_write("data/20180107rainbowingbanner_random.png") Testing a first (bad) approach: using hueOnce I had the main colour as an hex code, I had no idea how to order the colours and thought a good idea would be to use hue, which is the main wave length in a colour. Most observed colours are a mix of wave lengths unless you’re using a laser for instance. To get hue from a colour identified by its hex code, one needs two functions: colorspace::hex2rgb and grDevices::rgb2hsv. The latter one outputs hue, saturation and value. Hue is the main wavelength, saturation the amount of that wavelength in the colour and value the amount of light in the colour. The smaller the saturation, the less representative the hue is of the main colour. Add to that the fact that the main colour can also be only a little representative of your original picture… Ordering by hue isn’t too perfect, but I tried that anyway.
# now work on getting the hue and value for all pics # create a data.frame with path, hue, value get_values < function(path, pics_main_colours){ print(path) # get main color colour < pics_main_colours$colour[pics_main_colours$path == stringr::str_replace(path, "formatted_pics", "nature")] # translate it rgb < colorspace::hex2RGB(colour) values < grDevices::rgb2hsv(t(rgb@coords)) tibble::tibble(path = path, hue = values[1,1], saturation = values [2,1], value = values[3,1]) } # all values pics_col < purrr::map_df(dir("formatted_pics", full.names = TRUE), get_values, pics_main_colours) make_collage(pics_info, 19, 59, "hue") %>% magick::image_write("banner_hue.png")So this is not too pretty. Blue and green pictures seem to cluster together but there are very dark pictures which we’d intuitively put aside.
So I thought a bit more and decided to first assign main colours to a reference colour and then order pictures based on this…
Choosing a better approach: RGB and distancesThe first challenge was to choose reference colours which’d be a rainbow slices. I could have looked up RGB codes corresponding to ROYGBIV (red, orange, yellow, green, blue, indigo and violet.) but I had read about xkcd colors survey in this interesting post and therefore decided to use XKCD colors, available in R via the xkcdcolors package. I chose to use the 18 most common ones, and add black to that lot. It was no longer really a rainbow, I agree. The colors present in the pictures were ordered by hand by my husband, and I like his choices.
Then to assign each pic to a reference colour via its main colour, I calculated the Euclidian distance between that colour and all reference colours to find the closes reference colours, using the RGB values.
library("xkcdcolors") library("magrittr") # version of colorspace::hex2RGB returning a df hex2rgb < function(hex){ result < colorspace::hex2RGB(hex)@coords } # https://stackoverflow.com/questions/45328221/unnestonecolumnlisttomanycolumnsintidyr colors < tibble::tibble(name = c(xcolors()[1:18], "black"), hex = name2color(name), rgb = purrr::map(hex, hex2rgb)) %>% dplyr::mutate(rgb = purrr::map(rgb, tibble::as_tibble)) %>% tidyr::unnest() # for each colour I want the closest one. find_closest_colour < function(hex, colors){ test < tibble::tibble(hex = hex) %>% dplyr::mutate(rgb = purrr::map(hex, hex2rgb), rgb = purrr::map(rgb, tibble::as_tibble)) %>% tidyr::unnest() distance < stats::dist(rbind(test[, c("R", "G", "B")], colors[, c("R", "G", "B")])) colors$name[which.min(as.matrix(distance)[,1][2:(nrow(colors) + 1)])] } imgs_col < dplyr::mutate(pics_main_colours, xkcd_col = purrr::map_chr(colour, find_closest_colour, colors = colors)) readr::write_csv(imgs_col, path = "imgs_xkcd_col.csv")Once I had this information about each pic, I could order the pictures, after having defined the order of the reference colours.
pics_info < readr::read_csv("imgs_xkcd_col.csv") pics_info < dplyr::mutate(pics_info, path = stringr::str_replace(path, "nature", "formatted_pics")) pics_info < dplyr::mutate(pics_info, xkcd_col = factor(xkcd_col, ordered = TRUE, levels = c("black","brown","red","magenta","pink", "lime green","green","dark green","teal", "light blue","sky blue","blue","purple","grey")))This looks much better, but of course the initial set (and maybe the used extraction method as well) don’t provide for enough colours to make this extremely pretty. I’m not sure how useful this end product is, but hey I got to look at pretty landscapes full of colours from my grey rainy city, and learnt a lot along the way… Besides, maybe you will find a cool use case of some of the colour methods featured here and will tell me about it in the comments?
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Maëlle. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Dataviz Signup
(This article was first published on R on kieranhealy.org , and kindly contributed to Rbloggers)
Data Visualization for Social Science will be published later this year by Princeton University Press. You can read a nearcomplete draft of the book at socviz.co. If you would like to receive one (1) email when the book is available for preorder, please fill out this very short form. The goal of the book is to introduce readers to the principles and practice of data visualization in a humane, reproducible, and uptodate way.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
New wrapr R pipeline feature: wrapr_applicable
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
The R package wrapr now has a neat new feature: “wrapr_applicable”.
This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery.
A small effort in making a package “wrapr aware” appears to have a fairly large payoff.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
tint 0.0.5
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A maintenance release of the tint package arrived on CRAN earlier this week. Its name expands from tint is not tufte as the package offers a fresher take on the Tuftestyle for html and pdf presentations.
A screenshot of the pdf variant is below.
Similar to the RcppCNPy release this week, this is pure maintenance related to dependencies. CRAN noticed that processing these vignettes requires the mgcv package—as we use geom_smooth() in some example graphs. So that was altered to not require this requirement just for the vignette tests. We also had one pending older change related to jurassic pandoc versions on some CRAN architectures.
Changes in tint version 0.0.5 (20180105)
Only run html rendering regression test on Linux or Windows as the pandoc versions on CRAN are too old elsewhere.

Vignette figures reworked so that the mgcv package is not required avoiding a spurious dependency [CRAN request]
Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page.
For questions or comments use the issue tracker off the GitHub repo.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppCNPy 0.2.8
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A minor maintenance release of the RcppCNPy package arrived on CRAN this week.
RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers.
There is no code change here. But to process the vignette we rely on knitr which sees Python here and (as of its most recent release) wants the (excellent !!) reticulate package. Which is of course overkill just to process a short pdf document, so we turned this off.
Changes in version 0.2.8 (20180104) Vignette sets knitr option python.reticulate=FALSE to avoid another depedency just for the vignette [CRAN request]
CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RQuantLib 0.4.4 for Windows
(This article was first published on FOSS Trading, and kindly contributed to Rbloggers)
I’m pleased to announce that the RQuantLib Windows binaries are now up to 0.4.4! The RQuantLib prebuilt Windows binaries have been frozen on CRAN since 0.4.2, but now you can get version 0.4.4 binaries on Dirk’s ghrr drat repo.
Installation is as simple as: drat::addRepo(“ghrr”) # maybe use ‘install.packages(“drat”)’ first install.packages(“RQuantLib”, type=”binary”) I will be able to create Windows binaries for future RQuantLib versions too, now that I have a Windows QuantLib build (version 1.11) to link against. Dirk and I plan to talk with CRAN about getting the new binaries hosted there. Regardless, they will always be available via the drat repo. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to extract data from a PDF file with R
(This article was first published on Rposts.com, and kindly contributed to Rbloggers)
In this post, taken from the book R Data Mining by Andrea Cirillo, we’ll be looking at how to scrape PDF files using R. It’s a relatively straightforward way to look at text mining – but it can be challenging if you don’t know exactly what you’re doing.
Until January 15th, every single eBook and video by Packt is just $5! Start exploring some of Packt’s huge range of R titles here.
You may not be aware of this, but some organizations create something called a ‘customer card’ for every single customer they deal with. This is quite an informal document that contains some relevant information related to the customer, such as the industry and the date of foundation. Probably the most precious information contained within these cards is the comments they write down about the customers. Let me show you one of them:
My plan was the following—get the information from these cards and analyze it to discover whether some kind of common traits emerge.
As you may already know, at the moment this information is presented in an unstructured way; that is, we are dealing with unstructured data. Before trying to analyze this data, we will have to gather it in our analysis environment and give it some kind of structure.
Technically, what we are going to do here is called text mining, which generally refers to the activity of gaining knowledge from texts. The techniques we are going to employ are the following:
 Sentiment analysis
 Wordclouds
 Ngram analysis
 Network analysis
First of all, we need to get a list of customer cards we were from the commercial department. I have stored all of them within the ‘data’ folder on my workspace. Let’s use list.files() to get them:
file_vector < list.files(path = "data")Nice! We can inspect this looking at the head of it. Using the following command:
file_vector %>% head() [1] "banking.xls" "Betasoloin.pdf" "Burl Whirl.pdf" "BUSINESSCENTER.pdf" [5] "Buzzmaker.pdf" "BuzzSaw Publicity.pdf"Uhm… not exactly what we need. I can see there are also .xls files. We can remove them using the grepl() function, which performs partial matches on strings, returning TRUE if the pattern required is found, or FALSE if not. We are going to set the following test here: give me TRUE if you find .pdf in the filename, and FALSE if not:
grepl(".pdf",file_list) [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [18] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [35] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [52] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [69] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [86] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [103] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE [120] TRUEAs you can see, the first match results in a FALSE since it is related to the .xls file we saw before. We can now filter our list of files by simply passing these matching results to the list itself. More precisely, we will slice our list, selecting only those records where our grepl() call returns TRUE:
pdf_list < file_vector[grepl(".pdf",file_list)]Did you understand [grepl(“.pdf”,file_list)] ? It is actually a way to access one or more indexes within a vector, which in our case are the indexes corresponding to “.pdf”, exactly the same as we printed out before.
If you now look at the list, you will see that only PDF filenames are shown on it.
Reading PDF files into R via pdf_text()R comes with a really useful that’s employed tasks related to PDFs. This is named pdftools, and beside the pdf_text function we are going to employ here, it also contains other relevant functions that are used to get different kinds of information related to the PDF file into R.
For our purposes, it will be enough to get all of the textual information contained within each of the PDF files. First of all, let’s try this on a single document; we will try to scale it later on the whole set of documents. The only required argument to make pdf_text work is the path to the document. The object resulting from this application will be a character vector of length 1:
pdf_text("data/BUSINESSCENTER.pdf") [1] "BUSINESSCENTER business profile\nInformation below are provided under non disclosure agreement. date of enquery: 12.05.2017\ndate of foundation: 19930518\nindustry: Nonprofit\nshare holders: Helene Wurm ; Meryl Savant ; Sydney Wadley\ncomments\nThis is one of our worst customer. It really often miss payments even if for just a couple of days. We have\nproblems finding useful contact persons. The only person we can have had occasion to deal with was the\nfiscal expert, since all other relevant person denied any kind of contact.\n 1\n"If you compare this with the original PDF document you can easily see that all of the information is available even if it is definitely not ready to be analyzed. What do you think is the next step needed to make our data more useful?
We first need to split our string into lines in order to give our data a structure that is closer to the original one, that is, made of paragraphs. To split our string into separate records, we can use the strsplit() function, which is a base R function. It requires the string to be split and the token that decides where the string has to be split as arguments. If you now look at the string, you’ll notice that where we found the end of a line in the original document, for instance after the words business profile, we now find the \n token. This is commonly employed in text formats to mark the end of a line.
We will therefore use this token as a split argument:
pdf_text("data/BUSINESSCENTER.pdf") %>% strsplit(split = "\n") [[1]] [1] "BUSINESSCENTER business profile" [2] "Information below are provided under non disclosure agreement. date of enquery: 12.05.2017" [3] "date of foundation: 19930518" [4] "industry: Nonprofit" [5] "share holders: Helene Wurm ; Meryl Savant ; Sydney Wadley" [6] "comments" [7] "This is one of our worst customer. It really often miss payments even if for just a couple of days. We have" [8] "problems finding useful contact persons. The only person we can have had occasion to deal with was the" [9] "fiscal expert, since all other relevant person denied any kind of contact." [10] " 1"strsplit() returns a list with an element for each element of the character vector passed as argument; within each list element, there is a vector with the split string.
Isn’t that better? I definitely think it is. The last thing we need to do before actually doing text mining on our data is to apply those treatments to all of the PDF files and gather the results into a conveniently arranged data frame.
Iteratively extracting text from a set of documents with a for loopWhat we want to do here is run trough the list of files and for filename found there, we run the pdf_text() function and then the strsplit() function to get an object similar to the one we have seen with our test. A convenient way to do this is by employing a ‘for’ loop. These structures basically do this to their content: Repeat this instruction n times and then stop. Let me show you a typical ‘for’ loop:
for (i in 1:3){ print(i) }If we run this, we obtain the following:
[1] 1 [1] 2 [1] 3This means that the loop runs three times and therefore repeats the instructions included within the brackets three times. What is the only thing that seems to change every time? It is the value of i. This variable, which is usually called counter, is basically what the loop employs to understand when to stop iterating. When the loop execution starts, the loop starts increasing the value of the counter, going from 1 to 3 in our example. The for loop repeats the instructions between the brackets for each element of the values of the vector following the in clause in the for command. At each step, the value of the variable before in (i in this case) takes one value of the sequence from the vector itself.
The counter is also useful within the loop itself, and it is usually employed to iterate within an object in which some kind of manipulation is desired. Take, for instance, a vector defined like this:
vector < c(1,2,3)Imagine we want to increase the value of every element of the vector by 1. We can do this by employing a loop such as this:
for (i in 1:3){ vector[i] < vector[i]+1 }If you look closely at the loop, you’ll realize that the instruction needs to access the element of the vector with an index equal to i and modify this value by 1. The counter here is useful because it will allow iteration on all vectors from 1 to 3.
Be aware that this is actually not a best practice because loops tend to be quite computationally expensive, and they should be employed when no other valid alternative is available. For instance, we can obtain the same result here by working directly on the whole vector, as follows:
vector_increased < vector +1If you are interested in the topic of avoiding loops where they are not necessary, I can share with you some relevant material on this.
For our purposes, we are going to apply loops to go through the pdf_list object, and apply the pdf_text function and subsequently the strsplit() function to each element of this list:
corpus_raw < data.frame("company" = c(),"text" = c()) for (i in 1:length(pdf_list)){ print(i) pdf_text(paste("data/", pdf_list[i],sep = "")) %>% strsplit("\n")> document_text data.frame("company" = gsub(x =pdf_list[i],pattern = ".pdf", replacement = ""), "text" = document_text, stringsAsFactors = FALSE) > document colnames(document) < c("company", "text") corpus_raw < rbind(corpus_raw,document) }Let’s get closer to the loop: we first have a call to the pdf_text function, passing an element of pdf_list as an argument; it is defined as referencing the i position in the list. Once we have done this, we can move on to apply the strsplit() function to the resulting string. We define the document object, which contains two columns, in this way:
 company, which stores the name of the PDF without the .pdf token; this is the name of the company
 text, which stores the text resulting from the extraction
This document object is then appended to the corpus object, which we created previously, to store all of the text within the PDF.
Let’s have a look a the resulting data frame:
corpus_raw %>% head()This is a wellstructured object, ready for some text mining. Nevertheless, if we look closely at our PDF customer cards, we can see that there are three different kinds of information and they should be handled differently:
 Repeated information, such as the confidentiality disclosure on the second line and the date of enquiry (12.05.2017)
 Structured attributes, for instance, date of foundation or industry
 Strictly unstructured data, which is in the comments paragraph
We are going to address these three kinds of data differently, removing the first group of irrelevant information; we therefore have to split our data frame accordingly into two smaller data frames. To do so, we will leverage the grepl() function once again, looking for the following tokens:
 12.05.2017: This denotes the line showing the nondisclosure agreement and the date of inquiry.
 business profile: This denotes the title of the document, containing the name of the company. We already have this information stored within the company column.
 comments: This is the name of the last paragraph.
 1: This represents the number of the page and is always the same on every card.
We can apply the filter function to our corpus_raw object here as follows:
corpus_raw %>% filter(!grepl("12.05.2017",text)) %>% filter(!grepl("business profile",text)) %>% filter(!grepl("comments",text)) %>% filter(!grepl("1",text)) > corpusNow that we have removed those useless things, we can actually split the data frame into two subdata frames based on what is returned by the grepl function when searching for the following tokens, which point to the structured attributes we discussed previously:
 date of foundation
 industry
 shareholders
We are going to create two different data frames here; one is called ‘information’ and the other is called ‘comments’:
corpus %>% filter(!grepl(c("date of foundation"),text)) %>% filter(!grepl(c( "industry"),text)) %>% filter(!grepl(c( "share holders"),text)) > comments corpus %>% filter(grepl(("date of foundation"),text)grepl(( "industry"),text)grepl(( "share holders"),text))> informationAs you can see, the two data treatments are nearly the opposite of each other, since the first looks for lines showing none of the three tokens while the other looks for records showing at least one of the tokens.
Let’s inspect the results by employing the head function:
information %>% head() comments %>% head()Great! We are nearly done. We are now going to start analyzing the comments data frame, which reports all comments from our colleagues. The very last step needed to make this data frame ready for subsequent analyzes is to tokenize it, which basically means separating it into different rows for all the words available within the text column. To do this, we are going to leverage the unnest_tokens function, which basically splits each line into words and produces a new row for each word, having taken care to repeat the corresponding value within the other columns of the original data frame.
This function comes from the recent tidytext package by Julia Silge and Davide Robinson, which provides an organic framework of utilities for text mining tasks. It follows the tidyverse framework, which you should already know about if you are using the dplyr package.
Let’s see how to apply the unnest_tokens function to our data:
comments %>% unnest_tokens(word,text)> comments_tidyIf we now look at the resulting data frame, we can see the following:
As you can see, we now have each word separated into a single record.
Thanks for reading! We hope you enjoyed this extract taken from R Data Mining.
Explore $5 R eBooks and videos from Packt until January 15th 2018.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Rposts.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Using nonstandard evaluation to simulate a register machine
(This article was first published on Higher Order Functions, and kindly contributed to Rbloggers)
I recently completed all 25 days of Advent of Code 2017, an annual
series of recreational programming puzzles. Each day describes a programming
puzzle and illustrates a handful of simple examples of the problem. The puzzle
then requires the participant to solve a much, much larger form of the problem.
For five or so of the puzzles, I used nonstandard evaluation to implement my
solution. As I previously wrote, nonstandard evaluation is a way of
bottling up magic spells (lines of R code) and changing how or where they are
cast (evaluated). I chose to use special evaluation not because it was the
easiest or most obvious solution but because I wanted to develop my skills with
computing on the language. In this post, I work through one of the examples
where I used nonstandard evaluation to write an interpreter for a simple
machine.
Day 8 requires us to simulate the state
of a register machine as it receives a series of instructions.
Each instruction consists of several parts: the register to modify, whether to
increase or decrease that register’s value, the amount by which to increase or
decrease it, and a condition. If the condition fails, skip the instruction
without modifying the register. The registers all start at 0. The instructions
look like this:
[…]
You might also encounter <= (less than or equal to) or != (not equal to).
However, the CPU doesn’t have the bandwidth to tell you what all the registers
are named, and leaves that to you to determine.
What is the largest value in any register after completing the
instructions in your puzzle input?
If I squint long enough at the register instructions, I can basically
see R code.
If we can set up a way to convert the machine instructions into R code, R will
handle the job of looking up values, modifying values and evaluating the logic
and if statements. In other words, if we can convert register instructions
into R code, the problem simplifies into something like running an R script.
And that’s a good strategy, because we have a lot of instructions to process.
Each user receives some unique (I think) input for each problem, and my problem
input contains 1,000 instructions.
Our strategy for simulating the register machine will have the following steps:
 Parsing a register instruction
 Creating an R expression from an instruction
 Evaluating an R expression inside of a register machine
 Changing the evaluation rules to adapt to the quirks of this problem
The instructions have a very simple grammar. Here is how I would tag the first
few lines of my problem input.
We can parse these lines using regular expressions. Regular expressions are an
incredibly powerful language for processing text using patternmatching rules. I
will walk through each of the regular expression patterns used to parse an
instruction.
To match the verbs, we can use the or  operator, so
(incdec) matches inc or dec. We can also match the six different
comparison operators using  too. In the code below, I put the patterns in
parentheses so that they will be treated as a single group.
A register name is just a sequence of letters. The special character \w
matches any word character; that is, it matches uppercase/lowercase letters,
digits and underscores. The (token)+ suffix matches 1 or more
repetitions of a token. Putting these two together, \w+ will match 1 or
more adjacent word characters. That pattern in principle could matches things
beside register names (like numbers) but the instruction format here is so
constrained that it’s not a problem.
Numbers are just integers, and sometimes they are negative. A number here
is an optional  plus some digits. The special character \d matches any
digit from 0 to 9, so \d+ matches 1 or more successive digits. We can use
the (token)? suffix to match an optional token. In our case,
?\d+ will match a sequence of digits and match leading hyphen if one is
present.
Each pattern in parentheses is a matching group, and the function str_match()
from the stringr package will return a matrix
with a column for each matched group. I also include an extra set of
parentheses around the condition in the if statement to also match the whole
condition as well as its parts.
We can package this step into a function that takes an instruction’s
text and returns a list with the labelled parts of that instruction.
Next, we need to convert some strings into R code. We can do this with
rlang::parse_expr(). It takes a string and creates an R expression,
something I’ve described as a kind of bottled up magic spell: An
expression captures magic words (code) allow us to manipulate or cast (evaluate)
them.
The format of the instructions is relatively straightforward. We can fill
in a template with the parts of the parsed line. Because inc/dec are just
addition and subtraction, we replace them with the appropriate math operations.
We have to figure out where we want to evaluate the generated R code. We
create a register object to hold the values. The object will just be a list()
with some extra metadata. This object will be the location where the R code
is evaluated.
For now, we can initialize registers by using named arguments to the function.
create_register_machine(a = 0, b = 0) #> List of 3 #> $ a : num 0 #> $ b : num 0 #> $ .metadata: list() #>  attr(*, "class")= chr [1:2] "register_machine" "list" Evaluating code inside of the machineSo far, we have:
 A way to analyze register instructions and convert them into R code
 An object that holds register values
Now, we need to evaluate an expression inside of the register. We will use
tidy evaluation; the function eval_tidy() lets us evaluate an
R expression inside of a data object.
Now, we need to actually do something. We need to update the register machine
using the value from the evaluated instruction. Otherwise, the machine will just
read expressions and forget everything it’s read.
To update the machine, we have to determine the register to update. Fortunately,
our generated code always ends with an else branch that has the target
register.
If we can pull out that symbol after the else, we will have the name of
register to update in the machine. Because the code is so formulaic, we can just
extract the symbol directly using the code’s abstract syntax tree (AST).
pryr::call_tree() shows the AST for an expression.
We can extract elements from the tree like elements in a list by selecting
indices.
If we convert the symbol into a string, we can look it up in the register using
the usual list lookup syntax.
We can also use list lookup syntax with assignment to modify the register.
r[[target]] < rlang::eval_tidy(r_code, data = r) r #> List of 3 #> $ a : num 4 #> $ b : num 12 #> $ .metadata: list() #>  attr(*, "class")= chr [1:2] "register_machine" "list"Let’s wrap these steps into a function.
eval_instruction < function(register_machine, instruction) { target < rlang::as_string(instruction[[4]]) register_machine[[target]] < rlang::eval_tidy( expr = instruction, data = register_machine) register_machine } create_register_machine(a = 2, b = 0) %>% eval_instruction(r_code) #> List of 3 #> $ a : num 2 #> $ b : num 5 #> $ .metadata: list() #>  attr(*, "class")= chr [1:2] "register_machine" "list" create_register_machine(a = 2, b = 0) %>% # For quick testing, we pass in quoted expressions eval_instruction(quote(if (a > 1) b  100 else b)) %>% # Should not run eval_instruction(quote(if (a < 1) b + 5 else b)) %>% # Should run eval_instruction(quote(if (a > 1) a + 10 else a)) #> List of 3 #> $ a : num 12 #> $ b : num 100 #> $ .metadata: list() #>  attr(*, "class")= chr [1:2] "register_machine" "list" Time for some extra nonstandard evaluationThe code so far only works if the machine already has registers that match the
registers in an instruction. Otherwise, we raise an error.
To solve the problem, we could study the 1,000 lines of input beforehand,
extract the register names, initialize them to 0 and then evaluate the
instructions.1 Or… or… we could procrastinate and only
initialize a register name to 0 when the machine encounters a name it doesn’t
recognize. If, for some reason, our machine received instructions
one at a time, like over a network connection, then the procrastinated approach
seems even more reasonable.
This latter strategy will involve some very nonstandard evaluation. I
emphasize the “very” because we are changing one of the fundamental rules of R
evaluation :smiling_imp:. R throws an error if you ask it to evaluate the name
of a variable that doesn’t exist. But here we are going to detect those missing
variables and set them to 0 before they get evaluated.
To find the brandnew register names, we can inspect the call tree and find the
names of the registers. We already know where the target is. The other place
where names show up is in the condition of the if statement.
We can define a helper function which checks for missing names—names that
yield NULL values when we try to retrieve them—and initializes them to 0.
Finally, we update our evaluation function to do this step automatically. I’m
also going to add some code to record the value of the maximum register whenever
an instruction is evaluated because, ummm, that’s the whole point of puzzle.
Let’s try four instructions from a clean slate.
create_register_machine() %>% # b gets 5 eval_instruction(quote(if (d < 1) b + 5 else b)) %>% # c gets 10 eval_instruction(quote(if (b > 1) c + 10 else c)) %>% # b gets 5 more eval_instruction(quote(if (a < 1) b + 5 else b)) #> List of 5 #> $ .metadata:List of 1 #> ..$ max: num 10 #> $ b : num 10 #> $ d : num 0 #> $ c : num 10 #> $ a : num 0 #>  attr(*, "class")= chr [1:2] "register_machine" "list"Now, for the moment of truth… Let’s process all 1,000 instructions.
r < create_register_machine() for (each_instruction in full_input) { parsed < each_instruction %>% parse_instruction() %>% create_r_instruction() r < eval_instruction(r, parsed) } r #> List of 27 #> $ .metadata:List of 1 #> ..$ max: num 4832 #> $ kd : num 2334 #> $ gm : num 4239 #> $ x : num 345 #> $ kjn : num 1813 #> $ ey : num 209 #> $ n : num 764 #> $ t : num 2997 #> $ f : num 4468 #> $ msg : num 3906 #> $ ic : num 263 #> $ zv : num 599 #> $ gub : num 2025 #> $ yp : num 2530 #> $ lyr : num 2065 #> $ j : num 3619 #> $ e : num 4230 #> $ riz : num 863 #> $ lzd : num 4832 #> $ ucy : num 3947 #> $ i : num 3448 #> $ omz : num 3365 #> $ djq : num 392 #> $ bxy : num 1574 #> $ tj : num 1278 #> $ y : num 1521 #> $ m : num 2571 #>  attr(*, "class")= chr [1:2] "register_machine" "list":star: Tada! The maximum register value is 4,832. Problem solved!
And then the rules changeAdvent of Code problems come in two parts, and we don’t learn the question
behind Part 2 until we complete Part 1. In this case, after submitting our
solution for Part 1, we receive the following requirement:
To be safe, the CPU also needs to know the highest value held in any
register during this process so that it can decide how much memory to allocate
to these operations.
Accounting for this twist requires a small change to the evaluation code. We
add another metadata variable to track the highest value ever stored in a
register.
Admittedly, eval_instruction() is starting to get bloated. Conceptually, we
could probably the break this function down into three functions: preevaluation
steps, evaluation, and postevaluation steps.2 But this is good
enough for now.
We run the instructions again to get the updated metadata.
r < create_register_machine() for (each_instruction in full_input) { parsed < each_instruction %>% parse_instruction() %>% create_r_instruction() r < eval_instruction(r, parsed) } r$.metadata #> $max #> [1] 4832 #> #> $max_ever #> [1] 5443:star2: And boom! Another problem solved.
eval(thoughts, envir = this_problem)I like this kind of nonstandard evaluation approach for converting problems into
R code, but it’s mostly useful when the problem describes a series of
instructions that can be parsed and evaluated. For problems like this register
machine simulation, the nonstandard evaluation route is straightforward. But
it’s also a viable problemsolving strategy when the “machine” or the
“instructions” are subtler, as in
this problem about simulating “dance” moves.
Odds are, you’ll never have to write an interpreter for a toy machine or
language. Nevertheless, here are some R functions that we used for this
puzzle that are helpful in other contexts:
 stringr::str_match() to extract all the groups in a regular
expression at once.  rlang::parse_expr() to convert a string of text into an R expression.
 pryr::call_tree() to visualize an expression’s syntax tree and
expression[[i]][[j]] to pluck out symbols from locations in a tree.  rlang::as_string() to convert a symbol into a string.
 rlang::eval_tidy() to evaluate an expression inside of a data
context.

That actually would be pretty easy. Get a dataframe with
purrr::map_df(full_input, parse_instruction). Find the unique register
names. Create a list of 0’s with those names. Use do.call() to call
create_register_machine() with that list. With no special evaluation
trickery, this approach is closer to the idea of “just running R code”. 
If all I did for a living was write code to simulate machines
or toy languages, I might try to formalize this custom evaluation process
with preevaluation and postevaluations “hooks” that could be arguments
to a custom evaluation function. I’m just brainstorming though.
To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
StanCon is next week, Jan 1012, 2018
(This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to Rbloggers)
It looks pretty cool!
Wednesday, Jan 10
Invited Talk: Predictive information criteria in hierarchical Bayesian models for clustered data. Sophia RabeHesketh and Daniel Furr (U California, Berkely) 10:4011:30am
Does the New York City Police Department rely on quotas? Jonathan Auerbach (Columbia U) 11:3011:50am
Bayesian estimation of mechanical elastic constants. Ben Bales, Brent Goodlet, Tresa Pollock, Linda Petzold (UC Santa Barbara) 11:50am12:10pm
Joint longitudinal and timetoevent models via Stan. Sam Brilleman, Michael Crowther, Margarita MorenoBetancur, Jacqueline Buros Novik, Rory Wolfe (Monash U, Columbia U) 12:1012:30pm
Lunch 12:302:00pm
ScalaStan. Joe Wingbermuehle (Cibo Technologies) 2:002:20pm
A tutorial on Hidden Markov Models using Stan. Luis Damiano, Brian Peterson, Michael Weylandt 2:202:40pm
Student OrnsteinUhlenbeck models served three ways (with applications for population dynamics data). Aaron Goodman (Stanford U) 2:403:00pm
SlicStan: a blockless Stanlike language. Maria I. Gorinova, Andrew D. Gordon, Charles Sutton (U of Edinburgh) 3:003:20pm
Break 3:204:00pm
Invited Talk: Talia Weiss (MIT) 4:004:50pm
Thursday, Jan 11
Invited Talk: Sean Taylor and Ben Letham (Facebook) 10:4011:30am
NPCompare: a package for nonparametric density estimation and two populations comparison built on top of PyStan. Marco Inacio (U of São Paulo/UFSCar) 11:3011:50am
Introducing idealstan, an R package for ideal point modeling with Stan. Robert Kubinec (U of Virginia) 11:50am12:10pm
A brief history of Stan. Daniel Lee (Generable) 12:1012:30pm
Lunch 12:301:30pm
Computing steady states with Stan’s nonlinear algebraic solver. Charles C. Margossian (Metrum, Columbia U) 1:301:50pm
Flexible modeling of Alzheimer’s disease progression with ISplines. Arya A. Pourzanjani, Benjamin B. Bales, Linda R. Petzold, Michael Harrington (UC Santa Barbara) 1:502:10pm
Intrinsic AutoRegressive (ICAR) Models for Spatial Data, Mitzi Morris (Columbia U) 2:102:30pm
Modeling/Data Session + Classes 2:304:10pm
Open session for consultations on modeling and data problems with Stan developers and modelers. 2:304:10pm
Session 3 of Intro to Stan 2:304:10pm
2:303:30pm Have I converged successfully? How to verify fit and diagnose fit problems, Bob Carpenter
What is new to Stan 3:304:10pm
Invited Talk: Manuel Rivas (Stanford U) 4:004:50pm
Friday, Jan 12
Invited Talk: Susan Holmes (Stanford U) 10:4011:30am
Aggregate random coefficients logit — a generative approach. Jim Savage, Shoshana Vasserman 11:3011:50am
The threshold test: Testing for racial bias in vehicle searches by police. Camelia Simoiu, Sam CorbettDavies, Sharad Goel, Emma Pierson (Stanford U) 11:50am12:10pm
Assessing the safety of Rosiglitazone for the treatment of type II diabetes. Konstantinos Vamvourellis, K. Kalogeropoulos, L. Phillips (London School of Economics and Political Science) 12:1012:30pm
Lunch 12:301:30pm
Causal inference with the gformula in Stan. Leah Comment (Harvard U) 1:301:50pm
Bayesian estimation of ETAS models with Rstan. Fausto Fabian Crespo Fernandez (Universidad San Francisco de Quito) 1:502:10pm
Invited Talk: Andrew Gelman 2:103:00 (Columbia U) (virtual)
Classes/Tutorials
We have tutorials that start at the crack of 8am for those desiring further edification beyond the program—these do not run in parallel to the main session but do run parallel to each other:
Introduction to Stan: Know how to program? Know basic statistics? Curious about Bayesian analysis and Stan? This is the course for you. Hands on, focused and an excellent way to get started working in Stan. Two hours every day, 6 hours total. Jonah Sol Gabry.
Executive decision making the Bayesian way: This is for nontechnical managers and technical folks who need to communicate with managers to learn the core of decision making under uncertainty. One hour every day. Jonathan Auerbach, Breck Baldwin, Eric Novik.
Advanced Hierarchical Models in Stan: The hard stuff. Very interactive, very intense. Topics vary by day. Ben Goodrich.
Model assessment, model selection and inference after model selection. Aki Vehtari.
How to develop for Stan at the C++ level: Overview of Stan C++ architecture and build/development process for contributors. Charles Christopher Margossian.
[edited by Aki: added Model selection tutorial]
The post StanCon is next week, Jan 1012, 2018 appeared first on Statistical Modeling, Causal Inference, and Social Science.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Big cdata News
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata.
cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago.
However, cdata is much more than that.
cdata provides a family of general transforms that include pivot/unpivot (or tidyr::spread/tidyr::gather) as easy special cases. Nina refused to write the article on it until we refactored the api to be even more teachable (and therefore more learnable, and more useful). After her redesign (adding the concepts of both concrete records and abstract records to the coordinatized data theory) the system teaches itself. It is actually hard to remember you are graphically specifying potentially involved, difficult, and confusing transforms (which do not remain confusing as the graphical specification becomes its own documenting diagram!).
Don’t take my word for it, please checkout Nina’s article: “Fluid data reshaping with cdata”.
Also, I will be presenting a lightening talk on cdata at the January 2018 BARUG Meetup! We think we have gotten the concepts and package refined and polished to the point where it can be mastered in 15 minutes with time to spare.
As a bonus this new version of cdata is now available on CRAN (with the old method naming still supported), and also works at big data scale (for DBIadaptable databases such as Spark and PostgreSQL, an uncommon feature for a full featured pivot/unpivot system).
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
simmer 3.6.5
(This article was first published on R – Enchufa2, and kindly contributed to Rbloggers)
The fifth update of the 3.6.x release of simmer, the DiscreteEvent Simulator for R, is on CRAN. This release extends the attributes API by allowing users to set/get multiple attributes at once (a pretty straightforward change as well as useful; I don’t know why it didn’t occurred to me before…). Vectors as attributes and other data types are not supported yet, but they are on the roadmap.
This version also fixes some minor bugs (many thanks to the users of the simmerdevel mailing list for taking their simulations to edge cases, where these bugs arise), deprecates the onestep() function and provides the new stepn() instead. Since onestep() serves primarily for debugging purposes, the transition to the new one may go unnoticed. Finally, there is a new vignette about the Dining Philosophers Problem.
New features: set_attribute() (and set_global() by extension) can set multiple attributes at once by providing vectors of keys and values (or functions returning such keys and/or values). get_attribute() (and get_global() by extension) can retrieve multiple keys (#122).
 New stepn() method deprecates onestep() (e452975).
 Restore ostream after formatting (9ff11f8).
 Fix arrival cloning to copy attributes over to the clone (#118).
 Fix selfinduced preemption through set_capacity() (#125).
 Update “Queueing Systems” vignette (a0409a0, 8f03f4f).
 Update “Advanced Trajectory Usage” vignette (4501927).
 Fix print methods to return the object invisibly (#128).
 New “Dining Philosophers Problem” vignette (ff6137e).
Article originally published in Enchufa2.es: simmer 3.6.5.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Divide and parallelize large data problems with Rcpp
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
by Błażej Moska, computer science student and data science intern
Got stuck with too large a dataset? R speed drives you mad? Divide, parallelize and go with Rcpp!
One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with.
Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule’s antecedent is a subset of second rule’s antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set). This is illustrated below:
{A,B,C}=>{D}
{E}=>{F}
{A,B}=>{D}
{A}=>D
How can we achieve that? For example, using below pseudo algorithm:
For i=1 to n: For j=i+1 to n: # check if antecedent[i] contains antecedent[j] (if consequents[i]=consequents[j]), then flag antecedent[i] with 1, otherwise with 0 else: # check if antecedent[j] contains antecedent[i] (if consequents[i]=consequents[j]), then flag antecedent[j] with 1, otherwise with 0How many operations do we need to perform with this simple algorithm?
For the first i we need to iterate \(n1\) times, for the second i \(n2\) times, for the third i \(n3\) and so on, reaching finally \(n(n1)\). This leads to (proof can be found here):
\[
\sum_{i=1}^{n}{i}= \frac{n(n1)}{2}
\]
So the above has asymptotic complexity of \(O(n^2)\). It means, more or less, that the computational complexity grows with the square of the size of the data. Well, for the dataset containing around 1,300,000 records this becomes serious issue. With R I was unable to perform computation in reasonable time. Since a compiled language performs better with simple arithmetic operations, the second idea was to use Rcpp. Yes, it is faster, to some extent — but with such a large dataframe I was still unable to get results in satisfying time. So are there any other options?
Yes, there are. If we take a look at our dataset, we can see that it can be aggregated in such way that each individual “chunk” will consist of records with exactly same consequents:
{A,B}=>{D}
{A}=>{D}
{C,G}=>{F}
{Y}=>{F}
After such division I got 3300 chunks, so the average number of observations per chunk was around 400. Next step was to retry sequentially for each chunk. Since our algorithm has square complexity, it is faster to do it that way rather than on the whole dataset at once. While R failed again, Rcpp finally returned result (after 5 minutes). But still there is a room for improvement. Since our chunks can be calculated independently, there is a possibility to perform parallel computation using for example, foreach package (which I demonstrated in previous article). While passing R functions to foreach is a simple task, parallelizing Rcpp is a little bit more time consuming. We need to do below steps:
 Create .cpp file, which includes all of functions needed
 Create a package using Rcpp. This can be achieved using for example:
Rcpp.package.skeleton("nameOfYourPackage",cpp_files = "directory_of_your_cpp_file")  Install your Rcpp package from source:
install.packages("directory_of_your_rcpp_package", repos=NULL, type="source")  Load your library:
library(name_of_your_rcpp_package)
Now you can use your Rcpp function in foreach:
results=foreach(k=1:length(len), .packages=c(name_of_your_package)) %dopar% {your_cpp_function(data)}Even with foreach I waited forever for the R results, but Rcpp gave them in approximately 2.5 minutes. Not too bad!
 Here is the Rcpp code for the issub function
 Here is the R code that partitions the data and calls the issub function in parallel
Here are some conclusions. Firstly, it’s worth knowing more languages/tools than just R. Secondly, there is often escape from the large dataset trap. There is little chance that somebody will do exactly the same task as mentioned in above example, but much higher probability that someone will face similar problem, with a possibility to solve it in the same way.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Die Markov Kette: Wie berechne ich die Gleichgewichtsverteilung?
(This article was first published on RProgramming – Statistik Service, and kindly contributed to Rbloggers)
In diesem Artikel möchten wir Ihnen das Konzept der Markov Kette vorstellen, dessen Grundlagen veranschaulichen und Ihnen mehrere mögliche Anwendungsbereiche aufzeigen, in denen Sie mit einer gezielten statistischen Programmierung von Markov Ketten profitieren können.
Eine Markov Kette ist ein stochastischer Prozess mit den vielfältigsten Anwendungsbereichen aus der Natur, Technik und Wirtschaft. Klassische Anwendungsmöglichkeiten aus der Wirtschaft sind die Modellierung von Warteschlangen und Wechselkursen. Auch im Bereich der Technik gibt es zahlreiche Einsatzgebiete wie zum Beispiel die Modellierung des Verhaltens von Talsperren und von Geschwindigkeitsregelanlagen bei Kraftfahrzeugen, sowie die analytische Bewertung von Mobilitätsalgorithmen wie dem Random Walk. In der Naturwissenschaft verwendet man diesen Prozess in der Populationsdynamik zur Vorhersage des Bevölkerungswachstums von Menschen und Tieren und in einer abgewandelten Form für die Brownsche Molekularbewegung. Für die Praxis besonders relevant ist die statistische Programmierung und Simulation der Gleichgewichtsverteilung mit der Statistik Software R, welche im Folgenden anhand eines anschaulichen Beispiels durchgeführt wird. Doch zunächst werden die für die Berechnung erforderlichen Begriffe erläutert.
Was sind Markov Kette und Gleichgewichtsverteilung?Es gibt eine Vielzahl unterschiedlicher stochastischer Prozesse, welche in verschiedene Kategorien eingeteilt werden. Ein bekannter und in der Praxis besonders oft eingesetzter Typ stochastischer Prozesse ist die Markov Kette, benannt nach dem Mathematiker Andrei Andrejewitsch Markov. Diese besteht aus einer Zustandsmenge, einer Indexmenge, einer Startverteilung und den Übergangswahrscheinlichkeiten.
Die Zustandsmenge bestimmt, welche Zustände angenommen werden können. Hierbei unterscheidet man zwischen einer stetigen Zustandsmenge, welche überabzählbar unendlich viele Zustände enthält und einer diskreten Zustandsmenge, welche höchstens abzählbar unendlich viele Zustände enthält. Ein wichtiger Spezialfall des zuletzt Genannten ist ein Zustandsraum mit endlich vielen Zuständen, auf welchen wir uns konzentrieren werden.
Ein stochastischer Prozess ändert seinen Zustand im Laufe der Zeit. Analog zur Zustandsmenge unterscheidet man auch bei der Indexmenge (welche die Zeitpunkte beinhaltet) zwischen einer stetigen und einer diskreten Indexmenge. Eine stetige Indexmenge kommt beispielsweise bei der Brownschen Molekularbewegung in Betracht, weil die Moleküle in ständiger Bewegung sind und ihre Richtung und Geschwindigkeit in kleinsten Zeitabständen wechseln können. Doch wir verwenden die diskrete Indexmenge der natürlichen Zahlen N0={0,1,2,3,4,5,…} einschließlich des Startzeitpunkts 0 mit einer beliebigen Zeiteinheit. Zwischen zwei aufeinander folgenden Zeitpunkten bleibt der Zustand also konstant.
In irgendeinem Zustand muss die Markov Kette starten. Dies kann ein fest vorgegebener oder zufällig ausgewählter Zustand sein. Mit welcher Wahrscheinlichkeit der stochastische Prozess in welchem Zustand startet, legt die Startverteilung fest.
Nicht nur die Startverteilung ist zufällig, sondern auch das weitere Verhalten. Mit welcher Wahrscheinlichkeit der Prozess in welchen Zustand wechselt, legen die Übergangswahrscheinlichkeiten fest. Wir betrachten den Standardfall einer Markov Kette erster Ordnung, bei welcher die Übergangswahrscheinlichkeiten ausschließlich vom momentanen Zustand abhängen. Außerdem ist die Markov Kette homogen, das heißt, die Übergangswahrscheinlichkeiten sind konstant und ändern sich nicht im Lauf der Zeit. Die Übergangswahrscheinlichkeiten können daher in einer Übergangsmatrix veranschaulicht werden.
Die Zustandsverteilung hängt vom jeweiligen Zeitpunkt ab. In einigen Fällen konvergiert die Zustandsverteilung (diese beinhaltet die Aufenthaltswahrscheinlichkeiten der Zustände zu einem Zeitpunkt n) gegen eine Gleichgewichtsverteilung, welche auch stationäre Verteilung genannt wird. In den folgenden Abschnitten erfahren Sie anhand eines Beispiels nicht nur die Kriterien für Existenz und Eindeutigkeit der Gleichgewichtsverteilung, sondern auch die analytische Lösung und wie Sie die statistische Programmierung und Simulation mit der Statistik Software R durchführen.
Bedingungen für Existenz und Eindeutigkeit der GleichgewichtsverteilungDie Spielfigur PacMan frisst in einem Labyrinth kleine Kugeln und Kirschen und wird dabei von Gespenstern verfolgt. Der Einfachheit halber ist die Spielwelt in diesem Beispiel ein kleines 3×3Gitter und die Monster bewegen sich rein zufällig. Jedes horizontal und vertikal angrenzende Spielfeld ist mit gleicher Wahrscheinlichkeit der nächste Aufenthaltsort des Gespensts, mit Ausnahme eines Geheimgangs zwischen den Zuständen 2 und 8. Der unten abgebildete Übergangsgraph beinhaltet exemplarisch die Übergangswahrscheinlichkeiten der Zustände 1, 5, 6 und 8.
Um für Abwechslung zu sorgen, wird der Startort der Monster zufällig gewählt, und zwar jedes Spielfeld mit der gleichen Wahrscheinlichkeit. Der Startvektor auf dem Zustandsraum Z = {1,2,3,4,5,6,7,8,9} lautet daher als Zeilenvektor π0 = 1/9 (1,1,1,1,1,1,1,1,1). Die ite Zeile und jte Spalte der unten abgebildeten Übergangsmatrix P enthält die Übergangswahrscheinlichkeit vom iten zum jten Zustand. Einträge mit Wahrscheinlichkeit 0 wurden entfernt, um eine bessere Übersichtlichkeit zu erhalten:
Zu Beginn (zum Zeitpunkt 0) ist jeder Zustand in diesem Beispiel noch gleichwahrscheinlich, die Zustandsverteilung zu Beginn lässt sich direkt am Startvektor ablesen. Und wie sieht die Zustandsverteilung nach einer Zeiteinheit aus? Um diese Frage zu beantworten, multiplizieren Sie den Startvektor mit der Übergangsmatrix P. Für den zweiten Zeitpunkt multiplizieren Sie den resultierenden Zeilenvektor ein weiteres mal mit der Übergangsmatrix P und so weiter. Im Allgemeinen erhalten Sie für den Zeitpunkt n die Aufenthaltswahrscheinlichkeiten für die einzelnen Zustände durch die Formel: πn=π0⋅Pn
Unter bestimmten Voraussetzungen konvergiert die Zustandsverteilung πn gegen die Gleichgewichtsverteilung π. In unserem Beispiel mit endlichem Zustandsraum muss die MarkovKette hierfür irreduzibel und aperiodisch sein.
Was bedeuten nun die Begriffe Irreduzibilität und Aperiodizität? Bei dem von uns betrachteten Typ von Markov Ketten liegt Irreduzibilität vor, falls man in endlicher Zeit von jedem beliebigen Zustand in jeden beliebigen Zustand gelangt. Aperiodizität ist gegeben, falls der größte gemeinsame Teiler aller Rückkehrzeiten zu dem ursprünglichen Zustand 1 ist. Und dieses muss für jeden Zustand gelten. Überprüfen wir mal die beiden Bedingungen:
Unsere MarkovKette ist irreduzibel, da sich die Gespenster in endlicher Zeit von jedem beliebigen Zustand in jeden beliebigen Zustand begeben können. Dank des Geheimgangs sind hierfür nur maximal drei Zustandswechsel nötig. Durch den Geheimgang ist die MarkovKette auch aperiodisch, weil die Monster sowohl in einer geraden als auch in einer ungeraden Anzahl an Zustandswechseln von jedem beliebigen Zustand in jeden beliebigen Zustand wechseln können und somit der größte gemeinsame Teiler dieser Rückkehrzeiten 1 beträgt. Ohne den Geheimgang wäre die MarkovKette periodisch, weil dann ein Übergang von einem geraden in einen geraden Zustand bzw. von einem ungeraden in einen ungeraden Zustand nur in einer geraden Anzahl von Zustandswechseln möglich und somit der größte gemeinsame Teiler 2 wäre.
Die Voraussetzungen für Existenz und Eindeutigkeit der Gleichgewichtsverteilung sind also erfüllt. Wegen der Irreduzibilität und Aperiodizität gibt es genau eine stabile Gleichgewichtsverteilung, welche die MarkovKette nach einer unendlich langen Zeit annimmt. Das bedeutet, die Aufenthaltswahrscheinlichkeiten für die einzelnen Zustände ändern sich nach langer Zeit (fast) nicht mehr. Aus diesem Grund konvergieren auch die Matrixpotenzen. Doch wie können Sie nun die statistische Programmierung und Simulation der Gleichgewichtsverteilung mit der Statistik Software R berechnen? Das erfahren Sie in den folgenden beiden Abschnitten dieses Artikels.
Mathematische Berechnung der GleichgewichtsverteilungWie wir gesehen haben, existiert eine eindeutige Gleichgewichtsverteilung, auch stationäre Verteilung genannt. In diesem Abschnitt erfahren Sie, wie Sie diese Verteilung mathematisch berechnen können. Dadurch erhalten Sie die Information, mit welcher Wahrscheinlichkeit sich die Monster langfristig in welchen Zuständen (bzw. Orten) aufhalten. Die Gleichgewichtsverteilung der Markov Kette als stochastischer Prozess lässt sich naiv bestimmen, indem in das Gleichungssystemπn=π0⋅Pn ein großes n eingesetzt wird, weil die Matrixpotenzen aufgrund der Irreduzibilität und Aperiodizität konvergieren, wie wir bereits herausgefunden erfahren haben. Um eine analytische Lösung zu berechnen, ist das lineare Gleichungssystem π=π⋅P nach π aufzulösen, unter der Nebenbedingung einer Zeilensumme von π von 1, damit wir auch tatsächlich eine Wahrscheinlichkeitsverteilung erhalten. Das Einsetzen der naiven Lösung in dieses Gleichungssystem dient dann als Kontrolle.
Das obige Gleichungssystem π=π⋅P ohne Nebenbedingung ist äquivalent zu (PTI)⋅πT=0. Die Übergangsmatrix wird demnach transponiert und die Einheitsmatrix subtrahiert. Der gesuchte Vektor der Zustandswahrscheinlichkeiten ist nun ein Spaltenvektor. Wir müssen also ein lineares Gleichungssystem lösen, welches inklusive Nebenbedingung eine Gleichung mehr hat als die Markov Kette Zustände. Bei einer großen Anzahl an Zuständen ist es sehr mühsam, die Lösung von Hand zu berechnen. Daher führen wir die statistische Programmierung nun mit der Statistik Software R durch. Eine alternative Lösung wäre die Verwendung einer MCMC Simulation (MCMC steht für Markov Chain Monte Carlo). Eine Simulation stellt eine sinnvolle Alternative dar, falls ein stochastischer Prozess beispielsweise so viele Zustände hat, dass die analytische Berechnung numerisch zu aufwändig wäre. Darauf verzichten wir jedoch, weil wir unsere Markov Kette nur 9 Zustände besitzt.
So berechnen Sie die Gleichgewichtsverteilung durch statistische ProgrammierungZunächst einmal tragen wir die Übergangsmatrix P in die Statistik Software R ein.
P = matrix ( c ( 0 , 1/2 , 0 , 1/2 , 0 , 0 , 0 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 1/4 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/3 , 0 , 0 , 0 , 1/3 , 0 , 1/3 , 0 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 0 , 1/3 , 0 , 1/3 , 0 , 0 , 0 , 1/3 , 0 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/2 , 0 , 0 , 1/4 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 0 , 0 , 0 , 1/2 , 0 , 1/2 , 0 ) , byrow = T , ncol = 9) PEine Übergangsmatrix enthält als Einträge die Übergangswahrscheinlichkeiten und diese müssen Werte zwischen 0 und 1 aufweisen. Ob das zutrifft, kann für jeden Eintrag der Matrix einzeln überprüft werden,
P >= 0 & P <= 1oder man lässt sich gleich anzeigen, ob alle Einträge plausibel sind.
P >= 0 && P <= 1Außerdem sollten sich die von jedem Zustand ausgehenden Übergangswahrscheinlichkeiten zu 1 summieren.
apply ( P , MARGIN = 1, FUN = sum )Die Übergangsmatrix P wird transponiert und die Einheitsmatrix subtrahiert.
P.T.minus.I = t ( P )  diag ( 9 )Jetzt ist es wichtig bloß nicht die Nebenbedingung zu vergessen, da wir sonst kein Ergebnis erhalten. Die Gleichgewichtsverteilung ist eine Wahrscheinlichkeitsverteilung und als solche muss die Summe über alle Zustände der Gleichgewichtsverteilung 1 ergeben. Wir ergänzen also zur Matrix P.T.minus.I eine Zeile mit Einsen.
P.T.minus.I.mit.Nebenbedingung = rbind ( P.T.minus.I , 1 ) P.T.minus.I.mit.NebenbedingungAuf der anderen Seite des Gleichungssystems steht der Nullvektor. Aufgrund der Nebenbedingung müssen wir eine Eins ergänzen. Um einen Spaltenvektor zu erhalten, verwenden wir als Datentyp eine Matrix mit einer Spalte.
Nullvektor = matrix ( c ( rep ( 0 , 9 ) , 1 ) , ncol = 1 ) NullvektorDas durch die Nebenbedingung erweitere lineare Gleichungssystem ist nun nicht mehr quadratisch, sondern enthält eine Bedingung mehr als sie Variablen hat. Daher können wir zur Lösung nicht den Standardbefehl solve verwenden, sondern verwenden alternativ den Befehl Solve (mit großem Anfangsbuchstaben) aus dem Paket limSolve, welches wir zuvor installieren müssen. Nach der Installation können wir das Paket mit library ( limSolve ) einbinden.
Wir geben in die Funktion Matrix und Vektor ein und erhalten:
Gleichgewichtsverteilung = Solve ( P.T.minus.I.mit.Nebenbedingung , Nullvektor ) GleichgewichtsverteilungZum Schluss überprüfen wir noch, ob wir tatsächlich eine gültige Wahrscheinlichkeitsverteilung erhalten haben:
Gleichgewichtsverteilung >= 0 & Gleichgewichtsverteilung <= 1 Gleichgewichtsverteilung >= 0 && Gleichgewichtsverteilung <= 1 sum ( Gleichgewichtsverteilung ) Lösung der GleichgewichtsverteilungDie Lösung des linearen Gleichungssystems mit der Statistik Software R ergibt: π=(7.7,15.4,7.7,11.5,15.4,11.5,7.7,15.4,7.7)
Die Gespenster halten sich demnach am häufigsten in der Mitte auf, weniger oft am Rand und am seltensten in der Ecke. Eine Ausnahme bilden die Randzustände 2 und 8, welche aufgrund des Geheimwegs durchschnittlich genauso oft besucht werden wie das zentrale Spielfeld. Die Aufenthaltswahrscheinlichkeiten der Zustände sind proportional zur Anzahl der eingehenden Pfeile. Je mehr einschrittige Wege zu einem Zustand führen, umso öfter wird dieser Zustand langfristig besucht.
Wie verwende ich die Markov Kette als stochastischer Prozess in der Wirtschaft?Es gibt zahlreiche Anwendungen für Markov Ketten in der Wirtschaft. Ein Beispiel wird im Folgenden vorgestellt. Im Aktienhandel ist man oftmals besonders daran interessiert, vorherzusagen, wie sich der Aktienmarkt entwickelt. Die Situation an der Börse wird in drei Hauptphasen unterteilt – Bull Market, Bear Market und Stagnant Market. Der unten abgebildete Übergangsgraph enthält die Übergangswahrscheinlichkeiten zwischen den drei Phasen von Woche zu Woche, wobei jede Phase immer für mindestens eine Woche bestehen bleibt. Mit der Gleichgewichtsverteilung können Sie nun berechnen, mit welcher Wahrscheinlichkeit sich der Aktienmarkt langfristig in welchem Zustand befindet.
Wenn die oben beschriebene Methodik der Markov Ketten auf dieses Beispiel aus der Wirtschaft angewendet wird, ergibt sich für den Aktienmarkt ein langfristiges Ergebnis von 62,5% der Wochen im Bull Market, 31,25% der Wochen im Bear Market und in 6,25% der Wochen im Stagnant Market befindet.
Wir hoffen, dass wir Ihnen mit diesem Artikel nun die Thematik der Markov Ketten und Gleichgewichtsverteilung näherbringen konnten, und Sie diese in Zukunft zur Lösung mathematischer Probleme oder von Fragestellungen im BusinessKontext einsetzen können. Für den Fall, dass Sie dabei professionelle Unterstützung benötigen sollten, stehen Ihnen die professionellen Statistiker von Novustat für statistische Programmierungen oder Simulationen zur Verfügung.
Der Beitrag Die Markov Kette: Wie berechne ich die Gleichgewichtsverteilung? erschien zuerst auf Statistik Service.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RProgramming – Statistik Service. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Faster Than Excel: Painlessly Merge Data into Actuarial Loss Development Triangles with R
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
The objective of this post is to create a method which easily combines loss runs, or listings of insurance claims, into triangles. Using only Excel, the common method is to create links between the excel files which must be updated manually for each new evaluation. This is prone to human error and is timeconsuming. Using a script to merge the files first and then create a triangle saves time and is more consistent.
Example of the conventional linked Excel triangle:
For accident year 2016, all the diagonal values need to be linked to a data file that was last evaluated in 2016. For 2015, the links go to the 2015 file, for 2014, the 2014 file, ect. This is time consuming.
See Wikipedia for a definition of Loss Development Triangles and why they are useful.
MethodsI use the packages plyr, dplyr, tidyr, and lubridate to manipulate the data once it is loaded into R. The package readxl reads the excel files from the windows file directory. It is worth noting that all five of these packages are included in Hadley Wickham’s tidyverse package. The package ChainLadder has a variety of tools for actuarial reserve analysis.
library(dplyr) library(readxl) library(plyr) library(tidyr) library(ChainLadder) library(lubridate) Step 1: Organize the Excel FilesCopy all of the excel files into the same windows folder, inside the working directory of the R project. It is important to have *only* the files that are going to be loaded into R in this folder.
On my computer, my file structure is as follows:
C:/Users/Sam/Desktop/[R project working directory]/[folder with excel claims listing files to aggregate]/[file 1, file 2, file 3, etc]
Using the command list.files(), R automatically looks in the working directory and returns a vector of all the the file names which it sees. For example, R sees the data files in the current working directory:
wd_path = "C:/Users/Sam/Desktop/projects/Excel Aggregate Loop/excel files" list.files(wd_path) ## [1] "claims listing 2011.xlsx" "claims listing 2012.xlsx" ## [3] "claims listing 2013.xlsx" "claims listing 2014.xlsx" ## [5] "claims listing 2015.xlsx" "claims listing 2016.xlsx"R can then loop through each of these files and perform any action. If there were 100 files, or 1000 files in this directory, this would still work.
file_names = list.files(wd_path) file_paths = paste(wd_path, "/", file_names, sep = "") head(read_excel(file_paths[4]) %>% select(3,4,5)) ## # A tibble: 4 x 6 ## name `license number` age `file year` `loss date` paid ## ## 1 jeff 3 23 2014 20110101 400 ## 2 sue 3 43 2014 20120101 400 ## 3 mark 2 55 2014 20130101 200 ## 4 sarah 1 100 2014 20140101 500In order to evaluate the age of the losses, we need to take into account when each loss was evaluated. This is accomplished by going into Excel and adding in a column for “file year”, which specifies the year of evaluation of the file. For instance, for the “claim listing 2013” file, all of the claims have a “2013” in the “file year” column.
Step 2: Load the Data into RInitialize a data frame which will store the aggregated loss run data from each of the excel files. **The names of this data frame need to be the names of excel file columns which need to be aggregated.** For instance, these could be “reported”, “Paid Loss”, “Case Reserves”, or “Incurred Loss”. If the excel files have different names for the same quantities (ie, “Paid Loss” vs. “paid loss”), then they should be renamed within excel first.
merged_data = as_data_frame(matrix(nrow = 0, ncol = 3)) names(merged_data) = c("file year", "loss date", "paid")Someone once said “if you need to use a ‘for’ loop in R, then you are doing something wrong”. Vectorized functions are faster and easier to implement. The function my_extract below takes in the file name of the excel file and returns a data frame with only the columns which are selected.
excel_file_extractor = function(cur_file_name){ read_excel(cur_file_name[1], sheet = "Sheet1", col_names = T) %>% select(`file year`, `file year`, `loss date`, paid) %>% rbind(merged_data) }Apply the function to all of the files in the folder that you created. Obviously, if you had 100 excel files this would still work just as effectively.
From the plyr package, ldply takes in a list and returns a data frame. The way to remember this is by the ordering of the letters (“list”“data frame”“ply”). For example, ff we wanted to read in a data frame and return a data frame, it would be ddply.
loss_run_data = ldply(file_paths, excel_file_extractor) names(loss_run_data) = c("file_year", "loss_date", "paid losses")The data now has only the columns what we selected, despite the fact that the loss run files had different columns in each of the files.
glimpse(loss_run_data) ## Variables: 3 ## $ file_year 2011, 2012, 2012, 2013, 2013, 2013, 2014, 2014, 20... ## $ loss_date 20110101, 20110101, 20120101, 20110101, 2... ## $ paid losses 100, 200, 300, 300, 350, 100, 400, 400, 200, 500, ... head(loss_run_data) ## file_year loss_date paid losses ## 1 2011 20110101 100 ## 2 2012 20110101 200 ## 3 2012 20120101 300 ## 4 2013 20110101 300 ## 5 2013 20120101 350 ## 6 2013 20130101 100 Step 3: Create Development TrianglesFinally, once we have the loss run combined, we just need to create a triangle. This is made easy by the `as.triangle` function from the `ChainLadder` package.
The only manual labor required in excel was to go into each file and create the `file year` column, which was just the year of evaluation of each loss run file.
loss_run_data = loss_run_data %>% mutate(accident_year = as.numeric(year(ymd(loss_date))), maturity_in_months = (file_year  accident_year)*12) merged_triangle = as.triangle(loss_run_data, dev = "maturity_in_months", origin = "accident_year", value = "paid losses") merged_triangleWithin the package `ChainLadder` is a plot function, which shows the development of losses by accident year. Because these are arbitrary amounts, the plot is not realistic.
plot(merged_triangle, main = "Paid Losses vs Maturity by Accident Year", xlab = "Maturity in Months", ylab = "Paid Losses") ConclusionWhen it comes to aggregating excel files, R can be faster and more consistent than linking together each of the excel files, and once this script is set in place, making modifications to the data can be done easily by editing the `exel_file_extractor` function.
Download the data and Rmd files for this article from my github account
Related Post
 The Mcomp Package
 Building a Hacker News scraper with 8 lines of R code using rvest library
 Building a Telecom Dictionary scraping web using rvest in R
 Scraping Javascriptrendered web content using R
 Analysing Cryptocurrency Market in R
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Deep Learning from first principles in Python, R and Octave – Part 1
(This article was first published on R – Giga thoughts …, and kindly contributed to Rbloggers)
“You don’t perceive objects as they are. You perceive them as you are.”
“Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.”
“The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”
This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any selfrespecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!
This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning Part 1
To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below
1. Logistic Regression import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from matplotlib.colors import ListedColormap from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Call the Logisitic Regression function clf = LogisticRegression().fit(X_train, y_train) print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Accuracy of Logistic regression classifier on training set: 0.96 ## Accuracy of Logistic regression classifier on test set: 0.96To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2. You could also take a look at my book, available on Amazon , in which I implement the most popular Machine Learning algorithms in both R and Python – My book ‘Practical Machine Learning with R and Python’ on Amazon
2. Logistic Regression as a 2 layer Neural NetworkIn the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below
The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification.
This simple 2 layer Neural Network is shown below
At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values.
where ‘b’ is the bias term
The Activation function is the sigmoid function which is
The Loss, when the sigmoid function is used in the output layer, is given by
(1)
In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above.
Backward propagationThe backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples.
Chain rule of differentiation
Let y=f(u)
and u=g(x) then
Derivative of sigmoid
Let then
Using the chain rule of differentiation we get
Therefore (2)
The 3 equations for the 2 layer Neural Network representation of Logistic Regression are
(a)
(b)
(c)
The back propagation step requires the computation of and . In the case of regression it would be and where dE is the Mean Squared Error function.
Computing the derivatives for back propagation we have
(d)
because
Also from equation (2) we get
– (e)
By chain rule
therefore substituting the results of (d) & (e) we get
(f)
Finally
(g)
– (h)
and from (f) we have
Therefore (g) reduces to
(i)
Also
(j)
Since
The gradient computes the weights at the input layer and the corresponding bias by using the values
of and
I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below
Note: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks.
4. Neural Network for Logistic Regression R code (vectorized) source("RFunctions1.R") # Define the sigmoid function sigmoid < function(z){ a < 1/(1+ exp(z)) a } # Compute the loss computeLoss < function(numTraining,Y,A){ loss < 1/numTraining* sum(Y*log(A) + (1Y)*log(1A)) return(loss) } # Compute forward propagation forwardPropagation < function(w,b,X,Y){ # Compute Z Z < t(w) %*% X +b #Set the number of samples numTraining < ncol(X) # Compute the activation function A=sigmoid(Z) #Compute the loss loss < computeLoss(numTraining,Y,A) # Compute the gradients dZ, dw and db dZ<AY dw<1/numTraining * X %*% t(dZ) db<1/numTraining*sum(dZ) fwdProp < list("loss" = loss, "dw" = dw, "db" = db) return(fwdProp) } # Perform one cycle of Gradient descent gradientDescent < function(w, b, X, Y, numIerations, learningRate){ losses < NULL idx < NULL # Loop through the number of iterations for(i in 1:numIerations){ fwdProp <forwardPropagation(w,b,X,Y) #Get the derivatives dw < fwdProp$dw db < fwdProp$db #Perform gradient descent w = wlearningRate*dw b = blearningRate*db l < fwdProp$loss # Stoe the loss if(i %% 100 == 0){ idx < c(idx,i) losses < c(losses,l) } } # Return the weights and losses gradDescnt < list("w"=w,"b"=b,"dw"=dw,"db"=db,"losses"=losses,"idx"=idx) return(gradDescnt) } # Compute the predicted value for input predict < function(w,b,X){ m=dim(X)[2] # Create a ector of 0's yPredicted=matrix(rep(0,m),nrow=1,ncol=m) Z < t(w) %*% X +b # Compute sigmoid A=sigmoid(Z) for(i in 1:dim(A)[2]){ # If A > 0.5 set value as 1 if(A[1,i] > 0.5) yPredicted[1,i]=1 else # Else set as 0 yPredicted[1,i]=0 } return(yPredicted) } # Normalize the matrix normalize < function(x){ #Create the norm of the matrix.Perform the Frobenius norm of the matrix n<as.matrix(sqrt(rowSums(x^2))) #Sweep by rows by norm. Note '1' in the function which performing on every row normalized<sweep(x, 1, n, FUN="/") return(normalized) } # Run the 2 layer Neural Network on the cancer data set # Read the data (from sklearn) cancer < read.csv("cancer.csv") # Rename the target variable names(cancer) < c(seq(1,30),"output") # Split as training and test sets train_idx < trainTestSplit(cancer,trainPercent=75,seed=5) train < cancer[train_idx, ] test < cancer[train_idx, ] # Set the features X_train <train[,1:30] y_train < train[,31] X_test < test[,1:30] y_test < test[,31] # Create a matrix of 0's with the number of features w <matrix(rep(0,dim(X_train)[2])) b <0 X_train1 < normalize(X_train) X_train2=t(X_train1) # Reshape then transpose y_train1=as.matrix(y_train) y_train2=t(y_train1) # Perform gradient descent gradDescent= gradientDescent(w, b, X_train2, y_train2, numIerations=3000, learningRate=0.77) # Normalize X_test X_test1=normalize(X_test) #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=t(X_test1) #Reshape y_test and take transpose y_test1=as.matrix(y_test) y_test2=t(y_test1) # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2) yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2) sprintf("Train accuracy: %f",(100  mean(abs(yPredictionTrain  y_train2)) * 100)) ## [1] "Train accuracy: 90.845070" sprintf("test accuracy: %f",(100  mean(abs(yPredictionTest  y_test)) * 100)) ## [1] "test accuracy: 87.323944" df <data.frame(gradDescent$idx, gradDescent$losses) names(df) < c("iterations","losses") ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") + ggtitle("Gradient Descent  Losses vs No of Iterations") + xlab("No of iterations") + ylab("Losses") 4. Neural Network for Logistic Regression Octave code (vectorized)
1;
# Define sigmoid function
function a = sigmoid(z)
a = 1 ./ (1+ exp(z));
end
# Compute the loss
function loss=computeLoss(numtraining,Y,A)
loss = 1/numtraining * sum((Y .* log(A)) + (1Y) .* log(1A));
end
# Perform forward propagation
function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y)
% Compute Z
Z = w' * X + b;
numtraining = size(X)(1,2);
# Compute sigmoid
A = sigmoid(Z);
#Compute loss. Note this is element wise product
loss =computeLoss(numtraining,Y,A);
# Compute the gradients dZ, dw and db
dZ = AY;
dw = 1/numtraining* X * dZ';
db =1/numtraining*sum(dZ);
end
# Compute Gradient Descent
function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate)
#Initialize losses and idx
losses=[];
index=[];
# Loop through the number of iterations
for i=1:numIerations,
[loss,dw,db,dZ] = forwardPropagation(w,b,X,Y);
# Perform Gradient descent
w = w  learningRate*dw;
b = b  learningRate*db;
if(mod(i,100) ==0)
# Append index and loss
index = [index i];
losses = [losses loss];
endif
end
end
# Determine the predicted value for dataset
function yPredicted = predict(w,b,X)
m = size(X)(1,2);
yPredicted=zeros(1,m);
# Compute Z
Z = w' * X + b;
# Compute sigmoid
A = sigmoid(Z);
for i=1:size(X)(1,2),
# Set predicted as 1 if A > 0,5
if(A(1,i) >= 0.5)
yPredicted(1,i)=1;
else
yPredicted(1,i)=0;
endif
end
end
# Normalize by dividing each value by the sum of squares
function normalized = normalize(x)
# Compute Frobenius norm. Square the elements, sum rows and then find square root
a = sqrt(sum(x .^ 2,2));
# Perform element wise division
normalized = x ./ a;
end
# Split into train and test sets
function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent)
# Create a random index
ix = randperm(length(dataset));
# Split into training
trainSize = floor(trainPercent/100 * length(dataset));
train=dataset(ix(1:trainSize),:);
# And test
test=dataset(ix(trainSize+1:length(dataset)),:);
X_train = train(:,1:30);
y_train = train(:,31);
X_test = test(:,1:30);
y_test = test(:,31);
end
cancer=csvread("cancer.csv");
[X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75);
w=zeros(size(X_train)(1,2),1);
b=0;
X_train1=normalize(X_train);
X_train2=X_train1';
y_train1=y_train';
[w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75);
# Normalize X_test
X_test1=normalize(X_test);
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=X_test1';
y_test1=y_test';
# Use the values of the weights generated from Gradient Descent
yPredictionTest = predict(w1, b1, X_test2);
yPredictionTrain = predict(w1, b1, X_train2);
trainAccuracy=100mean(abs(yPredictionTrain  y_train1))*100
testAccuracy=100 mean(abs(yPredictionTest  y_test1))*100
trainAccuracy = 90.845
testAccuracy = 89.510
graphics_toolkit('gnuplot')
plot(idx,losses);
title ('Gradient descent Cost vs No of iterations');
xlabel ("No of iterations");
ylabel ("Cost");
Conclusion
This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers
The Deep Learning journey has begun… Don’t miss the bus!
Stay tuned for more interesting posts in Deep Learning!!
References
1. Deep Learning Specialization
2. Neural Networks for Machine Learning
3. Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville
4. Neural Networks: The mechanics of backpropagation
5. Machine Learning
Also see
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2. Simplifying Machine Learning: Bias, Variance, regularization and odd facts – Part 4
3. The 3rd paperback & kindle editions of my books on Cricket, now on Amazon
4. Practical Machine Learning with R and Python – Part 4
5. Introducing QCSimulator: A 5qubit quantum computing simulator in R
6. A Bluemix recipe with MongoDB and Node.js
7. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)
To see all posts check Index of posts
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...