R news and tutorials contributed by hundreds of R bloggers
Updated: 2 days 15 hours ago

### Diffusion/Wiener Model Analysis with brms – Part II: Model Diagnostics and Model Fit

Sun, 01/07/2018 - 21:08

(This article was first published on Computational Psychology - Henrik Singmann, and kindly contributed to R-bloggers)

This is the considerably belated second part of my blog series on fitting diffusion models (or better, the 4-parameter Wiener model) with brms. The first part discusses how to set up the data and model. This second part is concerned with perhaps the most important steps in each model based data analysis, model diagnostics and the assessment of model fit. Note, the code in the part is completely self sufficient and can be run without running the code of part I.

Setup

At first, we load quite a few packages that we will need down the way. Obviously brms, but also some of the packages from the tidyverse (i.e., dplyr, tidyr, tibble, and ggplot2). It took me a little time to jump on the tidyverse bandwagon, but now that I use it more and more I cannot deny its utility. If your data can be made ‘tidy’, the coherent set of tools offered by the tidyverse make many seemingly complicated tasks pretty easy. A few examples of this will be shown below. If you need more introduction, I highly recommend the awesome ‘R for Data Science’ book by Grolemund and Wickham, which they made available for free! We also need gridExtra for combining plots and DescTools for the concordance correlation coefficient CCC used below.

library("brms") library("dplyr") library("tidyr") library("tibble") # for rownames_to_column library("ggplot2") library("gridExtra") # for grid.arrange library("DescTools") # for CCC

As in part I, we need package rtdists for the data.

data(speed_acc, package = "rtdists") speed_acc <- droplevels(speed_acc[!speed_acc$censor,]) # remove extreme RTs speed_acc <- droplevels(speed_acc[ speed_acc$frequency %in% c("high", "nw_high"),]) speed_acc$response2 <- as.numeric(speed_acc$response)-1

I have uploaded the binary R data file containing the fitted model object as well as the generated posterior predictive distributions to github, from which we can download them directly into R. Note that I needed to go the way via a temporary folder. If there is a way without that I would be happy to learn about it.

We already know from part I that there are a few divergent transitions. If this were a real analysis we therefore would not be satisfied with the current fit and try to rerun brm with an increased adapt_delta with the hope that this removes the divergent transitions. The Stan warning guidelines clearly state that “the validity of the estimates is not guaranteed if there are post-warmup divergences”. However, it is unclear what the actual impact of the small number of divergent transitions (< 10) observed here is on the posterior. Also, it is unclear what one can do if adapt_delta cannot be increased anymore and the model also cannot be reparameterized. Should all fits with any divergent transitions be completely disregarded? I hope the Stan team provides more guidelines to such questions in the future.

Coming back to our fit, as a first step in our model diagnostics we check the R-hat statistic as well as the number of effective samples. Specifically, we look at the parameters with the highest R² and lowest number of effective samples.

tail(sort(rstan::summary(fit_wiener$fit)$summary[,"Rhat"])) # sd_id__conditionaccuracy:frequencyhigh # 1.00 # r_id__bs[15,conditionaccuracy] # 1.00 # b_bias_conditionaccuracy # 1.00 # cor_id__conditionspeed:frequencyhigh__ndt_conditionaccuracy # 1.00 # sd_id__ndt_conditionspeed # 1.00 # cor_id__conditionspeed:frequencynw_high__bs_conditionspeed # 1.01 head(sort(rstan::summary(fit_wiener$fit)$summary[,"n_eff"])) # lp__ # 462 # b_conditionaccuracy:frequencyhigh # 588 # sd_id__ndt_conditionspeed # 601 # sd_id__conditionspeed:frequencyhigh # 646 # b_conditionspeed:frequencyhigh # 695 # r_id[12,conditionaccuracy:frequencyhigh] # 712

Both are unproblematic (i.e., R-hat < 1.05 and n_eff > 100) and suggest that the sampler has converged on the stationary distribution. If anyone has a similar oneliner to return the number of divergent transitions, I would be happy to learn about it.

We also visually inspect the chain behavior of a few semi-randomly selected parameters.

pars <- parnames(fit_wiener) pars_sel <- c(sample(pars[1:10], 3), sample(pars[-(1:10)], 3)) plot(fit_wiener, pars = pars_sel, N = 6, ask = FALSE, exact_match = TRUE, newpage = TRUE, plot = TRUE)

This visual inspection confirms the earlier conclusion. For all parameters the posteriors look well-behaved and the chains appears to mix well.

Finally, in the literature there are some discussions about parameter trade-offs for the diffusion and related models. These trade-offs supposedly make fitting the diffusion model in a Bayesian setting particularly complicated. To investigate whether fitting the Wiener model with HMC as implemented in Stan (i.e., NUTS) also shows this pattern we take a look at the joint posterior of the fixed-effects of the main Wiener parameters for the accuracy condition. For this we use the stanfit method of the pairs function and set the condition to "divergent__". This plots the few divergent transitions above the diagonal and the remaining samples below the diagonal.

pairs(fit_wiener$fit, pars = pars[c(1, 3, 5, 7, 9)], condition = "divergent__") This plot shows some correlations, but nothing too dramatic. HMC appears to sample quite efficiently from the Wiener model. Next we also take a look at the correlations across all parameters (not only the fixed effects). posterior <- as.mcmc(fit_wiener, combine_chains = TRUE) cor_posterior <- cor(posterior) cor_posterior[lower.tri(cor_posterior, diag = TRUE)] <- NA cor_long <- as.data.frame(as.table(cor_posterior)) cor_long <- na.omit(cor_long) tail(cor_long[order(abs(cor_long$Freq)),], 10) # Var1 Var2 Freq # 43432 b_ndt_conditionspeed r_id__ndt[1,conditionspeed] -0.980 # 45972 r_id__ndt[4,conditionspeed] r_id__ndt[11,conditionspeed] 0.982 # 46972 b_ndt_conditionspeed r_id__ndt[16,conditionspeed] -0.982 # 44612 b_ndt_conditionspeed r_id__ndt[6,conditionspeed] -0.983 # 46264 b_ndt_conditionspeed r_id__ndt[13,conditionspeed] -0.983 # 45320 b_ndt_conditionspeed r_id__ndt[9,conditionspeed] -0.984 # 45556 b_ndt_conditionspeed r_id__ndt[10,conditionspeed] -0.985 # 46736 b_ndt_conditionspeed r_id__ndt[15,conditionspeed] -0.985 # 44140 b_ndt_conditionspeed r_id__ndt[4,conditionspeed] -0.990 # 45792 b_ndt_conditionspeed r_id__ndt[11,conditionspeed] -0.991

This table lists the ten largest absolute values of correlations among posteriors for all pairwise combinations of parameters. The value in column Freq somewhat unintuitively is the observed  correlation among the posteriors of the two parameters listed in the two previous columns. To create this table I used a trick from SO using as.table, which is responsible for labeling the column containing the correlation value Freq.

What the table shows is some extreme correlations for the individual-level deviations (the first index in the squared brackets of the parameter names seems to be the participant number). Let us visualize these correlations as well.

pairs(fit_wiener$fit, pars = c("b_ndt_conditionspeed", "r_id__ndt[11,conditionspeed]", "r_id__ndt[4,conditionspeed]"), condition = "divergent__") This plot shows that some of the individual-level parameters are not well estimated. However, overall these extreme correlations appear rather rarely. hist(cor_long$Freq, breaks = 40)

Overall the model diagnostics do not show any particularly worrying behavior (with the exception of the divergent transitions). We have learned that a few of the individual-level estimates for some of the parameters are not very trustworthy. However, this does not disqualify the overall fit. The main take away from this fact is that we would need to be careful in interpreting the individual-level estimates. Thus, we assume the fit is okay and continue with the next step of the analysis.

Assessing Model Fit

We will now investigate the model fit. That is, we will investigate whether the model provides an adequate description of the observed data. We will mostly do so via graphical checks. To do so, we need to prepare the posterior predictive distribution and the data. As a first step, we combine the posterior predictive distributions with the data.

d_speed_acc <- as_tibble(cbind(speed_acc, as_tibble(t(pred_wiener))))

Then we calculate three important measures (or test statistics T()) on the individual level for each cell of the design (i.e., combination of condition and frequency factors):

• Probability of giving an upper boundary response (i.e., respond “nonword”).
• Median RTs for responses to the upper boundary.
• Median RTs for the lower boundary.

We first calculate this for each sample of the posterior predictive distribution. We then summarize these three measures by calculating the median and some additional quantiles across the posterior predictive distribution. We calculate all of this in one step using a somewhat long combination of dplyr and tidyr magic.

d_speed_acc_agg <- d_speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise_at(.vars = vars(starts_with("V")), funs(prob.upper = mean(. > 0), medrt.lower = median(abs(.[. < 0]) ), medrt.upper = median(.[. > 0] ) )) %>% ungroup %>% gather("key", "value", -id, -condition, -frequency) %>% # remove grouping vars separate("key", c("rep", "measure"), sep = "_") %>% spread(measure, value) %>% group_by(id, condition, frequency) %>% # select grouping vars summarise_at(.vars = vars(prob.upper, medrt.lower, medrt.upper), .funs = funs(median = median(., na.rm = TRUE), llll = quantile(., probs = 0.01,na.rm = TRUE), lll = quantile(., probs = 0.025,na.rm = TRUE), ll = quantile(., probs = 0.1,na.rm = TRUE), l = quantile(., probs = 0.25,na.rm = TRUE), h = quantile(., probs = 0.75,na.rm = TRUE), hh = quantile(., probs = 0.9,na.rm = TRUE), hhh = quantile(., probs = 0.975,na.rm = TRUE), hhhh = quantile(., probs = 0.99,na.rm = TRUE) ))

Next, we calculate the three measures also for the data and combine it with the results from the posterior predictive distribution in one data.frame using left_join.

speed_acc_agg <- speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise(prob.upper = mean(response == "nonword"), medrt.upper = median(rt[response == "nonword"]), medrt.lower = median(rt[response == "word"]) ) %>% ungroup %>% left_join(d_speed_acc_agg) Aggregated Model-Fit

The first important question is whether our model can adequately describe the overall patterns in the data aggregated across participants. For this we simply aggregate the results obtained in the previous step (i.e., the summary results from the posterior predictive distribution as well as the test statistics from the data) using mean.

d_speed_acc_agg2 <- speed_acc_agg %>% group_by(condition, frequency) %>% summarise_if(is.numeric, mean, na.rm = TRUE) %>% ungroup

We then use these summaries and plot predictions (in grey and black) as well as data (in red) for the three measures. The inner (fat) error bars show the 80% credibility intervals (CIs), the outer (thin) error bars show the 95% CIs. The black circle shows the median of the posterior predictive distributions.

new_x <- with(d_speed_acc_agg2, paste(rep(levels(condition), each = 2), levels(frequency), sep = "\n")) p1 <- ggplot(d_speed_acc_agg2, aes(x = condition:frequency)) + geom_linerange(aes(ymin = prob.upper_lll, ymax = prob.upper_hhh), col = "darkgrey") + geom_linerange(aes(ymin = prob.upper_ll, ymax = prob.upper_hh), size = 2, col = "grey") + geom_point(aes(y = prob.upper_median), shape = 1) + geom_point(aes(y = prob.upper), shape = 4, col = "red") + ggtitle("Response Probabilities") + ylab("Probability of upper resonse") + xlab("") + scale_x_discrete(labels = new_x) p2 <- ggplot(d_speed_acc_agg2, aes(x = condition:frequency)) + geom_linerange(aes(ymin = medrt.upper_lll, ymax = medrt.upper_hhh), col = "darkgrey") + geom_linerange(aes(ymin = medrt.upper_ll, ymax = medrt.upper_hh), size = 2, col = "grey") + geom_point(aes(y = medrt.upper_median), shape = 1) + geom_point(aes(y = medrt.upper), shape = 4, col = "red") + ggtitle("Median RTs upper") + ylab("RT (s)") + xlab("") + scale_x_discrete(labels = new_x) p3 <- ggplot(d_speed_acc_agg2, aes(x = condition:frequency)) + geom_linerange(aes(ymin = medrt.lower_lll, ymax = medrt.lower_hhh), col = "darkgrey") + geom_linerange(aes(ymin = medrt.lower_ll, ymax = medrt.lower_hh), size = 2, col = "grey") + geom_point(aes(y = medrt.lower_median), shape = 1) + geom_point(aes(y = medrt.lower), shape = 4, col = "red") + ggtitle("Median RTs lower") + ylab("RT (s)") + xlab("") + scale_x_discrete(labels = new_x) grid.arrange(p1, p2, p3, ncol = 2)

Inspection of the plots show no dramatic misfit. Overall the model appears to be able to describe the general patterns in the data. Only the response probabilities for words (i.e., frequency = high) appears to be estimated too low. The red x appear to be outside the 80% CIs but possibly also outside the 95% CIs.

The plots of the RTs show an interesting (but not surprising) pattern. The posterior predictive distributions for the rare responses (i.e., “word” responses for upper/non-word stimuli and “nonword” response to lower/word stimuli) are relatively wide. In contrast, the posterior predictive distributions for the common responses are relatively narrow. In each case, the observed median is inside the 80% CI and also quite near to the predicted median.

Individual-Level Fit

To investigate the pattern of predicted response probabilities further, we take a look at them on the individual level. We again plot the response probabilities in the same way as above, but separated by participant id.

ggplot(speed_acc_agg, aes(x = condition:frequency)) + geom_linerange(aes(ymin = prob.upper_lll, ymax = prob.upper_hhh), col = "darkgrey") + geom_linerange(aes(ymin = prob.upper_ll, ymax = prob.upper_hh), size = 2, col = "grey") + geom_point(aes(y = prob.upper_median), shape = 1) + geom_point(aes(y = prob.upper), shape = 4, col = "red") + facet_wrap(~id, ncol = 3) + ggtitle("Prediced (in grey) and observed (red) response probabilities by ID") + ylab("Probability of upper resonse") + xlab("") + scale_x_discrete(labels = new_x)

This plot shows a similar pattern as the aggregated data. For none of the participants do we observe dramatic misfit. Furthermore, response probabilities to non-word stimuli appear to be predicted rather well. In contrast, response probabilities for word-stimuli are overall predicted to be lower than observed. However, this misfit does not seem to be too strong.

As a next step we look at the coverage probabilities of our three measures across individuals. That is, we calculate for each of the measures, for each of the cells of the design, and for each of the CIs (i.e., 50%, 80%, 95%, and 99%), the proportion of participants for which the observed test statistics falls into the corresponding CI.

speed_acc_agg %>% mutate(prob.upper_99 = (prob.upper >= prob.upper_llll) & (prob.upper <= prob.upper_hhhh), prob.upper_95 = (prob.upper >= prob.upper_lll) & (prob.upper <= prob.upper_hhh), prob.upper_80 = (prob.upper >= prob.upper_ll) & (prob.upper <= prob.upper_hh), prob.upper_50 = (prob.upper >= prob.upper_l) & (prob.upper <= prob.upper_h), medrt.upper_99 = (medrt.upper > medrt.upper_llll) & (medrt.upper < medrt.upper_hhhh), medrt.upper_95 = (medrt.upper > medrt.upper_lll) & (medrt.upper < medrt.upper_hhh), medrt.upper_80 = (medrt.upper > medrt.upper_ll) & (medrt.upper < medrt.upper_hh), medrt.upper_50 = (medrt.upper > medrt.upper_l) & (medrt.upper < medrt.upper_h), medrt.lower_99 = (medrt.lower > medrt.lower_llll) & (medrt.lower < medrt.lower_hhhh), medrt.lower_95 = (medrt.lower > medrt.lower_lll) & (medrt.lower < medrt.lower_hhh), medrt.lower_80 = (medrt.lower > medrt.lower_ll) & (medrt.lower < medrt.lower_hh), medrt.lower_50 = (medrt.lower > medrt.lower_l) & (medrt.lower < medrt.lower_h) ) %>% group_by(condition, frequency) %>% ## grouping factors without id summarise_at(vars(matches("\\d")), mean, na.rm = TRUE) %>% gather("key", "mean", -condition, -frequency) %>% separate("key", c("measure", "ci"), "_") %>% spread(ci, mean) %>% as.data.frame() # condition frequency measure 50 80 95 99 # 1 accuracy high medrt.lower 0.706 0.8824 0.882 1.000 # 2 accuracy high medrt.upper 0.500 0.8333 1.000 1.000 # 3 accuracy high prob.upper 0.529 0.7059 0.765 0.882 # 4 accuracy nw_high medrt.lower 0.500 0.8125 0.938 0.938 # 5 accuracy nw_high medrt.upper 0.529 0.8235 1.000 1.000 # 6 accuracy nw_high prob.upper 0.529 0.8235 0.941 0.941 # 7 speed high medrt.lower 0.471 0.8824 0.941 1.000 # 8 speed high medrt.upper 0.706 0.9412 1.000 1.000 # 9 speed high prob.upper 0.000 0.0588 0.588 0.647 # 10 speed nw_high medrt.lower 0.706 0.8824 0.941 0.941 # 11 speed nw_high medrt.upper 0.471 0.7647 1.000 1.000 # 12 speed nw_high prob.upper 0.235 0.6471 0.941 1.000

As can be seen, for the RTs, the coverage probability is generally in line with the width of the CIs or even above it. Furthermore, for the common response (i.e., upper for frequency = nw_high and lower for frequency = high), the coverage probability is 1 for the 99% CIs in all cases.

Unfortunately, for the response probabilities, the coverage is not that great. especially in the speed condition and for tighter CIs. However, for the wide CIs the coverage probabilities is at least acceptable. Overall the results so far suggest that the model provides an adequate account. There are some misfits that should be kept in mind if one is interested in extending the model or fitting it to new data, but overall it provides a satisfactory account.

QQ-plots: RTs

The final approach for assessing the fit of the model will be based on more quantiles of the RT distribution (i.e., so far we only looked at th .5 quantile, the median). We will then plot individual observed versus predicted (i.e., mean from posterior predictive distribution) quantiles across conditions. For this we first calculate the quantiles per sample from the posterior predictive distribution and then aggregate across the samples. This is achieved via dplyr::summarise_at using a list column and tidyr::unnest to unstack the columns (see section 25.3 in “R for data Science”). We then combine the aggregated predicted RT quantiles with the observed RT quantiles.

quantiles <- c(0.1, 0.25, 0.5, 0.75, 0.9) pp2 <- d_speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise_at(.vars = vars(starts_with("V")), funs(lower = list(rownames_to_column( data.frame(q = quantile(abs(.[. < 0]), probs = quantiles)))), upper = list(rownames_to_column( data.frame(q = quantile(.[. > 0], probs = quantiles )))) )) %>% ungroup %>% gather("key", "value", -id, -condition, -frequency) %>% # remove grouping vars separate("key", c("rep", "boundary"), sep = "_") %>% unnest(value) %>% group_by(id, condition, frequency, boundary, rowname) %>% # grouping vars + new vars summarise(predicted = mean(q, na.rm = TRUE)) rt_pp <- speed_acc %>% group_by(id, condition, frequency) %>% # select grouping vars summarise(lower = list(rownames_to_column( data.frame(observed = quantile(rt[response == "word"], probs = quantiles)))), upper = list(rownames_to_column( data.frame(observed = quantile(rt[response == "nonword"], probs = quantiles )))) ) %>% ungroup %>% gather("boundary", "value", -id, -condition, -frequency) %>% unnest(value) %>% left_join(pp2)

To evaluate the agreement between observed and predicted quantiles we calculate for each cell and quantile the concordance correlation coefficient (CCC; e.g, Barchard, 2012, Psych. Methods). The CCC is a measure of absolute agreement between two values and thus better suited than simple correlation. It is scaled from -1 to 1 where 1 represent perfect agreement, 0 no relationship, and -1 a correlation of -1 with same mean and variance of the two variables.

The following code produces QQ-plots for each condition and quantile separately for responses to the upper boundary and lower boundary. The value in the upper left of each plot gives the CCC measures of absolute agreement.

plot_text <- rt_pp %>% group_by(condition, frequency, rowname, boundary) %>% summarise(ccc = format( CCC(observed, predicted, na.rm = TRUE)$rho.c$est, digits = 2)) p_upper <- rt_pp %>% filter(boundary == "upper") %>% ggplot(aes(x = observed, predicted)) + geom_abline(slope = 1, intercept = 0) + geom_point() + facet_grid(condition+frequency~ rowname) + geom_text(data=plot_text[ plot_text$boundary == "upper", ], aes(x = 0.5, y = 1.8, label=ccc), parse = TRUE, inherit.aes=FALSE) + coord_fixed() + ggtitle("Upper responses") + theme_bw() p_lower <- rt_pp %>% filter(boundary == "lower") %>% ggplot(aes(x = observed, predicted)) + geom_abline(slope = 1, intercept = 0) + geom_point() + facet_grid(condition+frequency~ rowname) + geom_text(data=plot_text[ plot_text$boundary == "lower", ], aes(x = 0.5, y = 1.6, label=ccc), parse = TRUE, inherit.aes=FALSE) + coord_fixed() + ggtitle("Lower responses") + theme_bw() grid.arrange(p_upper, p_lower, ncol = 1)

Results show that overall the fit is better for the accuracy than the speed conditions. Furthermore, fit is better for the common response (i.e., nw_high for upper and high for lower responses). This latter observation is again not too surprising.

When comparing the fit for the different quantiles it appears that at least the median (i.e., 50%) shows acceptable values for the common response. However, especially in the speed condition the account of the other quantiles is not great. Nevertheless, dramatic misfit is only observed for the rare responses.

One possibility for some of the low CCCs in the speed conditions may be the comparatively low variances in some of the cells. For example, for both speed conditions that are common (i.e., speed & nw_high for upper responses and speed & high for lower responses) the visual inspection of the plot suggests a acceptable account while at the same time some CCC value are low (i.e., < .5). Only for the 90% quantile in the speed conditions (and somewhat less the 75% quantile) we see some systematic deviations. The model predicts lower RTs than observed.

Taken together, the model appear to provide an at least acceptable account. The only slightly worrying patterns are (a) that the model predicts a slightly better performance for the word stimuli than observed (i.e., lower predicted rate of non-word responses than observed for word-stimuli) and (b) that in the speed conditions the model predicts somewhat longer RTs for the 75% and 90% quantile than observed.

The next step will be to look at differences between parameters as a function of the speed-accuracy condition. However, this will be the topic of the third blog post. I am hopeful it will not take two months this time.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Computational Psychology - Henrik Singmann. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Anscombe’s Quartet: 1980’s Edition

Sun, 01/07/2018 - 19:57

(This article was first published on Method Matters, and kindly contributed to R-bloggers)

In this post, I’ll describe a fun visualization of Anscombe’s quartet I whipped up recently.

If you aren’t familiar with Anscombe’s quartet, here’s a brief description from its Wikipedia entry: “Anscombe’s quartet comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties. He described the article as being intended to counter the impression among statisticians that ‘numerical calculations are exact, but graphs are rough.’

In essence, there are 4 different datasets with quite different patterns in the data. Fitting a linear regression model through each dataset yields (nearly) identical regression coefficients, while graphing the data makes it clear that the underlying patterns are very different. What’s amazing to me is how these simple data sets (and accompanying graphs) make immediately intuitive the importance of data visualization, and drive home the point of how well-constructed graphs can help the analyst understand the data he or she is working with.

The Anscombe data are included in base R, and these data (and R, of course!) are used to produce the plot that accompanies the Wikipedia entry on Anscombe’s quartet.

Because the 1980’s are back, I decided to make a visualization of Anscombe’s quartet using the like-most-totally-rad 1980’s graphing elements I could come up with. I was aided with the colors by a number of graphic design palettes with accompanying hex codes. I used the excellent showtext package for the 1980’s font, which comes from the Google font “Press Start 2P.” (Note, if you’re reproducing the graph at home, the fonts won’t show properly in RStudio. Run the code in the standalone R program and everything works like a charm). I had to tweak a number of graphical parameters in order to get the layout right, but in the end I’m quite pleased with the result.

The Code

# showtext library to get 1980's font
# "Press Start 2P"
library(showtext)

png("C:\\Directory\\Anscombe_80s.png",
width=11,height=8, units='in', res=600)

showtext.begin()
op <- par(las=1, mfrow=c(2,2), mar=1.5+c(4,4,.5,1), oma=c(0,0,5,0),
lab=c(6,6,7), cex.lab=12.0, cex.axis=5, mgp=c(3,1,0), bg = 'black',
col.axis = '#F2CC00', col.lab = '#A10EEC', family = 'start2p')
ff <- y ~ x
for(i in 1:4) {
ff[[2]] <- as.name(paste("y", i, sep=""))
ff[[3]] <- as.name(paste("x", i, sep=""))
lmi <- lm(ff, data= anscombe)
xl <- substitute(expression(x[i]), list(i=i))
yl <- substitute(expression(y[i]), list(i=i))
plot(ff, data=anscombe, col="#490E61", pch=21, cex=2.4, bg = "#32FA05",
xlim=c(3,19), ylim=c(3,13)
, xlab=eval(xl), ylab=yl,
family = 'start2p'
)
abline(lmi, col="#FA056F", lwd = 5)
axis(1, col = '#FA1505')
axis(2, col = '#FA1505')
box(col="#FA1505")
}
mtext("Anscombe's Quartet", outer = TRUE,
cex = 20, family = 'start2p', col="#FA1505")
showtext.end()

dev.off()

The Plot

Conclusion

In this post, I used data available in R to make a 1980’s-themed version of the Anscombe quartet graphs. The main visual elements I manipulated were the colors and the fonts. R’s wonderful and flexible plotting capabilities (here using base R!) made it very straightforward to edit every detail of the graph to achieve the desired retro-kitsch aesthetic.

OK, so maybe this isn’t the most serious use of R for data analysis and visualization. There are doubtless more important business cases and analytical problems to solve. Nevertheless, this was super fun to do. Data analysis (or data science, or whatever you’d like to call it) is a field in which there are countless ways to be creative with data. It’s not always easy to bring this type of creativity to every applied project, but this blog is a place where I can do any crazy thing I set my mind to and just have fun. Judging by that standard, I think this project was a success.

Coming Up Next

In the next post, I’ll do something a little bit different with data. Rather than doing data analysis, I’ll describe a project in which I used Python to manage and edit meta-data (ID3 tags) in mp3 files. Stay tuned!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Method Matters. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### prrd 0.0.1: Parallel Running [of] Reverse Depends

Sun, 01/07/2018 - 19:33

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new package is now on the ghrr drat. It was uploaded four days ago to CRAN but still lingers in the inspect/ state, along with a growing number of other packages. But as some new packages have come through, I am sure it will get processed eventually but in the meantime I figured I may as well make it available this way.

The idea of prrd is simple, and described in some more detail on its webpage and its GitHub repo. Reverse dependency checks are an important part of package development (provided you care about not breaking other packages as CRAN asks you too), and is easily done in a (serial) loop. But these checks are also generally embarassingly parallel as there is no or little interdependency between them (besides maybe shared build depedencies).

So this package uses the liteq package by Gabor Csardi to set up all tests to run as tasks in a queue. This permits multiple independent queue runners to work at a task at a time. Results are written back and summarized.

This already works pretty well as evidenced by the following screenshot (running six parallel workers, arranged in split byobu session).

See the aforementioned webpage and its repo for more details, and by all means give it a whirl.

For more questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Discontinuing maintenance of jug

Sun, 01/07/2018 - 14:00

(This article was first published on FishyOperations, and kindly contributed to R-bloggers)

I’m starting off 2018 with dropping my involvement with jug. The reason for this being that I simply cannot find the time or interest anymore to maintain the code.

All of the open-source efforts that I started always had to do with a concrete need that I had in one way or another. Once that need is fulfilled by creating a solution for it, the next challenge becomes to maintain that solution.

In a best case scenario, someone with the right set of skills will continue the development into a mature project which become significant better then the original solution. A close to my heart example of this is simmer (see http://r-simmer.org/), where Iñaki Ucar has taken over the development and has improved the original enormously.

However, for jug the active developers community is currently too small too do a handover. Next to that, jug its main competitor plumber (see website) has been able to create a significant community with highly active support.

Today I have to sort of admit defeat. My own time availability as co-founder of dataroots and my changing interests unfortunately no long allow me to put in the time and effort required to do the unnecessary maintenance to do justice to the jug user base. therefore I will be discontinuing the maintenance of jug. If anyone would be willing to take over, they will be more than welcome.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: FishyOperations. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Undemocratic Democracies

Sun, 01/07/2018 - 13:16

(This article was first published on R – DiffusePrioR, and kindly contributed to R-bloggers)

I’ve always thought that it was ironic that most countries with the word “Democratic” in their name are exceptionally un-democratic in reality. So I was very interested in the following post on Reddit this week showing exactly this (https://www.reddit.com/r/dataisbeautiful/comments/7nkyek/countries_with_the_word_democratic_in_their/).

However, there are only 8 countries that have “democratic” in their name: People’s Democratic Republic of Algeria, Democratic Republic of the Congo, Democratic Republic of Timor-Leste, Federal Democratic Republic of Ethiopia, Lao People’s Democratic Republic, Democratic People’s Republic of Korea, Federal Democratic Republic of Nepal, and the Democratic Socialist Republic of Sri Lanka. This and the fact that I am back teaching next week got me thinking that this might be a nice example to show how two-sample t-tests work.

The 8 “democratic” countries have an overall democracy index score of 3.89 with a sample standard deviation of 2.174942. This contrasts with a mean and standard deviation of 5.602013 and 2.172326 for the remaining 159 countries in the Economist Intelligence Unit’s democracy index (https://en.wikipedia.org/wiki/Democracy_Index).

Let’s conduct a two-sample t-test that assumes equal variances to test whether this difference in means is statistically significant. The null hypothesis is that both sample means are equal. This is a two-tailed test. The test statistic follows Gosset’s t-distribution with n1+n2-2 degrees of freedom (159+8-2=165). The test statistic is calculated (formula here: https://en.wikipedia.org/wiki/Student%27s_t-test):

s = sqrt(((158*2.172326^2)+(7*2.174942^2))/(159+8-2))
t = (5.602013-3.89)/(s*(sqrt(1/159+1/8)))=2.1749

which allows us to reject at the conventional alpha of 0.05. The p-value is ~0.03 meaning we would not be able to reject at the 1% level. Interestingly, if you do not follow the equal variance assumption, you can no longer reject the null at the 5% level.

Hopefully, this example will be of interest to people teaching stats and econometrics for undergrads!

rm(list = ls()) other <- c(9.93, 9.5, 9.39, 9.26, 9.2, 9.15, 9.15, 9.09, 9.03, 9.01, 8.81, 8.8, 8.63, 8.41, 8.39, 8.36, 8.3, 8.28, 8.17, 7.99, 7.98, 7.98, 7.94, 7.92, 7.92, 7.88, 7.87, 7.86, 7.85, 7.85, 7.82, 7.81, 7.79, 7.78, 7.77, 7.65, 7.51, 7.47, 7.41, 7.39, 7.31, 7.29, 7.23, 7.13, 7.1, 7.01, 6.97, 6.96, 6.94, 6.9, 6.83, 6.77, 6.75, 6.75, 6.72, 6.67, 6.67, 6.65, 6.64, 6.62, 6.62, 6.59, 6.57, 6.54, 6.47, 6.42, 6.4, 6.38, 6.31, 6.27, 6.25, 6.21, 6.03, 6.01, 5.99, 5.93, 5.92, 5.92, 5.91, 5.81, 5.76, 5.73, 5.72, 5.7, 5.7, 5.67, 5.64, 5.63, 5.55, 5.33, 5.31, 5.26, 5.23, 5.07, 5.04, 4.93, 4.93, 4.92, 4.87, 4.86, 4.81, 4.77, 4.7, 4.68, 4.55, 4.5, 4.49, 4.33, 4.27, 4.2, 4.08, 4.02, 4.02, 3.96, 3.96, 3.96, 3.88, 3.85, 3.81, 3.74, 3.71, 3.54, 3.46, 3.46, 3.4, 3.38, 3.32, 3.31, 3.24, 3.18, 3.14, 3.14, 3.07, 3.06, 3.05, 3.04, 3.03, 2.91, 2.91, 2.83, 2.79, 2.75, 2.65, 2.55, 2.4, 2.37, 2.37, 2.34, 2.25, 2.06, 1.98, 1.95, 1.93, 1.89, 1.83, 1.7, 1.61, 1.5, 1.43) demo <- c(7.24, 6.48, 4.86, 3.6, 3.56, 2.37, 1.93, 1.08) mean(other) ; sd(other) ; length(other) mean(demo) ; sd(demo) ; length(demo) t.test(other, demo, var.equal = T) s = sqrt(((158*2.172326^2)+(7*2.174942^2))/(159+8-2)) t = (5.602013-3.89)/(s*(sqrt(1/159+1/8))) t.test(other, demo, var.equal = F) library(ggplot2) data1 <- data.frame(Score = other, Name = "No") data2 <- data.frame(Score = demo, Name = "Yes") data <- rbind(data1, data2) ggplot(data, aes(Name, Score)) + geom_boxplot(fill="lightblue") + theme_bw() + xlab("Democratic in Name") + ylab("Democracy Score")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – DiffusePrioR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Rainbowing a set of pictures

Sun, 01/07/2018 - 01:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I’ve now down a few collages from R using magick: the faces of #rstats Twitter, We R-Ladies with Lucy D’Agostino McGowan, and a holiday card for R-Ladies. The faces of #rstats Twitter and holiday card collages were arranged at random, while the We R-Ladies one was a mosaic forming the R-Ladies logo. I got the idea to up my collage skills by trying to learn how to arrange pics by their main colour, like a rainbow. The verb rainbow doesn’t exist, and “rainbowing” doesn’t mean ordering by colour, but I didn’t let this stop me.

It was the occasion to grab some useful knowledge about colours, not useless for someone who did not even know about Pantone’s Colors of the Year a few weeks ago…

This post has nothing to do with Kesha’s new album. However, you can listen to it while reading since it’s so good, but maybe switch to something older from her when I use “$”. Getting some pics to play with The first pictures I tried to arrange were all the pictures ever posted by R-Ladies local chapters on their Twitter account. While it was fun to grab them all, it was not very interesting to play with them as so many of them were pictures of screens. I therefore grabbed “nature” pictures from Pexels using the same method [as when creating the Bubblegum Puppies] following this Stack Overflow thread. I chose “nature” as a keyword because 1) it lead to many hits 2) it offered a good variety of colours. library("rvest") library("RSelenium") library("magrittr") rD <- rsDriver() remDr <- rD[["client"]] # open the webpage remDr$navigate("https://www.pexels.com/search/nature/") # scroll down for(i in 1:130){ remDr$executeScript(paste("scroll(0,",i*10000,");"), args = list("dummy")) # be nice and wait Sys.sleep(1) } # https://www.pexels.com/faq/ page_content <- remDr$getPageSource() remDr$close() get_link_from_src <- function(node){ xml2::xml_attrs(node)["src"] %>% as.character() %>% stringr::str_replace("\\?h.*", "") } xtract_pic_links <- function(source) { css <- '.photo-item__img' read_html(source[[1]]) %>% html_nodes(css) %>% purrr::map_chr(get_link_from_src) } links <- xtract_pic_links(page_content) links <- links[1:1400] # save dir.create("nature") save_pic <- function(url){ Sys.sleep(1) name <- stringr::str_replace(url, ".*\\/", "") try(magick::image_read(url) %>% magick::image_write(paste0("nature/", name)), silent = TRUE) } purrr::walk(links, save_pic) Extracting the main colour and making pics size-compatible In the following code, I extracted the main colour from each pic using Russel Dinnage’s method as presented in this blog post from David Smith. Before that I had to install two packages from Github, rblocks and rPlotter. This code also serves another role: since I wanted to paste pics together at some point, I decided to make them all of the same dimensions by adding a border with magick. I had learnt how to do that when preparing R-Ladies Global holiday card, but this time instead of using the same colour every time (R-Ladies’ official purple), I used the main colour I’d just extracted. The most important points to make a picture a square are to know magick::image_info gives you the height and width of an image… and to somehow understand geometry which was embarrassingly a hurdle when I did that. The code to extract colours didn’t work in a few cases which I did not investigate a lot: I had downloaded more pics than what I needed because I had experienced the issue when working with R-Ladies meetups pics, and had seen it was because of seemingly bicolor pics. dir.create("formatted_pics") format_image <- function(path){ image <- magick::image_read(path) info <- magick::image_info(image) # find in which direction I need to add pixels # to make this a square direction <- ifelse(info$height > info$width, "height", "width") scale_number <- as.numeric(info[direction]/500) image <- magick::image_scale(image, paste0(info["width"]/scale_number, "x", info["height"]/scale_number)) newinfo <- magick::image_info(image) # colours colours <- try(rPlotter::extract_colours(path, num_col = 1), silent = TRUE) # one pic at least was problematic if(!is(colours, "try-error")){ colour <- colours[1] image <- magick::image_border(image, colour, paste0((500-newinfo$width)/2, "x", (500-newinfo$height)/2)) info <- magick::image_info(image) # odd numbers out! if(info$height/2 != floor(info$height/2)){ image <- magick::image_crop(image, "0x500+0") } if(info$width/2 != floor(info$width/2)){ image <- magick::image_crop(image, "500x0+0") } magick::image_write(image, stringr::str_replace(path, "nature", "formatted_pics")) tibble::tibble(path = path, colour = colour) }else{ NULL } } pics_main_colours <- purrr::map_df(dir("nature", full.names = TRUE), format_image) readr::write_csv(pics_main_colours, path = "pics_main_colours.csv") And because I’m apparently a bad planner, I had to reduce pictures afterwards. # we need smaller images reduce_image <- function(path){ magick::image_read(path) %>% magick::image_scale("50x50!") %>% magick::image_write(path) } purrr::walk(dir("formatted_pics", full.names = TRUE), reduce_image) Preparing a function to order and paste pictures This function has a collage part which you might recognize from my the faces of #rstats Twitter blog post, and a ordering pictures according to a variable part that’s new and uses a bit of tidy eval… Maybe I’ll really learn tidy eval this year! pics_info needs to be a data.frame with the path to pictures and well the variable one wants to use to order them. library("rlang") make_column <- function(i, files, no_rows){ magick::image_read(files[(i*no_rows+1):((i+1)*no_rows)]) %>% magick::image_append(stack = TRUE) %>% magick::image_write(paste0("cols/", i, ".jpg")) } make_collage <- function(pics_info, no_rows, no_cols, ordering_col){ pics_info <- dplyr::arrange(pics_info, !!!syms(ordering_col)) pics_info <- pics_info[1:(no_rows*no_cols),] pics_info <- dplyr::mutate(pics_info, column = rep(1:no_cols, each = no_rows)) pics_info <- dplyr::group_by(pics_info, column) %>% dplyr::arrange(!!!syms(ordering_col)) %>% dplyr::mutate(row = 1:no_rows) %>% dplyr::ungroup() pics_info <- dplyr::arrange(pics_info, column, row) dir.create("cols") purrr::walk(0:(no_cols-1), make_column, files = pics_info$path, no_rows = no_rows) banner <- magick::image_read(dir("cols/", full.names = TRUE)) %>% magick::image_append(stack = FALSE) unlink("cols", recursive = TRUE) return(banner) }

The function returns a magick object that one can then write to disk as a PNG for instance.

I first tested it using a random approach added to the data.frame created in the next section, and show the result here to give an idea of the variety of pictures. For many of them, however, the main colour that you can see in their border is greyish.

set.seed(42) pics_info <- dplyr::mutate(pics_info, random = sample(1:nrow(pics_info), nrow(pics_info))) make_collage(pics_info, 19, 59, "random") %>% magick::image_write("data/2018-01-07-rainbowing-banner_random.png")

Testing a first (bad) approach: using hue

Once I had the main colour as an hex code, I had no idea how to order the colours and thought a good idea would be to use hue, which is the main wave length in a colour. Most observed colours are a mix of wave lengths unless you’re using a laser for instance. To get hue from a colour identified by its hex code, one needs two functions: colorspace::hex2rgb and grDevices::rgb2hsv. The latter one outputs hue, saturation and value. Hue is the main wavelength, saturation the amount of that wavelength in the colour and value the amount of light in the colour. The smaller the saturation, the less representative the hue is of the main colour. Add to that the fact that the main colour can also be only a little representative of your original picture… Ordering by hue isn’t too perfect, but I tried that anyway.

# now work on getting the hue and value for all pics # create a data.frame with path, hue, value get_values <- function(path, pics_main_colours){ print(path) # get main color colour <- pics_main_colours$colour[pics_main_colours$path == stringr::str_replace(path, "formatted_pics", "nature")] # translate it rgb <- colorspace::hex2RGB(colour) values <- grDevices::rgb2hsv(t(rgb@coords)) tibble::tibble(path = path, hue = values[1,1], saturation = values [2,1], value = values[3,1]) } # all values pics_col <- purrr::map_df(dir("formatted_pics", full.names = TRUE), get_values, pics_main_colours) make_collage(pics_info, 19, 59, "hue") %>% magick::image_write("banner_hue.png")

So this is not too pretty. Blue and green pictures seem to cluster together but there are very dark pictures which we’d intuitively put aside.

So I thought a bit more and decided to first assign main colours to a reference colour and then order pictures based on this…

Choosing a better approach: RGB and distances

The first challenge was to choose reference colours which’d be a rainbow slices. I could have looked up RGB codes corresponding to ROYGBIV (red, orange, yellow, green, blue, indigo and violet.) but I had read about xkcd colors survey in this interesting post and therefore decided to use XKCD colors, available in R via the xkcdcolors package. I chose to use the 18 most common ones, and add black to that lot. It was no longer really a rainbow, I agree. The colors present in the pictures were ordered by hand by my husband, and I like his choices.

Then to assign each pic to a reference colour via its main colour, I calculated the Euclidian distance between that colour and all reference colours to find the closes reference colours, using the RGB values.

library("xkcdcolors") library("magrittr") # version of colorspace::hex2RGB returning a df hex2rgb <- function(hex){ result <- colorspace::hex2RGB(hex)@coords } # https://stackoverflow.com/questions/45328221/unnest-one-column-list-to-many-columns-in-tidyr colors <- tibble::tibble(name = c(xcolors()[1:18], "black"), hex = name2color(name), rgb = purrr::map(hex, hex2rgb)) %>% dplyr::mutate(rgb = purrr::map(rgb, tibble::as_tibble)) %>% tidyr::unnest() # for each colour I want the closest one. find_closest_colour <- function(hex, colors){ test <- tibble::tibble(hex = hex) %>% dplyr::mutate(rgb = purrr::map(hex, hex2rgb), rgb = purrr::map(rgb, tibble::as_tibble)) %>% tidyr::unnest() distance <- stats::dist(rbind(test[, c("R", "G", "B")], colors[, c("R", "G", "B")])) colors$name[which.min(as.matrix(distance)[,1][2:(nrow(colors) + 1)])] } imgs_col <- dplyr::mutate(pics_main_colours, xkcd_col = purrr::map_chr(colour, find_closest_colour, colors = colors)) readr::write_csv(imgs_col, path = "imgs_xkcd_col.csv") Once I had this information about each pic, I could order the pictures, after having defined the order of the reference colours. pics_info <- readr::read_csv("imgs_xkcd_col.csv") pics_info <- dplyr::mutate(pics_info, path = stringr::str_replace(path, "nature", "formatted_pics")) pics_info <- dplyr::mutate(pics_info, xkcd_col = factor(xkcd_col, ordered = TRUE, levels = c("black","brown","red","magenta","pink", "lime green","green","dark green","teal", "light blue","sky blue","blue","purple","grey"))) This looks much better, but of course the initial set (and maybe the used extraction method as well) don’t provide for enough colours to make this extremely pretty. I’m not sure how useful this end product is, but hey I got to look at pretty landscapes full of colours from my grey rainy city, and learnt a lot along the way… Besides, maybe you will find a cool use case of some of the colour methods featured here and will tell me about it in the comments? var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Dataviz Signup Sat, 01/06/2018 - 23:26 (This article was first published on R on kieranhealy.org , and kindly contributed to R-bloggers) Data Visualization for Social Science will be published later this year by Princeton University Press. You can read a near-complete draft of the book at socviz.co. If you would like to receive one (1) email when the book is available for pre-order, please fill out this very short form. The goal of the book is to introduce readers to the principles and practice of data visualization in a humane, reproducible, and up-to-date way. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### New wrapr R pipeline feature: wrapr_applicable Sat, 01/06/2018 - 18:35 (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers) The R package wrapr now has a neat new feature: “wrapr_applicable”. This feature allows objects to declare a surrogate function to stand in for the object in wrapr pipelines. It is a powerful technique and allowed us to quickly implement a convenient new ad hoc query mode for rquery. A small effort in making a package “wrapr aware” appears to have a fairly large payoff. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### tint 0.0.5 Sat, 01/06/2018 - 14:50 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) A maintenance release of the tint package arrived on CRAN earlier this week. Its name expands from tint is not tufte as the package offers a fresher take on the Tufte-style for html and pdf presentations. A screenshot of the pdf variant is below. Similar to the RcppCNPy release this week, this is pure maintenance related to dependencies. CRAN noticed that processing these vignettes requires the mgcv package—as we use geom_smooth() in some example graphs. So that was altered to not require this requirement just for the vignette tests. We also had one pending older change related to jurassic pandoc versions on some CRAN architectures. Changes in tint version 0.0.5 (2018-01-05) • Only run html rendering regression test on Linux or Windows as the pandoc versions on CRAN are too old elsewhere. • Vignette figures reworked so that the mgcv package is not required avoiding a spurious dependency [CRAN request] Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### RcppCNPy 0.2.8 Sat, 01/06/2018 - 14:44 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) A minor maintenance release of the RcppCNPy package arrived on CRAN this week. RcppCNPy provides R with read and write access to NumPy files thanks to the cnpy library by Carl Rogers. There is no code change here. But to process the vignette we rely on knitr which sees Python here and (as of its most recent release) wants the (excellent !!) reticulate package. Which is of course overkill just to process a short pdf document, so we turned this off. Changes in version 0.2.8 (2018-01-04) • Vignette sets knitr option python.reticulate=FALSE to avoid another depedency just for the vignette [CRAN request] CRANberries also provides a diffstat report for the latest release. As always, feedback is welcome and the best place to start a discussion may be the GitHub issue tickets page. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### RQuantLib 0.4.4 for Windows Fri, 01/05/2018 - 21:02 (This article was first published on FOSS Trading, and kindly contributed to R-bloggers) I’m pleased to announce that the RQuantLib Windows binaries are now up to 0.4.4! The RQuantLib pre-built Windows binaries have been frozen on CRAN since 0.4.2, but now you can get version 0.4.4 binaries on Dirk’s ghrr drat repo. Installation is as simple as: drat::addRepo(“ghrr”) # maybe use ‘install.packages(“drat”)’ first install.packages(“RQuantLib”, type=”binary”) I will be able to create Windows binaries for future RQuantLib versions too, now that I have a Windows QuantLib build (version 1.11) to link against. Dirk and I plan to talk with CRAN about getting the new binaries hosted there. Regardless, they will always be available via the drat repo. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: FOSS Trading. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### How to extract data from a PDF file with R Fri, 01/05/2018 - 07:02 (This article was first published on R-posts.com, and kindly contributed to R-bloggers) In this post, taken from the book R Data Mining by Andrea Cirillo, we’ll be looking at how to scrape PDF files using R. It’s a relatively straightforward way to look at text mining – but it can be challenging if you don’t know exactly what you’re doing. Until January 15th, every single eBook and video by Packt is just$5! Start exploring some of Packt’s huge range of R titles here.

You may not be aware of this, but some organizations create something called a ‘customer card’ for every single customer they deal with. This is quite an informal document that contains some relevant information related to the customer, such as the industry and the date of foundation. Probably the most precious information contained within these cards is the comments they write down about the customers. Let me show you one of them:
My plan was the following—get the information from these cards and analyze it to discover whether some kind of common traits emerge.

As you may already know, at the moment this information is presented in an unstructured way; that is, we are dealing with unstructured data. Before trying to analyze this data, we will have to gather it in our analysis environment and give it some kind of structure.

Technically, what we are going to do here is called text mining, which generally refers to the activity of gaining knowledge from texts. The techniques we are going to employ are the following:

• Sentiment analysis
• Wordclouds
• N-gram analysis
• Network analysis
Getting a list of documents in a folder

First of all, we need to get a list of customer cards we were from the commercial department. I have stored all of them within the ‘data’ folder on my workspace. Let’s use list.files() to get them:

file_vector <- list.files(path = "data")

Nice! We can inspect this looking at the head of it. Using the following command:

file_vector %>% head() [1] "banking.xls" "Betasoloin.pdf" "Burl Whirl.pdf" "BUSINESSCENTER.pdf" [5] "Buzzmaker.pdf" "BuzzSaw Publicity.pdf"

Uhm… not exactly what we need. I can see there are also .xls files. We can remove them using the grepl() function, which performs partial matches on strings, returning TRUE if the pattern required is found, or FALSE if not. We are going to set the following test here: give me TRUE if you find .pdf in the filename, and FALSE if not:

grepl(".pdf",file_list) [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [18] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [35] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [52] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [69] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [86] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE [103] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE [120] TRUE

As you can see, the first match results in a FALSE since it is related to the .xls file we saw before. We can now filter our list of files by simply passing these matching results to the list itself. More precisely, we will slice our list, selecting only those records where our grepl() call returns TRUE:

pdf_list <- file_vector[grepl(".pdf",file_list)]

Did you understand [grepl(“.pdf”,file_list)] ? It is actually a way to access one or more indexes within a vector, which in our case are the indexes corresponding to “.pdf”, exactly the same as we printed out before.

If you now look at the list, you will see that only PDF filenames are shown on it.

Reading PDF files into R via pdf_text()

R comes with a really useful that’s employed tasks related to PDFs. This is named pdftools, and beside the pdf_text function we are going to employ here, it also contains other relevant functions that are used to get different kinds of information related to the PDF file into R.

For our purposes, it will be enough to get all of the textual information contained within each of the PDF files. First of all, let’s try this on a single document; we will try to scale it later on the whole set of documents. The only required argument to make pdf_text work is the path to the document. The object resulting from this application will be a character vector of length 1:

pdf_text("data/BUSINESSCENTER.pdf") [1] "BUSINESSCENTER business profile\nInformation below are provided under non disclosure agreement. date of enquery: 12.05.2017\ndate of foundation: 1993-05-18\nindustry: Non-profit\nshare holders: Helene Wurm ; Meryl Savant ; Sydney Wadley\ncomments\nThis is one of our worst customer. It really often miss payments even if for just a couple of days. We have\nproblems finding useful contact persons. The only person we can have had occasion to deal with was the\nfiscal expert, since all other relevant person denied any kind of contact.\n 1\n"

If you compare this with the original PDF document you can easily see that all of the information is available even if it is definitely not ready to be analyzed. What do you think is the next step needed to make our data more useful?

We first need to split our string into lines in order to give our data a structure that is closer to the original one, that is, made of paragraphs. To split our string into separate records, we can use the strsplit() function, which is a base R function. It requires the string to be split and the token that decides where the string has to be split as arguments. If you now look at the string, you’ll notice that where we found the end of a line in the original document, for instance after the words business profile, we now find the \n token. This is commonly employed in text formats to mark the end of a line.

We will therefore use this token as a split argument:

pdf_text("data/BUSINESSCENTER.pdf") %>% strsplit(split = "\n") [[1]] [1] "BUSINESSCENTER business profile" [2] "Information below are provided under non disclosure agreement. date of enquery: 12.05.2017" [3] "date of foundation: 1993-05-18" [4] "industry: Non-profit" [5] "share holders: Helene Wurm ; Meryl Savant ; Sydney Wadley" [6] "comments" [7] "This is one of our worst customer. It really often miss payments even if for just a couple of days. We have" [8] "problems finding useful contact persons. The only person we can have had occasion to deal with was the" [9] "fiscal expert, since all other relevant person denied any kind of contact." [10] " 1"

strsplit() returns a list with an element for each element of the character vector passed as argument; within each list element, there is a vector with the split string.

Isn’t that better? I definitely think it is. The last thing we need to do before actually doing text mining on our data is to apply those treatments to all of the PDF files and gather the results into a conveniently arranged data frame.

Iteratively extracting text from a set of documents with a for loop

What we want to do here is run trough the list of files and for filename found there, we run the pdf_text() function and then the strsplit() function to get an object similar to the one we have seen with our test. A convenient way to do this is by employing a ‘for’ loop. These structures basically do this to their content: Repeat this instruction n times and then stop. Let me show you a typical ‘for’ loop:

for (i in 1:3){ print(i) }

If we run this, we obtain the following:

[1] 1 [1] 2 [1] 3

This means that the loop runs three times and therefore repeats the instructions included within the brackets three times. What is the only thing that seems to change every time? It is the value of i. This variable, which is usually called counter, is basically what the loop employs to understand when to stop iterating. When the loop execution starts, the loop starts increasing the value of the counter, going from 1 to 3 in our example. The for loop repeats the instructions between the brackets for each element of the values of the vector following the in clause in the for command. At each step, the value of the variable before in (i in this case) takes one value of the sequence from the vector itself.

The counter is also useful within the loop itself, and it is usually employed to iterate within an object in which some kind of manipulation is desired. Take, for instance, a vector defined like this:

vector <- c(1,2,3)

Imagine we want to increase the value of every element of the vector by 1. We can do this by employing a loop such as this:

for (i in 1:3){ vector[i] <- vector[i]+1 }

If you look closely at the loop, you’ll realize that the instruction needs to access the element of the vector with an index equal to i and modify this value by 1. The counter here is useful because it will allow iteration on all vectors from 1 to 3.

Be aware that this is actually not a best practice because loops tend to be quite computationally expensive, and they should be employed when no other valid alternative is available. For instance, we can obtain the same result here by working directly on the whole vector, as follows:

vector_increased <- vector +1

If you are interested in the topic of avoiding loops where they are not necessary, I can share with you some relevant material on this.

For our purposes, we are going to apply loops to go through the pdf_list object, and apply the pdf_text function and subsequently the strsplit() function to each element of this list:

corpus_raw <- data.frame("company" = c(),"text" = c()) for (i in 1:length(pdf_list)){ print(i) pdf_text(paste("data/", pdf_list[i],sep = "")) %>% strsplit("\n")-> document_text data.frame("company" = gsub(x =pdf_list[i],pattern = ".pdf", replacement = ""), "text" = document_text, stringsAsFactors = FALSE) -> document colnames(document) <- c("company", "text") corpus_raw <- rbind(corpus_raw,document) }

Let’s get closer to the loop: we first have a call to the pdf_text function, passing an element of pdf_list as an argument; it is defined as referencing the i position in the list. Once we have done this, we can move on to apply the strsplit() function to the resulting string. We define the document object, which contains two columns, in this way:

• company, which stores the name of the PDF without the .pdf token; this is the name of the company
• text, which stores the text resulting from the extraction

This document object is then appended to the corpus object, which we created previously, to store all of the text within the PDF.

Let’s have a look a the resulting data frame:

This is a well-structured object, ready for some text mining. Nevertheless, if we look closely at our PDF customer cards, we can see that there are three different kinds of information and they should be handled differently:

• Repeated information, such as the confidentiality disclosure on the second line and the date of enquiry (12.05.2017)
• Structured attributes, for instance, date of foundation or industry
• Strictly unstructured data, which is in the comments paragraph

We are going to address these three kinds of data differently, removing the first group of irrelevant information; we therefore have to split our data frame accordingly into two smaller data frames. To do so, we will leverage the grepl() function once again, looking for the following tokens:

• 12.05.2017: This denotes the line showing the non-disclosure agreement and the date of inquiry.
• business profile: This denotes the title of the document, containing the name of the company. We already have this information stored within the company column.
• comments: This is the name of the last paragraph.
• 1: This represents the number of the page and is always the same on every card.

We can apply the filter function to our corpus_raw object here as follows:

corpus_raw %>% filter(!grepl("12.05.2017",text)) %>% filter(!grepl("business profile",text)) %>% filter(!grepl("comments",text)) %>% filter(!grepl("1",text)) -> corpus

Now that we have removed those useless things, we can actually split the data frame into two sub-data frames based on what is returned by the grepl function when searching for the following tokens, which point to the structured attributes we discussed previously:

• date of foundation
• industry
• shareholders

We are going to create two different data frames here; one is called ‘information’ and the other is called ‘comments’:

corpus %>% filter(!grepl(c("date of foundation"),text)) %>% filter(!grepl(c( "industry"),text)) %>% filter(!grepl(c( "share holders"),text)) -> comments corpus %>% filter(grepl(("date of foundation"),text)|grepl(( "industry"),text)|grepl(( "share holders"),text))-> information

As you can see, the two data treatments are nearly the opposite of each other, since the first looks for lines showing none of the three tokens while the other looks for records showing at least one of the tokens.

Let’s inspect the results by employing the head function:

Great! We are nearly done. We are now going to start analyzing the comments data frame, which reports all comments from our colleagues. The very last step needed to make this data frame ready for subsequent analyzes is to tokenize it, which basically means separating it into different rows for all the words available within the text column. To do this, we are going to leverage the unnest_tokens function, which basically splits each line into words and produces a new row for each word, having taken care to repeat the corresponding value within the other columns of the original data frame.

This function comes from the recent tidytext package by Julia Silge and Davide Robinson, which provides an organic framework of utilities for text mining tasks. It follows the tidyverse framework, which you should already know about if you are using the dplyr package.

Let’s see how to apply the unnest_tokens function to our data:

If we now look at the resulting data frame, we can see the following:

As you can see, we now have each word separated into a single record.

Thanks for reading! We hope you enjoyed this extract taken from R Data Mining.

Explore $5 R eBooks and videos from Packt until January 15th 2018. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Using nonstandard evaluation to simulate a register machine Fri, 01/05/2018 - 07:00 (This article was first published on Higher Order Functions, and kindly contributed to R-bloggers) I recently completed all 25 days of Advent of Code 2017, an annual series of recreational programming puzzles. Each day describes a programming puzzle and illustrates a handful of simple examples of the problem. The puzzle then requires the participant to solve a much, much larger form of the problem. For five or so of the puzzles, I used nonstandard evaluation to implement my solution. As I previously wrote, nonstandard evaluation is a way of bottling up magic spells (lines of R code) and changing how or where they are cast (evaluated). I chose to use special evaluation not because it was the easiest or most obvious solution but because I wanted to develop my skills with computing on the language. In this post, I work through one of the examples where I used nonstandard evaluation to write an interpreter for a simple machine. Puzzle description Day 8 requires us to simulate the state of a register machine as it receives a series of instructions. Each instruction consists of several parts: the register to modify, whether to increase or decrease that register’s value, the amount by which to increase or decrease it, and a condition. If the condition fails, skip the instruction without modifying the register. The registers all start at 0. The instructions look like this: b inc 5 if a > 1 a inc 1 if b < 5 c dec -10 if a >= 1 c inc -20 if c == 10 […] You might also encounter <= (less than or equal to) or != (not equal to). However, the CPU doesn’t have the bandwidth to tell you what all the registers are named, and leaves that to you to determine. What is the largest value in any register after completing the instructions in your puzzle input? If I squint long enough at the register instructions, I can basically see R code. # b inc 5 if a > 1 b <- if (a > 1) b + 5 else b # a inc 1 if b < 5 a <- if (b < 5) a + 1 else a # c dec -10 if a >= 1 c <- if (a >= 1) c - -10 else c # c inc -20 if c == 10 c <- if (c == 10) c + -20 else c If we can set up a way to convert the machine instructions into R code, R will handle the job of looking up values, modifying values and evaluating the logic and if statements. In other words, if we can convert register instructions into R code, the problem simplifies into something like running an R script. And that’s a good strategy, because we have a lot of instructions to process. Each user receives some unique (I think) input for each problem, and my problem input contains 1,000 instructions. library(magrittr) full_input <- "https://raw.githubusercontent.com" %>% file.path("tjmahr", "adventofcode17", "master", "inst", "input08.txt") %>% readr::read_lines() length(full_input) #> [1] 1000 head(full_input) #> [1] "kd dec -37 if gm <= 9" "x dec -715 if kjn == 0" #> [3] "ey inc 249 if x < 722" "n dec 970 if t > 3" #> [5] "f dec -385 if msg > -3" "kd dec -456 if ic <= -8" Our strategy for simulating the register machine will have the following steps: • Parsing a register instruction • Creating an R expression from an instruction • Evaluating an R expression inside of a register machine • Changing the evaluation rules to adapt to the quirks of this problem Parsing the register instructions with regular expressions The instructions have a very simple grammar. Here is how I would tag the first few lines of my problem input. [target] [verb] [num1] if [s1] [op] [num2] kd dec -37 if gm <= 9 x dec -715 if kjn == 0 ey inc 249 if x < 722 n dec 970 if t > 3 f dec -385 if msg > -3 We can parse these lines using regular expressions. Regular expressions are an incredibly powerful language for processing text using pattern-matching rules. I will walk through each of the regular expression patterns used to parse an instruction. To match the verbs, we can use the or | operator, so (inc|dec) matches inc or dec. We can also match the six different comparison operators using | too. In the code below, I put the patterns in parentheses so that they will be treated as a single group. re_verb <- "(inc|dec)" re_op <- "(>|<|==|!=|>=|<=)" A register name is just a sequence of letters. The special character \w matches any word character; that is, it matches uppercase/lowercase letters, digits and underscores. The (token)+ suffix matches 1 or more repetitions of a token. Putting these two together, \w+ will match 1 or more adjacent word characters. That pattern in principle could matches things beside register names (like numbers) but the instruction format here is so constrained that it’s not a problem. # We have to double the backslashes when using them in R strings re_name <- "(\\w+)" Numbers are just integers, and sometimes they are negative. A number here is an optional - plus some digits. The special character \d matches any digit from 0 to 9, so \d+ matches 1 or more successive digits. We can use the (token)? suffix to match an optional token. In our case, -?\d+ will match a sequence of digits and match leading hyphen if one is present. re_number <- "(-?\\d+)" Each pattern in parentheses is a matching group, and the function str_match() from the stringr package will return a matrix with a column for each matched group. I also include an extra set of parentheses around the condition in the if statement to also match the whole condition as well as its parts. # Combine the sub-patterns together re <- sprintf("%s %s %s if (%s %s %s)", re_name, re_verb, re_number, re_name, re_op, re_number) re #> [1] "(\\w+) (inc|dec) (-?\\d+) if ((\\w+) (>|<|==|!=|>=|<=) (-?\\d+))" text <- "b inc 5 if a > 1" stringr::str_match(text, re) #> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] #> [1,] "b inc 5 if a > 1" "b" "inc" "5" "a > 1" "a" ">" "1" # Column 5 matches the subgroups in columns 6, 7 and 8 as a single # group because of the extra grouping parentheses after the if. We can package this step into a function that takes an instruction’s text and returns a list with the labelled parts of that instruction. parse_instruction <- function(text) { stopifnot(length(text) == 1) re <- "(\\w+) (inc|dec) (-?\\d+) if ((\\w+) (>|<|==|!=|>=|<=) (-?\\d+))" text %>% stringr::str_match(re) %>% as.list() %>% setNames(c("instruction", "target", "verb", "num1", "cond", "s1", "op", "num2")) } str(parse_instruction(text)) #> List of 8 #>$ instruction: chr "b inc 5 if a > 1" #> $target : chr "b" #>$ verb : chr "inc" #> $num1 : chr "5" #>$ cond : chr "a > 1" #> $s1 : chr "a" #>$ op : chr ">" #> $num2 : chr "1" Creating R code Next, we need to convert some strings into R code. We can do this with rlang::parse_expr(). It takes a string and creates an R expression, something I’ve described as a kind of bottled up magic spell: An expression captures magic words (code) allow us to manipulate or cast (evaluate) them. code <- rlang::parse_expr("print('hello')") code #> print("hello") code <- rlang::parse_expr("if (a > 1) b + 5 else b") code #> if (a > 1) b + 5 else b The format of the instructions is relatively straightforward. We can fill in a template with the parts of the parsed line. Because inc/dec are just addition and subtraction, we replace them with the appropriate math operations. create_r_instruction <- function(parsed) { parsed$math <- if (parsed$verb == "inc") "+" else "-" code <- sprintf("if (%s) %s %s %s else %s", parsed$cond, parsed$target, parsed$math, parsed$num1, parsed$target) rlang::parse_expr(code) } r_code <- "b inc 5 if a > 1" %>% parse_instruction() %>% create_r_instruction() r_code #> if (a > 1) b + 5 else b Create the register machine

We have to figure out where we want to evaluate the generated R code. We
create a register object to hold the values. The object will just be a list()
with some extra metadata. This object will be the location where the R code
is evaluated.

create_register_machine <- function(...) { initial <- list(...) data <- c(initial, list(.metadata = list())) structure(data, class = c("register_machine", "list")) } # Give the machines a pretty print method print.register_machine <- function(x, ...) { utils::str(x, ...) invisible(x) } create_register_machine() #> List of 1 #> $.metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list" For now, we can initialize registers by using named arguments to the function. create_register_machine(a = 0, b = 0) #> List of 3 #>$ a : num 0 #> $b : num 0 #>$ .metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list" Evaluating code inside of the machine

So far, we have:

• A way to analyze register instructions and convert them into R code
• An object that holds register values

Now, we need to evaluate an expression inside of the register. We will use
tidy evaluation; the function eval_tidy() lets us evaluate an
R expression inside of a data object.

r_code #> if (a > 1) b + 5 else b # b + 5 r <- create_register_machine(a = 4, b = 7) rlang::eval_tidy(r_code, data = r) #> [1] 12 # just b r <- create_register_machine(a = 0, b = 7) rlang::eval_tidy(r_code, data = r) #> [1] 7

Now, we need to actually do something. We need to update the register machine
using the value from the evaluated instruction. Otherwise, the machine will just

To update the machine, we have to determine the register to update. Fortunately,
our generated code always ends with an else branch that has the target
register.

r_code #> if (a > 1) b + 5 else b

If we can pull out that symbol after the else, we will have the name of
register to update in the machine. Because the code is so formulaic, we can just
extract the symbol directly using the code’s abstract syntax tree (AST).
pryr::call_tree() shows the AST for an expression.

pryr::call_tree(r_code) #> \- () #> \- if #> \- () #> \- > #> \- a #> \- 1 #> \- () #> \- + #> \- b #> \- 5 #> \- b

We can extract elements from the tree like elements in a list by selecting
indices.

# The numbers match one of the slashs at the first level of indentation r_code[[1]] #> if r_code[[2]] #> a > 1 # We can crawl down subtrees too r_code[[2]][[2]] #> a # But what we want is the last branch from the if level r_code[[4]] #> b

If we convert the symbol into a string, we can look it up in the register using
the usual list lookup syntax.

r <- create_register_machine(a = 4, b = 7) target <- rlang::as_string(r_code[[4]]) r[[target]] #> [1] 7

We can also use list lookup syntax with assignment to modify the register.

r[[target]] <- rlang::eval_tidy(r_code, data = r) r #> List of 3 #> $a : num 4 #>$ b : num 12 #> $.metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list" Let’s wrap these steps into a function. eval_instruction <- function(register_machine, instruction) { target <- rlang::as_string(instruction[[4]]) register_machine[[target]] <- rlang::eval_tidy( expr = instruction, data = register_machine) register_machine } create_register_machine(a = 2, b = 0) %>% eval_instruction(r_code) #> List of 3 #>$ a : num 2 #> $b : num 5 #>$ .metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list" create_register_machine(a = 2, b = 0) %>% # For quick testing, we pass in quoted expressions eval_instruction(quote(if (a > 1) b - 100 else b)) %>% # Should not run eval_instruction(quote(if (a < 1) b + 5 else b)) %>% # Should run eval_instruction(quote(if (a > 1) a + 10 else a)) #> List of 3 #> $a : num 12 #>$ b : num -100 #> $.metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list" Time for some extra nonstandard evaluation The code so far only works if the machine already has registers that match the registers in an instruction. Otherwise, we raise an error. create_register_machine() %>% eval_instruction(quote(if (a > 1) b - 100 else b)) #> Error in overscope_eval_next(overscope, expr): object 'a' not found # "Overscope" is the tidy evaluation term for the data context, so failing to # find the name in the data is failing to find the name in the overscope. To solve the problem, we could study the 1,000 lines of input beforehand, extract the register names, initialize them to 0 and then evaluate the instructions.1 Or… or… we could procrastinate and only initialize a register name to 0 when the machine encounters a name it doesn’t recognize. If, for some reason, our machine received instructions one at a time, like over a network connection, then the procrastinated approach seems even more reasonable. This latter strategy will involve some very nonstandard evaluation. I emphasize the “very” because we are changing one of the fundamental rules of R evaluation :smiling_imp:. R throws an error if you ask it to evaluate the name of a variable that doesn’t exist. But here we are going to detect those missing variables and set them to 0 before they get evaluated. To find the brand-new register names, we can inspect the call tree and find the names of the registers. We already know where the target is. The other place where names show up is in the condition of the if statement. pryr::call_tree(r_code) #> \- () #> \- if #> \- () #> \- > #> \- a #> \- 1 #> \- () #> \- + #> \- b #> \- 5 #> \- b extract_register_names <- function(instruction) { reg_target <- rlang::as_string(instruction[[4]]) reg_condition <- rlang::as_string(instruction[[2]][[2]]) list(target = reg_target, registers = unique(c(reg_target, reg_condition)) ) } extract_register_names(quote(if (a > 1) b - 100 else b)) %>% str() #> List of 2 #>$ target : chr "b" #> $registers: chr [1:2] "b" "a" # Just returns unique names extract_register_names(quote(if (b > 1) b - 100 else b)) %>% str() #> List of 2 #>$ target : chr "b" #> $registers: chr "b" We can define a helper function which checks for missing names—names that yield NULL values when we try to retrieve them—and initializes them to 0. initialize_new_registers <- function(register_machine, registers) { for (each_register in registers) { if (is.null(register_machine[[each_register]])) { register_machine[[each_register]] <- 0 } } register_machine } # Before r #> List of 3 #>$ a : num 4 #> $b : num 12 #>$ .metadata: list() #> - attr(*, "class")= chr [1:2] "register_machine" "list" initialize_new_registers(r, c("a", "b", "w", "a", "s", "j")) #> List of 6 #> $a : num 4 #>$ b : num 12 #> $.metadata: list() #>$ w : num 0 #> $s : num 0 #>$ j : num 0 #> - attr(*, "class")= chr [1:2] "register_machine" "list"

Finally, we update our evaluation function to do this step automatically. I’m
also going to add some code to record the value of the maximum register whenever
an instruction is evaluated because, ummm, that’s the whole point of puzzle.

eval_instruction <- function(register_machine, instruction) { # Set any new registers to 0 registers <- extract_register_names(instruction) register_machine <- initialize_new_registers( register_machine, registers$registers) # Evaluate instruction register_machine[[registers$target]] <- rlang::eval_tidy( expr = instruction, data = register_machine) # Find the maximum value register_names <- setdiff(names(register_machine), ".metadata") current_max <- max(unlist(register_machine[register_names])) register_machine$.metadata$max <- current_max register_machine }

Let’s try four instructions from a clean slate.

create_register_machine() %>% # b gets 5 eval_instruction(quote(if (d < 1) b + 5 else b)) %>% # c gets 10 eval_instruction(quote(if (b > 1) c + 10 else c)) %>% # b gets 5 more eval_instruction(quote(if (a < 1) b + 5 else b)) #> List of 5 #> $.metadata:List of 1 #> ..$ max: num 10 #> $b : num 10 #>$ d : num 0 #> $c : num 10 #>$ a : num 0 #> - attr(*, "class")= chr [1:2] "register_machine" "list"

Now, for the moment of truth… Let’s process all 1,000 instructions.

r <- create_register_machine() for (each_instruction in full_input) { parsed <- each_instruction %>% parse_instruction() %>% create_r_instruction() r <- eval_instruction(r, parsed) } r #> List of 27 #> $.metadata:List of 1 #> ..$ max: num 4832 #> $kd : num -2334 #>$ gm : num -4239 #> $x : num -345 #>$ kjn : num -1813 #> $ey : num 209 #>$ n : num -764 #> $t : num 2997 #>$ f : num 4468 #> $msg : num -3906 #>$ ic : num -263 #> $zv : num -599 #>$ gub : num 2025 #> $yp : num -2530 #>$ lyr : num -2065 #> $j : num 3619 #>$ e : num -4230 #> $riz : num 863 #>$ lzd : num 4832 #> $ucy : num -3947 #>$ i : num 3448 #> $omz : num -3365 #>$ djq : num 392 #> $bxy : num 1574 #>$ tj : num 1278 #> $y : num 1521 #>$ m : num 2571 #> - attr(*, "class")= chr [1:2] "register_machine" "list"

:star: Ta-da! The maximum register value is 4,832. Problem solved!

And then the rules change

Advent of Code problems come in two parts, and we don’t learn the question
behind Part 2 until we complete Part 1. In this case, after submitting our
solution for Part 1, we receive the following requirement:

To be safe, the CPU also needs to know the highest value held in any
register during this process
so that it can decide how much memory to allocate
to these operations.

Accounting for this twist requires a small change to the evaluation code. We
add another metadata variable to track the highest value ever stored in a
register.

eval_instruction <- function(register_machine, instruction) { # Set any new registers to 0 registers <- extract_register_names(instruction) register_machine <- initialize_new_registers( register_machine, registers$registers) # Evaluate instruction register_machine[[registers$target]] <- rlang::eval_tidy( expr = instruction, data = register_machine) # Find the maximum value register_names <- setdiff(names(register_machine), ".metadata") current_max <- max(unlist(register_machine[register_names])) register_machine$.metadata$max <- current_max # Create the max-ever value if necessary if (is.null(register_machine$.metadata$max_ever)) { register_machine$.metadata$max_ever <- 0 } # Maybe update the max-ever value if (register_machine$.metadata$max_ever < current_max) { register_machine$.metadata$max_ever <- current_max } register_machine }

Admittedly, eval_instruction() is starting to get bloated. Conceptually, we
could probably the break this function down into three functions: pre-evaluation
steps, evaluation, and post-evaluation steps.2 But this is good
enough for now.

We run the instructions again to get the updated metadata.

r <- create_register_machine() for (each_instruction in full_input) { parsed <- each_instruction %>% parse_instruction() %>% create_r_instruction() r <- eval_instruction(r, parsed) } r$.metadata #>$max #> [1] 4832 #> #> $max_ever #> [1] 5443 :star2: And boom! Another problem solved. eval(thoughts, envir = this_problem) I like this kind of nonstandard evaluation approach for converting problems into R code, but it’s mostly useful when the problem describes a series of instructions that can be parsed and evaluated. For problems like this register machine simulation, the nonstandard evaluation route is straightforward. But it’s also a viable problem-solving strategy when the “machine” or the “instructions” are subtler, as in this problem about simulating “dance” moves. Odds are, you’ll never have to write an interpreter for a toy machine or language. Nevertheless, here are some R functions that we used for this puzzle that are helpful in other contexts: • stringr::str_match() to extract all the groups in a regular expression at once. • rlang::parse_expr() to convert a string of text into an R expression. • pryr::call_tree() to visualize an expression’s syntax tree and expression[[i]][[j]] to pluck out symbols from locations in a tree. • rlang::as_string() to convert a symbol into a string. • rlang::eval_tidy() to evaluate an expression inside of a data context. 1. That actually would be pretty easy. Get a dataframe with purrr::map_df(full_input, parse_instruction). Find the unique register names. Create a list of 0’s with those names. Use do.call() to call create_register_machine() with that list. With no special evaluation trickery, this approach is closer to the idea of “just running R code”. 2. If all I did for a living was write code to simulate machines or toy languages, I might try to formalize this custom evaluation process with pre-evaluation and post-evaluations “hooks” that could be arguments to a custom evaluation function. I’m just brainstorming though. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### StanCon is next week, Jan 10-12, 2018 Fri, 01/05/2018 - 04:00 (This article was first published on R – Statistical Modeling, Causal Inference, and Social Science, and kindly contributed to R-bloggers) It looks pretty cool! Wednesday, Jan 10 Invited Talk: Predictive information criteria in hierarchical Bayesian models for clustered data. Sophia Rabe-Hesketh and Daniel Furr (U California, Berkely) 10:40-11:30am Does the New York City Police Department rely on quotas? Jonathan Auerbach (Columbia U) 11:30-11:50am Bayesian estimation of mechanical elastic constants. Ben Bales, Brent Goodlet, Tresa Pollock, Linda Petzold (UC Santa Barbara) 11:50am-12:10pm Joint longitudinal and time-to-event models via Stan. Sam Brilleman, Michael Crowther, Margarita Moreno-Betancur, Jacqueline Buros Novik, Rory Wolfe (Monash U, Columbia U) 12:10-12:30pm Lunch 12:30-2:00pm ScalaStan. Joe Wingbermuehle (Cibo Technologies) 2:00-2:20pm A tutorial on Hidden Markov Models using Stan. Luis Damiano, Brian Peterson, Michael Weylandt 2:20-2:40pm Student Ornstein-Uhlenbeck models served three ways (with applications for population dynamics data). Aaron Goodman (Stanford U) 2:40-3:00pm SlicStan: a blockless Stan-like language. Maria I. Gorinova, Andrew D. Gordon, Charles Sutton (U of Edinburgh) 3:00-3:20pm Break 3:20-4:00pm Invited Talk: Talia Weiss (MIT) 4:00-4:50pm Thursday, Jan 11 Invited Talk: Sean Taylor and Ben Letham (Facebook) 10:40-11:30am NPCompare: a package for nonparametric density estimation and two populations comparison built on top of PyStan. Marco Inacio (U of São Paulo/UFSCar) 11:30-11:50am Introducing idealstan, an R package for ideal point modeling with Stan. Robert Kubinec (U of Virginia) 11:50am-12:10pm A brief history of Stan. Daniel Lee (Generable) 12:10-12:30pm Lunch 12:30-1:30pm Computing steady states with Stan’s nonlinear algebraic solver. Charles C. Margossian (Metrum, Columbia U) 1:30-1:50pm Flexible modeling of Alzheimer’s disease progression with I-Splines. Arya A. Pourzanjani, Benjamin B. Bales, Linda R. Petzold, Michael Harrington (UC Santa Barbara) 1:50-2:10pm Intrinsic Auto-Regressive (ICAR) Models for Spatial Data, Mitzi Morris (Columbia U) 2:10-2:30pm Modeling/Data Session + Classes 2:30-4:10pm Open session for consultations on modeling and data problems with Stan developers and modelers. 2:30-4:10pm Session 3 of Intro to Stan 2:30-4:10pm 2:30-3:30pm Have I converged successfully? How to verify fit and diagnose fit problems, Bob Carpenter What is new to Stan 3:30-4:10pm Invited Talk: Manuel Rivas (Stanford U) 4:00-4:50pm Friday, Jan 12 Invited Talk: Susan Holmes (Stanford U) 10:40-11:30am Aggregate random coefficients logit — a generative approach. Jim Savage, Shoshana Vasserman 11:30-11:50am The threshold test: Testing for racial bias in vehicle searches by police. Camelia Simoiu, Sam Corbett-Davies, Sharad Goel, Emma Pierson (Stanford U) 11:50am-12:10pm Assessing the safety of Rosiglitazone for the treatment of type II diabetes. Konstantinos Vamvourellis, K. Kalogeropoulos, L. Phillips (London School of Economics and Political Science) 12:10-12:30pm Lunch 12:30-1:30pm Causal inference with the g-formula in Stan. Leah Comment (Harvard U) 1:30-1:50pm Bayesian estimation of ETAS models with Rstan. Fausto Fabian Crespo Fernandez (Universidad San Francisco de Quito) 1:50-2:10pm Invited Talk: Andrew Gelman 2:10-3:00 (Columbia U) (virtual) Classes/Tutorials We have tutorials that start at the crack of 8am for those desiring further edification beyond the program—these do not run in parallel to the main session but do run parallel to each other: Introduction to Stan: Know how to program? Know basic statistics? Curious about Bayesian analysis and Stan? This is the course for you. Hands on, focused and an excellent way to get started working in Stan. Two hours every day, 6 hours total. Jonah Sol Gabry. Executive decision making the Bayesian way: This is for non-technical managers and technical folks who need to communicate with managers to learn the core of decision making under uncertainty. One hour every day. Jonathan Auerbach, Breck Baldwin, Eric Novik. Advanced Hierarchical Models in Stan: The hard stuff. Very interactive, very intense. Topics vary by day. Ben Goodrich. Model assessment, model selection and inference after model selection. Aki Vehtari. How to develop for Stan at the C++ level: Overview of Stan C++ architecture and build/development process for contributors. Charles Christopher Margossian. [edited by Aki: added Model selection tutorial] The post StanCon is next week, Jan 10-12, 2018 appeared first on Statistical Modeling, Causal Inference, and Social Science. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Modeling, Causal Inference, and Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Big cdata News Thu, 01/04/2018 - 20:04 (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers) I have some big news about our R package cdata. We have greatly improved the calling interface and Nina Zumel has just written the definitive introduction to cdata. cdata is our general coordinatized data tool. It is what powers the deep learning performance graph (here demonstrated with R and Keras) that I announced a while ago. However, cdata is much more than that. cdata provides a family of general transforms that include pivot/unpivot (or tidyr::spread/tidyr::gather) as easy special cases. Nina refused to write the article on it until we re-factored the api to be even more teachable (and therefore more learnable, and more useful). After her re-design (adding the concepts of both concrete records and abstract records to the coordinatized data theory) the system teaches itself. It is actually hard to remember you are graphically specifying potentially involved, difficult, and confusing transforms (which do not remain confusing as the graphical specification becomes its own documenting diagram!). Don’t take my word for it, please checkout Nina’s article: “Fluid data reshaping with cdata”. Also, I will be presenting a lightening talk on cdata at the January 2018 BARUG Meetup! We think we have gotten the concepts and package refined and polished to the point where it can be mastered in 15 minutes with time to spare. As a bonus this new version of cdata is now available on CRAN (with the old method naming still supported), and also works at big data scale (for DBI-adaptable databases such as Spark and PostgreSQL, an uncommon feature for a full featured pivot/un-pivot system). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### simmer 3.6.5 Thu, 01/04/2018 - 19:19 (This article was first published on R – Enchufa2, and kindly contributed to R-bloggers) The fifth update of the 3.6.x release of simmer, the Discrete-Event Simulator for R, is on CRAN. This release extends the attributes API by allowing users to set/get multiple attributes at once (a pretty straightforward change as well as useful; I don’t know why it didn’t occurred to me before…). Vectors as attributes and other data types are not supported yet, but they are on the roadmap. This version also fixes some minor bugs (many thanks to the users of the simmer-devel mailing list for taking their simulations to edge cases, where these bugs arise), deprecates the onestep() function and provides the new stepn() instead. Since onestep() serves primarily for debugging purposes, the transition to the new one may go unnoticed. Finally, there is a new vignette about the Dining Philosophers Problem. New features: • set_attribute() (and set_global() by extension) can set multiple attributes at once by providing vectors of keys and values (or functions returning such keys and/or values). get_attribute() (and get_global() by extension) can retrieve multiple keys (#122). • New stepn() method deprecates onestep() (e452975). Minor changes and fixes: • Restore ostream after formatting (9ff11f8). • Fix arrival cloning to copy attributes over to the clone (#118). • Fix self-induced preemption through set_capacity() (#125). • Update “Queueing Systems” vignette (a0409a0, 8f03f4f). • Update “Advanced Trajectory Usage” vignette (4501927). • Fix print methods to return the object invisibly (#128). • New “Dining Philosophers Problem” vignette (ff6137e). Article originally published in Enchufa2.es: simmer 3.6.5. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Divide and parallelize large data problems with Rcpp Thu, 01/04/2018 - 18:00 (This article was first published on Revolutions, and kindly contributed to R-bloggers) by Błażej Moska, computer science student and data science intern Got stuck with too large a dataset? R speed drives you mad? Divide, parallelize and go with Rcpp! One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I’ve been working with. Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule’s antecedent is a subset of second rule’s antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set). This is illustrated below: {A,B,C}=>{D} {E}=>{F} {A,B}=>{D} {A}=>D How can we achieve that? For example, using below pseudo algorithm: For i=1 to n: For j=i+1 to n: # check if antecedent[i] contains antecedent[j] (if consequents[i]=consequents[j]), then flag antecedent[i] with 1, otherwise with 0 else: # check if antecedent[j] contains antecedent[i] (if consequents[i]=consequents[j]), then flag antecedent[j] with 1, otherwise with 0 How many operations do we need to perform with this simple algorithm? For the first i we need to iterate $$n-1$$ times, for the second i $$n-2$$ times, for the third i $$n-3$$ and so on, reaching finally $$n-(n-1)$$. This leads to (proof can be found here): $\sum_{i=1}^{n}{i}= \frac{n(n-1)}{2}$ So the above has asymptotic complexity of $$O(n^2)$$. It means, more or less, that the computational complexity grows with the square of the size of the data. Well, for the dataset containing around 1,300,000 records this becomes serious issue. With R I was unable to perform computation in reasonable time. Since a compiled language performs better with simple arithmetic operations, the second idea was to use Rcpp. Yes, it is faster, to some extent — but with such a large dataframe I was still unable to get results in satisfying time. So are there any other options? Yes, there are. If we take a look at our dataset, we can see that it can be aggregated in such way that each individual “chunk” will consist of records with exactly same consequents: {A,B}=>{D} {A}=>{D} {C,G}=>{F} {Y}=>{F} After such division I got 3300 chunks, so the average number of observations per chunk was around 400. Next step was to retry sequentially for each chunk. Since our algorithm has square complexity, it is faster to do it that way rather than on the whole dataset at once. While R failed again, Rcpp finally returned result (after 5 minutes). But still there is a room for improvement. Since our chunks can be calculated independently, there is a possibility to perform parallel computation using for example, foreach package (which I demonstrated in previous article). While passing R functions to foreach is a simple task, parallelizing Rcpp is a little bit more time consuming. We need to do below steps: 1. Create .cpp file, which includes all of functions needed 2. Create a package using Rcpp. This can be achieved using for example: Rcpp.package.skeleton("nameOfYourPackage",cpp_files = "directory_of_your_cpp_file") 3. Install your Rcpp package from source: install.packages("directory_of_your_rcpp_package", repos=NULL, type="source") 4. Load your library: library(name_of_your_rcpp_package) Now you can use your Rcpp function in foreach: results=foreach(k=1:length(len), .packages=c(name_of_your_package)) %dopar% {your_cpp_function(data)} Even with foreach I waited forever for the R results, but Rcpp gave them in approximately 2.5 minutes. Not too bad! Here are some conclusions. Firstly, it’s worth knowing more languages/tools than just R. Secondly, there is often escape from the large dataset trap. There is little chance that somebody will do exactly the same task as mentioned in above example, but much higher probability that someone will face similar problem, with a possibility to solve it in the same way. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Die Markov Kette: Wie berechne ich die Gleichgewichtsverteilung? Thu, 01/04/2018 - 15:51 (This article was first published on R-Programming – Statistik Service, and kindly contributed to R-bloggers) In diesem Artikel möchten wir Ihnen das Konzept der Markov Kette vorstellen, dessen Grundlagen veranschaulichen und Ihnen mehrere mögliche Anwendungsbereiche aufzeigen, in denen Sie mit einer gezielten statistischen Programmierung von Markov Ketten profitieren können. Eine Markov Kette ist ein stochastischer Prozess mit den vielfältigsten Anwendungsbereichen aus der Natur, Technik und Wirtschaft. Klassische Anwendungsmöglichkeiten aus der Wirtschaft sind die Modellierung von Warteschlangen und Wechselkursen. Auch im Bereich der Technik gibt es zahlreiche Einsatzgebiete wie zum Beispiel die Modellierung des Verhaltens von Talsperren und von Geschwindigkeitsregelanlagen bei Kraftfahrzeugen, sowie die analytische Bewertung von Mobilitätsalgorithmen wie dem Random Walk. In der Naturwissenschaft verwendet man diesen Prozess in der Populationsdynamik zur Vorhersage des Bevölkerungswachstums von Menschen und Tieren und in einer abgewandelten Form für die Brownsche Molekularbewegung. Für die Praxis besonders relevant ist die statistische Programmierung und Simulation der Gleichgewichtsverteilung mit der Statistik Software R, welche im Folgenden anhand eines anschaulichen Beispiels durchgeführt wird. Doch zunächst werden die für die Berechnung erforderlichen Begriffe erläutert. Was sind Markov Kette und Gleichgewichtsverteilung? Es gibt eine Vielzahl unterschiedlicher stochastischer Prozesse, welche in verschiedene Kategorien eingeteilt werden. Ein bekannter und in der Praxis besonders oft eingesetzter Typ stochastischer Prozesse ist die Markov Kette, benannt nach dem Mathematiker Andrei Andrejewitsch Markov. Diese besteht aus einer Zustandsmenge, einer Indexmenge, einer Startverteilung und den Übergangswahrscheinlichkeiten. Die Zustandsmenge bestimmt, welche Zustände angenommen werden können. Hierbei unterscheidet man zwischen einer stetigen Zustandsmenge, welche überabzählbar unendlich viele Zustände enthält und einer diskreten Zustandsmenge, welche höchstens abzählbar unendlich viele Zustände enthält. Ein wichtiger Spezialfall des zuletzt Genannten ist ein Zustandsraum mit endlich vielen Zuständen, auf welchen wir uns konzentrieren werden. Ein stochastischer Prozess ändert seinen Zustand im Laufe der Zeit. Analog zur Zustandsmenge unterscheidet man auch bei der Indexmenge (welche die Zeitpunkte beinhaltet) zwischen einer stetigen und einer diskreten Indexmenge. Eine stetige Indexmenge kommt beispielsweise bei der Brownschen Molekularbewegung in Betracht, weil die Moleküle in ständiger Bewegung sind und ihre Richtung und Geschwindigkeit in kleinsten Zeitabständen wechseln können. Doch wir verwenden die diskrete Indexmenge der natürlichen Zahlen N0={0,1,2,3,4,5,…} einschließlich des Startzeitpunkts 0 mit einer beliebigen Zeiteinheit. Zwischen zwei aufeinander folgenden Zeitpunkten bleibt der Zustand also konstant. In irgendeinem Zustand muss die Markov Kette starten. Dies kann ein fest vorgegebener oder zufällig ausgewählter Zustand sein. Mit welcher Wahrscheinlichkeit der stochastische Prozess in welchem Zustand startet, legt die Startverteilung fest. Nicht nur die Startverteilung ist zufällig, sondern auch das weitere Verhalten. Mit welcher Wahrscheinlichkeit der Prozess in welchen Zustand wechselt, legen die Übergangswahrscheinlichkeiten fest. Wir betrachten den Standardfall einer Markov Kette erster Ordnung, bei welcher die Übergangswahrscheinlichkeiten ausschließlich vom momentanen Zustand abhängen. Außerdem ist die Markov Kette homogen, das heißt, die Übergangswahrscheinlichkeiten sind konstant und ändern sich nicht im Lauf der Zeit. Die Übergangswahrscheinlichkeiten können daher in einer Übergangsmatrix veranschaulicht werden. Die Zustandsverteilung hängt vom jeweiligen Zeitpunkt ab. In einigen Fällen konvergiert die Zustandsverteilung (diese beinhaltet die Aufenthaltswahrscheinlichkeiten der Zustände zu einem Zeitpunkt n) gegen eine Gleichgewichtsverteilung, welche auch stationäre Verteilung genannt wird. In den folgenden Abschnitten erfahren Sie anhand eines Beispiels nicht nur die Kriterien für Existenz und Eindeutigkeit der Gleichgewichtsverteilung, sondern auch die analytische Lösung und wie Sie die statistische Programmierung und Simulation mit der Statistik Software R durchführen. Bedingungen für Existenz und Eindeutigkeit der Gleichgewichtsverteilung Die Spielfigur Pac-Man frisst in einem Labyrinth kleine Kugeln und Kirschen und wird dabei von Gespenstern verfolgt. Der Einfachheit halber ist die Spielwelt in diesem Beispiel ein kleines 3×3-Gitter und die Monster bewegen sich rein zufällig. Jedes horizontal und vertikal angrenzende Spielfeld ist mit gleicher Wahrscheinlichkeit der nächste Aufenthaltsort des Gespensts, mit Ausnahme eines Geheimgangs zwischen den Zuständen 2 und 8. Der unten abgebildete Übergangsgraph beinhaltet exemplarisch die Übergangswahrscheinlichkeiten der Zustände 1, 5, 6 und 8. Um für Abwechslung zu sorgen, wird der Startort der Monster zufällig gewählt, und zwar jedes Spielfeld mit der gleichen Wahrscheinlichkeit. Der Startvektor auf dem Zustandsraum Z = {1,2,3,4,5,6,7,8,9} lautet daher als Zeilenvektor π0 = 1/9 (1,1,1,1,1,1,1,1,1). Die i-te Zeile und j-te Spalte der unten abgebildeten Übergangsmatrix P enthält die Übergangswahrscheinlichkeit vom i-ten zum j-ten Zustand. Einträge mit Wahrscheinlichkeit 0 wurden entfernt, um eine bessere Übersichtlichkeit zu erhalten: Zu Beginn (zum Zeitpunkt 0) ist jeder Zustand in diesem Beispiel noch gleichwahrscheinlich, die Zustandsverteilung zu Beginn lässt sich direkt am Startvektor ablesen. Und wie sieht die Zustandsverteilung nach einer Zeiteinheit aus? Um diese Frage zu beantworten, multiplizieren Sie den Startvektor mit der Übergangsmatrix P. Für den zweiten Zeitpunkt multiplizieren Sie den resultierenden Zeilenvektor ein weiteres mal mit der Übergangsmatrix P und so weiter. Im Allgemeinen erhalten Sie für den Zeitpunkt n die Aufenthaltswahrscheinlichkeiten für die einzelnen Zustände durch die Formel: πn=π0⋅Pn Unter bestimmten Voraussetzungen konvergiert die Zustandsverteilung πn gegen die Gleichgewichtsverteilung π. In unserem Beispiel mit endlichem Zustandsraum muss die Markov-Kette hierfür irreduzibel und aperiodisch sein. Was bedeuten nun die Begriffe Irreduzibilität und Aperiodizität? Bei dem von uns betrachteten Typ von Markov Ketten liegt Irreduzibilität vor, falls man in endlicher Zeit von jedem beliebigen Zustand in jeden beliebigen Zustand gelangt. Aperiodizität ist gegeben, falls der größte gemeinsame Teiler aller Rückkehrzeiten zu dem ursprünglichen Zustand 1 ist. Und dieses muss für jeden Zustand gelten. Überprüfen wir mal die beiden Bedingungen: Unsere Markov-Kette ist irreduzibel, da sich die Gespenster in endlicher Zeit von jedem beliebigen Zustand in jeden beliebigen Zustand begeben können. Dank des Geheimgangs sind hierfür nur maximal drei Zustandswechsel nötig. Durch den Geheimgang ist die Markov-Kette auch aperiodisch, weil die Monster sowohl in einer geraden als auch in einer ungeraden Anzahl an Zustandswechseln von jedem beliebigen Zustand in jeden beliebigen Zustand wechseln können und somit der größte gemeinsame Teiler dieser Rückkehrzeiten 1 beträgt. Ohne den Geheimgang wäre die Markov-Kette periodisch, weil dann ein Übergang von einem geraden in einen geraden Zustand bzw. von einem ungeraden in einen ungeraden Zustand nur in einer geraden Anzahl von Zustandswechseln möglich und somit der größte gemeinsame Teiler 2 wäre. Die Voraussetzungen für Existenz und Eindeutigkeit der Gleichgewichtsverteilung sind also erfüllt. Wegen der Irreduzibilität und Aperiodizität gibt es genau eine stabile Gleichgewichtsverteilung, welche die Markov-Kette nach einer unendlich langen Zeit annimmt. Das bedeutet, die Aufenthaltswahrscheinlichkeiten für die einzelnen Zustände ändern sich nach langer Zeit (fast) nicht mehr. Aus diesem Grund konvergieren auch die Matrixpotenzen. Doch wie können Sie nun die statistische Programmierung und Simulation der Gleichgewichtsverteilung mit der Statistik Software R berechnen? Das erfahren Sie in den folgenden beiden Abschnitten dieses Artikels. Mathematische Berechnung der Gleichgewichtsverteilung Wie wir gesehen haben, existiert eine eindeutige Gleichgewichtsverteilung, auch stationäre Verteilung genannt. In diesem Abschnitt erfahren Sie, wie Sie diese Verteilung mathematisch berechnen können. Dadurch erhalten Sie die Information, mit welcher Wahrscheinlichkeit sich die Monster langfristig in welchen Zuständen (bzw. Orten) aufhalten. Die Gleichgewichtsverteilung der Markov Kette als stochastischer Prozess lässt sich naiv bestimmen, indem in das Gleichungssystemπn=π0⋅Pn ein großes n eingesetzt wird, weil die Matrixpotenzen aufgrund der Irreduzibilität und Aperiodizität konvergieren, wie wir bereits herausgefunden erfahren haben. Um eine analytische Lösung zu berechnen, ist das lineare Gleichungssystem π=π⋅P nach π aufzulösen, unter der Nebenbedingung einer Zeilensumme von π von 1, damit wir auch tatsächlich eine Wahrscheinlichkeitsverteilung erhalten. Das Einsetzen der naiven Lösung in dieses Gleichungssystem dient dann als Kontrolle. Das obige Gleichungssystem π=π⋅P ohne Nebenbedingung ist äquivalent zu (PT-I)⋅πT=0. Die Übergangsmatrix wird demnach transponiert und die Einheitsmatrix subtrahiert. Der gesuchte Vektor der Zustandswahrscheinlichkeiten ist nun ein Spaltenvektor. Wir müssen also ein lineares Gleichungssystem lösen, welches inklusive Nebenbedingung eine Gleichung mehr hat als die Markov Kette Zustände. Bei einer großen Anzahl an Zuständen ist es sehr mühsam, die Lösung von Hand zu berechnen. Daher führen wir die statistische Programmierung nun mit der Statistik Software R durch. Eine alternative Lösung wäre die Verwendung einer MCMC Simulation (MCMC steht für Markov Chain Monte Carlo). Eine Simulation stellt eine sinnvolle Alternative dar, falls ein stochastischer Prozess beispielsweise so viele Zustände hat, dass die analytische Berechnung numerisch zu aufwändig wäre. Darauf verzichten wir jedoch, weil wir unsere Markov Kette nur 9 Zustände besitzt. So berechnen Sie die Gleichgewichtsverteilung durch statistische Programmierung Zunächst einmal tragen wir die Übergangsmatrix P in die Statistik Software R ein. P = matrix ( c ( 0 , 1/2 , 0 , 1/2 , 0 , 0 , 0 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 1/4 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/3 , 0 , 0 , 0 , 1/3 , 0 , 1/3 , 0 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 0 , 1/3 , 0 , 1/3 , 0 , 0 , 0 , 1/3 , 0 , 0 , 0 , 1/2 , 0 , 0 , 0 , 1/2 , 0 , 0 , 1/4 , 0 , 0 , 1/4 , 0 , 1/4 , 0 , 1/4 , 0 , 0 , 0 , 0 , 0 , 1/2 , 0 , 1/2 , 0 ) , byrow = T , ncol = 9) P Eine Übergangsmatrix enthält als Einträge die Übergangswahrscheinlichkeiten und diese müssen Werte zwischen 0 und 1 aufweisen. Ob das zutrifft, kann für jeden Eintrag der Matrix einzeln überprüft werden, P >= 0 & P <= 1 oder man lässt sich gleich anzeigen, ob alle Einträge plausibel sind. P >= 0 && P <= 1 Außerdem sollten sich die von jedem Zustand ausgehenden Übergangswahrscheinlichkeiten zu 1 summieren. apply ( P , MARGIN = 1, FUN = sum ) Die Übergangsmatrix P wird transponiert und die Einheitsmatrix subtrahiert. P.T.minus.I = t ( P ) - diag ( 9 ) Jetzt ist es wichtig bloß nicht die Nebenbedingung zu vergessen, da wir sonst kein Ergebnis erhalten. Die Gleichgewichtsverteilung ist eine Wahrscheinlichkeitsverteilung und als solche muss die Summe über alle Zustände der Gleichgewichtsverteilung 1 ergeben. Wir ergänzen also zur Matrix P.T.minus.I eine Zeile mit Einsen. P.T.minus.I.mit.Nebenbedingung = rbind ( P.T.minus.I , 1 ) P.T.minus.I.mit.Nebenbedingung Auf der anderen Seite des Gleichungssystems steht der Nullvektor. Aufgrund der Nebenbedingung müssen wir eine Eins ergänzen. Um einen Spaltenvektor zu erhalten, verwenden wir als Datentyp eine Matrix mit einer Spalte. Nullvektor = matrix ( c ( rep ( 0 , 9 ) , 1 ) , ncol = 1 ) Nullvektor Das durch die Nebenbedingung erweitere lineare Gleichungssystem ist nun nicht mehr quadratisch, sondern enthält eine Bedingung mehr als sie Variablen hat. Daher können wir zur Lösung nicht den Standardbefehl solve verwenden, sondern verwenden alternativ den Befehl Solve (mit großem Anfangsbuchstaben) aus dem Paket limSolve, welches wir zuvor installieren müssen. Nach der Installation können wir das Paket mit library ( limSolve ) einbinden. Wir geben in die Funktion Matrix und Vektor ein und erhalten: Gleichgewichtsverteilung = Solve ( P.T.minus.I.mit.Nebenbedingung , Nullvektor ) Gleichgewichtsverteilung Zum Schluss überprüfen wir noch, ob wir tatsächlich eine gültige Wahrscheinlichkeitsverteilung erhalten haben: Gleichgewichtsverteilung >= 0 & Gleichgewichtsverteilung <= 1 Gleichgewichtsverteilung >= 0 && Gleichgewichtsverteilung <= 1 sum ( Gleichgewichtsverteilung ) Lösung der Gleichgewichtsverteilung Die Lösung des linearen Gleichungssystems mit der Statistik Software R ergibt: π=(7.7,15.4,7.7,11.5,15.4,11.5,7.7,15.4,7.7) Die Gespenster halten sich demnach am häufigsten in der Mitte auf, weniger oft am Rand und am seltensten in der Ecke. Eine Ausnahme bilden die Randzustände 2 und 8, welche aufgrund des Geheimwegs durchschnittlich genauso oft besucht werden wie das zentrale Spielfeld. Die Aufenthaltswahrscheinlichkeiten der Zustände sind proportional zur Anzahl der eingehenden Pfeile. Je mehr ein-schrittige Wege zu einem Zustand führen, umso öfter wird dieser Zustand langfristig besucht. Wie verwende ich die Markov Kette als stochastischer Prozess in der Wirtschaft? Es gibt zahlreiche Anwendungen für Markov Ketten in der Wirtschaft. Ein Beispiel wird im Folgenden vorgestellt. Im Aktienhandel ist man oftmals besonders daran interessiert, vorherzusagen, wie sich der Aktienmarkt entwickelt. Die Situation an der Börse wird in drei Hauptphasen unterteilt – Bull Market, Bear Market und Stagnant Market. Der unten abgebildete Übergangsgraph enthält die Übergangswahrscheinlichkeiten zwischen den drei Phasen von Woche zu Woche, wobei jede Phase immer für mindestens eine Woche bestehen bleibt. Mit der Gleichgewichtsverteilung können Sie nun berechnen, mit welcher Wahrscheinlichkeit sich der Aktienmarkt langfristig in welchem Zustand befindet. Wenn die oben beschriebene Methodik der Markov Ketten auf dieses Beispiel aus der Wirtschaft angewendet wird, ergibt sich für den Aktienmarkt ein langfristiges Ergebnis von 62,5% der Wochen im Bull Market, 31,25% der Wochen im Bear Market und in 6,25% der Wochen im Stagnant Market befindet. Wir hoffen, dass wir Ihnen mit diesem Artikel nun die Thematik der Markov Ketten und Gleichgewichtsverteilung näherbringen konnten, und Sie diese in Zukunft zur Lösung mathematischer Probleme oder von Fragestellungen im Business-Kontext einsetzen können. Für den Fall, dass Sie dabei professionelle Unterstützung benötigen sollten, stehen Ihnen die professionellen Statistiker von Novustat für statistische Programmierungen oder Simulationen zur Verfügung. Der Beitrag Die Markov Kette: Wie berechne ich die Gleichgewichtsverteilung? erschien zuerst auf Statistik Service. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-Programming – Statistik Service. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Faster Than Excel: Painlessly Merge Data into Actuarial Loss Development Triangles with R Thu, 01/04/2018 - 15:00 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) The objective of this post is to create a method which easily combines loss runs, or listings of insurance claims, into triangles. Using only Excel, the common method is to create links between the excel files which must be updated manually for each new evaluation. This is prone to human error and is time-consuming. Using a script to merge the files first and then create a triangle saves time and is more consistent. Example of the conventional linked Excel triangle: For accident year 2016, all the diagonal values need to be linked to a data file that was last evaluated in 2016. For 2015, the links go to the 2015 file, for 2014, the 2014 file, ect. This is time consuming. See Wikipedia for a definition of Loss Development Triangles and why they are useful. Methods I use the packages plyr, dplyr, tidyr, and lubridate to manipulate the data once it is loaded into R. The package readxl reads the excel files from the windows file directory. It is worth noting that all five of these packages are included in Hadley Wickham’s tidyverse package. The package ChainLadder has a variety of tools for actuarial reserve analysis. library(dplyr) library(readxl) library(plyr) library(tidyr) library(ChainLadder) library(lubridate) Step 1: Organize the Excel Files Copy all of the excel files into the same windows folder, inside the working directory of the R project. It is important to have *only* the files that are going to be loaded into R in this folder. On my computer, my file structure is as follows: C:/Users/Sam/Desktop/[R project working directory]/[folder with excel claims listing files to aggregate]/[file 1, file 2, file 3, etc] Using the command list.files(), R automatically looks in the working directory and returns a vector of all the the file names which it sees. For example, R sees the data files in the current working directory: wd_path = "C:/Users/Sam/Desktop/projects/Excel Aggregate Loop/excel files" list.files(wd_path) ## [1] "claims listing 2011.xlsx" "claims listing 2012.xlsx" ## [3] "claims listing 2013.xlsx" "claims listing 2014.xlsx" ## [5] "claims listing 2015.xlsx" "claims listing 2016.xlsx" R can then loop through each of these files and perform any action. If there were 100 files, or 1000 files in this directory, this would still work. file_names = list.files(wd_path) file_paths = paste(wd_path, "/", file_names, sep = "") head(read_excel(file_paths[4]) %>% select(-3,-4,-5)) ## # A tibble: 4 x 6 ## name license number age file year loss date paid ## ## 1 jeff 3 23 2014 2011-01-01 400 ## 2 sue 3 43 2014 2012-01-01 400 ## 3 mark 2 55 2014 2013-01-01 200 ## 4 sarah 1 100 2014 2014-01-01 500 In order to evaluate the age of the losses, we need to take into account when each loss was evaluated. This is accomplished by going into Excel and adding in a column for “file year”, which specifies the year of evaluation of the file. For instance, for the “claim listing 2013” file, all of the claims have a “2013” in the “file year” column. Step 2: Load the Data into R Initialize a data frame which will store the aggregated loss run data from each of the excel files. **The names of this data frame need to be the names of excel file columns which need to be aggregated.** For instance, these could be “reported”, “Paid Loss”, “Case Reserves”, or “Incurred Loss”. If the excel files have different names for the same quantities (ie, “Paid Loss” vs. “paid loss”), then they should be renamed within excel first. merged_data = as_data_frame(matrix(nrow = 0, ncol = 3)) names(merged_data) = c("file year", "loss date", "paid") Someone once said “if you need to use a ‘for’ loop in R, then you are doing something wrong”. Vectorized functions are faster and easier to implement. The function my_extract below takes in the file name of the excel file and returns a data frame with only the columns which are selected. excel_file_extractor = function(cur_file_name){ read_excel(cur_file_name[1], sheet = "Sheet1", col_names = T) %>% select(file year, file year, loss date, paid) %>% rbind(merged_data) } Apply the function to all of the files in the folder that you created. Obviously, if you had 100 excel files this would still work just as effectively. From the plyr package, ldply takes in a list and returns a data frame. The way to remember this is by the ordering of the letters (“list”-“data frame”-“ply”). For example, ff we wanted to read in a data frame and return a data frame, it would be ddply. loss_run_data = ldply(file_paths, excel_file_extractor) names(loss_run_data) = c("file_year", "loss_date", "paid losses") The data now has only the columns what we selected, despite the fact that the loss run files had different columns in each of the files. glimpse(loss_run_data) ## Variables: 3 ##$ file_year 2011, 2012, 2012, 2013, 2013, 2013, 2014, 2014, 20... ## $loss_date 2011-01-01, 2011-01-01, 2012-01-01, 2011-01-01, 2... ##$ paid losses 100, 200, 300, 300, 350, 100, 400, 400, 200, 500, ... head(loss_run_data) ## file_year loss_date paid losses ## 1 2011 2011-01-01 100 ## 2 2012 2011-01-01 200 ## 3 2012 2012-01-01 300 ## 4 2013 2011-01-01 300 ## 5 2013 2012-01-01 350 ## 6 2013 2013-01-01 100 Step 3: Create Development Triangles

Finally, once we have the loss run combined, we just need to create a triangle. This is made easy by the as.triangle function from the ChainLadder package.

The only manual labor required in excel was to go into each file and create the file year column, which was just the year of evaluation of each loss run file.

loss_run_data = loss_run_data %>% mutate(accident_year = as.numeric(year(ymd(loss_date))), maturity_in_months = (file_year - accident_year)*12) merged_triangle = as.triangle(loss_run_data, dev = "maturity_in_months", origin = "accident_year", value = "paid losses") merged_triangle

Within the package ChainLadder is a plot function, which shows the development of losses by accident year. Because these are arbitrary amounts, the plot is not realistic.

plot(merged_triangle, main = "Paid Losses vs Maturity by Accident Year", xlab = "Maturity in Months", ylab = "Paid Losses")

Conclusion

When it comes to aggregating excel files, R can be faster and more consistent than linking together each of the excel files, and once this script is set in place, making modifications to the data can be done easily by editing the exel_file_extractor function.

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Deep Learning from first principles in Python, R and Octave – Part 1

Thu, 01/04/2018 - 13:42

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

“You don’t perceive objects as they are. You perceive them as you are.”
“Your interpretation of physical objects has everything to do with the historical trajectory of your brain – and little to do with the objects themselves.”
“The brain generates its own reality, even before it receives information coming in from the eyes and the other senses. This is known as the internal model”

David Eagleman - The Brain: The Story of You

This is the first in the series of posts, I intend to write on Deep Learning. This post is inspired by the Deep Learning Specialization by Prof Andrew Ng on Coursera and Neural Networks for Machine Learning by Prof Geoffrey Hinton also on Coursera. In this post I implement Logistic regression with a 2 layer Neural Network i.e. a Neural Network that just has an input layer and an output layer and with no hidden layer.I am certain that any self-respecting Deep Learning/Neural Network would consider a Neural Network without hidden layers as no Neural Network at all!

This 2 layer network is implemented in Python, R and Octave languages. I have included Octave, into the mix, as Octave is a close cousin of Matlab. These implementations in Python, R and Octave are equivalent vectorized implementations. So, if you are familiar in any one of the languages, you should be able to look at the corresponding code in the other two. You can download this R Markdown file and Octave code from DeepLearning -Part 1

To start with, Logistic Regression is performed using sklearn’s logistic regression package for the cancer data set also from sklearn. This is shown below

1. Logistic Regression import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from matplotlib.colors import ListedColormap from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Call the Logisitic Regression function clf = LogisticRegression().fit(X_train, y_train) print('Accuracy of Logistic regression classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Logistic regression classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Accuracy of Logistic regression classifier on training set: 0.96 ## Accuracy of Logistic regression classifier on test set: 0.96

To check on other classification algorithms, check my post Practical Machine Learning with R and Python – Part 2. You could also take a look at my book, available on Amazon , in which I implement the most popular Machine Learning algorithms in both R and Python – My book ‘Practical Machine Learning with R and Python’ on Amazon

2. Logistic Regression as a 2 layer Neural Network

In the following section Logistic Regression is implemented as a 2 layer Neural Network in Python, R and Octave. The same cancer data set from sklearn will be used to train and test the Neural Network in Python, R and Octave. This can be represented diagrammatically as below

The cancer data set has 30 input features, and the target variable ‘output’ is either 0 or 1. Hence the sigmoid activation function will be used in the output layer for classification.

This simple 2 layer Neural Network is shown below
At the input layer there are 30 features and the corresponding weights of these inputs which are initialized to small random values.

where ‘b’ is the bias term

The Activation function is the sigmoid function which is
The Loss, when the sigmoid function is used in the output layer, is given by
(1)

In forward propagation cycle of the Neural Network the output Z and the output of activation function, the sigmoid function, is first computed. Then using the output ‘y’ for the given features, the ‘Loss’ is computed using equation (1) above.

Backward propagation

The backward propagation cycle determines how the ‘Loss’ is impacted for small variations from the previous layers upto the input layer. In other words, backward propagation computes the changes in the weights at the input layer, which will minimize the loss. Several cycles of gradient descent are performed in the path of steepest descent to find the local minima. In other words the set of weights and biases, at the input layer, which will result in the lowest loss is computed by gradient descent. The weights at the input layer are decreased by a parameter known as the ‘learning rate’. Too big a ‘learning rate’ can overshoot the local minima, and too small a ‘learning rate’ can take a long time to reach the local minima. This is done for ‘m’ training examples.

Chain rule of differentiation
Let y=f(u)
and u=g(x) then

Derivative of sigmoid

Let  then

Using the chain rule of differentiation we get

Therefore         -(2)

The 3 equations for the 2 layer Neural Network representation of Logistic Regression are
-(a)
-(b)
-(c)

The back propagation step requires the computation of and . In the case of regression it would be and where dE is the Mean Squared Error function.
Computing the derivatives for back propagation we have
-(d)
because
Also from equation (2) we get
– (e)
By chain rule

therefore substituting the results of (d) & (e) we get
(f)
Finally
-(g)
– (h)
and from (f) we have
Therefore  (g) reduces to
-(i)
Also
-(j)
Since

The gradient computes the weights at the input layer and the corresponding bias by using the values
of and

I found the computation graph representation in the book Deep Learning: Ian Goodfellow, Yoshua Bengio, Aaron Courville, very useful to visualize and also compute the backward propagation. For the 2 layer Neural Network of Logistic Regression the computation graph is shown below

Note: It can be seen that the Accuracy on the training and test set is 90.37% and 89.51%. This is comparatively poorer than the 96% which the logistic regression of sklearn achieves! But this is mainly because of the absence of hidden layers which is the real power of neural networks.

4. Neural Network for Logistic Regression -R code (vectorized) source("RFunctions-1.R") # Define the sigmoid function sigmoid <- function(z){ a <- 1/(1+ exp(-z)) a } # Compute the loss computeLoss <- function(numTraining,Y,A){ loss <- -1/numTraining* sum(Y*log(A) + (1-Y)*log(1-A)) return(loss) } # Compute forward propagation forwardPropagation <- function(w,b,X,Y){ # Compute Z Z <- t(w) %*% X +b #Set the number of samples numTraining <- ncol(X) # Compute the activation function A=sigmoid(Z) #Compute the loss loss <- computeLoss(numTraining,Y,A) # Compute the gradients dZ, dw and db dZ<-A-Y dw<-1/numTraining * X %*% t(dZ) db<-1/numTraining*sum(dZ) fwdProp <- list("loss" = loss, "dw" = dw, "db" = db) return(fwdProp) } # Perform one cycle of Gradient descent gradientDescent <- function(w, b, X, Y, numIerations, learningRate){ losses <- NULL idx <- NULL # Loop through the number of iterations for(i in 1:numIerations){ fwdProp <-forwardPropagation(w,b,X,Y) #Get the derivatives dw <- fwdProp$dw db <- fwdProp$db #Perform gradient descent w = w-learningRate*dw b = b-learningRate*db l <- fwdProp$loss # Stoe the loss if(i %% 100 == 0){ idx <- c(idx,i) losses <- c(losses,l) } } # Return the weights and losses gradDescnt <- list("w"=w,"b"=b,"dw"=dw,"db"=db,"losses"=losses,"idx"=idx) return(gradDescnt) } # Compute the predicted value for input predict <- function(w,b,X){ m=dim(X)[2] # Create a ector of 0's yPredicted=matrix(rep(0,m),nrow=1,ncol=m) Z <- t(w) %*% X +b # Compute sigmoid A=sigmoid(Z) for(i in 1:dim(A)[2]){ # If A > 0.5 set value as 1 if(A[1,i] > 0.5) yPredicted[1,i]=1 else # Else set as 0 yPredicted[1,i]=0 } return(yPredicted) } # Normalize the matrix normalize <- function(x){ #Create the norm of the matrix.Perform the Frobenius norm of the matrix n<-as.matrix(sqrt(rowSums(x^2))) #Sweep by rows by norm. Note '1' in the function which performing on every row normalized<-sweep(x, 1, n, FUN="/") return(normalized) } # Run the 2 layer Neural Network on the cancer data set # Read the data (from sklearn) cancer <- read.csv("cancer.csv") # Rename the target variable names(cancer) <- c(seq(1,30),"output") # Split as training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Set the features X_train <-train[,1:30] y_train <- train[,31] X_test <- test[,1:30] y_test <- test[,31] # Create a matrix of 0's with the number of features w <-matrix(rep(0,dim(X_train)[2])) b <-0 X_train1 <- normalize(X_train) X_train2=t(X_train1) # Reshape then transpose y_train1=as.matrix(y_train) y_train2=t(y_train1) # Perform gradient descent gradDescent= gradientDescent(w, b, X_train2, y_train2, numIerations=3000, learningRate=0.77) # Normalize X_test X_test1=normalize(X_test) #Transpose X_train so that we have a matrix as (features, numSamples) X_test2=t(X_test1) #Reshape y_test and take transpose y_test1=as.matrix(y_test) y_test2=t(y_test1) # Use the values of the weights generated from Gradient Descent yPredictionTest = predict(gradDescent$w, gradDescent$b, X_test2) yPredictionTrain = predict(gradDescent$w, gradDescent$b, X_train2) sprintf("Train accuracy: %f",(100 - mean(abs(yPredictionTrain - y_train2)) * 100)) ## [1] "Train accuracy: 90.845070" sprintf("test accuracy: %f",(100 - mean(abs(yPredictionTest - y_test)) * 100)) ## [1] "test accuracy: 87.323944" df <-data.frame(gradDescent$idx, gradDescent\$losses) names(df) <- c("iterations","losses") ggplot(df,aes(x=iterations,y=losses)) + geom_point() + geom_line(col="blue") + ggtitle("Gradient Descent - Losses vs No of Iterations") + xlab("No of iterations") + ylab("Losses")

4. Neural Network for Logistic Regression -Octave code (vectorized)

1;
# Define sigmoid function
function a = sigmoid(z)
a = 1 ./ (1+ exp(-z));
end
# Compute the loss
function loss=computeLoss(numtraining,Y,A)
loss = -1/numtraining * sum((Y .* log(A)) + (1-Y) .* log(1-A));
end

# Perform forward propagation
function [loss,dw,db,dZ] = forwardPropagation(w,b,X,Y)
% Compute Z
Z = w' * X + b;
numtraining = size(X)(1,2);
# Compute sigmoid
A = sigmoid(Z);

#Compute loss. Note this is element wise product
loss =computeLoss(numtraining,Y,A);
# Compute the gradients dZ, dw and db
dZ = A-Y;
dw = 1/numtraining* X * dZ';
db =1/numtraining*sum(dZ);

end

function [w,b,dw,db,losses,index]=gradientDescent(w, b, X, Y, numIerations, learningRate)
#Initialize losses and idx
losses=[];
index=[];
# Loop through the number of iterations
for i=1:numIerations,
[loss,dw,db,dZ] = forwardPropagation(w,b,X,Y);
w = w - learningRate*dw;
b = b - learningRate*db;
if(mod(i,100) ==0)
# Append index and loss
index = [index i];
losses = [losses loss];
endif

end
end

# Determine the predicted value for dataset
function yPredicted = predict(w,b,X)
m = size(X)(1,2);
yPredicted=zeros(1,m);
# Compute Z
Z = w' * X + b;
# Compute sigmoid
A = sigmoid(Z);
for i=1:size(X)(1,2),
# Set predicted as 1 if A > 0,5
if(A(1,i) >= 0.5)
yPredicted(1,i)=1;
else
yPredicted(1,i)=0;
endif
end
end

# Normalize by dividing each value by the sum of squares
function normalized = normalize(x)
# Compute Frobenius norm. Square the elements, sum rows and then find square root
a = sqrt(sum(x .^ 2,2));
# Perform element wise division
normalized = x ./ a;
end

# Split into train and test sets
function [X_train,y_train,X_test,y_test] = trainTestSplit(dataset,trainPercent)
# Create a random index
ix = randperm(length(dataset));
# Split into training
trainSize = floor(trainPercent/100 * length(dataset));
train=dataset(ix(1:trainSize),:);
# And test
test=dataset(ix(trainSize+1:length(dataset)),:);
X_train = train(:,1:30);
y_train = train(:,31);
X_test = test(:,1:30);
y_test = test(:,31);
end

[X_train,y_train,X_test,y_test] = trainTestSplit(cancer,75);
w=zeros(size(X_train)(1,2),1);
b=0;
X_train1=normalize(X_train);
X_train2=X_train1';
y_train1=y_train';
[w1,b1,dw,db,losses,idx]=gradientDescent(w, b, X_train2, y_train1, numIerations=3000, learningRate=0.75);
# Normalize X_test
X_test1=normalize(X_test);
#Transpose X_train so that we have a matrix as (features, numSamples)
X_test2=X_test1';
y_test1=y_test';
# Use the values of the weights generated from Gradient Descent
yPredictionTest = predict(w1, b1, X_test2);
yPredictionTrain = predict(w1, b1, X_train2);

trainAccuracy=100-mean(abs(yPredictionTrain - y_train1))*100
testAccuracy=100- mean(abs(yPredictionTest - y_test1))*100
trainAccuracy = 90.845
testAccuracy = 89.510
graphics_toolkit('gnuplot')
plot(idx,losses);
title ('Gradient descent- Cost vs No of iterations');
xlabel ("No of iterations");
ylabel ("Cost");

Conclusion
This post starts with a simple 2 layer Neural Network implementation of Logistic Regression. Clearly the performance of this simple Neural Network is comparatively poor to the highly optimized sklearn’s Logistic Regression. This is because the above neural network did not have any hidden layers. Deep Learning & Neural Networks achieve extraordinary performance because of the presence of deep hidden layers

The Deep Learning journey has begun… Don’t miss the bus!
Stay tuned for more interesting posts in Deep Learning!!

To see all posts check Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...