R news and tutorials contributed by (750) R bloggers
Updated: 1 hour 34 min ago

Cross-over study design with a major constraint

14 hours 44 min ago

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Every new study presents its own challenges. (I would have to say that one of the great things about being a biostatistician is the immense variety of research questions that I get to wrestle with.) Recently, I was approached by a group of researchers who wanted to evaluate an intervention. Actually, they had two, but the second one was a minor tweak added to the first. They were trying to figure out how to design the study to answer two questions: (1) is intervention $$A$$ better than doing nothing and (2) is $$A^+$$, the slightly augmented version of $$A$$, better than just $$A$$?

It was clear in this context (and it is certainly not usually the case) that exposure to $$A$$ on one day would have no effect on the outcome under $$A^+$$ the next day (or vice versa). That is, spillover risks were minimal. Given this, the study was an ideal candidate for a cross-over design, where each study participant would receive both versions of the intervention and the control. This design can be much more efficient than a traditional RCT, because we can control for variability across patients.

While a cross-over study is interesting and challenging in its own right, the researchers had a pretty serious constraint: they did not feel they could assign intervention $$A^+$$ until $$A$$ had been applied, which would be necessary in a proper cross-over design. So, we had to come up with something a little different.

This post takes a look at how to generate data for and analyze data from a more standard cross-over trial, and then presents the solution we came up with for the problem at hand.

Cross-over design with three exposures

If we are free to assign any intervention on any day, one possible randomization scheme involving three interventions could look like this:

Key features of this scheme are: (1) all individuals are exposed to each intervention over three days, (2) on any given day, each intervention is applied to one group of participants (just in case the specific day has an impact on the outcome), and (3) not every permutation is included (for example, $$A$$ does not immediately proceed $$Control$$ in any sequence), because the relative ordering of interventions in this case is assumed not to matter. (We might need to expand to six groups to rectify this.)

Data simulation

In this simulation, we will assume (1) that the outcome is slightly elevated on days two and three, (2) $$A$$ is an improvement over $$Control$$, (3) $$A^+$$ is an improvement over $$A$$, (4) there is strong correlation of outcomes within each individual, and (5) group membership has no bearing on the outcome.

First, I define the data, starting with the different sources of variation. I have specified a fairly high intra-class coefficient (ICC), because it is reasonable to assume that there will be quite a bit of variation across individuals:

vTotal = 1 vAcross <- iccRE(ICC = 0.5, varTotal = vTotal, "normal") vWithin <- vTotal - vAcross ### Definitions b <- defData(varname = "b", formula = 0, variance = vAcross, dist = "normal") d <- defCondition(condition = "rxlab == 'C'", formula = "0 + b + (day == 2) * 0.5 + (day == 3) * 0.25", variance = vWithin, dist = "normal") d <- defCondition(d, "rxlab == 'A'", formula = "0.4 + b + (day == 2) * 0.5 + (day == 3) * 0.25", variance = vWithin, dist = "normal") d <- defCondition(d, "rxlab == 'A+'", formula = "1.0 + b + (day == 2) * 0.5 + (day == 3) * 0.25", variance = vWithin, dist = "normal")

Next, I generate the data, assigning three groups, each of which is tied to one of the three treatment sequences.

set.seed(39217) db <- genData(240, b) dd <- trtAssign(db, 3, grpName = "grp") dd <- addPeriods(dd, 3) dd[grp == 1, rxlab := c("C", "A", "A+")] dd[grp == 2, rxlab := c("A+", "C", "A")] dd[grp == 3, rxlab := c("A", "A+", "C")] dd[, rxlab := factor(rxlab, levels = c("C", "A", "A+"))] dd[, day := factor(period + 1)] dd <- addCondition(d, dd, newvar = "Y") dd ## timeID Y id period grp b rxlab day ## 1: 1 0.9015848 1 0 2 0.2664571 A+ 1 ## 2: 2 1.2125919 1 1 2 0.2664571 C 2 ## 3: 3 0.7578572 1 2 2 0.2664571 A 3 ## 4: 4 2.0157066 2 0 3 1.1638244 A 1 ## 5: 5 2.4948799 2 1 3 1.1638244 A+ 2 ## --- ## 716: 716 1.9617832 239 1 1 0.3340201 A 2 ## 717: 717 1.9231570 239 2 1 0.3340201 A+ 3 ## 718: 718 1.0280355 240 0 3 1.4084395 A 1 ## 719: 719 2.5021319 240 1 3 1.4084395 A+ 2 ## 720: 720 0.4610550 240 2 3 1.4084395 C 3

Here is a plot of the treatment averages each day for each of the three groups:

dm <- dd[, .(Y = mean(Y)), keyby = .(grp, period, rxlab)] ngrps <- nrow(dm[, .N, keyby = grp]) nperiods <- nrow(dm[, .N, keyby = period]) ggplot(data = dm, aes(y=Y, x = period + 1)) + geom_jitter(data = dd, aes(y=Y, x = period + 1), width = .05, height = 0, color="grey70", size = 1 ) + geom_line(color = "grey50") + geom_point(aes(color = rxlab), size = 2.5) + scale_color_manual(values = c("#4477AA", "#DDCC77", "#CC6677")) + scale_x_continuous(name = "day", limits = c(0.9, nperiods + .1), breaks=c(1:nperiods)) + facet_grid(~ factor(grp, labels = paste("Group", 1:ngrps))) + theme(panel.grid = element_blank(), legend.title = element_blank())

Estimating the effects

To estimate the treatment effects, I will use this mixed effects linear regression model:

$Y_{it} = \alpha_0 + \gamma_{t} D_{it} + \beta_1 A_{it} + \beta_2 P_{it} + b_i + e_i$

where $$Y_{it}$$ is the outcome for individual $$i$$ on day $$t$$, $$t \in (1,2,3)$$. $$A_{it}$$ is an indicator for treatment $$A$$ in time $$t$$; likewise $$P_{it}$$ is an indicator for $$A^+$$. $$D_{it}$$ is an indicator that the outcome was recorded on day $$t$$. $$b_i$$ is an individual (latent) random effect, $$b_i \sim N(0, \sigma_b^2)$$. $$e_i$$ is the (also latent) noise term, $$e_i \sim N(0, \sigma_e^2)$$.

The parameter $$\alpha_0$$ is the mean outcome on day 1 under $$Control$$. The $$\gamma$$’s are the day-specific effects for days 2 and 3, with $$\gamma_1$$ fixed at 0. $$\beta_1$$ is the effect of $$A$$ (relative to $$Control$$) and $$\beta_2$$ is the effect of $$A^+$$. In this case, the researchers were primarily interested in $$\beta_1$$ and $$\beta_2 – \beta_1$$, which is the incremental change from $$A$$ to $$A^+$$.

library(lme4) lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd) rndTidy(lmerfit) ## term estimate std.error statistic group ## 1: (Intercept) -0.14 0.08 -1.81 fixed ## 2: day2 0.63 0.06 9.82 fixed ## 3: day3 0.38 0.06 5.97 fixed ## 4: rxlabA 0.57 0.06 8.92 fixed ## 5: rxlabA+ 0.98 0.06 15.35 fixed ## 6: sd_(Intercept).id 0.74 NA NA id ## 7: sd_Observation.Residual 0.70 NA NA Residual

As to why we would want to bother with this complex design if we could just randomize individuals to one of three treatment groups, this little example using a more standard parallel design might provide a hint:

def2 <- defDataAdd(varname = "Y", formula = "0 + (frx == 'A') * 0.4 + (frx == 'A+') * 1", variance = 1, dist = "normal") dd <- genData(240) dd <- trtAssign(dd, nTrt = 3, grpName = "rx") dd <- genFactor(dd, "rx", labels = c("C","A","A+"), replace = TRUE) dd <- addColumns(def2, dd) lmfit <- lm(Y~frx, data = dd) rndTidy(lmfit) ## term estimate std.error statistic p.value ## 1: (Intercept) -0.12 0.10 -1.15 0.25 ## 2: frxA 0.64 0.15 4.38 0.00 ## 3: frxA+ 1.01 0.15 6.86 0.00

If we compare the standard error for the effect of $$A^+$$ in the two studies, the cross-over design is much more efficient (i.e. standard error is considerably smaller: 0.06 vs. 0.15). This really isn’t so surprising since we have collected a lot more data and modeled variation across individuals in the cross-over study.

Constrained cross-over design

Unfortunately, the project was not at liberty to implement the three-way/three-day design just simulated. We came up with this approach that would provide some cross-over, but with an added day of treatment and measurement:

The data generation is slightly modified, though the original definitions can still be used:

db <- genData(240, b) dd <- trtAssign(db, 2, grpName = "grp") dd <- addPeriods(dd, 4) dd[grp == 0, rxlab := c("C", "C", "A", "A+")] dd[grp == 1, rxlab := c("C", "A", "A+", "A")] dd[, rxlab := factor(rxlab, levels = c("C", "A", "A+"))] dd[, day := factor(period + 1)] dd <- addCondition(d, dd, "Y")

The model estimates indicate slightly higher standard errors than in the pure cross-over design:

lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd) rndTidy(lmerfit) ## term estimate std.error statistic group ## 1: (Intercept) 0.15 0.06 2.36 fixed ## 2: day2 0.48 0.08 6.02 fixed ## 3: day3 0.16 0.12 1.32 fixed ## 4: day4 -0.12 0.12 -1.02 fixed ## 5: rxlabA 0.46 0.10 4.70 fixed ## 6: rxlabA+ 1.14 0.12 9.76 fixed ## 7: sd_(Intercept).id 0.69 NA NA id ## 8: sd_Observation.Residual 0.68 NA NA Residual

Here are the key parameters of interest (refit using package lmerTest to get the contrasts). The confidence intervals include the true values ($$\beta_1 = 0.4$$ and $$\beta_2 – \beta_1 = 0.6$$):

library(lmerTest) lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd) L <- matrix(c(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, -1, 1), nrow = 2, ncol = 6, byrow = TRUE) con <- data.table(contest(lmerfit, L, confint = TRUE, joint = FALSE)) round(con[, .(Estimate, Std. Error, lower, upper)], 3) ## Estimate Std. Error lower upper ## 1: 0.462 0.098 0.269 0.655 ## 2: 0.673 0.062 0.551 0.795 Exploring bias

A single data set does not tell us if the proposed approach is indeed unbiased. Here, I generate 1000 data sets and fit the mixed effects model. In addition, I fit a model that ignores the day factor to see if it will induce bias (of course it will).

iter <- 1000 ests <- vector("list", iter) xests <- vector("list", iter) for (i in 1:iter) { db <- genData(240, b) dd <- trtAssign(db, 2, grpName = "grp") dd <- addPeriods(dd, 4) dd[grp == 0, rxlab := c("C", "C", "A", "A+")] dd[grp == 1, rxlab := c("C", "A", "A+", "A")] dd[, rxlab := factor(rxlab, levels = c("C", "A", "A+"))] dd[, day := factor(period + 1)] dd <- addCondition(d, dd, "Y") lmerfit <- lmer(Y ~ day + rxlab + (1|id), data = dd) xlmerfit <- lmer(Y ~ rxlab + (1|id), data = dd) ests[[i]] <- data.table(estA = fixef(lmerfit)[5], estAP = fixef(lmerfit)[6] - fixef(lmerfit)[5]) xests[[i]] <- data.table(estA = fixef(xlmerfit)[2], estAP = fixef(xlmerfit)[3] - fixef(xlmerfit)[2]) } ests <- rbindlist(ests) xests <- rbindlist(xests)

The results for the correct model estimation indicate that there is no bias (and that the standard error estimates from the model fit above were correct):

ests[, .(A.est = round(mean(estA), 3), A.se = round(sd(estA), 3), AP.est = round(mean(estAP), 3), AP.se = round(sd(estAP), 3))] ## A.est A.se AP.est AP.se ## 1: 0.407 0.106 0.602 0.06

In contrast, the estimates that ignore the day or period effect are in fact biased (as predicted):

xests[, .(A.est = round(mean(estA), 3), A.se = round(sd(estA), 3), AP.est = round(mean(estAP), 3), AP.se = round(sd(estAP), 3))] ## A.est A.se AP.est AP.se ## 1: 0.489 0.053 0.474 0.057 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Top 7 R-bloggers’ posts from last week (2018-10-14 till 2018-10-20)

Mon, 10/22/2018 - 21:48

Update on the R Consortium Census Working Group

Mon, 10/22/2018 - 19:54

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

It’s been a while since I shared any information about the R Consortium’s Census Working Group.

Today I’d like to share an update on three projects which the group is involved in.

The project which you might already be familiar with is the “Guide to Working with Census Data in R”. This is my survey paper that will include popular datasets that Census publishes, R packages that facilitate working with Census Data and resources to learn more about the above. I am planning to complete this document in the next two weeks.

The project which launched this working group is my video course on using Choroplethr to visualize data from the American Community Survey (ACS). This course will be free and hosted on the Census Bureau’s website. The course has already been completed, and we’re in the final stages of getting approval from the bureau to publish the course on its website. I expect this to be completed by the end of the year.

A new project which we are considering taking on is a webinar series highlighting authors of R packages that work with Census data. This would give us a chance to create a library of educational material about using R to work with Census data.

The post Update on the R Consortium Census Working Group appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Cassie Kozyrkov discusses decision making and decision intelligence!

Mon, 10/22/2018 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Cassie Kozyrkov, Chief Decision Scientist at Google.

Here is the podcast link.

Introducing Cassie Kozyrkov

Hugo: Hi there, Cassie, and welcome to DataFramed.

Cassie: Hi. Thanks, Hugo. Honored to be here.

Hugo: It’s a great pleasure to have you on the show. And I’m really excited to have you here to talk about data science in general, decision making, decision science, and decision intelligence. But before that, I’d like to find out a bit about you. First I want to know what your colleagues would say that you do.

Cassie: Oh, goodness. Well, it depends on the colleague, I think. But I think the consensus would be they’d say that I have some expertise in applied data science, especially. And I help Google teams and our cloud customers apply machine learning effectively.

Applied Data Science

Hugo: Great. Could you tell me a bit about what applied data science means to you?

Cassie: Yeah, so when it comes to data science, well, let’s start with what data science means, and then we’ll take it deeper. So data science, to me, is that umbrella discipline that has underneath it statistical inference, machine learning, and analytics or data mining. And the difference between these three, for me, boils down not to the algorithms that are used in them, because if you’re smart, you can use any algorithm for any of them. Also not down to the tools, but really boils down to the number of decisions under uncertainty that you want to make with them. With data mining, you really wanna get inspired. There are no specific decisions that you want yet, but you wanna see what your data inspires you to start thinking and dreaming about. Statistical inference is one where a few really important decisions under uncertainty, and then machine learning and AI, they boil down to a recipe for repeated decision making. So many, many decisions under uncertainty.

Cassie: I actually see data science as a discipline of decision making, turning information into action. Now, where is this applied versus research side of things? Researchers focus more on enabling fundamental tools that other people will use to solve business problems. Whereas applied folks will go and find among the available tools, what they need to solve those problems. So I’m not focused on how do I develop a new neural network architecture. I’m focused more on, there seems to be this germ of an idea in a business leader. How do we bring that out, make it real, build the team that’s going to end up working on that, and then ensuring that the entire process from start to finish is well thought out and executed, and then at the end, there is a safe, reliable result.

What do you do?

Hugo: And I think framing this form of data science and applied data science as a subdiscipline as decision making is something we’re gonna unpack more and more in this conversation. So that really sets the scene very nicely. So you said that your colleagues would view you as an expert on applied data science and thinking about how to use machine learning effectively. Now, what do you actually do? Are they pretty on track with what you do?

Cassie: They’re close. But I think at the heart of what I care about, is this notion of type three error in statistics. For those that don’t remember your errors, let’s have a quick reminder. Type one is incorrectly rejecting a null hypothesis. Type two is incorrectly failing to reject a null hypothesis, and type three is correctly rejecting the wrong null hypothesis. Or, if you prefer a Bayesian statement on the same thing, it’s all the right math to solve entirely the wrong problem.

Hugo: Great. So could you give me an example of a type three error?

Cassie: Yeah. So it’s any rabbit hole that a data scientist goes down meticulously, carefully answering a question that just didn’t need to be answered. So maybe this’ll be a familiar thing to some of the data scientists listening, and I hope you don’t suffer from those too much, ’cause that’s a bit of a newbie gotcha, but it kind of goes like this: There you are, finished with most of your work for the week. It’s maybe 4:00 p.m. on a Friday, and you’re excited for a nice, free weekend, because that’s the whole point of being in industry and not being in academia anymore. Right? Just kidding. Anyways-

Hugo: You’re not kidding at all.

Cassie: I’m … Okay. I’m kidding some.

Hugo: Sure. A bit.

Cassie: Yeah. All right.

Hugo: Great. Go on, go on.

Cassie: Fine. I’m not kidding at all. So there you are, and you’re just ready to go home, and say a product manager comes up to you. And with this sense of urgency in their voice, wants to get a specific measurement from you. Or a specific question answered. And you think to yourself, "My goodness. But that is difficult. That’s gonna take me at least all weekend. And well into the nights as well. I’m gonna first have to figure out getting the data, then I have to sync up with data engineers. I’m gonna have to look up all these methods in the textbook. This is gonna be a difficult thing. But look, I am a great data scientist, and I can do this. And I can do this correctly and I can make sure that all the statistical assumptions are met. And come Monday morning, I’m going to deliver this thing perfectly." And so on Monday morning, you come on bloody knees, you lift this result up to that product manager. And they kind of poke their head and look at you, and go, "Oh. I didn’t even realize that that was what I was asking for."

Cassie: So there you were, meticulously, very correctly solving this problem, but rather uselessly as well. And it goes nowhere, the product manager doesn’t use ut for anything, it just gets abandoned behind the sofa of lost results. So that’s a type three error.

Communication and Process

Hugo: And so how do you stop that happening? Presumably, on the Friday afternoon. Presumably involves communication. Right?

Cassie: Communication is one of those things but process, as well. So the data science team should know what the other stakeholders that they depend on are responsible for and what it looks like for those pieces of work to be completed correctly. So I like to talk about this wide versus deep approach to data science. So a rigorous approach versus a more shallow, inspiration gathering approach. And the second one is always good as long as you don’t end up wasting your data on it, and hopefully you’ll prod me about that shortly. But as long as you have data allocated to inspiration, having a light, gentle look at it is always a good idea. Putting your eyes on that data and seeing what it inspires you to think about. It helps you frame your ideas.

Hugo: And so in that case, we’re thinking about some sort of rapid prototyping in order to-

Cassie: We’re thinking about something even more basic. We’re thinking about just plotting the thing. And that is separate from a very careful pursuit, rigorously, of a specific and important goal. So first, separating these two and saying the former, that broad, wide, shallow approach, that is always … that’s indicated for every patient. Let’s just put it like that. Doctor prescribes that always. As long as you have the data to spare for it, do it. But don’t take your results too seriously, and don’t do anything too meticulous.

Cassie: On the other hand, this more rigorous approach, this takes a lot of effort. And it takes a lot of effort not just from the data science team. And the context for that, how the question is being asked, what assumptions are palatable and so forth, that is actually the responsibility of the decision maker, the business leader. And they have to have done their part properly in order for that meticulous work to make sense. So if you’re going to go and do rigorous things, you need to make sure that that work was properly framed for you.

Hugo: Right. So in this case of the data scientist going and spending their weekend working on this problem doing essentially work that the product manager didn’t think they’d be doing, a way to solve that would’ve been doing some sort of rapid datavis, exploratory data analysis, and then having a conversation with the product manager about what they really wanted.

Cassie: I would say, actually, the other way around. Have a conversation with the product manager about what they really wanted first. And if what they want is something emotional, to get a feel for something, that needs to be unpacked a little more, and perhaps what they’re looking for is possible, perhaps it isn’t. Perhaps taking a look at the data generates the inspiration that is required for what they want. Perhaps they’re hoping that the data scientist is a genie in a magic lamp that grants wishes that can’t be granted. So talking to them and figuring out what they want should actually be the first step. But better even than that, would be an organization where there isn’t this adversarial relationship that assumes that the product manager doesn’t know their part. Better to staff the project with trained, skilled decision makers who know how to do their part, and the data scientist will simply check the requests coming in. And if the request has a certain characteristic, they will tend to go for no work or light work, and if the request has a different kind of characteristic, they will go and do things carefully, rigorously and meticulously at the level of rigor and complexity requested by that skilled business leader.

Hugo: I love it. So then we’re actually talking about specifically, with clarity, defining roles and defining a process around the work—

Cassie: Absolutely. So this can get pretty big and interesting, of how you arrange these teams and how you arrange these processes. In its lightest form, it can be a matter of who talks to whom in what order, but it can be much bigger than that.

Wasting Data

Hugo: Great. And this is something we’re gonna delve into later in the conversation, is common organizational models for this type of work. But before that, I want to prod you a bit, and something you mentioned earlier was the idea of wasting your data. And maybe you could tell me what you mean by that.

Cassie: Sure. Well, we all learn something rather straightforward and obvious-sounding in statistics and data science class. And then unfortunately, we end up forgetting it a little bit. And it’s something that we really shouldn’t forget. And that is that a data point can be used for inspiration, or rigor, but not both if you’re dealing with uncertainty, if you wanna go beyond your data. Because when you’re assessing, whether or not your opinion actually holds up in reality and in general, then you need to make sure that you check that opinion on something that you didn’t use to form the opinion. Because we, humans, are the sorts of creatures that find Elvis’s face in a piece of toast. And if we use the same piece of toast to get inspired to wonder whether toasts look like Elvis, and then to also answer whether toast does, in general, look like Elvis, we have a problem. You’re going to need to go to a different piece of toast.

Cassie: And so you can use data for inspiration or rigor, but not both. And so if you use all the data that you have for getting inspired, for figuring out what questions you even wanna ask, then you have no data left over to rigorously answer them.

Hugo: And I think similar comparisons can be made to null hypothesis significance testing. Right? So for example, you’ll do exploratory data analysis, start to notice something, and then do a test there, because you were inspired in your null hypothesis and alternative hypothesis by the original data, you may actually be over fitting your model of the world to that dataset.

Cassie: Yeah, and I think that that kind of thing actually happens in practice in the real world, because of the way in which students are taught in class. So it makes sense in class for you to get a look at the conditions of a toy dataset, see what sorts of assumptions might or might not hold in that dataset, and then see what it looks like when you apply a particular method to that dataset. And that poor toy dataset would’ve been torn to shreds with respect to what you could actually reasonably learn from it, by all the thousands of times that the student and professor are torturing this poor little dataset. But that’s okay, because all you’re supposed to do in class is see how the math interacts with the data. But you get used to this idea that you’re first allowed to look and examine this dataset, and then you’re allowed to apply the algorithm or statistical test to it.

Cassie: In real life, though, you end up running into exactly this problem, where you invalidate your own conclusions by going through that process. You really should not use the same dataset for both purposes. You shouldn’t pick your statistical hypothesis and test it right then and there. I mean, think about it like this: Here you are, with x variable and y variable, nice scatter plot. And you take this little dataset, and you plot it and you see sort of the ghost of a upward, up and to the right lift in this little point cloud that you’ve just plotted. Well, you’ve just seen this, and so you ask yourself, "Well maybe I can put a straight line through it and see whether I statistically significantly have a positive correlation there." Congratulations, you are going to get the result that, yes you do statistically significantly have a positive correlation, on account of having been inspired to ask the question in the first place by how these particular points fell onto your scatterplot. The conclusion you make might be entirely unrelated to reality. If you’re inspired to do that by this data set, go get another data set from the same process in physical reality, and make sure that your inspiration holds up, there.

Cassie: We humans, we do see patterns that are convenient, interesting to, whatever we’re interested in, and might touch reality at no point at all.

Hugo: There are several ways that we think about battling this endemic issue. One that you mentioned, of course, is after noticing things in exploratory analysis of your dataset and coming up with hypotheses, then going and gathering more data, generated by the same processes. Another, of course, is preregistering techniques before looking at any data initially. I was wondering if there are any other ways that you’ve thought about, or you think may be worth discussing to help with this challenge.

Cassie: The problem, really, is the psychological element of data analysis. How you’re looking for things. How your mind tricks you as you look for things. And the mathematical techniques that are supposed to help you do things like validate under extreme circumstances with cross validation, those are really easy to break. They don’t actually protect you from the going about things the wrong way, psychologically.

Cassie: So what I suggest people do, when they start thinking about this is, if you were pitted against a data scientist who absolutely wants to lead you astray and trick you into all kinds of things, when you give them certain constraints on their process, can they still give you a bad result? Can they still mess with you? Can they still trick you? And most of those methods out there … In fact, I can’t think of one off the top of my head that wouldn’t be, most of them are susceptible to that kind of mucking about. And unfortunately, as a good data scientist, you are likely to trick yourself in that same manner.

Hugo: That’s very interesting, because I think it hints at the fact that we actually due to a lot of our cognitive and psychological biases, we don’t necessarily have good techniques. We need to develop processes, but we don’t necessarily have good techniques to deal with this, yet.

Cassie: When you talk about preregistration of a study, that is less of a technique and more of a statement that in these data, you’re not going to go and adjust your perspective and question. So you’re saying wherever your hypothesis comes from in advance of gathering and processing these data, it is now fixed. That’s kind of how it should be, anyway. So even when you ask the question and separate the two, you’re actually talking about two aspects of the same thing. If you want to form a hypothesis, go and explore data, but nuke that dataset from orbit if you’re going to go and do some rigorous process where you have the intention of taking yourself seriously. You should have your entire question, all the assumptions, all the code, even, submitted ideally before the data’s collected, but certainly before the data hits that code.

What is decision intelligence?

Hugo: So I’d like to move now, and talk about decision intelligence, which you yourself, a chief decision scientist at Google Cloud and you work in decision intelligence. And I was wondering if you could frame for us, what decision intelligence actually is, and how it differs from data science, as a whole?

Cassie: So I like to think of decision intelligence as data science plus plus, augmented with the social and managerial sciences. And with a focus on solving real business problems and turning information into action. So it’s oriented on decision making. If we had to start again with designing a discipline like that, we would wanna ask every science out there what does that science have to say about how do we turn information into action? How do we actually do it, because of the sort of animal that we are? And if we want to build a reliable system or reliable result for a specific goal, how do we do that in a way that actually reaches that goal rather than takes a nasty detour along the way?

Cassie: So it’s very process-oriented. It’s very decision-oriented. But of course, a large chunk of that is applied data science.

Why data science plus plus?

Hugo: So can you tell me why data science plus plus? Why do we have two pluses there?

Cassie: Ah, that plus plus, as in upgraded with the next thing, I suppose. In the same way as in Syntax, you would have I plus plus. That’s just some cuteness, there, I suppose. But think of the upgrade like this: a data scientist is taught how to analyze survey data and how to think through a lot of the careful, mathematical stuff, how to deal with what happens if their data are continuous, if they’re categorical. What if the person was using some sliding scale, et cetera? How many questions? How do we correct for this many questions? That sort of thing. But what they’re not taught, not directly in their training, is how do you construct that survey? How do you make sure that that survey minimizes say, response bias, which is where the user or participant simply lies to you and gives you the wrong answer to your question? And how do you think about what the original purpose of that survey is? Why are we doing this in the first place? And was a survey the right way to go about it? And how did we decide what was worth measuring? Those things are not typically taught to data scientists.

Cassie: So if data scientists want their work to be useful, then someone, whether it’s themselves or a teammate, who has the skills to think thorough that stuff, has to be participating.

Hugo: Right. And is it important, or is best case scenario the data scientist being involved in every step of the process from data collection, experimental design, question design through to actual decision making?

Cassie: That depends on what your budget is. Right? If you have infinite money, perhaps you might be able to hire one of the very, very rare unicorns who’ve actually thought about all of it, and who are skilled in all of it. There are not that many of those people. And if you intend to hire them, you are going to have to pay them. So intending to staff your projects in that manner, well, no wonder you’re going to be complaining about a talent shortage. So the reality of it is that you’re going to have to work with interdisciplinary teams. And also, even if you have someone who gets all of it, in a large scale project, there’s still more work than someone can do with the hours that there are in the day. And so why do you really need all these identical copies of the perfectly knowledgeable worker, if they’re going to have to work on different parts of the process, in any case? So the data scientist upskilling to perfection and then owning all of it, it’s a nice dream, but it doesn’t sound very practical.

Cassie: Instead, I imagine that they would be best suited to the part that they have spent the most time learning. And what they should really worry about more, is how the baton is passed from colleagues who are responsible for other parts of the process, and having the skills to check that that part was done well enough for their own work to be worthwhile. Because unfortunately, data science is right in the middle of the process. And it relies on the bookends. And if the bookends, like that decision making side and that product leadership side and that social science side, if that wasn’t done correctly, or if downstream, you have no way to put this into production reliably, and even if the prototype has beautiful math, it will be too much of a mess to actually use in practice. Then there is no point to the data scientist’s work. It all becomes a type three error.

Cassie: So they’re gonna be working with a very interdisciplinary team, probably. And they should focus on the parts where they can have the best impact.

Organizational Models

Hugo: Great. So in terms of decision making, I wanna know about these teams. I love that your response to my previous question was, "The reality of the situation is … " I wanna know more about reality, and I wanna know more about the practical nature of how data scientists and their work are included or embedded in decision making processes. So could you tell me a bit about the most common organizational models for how data scientists are included in this?

Cassie: Yeah, sure. An obvious way to do it is to collect a whole lot of data scientists and put them together into a centralized data science team, and that tends to be guided jealously by their data science director, who buffers them from the most egregious type three error requests, and makes sure that the rest of the organization uses them to a good purpose, or at least to the most impactful business purpose. And the junior data scientists in that structure, they don’t need to navigate the politics.

Cassie: There’s another model, which is simply embedding a data scientist in a large engineering team, and telling them to be useful.

Cassie: And there’s the decision support model. This is where you append the data scientist to a leader, and the data scientist helps that leader make decisions.

Cassie: And then, of course, there is the data scientist owning most of the process, especially the decision making. So here, data science is responsible for framing decision context, figuring out which questions are even worth asking, and then also taking ownership of answering them.

Hugo: So we have the pure data science team, the embedded in engineering, decision support, and data scientists as decision maker. And I think-

Cassie: The fifth will be the decision intelligence option, which is none of these.

Hugo: I look forward to discussing that. And it seems like, generally, this order goes from less decision making to more decision making on the part of the data scientist. Is that fair to say?

Cassie: Ah, fair enough.

Hugo: And so what are the pros and cons of being at different points along this spectrum?

Cassie: With a super centralized one, an obvious con is that if you are a small, scrappy organization, forget it. You are not going to be able to have this large data science org. Another con is that they tend to be put towards what the business already knows is worth doing properly and carefully. So in some sense, this is a pro. They’re going to be associated with the most delicate or high-value questions in the business. The con is that there’s flexibility to help out the broader organization to seize unusual opportunities, because there is this sort of single point through which all the requests come. And that tends to homogenize the requests a little bit. And that also means that individual data scientists will have very low contact with the decision function. That might be a pro for them. Maybe that’s a stressful thing for a junior data scientist. But it’s very hard for their work and their contribution to have visibility in this way.

Cassie: And all of this is really at the mercy of data science leadership. So if their data science director does not know what they’re doing, we’ll have a problem. And the industry is really suffering from a shortage of data science leaders. There are people who call themselves data science leaders or analytics managers, but these folks might not really know how to play the organizational politics. They might not have a good business sense. Or maybe they are primarily leaders, they have all those … that nose for impact, but they don’t understand how to make a data science team effective. So there can be some problems with that.

Cassie: The embedded in engineering: pro is that you get to influence engineering. However, you end up doing a variety of tasks, which may or may not have anything to do with data science. Quite often, the engineering team don’t really know what sort of animal you are, and doesn’t really know what you’re for, doesn’t know if you’re useful. They think of you as a sort of not-so-good programmer. "What’s wrong with you? And what is this stuff that you’re constantly fussing about on whiteboards?" You might not be seen as very useful, and you might find yourself taking on product management tasks that you might not want to do, that you didn’t think you were going to have to do, and that you didn’t train for. So you end up with non-specialist tasks and there’s no buffer against politics for you, there.

Hugo: And is this something that also happens as we move more towards data scientists who work in decision support and as decision makers, themselves?

Cassie: There’s some element of this, as well. With decision support, the leader, a good leader quickly figures out how to make you useful. So you don’t spend an awful lot of time sort of wafting about, figuring out how to even contribute in the first place. Now, it might be that your best contribution is nothing to do with complex methodology that you spent so many years in grad school studying, and your data science tasks might end up getting diluted with a whole host of other things you might be working on. But your value does tend to get better protected under this setup.

Hugo: And how about for a data scientist as an actual decision maker?

Cassie: So of course, the pro there is that you don’t have this loss in moving between the data science, engineering, and decision functions, because the data scientist owns all of those things. The con is that in order to do it, you need to really get several black belts. And if you don’t have them, you might think you’re being useful, but you might be doing more damage than good. So maybe you think you’re good at understanding business impact, but really, what you’re much better at is doing the math. And you end up pushing the organization down rabbit holes, bad rabbit holes at a much worse rate than they would’ve had without you. So you really do need these multiple black belts, and you need to understand that you have to train for these things separately. Because a standard training program just does not prepare you to be a two-in-one or three-in-one worker.

Cassie: So in practice, this is a rare animal.

Data Scientist as Decision Intelligence Operative

Hugo: And then, of course the fifth model that you mentioned, in passing, that I’d like to focus on now, is data scientist as decision intelligence operative. What happens here?

Cassie: So there will be some allocation of time and human resources toward the analytics or data mining side of data science. And so there will be an ongoing pause check for the company. So there will just be this broad, light touch analytics going on all the time, and whoever is best at that model of working under data science will be doing that and will be partially driven by what leadership wants, but will also be driven by the explore over exploit attitude.

Cassie: Then, if something else is going to be requested, there will be certain stages in the project lifecycle that have to be completed in order to get that work. So it’s sort of like a combination of those two models, where you get embedded in engineering, or you get embedded with decision making, but that match happens out of a centralized pool of labor, and it happens based on the project being framed in the required way. So for example, you might have the decision support framing where you need statistical assistance on a project. In order for that to happen, there has to be certain steps like a selection, if you’re going to go frequentist way, selection of the default action, what the decision maker actually wants to do by default, an understanding of what it takes to convince them, what their metrics are, that sort of passes a social science function, there. What population they’re thinking about. What assumptions they’re willing to deal with. That will be someone from social science or from data science working with the decision maker to help them frame their decision context.

Cassie: And then once that is all ready, then you staff folks who can actually do the heavy lifting, the calculations, the data stuff to the project. And of course, you need to staff data engineering to that project as well. So when everyone comes together, they know what they are there for.

Hugo: And this actually speaks to kind of a broader challenge. I mean, we’ve discussed this previously, but this idea that a lot of people want to hire data scientists or do machine learning or state of the art deep learning or AI before they even know what questions they want to answer. Right?

Cassie: Yeah. So what you should do … Here’s my advice for everybody. If you don’t know what you want, think of your data as a big old pile of photographs in the attic. And think of analytics or data mining as the person or function that is going to go to that attic and their opportunity to actually go to the attic and look at the data is going to be supported by data engineering. They’re gonna go to that attic, they’re going to upend those big box of photographs on the floor. They’re going to look at them, then they are going to summarize what they see there to the people waiting patiently in the house and ask those folks whether they are considering doing anything more with it. That kind of approach always makes sense. You’ll never know what’s in this pile of photographs. And you’ll never know whether it’s worth doing anything serious with it. But also because it’s a pile of photographs and you don’t know who took it, and for what purpose, you should never learn anything beyond what is there.

Cassie: So we, as citizens, we already know how to think about a pile of photographs, or a photo you find on the side of the road. The only thing that you can reasonably say about it is, "Hey, this is what is here." Does that inspire me? Does it make me dream? Does it make me want to ask other questions about the world? Sure. Perhaps. But do I take any of that seriously? No, of course not. It’s some photograph, and data science is essentially Photoshop, as we all know, and we don’t know much about how that photo was taken or why. And we can’t really make serious decisions based on it. But taking a look always makes sense. As long as you continue to think about it reasonably, the same way as you would think about those photos. So that’s always good for every project. And if any team, any organization says, "I’d like to know a little more about my data. I’d like to get into mining my data, looking at my data, finding out what’s in there," that is always a good thing.

Cassie: But now, if you don’t actually control the quality of that data, you might end up doing very careful, rigorous things with it. And the photographs were all, I don’t know, blank. Right? There’s no point to it. Or maybe they were all taken in a way that’s entirely unreliable for the question that you want to answer, as you didn’t actually plan the data collection, so if you look at the photographs that I take in my travels, you notice there’s all these super touristy landmarks. And yet somehow, I’m the only person pictured in the photograph at that landmark. You can’t conclude anything about how many people go to these landmarks based on my pile of photographs. But you can still take a look, as long as you don’t take them too seriously, and then you might start thinking about the sorts of things you might wanna do with them. And when you start figuring out what you might like to do, then you start planning the entire process as directed towards that goal. And then it makes sense to start thinking about hiring people who can do that extra stuff.

Why do so many organizations fail at using data science properly?

Hugo: So Cassie, given the variety of different models for embedding data scientists in the decision making process, I’m wondering why so many organizations fail at using data science to inform decision making properly and robustly.

Cassie: Well, this comes down to a problem of turning information into action and how decision makers are organized and trained to do that. So it may be a case of the decision maker actually doesn’t know what their own role in the process is, and they don’t know how to properly frame a decision context for a data science project that isn’t simply data mining and analytics, this wide and shallow approach. Without the decision maker taking control of the process, what always makes sense is a nice, shallow, wide data mining approach. Mine everything for inspiration, don’t take yourself too seriously. Don’t spend a whole lot of effort. And if you just stick to this, and you really, truly don’t take yourself more seriously than you’re supposed to, the biggest danger is overspending on personnel. Maybe you’ve ended up hiring a bunch of professors, and now you’ve used them for tasks that they consider to be far too easy given their intense training.

Cassie: But, what tends to happen is that decision makers don’t end up owning the dive into the careful, rigorous stuff properly. So maybe they just hire a bunch of data scientists and they leave them in a room, all on their own. They don’t give them any instructions, and then they are surprised when the only thing that comes out of that room is researched white papers. Maybe there’s a case of all those folks pursuing research and rigor for its own sake, because that’s the most comfortable thing, their comfort mode from their research training, and those folks aren’t really qualified to diagnose what is useful for the business, and the decision function just leaves them alone.

Cassie: It might be a case of the entire organization not understanding that there’s a difference between inspiration and rigor, and how to use data for these things, and how much effort each one takes. So another failure is where you get the opposite. You end up using data for inspiration, and then you think that you’ve done something rigorous there, where you really haven’t. And you start taking those results much more seriously than you ought to. And you become over confident and you run headlong into a wall.

Cassie: Another problem that organizations have is that it’s very convenient to use the outputs of data science work as a way to bludgeon your fellow decision makers in a meeting. So everyone wants to argue and put forth their personal opinion about a thing that cannot be solved with data, really. It has to do maybe with the strategy of the organization, and instead of sticking with humility, owning what you don’t know, and using argument to discuss with your fellow decision makers what should be done next, you bring some inscrutable report that’s covered in equations and you say, "Because my magical data scientists have said, this is the truth." But, you know about statistical inference, you know that the question matters more than the answer does, almost. And if all you do is you bring an answer, well, it may or may not be an answer to the question that is being asked or assumed by everybody else. It’s like that Douglas Adams thing, where you just bring 42 to the meeting, and you say look at all these equations that have gotten us to 42. And because it says 42, I’m correct. Actually, it doesn’t make much sense. Takes a lot of effort. And it wastes a lot of time.

Cassie: And then there’s also an element of misguided and misdirected delegation of decision responsibilities. That’s where you have someone who wants to have decision responsibility and they want decision making to be done rigorously, but they want that for more decisions than they actually have time to deal with. And so they sort of fool themselves into thinking that they can be in that decision maker role without spending the time to actually frame decision context, think through assumptions, work with the data science teams and so forth. And so what ends up happening is that people junior to them end up usurping those roles and make the decision however they make it. Maybe they make it rigorously, maybe they don’t, and then spend all of the data science team’s effort in persuading or convincing this pretend decision maker that it’s actually their idea. Now, there’s an element of fuss there, which could just be avoided if decision responsibility were delegated appropriately. There’s no need for this usurping thing. If you don’t have the time to put in the effort that it takes, then hand over that decision to someone who does have that time, if they’re going to pursue it carefully, rigorously, and in this intense statistical way. Or, say, "We’re going to base it on inspiration. It’s going to be a light case of analytics and plotting, but we’re not going to allow ourselves to become more confident than our methods deserve."

Cassie: So really, most of that disconnect has to do either with the people hired, or with the decision makers themselves not knowing what their own role is, since they are the ones who kick off the whole process. It really does matter that they have the skills to do it.

Data Literacy

Hugo: And is there another disconnect with respect to how many people in an org can speak data, in the sense that data literacy and data fluency aren’t necessarily spread or distributed across organizations. I suppose my question is: In organizations you have seen, how is data literacy spread through the org, and how would you like to see this change?

Cassie: So I’m not gonna speak specifically to Google, here. I’m gonna speak much more generally, about all of us at once.

Cassie: In this world, data literacy is in a sorry state. At least from my perspective, I really wish that we were better at this stuff. And we are surprisingly good at thinking through photograph data. And we’re fairly reasonable, fairly reasonable … We still might do some silly things. But we’re fairly reasonable about that. And we’re fairly reasonable about laughing and saying, "Oh, ha-ha, just because it’s in a book doesn’t mean it’s true." But somehow, when it involves math and data, we start to pronounce data with a capital D, like it is some source of objective truth that’s entirely disconnected from the humans that decided to collect it in the first place, and made decisions over how they were gonna collect it and why. So data literacy is in a sorry state. And what I keep seeing in the world at large is that we lack the humility to say, "Well, if we had no one on the team who could have played this role, who had the skills to take on the decision maker’s part, then we shouldn’t take ourselves too seriously."

Cassie: Instead, what one sees out there in the wild, is that there are these teams that are staffed with very meticulous mathematical minds and unskilled decision makers and that whole team, that whole … So what I see missing in the world at large is teams with the humility to say, "Taking ourselves seriously actually takes work, and it takes skills. And if we lack those skills, we’re not going to be able to do it. The best that we can get from this is the same kind of thing that we get from looking at a pile of photos." And that’s actually still something. It’s amazing that we have the ability to take an SD card, which means nothing to you when it’s laying in the palm of your hand, and you plug it into your computer, and you use some visualization software, I don’t know, Microsoft Paint, or something, and now you can get inspired and see what’s there. That’s an incredibly powerful thing. That’s good for everybody. Everyone should be doing more of that on more data types.

Cassie: But not to assume that just any old data plus very complex mathematics can create something out of nothing. Certainty out of uncertainty, for example. A good decision process where the fundamental initial skills were lacking. I like to say inspiration is cheap, but rigor is expensive. And if you’re not willing to pay up, don’t expect that there is some magical formula that’s going to give it to you. Without that data literacy, please don’t be trying to do very complicated things.

Future of Data Science and Decision Science

Hugo: Right so this is the present state of data science, decision making, decision intelligence and data literacy. What do the futures and their intersection of data science and decision science look like to you?

Cassie: As we start to do more with data, I hope to see the world taking the quality of the decision skills and kicking off and also guiding those projects, growing. We can’t really afford to be automating things with data at scale, and have that all based on bad decision making skills. That’s going to be a disaster for the company who’s doing it. So we’ll have to move towards taking those skills more seriously and not treating it just as something that you have a flair for or a talent. But, even though I imagine that whether we learn it now or we learn it the hard way later, that those skills are going to get better, they don’t have to be carried entirely by the people with decision responsibility currently delegated to them. There is another option.

Cassie: And that other option is to hire a helper who can do that rigorous thinking for you. The part of decision making that is a science can be done by a scientist, helping the decision maker who is owning the part that has to do with intuition and politics and so forth. So you can hire a helper to upgrade your skills if you don’t want to go and learn the stuff yourself. But I do think that on the whole, the future involves us taking that first bit much, much more seriously.

Call to Action

Hugo: So towards that future, my final question is: Do you have a final call to action for our listeners out there?

Cassie: Yeah, two. One is, it’s time to start shifting our focus away from only research, and more to a choice of whether you want to be doing research, or you want to be doing applied stuff. These are both equally valuable, important approaches. One of them is tremendously understaffed right now. I could argue both are and it’s a really exciting time if you want to get into that sphere, because that’s going to become more and more important as those general purpose techniques that the researchers make become more readily available for application. An analogy that I have for this is, researchers might be folks who build microwaves, new and better microwaves. Whereas applied folks think about innovating in the kitchen and recipes at scale. And I wanna point out that if you want to say, create McDonald’s, just because you don’t have to wire your own microwave doesn’t mean it’s easy. So this is an exciting time for a new area of investigation and new discipline.

Cassie: And the other thing I want to leave you with is, the world is generating more and more data. We really owe it to ourselves to make that data useful. Wasting all of our time and resources on type three error after type three error is a very sad state of affairs. So it’s really time for us to take this seriously, because we just have so much of it. Let’s do good, beneficial things with it.

Hugo: I love that, because it really brings this conversation full circle in terms of, you stated at the start of our conversation that a big part of what you do is to help teams avoid or lower the rate of type three errors in data science. We’ve come full circle, essentially. And that’s one of the calls to action, here. Right? That we all work together and use the data and our modeling techniques and question us in capabilities to lower type three errors more and more.

Cassie: Yeah, and as I think back on our conversation, I think that a disservice that I did to decision intelligence as a whole, is that I really spoke to you a lot about data scientists. I spoke a lot about decision makers. I vaguely mentioned social scientists. But it’s a much, much more diverse game. And I really left out all the other people that should be involved. The engineers, the reliability folks, the ethicists, the designers. There is a lot of important work to be done in this space by a great variety of people. And I want to ask everyone who’s thinking about sneaking off right now because this doesn’t apply to them, to reconsider. Decision making is important for all of us. And if we are going to do this seriously and at scale, then there is a role for everyone to play if you have anything to say about turning information into action.

Hugo: I couldn’t agree more. And Cassie, it’s been such a pleasure having you on the show.

Cassie: Thank you so much.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Maximized Monte Carlo Testing with MCHT

Mon, 10/22/2018 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Introduction

I introduced MCHT two weeks ago and presented it as a package for Monte Carlo and boostrap hypothesis testing. Last week, I delved into important technical details and showed how to make self-contained MCHTest objects that don’t suffer side effects from changes in the global namespace. In this article I show how to perform maximized Monte Carlo hypothesis testing using MCHT, as described in [1].

Maximized Monte Carlo Hypothesis Testing

The usual procedure for Monte Carlo hypothesis testing is:

1. Compute a test statistic for the data on which you wish to test a hypothesis
2. Generate random datasets like the one of interest but with the data generating process (DGP) being the one prescribed by the null hypothesis, and compute the test statistic on each of these datasets
3. Use the empirical distribution function of the simulated test statistics to compute the -value of the test

Monte Carlo tests often make strong distributional assumptions, such as what distribution generated the dataset being tested, but when those assumptions hold, they are exact tests (see [2]). They are not as powerful as if we had the exact distribution of the test statistic under those assumptions but the power increases with (see [3]) and given the power of modern computers getting a large is usually not a problem. Thus Monte Carlo tests are attractive in small sample situations where we do not want to rely on an asymptotic distribution for inference.

However, the procedure outlined above does not allow for nuisance parameters (parameters that are not the subject of interest but whose values are needed in order to conduct inference). In the introductory blog post, while one may view the population standard deviation as a nuisance parameter, the test statistic does not depend on it when the data follows a Gaussian distribution so there was no need to worry about it. In the case when we switched to data following the exponential distribution, it was still not a problem since its value was specified under the null hypothesis (). Thus it was no longer a nuisance parameter.

That said, nuisance parameters can still appear when we need to perform inference. Suppose, for example, that our data follows a Weibull distribution, denoted by , with being the shape parameter and the scale parameter. We want to test the set of hypotheses:

We can use the generalized likelihood ratio test to form a test statistic (which I won’t repeat hear but does appear in the code below). While Wilks’ theorem tells us about the asymptotic distribution of the test statistic, it says nothing about the exact distribution of the test statistic at a particular sample size, and it’s not given that the test statistic is pivotal and thus independent of the value of nuisance parameters under the null hypothesis (the nuisance parameter, in this case, being the shape parameter ).

What then can we do? A bootstrap test would estimate the value of the nuisance parameter under the null hypothesis and use that estimate as the actual value when generating new, simulated test statistics. Bootstrap tests, however, are not exact tests (see [2]) and we’ve decided that we want a test with stronger guarantees.

[1] introduced the maximized Monte Carlo (MMC) test, which proceeds as described below:

1. Compute the test statistic from the data.
2. Generate collections of random numbers, such as uniformly distributed random numbers, and use those random numbers for generating random copies of test statistics that depend on the values of nuisance parameters (notice that the random numbers are effectively not discarded)
3. Use an optimization procedure ([1] suggested simulated annealing) to pick values for the nuisance parameters such that the -value is maximized; the maximally chosen -value is the -value of the test

[1] showed that this procedure yields -values that, while not as precise as if we knew the values of the nuisance parameters that produced the data, are at least conservative, in the sense that they’re no smaller than they should be (thus biasing our conclusions in favor of the null hypothesis). This is the best we can hope for in this context.

MMC is intuitive and compelling, and the theoretical guarantee gives us confidence in our conclusions, but it’s not a panacea. First, the optimization procedure is costly in work and time. Second (and, in my opinion, the biggest problem), the procedure may be too conservative. There’s a strong chance that the procedure will find some values for nuisance parameters that yield a large -value, perhaps a combination not at all resembling the actual values of the nuisance parameters that produced the data. In short, MMC can be severely lacking in power. When it does reject the null hypothesis, it’s compelling, but otherwise it’s not convincing that the alternative hypothesis is false.

MMC in MCHT

Creating an implementation of MMC in R was my original goal in developing MCHT, and all that needs to be done to perform MMC is pass a value the nuisance_params parameter and an appropriate list to optim_control.

Let’s take the hypothesis test mentioned above and prepare to implement it using MCHT. I will be using fitdistrplus for maximum likelihood estimation, as required by the test statistic (see [4]).

library(MCHT) ## .------..------..------..------. ## |M.--. ||C.--. ||H.--. ||T.--. | ## | (\/) || :/\: || :/\: || :/\: | ## | :\/: || :\/: || (__) || (__) | ## | '--'M|| '--'C|| '--'H|| '--'T| ## ------'------'------'------' v. 0.1.0 ## Type citation("MCHT") for citing this R package in publications library(fitdistrplus) registerDoParallel(detectCores()) # To be passed to test_stat ts <- function(x, scale = 1) { fit_null <- coef(fitdist(x, "weibull", fix.arg = list("scale" = scale))) kt <- fit_null[["shape"]] l0 <- scale fit_all <- coef(fitdist(x, "weibull")) kh <- fit_all[["shape"]] lh <- fit_all[["scale"]] n <- length(x) # Test statistic, based on the negative-log-likelihood ratio suppressWarnings(n * ((kt - 1) * log(l0) - (kh - 1) * log(lh) - log(kt/kh) - log(lh/l0)) - (kt - kh) * sum(log(x)) + l0^(-kt) * sum(x^kt) - lh^(-kh) * sum(x^kh)) } # To be passed to stat_gen; localize_functions will be TRUE sg <- function(x, scale = 1, shape = 1) { x <- qweibull(x, shape = shape, scale = scale) test_stat(x, scale = scale) }

The MCHTest() parameter nuisance_params accepts a character vector giving the names of nuisance parameters the distribution of the test statistic may depend upon, and those names must be among the arguments of the function passed to stat_gen; that function is expected to know how to handle those parameters. In this case, rand_gen will not be specified since by default it gives uniformly distributed random variables. It’s a well-known fact in probability that the inverse of the CDF of a random variable (which are the q functions in R) applied to a uniformly distributed random variable yields a random variable that follows the distribution prescribed by the CDF. Hence the use of qweibull() above, which is being applied to datasets of uniformly distributed random variables that are effectively fixed when stat_gen will be called. Then the test statistic computed will be computed from data that follows the scale parameter prescribed by the null hypothesis but for some set value of , the shape parameter.

The MCHTest object will then perform simulated annealing to choose the value of the nuisance parameter that maximizes the -value under the null hypothesis for the given dataset. Simulated annealing is implemented in the GenSA() function provided by the GenSA package (see [5]). GenSA() needs a description of what set of parameter values over which to optimize and there is no general method for choosing this, so MCHTest() will require that a list be passed to optim_control that effectively contains the parameters that will be passed to GenSA(). At minimum, this list must contain an upper and a lower element, each of which are numeric vectors with names exactly like the character vector passed to nuisance_params; these vectors specify the space GenSA() will search to find the optima. Other elements can be passed to control GenSA(), and I highly recommend reading the function’s documentation for more details.

There’s an additional parameter of MCHTest(), threshold_pval, that matters to the optimization. GenSA() will take many steps to make sure it reaches a good optimal value, perhaps taking too long. The authors of GenSA recommend specifying an additional terminating condition to speed up the process. threshold_pval will alter the threshold.stop parameter in the control list of optim_control so that the algorithm terminates when the estimated -value crosses threshold_pval‘s value. Effectively, this means that we know that whatever the true -value of the test is, it’s larger than that threshold, and if the threshold is chosen appropriately, we know that we should not reject the null hypothesis based on the results of this test.

While giving threshold_pval a number less than 1 can help terminate the algorithm in the case of not rejecting the null hypothesis, the algorithm can still run for a long time if the test will eventually return a statistically significant result. For this reason, I recommend that optim_list contain a list called control and that the list should have a max.time element telling the algorithm the maximum running time (in seconds) it should have.

With all this in mind, we create the MCHTest object below:

mc.wei.scale.test <- MCHTest(ts, sg, N = 1000, seed = 123, test_params = "scale", nuisance_params = "shape", optim_control = list("lower" = c("shape" = 0), "upper" = c("shape" = 100), "control" = list( "max.time" = 10 )), threshold_pval = .2, localize_functions = TRUE) mc.wei.scale.test(rweibull(10, 2, 2), scale = 4) ## ## Monte Carlo Test ## ## data: rweibull(10, 2, 2) ## S = 7.2983, p-value < 2.2e-16 Conclusion

The MMC procedure is intersting and I don’t think any package implements it to the level I have in MCHT. The power of the procedure itself concerns me, though. Fortunately, the package also supports bootstrap testing, which I will discuss next week.

References
1. J-M Dufour, Monte Carlo tests with nuisance parameters: A general approach to finite-sample inference and nonstandard asymptotics, Journal of Econometrics, vol. 133 no. 2 (2006) pp. 443-477
2. J. G. MacKinnon, Bootstrap hypothesis testing in Handbook of computational econometrics (2009) pp. 183-213
3. A. C. A. Hope, A simplified Monte Carlo test procedure, JRSSB, vol. 30 (1968) pp. 582-598
4. M. L. Delignette-Muller and C. Dulag, fitdistrplus: an R package for fitting distributions, J. Stat. Soft., vol. 64 no. 4 (2015)
5. Y. Xiang et. al., Generalized simulated annealing for global optimization: the GenSA package, R Journal, vol. 5 no. 1 (2013) pp. 13-28

Packt Publishing published a book for me entitled Hands-On Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Summer Intern Projects

Mon, 10/22/2018 - 16:59

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

This summer we had five interns participate in our internship program. Each intern was here for 10 weeks and worked closely with a mentor or team. Everyone jumped right in and contributed quickly. We are excited about the progress our interns made and wanted to share it with you here!

Fanny Chow (@Fannystats) worked with Max Kuhn on bootstrapping methods in the rsample package. Check out her blog post here.

Alex Hayes (@alexpghayes) work on a major new release of the broom package featuring more modern behavior, new tidiers, new documentation and some refactoring of package internals. He wrote a blog post about his experience.

Dana Seidel (@dpseidel) worked with Hadley this summer preparing the scales 1.0.0 release and improving themes and secondary axis functionality in ggplot2. Read more about her summer projects on her blog and take a look at slides from her September 19th talk introducing the scales package to RLadies-SF.

Timothy Mastny (@timmastny) worked with Winston Chang and the Shiny team on Sass. Sass is a CSS compiler that will help make dynamic themes for Shiny apps and other R packages. He made this R package from the internship.

Irene Steves (@i_steves) worked with Jenny Bryan on The Tidies of March. Here’s a link to her blog post. Irene also presented at R-Ladies in Paris.

Thank you again to our interns and Hadley, Max, Jenny, Winston, and Dave!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science With R Course Series – Week 6

Mon, 10/22/2018 - 11:28

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

Welcome to week 6 of the Data Science with R Course Series.

This week’s curriculum is an extension of week 5, where we started analyzing model performance.

This week will cover the following:

• Learn techniques to communicate and measure model performance
• Learn how to analyze individual machine learning model metrics
• Learn how to visually compare multiple models based on metrics
• Learn how to communicate model performance to different stakeholders

Here is a recap of our trajectory and the course overview:

Recap: Data Science With R Course Series

You’re in the Week 6: H2O Model Performance. Here’s our game-plan over the 10 articles in this series. We’ll cover how to apply data science for business with R following our systematic process.

Week 6: H2O Model Performance

Student Feedback

Week 6: H2O Model Performance Performance Overview & Setup

The Performance Overview & Setup module summarizes this week’s course material and reviews the required code, such as extracted H2O machine learning models and utility script files.

H2O Performance For Binary Classification

Learn how to create an H2O performance object to retrieve model metrics, such as AUC and logloss. Each H2O auto-generated machine learning model can be extracted and saved to memory.

Once the machine learning models are saved, you will learn how to:

• Compare two competing metrics in binary classification
• Compare top performing models using the AUC metric

Performance Charts For Data Scientists

In Performance Charts For Data Scientists, take your knowledge on how to measure the performance of a single classifier and create a plot that will measure the performance of multiple models.

Performance charting is a great way for data scientists to evaluate model performance using different metrics.

Performance Charts For Business People

All your efforts to this point have been to create a high-performance model that accurately predicts employee churn. However, business people will not be receptive to the language of data science.

In this module, learn how to create a chart for business people to understand how much a model will improve results. The performance chart for business people will help communicate what can be done to solve the problem and understand how serious the problem is.

Ultimate Model Performance Comparison Dashboard

In this lecture, you will create a model metrics dashboard that combines the performance chart for data scientists and performance chart for business people into one visualization. Combining the charts assist in evaluating the strengths and weaknesses of multiple models in one view.

You Need To Learn R For Business

To be efficient as a data scientist, you need to learn R. Take the course that has cut data science projects in half (see this testimonial from a leading data science consultant) and has progressed data scientists more than anything they have tried before. Over 10-weeks you learn what it has taken data scientists 10-years to learn:

• Our systematic data science for business framework
• R and H2O for Machine Learning
• How to produce Return-On-Investment from data science
• And much more.

Next Up

The next article in the Data Science With R Series covers Modeling Churn: Explaining Black-Box Models With LIME.

Week 7 will cover feature explanation using a the R library Lime. Prediction explanation is important as a data scientist in order to communicate model results. Using Lime, you will learn how to explain your machine learning model and empower business people to make better decisions.

Week 7: Modeling Churn: Explaining Black-Box Models With LIME

New Course Coming Soon: Build A Shiny Web App!

You’re experiencing the magic of creating a high performance employee turnover risk prediction algorithm in DS4B 201-R. Why not put it to good use in an Interactive Web Dashboard?

In our new course, Build A Shiny Web App (DS4B 301-R), you’ll learn how to integrate the H2O model, LIME results, and recommendation algorithm building in the 201 course into an ML-Powered R + Shiny Web App!

Shiny Apps Course Coming in October 2018!!! Sign up for Business Science University Now!

Building an R + Shiny Web App, DS4B 301-R

Get Started Today!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The Mysterious Ellipsis: Tutorial

Mon, 10/22/2018 - 10:45

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

If you have any basic experience with R, you probably noticed that R uses three dots ellipsis (…) to allow functions to take arguments that weren’t pre-defined or hard-coded when the function was built. Even though R beginners are usually aware of this behavior, especially due to some common functions that implement it (for example, paste()), they are often not using it enough in their own functions. In other cases, the ellipsis is just not used properly or not fully taken advantage of. In this tutorial we will go through some common mistakes in using the ellipsis feature, and some interesting options to fully utilize it and the flexibility that it offers.

Choose lists over vectors
The most common mistake is trying to assign the ellipsis content to a vector rather than a list. Well, of course it’s not so much of a mistake if we’re expecting only a single data type from the ellipsis arguments, but this is often not the case and assigning the arguments to a vector rather than a list might cause problems when there’s a variety of data types.

So make sure you’re always unpacking the ellipsis content using the list() function rather than the c() function. As an example, try running this piece of code with both options:

my_ellipsis_function <- function(...) {
args <- list(...) # good
# args <- c(...) # bad
length(args)
}

my_ellipsis_function(“Hello World”, mtcars)

Combine the ellipsis with other arguments
Some tend to think that it’s not possible to use the ellipsis with other regular arguments. This is not the case, and the ellipsis-arguments shouldn’t be the only ones in your function. You can combine them with as many regular arguments as you wish.

my_ellipsis_function <- function(x, ...) {
print(paste("Class of regular-argument:", class(x)))
print(paste("Number of ellipsis-arguments:", length(list(...))))
}

my_ellipsis_function(x = “Hello World”, mtcars, c(1:10), list(“Abc”, 123))

Don’t forget the names
In fact, the values of the arguments themselves are not the only information that is passed through the ellipsis-arguments. The names of the arguments (if specified) can also be used. For example:

my_ellipsis_function <- function(...) {
names(list(...))
}

my_ellipsis_function(some_number = 123, some_string = “abc”, some_missing_value = NA)

Lastly, somewhat of an advanced procedure might be unpacking the ellipsis-arguments into local function variables (or even global). There are all kind of scenarios where it might be needed (for global variables assignment it might be more intuitive). One example for a need in local variables, is where a certain function takes a certain regular-argument, that is dependent on a varying set of other variables. A use of the function glue::glue() within another function is a good example for that. The following code demonstrates how simple it is to perform this “unpacking”:

my_ellipsis_function <- function(...) {
args <- list(...)

for(i in 1:length(args)) {
assign(x = names(args)[i], value = args[[i]])
}

ls() # show the available variables

# some other code and operations
# that use the ellipsis-arguments as “native” variables…
}

my_ellipsis_function(some_number = 123, some_string = “abc”)

So whether you’re an R beginners or not, don’t forget to utilize this convenient feature when needed, and use it wisely.

Related exercise sets: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Residential Property Investment Visualization and Analysis Shiny App

Mon, 10/22/2018 - 03:55

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Introduction:

The purpose of this Shiny web app is to better advise clients in residential property investments in New York City by visualizing sales and rental market trends in major areas.

Check out the app HERE, and find the code HERE.

Data:

Sales and rental data are compiled from OneBlockOver by Streeteasy. The dataset used in this web app is combined using:

– Borough
– Neighborhood
– Date
– Condo Listing Median Price
– Condo Sale Median Price
– Rental Studio Median Price
– Rental One Bedroom Median Price
– Rental Two Bedroom Median Price
– Rental Three+ Bedroom Median Price

Boroughs in consideration are Manhattan, Brooklyn, Queens.
The date of the data point ranges from Jan 2014 to Aug 2018, aggregated monthly.

Boundary data of the neighborhoods are acquired from data.beta.nyc.

How to use this web app:

Start by clicking on the “Visualization” tab on the side menu to go to the visualization page. Beware that it takes a little time to load so don’t be startled by the error messages. Voilà!

First, choose a borough and neighborhood from the right side of the page.

Then, you will see that the neighborhood you selected will be highlighted on the map at the top of the page.

On the bottom left, there will be a brief summary of the market trend of property sales in your chosen neighborhood, whether it has been increasing or decreasing. And it will provide a rough estimate of the percentage change in property value and value prediction for 2019 and 2023 (5 years). You can also opt to see an interactive chart of Listing and Sale price over the years.

On the bottom right, there is a chart showing the rental prices for different properties types, whether it is a Studio, One Bedroom, or even Three+ Bedroom.

You will notice a slider and selector below the location selectors, it is used to set a budget and a room count so that the tool will calculate a rough estimate of your mortgage payment, given 20% down payment,  30-year term, and an annual rate of 5.04%.

Future Works:

• There is some backend issue with the interface that will require an overhaul, which I will tackle first.
• Still making improvements on the map feature. The idea is to show nearest neighborhood info on the map for comparison.
• Dataset is not really complete for neighborhoods that brokers do not post listings on Streeteasy. Need more detailed and comprehensive data.
• Need more features to consider such as crime rate, school ranking, etc. for different neighborhoods.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The tweenr is all grown up

Mon, 10/22/2018 - 02:00

(This article was first published on Data Imaginist, and kindly contributed to R-bloggers)

NOTE: tweenr was released some time ago but a theft of my computer while
writing the release post meant that I only just finished writing about it now.

I’m very happy to once again announce a package update, as a new major version
of tweenr is now on CRAN. This release, while significant in itself, is also
an important part of getting gganimate on CRAN, so if you care about
gganimate but have never heard of tweenr you should be happy nonetheless.

tweenr was my first sort-of popular package and filled a gap in the gganimate
version of yore, where smooth transitions were something you had to bring
yourself. It has lived almost unchanged since its initial release, but as I
began to develop the next iteration of gganimate it became clear that new
functionality was needed. Some of it ended up in the
transformr but a huge chunk has
been added to tweenr itself. A description of everything new follows below.

Something new, something old {new_old}

The main API of the previous version of tweenr comprised of the
tween_states(), tween_elements(), and tween_appear(). All of these needed
serious change in both capabilities and API to the extend that all prior code
would break, so instead I decided to keep the old functions unchanged and create
new ones for a brighter future.

tween_states ⇒ tween_state/keep_state

tween_states() was perhaps the most used of the old functions. It takes a list
of data.frames and a specification of how long transitions between them should
take and how long it should pause at each data.frame. This is perfect for
situations where you have discrete states and want to have a smooth transition
between them. Still it had certain shortcomings, such as requiring that each
data.frame included the same number of rows etc. (meaning that each “element”
should be present in each state). Further I have ended up finding it a bit
clumsy to use. The new function(s) that should serve the same needs as
tween_states() is tween_state() (and keep_state()) and they are much more
powerful. The biggest difference, perhaps, is that tween_state() only takes a
from and to state and not an arbitrarily long list of states. If transitions
are needed between more states you’ll need to chain calls together (potentially
with keep_state() if the state should pause between transitions). It will
something like this:

library(tweenr) irises <- split(iris, iris$Species) iris_tween <- irises$setosa %>% tween_state(irises$versicolor, ease = 'cubic-in-out', nframes = 10) %>% keep_state(5) %>% tween_state(irises$virginica, ease = 'linear', nframes = 15) %>% keep_state(5)

As can be seen, if you like piping you’ll feel right at home. Apart from the
seemingly superficial change in API, the tween_state() also packs some new
tricks. One of these is per-column easing, so you can specify different easing
functions for different variables. Another, more fundamental one, is the
possibility of specifying an id to match rows by. Ultimately this means that
rows no longer needs to be matched by position and that you can now tween
between states with different numbers of elements. All of this is so important
that it will get its own section later on in Enter and Exit.

tween_elements ⇒ tween_components

While tween_states() was probably the most used of the legacy functions it was
by no means the only one. tween_elements() was a very powerful function that
let you specify different individual states for each element in a single
data.frame and then expand this to an arbitrary number of frames. The changes
that its hier bring is less dramatic than what happened with tween_states(),
and simply adds the same features that tween_state() introduced. This means
per-column easing and the same features as described below in
Enter and Exit. Further it changes the semantics of a couple of
variables so they are now tidy evaluated. This means that state specifications
can be calculated on the fly, rather than having to exist inside the data.frame
to be tweened.

tween_appear ⇒ tween_events

tween_appear() never really felt right. The purpose was to let each row appear
in a specific frame while still allowing the user to define how it should
appear. To this end it expanded the data.frame by giving each row an age in
each frame (negative age meant that it had yet to appear) and then let the user
do with this as they pleased. What was missing was the whole idea of
Enter and Exit which I have already plugged multiple times.
tween_event() is a pretty radical change in order to solve the problem that
tween_appear() originally tried (but failed) to solve.

Enter and Exit

Before we go any further with other new tweening functions I think I owe it to
you to describe what all this entering and exiting I’ve been talking about
really is. If you have dabbled in D3.js the two words will be familiar but they
have slightly different meaning in tweenr. In D3 enter and exit describe a
selection of data that did not match in a data join between current and next
state, while in tweenr it is a function that modifies data that is going to
appear or disappear. If enter and/or exit is not given, the data will just pop
into existence in the first frame it relates to and disappear without a trace
after the last frame it relates to has ended. if you provide e.g. an enter
function this will be applied to all elements when they first appear. The result
of the function will then be inserted into the tweening prior to the original
data so that any changes the function does will gradually change to the original
data. This may sound quite confusing, but in essence it means that if you pass
in an enter function that sets the transparancy variable to zero you’ll get a
gradual fade-in effect. The exit function is just like it, but in reverse. But
let’s see it in effect instead:

df1 <- data.frame(x = 1:2, y = 2, alpha = 1) # 2 rows df2 <- data.frame(x = 2:0, y = 1, alpha = 1) # 3 rows fade <- function(data) { data$alpha <- 0 data } tween <- tween_state(df1, df2, ease = 'linear', nframes = 5, enter = fade) tween$alpha[tween$.id == 3] ## [1] 0.25 0.50 0.75 1.00 We can see that the alpha value of the third element (the one that doesn’t exist in the first state) gradually increase from zero to one. The reason why 0 isn’t included is that the enter and exit function return virtual states that doens’t remain in the data – only the transition to and from them does. New tweens on the block Appart from the new versions of the old functionality discussed above this version also includes some brand new tweens, mainly implemented to serve needs in gganimate but of course also available for everyone else. tween_along This is the tweening function that powers transition_reveal() in gganimate – if you have played with that you are fairly well situated to understand what it does. In essence it allows you to specify time points for the different rows in your data.frame and then tween between these. You might think that this is exactly what tween_components() does and you’d partly be right. The big difference is that tween_along() ensures equidistant frames, whereas tween_components() assigns all rows in the data to the nearest frame and then use the frame as a time variable. The latter will always have the raw data appearing in one frame or another while the former will not. Further, tween_along() will optionally keep earlier rows in your data at each frame, which is useful if you e.g. want a line to gradually appear along an axis. tween_at This is a pretty low level tweening function intened to get an exact state between two data frames or vectors. It takes two states and a numeric vector giving the tween position for each row and then calculates the intermediary rows. tween_at(mtcars[1:3, ], mtcars[4:6, ], runif(3), 'linear') ## mpg cyl disp hp drat wt qsec vs ## 1 21.27039 6.000000 226.2455 110.0000 3.345701 3.022205 18.47440 0.6759747 ## 2 20.51284 6.423621 202.3621 123.7677 3.741142 2.994673 17.02000 0.0000000 ## 3 19.77526 5.287121 183.2966 100.7227 3.148519 3.053659 19.64613 1.0000000 ## am gear carb ## 1 0.3240253 3.324025 1.972076 ## 2 0.7881894 3.788189 3.576379 ## 3 0.3564393 3.356439 1.000000 This is unlikely to be useful for directly creating animations (though I guess it is low-level enough to be able to be shoehorned into anything). It is being used in gganimate for calculating shadow falloff in shadow_wake(). tween_fill This tween takes a page out of tidyr::fill and simply fill out missing elements in a data frame or vector. Instead of being boring and repeating the prior or following data it doesn the tweenr thing and tweens between them: mtcars2 <- mtcars[1:7, ] mtcars2[2:6, ] <- NA tween_fill(mtcars2, 'cubic-in-out') ## mpg cyl disp hp drat wt qsec vs ## 1 21.00000 6.000000 160.0000 110.0 3.900000 2.620000 16.46000 0 ## 2 20.87593 6.037037 163.7037 112.5 3.887222 2.637593 16.44852 0 ## 3 20.00741 6.296296 189.6296 130.0 3.797778 2.760741 16.36815 0 ## 4 17.65000 7.000000 260.0000 177.5 3.555000 3.095000 16.15000 0 ## 5 15.29259 7.703704 330.3704 225.0 3.312222 3.429259 15.93185 0 ## 6 14.42407 7.962963 356.2963 242.5 3.222778 3.552407 15.85148 0 ## 7 14.30000 8.000000 360.0000 245.0 3.210000 3.570000 15.84000 0 ## am gear carb ## 1 1.00000000 4.000000 4 ## 2 0.98148148 3.981481 4 ## 3 0.85185185 3.851852 4 ## 4 0.50000000 3.500000 4 ## 5 0.14814815 3.148148 4 ## 6 0.01851852 3.018519 4 ## 7 0.00000000 3.000000 4 Neat-o… Grab bag of niceties There are more subtle additions to tweenr as well, most of which has also been driven by gganimate needs. Here’s an unceremonious list: • More information: In olden days tweenr simply added a .frame column to the output to identify the frame the row belonged to. It still does, but that column is now accompagnied by a .phase column that tells if the data is raw, static, entering, exiting, or transitioning, and an .id column that identifies the same data across frames. • More support: The supported data types has been expanded considerably. Most notably list columns are now accepted. If the list only contains numeric vectors these vectors will get tweened accordingly, and if not the list will be treated as constant. • Better colour support: Colour has always been tweened in the LAB representation to get a more natural transition. In the old version the native convertColor() function was used, but this could lead to substantial slowdown when tweening lots of data. To address this farver was developed and released and this version of tweenr naturally uses this for colour space conversions now. In addition, tweenr now supports hex-colours with alpha. I think that is it, but frankly there has been so many additions that I may have missed a few… First to find something I missed gets a sticker! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Data Imaginist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Faceted Graphs with cdata and ggplot2 Mon, 10/22/2018 - 00:03 (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers) In between client work, John and I have been busy working on our book, Practical Data Science with R, 2nd Edition. To demonstrate a toy example for the section I’m working on, I needed scatter plots of the petal and sepal dimensions of the iris data, like so: I wanted a plot for petal dimensions and sepal dimensions, but I also felt that two plots took up too much space. So, I thought, why not make a faceted graph that shows both: Except — which columns do I plot and what do I facet on? head(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa Here’s one way to create the plot I want, using the cdata package along with ggplot2. First, load the packages and data: library("ggplot2") library("cdata") iris <- data.frame(iris) Now define the data-shaping transform, or control table. The control table is basically a picture that sketches out the final data shape that I want. I want to specify the x and y columns of the plot (call these the value columns of the data frame) and the column that I am faceting by (call this the key column of the data frame). And I also need to specify how the key and value columns relate to the existing columns of the original data frame. Here’s what the control table looks like: The control table specifies that the new data frame will have the columns flower_part, Length and Width. Every row of iris will produce two rows in the new data frame: one with a flower_part value of Petal, and another with a flower_part value of Sepal. The Petal row will take the Petal.Length and Petal.Width values in the Length and Width columns respectively. Similarly for the Sepal row. Here I create the control table in R, using the convenience function wrapr::build_frame() to create the controlTable data frame in a legible way. (controlTable <- wrapr::build_frame( "flower_part", "Length" , "Width" | "Petal" , "Petal.Length", "Petal.Width" | "Sepal" , "Sepal.Length", "Sepal.Width" )) ## flower_part Length Width ## 1 Petal Petal.Length Petal.Width ## 2 Sepal Sepal.Length Sepal.Width Now I apply the transform to iris using the function rowrecs_to_blocks(). I also want to carry along the Species column so I can color the scatterplot points by species. iris_aug <- rowrecs_to_blocks( iris, controlTable, columnsToCopy = c("Species")) head(iris_aug) ## Species flower_part Length Width ## 1 setosa Petal 1.4 0.2 ## 2 setosa Sepal 5.1 3.5 ## 3 setosa Petal 1.4 0.2 ## 4 setosa Sepal 4.9 3.0 ## 5 setosa Petal 1.3 0.2 ## 6 setosa Sepal 4.7 3.2 And now I can create the plot! ggplot(iris_aug, aes(x=Length, y=Width)) + geom_point(aes(color=Species, shape=Species)) + facet_wrap(~flower_part, labeller = label_both, scale = "free") + ggtitle("Iris dimensions") + scale_color_brewer(palette = "Dark2") In the next post, I will show how to use cdata and ggplot2 to create a scatterplot matrix. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... RApiDatetime 0.0.4: Updates and Extensions Sun, 10/21/2018 - 21:12 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) The first update in a little while brings us release 0.0.4 of RApiDatetime which got onto CRAN this morning via the lovely automated sequence of submission, pretest-recheck and pretest-publish. RApiDatetime provides seven entry points for C-level functions of the R API for Date and Datetime calculations. The functions asPOSIXlt and asPOSIXct convert between long and compact datetime representation, formatPOSIXlt and Rstrptime convert to and from character strings, and POSIXlt2D and D2POSIXlt convert between Date and POSIXlt datetime. This releases brings asDatePOSIXct as a seventh courtesy of Josh Ulrich. All these functions are all fairly useful, but not one of them was previously exported by R for C-level use by other packages. Which is silly as this is generally extremely carefully written and tested code. I also updated the exported base R code to what is in R 3.5.1 right now, added a few #nocov declarations (not all which are reflected at the codecov page yet, and added a dependency badge at the GitHub repo. Changes in RApiDatetime version 0.0.4 (2018-10-21) • New function asDatePOSIXct (Josh Ulrich in #2) • Synchronized with upstream code in base R (Dirk in #3) Courtesy of CRANberries, there is a comparison to the previous release. More information is on the rapidatetime page. For questions or comments please use the issue tracker off the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... automl package: part 1/2 why and how Sun, 10/21/2018 - 20:32 (This article was first published on R-posts.com, and kindly contributed to R-bloggers) Why & how automl package automl package provides: • Deep Learning last tricks (those who have taken Andrew NG’s MOOC on Coursera will be in familiar territory) • hyperparameters autotune with metaheuristic (PSO) • experimental stuff and more to come (you’re welcome as coauthor!) Deep Learning existing frameworks, disadvantages Deploying and maintaining most Deep Learning frameworks means: Python… R language is so simple to install and maintain in production environments that it is obvious to use a pure R based package for deep learning ! Neural Network – Deep Learning, disadvantages Disadvantages: • 1st disadvantage: you have to test manually different combinations of parameters (number of layers, nodes, activation function, etc …) and then also tune manually hyper parameters for training (learning rate, momentum, mini batch size, etc …) • 2nd disadvantage: only for those who are not mathematicians, calculating derivative in case of new cost or activation function, may by an issue. Metaheuristic – PSO, benefits The Particle Swarm Optimization algorithm is a great and simple one. In a few words, the first step consists in throwing randomly a set of particles in a space and the next steps consist in discovering the best solution while converging. video tutorial from Yarpiz is a great ressource Birth of automl package automl package was born from the idea to use metaheuristic PSO to address the identified disadvantages above. And last but not the least reason: use R and R only 3 functions are available: • automl train manual: the manual mode to train a model • automl train: the automatic mode to train model • automl predict: the prediction function to apply a trained model on datas Mix 1: hyperparameters tuning with PSO Mix 1 consists in using PSO algorithm to optimize the hyperparameters: each particle corresponds to a set of hyperparameters. The automl train function was made to do that. nb: parameters (nodes number, activation functions, etc…) are not automatically tuned for the moment, but why not in the futur Mix 2: PSO instead of gradient descent Mix 2 is experimental, it consists in using PSO algorithm to optimize the weights of Neural Network in place of gradient descent: each particle corresponds to a set of neural network weights matrices. The automl train manual function do that too. Next post We will see how to use it in the next post. Feel free to comment or join the team being formed var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Statistics Sunday: What Fast Food Can Tell Us About a Community and the World Sun, 10/21/2018 - 15:25 (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers) Two statistical indices crossed my inbox in the last week, both of which use fast food restaurants to measure a concept indirectly. First up, in the wake of recent hurricanes, is the Waffle House Index. As The Economist explains: Waffle House, a breakfast chain from the American South, is better known for reliability than quality. All its restaurants stay open every hour of every day. After extreme weather, like floods, tornados and hurricanes, Waffle Houses are quick to reopen, even if they can only serve a limited menu. That makes them a remarkably reliable if informal barometer for weather damage. The index was invented by Craig Fugate, a former director of the Federal Emergency Management Agency (FEMA) in 2004 after a spate of hurricanes battered America’s east coast. “If a Waffle House is closed because there’s a disaster, it’s bad. We call it red. If they’re open but have a limited menu, that’s yellow,” he explained to NPR, America’s public radio network. Fully functioning restaurants mean that the Waffle House Index is shining green. Next is the Big Mac Index, created by The Economist: The Big Mac index was invented by The Economist in 1986 as a lighthearted guide to whether currencies are at their “correct” level. It is based on the theory of purchasing-power parity (PPP), the notion that in the long run exchange rates should move towards the rate that would equalise the prices of an identical basket of goods and services (in this case, a burger) in any two countries. You might remember a discussion of the “basket of goods” in my post on the Consumer Price Index. And in fact, the Big Mac Index, which started as a way “to make exchange-rate theory more digestible,” it’s since become a global standard and is used in multiple studies. Now you can use it too, because the data and methodology have been made available on GitHub. R users will be thrilled to know that the code is written in R, but you’ll need to use a bit of Python to get at the Jupyter notebook they’ve put together. Fortunately, they’ve provided detailed information on installing and setting everything up. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... Getting the data from the Luxembourguish elections out of Excel Sun, 10/21/2018 - 02:00 (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers) In this blog post, similar to a previous blog post I am going to show you how we can go from an Excel workbook that contains data to flat file. I will taking advantage of the structure of the tables inside the Excel sheets by writing a function that extracts the tables and then mapping it to each sheet! Last week, October 14th, Luxembourguish nationals went to the polls to elect the Grand Duke! No, actually, the Grand Duke does not get elected. But Luxembourguish citizen did go to the polls to elect the new members of the Chamber of Deputies (a sort of parliament if you will). The way the elections work in Luxembourg is quite interesting; you can vote for a party, or vote for individual candidates from different parties. The candidates that get the most votes will then seat in the parliament. If you vote for a whole party, each of the candidates get a vote. You get as many votes as there are candidates to vote for. So, for example, if you live in the capital city, also called Luxembourg, you get 21 votes to distribute. You could decide to give 10 votes to 10 candidates of party A and 11 to 11 candidates of party B. Why 21 votes? The chamber of Deputies is made up 60 deputies, and the country is divided into four legislative circonscriptions. So each voter in a circonscription gets an amount of votes that is proportional to the population size of that circonscription. Now you certainly wonder why I put the flag of Gambia on top of this post? This is because the government that was formed after the 2013 elections was made up of a coalition of 3 parties; the Luxembourg Socialist Worker’s Party, the Democratic Party and The Greens. The LSAP managed to get 13 seats in the Chamber, while the DP got 13 and The Greens 6, meaning 32 seats out of 60. So because they made this coalition, they could form the government, and this coalition was named the Gambia coalition because of the colors of these 3 parties: red, blue and green. If you want to take a look at the ballot from 2013 for the southern circonscription, click here. Now that you have the context, we can go back to some data science. The results of the elections of last week can be found on Luxembourg’s Open Data portal, right here. The data is trapped inside Excel sheets; just like I explained in a previous blog post the data is easily read by human, but not easily digested by any type of data analysis software. So I am going to show you how we are going from this big Excel workbook to a flat file. First of all, if you open the Excel workbook, you will notice that there are a lot of sheets; there is one for the whole country, named “Le Grand-Duché de Luxembourg”, one for the four circonscriptions, “Centre”, “Nord”, “Sud”, “Est” and 102 more for each commune of the country (a commune is an administrative division). However, the tables are all very similarly shaped, and roughly at the same position. This is good, because we can write a function to extracts the data and then map it over all the sheets. First, let’s load some packages and the data for the country: library("tidyverse") library("tidyxl") library("brotools") # National Level 2018 elections_raw_2018 <- xlsx_cells("leg-2018-10-14-22-58-09-737.xlsx", sheets = "Le Grand-Duché de Luxembourg") {brotools} is my own package. You can install it with: devtools::install_github("b-rodrigues/brotools") it contains a function that I will use down below. The function I wrote to extract the tables is not very complex, but requires that you are familiar with how {tidyxl} imports Excel workbooks. So if you are not familiar with it, study the imported data frame for a few minutes. It will make understanding the next function easier: extract_party <- function(dataset, starting_col, target_rows){ almost_clean <- dataset %>% filter(row %in% target_rows) %>% filter(col %in% c(starting_col, starting_col + 1)) %>% select(character, numeric) %>% fill(numeric, .direction = "up") %>% filter(!is.na(character)) party_name <- almost_clean$character[1] %>% str_split("-", simplify = TRUE) %>% .[2] %>% str_trim() almost_clean$character[1] <- "Pourcentage" almost_clean$party <- party_name colnames(almost_clean) <- c("Variables", "Values", "Party") almost_clean %>% mutate(Year = 2018) %>% select(Party, Year, Variables, Values) }

This function has three arguments, dataset, starting_col and target_rows. dataset is the
data I loaded with xlsx_cells from the {tidyxl} package. I think the following picture illustrates
easily what the function does:

So the function first filters only the rows we are interested in, then the cols. I then select
the columns I want which are called character and numeric (if the Excel cell contains characters then
you will find them in the character column, if it contains numbers you will them in the numeric
column), then I fill the empty cells with the values from the numeric column and the I remove
the NA’s. These two last steps might not be so clear; this is how the data looks like up until the
select() function:

> elections_raw_2018 %>% + filter(row %in% seq(11,19)) %>% + filter(col %in% c(1, 2)) %>% + select(character, numeric) # A tibble: 18 x 2 character numeric 1 1 - PIRATEN - PIRATEN NA 2 NA 0.0645 3 Suffrage total NA 4 NA 227549 5 Suffrages de liste NA 6 NA 181560 7 Suffrage nominatifs NA 8 NA 45989 9 Pourcentage pondéré NA 10 NA 0.0661 11 Suffrage total pondéré NA 12 NA 13394. 13 Suffrages de liste pondéré NA 14 NA 10308 15 Suffrage nominatifs pondéré NA 16 NA 3086. 17 Mandats attribués NA 18 NA 2

So by filling the NA’s in the numeric the data now looks like this:

> elections_raw_2018 %>% + filter(row %in% seq(11,19)) %>% + filter(col %in% c(1, 2)) %>% + select(character, numeric) %>% + fill(numeric, .direction = "up") # A tibble: 18 x 2 character numeric 1 1 - PIRATEN - PIRATEN 0.0645 2 NA 0.0645 3 Suffrage total 227549 4 NA 227549 5 Suffrages de liste 181560 6 NA 181560 7 Suffrage nominatifs 45989 8 NA 45989 9 Pourcentage pondéré 0.0661 10 NA 0.0661 11 Suffrage total pondéré 13394. 12 NA 13394. 13 Suffrages de liste pondéré 10308 14 NA 10308 15 Suffrage nominatifs pondéré 3086. 16 NA 3086. 17 Mandats attribués 2 18 NA 2

And then I filter out the NA’s from the character column, and that’s almost it! I simply need
to add a new column with the party’s name and rename the other columns. I also add a “Year” colmun.

Now, each party will have a different starting column. The table with the data for the first party
starts on column 1, for the second party it starts on column 4, column 7 for the third party…
So the following vector contains all the starting columns:

position_parties_national <- seq(1, 24, by = 3)

(If you study the Excel workbook closely, you will notice that I do not extract the last two parties.
This is because these parties were not present in all of the 4 circonscriptions and are very, very,
very small.)

The target rows are always the same, from 11 to 19. Now, I simply need to map this function to
this list of positions and I get the data for all the parties:

elections_national_2018 <- map_df(position_parties_national, extract_party, dataset = elections_raw_2018, target_rows = seq(11, 19)) %>% mutate(locality = "Grand-Duchy of Luxembourg", division = "National")

I also added the locality and division columns to the data.

Let’s take a look:

glimpse(elections_national_2018) ## Observations: 72 ## Variables: 6 ## $Party "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN... ##$ Year 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018... ## $Variables "Pourcentage", "Suffrage total", "Suffrages de liste... ##$ Values 6.446204e-02, 2.275490e+05, 1.815600e+05, 4.598900e+... ## $locality "Grand-Duchy of Luxembourg", "Grand-Duchy of Luxembo... ##$ division "National", "National", "National", "National", "Nat...

Very nice.

Now we need to do the same for the 4 electoral circonscriptions. First, let’s load the data:

# Electoral districts 2018 districts <- c("Centre", "Nord", "Sud", "Est") elections_district_raw_2018 <- xlsx_cells("leg-2018-10-14-22-58-09-737.xlsx", sheets = districts)

Now things get trickier. Remember I said that the number of seats is proportional to the population
of each circonscription? We simply can’t use the same target rows as before. For example, for the
“Centre” circonscription, the target rows go from 12 to 37, but for the “Est” circonscription
only from 12 to 23. Ideally, we would need a function that would return the target rows.

This is that function:

# The target rows I need to extract are different from district to district get_target_rows <- function(dataset, sheet_to_extract, reference_address){ last_row <- dataset %>% filter(sheet == sheet_to_extract) %>% filter(address == reference_address) %>% pull(numeric) seq(12, (11 + 5 + last_row)) }

This function needs a dataset, a sheet_to_extract and a reference_address. The reference
address is a cell that actually contains the number of seats in that circonscription, in our
case “B5”. We can easily get the list of target rows now:

# Get the target rows list_targets <- map(districts, get_target_rows, dataset = elections_district_raw_2018, reference_address = "B5") list_targets ## [[1]] ## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 ## [24] 35 36 37 ## ## [[2]] ## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ## ## [[3]] ## [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 ## [24] 35 36 37 38 39 ## ## [[4]] ## [1] 12 13 14 15 16 17 18 19 20 21 22 23

Now, let’s split the data we imported into a list, where each element of the list is a dataframe
with the data from one circonscription:

list_data_districts <- map(districts, ~filter(.data = elections_district_raw_2018, sheet == .))

Now I can easily map the function I defined above, extract_party to this list of datasets. Well,
I say easily, but it’s a bit more complicated than before because I have now a list of datasets
and a list of target rows:

elections_district_2018 <- map2(.x = list_data_districts, .y = list_targets, ~map_df(position_parties_national, extract_party, dataset = .x, target_rows = .y))

The way to understand this is that for each element of list_data_districts and list_targets,
I have to map extract_party to each element of position_parties_national. This gives the intented
result:

elections_district_2018 ## [[1]] ## # A tibble: 208 x 4 ## Party Year Variables Values ## ## 1 PIRATEN 2018 Pourcentage 0.0514 ## 2 PIRATEN 2018 CLEMENT Sven (1) 8007 ## 3 PIRATEN 2018 WEYER Jerry (2) 3446 ## 4 PIRATEN 2018 CLEMENT Pascal (3) 3418 ## 5 PIRATEN 2018 KUNAKOVA Lucie (4) 2860 ## 6 PIRATEN 2018 WAMPACH Jo (14) 2693 ## 7 PIRATEN 2018 LAUX Cynthia (6) 2622 ## 8 PIRATEN 2018 ISEKIN Christian (5) 2610 ## 9 PIRATEN 2018 SCHWEICH Georges (9) 2602 ## 10 PIRATEN 2018 LIESCH Mireille (8) 2551 ## # ... with 198 more rows ## ## [[2]] ## # A tibble: 112 x 4 ## Party Year Variables Values ## ## 1 PIRATEN 2018 Pourcentage 0.0767 ## 2 PIRATEN 2018 COLOMBERA Jean (2) 5074 ## 3 PIRATEN 2018 ALLARD Ben (1) 4225 ## 4 PIRATEN 2018 MAAR Andy (3) 2764 ## 5 PIRATEN 2018 GINTER Joshua (8) 2536 ## 6 PIRATEN 2018 DASBACH Angelika (4) 2473 ## 7 PIRATEN 2018 GRÜNEISEN Sam (6) 2408 ## 8 PIRATEN 2018 BAUMANN Roy (5) 2387 ## 9 PIRATEN 2018 CONRAD Pierre (7) 2280 ## 10 PIRATEN 2018 TRAUT ép. MOLITOR Angela Maria (9) 2274 ## # ... with 102 more rows ## ## [[3]] ## # A tibble: 224 x 4 ## Party Year Variables Values ## ## 1 PIRATEN 2018 Pourcentage 0.0699 ## 2 PIRATEN 2018 GOERGEN Marc (1) 9818 ## 3 PIRATEN 2018 FLOR Starsky (2) 6737 ## 4 PIRATEN 2018 KOHL Martine (3) 6071 ## 5 PIRATEN 2018 LIESCH Camille (4) 6025 ## 6 PIRATEN 2018 KOHL Sylvie (6) 5628 ## 7 PIRATEN 2018 WELTER Christian (5) 5619 ## 8 PIRATEN 2018 DA GRAÇA DIAS Yanick (10) 5307 ## 9 PIRATEN 2018 WEBER Jules (7) 5301 ## 10 PIRATEN 2018 CHMELIK Libor (8) 5247 ## # ... with 214 more rows ## ## [[4]] ## # A tibble: 96 x 4 ## Party Year Variables Values ## ## 1 PIRATEN 2018 Pourcentage 0.0698 ## 2 PIRATEN 2018 FRÈRES Daniel (1) 4152 ## 3 PIRATEN 2018 CLEMENT Jill (7) 1943 ## 4 PIRATEN 2018 HOUDREMONT Claire (2) 1844 ## 5 PIRATEN 2018 BÖRGER Nancy (3) 1739 ## 6 PIRATEN 2018 MARTINS DOS SANTOS Catarina (6) 1710 ## 7 PIRATEN 2018 BELLEVILLE Tatjana (4) 1687 ## 8 PIRATEN 2018 CONTRERAS Gerald (5) 1687 ## 9 PIRATEN 2018 Suffrages total 14762 ## 10 PIRATEN 2018 Suffrages de liste 10248 ## # ... with 86 more rows

I now need to add the locality and division columns:

elections_district_2018 <- map2(.y = elections_district_2018, .x = districts, ~mutate(.y, locality = .x, division = "Electoral district")) %>% bind_rows()

We’re almost done! Now we need to do the same for the 102 remaining sheets, one for each commune
of Luxembourg. This will now go very fast, because we got all the building blocks from before:

communes <- xlsx_sheet_names("leg-2018-10-14-22-58-09-737.xlsx") communes <- communes %-l% c("Le Grand-Duché de Luxembourg", "Centre", "Est", "Nord", "Sud", "Sommaire")

Let me introduce the following function: %-l%. This function removes elements from lists:

c("a", "b", "c", "d") %-l% c("a", "d") ## [1] "b" "c"

You can think of it as “minus for lists”. This is called an infix operator.

So this function is very useful to get the list of communes, and is part of my package, {brotools}.

As before, I load the data:

elections_communes_raw_2018 <- xlsx_cells("leg-2018-10-14-22-58-09-737.xlsx", sheets = communes)

Then get my list of targets, but I need to change the reference address. It’s “B8” now, not “B7”.

# Get the target rows list_targets <- map(communes, get_target_rows, dataset = elections_communes_raw_2018, reference_address = "B8")

I now create a list of communes by mapping a filter function to the data:

list_data_communes <- map(communes, ~filter(.data = elections_communes_raw_2018, sheet == .))

And just as before, I get the data I need by using extract_party, and adding the “locality” and
“division” columns:

elections_communes_2018 <- map2(.x = list_data_communes, .y = list_targets, ~map_df(position_parties_national, extract_party, dataset = .x, target_rows = .y)) elections_communes_2018 <- map2(.y = elections_communes_2018, .x = communes, ~mutate(.y, locality = .x, division = "Commune")) %>% bind_rows()

The steps are so similar for the four circonscriptions and for the 102 communes that I could
have write a big wrapper function and the use it for the circonscription and communes at once.
But I was lazy.

Finally, I bind everything together and have a nice, tidy, flat file:

# Final results elections_2018 <- bind_rows(list(elections_national_2018, elections_district_2018, elections_communes_2018)) glimpse(elections_2018) ## Observations: 15,544 ## Variables: 6 ## $Party "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN", "PIRATEN... ##$ Year 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018... ## $Variables "Pourcentage", "Suffrage total", "Suffrages de liste... ##$ Values 6.446204e-02, 2.275490e+05, 1.815600e+05, 4.598900e+... ## $locality "Grand-Duchy of Luxembourg", "Grand-Duchy of Luxembo... ##$ division "National", "National", "National", "National", "Nat...

This blog post is already quite long, so I will analyze the data now that R can easily ingest it
in a future blog post.

If you found this blog post useful, you might want to follow me on twitter
for blog post updates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Lazy Function

Sat, 10/20/2018 - 17:14

(This article was first published on CillianMacAodh, and kindly contributed to R-bloggers)

A Lazy Function

/*! * Bootstrap v3.3.5 (http://getbootstrap.com) * Copyright 2011-2015 Twitter, Inc. * Licensed under the MIT license */ if("undefined"==typeof jQuery)throw new Error("Bootstrap's JavaScript requires jQuery");+function(a){"use strict";var b=a.fn.jquery.split(" ")[0].split(".");if(b[0]<2&&b[1]<9||1==b[0]&&9==b[1]&&b[2]<1)throw new Error("Bootstrap's JavaScript requires jQuery version 1.9.1 or higher")}(jQuery),+function(a){"use strict";function b(){var a=document.createElement("bootstrap"),b={WebkitTransition:"webkitTransitionEnd",MozTransition:"transitionend",OTransition:"oTransitionEnd otransitionend",transition:"transitionend"};for(var c in b)if(void 0!==a.style[c])return{end:b[c]};return!1}a.fn.emulateTransitionEnd=function(b){var c=!1,d=this;a(this).one("bsTransitionEnd",function(){c=!0});var e=function(){c||a(d).trigger(a.support.transition.end)};return setTimeout(e,b),this},a(function(){a.support.transition=b(),a.support.transition&&(a.event.special.bsTransitionEnd={bindType:a.support.transition.end,delegateType:a.support.transition.end,handle:function(b){return a(b.target).is(this)?b.handleObj.handler.apply(this,arguments):void 0}})})}(jQuery),+function(a){"use strict";function b(b){return this.each(function(){var c=a(this),e=c.data("bs.alert");e||c.data("bs.alert",e=new d(this)),"string"==typeof b&&e[b].call(c)})}var c='[data-dismiss="alert"]',d=function(b){a(b).on("click",c,this.close)};d.VERSION="3.3.5",d.TRANSITION_DURATION=150,d.prototype.close=function(b){function c(){g.detach().trigger("closed.bs.alert").remove()}var e=a(this),f=e.attr("data-target");f||(f=e.attr("href"),f=f&&f.replace(/.*(?=#[^\s]*$)/,""));var g=a(f);b&&b.preventDefault(),g.length||(g=e.closest(".alert")),g.trigger(b=a.Event("close.bs.alert")),b.isDefaultPrevented()||(g.removeClass("in"),a.support.transition&&g.hasClass("fade")?g.one("bsTransitionEnd",c).emulateTransitionEnd(d.TRANSITION_DURATION):c())};var e=a.fn.alert;a.fn.alert=b,a.fn.alert.Constructor=d,a.fn.alert.noConflict=function(){return a.fn.alert=e,this},a(document).on("click.bs.alert.data-api",c,d.prototype.close)}(jQuery),+function(a){"use strict";function b(b){return this.each(function(){var d=a(this),e=d.data("bs.button"),f="object"==typeof b&&b;e||d.data("bs.button",e=new c(this,f)),"toggle"==b?e.toggle():b&&e.setState(b)})}var c=function(b,d){this.$element=a(b),this.options=a.extend({},c.DEFAULTS,d),this.isLoading=!1};c.VERSION="3.3.5",c.DEFAULTS={loadingText:"loading..."},c.prototype.setState=function(b){var c="disabled",d=this.$element,e=d.is("input")?"val":"html",f=d.data();b+="Text",null==f.resetText&&d.data("resetText",d[e]()),setTimeout(a.proxy(function(){d[e](null==f[b]?this.options[b]:f[b]),"loadingText"==b?(this.isLoading=!0,d.addClass(c).attr(c,c)):this.isLoading&&(this.isLoading=!1,d.removeClass(c).removeAttr(c))},this),0)},c.prototype.toggle=function(){var a=!0,b=this.$element.closest('[data-toggle="buttons"]');if(b.length){var c=this.$element.find("input");"radio"==c.prop("type")?(c.prop("checked")&&(a=!1),b.find(".active").removeClass("active"),this.$element.addClass("active")):"checkbox"==c.prop("type")&&(c.prop("checked")!==this.$element.hasClass("active")&&(a=!1),this.$element.toggleClass("active")),c.prop("checked",this.$element.hasClass("active")),a&&c.trigger("change")}else this.$element.attr("aria-pressed",!this.$element.hasClass("active")),this.$element.toggleClass("active")};var d=a.fn.button;a.fn.button=b,a.fn.button.Constructor=c,a.fn.button.noConflict=function(){return a.fn.button=d,this},a(document).on("click.bs.button.data-api",'[data-toggle^="button"]',function(c){var d=a(c.target);d.hasClass("btn")||(d=d.closest(".btn")),b.call(d,"toggle"),a(c.target).is('input[type="radio"]')||a(c.target).is('input[type="checkbox"]')||c.preventDefault()}).on("focus.bs.button.data-api blur.bs.button.data-api",'[data-toggle^="button"]',function(b){a(b.target).closest(".btn").toggleClass("focus",/^focus(in)?$/.test(b.type))})}(jQuery),+function(a){"use strict";function b(b){return this.each(function(){var d=a(this),e=d.data("bs.carousel"),f=a.extend({},c.DEFAULTS,d.data(),"object"==typeof b&&b),g="string"==typeof b?b:f.slide;e||d.data("bs.carousel",e=new c(this,f)),"number"==typeof b?e.to(b):g?e[g]():f.interval&&e.pause().cycle()})}var c=function(b,c){this.$element=a(b),this.$indicators=this.$element.find(".carousel-indicators"),this.options=c,this.paused=null,this.sliding=null,this.interval=null,this.$active=null,this.$items=null,this.options.keyboard&&this.$element.on("keydown.bs.carousel",a.proxy(this.keydown,this)),"hover"==this.options.pause&&!("ontouchstart"in document.documentElement)&&this.$element.on("mouseenter.bs.carousel",a.proxy(this.pause,this)).on("mouseleave.bs.carousel",a.proxy(this.cycle,this))};c.VERSION="3.3.5",c.TRANSITION_DURATION=600,c.DEFAULTS={interval:5e3,pause:"hover",wrap:!0,keyboard:!0},c.prototype.keydown=function(a){if(!/input|textarea/i.test(a.target.tagName)){switch(a.which){case 37:this.prev();break;case 39:this.next();break;default:return}a.preventDefault()}},c.prototype.cycle=function(b){return b||(this.paused=!1),this.interval&&clearInterval(this.interval),this.options.interval&&!this.paused&&(this.interval=setInterval(a.proxy(this.next,this),this.options.interval)),this},c.prototype.getItemIndex=function(a){return this.$items=a.parent().children(".item"),this.$items.index(a||this.$active)},c.prototype.getItemForDirection=function(a,b){var c=this.getItemIndex(b),d="prev"==a&&0===c||"next"==a&&c==this.$items.length-1;if(d&&!this.options.wrap)return b;var e="prev"==a?-1:1,f=(c+e)%this.$items.length;return this.$items.eq(f)},c.prototype.to=function(a){var b=this,c=this.getItemIndex(this.$active=this.$element.find(".item.active"));return a>this.$items.length-1||0>a?void 0:this.sliding?this.$element.one("slid.bs.carousel",function(){b.to(a)}):c==a?this.pause().cycle():this.slide(a>c?"next":"prev",this.$items.eq(a))},c.prototype.pause=function(b){return b||(this.paused=!0),this.$element.find(".next, .prev").length&&a.support.transition&&(this.$element.trigger(a.support.transition.end),this.cycle(!0)),this.interval=clearInterval(this.interval),this},c.prototype.next=function(){return this.sliding?void 0:this.slide("next")},c.prototype.prev=function(){return this.sliding?void 0:this.slide("prev")},c.prototype.slide=function(b,d){var e=this.$element.find(".item.active"),f=d||this.getItemForDirection(b,e),g=this.interval,h="next"==b?"left":"right",i=this;if(f.hasClass("active"))return this.sliding=!1;var j=f[0],k=a.Event("slide.bs.carousel",{relatedTarget:j,direction:h});if(this.$element.trigger(k),!k.isDefaultPrevented()){if(this.sliding=!0,g&&this.pause(),this.$indicators.length){this.$indicators.find(".active").removeClass("active");var l=a(this.$indicators.children()[this.getItemIndex(f)]);l&&l.addClass("active")}var m=a.Event("slid.bs.carousel",{relatedTarget:j,direction:h});return a.support.transition&&this.$element.hasClass("slide")?(f.addClass(b),f[0].offsetWidth,e.addClass(h),f.addClass(h),e.one("bsTransitionEnd",function(){f.removeClass([b,h].join(" ")).addClass("active"),e.removeClass(["active",h].join(" ")),i.sliding=!1,setTimeout(function(){i.$element.trigger(m)},0)}).emulateTransitionEnd(c.TRANSITION_DURATION)):(e.removeClass("active"),f.addClass("active"),this.sliding=!1,this.$element.trigger(m)),g&&this.cycle(),this}};var d=a.fn.carousel;a.fn.carousel=b,a.fn.carousel.Constructor=c,a.fn.carousel.noConflict=function(){return a.fn.carousel=d,this};var e=function(c){var d,e=a(this),f=a(e.attr("data-target")||(d=e.attr("href"))&&d.replace(/.*(?=#[^\s]+$)/,""));if(f.hasClass("carousel")){var g=a.extend({},f.data(),e.data()),h=e.attr("data-slide-to");h&&(g.interval=!1),b.call(f,g),h&&f.data("bs.carousel").to(h),c.preventDefault()}};a(document).on("click.bs.carousel.data-api","[data-slide]",e).on("click.bs.carousel.data-api","[data-slide-to]",e),a(window).on("load",function(){a('[data-ride="carousel"]').each(function(){var c=a(this);b.call(c,c.data())})})}(jQuery),+function(a){"use strict";function b(b){var c,d=b.attr("data-target")||(c=b.attr("href"))&&c.replace(/.*(?=#[^\s]+$)/,"");return a(d)}function c(b){return this.each(function(){var c=a(this),e=c.data("bs.collapse"),f=a.extend({},d.DEFAULTS,c.data(),"object"==typeof b&&b);!e&&f.toggle&&/show|hide/.test(b)&&(f.toggle=!1),e||c.data("bs.collapse",e=new d(this,f)),"string"==typeof b&&e[b]()})}var d=function(b,c){this.$element=a(b),this.options=a.extend({},d.DEFAULTS,c),this.$trigger=a('[data-toggle="collapse"][href="#'+b.id+'"],[data-toggle="collapse"][data-target="#'+b.id+'"]'),this.transitioning=null,this.options.parent?this.$parent=this.getParent():this.addAriaAndCollapsedClass(this.$element,this.$trigger),this.options.toggle&&this.toggle()};d.VERSION="3.3.5",d.TRANSITION_DURATION=350,d.DEFAULTS={toggle:!0},d.prototype.dimension=function(){var a=this.$element.hasClass("width");return a?"width":"height"},d.prototype.show=function(){if(!this.transitioning&&!this.$element.hasClass("in")){var b,e=this.$parent&&this.$parent.children(".panel").children(".in, .collapsing");if(!(e&&e.length&&(b=e.data("bs.collapse"),b&&b.transitioning))){var f=a.Event("show.bs.collapse");if(this.$element.trigger(f),!f.isDefaultPrevented()){e&&e.length&&(c.call(e,"hide"),b||e.data("bs.collapse",null));var g=this.dimension();this.$element.removeClass("collapse").addClass("collapsing")[g](0).attr("aria-expanded",!0),this.$trigger.removeClass("collapsed").attr("aria-expanded",!0),this.transitioning=1;var h=function(){this.$element.removeClass("collapsing").addClass("collapse in")[g](""),this.transitioning=0,this.$element.trigger("shown.bs.collapse")};if(!a.support.transition)return h.call(this);var i=a.camelCase(["scroll",g].join("-"));this.$element.one("bsTransitionEnd",a.proxy(h,this)).emulateTransitionEnd(d.TRANSITION_DURATION)[g](this.$element[0][i])}}}},d.prototype.hide=function(){if(!this.transitioning&&this.$element.hasClass("in")){var b=a.Event("hide.bs.collapse");if(this.$element.trigger(b),!b.isDefaultPrevented()){var c=this.dimension();this.$element(this.$element())[0].offsetHeight,this.$element.addClass("collapsing").removeClass("collapse in").attr("aria-expanded",!1),this.$trigger.addClass("collapsed").attr("aria-expanded",!1),this.transitioning=1;var e=function(){this.transitioning=0,this.$element.removeClass("collapsing").addClass("collapse").trigger("hidden.bs.collapse")};return a.support.transition?void this.$element(0).one("bsTransitionEnd",a.proxy(e,this)).emulateTransitionEnd(d.TRANSITION_DURATION):e.call(this)}}},d.prototype.toggle=function(){this[this.$element.hasClass("in")?"hide":"show"]()},d.prototype.getParent=function(){return a(this.options.parent).find('[data-toggle="collapse"][data-parent="'+this.options.parent+'"]').each(a.proxy(function(c,d){var e=a(d);this.addAriaAndCollapsedClass(b(e),e)},this)).end()},d.prototype.addAriaAndCollapsedClass=function(a,b){var c=a.hasClass("in");a.attr("aria-expanded",c),b.toggleClass("collapsed",!c).attr("aria-expanded",c)};var e=a.fn.collapse;a.fn.collapse=c,a.fn.collapse.Constructor=d,a.fn.collapse.noConflict=function(){return a.fn.collapse=e,this},a(document).on("click.bs.collapse.data-api",'[data-toggle="collapse"]',function(d){var e=a(this);e.attr("data-target")||d.preventDefault();var f=b(e),g=f.data("bs.collapse"),h=g?"toggle":e.data();c.call(f,h)})}(jQuery),+function(a){"use strict";function b(b){var c=b.attr("data-target");c||(c=b.attr("href"),c=c&&/#[A-Za-z]/.test(c)&&c.replace(/.*(?=#[^\s]*$)/,""));var d=c&&a(c);return d&&d.length?d:b.parent()}function c(c){c&&3===c.which||(a(e).remove(),a(f).each(function(){var d=a(this),e=b(d),f={relatedTarget:this};e.hasClass("open")&&(c&&"click"==c.type&&/input|textarea/i.test(c.target.tagName)&&a.contains(e[0],c.target)||(e.trigger(c=a.Event("hide.bs.dropdown",f)),c.isDefaultPrevented()||(d.attr("aria-expanded","false"),e.removeClass("open").trigger("hidden.bs.dropdown",f))))}))}function d(b){return this.each(function(){var c=a(this),d=c.data("bs.dropdown");d||c.data("bs.dropdown",d=new g(this)),"string"==typeof b&&d[b].call(c)})}var e=".dropdown-backdrop",f='[data-toggle="dropdown"]',g=function(b){a(b).on("click.bs.dropdown",this.toggle)};g.VERSION="3.3.5",g.prototype.toggle=function(d){var e=a(this);if(!e.is(".disabled, :disabled")){var f=b(e),g=f.hasClass("open");if(c(),!g){"ontouchstart"in document.documentElement&&!f.closest(".navbar-nav").length&&a(document.createElement("div")).addClass("dropdown-backdrop").insertAfter(a(this)).on("click",c);var h={relatedTarget:this};if(f.trigger(d=a.Event("show.bs.dropdown",h)),d.isDefaultPrevented())return;e.trigger("focus").attr("aria-expanded","true"),f.toggleClass("open").trigger("shown.bs.dropdown",h)}return!1}},g.prototype.keydown=function(c){if(/(38|40|27|32)/.test(c.which)&&!/input|textarea/i.test(c.target.tagName)){var d=a(this);if(c.preventDefault(),c.stopPropagation(),!d.is(".disabled, :disabled")){var e=b(d),g=e.hasClass("open");if(!g&&27!=c.which||g&&27==c.which)return 27==c.which&&e.find(f).trigger("focus"),d.trigger("click");var h=" li:not(.disabled):visible a",i=e.find(".dropdown-menu"+h);if(i.length){var j=i.index(c.target);38==c.which&&j>0&&j--,40==c.which&&jdocument.documentElement.clientHeight;this.$element.css({paddingLeft:!this.bodyIsOverflowing&&a?this.scrollbarWidth:"",paddingRight:this.bodyIsOverflowing&&!a?this.scrollbarWidth:""})},c.prototype.resetAdjustments=function(){this.$element.css({paddingLeft:"",paddingRight:""})},c.prototype.checkScrollbar=function(){var a=window.innerWidth;if(!a){var b=document.documentElement.getBoundingClientRect();a=b.right-Math.abs(b.left)}this.bodyIsOverflowing=document.body.clientWidth ',trigger:"hover focus",title:"",delay:0,html:!1,container:!1,viewport:{selector:"body",padding:0}},c.prototype.init=function(b,c,d){if(this.enabled=!0,this.type=b,this.$element=a(c),this.options=this.getOptions(d),this.$viewport=this.options.viewport&&a(a.isFunction(this.options.viewport)?this.options.viewport.call(this,this.$element):this.options.viewport.selector||this.options.viewport),this.inState={click:!1,hover:!1,focus:!1},this.$element[0]instanceof document.constructor&&!this.options.selector)throw new Error("selector option must be specified when initializing "+this.type+" on the window.document object!");for(var e=this.options.trigger.split(" "),f=e.length;f--;){var g=e[f];if("click"==g)this.$element.on("click."+this.type,this.options.selector,a.proxy(this.toggle,this));else if("manual"!=g){var h="hover"==g?"mouseenter":"focusin",i="hover"==g?"mouseleave":"focusout";this.$element.on(h+"."+this.type,this.options.selector,a.proxy(this.enter,this)),this.$element.on(i+"."+this.type,this.options.selector,a.proxy(this.leave,this))}}this.options.selector?this._options=a.extend({},this.options,{trigger:"manual",selector:""}):this.fixTitle()},c.prototype.getDefaults=function(){return c.DEFAULTS},c.prototype.getOptions=function(b){return b=a.extend({},this.getDefaults(),this.$element.data(),b),b.delay&&"number"==typeof b.delay&&(b.delay={show:b.delay,hide:b.delay}),b},c.prototype.getDelegateOptions=function(){var b={},c=this.getDefaults();return this._options&&a.each(this._options,function(a,d){c[a]!=d&&(b[a]=d)}),b},c.prototype.enter=function(b){var c=b instanceof this.constructor?b:a(b.currentTarget).data("bs."+this.type);return c||(c=new this.constructor(b.currentTarget,this.getDelegateOptions()),a(b.currentTarget).data("bs."+this.type,c)),b instanceof a.Event&&(c.inState["focusin"==b.type?"focus":"hover"]=!0),c.tip().hasClass("in")||"in"==c.hoverState?void(c.hoverState="in"):(clearTimeout(c.timeout),c.hoverState="in",c.options.delay&&c.options.delay.show?void(c.timeout=setTimeout(function(){"in"==c.hoverState&&c.show()},c.options.delay.show)):c.show())},c.prototype.isInStateTrue=function(){for(var a in this.inState)if(this.inState[a])return!0;return!1},c.prototype.leave=function(b){var c=b instanceof this.constructor?b:a(b.currentTarget).data("bs."+this.type);return c||(c=new this.constructor(b.currentTarget,this.getDelegateOptions()),a(b.currentTarget).data("bs."+this.type,c)),b instanceof a.Event&&(c.inState["focusout"==b.type?"focus":"hover"]=!1),c.isInStateTrue()?void 0:(clearTimeout(c.timeout),c.hoverState="out",c.options.delay&&c.options.delay.hide?void(c.timeout=setTimeout(function(){"out"==c.hoverState&&c.hide()},c.options.delay.hide)):c.hide())},c.prototype.show=function(){var b=a.Event("show.bs."+this.type);if(this.hasContent()&&this.enabled){this.$element.trigger(b);var d=a.contains(this.$element[0].ownerDocument.documentElement,this.$element[0]);if(b.isDefaultPrevented()||!d)return;var e=this,f=this.tip(),g=this.getUID(this.type);this.setContent(),f.attr("id",g),this.$element.attr("aria-describedby",g),this.options.animation&&f.addClass("fade");var h="function"==typeof this.options.placement?this.options.placement.call(this,f[0],this.$element[0]):this.options.placement,i=/\s?auto?\s?/i,j=i.test(h);j&&(h=h.replace(i,"")||"top"),f.detach().css({top:0,left:0,display:"block"}).addClass(h).data("bs."+this.type,this),this.options.container?f.appendTo(this.options.container):f.insertAfter(this.$element),this.$element.trigger("inserted.bs."+this.type);var k=this.getPosition(),l=f[0].offsetWidth,m=f[0].offsetHeight;if(j){var n=h,o=this.getPosition(this.$viewport);h="bottom"==h&&k.bottom+m>o.bottom?"top":"top"==h&&k.top-mo.width?"left":"left"==h&&k.left-lg.top+g.height&&(e.top=g.top+g.height-i)}else{var j=b.left-f,k=b.left+f+c;jg.right&&(e.left=g.left+g.width-k)}return e},c.prototype.getTitle=function(){var a,b=this.$element,c=this.options;return a=b.attr("data-original-title")||("function"==typeof c.title?c.title.call(b[0]):c.title)},c.prototype.getUID=function(a){do a+=~~(1e6*Math.random());while(document.getElementById(a));return a},c.prototype.tip=function(){if(!this.$tip&&(this.$tip=a(this.options.template),1!=this.$tip.length))throw new Error(this.type+" template option must consist of exactly 1 top-level element!");return this.$tip},c.prototype.arrow=function(){return this.$arrow=this.$arrow||this.tip().find(".tooltip-arrow")},c.prototype.enable=function(){this.enabled=!0},c.prototype.disable=function(){this.enabled=!1},c.prototype.toggleEnabled=function(){this.enabled=!this.enabled},c.prototype.toggle=function(b){var c=this;b&&(c=a(b.currentTarget).data("bs."+this.type),c||(c=new this.constructor(b.currentTarget,this.getDelegateOptions()),a(b.currentTarget).data("bs."+this.type,c))),b?(c.inState.click=!c.inState.click,c.isInStateTrue()?c.enter(c):c.leave(c)):c.tip().hasClass("in")?c.leave(c):c.enter(c)},c.prototype.destroy=function(){var a=this;clearTimeout(this.timeout),this.hide(function(){a.$element.off("."+a.type).removeData("bs."+a.type),a.$tip&&a.$tip.detach(),a.$tip=null,a.$arrow=null,a.$viewport=null})};var d=a.fn.tooltip;a.fn.tooltip=b,a.fn.tooltip.Constructor=c,a.fn.tooltip.noConflict=function(){return a.fn.tooltip=d,this}}(jQuery),+function(a){"use strict";function b(b){return this.each(function(){var d=a(this),e=d.data("bs.popover"),f="object"==typeof b&&b;(e||!/destroy|hide/.test(b))&&(e||d.data("bs.popover",e=new c(this,f)),"string"==typeof b&&e[b]())})}var c=function(a,b){this.init("popover",a,b)};if(!a.fn.tooltip)throw new Error("Popover requires tooltip.js");c.VERSION="3.3.5",c.DEFAULTS=a.extend({},a.fn.tooltip.Constructor.DEFAULTS,{placement:"right",trigger:"click",content:"",template:'

'}),c.prototype=a.extend({},a.fn.tooltip.Constructor.prototype),c.prototype.constructor=c,c.prototype.getDefaults=function(){return c.DEFAULTS},c.prototype.setContent=function(){var a=this.tip(),b=this.getTitle(),c=this.getContent();a.find(".popover-title")[this.options.html?"html":"text"](b),a.find(".popover-content").children().detach().end()[this.options.html?"string"==typeof c?"html":"append":"text"](c),a.removeClass("fade top bottom left right in"),a.find(".popover-title").html()||a.find(".popover-title").hide()},c.prototype.hasContent=function(){return this.getTitle()||this.getContent()},c.prototype.getContent=function(){var a=this.$element,b=this.options;return a.attr("data-content")||("function"==typeof b.content?b.content.call(a[0]):b.content)},c.prototype.arrow=function(){return this.$arrow=this.$arrow||this.tip().find(".arrow")};var d=a.fn.popover;a.fn.popover=b,a.fn.popover.Constructor=c,a.fn.popover.noConflict=function(){return a.fn.popover=d,this}}(jQuery),+function(a){"use strict";function b(c,d){this.$body=a(document.body),this.$scrollElement=a(a(c).is(document.body)?window:c),this.options=a.extend({},b.DEFAULTS,d),this.selector=(this.options.target||"")+" .nav li > a",this.offsets=[],this.targets=[],this.activeTarget=null,this.scrollHeight=0,this.$scrollElement.on("scroll.bs.scrollspy",a.proxy(this.process,this)),this.refresh(),this.process()}function c(c){return this.each(function(){var d=a(this),e=d.data("bs.scrollspy"),f="object"==typeof c&&c;e||d.data("bs.scrollspy",e=new b(this,f)),"string"==typeof c&&e()})}b.VERSION="3.3.5",b.DEFAULTS={offset:10},b.prototype.getScrollHeight=function(){return this.$scrollElement[0].scrollHeight||Math.max(this.$body[0].scrollHeight,document.documentElement.scrollHeight)},b.prototype.refresh=function(){var b=this,c="offset",d=0;this.offsets=[],this.targets=[],this.scrollHeight=this.getScrollHeight(),a.isWindow(this.$scrollElement[0])||(c="position",d=this.$scrollElement.scrollTop()),this.$body.find(this.selector).map(function(){var b=a(this),e=b.data("target")||b.attr("href"),f=/^#./.test(e)&&a(e);return f&&f.length&&f.is(":visible")&&[[f().top+d,e]]||null}).sort(function(a,b){return a[0]-b[0]}).each(function(){b.offsets.push(this[0]),b.targets.push(this[1])})},b.prototype.process=function(){var a,b=this.$scrollElement.scrollTop()+this.options.offset,c=this.getScrollHeight(),d=this.options.offset+c-this.$scrollElement.height(),e=this.offsets,f=this.targets,g=this.activeTarget;if(this.scrollHeight!=c&&this.refresh(),b>=d)return g!=(a=f[f.length-1])&&this.activate(a);if(g&&b=e[a]&&(void 0===e[a+1]||b .dropdown-menu > .active").removeClass("active").end().find('[data-toggle="tab"]').attr("aria-expanded",!1),b.addClass("active").find('[data-toggle="tab"]').attr("aria-expanded",!0),h?(b[0].offsetWidth,b.addClass("in")):b.removeClass("fade"),b.parent(".dropdown-menu").length&&b.closest("li.dropdown").addClass("active").end().find('[data-toggle="tab"]').attr("aria-expanded",!0),e&&e()}var g=d.find("> .active"),h=e&&a.support.transition&&(g.length&&g.hasClass("fade")||!!d.find("> .fade").length);g.length&&h?g.one("bsTransitionEnd",f).emulateTransitionEnd(c.TRANSITION_DURATION):f(),g.removeClass("in")};var d=a.fn.tab;a.fn.tab=b,a.fn.tab.Constructor=c,a.fn.tab.noConflict=function(){return a.fn.tab=d,this};var e=function(c){c.preventDefault(),b.call(a(this),"show")};a(document).on("click.bs.tab.data-api",'[data-toggle="tab"]',e).on("click.bs.tab.data-api",'[data-toggle="pill"]',e)}(jQuery),+function(a){"use strict";function b(b){return this.each(function(){var d=a(this),e=d.data("bs.affix"),f="object"==typeof b&&b;e||d.data("bs.affix",e=new c(this,f)),"string"==typeof b&&e[b]()})}var c=function(b,d){this.options=a.extend({},c.DEFAULTS,d),this.$target=a(this.options.target).on("scroll.bs.affix.data-api",a.proxy(this.checkPosition,this)).on("click.bs.affix.data-api",a.proxy(this.checkPositionWithEventLoop,this)),this.$element=a(b),this.affixed=null,this.unpin=null,this.pinnedOffset=null,this.checkPosition()};c.VERSION="3.3.5",c.RESET="affix affix-top affix-bottom",c.DEFAULTS={offset:0,target:window},c.prototype.getState=function(a,b,c,d){var e=this.$target.scrollTop(),f=this.$element.offset(),g=this.$target.height();if(null!=c&&"top"==this.affixed)return c>e?"top":!1;if("bottom"==this.affixed)return null!=c?e+this.unpin<=f.top?!1:"bottom":a-d>=e+g?!1:"bottom";var h=null==this.affixed,i=h?e:f.top,j=h?g:b;return null!=c&&c>=e?"top":null!=d&&i+j>=a-d?"bottom":!1},c.prototype.getPinnedOffset=function(){if(this.pinnedOffset)return this.pinnedOffset;this.$element.removeClass(c.RESET).addClass("affix");var a=this.$target.scrollTop(),b=this.$element.offset();return this.pinnedOffset=b.top-a},c.prototype.checkPositionWithEventLoop=function(){setTimeout(a.proxy(this.checkPosition,this),1)},c.prototype.checkPosition=function(){if(this.$element.is(":visible")){var b=this.$element.height(),d=this.options.offset,e=d.top,f=d.bottom,g=Math.max(a(document).height(),a(document.body).height());"object"!=typeof d&&(f=e=d),"function"==typeof e&&(e=d.top(this.$element)),"function"==typeof f&&(f=d.bottom(this.$element));var h=this.getState(g,b,e,f);if(this.affixed!=h){null!=this.unpin&&this.$element.css("top","");var i="affix"+(h?"-"+h:""),j=a.Event(i+".bs.affix");if(this.$element.trigger(j),j.isDefaultPrevented())return;this.affixed=h,this.unpin="bottom"==h?this.getPinnedOffset():null,this.$element.removeClass(c.RESET).addClass(i).trigger(i.replace("affix","affixed")+".bs.affix")}"bottom"==h&&this.$element.offset({top:g-b-f})}};var d=a.fn.affix;a.fn.affix=b,a.fn.affix.Constructor=c,a.fn.affix.noConflict=function(){return a.fn.affix=d,this},a(window).on("load",function(){a('[data-spy="affix"]').each(function(){var c=a(this),d=c.data();d.offset=d.offset||{},null!=d.offsetBottom&&(d.offset.bottom=d.offsetBottom),null!=d.offsetTop&&(d.offset.top=d.offsetTop),b.call(c,d)})})}(jQuery);/** * @preserve HTML5 Shiv 3.7.2 | @afarkas @jdalton @jon_neal @rem | MIT/GPL2 Licensed */ // Only run this code in IE 8 if (!!window.navigator.userAgent.match("MSIE 8")) { !function(a,b){function c(a,b){var c=a.createElement("p"),d=a.getElementsByTagName("head")[0]||a.documentElement;return c.innerHTML="x "+b+" ",d.insertBefore(c.lastChild,d.firstChild)}function d(){var a=t.elements;return"string"==typeof a?a.split(" "):a}function e(a,b){var c=t.elements;"string"!=typeof c&&(c=c.join(" ")),"string"!=typeof a&&(a=a.join(" ")),t.elements=c+" "+a,j(b)}function f(a){var b=s[a[q]];return b||(b={},r++,a[q]=r,s[r]=b),b}function g(a,c,d){if(c||(c=b),l)return c.createElement(a);d||(d=f(c));var e;return e=d.cache[a]?d.cache[a].cloneNode():p.test(a)?(d.cache[a]=d.createElem(a)).cloneNode():d.createElem(a),!e.canHaveChildren||o.test(a)||e.tagUrn?e:d.frag.appendChild(e)}function h(a,c){if(a||(a=b),l)return a.createDocumentFragment();c=c||f(a);for(var e=c.frag.cloneNode(),g=0,h=d(),i=h.length;i>g;g++)e.createElement(h[g]);return e}function i(a,b){b.cache||(b.cache={},b.createElem=a.createElement,b.createFrag=a.createDocumentFragment,b.frag=b.createFrag()),a.createElement=function(c){return t.shivMethods?g(c,a,b):b.createElem(c)},a.createDocumentFragment=Function("h,f","return function(){var n=f.cloneNode(),c=n.createElement;h.shivMethods&&("+d().join().replace(/[\w\-:]+/g,function(a){return b.createElem(a),b.frag.createElement(a),'c("'+a+'")'})+");return n}")(t,b.frag)}function j(a){a||(a=b);var d=f(a);return!t.shivCSS||k||d.hasCSS||(d.hasCSS=!!c(a,"article,aside,dialog,figcaption,figure,footer,header,hgroup,main,nav,section{display:block}mark{background:#FF0;color:#000}template{display:none}")),l||i(a,d),a}var k,l,m="3.7.2",n=a.html5||{},o=/^<|^(?:button|map|select|textarea|object|iframe|option|optgroup)$/i,p=/^(?:a|b|code|div|fieldset|h1|h2|h3|h4|h5|h6|i|label|li|ol|p|q|span|strong|style|table|tbody|td|th|tr|ul)$/i,q="_html5shiv",r=0,s={};!function(){try{var a=b.createElement("a");a.innerHTML="",k="hidden"in a,l=1==a.childNodes.length||function(){b.createElement("a");var a=b.createDocumentFragment();return"undefined"==typeof a.cloneNode||"undefined"==typeof a.createDocumentFragment||"undefined"==typeof a.createElement}()}catch(c){k=!0,l=!0}}();var t={elements:n.elements||"abbr article aside audio bdi canvas data datalist details dialog figcaption figure footer header hgroup main mark meter nav output picture progress section summary template time video",version:m,shivCSS:n.shivCSS!==!1,supportsUnknownElements:l,shivMethods:n.shivMethods!==!1,type:"default",shivDocument:j,createElement:g,createDocumentFragment:h,addElements:e};a.html5=t,j(b)}(this,document); }; /*! Respond.js v1.4.2: min/max-width media query polyfill * Copyright 2013 Scott Jehl * Licensed under https://github.com/scottjehl/Respond/blob/master/LICENSE-MIT * */ // Only run this code in IE 8 if (!!window.navigator.userAgent.match("MSIE 8")) { !function(a){"use strict";a.matchMedia=a.matchMedia||function(a){var b,c=a.documentElement,d=c.firstElementChild||c.firstChild,e=a.createElement("body"),f=a.createElement("div");return f.id="mq-test-1",f.style.cssText="position:absolute;top:-100em",e.style.background="none",e.appendChild(f),function(a){return f.innerHTML='­ #mq-test-1 { width: 42px; } ',c.insertBefore(e,d),b=42===f.offsetWidth,c.removeChild(e),{matches:b,media:a}}}(a.document)}(this),function(a){"use strict";function b(){u(!0)}var c={};a.respond=c,c.update=function(){};var d=[],e=function(){var b=!1;try{b=new a.XMLHttpRequest}catch(c){b=new a.ActiveXObject("Microsoft.XMLHTTP")}return function(){return b}}(),f=function(a,b){var c=e();c&&(c.open("GET",a,!0),c.onreadystatechange=function(){4!==c.readyState||200!==c.status&&304!==c.status||b(c.responseText)},4!==c.readyState&&c.send(null))};if(c.ajax=f,c.queue=d,c.regex={media:/@media[^\{]+\{([^\{\}]*\{[^\}\{]*\})+/gi,keyframes:/@(?:\-(?:o|moz|webkit)\-)?keyframes[^\{]+\{(?:[^\{\}]*\{[^\}\{]*\})+[^\}]*\}/gi,urls:/(url$$)['"]?([^\/$$'"][^:\)'"]+)['"]?(\))/g,findStyles:/@media *([^\{]+)\{([\S\s]+?)$/,only:/(only\s+)?([a-zA-Z]+)\s?/,minw:/$$[\s]*min\-width\s*:[\s]*([\s]*[0-9\.]+)(px|em)[\s]*$$/,maxw:/$$[\s]*max\-width\s*:[\s]*([\s]*[0-9\.]+)(px|em)[\s]*$$/},c.mediaQueriesSupported=a.matchMedia&&null!==a.matchMedia("only all")&&a.matchMedia("only all").matches,!c.mediaQueriesSupported){var g,h,i,j=a.document,k=j.documentElement,l=[],m=[],n=[],o={},p=30,q=j.getElementsByTagName("head")[0]||k,r=j.getElementsByTagName("base")[0],s=q.getElementsByTagName("link"),t=function(){var a,b=j.createElement("div"),c=j.body,d=k.style.fontSize,e=c&&c.style.fontSize,f=!1;return b.style.cssText="position:absolute;font-size:1em;width:1em",c||(c=f=j.createElement("body"),c.style.background="none"),k.style.fontSize="100%",c.style.fontSize="100%",c.appendChild(b),f&&k.insertBefore(c,k.firstChild),a=b.offsetWidth,f?k.removeChild(c):c.removeChild(b),k.style.fontSize=d,e&&(c.style.fontSize=e),a=i=parseFloat(a)},u=function(b){var c="clientWidth",d=k,e="CSS1Compat"===j.compatMode&&d||j.body||d,f={},o=s[s.length-1],r=(new Date).getTime();if(b&&g&&p>r-g)return a.clearTimeout(h),h=a.setTimeout(u,p),void 0;g=r;for(var v in l)if(l.hasOwnProperty(v)){var w=l[v],x=w.minw,y=w.maxw,z=null===x,A=null===y,B="em";x&&(x=parseFloat(x)*(x.indexOf(B)>-1?i||t():1)),y&&(y=parseFloat(y)*(y.indexOf(B)>-1?i||t():1)),w.hasquery&&(z&&A||!(z||e>=x)||!(A||y>=e))||(f[w.media]||(f[w.media]=[]),f[w.media].push(m[w.rules]))}for(var C in n)n.hasOwnProperty(C)&&n[C]&&n[C].parentNode===q&&q.removeChild(n[C]);n.length=0;for(var D in f)if(f.hasOwnProperty(D)){var E=j.createElement("style"),F=f[D].join("\n");E.type="text/css",E.media=D,q.insertBefore(E,o.nextSibling),E.styleSheet?E.styleSheet.cssText=F:E.appendChild(j.createTextNode(F)),n.push(E)}},v=function(a,b,d){var e=a.replace(c.regex.keyframes,"").match(c.regex.media),f=e&&e.length||0;b=b.substring(0,b.lastIndexOf("/"));var g=function(a){return a.replace(c.regex.urls,"$1"+b+"$2$3")},h=!f&&d;b.length&&(b+="/"),h&&(f=1);for(var i=0;f>i;i++){var j,k,n,o;h?(j=d,m.push(g(a))):(j=e[i].match(c.regex.findStyles)&&RegExp.$1,m.push(RegExp.$2&&g(RegExp.$2))),n=j.split(","),o=n.length;for(var p=0;o>p;p++)k=n[p],l.push({media:k.split("(")[0].match(c.regex.only)&&RegExp.$2||"all",rules:m.length-1,hasquery:k.indexOf("(")>-1,minw:k.match(c.regex.minw)&&parseFloat(RegExp.$1)+(RegExp.$2||""),maxw:k.match(c.regex.maxw)&&parseFloat(RegExp.$1)+(RegExp.$2||"")})}u()},w=function(){if(d.length){var b=d.shift();f(b.href,function(c){v(c,b.href,b.media),o[b.href]=!0,a.setTimeout(function(){w()},0)})}},x=function(){for(var b=0;b /** * jQuery Plugin: Sticky Tabs * * @author Aidan Lister * adapted by Ruben Arslan to activate parent tabs too * http://www.aidanlister.com/2014/03/persisting-the-tab-state-in-bootstrap/ */ (function($) { "use strict"; $.fn.rmarkdownStickyTabs = function() { var context = this; // Show the tab corresponding with the hash in the URL, or the first tab var showStuffFromHash = function() { var hash = window.location.hash; var selector = hash ? 'a[href="' + hash + '"]' : 'li.active > a'; var$selector = $(selector, context); if($selector.data('toggle') === "tab") { $selector.tab('show'); // walk up the ancestors of this element, show any hidden tabs$selector.parents('.section.tabset').each(function(i, elm) { var link = $('a[href="#' +$(elm).attr('id') + '"]'); if(link.data('toggle') === "tab") { link.tab("show"); } }); } }; // Set the correct tab when the page loads showStuffFromHash(context); // Set the correct tab when a user uses their back/forward button $(window).on('hashchange', function() { showStuffFromHash(context); }); // Change the URL when tabs are clicked$('a', context).on('click', function(e) { history.pushState(null, null, this.href); showStuffFromHash(context); }); return this; }; }(jQuery));

.hljs-literal { color: #990073; } .hljs-number { color: #099; } .hljs-comment { color: #998; font-style: italic; } .hljs-keyword { color: #900; font-weight: bold; } .hljs-string { color: #d14; } code{white-space: pre;} pre:not([class]) { background-color: white; }

if (window.hljs) { hljs.configure({languages: []}); hljs.initHighlightingOnLoad(); if (document.readyState && document.readyState === "complete") { window.setTimeout(function() { hljs.initHighlighting(); }, 0); } }

h1 { font-size: 34px; } h1.title { font-size: 38px; } h2 { font-size: 30px; } h3 { font-size: 24px; } h4 { font-size: 18px; } h5 { font-size: 16px; } h6 { font-size: 12px; } .table th:not([align]) { text-align: left; }

.main-container { max-width: 940px; margin-left: auto; margin-right: auto; } code { color: inherit; background-color: rgba(0, 0, 0, 0.04); } img { max-width:100%; height: auto; } .tabbed-pane { padding-top: 12px; } .html-widget { margin-bottom: 20px; } button.code-folding-btn:focus { outline: none; } $(document).ready(function () { window.buildTabsets("TOC"); }); A Lazy Function CillianMacAodh (Cillian McHugh) 20 October 2018 A Lazy Function I have already written 2 posts about writing functions, and I will try to diversify my content. That said, I won’t refrain from sharing something that has been helpful to me. The function(s) I describe in this post is an artefact left over from before I started using R Markdown. It is a product of its time but may still be of use to people who haven’t switched to R Markdown yet. It is lazy (and quite imperfect) solution to a tedious task. The Problem At the time I wrote this function I was using R for my statistics and Libreoffice for writing. I would run a test in R and then write it up in Libreoffice. Each value that needed reporting had to be transferred from my R output to Libreoffice - and for each test there are a number of values that need reporting. Writing up these tests is pretty formulaic. There’s a set structure to the sentence, for example writing up a t-test with a significant result nearly always looks something like this: An independent samples t-test revealed a significant difference in X between the Y sample, (M = [ ], SD = [ ]), and the Z sample, (M = [ ], SD = [ ]), t([df]) = [ ], p = [ ]. And the write up of a non-significant result looks something like this: An independent samples t-test revealed no significant difference in X between the Y sample, (M = [ ], SD = [ ]), and the Z sample, (M = [ ], SD = [ ]), t([df]) = [ ], p = [ ]. Seven values (the square [ ] brackets) need to be reported for this single test. Whether you copy and paste or type each value, the reporting of such tests can be very tedious, and leave you prone to errors in reporting. The Solution In order to make reporting values easier (and more accurate) I wrote the t_paragraph() function (and the related t_paired_paragraph() function). This provided an output that I could copy and paste into a Word (Libreoffice) document. This function is part of the desnum1 package (McHugh, 2017). The t_parapgraph() Function The t_parapgraph() function runs a t-test and generates an output that can be copied and pasted into a word document. The code for the function is as follows: # Create the function t_paragraph with arguments x, y, and measure # x is the dependent variable # y is the independent (grouping) variable # measure is the name of dependent variable inputted as string t_paragraph <- function (x, y, measure){ # Run a t-test and store it as an object t t <- t.test(x ~ y) # If your grouping variable has labelled levels, the next line will store them for reporting at a later stage labels <- levels(y) # Create an object for each value to be reported tsl <- as.vector(t$statistic)
ts <- round(tsl, digits = 3)
tpl <- as.vector(t$p.value) tp <- round(tpl, digits = 3) d_fl <- as.vector(t$parameter)
d_f <- round(d_fl, digits = 2)
ml <- as.vector(tapply(x, y, mean))
m <- round(ml, digits = 2)
sdl <- as.vector(tapply(x, y, sd))
sd <- round(sdl, digits = 2)

# Use print(paste0()) to combine the objects above and create two potential outputs
# The output that is generated will depend on the result of the test

# wording if significant difference is observed

if (tp < 0.05)
print(paste0("An independent samples t-test revealed a significant difference in ",
measure, " between the ", labels[1], " sample, (M = ",
m[1], ", SD = ", sd[1], "), and the ", labels[2],
" sample, (M =", m[2], ", SD =", sd[2], "), t(",
d_f, ") = ", ts, ", p = ", tp, "."), quote = FALSE,
digits = 2)

# wording if no significant difference is observed

if (tp > 0.05)
print(paste0("An independent samples t-test revealed no difference in ",
measure, " between the ", labels[1], " sample, (M = ",
m[1], ", SD = ", sd[1], "), and the ", labels[2],
" sample, (M = ", m[2], ", SD =", sd[2], "), t(",
d_f, ") = ", ts, ", p = ", tp, "."), quote = FALSE,
digits = 2)
}

When using t_paragraph(), x is your DV, y is your grouping variable while measure is a string value that the name of the dependent variable. To illustrate the function I’ll use the mtcars dataset.

Applications of the t_parapgraph() Function

The mtcars dataset is comes with R. For information on it simply type help(mtcars). The variables of interest here are am (transmission; 0 = automatic, 1 = manual), mpg (miles per gallon), qsec (1/4 mile time). The two questions I’m going to look at are:

1. Is there a difference in miles per gallon depending on transmission?
2. Is there a difference in 1/4 mile time depending on transmission?

Before running the test it is a good idea to look at the data2. Because we’re going to look at differences between groups we want to run descriptives for each group separately. To do this I’m going to combine the the descriptives() function which I previously covered here (also part of the desnum package) and the tapply() function.

The tapply() function allows you to run a function on subsets of a dataset using a grouping variable (or index). The arguments are as follows tapply(vector, index, function). vector is the variable you want to pass through function; and index is the grouping variable. The examples below will make this clearer.

We want to run descriptives on mtcars$mpg and on mtcars$qsec and for each we want to group by transmission (mtcars$am). This can be done using tapply() and descriptives() together as follows: tapply(mtcars$mpg, mtcars$am, descriptives) ##$0
## mean sd min max len
## 1 17.14737 3.833966 10.4 24.4 19
##
## $1 ## mean sd min max len ## 1 24.39231 6.166504 15 33.9 13 Recall that 0 = automatic, and 1 = manual. Replace mpg with qsec and run again: tapply(mtcars$qsec, mtcars$am, descriptives) ##$0
## mean sd min max len
## 1 18.18316 1.751308 15.41 22.9 19
##
## $1 ## mean sd min max len ## 1 17.36 1.792359 14.5 19.9 13 Running t_paragraph() Now that we know the values for automatic vs manual cars we can run our t-tests using t_paragraph(). Our first question: Is there a difference in miles per gallon depeding on transmission? t_paragraph(mtcars$mpg, mtcars$am, "miles per gallon") ## [1] An independent samples t-test revealed a significant difference in miles per gallon between the sample, (M = 17.15, SD = 3.83), and the sample, (M =24.39, SD =6.17), t(18.33) = -3.767, p = 0.001. There is a difference, and the output above can be copied and pasted into a word document with minimal changes required. Our second question was: Is there a difference in 1/4 mile time depending on transmission? t_paragraph(mtcars$qsec, mtcars$am, "quarter-mile time") ## [1] An independent samples t-test revealed no difference in quarter-mile time between the sample, (M = 18.18, SD = 1.75), and the sample, (M = 17.36, SD =1.79), t(25.53) = 1.288, p = 0.209. This time there was no significant difference, and again the output can be copied and pasted into word with minimal changes. Limitations The function described was written a long time ago, and could be updated. However I no longer copy and paste into word (having switched to R markdown instead). The reporting of the p value is not always to APA standards. If p is < .001 this is what should be reported. The code for t_paragraph() could be updated to include the p_report function (described here) which would address this. Another limitation is that the formatting of the text isn’t perfect, the letters (N,M,SD,t,p) should all be italicised, but having to manually fix this formatting is still easier than manually transferring individual values. Conclusion Despite the limitations the functions t_paragraph() and t_paired_paragraph()3 have made my life easier. I still use them occasionally. I hope they can be of use to anyone who is using R but has not switched to R Markdown yet. References McHugh, C. (2017). Desnum: Creates some useful functions. 1. To install desnum just run devtools::install_github("cillianmiltown/R_desnum") 2. In this case this is particularly useful because there are no value labels for mtcars$am, so it won’t be clear from the output which values refer to the automatic group and which refer to the manual group. Running descriptives will help with this.

3. If you want to see the code for t_paired_paragraph() just load desnum and run t_paired_paragraph (without parenthesis)

// add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); }$(document).ready(function () { bootstrapStylePandocTables(); }); (function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })();

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: CillianMacAodh. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics Challenge Invites Students to Tackle Opioid Crisis Using Real-World Data

Fri, 10/19/2018 - 20:26

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

In 2016, 2.1 million Americans were found to have an opioid use disorder (according to SAMHSA), with drug overdose now the leading cause of injury and death in the United States. But some of the country’s top minds are working to fight this epidemic, and statisticians are helping to lead the charge.

In This is Statistics’ second annual fall data challenge, high school and undergraduate students will use statistics to analyze data and develop recommendations to help address this important public health crisis.

The contest invites teams of two to five students to put their statistical and data visualization skills to work using the Centers for Disease Control and Prevention (CDC)’s Multiple Cause of Death (Detailed Mortality) data set, and contribute to creating healthier communities. Given the size and complexity of the CDC dataset, programming languages such as R can be used to manipulate and conduct analysis effectively.

Each submission will consist of a short essay and presentation of recommendations. Winners will be awarded for best overall analysis, best visualization and best use of external data. Submissions are due November 12, 2018.

If you or a student you know is interested in participating, get full contest details here

Teachers, get resources about how to engage your students in the contest here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Solving the chinese postman problem

Fri, 10/19/2018 - 18:37

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

Some pre-Halloween post today. It started actually while I was in Barcelona : kids wanted to go back to some store we’ve seen the first day, in the gothic part, and I could not remember where it was. And I said to myself that would be quite long to do all the street of the neighborhood. And I discovered that it was actually an old problem. In 1962, Meigu Guan was interested in a postman delivering mail to a number of streets such that the total distance walked by the postman was as short as possible. How could the postman ensure that the distance walked was a minimum?

A very close notion is the concept of traversable graph, which is one that can be drawn without taking a pen from the paper and without retracing the same edge. In such a case the graph is said to have an Eulerian trail (yes, from Euler’s bridges problem). An Eulerian trail uses all the edges of a graph. For a graph to be Eulerian all the vertices must be of even order.

An algorithm for finding an optimal Chinese postman route is:

1. List all odd vertices.
2. List all possible pairings of odd vertices.
3. For each pairing find the edges that connect the vertices with the minimum weight.
4. Find the pairings such that the sum of the weights is minimised.
5. On the original graph add the edges that have been found in Step 4.
6. The length of an optimal Chinese postman route is the sum of all the edges added to the total found in Step 4.
7. A route corresponding to this minimum weight can then be easily found.

For the first steps, we can use the codes from Hurley & Oldford’s Eulerian tour algorithms for data visualization and the PairViz package. First, we have to load some R packages

1 2 3 4 require(igraph) require(graph) require(eulerian) require(GA)

Then use the following function from stackoverflow,

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 make_eulerian = function(graph){ info = c("broken" = FALSE, "Added" = 0, "Successfull" = TRUE) is.even = function(x){ x %% 2 == 0 } search.for.even.neighbor = !is.even(sum(!is.even(degree(graph)))) for(i in V(graph)){ set.j = NULL uneven.neighbors = !is.even(degree(graph, neighbors(graph,i))) if(!is.even(degree(graph,i))){ if(sum(uneven.neighbors) == 0){ if(sum(!is.even(degree(graph))) &gt; 0){ info["Broken"] = TRUE uneven.candidates &lt;- !is.even(degree(graph, V(graph))) if(sum(uneven.candidates) != 0){ set.j &lt;- V(graph)[uneven.candidates][[1]] }else{ info["Successfull"] &lt;- FALSE } } }else{ set.j &lt;- neighbors(graph, i)[uneven.neighbors][[1]] } }else if(search.for.even.neighbor == TRUE &amp; is.null(set.j)){ info["Added"] &lt;- info["Added"] + 1 set.j &lt;- neighbors(graph, i)[ !uneven.neighbors ][[1]] if(!is.null(set.j)){search.for.even.neighbor &lt;- FALSE} } if(!is.null(set.j)){ if(i != set.j){ graph &lt;- add_edges(graph, edges=c(i, set.j)) info["Added"] &lt;- info["Added"] + 1 } } } (list("graph" = graph, "info" = info))}

Then, consider some network, with 12 nodes

1 2 3 g1 = graph(c(1,2, 1,3, 2,4, 2,5, 1,5, 3,5, 4,7, 5,7, 5,8, 3,6, 6,8, 6,9, 9,11, 8,11, 8,10, 8,12, 7,10, 10,12, 11,12), directed = FALSE)

To plot that network, use

1 2 3 4 V(g1)$name=LETTERS[1:12] V(g1)$color=rgb(0,0,1,.4) ly=layout.kamada.kawai(g1) plot(g1,vertex.color=V(newg)$color,layout=ly) Then we convert it to some traversable graph by adding 5 vertices 1 2 3 4 5 eulerian = make_eulerian(g1) eulerian$info broken Added Successfull 0 5 1 g = eulerian$graph as shown below 1 2 ly=layout.kamada.kawai(g) plot(g,vertex.color=V(newg)$color,layout=ly)

We cut those 5 vertices in two part, and therefore, we add 5 artificial nodes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A=as.matrix(as_adj(g)) A1=as.matrix(as_adj(g1)) newA=lower.tri(A, diag = FALSE)*A1+upper.tri(A, diag = FALSE)*A for(i in 1:sum(newA==2)) newA = cbind(newA,0) for(i in 1:sum(newA==2)) newA = rbind(newA,0) s=nrow(A) for(i in 1:nrow(A)){ Aj=which(newA[i,]==2) if(!is.null(Aj)){ for(j in Aj){ newA[i,s+1]=newA[s+1,i]=1 newA[j,s+1]=newA[s+1,j]=1 newA[i,j]=1 s=s+1 }}}

We get the following graph, where all nodes have an even number of vertices !

1 2 3 4 5 6 7 8 9 10 11 12 newg=graph_from_adjacency_matrix(newA) newg=as.undirected(newg) V(newg)$name=LETTERS[1:17] V(newg)$color=c(rep(rgb(0,0,1,.4),12),rep(rgb(1,0,0,.4),5)) ly2=ly transl=cbind(c(0,0,0,.2,0),c(.2,-.2,-.2,0,-.2)) for(i in 13:17){ j=which(newA[i,]&gt;0) lc=ly[j,] ly2=rbind(ly2,apply(lc,2,mean)+transl[i-12,]) } plot(newg,layout=ly2)

Our network is now the following (new nodes are small because actually, they don’t really matter, it’s just for computational reasons)

1 2 3 plot(newg,vertex.color=V(newg)\$color,layout=ly2, vertex.size=c(rep(20,12),rep(0,5)), vertex.label.cex=c(rep(1,12),rep(.1,5)))

Now we can get the optimal path

1 2 3 4 5 6 7 n &lt;- LETTERS[1:nrow(newA)] g_2 &lt;- new("graphNEL",nodes=n) for(i in 1:nrow(newA)){ for(j in which(newA[i,]&gt;0)){ g_2 &lt;- addEdge(n[i],n[j],g_2,1) }} etour(g_2,weighted=FALSE) [1] "A" "B" "D" "G" "E" "A" "C" "E" "H" "F" "I" "K" "H" "J" "G" "P" "J" "L" "K" "Q" "L" "H" "O" "F" "C" [26] "N" "E" "B" "M" "A"

or

1 2 3 4 5 6 7 8 9 10 11 12 13 14 edg=attr(E(newg), "vnames") ET=etour(g_2,weighted=FALSE) parcours=trajet=rep(NA,length(ET)-1) for(i in 1:length(parcours)){ u=c(ET[i],ET[i+1]) ou=order(u) parcours[i]=paste(u[ou[1]],u[ou[2]],sep="|") trajet[i]=which(edg==parcours[i]) } parcours [1] "A|B" "B|D" "D|G" "E|G" "A|E" "A|C" "C|E" "E|H" "F|H" "F|I" "I|K" "H|K" "H|J" "G|J" "G|P" "J|P" [17] "J|L" "K|L" "K|Q" "L|Q" "H|L" "H|O" "F|O" "C|F" "C|N" "E|N" "B|E" "B|M" "A|M" trajet [1] 1 3 8 9 4 2 6 10 11 12 16 15 14 13 26 27 18 19 28 29 17 25 24 7 22 23 5 21 20

Let us try now on a real network of streets. Like Missoula, Montana.

I will not try to get the shapefile of the city, I will just try to replicate the photography above.

If you look carefully, you will see some problem : 10 and 93 have an odd number of vertices (3 here), so one strategy is to connect them (which explains the grey line).

But actually, to be more realistic, we start in 93, and we end in 10. Here is the optimal (shortest) path which goes through all vertices.

Now, we are ready for Halloween, to go through all streets in the neighborhood !

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining Week 7 (2018)

Fri, 10/19/2018 - 17:50

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 7 Gold Mining and Fantasy Football Projection Roundup now available. Go check out our cheat sheet for this week.

The post Gold-Mining Week 7 (2018) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Course: Interactive Data Visualization with rbokeh

Fri, 10/19/2018 - 15:47

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Here is the course link.

Course Description

Data visualization is an integral part of the data analysis process. This course will get you introduced to rbokeh: a visualization library for interactive web-based plots. You will learn how to use rbokeh layers and options to create effective visualizations that carry your message and emphasize your ideas. We will focus on the two main pieces of data visualization: wrangling data in the appropriate format as well as employing the appropriate visualization tools, charts and options from rbokeh.

Chapter 1: rbokeh Introduction (Free)

In this chapter we get introduced to rbokeh layers. You will learn how to specify data and arguments to create the desired plot and how to combine multiple layers in one figure.

Chapter 2: rbokeh Aesthetic Attributes and Figure Options

In this chapter you will learn how to customize your rbokeh figures using aesthetic attributes and figure options. You will see how aesthetic attributes such as color, transparancy and shape can serve a purpose and add more info to your visualizations. In addition, you will learn how to activate the tooltip and specify the hover info in your figures.

Chapter 3: Data Manipulation for Visualization and More rbokeh Layers

In this chapter, you will learn how to put your data in the right format to fit the desired figure. And how to transform between the wide and long formats. You will also see how to combine normal layers with regression lines. In addition you will learn how to customize the interaction tools that appear with each figure.

Chapter 4: Grid Plots and Maps

In this chapter you will learn how to combine multiple plots in one layout using grid plots. In addition, you will learn how to create interactive maps.

Prerequisites var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...