R news and tutorials contributed by hundreds of R bloggers
Updated: 11 hours 42 min ago

### rOpenSci’s drake package

Tue, 07/10/2018 - 21:28

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

If you don’t know rOpenSci, then I recommend checking them out. They write a lot of really good packages for R. A relatively new seems to be drake. I’ve not played with it yet, but it looks to be very useful at giving indications about which parts of an analysis are subject to changes, and only rerunning those parts to speed up redoing an analysis (envisage the overlays for some version control systems or dropbox that show the status of files, although it’s more complicated than that). Knitr has caching, which goes some way to handling this, but here you can see where outdated parts are in the scope of the entire analysis…

It looks like a super tool!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Call for submissions open!

Tue, 07/10/2018 - 18:05

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

Experienced and wannabe speakers; R professionals and enthusiasts; people with good ideas and will to share them: this is who we are looking for. For what? For our Call for Submissions!

Here is the form —> https://goo.gl/forms/EURRqMWuvuzccuYu2

You have time until July 31st, so be sure to submit your ideas in time!

What’s MilanoR?

MilanoR is a user group dedicated to bring together local R users. Our aim is to exchange knowledge, learn and share tricks and techniques and provide R beginners with an opportunity to meet more experienced users. We often organize events, such as meetings, workshops and R-labs (aka R learnathons, we’ll look more into this!), both with local and international speakers. If you want an overview of what we did recently, take a look to milanor.net or read this article about our last year of events.

Who can contribute?

The answer is simple: everyone! Well, everyone with a knowledge of R and something to share.

If you come from abroad and you would need a financial support, write it in the Notes space, at the bottom of the form: we can’t promise anything, but we’ll do our best to try to find funding.

What kind of submissions are we looking for?
• Talks

A talk is usually around 20-45 minutes, and involves the use of a presentation. You can do a tutorial, explain to us your new discovery, idea, widget or whatever you have in mind! You have to specify if it’s going to be in English or in Italian – either is fine. Here‘s an example of what a talk presentation looks like.

• Lightning talks

Lightning talks are just like talks, but faster: only 5 minutes, and 15 slides of 20-30 seconds each. Ideal for someone who doesn’t have much time to prepare a whole 20-mins talk, or find themselves at ease with short timing. Here‘s a presentation to give you an idea of what we’re talking about.

• R-labs

R-labs are our proud venture that’s going absolutely well. During R-labs, a group of R lovers work together to solve a problem. Some examples: the Comune of Milano asked for our help to make the municipality budget accessible to citizens. We built a shiny app for that! (here it is!) We also programmed a representation of earthquakes evolution for EarthCloud and along with INGM (Istituto Nazionale Genetica Molecolare), a dashboard to recognize risks of various illnesses using personal data. We see a problem and we find a solution together. If you have an issue with R and you want to work on it together, then suggest an R-lab! The R-Lab can last for 3 hours in the evening or for an entire day on a Saturday.

• Blog articles

Usually tutorial about the most various R subjects. They’ll also be published on R-bloggers platform, so be ready for the spotlight! Here‘s an example.

• Workshop

Usually up to 2 hours. Tutorial for 20-50 (or more) people on a beginner or advanced R topic. This is our last workshop, which had a huge success!

Submit your abstracts now, and we’ll get in touch with you very soon!

The post Call for submissions open! appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R community update: announcing useR Delhi July meetup

Tue, 07/10/2018 - 13:00

(This article was first published on R – Tech and Mortals, and kindly contributed to R-bloggers)

In my last post, I’d pointed out the importance of a local community and described how Delhi NCR useR group came about. It’s been a month since we’d started the group and I’m excited to announce that our first meetup is scheduled to take place on 14th July 2018, i.e. upcoming Saturday. Venue for the event is SocialCops headquarters in Saket, New Delhi. As a budding community, support from established organizations is a big boost, and we’re grateful to SocialCops for coming onboard as mentors and venue partners.
Event details can be found on meetup and you can RSVP .

Delhi NCR July useR meetup

In addition to providing an opportunity to interact with fellow community members, the meetup will also feature sessions by members from the R community. For this meetup, we’ve sessions from two exciting speakers lined up.
Our first session would be an AMA (Ask Me Anything) with Prof. Deepayan Sarkar of Indian Statistical Institute, Delhi. He’s one of 20 R-core members and author of the lattice package.

AMA with Prof Deepayan Sarkar

Our second session would be on “Satellite image classification in R” by Shilpa Arora of SocialCops. She’s an experienced data scientist who has worked on problems from various domains in the industry, including GIS and image classification.

Satellite Image Classification by Shilpa Arora

If you’re in the region, attend the event by registering on the meetup page or here.

We’re a community in its infancy and could use all the help we can get. If you know of someone who may be interested in speaking at the upcoming event or in the future, please let us know via mail, Twitter, Facebook or Meetup.
Also, please share and retweet our event and/or this blog. We would be grateful.

Thanks for being an amazing community and have a great day, #rstats!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Tech and Mortals. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to read Stata DTA files into R

Tue, 07/10/2018 - 04:00

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

The file contains 2017 face-to-face post-election survey responses along with explanatory notes. Read the Stata DTA file into R with two these two lines:

The data set is now stored as a dataframe df with 357 variables. To check the properties of the data set we type

str(df)

This gives the following output:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       2194 obs. of  357 variables: $finalserialno : atomic 10115 10119 10125 10215 10216 ... ..- attr(*, "label")= chr "Final Serial Number" ..- attr(*, "format.stata")= chr "%12.0g"$ serial                : atomic  000000399 000000398 000000400 000000347 ... ..- attr(*, "label")= chr "Respondent Serial Number" ..- attr(*, "format.stata")= chr "%9s" $a01 : atomic nhs brexit society immigration ... ..- attr(*, "label")= chr "A1: Most important issue" ..- attr(*, "format.stata")= chr "%240s"$ a02                   :Class 'labelled'  atomic [1:2194] 1 0 -1 -1 1 -1 2 -1 2 2 ... .. ..- attr(*, "label")= chr "Best party on most important issue" .. ..- attr(*, "format.stata")= chr "%8.0g" .. ..- attr(*, "labels")= Named num [1:13] NA NA NA 0 1 2 3 4 5 6 ... .. .. ..- attr(*, "names")= chr [1:13] "Not stated" "Refused" "Dont know" "None/No party" ...

The above output shows that the variables are already set to the correct types. The first variable finalserialno is numeric (i.e., atomic), the third variable a01 is character, and the fourth variable a02 has a class of ‘labelled’ which can be converted to a factor or categorical variable (after we handle missing values).
Each variable has an associated label attribute to help with interpretation. For example, without having to look up the explanatory notes, we can see that variable a01 contains the responses to the question “A1: most important issue” and variable a02 contains the responses to “Best party on most important issue”.

Missing values

Stata supports multiple types of missing values.  Read_dta automatically handles missing values in numeric and character variables. For categorical variables, missing values are typically encoded by negative numbers. Section 5.3 of the explanatory notes describes the encoding for this file: -1 (Don’t know), -2 (Refused) and -999 (Not stated). We now convert all three of these values to NA.

for (i in 1:length(df)) { if (class(df[[i]] == "labelled") df[[i]][df[[i]] &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt; 0] &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;- NA } Encoding categorical variables

The categorical variables of class “labelled” are stored as numeric vectors. Convert them into factors so they are correctly associated with the labels with only a single command:

df &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;- as_factor(df)

Note that we do this after converting the missing values to avoid spurious factor levels in the final dataset.

Find out more

You can find out about how to import and read Excel files into Displayr as well.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Optimize for Expected Profit in Lift Models

Tue, 07/10/2018 - 02:00

(This article was first published on Sweiss' Blog, and kindly contributed to R-bloggers)

hete is the wild west of analytics. no wrong answers yet – Bill Lattner

Intro

Generally lift models use a Qini coefficient to measure the performance of a model. However this metric is generally an indirect measure of what the user wants to achieve: profit. In this post I’ll discuss a measure of profitability in lift models. A neural network can optimize this cost function directly. A comparison between this method compared with a causal tree implementation shows promising results.

Current lift models

The goal of lift modeling (or Heterogeneous Treatment Effects ) is to assign an optimal action for a particular observation. This can be, among other things, targeted advertisement or price discrimination. There are several different formulations of lift models which include flip, causal trees, etc.

Essentially all try to estimate the interaction between covariates and the treatment. The most simple parameterized model is: $$y$$: response

$$x$$: covariates

$$t$$: randomly assigned treatment variable. Assumed to be binary here.

$$y = \beta_{1}x + \beta_{2}t+ \beta_{3}t*x$$

Here the treatment effect for a particular observation is $$\beta_{2+} \beta_{3}x$$. The key point here is that the treatment effect is now a function of $$x$$ which means we can see which observations are more or less effected than others.

Alternatively one can estimate a machine learning model $$y = f(t,x)$$ and the treatment effect can be calculated $$\hat{f}(t=1,x) – \hat{f}(t=0,x)$$.

However there are a few issues estimating this accurately which has spun several methods to accurately estimate it.

1. The heterogeneous treatment effect is probably very small. Indeed one may find that the treatment effect relative to main effect is so small it does not get split on in normal random forests / GBTs! This means we need a model to focus particularly on the treatment effect if we want the results to be useful.

2. We can have potentially hundreds of covariates available to include interactions with. We have no idea a priori which covariates are important so usually we have to expand parameter space significantly. Also, we don’t necessarily have the functional form available so we may take into account nonlinear transformations to the data.

3. We can’t compare the fitted treatment effect values to the ‘truth’. This is the biggest difference between lift models and normal prediction. In prediction settings we can set aside a test set and evaluate our predictions on known responses. Here we want to know the counterfactual; what would have happened if we gave an observation a different value, which lead us to the fundamental problem of causality.

So to sum up; we’re trying to estimate a very noisy signal, distributed among a large number of covariates, and we can’t check our estimated model directly.

Metrics to use

Ok 3) might be a bit of a stretch because there are metrics people can use to assess the performance of their model. One is qini coefficient.

This is a measure that observations with similar treatment effects and groups them into the treatment they were given. If these groups are very different then that is a good measure the model is.

One particular drawback among qini coefficient is that it focuses only on the rank ordering of predicted values. This can be an adequate measure when you care about ranking but there are limitations. The most important being that it does not assign an obvious cutoff between those affected and those not. In the next section I will describe the problem in terms of cost / benefit and suggest a different metric.

A new kind of metric

Suppose we have $$1…K$$ options in the treatment and the goal is to assign the most profitable treatment for each observation. We have a model to assign an optimal treatment for observation $$i$$: $$optim_{i,k}$$. This takes the value 1 if $$k$$ is best for observation $$i$$ else it’s 0.The expected profit is:

$$Expected Profit = \sum_{n=1}^{N}{I(t = optim_{i,k})* y_{i}(x,t=optim_{i,k})}$$

The problem is that we cannot assign and observe a treatment to our training data. Instead we can only look at the values we have. Instead I will ‘simulate’ an experiment the following way:

$$AverageProfit = {\frac{\sum_{i=1}^{N}\sum_{k=1}^{K}{I(optim_{i,k} = t_{i,k})*y_{i}}}{\sum_{i=1}^{N}\sum_{k=1}^{K}{I(optim_{i} = t_{i,k})}}}$$

Basically this is saying if the optimal results equal the randomly assigned treatment then we include those in our hypothetical model. Since I am assuming the treatment is randomly assigned I think this would give a good indicator on the average increase for each observations under this new models decisions. In order to test this out I compared the expected gain with the actual gain of a simulated dataset. Below is the scatterplot and it appears to be a pretty good metric for the actual results.

Model this Metric Directly?

If we have a measure that can accurately measure what we want, can we use it as a loss function? This should lead to a better model since the model will optimize over the metric of interest.

We cannot optimize this directly b/c the indicator function is non-differentiable. Instead replacing I() function with a probability and use the custom function below to maximize. – Thanks to Matt Becker for this insight.

$$CustomLoss = – \sum_{k=1}^{K}{\hat{p_{i,k}(x)}*I(t_i=k)*y_{i}}$$

This is essentialy answering the question: what is the probability that action $$k$$ is best for observation $$i$$?

Using Keras I hacked together a prototype.

Experiment

To test this new model (I’m calling it hete-optim) I simulated a similar dataset to that found in this paper. There are 200 variables with 15,000 training set and 3,000 test set. Of these there are 8 different scenarios using several nonlinear functions from the explanatory variables. One major difference is that I binarized the response. In addition I increased the random noise and decreased the relative treatment effect to get a more ‘realistic’

I’m comparing this model which my former colleague, Bill Lattner, Developed hete and to grf

Results

The Metric I’m comparing is profits with true model / profits with fitted model. So if a model get’s a score of .75 then that means that it captures 75% of potential gain using a lift model. This is sort of a normalized regret score so we aggregate results among the 8 scenarios. Below is a boxplot of scores by model type.

## Using as id variables

This method performs best on 7 out of 8 datasets. On average hete_opim gets ~99.8% of the gains while hete gets ~98.9% and grf is 99.5%. This suggests that this method might be on to something.

Conclusion

This post described a method to simulate profits in a real world lift model and the optimize it direclty. It shows promising results to existing techniques.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Sweiss' Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How the odds ratio confounds: a brief study in a few colorful figures

Tue, 07/10/2018 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

The odds ratio always confounds: while it may be constant across different groups or clusters, the risk ratios or risk differences across those groups may vary quite substantially. This makes it really hard to interpret an effect. And then there is inconsistency between marginal and conditional odds ratios, a topic I seem to be visiting frequently, most recently last month.

My aim here is to generate a few figures that might highlight some of these issues.

Assume that there is some exposure (indicated by the use of a $$1$$ or $$0$$ subscript) applied across a number of different groups or clusters of people (think different regions, hospitals, schools, etc.) – indicated by some number or letter $$i$$. Furthermore, assume that the total number exposed at each location is the same as the number unexposed: $$N_{i0} = N_{i1} = N = 100$$.

The number of folks with exposure at a particular location $$i$$ who have a poor outcome is $$n_{i1}$$ and the number with a good outcome is $$N-n_{i1}$$. Likewise, the corresponding measures for folks not exposed are $$n_{i0}$$ and $$N-n_{i0}$$. The probabilities of a poor outcome for exposed and non-exposed are $$n_{i1}/N$$ and $$n_{i0}/N$$. The relative risk of a poor outcome for those exposed compared to those non exposed is

$\text{RR}_i = \frac{n_{i1}/N}{n_{i0}/N} = \frac{n_{i1}}{n_{i0}},$
the risk difference between exposed and unexposed groups is

$\text{RD}_i = \frac{n_{i1}}{N}-\frac{n_{i0}}{N} = \frac{n_{i1} – n_{i0}}{N},$
and the odds ratio is

$\text{OR}_i = \frac{[n_{i1}/N] / [(N – n_{i1})/N]}{[n_{i0}/N] / [(N – n_{i0})/N]}$
$= \frac{n_{i1}(N-n_{i0})}{n_{i0}(N-n_{i1})}.$

The simple conditional logistic regression model that includes a group-level random effect $$b_i$$ assumes a constant odds ratio between exposed and unexposed individuals across the different clusters:

$\text{logit} (Y_{ij}) = \beta_0 + \beta_1 E_{ij} + b_i,$
where $$E_{ij}$$ is an exposure indicator for person $$j$$ in group $$i$$. The parameter $$\text{exp}(\beta_1)$$ is an estimate of the odds ratio defined above.

The point of all of this is to illustrate that although the odds-ratio is the same across all groups/clusters (i.e., there is no $$i$$ subscript in $$\beta_1$$ and $$\text{OR}_i = \text{OR}$$), the risk ratios and risk differences can vary greatly across groups, particularly if the $$b$$’s vary considerably.

Constant odds ratio, different risk ratios and differences

If the odds ratio is constant and we know $$n_{i1}$$, we can perform a little algebraic maneuvering on the $$\text{OR}$$ formula above to find $$n_{i0}$$:

$n_{i0} = \frac{N \times n_{i1}}{\text{OR} \times (N – n_{i1}) + n_{i1}}$

If we assume that the $$n_{i1}$$’s can range from 2 to 98 (out of 100), we can see how the risk ratios and risk differences vary considerably even though we fix the odds ratio fixed at a value of 3 (don’t pay too close attention to the fact the $$n_0$$ is not an integer – this is just an illustration that makes a few violations – if I had used $$N=1000$$, we could have called this rounding error):

N <- 100 trueOddsRatio <- 3 n1 <- seq(2:98) n0 <- (N * n1)/(trueOddsRatio * (N - n1) + n1) oddsRatio <- ((n1 / (N - n1) ) / (n0 / (N - n0) )) riskRatio <- n1 / n0 riskDiff <- (n1 - n0) / N dn <- data.table(n1 = as.double(n1), n0, oddsRatio, riskRatio, riskDiff = round(riskDiff,3)) dn[1:6] ## n1 n0 oddsRatio riskRatio riskDiff ## 1: 1 0.3355705 3 2.98 0.007 ## 2: 2 0.6756757 3 2.96 0.013 ## 3: 3 1.0204082 3 2.94 0.020 ## 4: 4 1.3698630 3 2.92 0.026 ## 5: 5 1.7241379 3 2.90 0.033 ## 6: 6 2.0833333 3 2.88 0.039

With a constant odds ratio of 3, the risk ratios range from 1 to 3, and the risk differences range from almost 0 to just below 0.3. The odds ratio is not exactly informative with respect to these other two measures. The plots – two takes on the same data – tell a better story:

Another look at contrasting marginal vs conditional odds ratios

Using this same simple framework, I thought I’d see if I can illustrate the relationship between marginal and conditional odds ratios.

In this case, we have two groups/clusters where the conditional odds ratios are equivalent, yet when we combine the groups into a single entity, the combined (marginal) odds ratio is less than each of the conditional odds ratios.

In this scenario each cluster has 100 people who are exposed and 100 who are not, as before. $$a_1$$ and $$a_0$$ represent the number of folks with a poor outcome for the exposed and unexposed in the first cluster, respectively; $$b_1$$ and $$b_0$$ represent the analogous quantities in the second cluster. As before $$a_0$$ and $$b_0$$ are derived as a function of $$a_1$$ and $$b_1$$, respectively, and the constant odds ratio.

constantOR <- function(n1, N, OR) { return(N*n1 / (OR*(N-n1) + n1)) } # Cluster A a1 <- 55 a0 <- constantOR(a1, N = 100, OR = 3) (a1*(100 - a0)) / (a0 * (100 - a1)) ## [1] 3 # Cluster B b1 <- 35 b0 <- constantOR(b1, N = 100, OR = 3) (b1*(100 - b0)) / (b0 * (100 - b1)) ## [1] 3 # Marginal OR tot0 <- a0 + b0 tot1 <- a1 + b1 (tot1*(200 - tot0)) / (tot0 * (200 - tot1)) ## [1] 2.886952

For this example, the marginal odds ratio is less than the conditional odds ratio. How does this contrast between the marginal and conditional odds ratio play out with a range of possible outcomes – all meeting the requirement of a constant conditional odds ratio? (Note we are talking about odds ratio larger than 1; everything is flipped if the OR is < 1.) The plot below shows possible combinations of sums $$a_1 + b_1$$ and $$a_0 + b_0$$, where the constant conditional odds ratio condition holds within each group. The red line shows all points where the marginal odds ratio equals the conditional odds ratio (which happens to be 3 in this case):

Here is the same plot, but a yellow line is drawn in all cases where $$a_1 = b_1$$ (hence $$a_0 = b_0$$). This line is the directly over the earlier line where the marginal odds ratios equal 3. So, sort of proof by plotting. The marginal odds ratio appears to equal the conditional odds ratio when the proportions of each group are equal.

But are the marginal odds ratios not on the colored lines higher or lower than 3? To check this, look at the next figure. In this plot, the odds ratio is plotted as a function of $$a_1 + b_1$$, which represents the total number of poor outcomes in the combined exposed groups. Each line represents the marginal odds ratio for a specific value of $$a_1$$.

If you notice, the odds ratio reaches the constant conditional odds ratio (which is 3) only when $$a_1 + b_1 = 2a_1$$, or when $$a_1 = b_1$$. It appears then, when $$a_1 \ne b_1$$, the marginal odds ratio lies below the conditional odds ratio. Another “proof” by figure. OK, not a proof, but colorful nonetheless.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A Simple Spatial Equilibrium Model with Tidycensus

Tue, 07/10/2018 - 02:00

(This article was first published on Ignacio Sarmiento Barbieri, and kindly contributed to R-bloggers)

The Spatial Equilibrium concept is well known to urban economists. In a nutshell, it states that in equilibrium there are no rents to be gained by changing locations. Ed Glaeser begins Chapter 2 of his book: “Cities, Agglomeration, and Spatial Equilibrium” with the well known Alonso-Muth-Mills model. In this post, I want to summarize it briefly following Ed Glaeser presentation and reproduce his 2.1. and 2.2. figures. This is the perfect excuse to play around with the “tidycensus” package.

Spatial equilibrium in the Alons-Muth-Mills model

In its simplest form, the model assumes a monocentric closed city. Furthermore, we assume that there are $N$ city dwellers that maximize their utility which depends on C consumption and L units of land

\underset{C}{\mathrm{max}\,}U(C,L)

City dwellers use L units of land, consume C, and commute to the city center. Consumption is equal to their income, minus transport costs and rental costs per unit of land. City dwellers income is an exogenous wage w. Commuting costs are a function of distance and for simplicity we assume it linear t(d)=td. Rental costs per unit of land are r(d)L, where r(d) is the rental cost per unit of land. Thus,

C = w − td − r(d)L

To get to the spatial equilibrium we first rewrite utility function

\underset{d}{\mathrm{max}}\,U( w -td -r(d) L,L)

Then, the first order conditions imply

\frac{\partial U}{\partial d}(-t-\frac{\partial r}{\partial d} L )=0

r′= − \frac{t}{L}

The equilibrium condition states that rents must decline exactly to offset the increase in transportation cost. The simple assumption that commuting costs are linear implies that the rent gradient is linear:

r(d)=r(0) -\frac{td}{L}

The key prediction of the model is that land rents will decline with distance to the city center.

R code

Let’s take the model to the data and reproduce figures 2.1. and 2.2 of “Cities, Agglomeration, and Spatial Equilibrium”. The focus are two cities, Chicago and Boston. These cities are chosen because both differ in how easy is to access to their city centers. Chicago is fairly easy, Boston is more complicated. Our model then implies that gradients then should reflect the differential costs to access the city centers.

So let’s begin, the first step is to get some data. To do so I’m are going to use the “tidycensus” package. This package will allow me to get data from the census website using their API. We are also going to need the help of three other packages: “sf” to handle spatial data, “dplyr” my go-to package to wrangle data, and “ggplot2” to plot my results.

require("tidycensus", quietly=TRUE) require("sf", quietly=TRUE) require("dplyr", quietly=TRUE) require("ggplot2", quietly=TRUE)

In order to get access to the Census API, I need to supply a key, which can be obtained from http://api.census.gov/data/key_signup.html.

census_api_key("YOUR CENSUS API HERE") ## To install your API key for use in future sessions, run this function with install = TRUE.

Let’s get Median Housing Values from the latest ACS. I’ll use the variable “B25077_001E” which is the estimated Median Value of Owner-Occupied Housing Units. The package “tidycensus” provides the get_acs function that will let me retrieve this variable for the county at the block group level. It is important to note the option geometry=T. This option downloads also geometry features of the census block that will allow later on to compute distances.

chicago<-get_acs(geography = "block group", variables = "B25077_001E", state = "IL",county = "Cook County",year=2016,geometry = T) #retrieve median housing values for Cook County boston<-get_acs(geography = "block group", variables = "B25077_001E", state = "MA",county = "Suffolk County",year=2016,geometry = T) #retrieve median housing values for Suffolk County

Next, we need the city centers. I create two “sf” objects that have the coordinates to the city center.

chicago_cbd<-st_as_sf(x = read.table(text="-87.627800 41.881998"), coords = c(1,2), crs = "+proj=longlat +datum=WGS84") boston_cbd<-st_as_sf(x = read.table(text="-71.057083 42.361145"), coords = c(1,2), crs = "+proj=longlat +datum=WGS84")

Now, I need to put everything in the same projection. I re-project my cbds to the same projection as my tibbles.

chicago_cbd <-chicago_cbd %>% st_transform(st_crs(chicago) ) boston_cbd <-boston_cbd %>% st_transform(st_crs(boston) )

The next step is to create distances. For that, I use the st_distance function. Given the projection of my tibbles is in meters, I transform all my distances to miles, one meter is 0.000621371 miles.

chicago$dist_CBD<-st_distance(chicago,chicago_cbd) #computes distance to CBD (in meters) chicago$dist_CBD<-as.numeric(chicago$dist_CBD)*0.000621371 #change units to miles boston$dist_CBD<-st_distance(boston,boston_cbd) #computes distance to CBD (in meters) boston$dist_CBD<-as.numeric(boston$dist_CBD)*0.000621371 #change units to miles

The last part is some cleaning up. I add a column with the name of the city, keep block groups in Cook County that are within 10 miles of the city center, and keep the data.frames from my tibbles,. Finally, I bind the data.frames together.

boston$City<-"Boston" chicago$City<-"Chicago" chicago<-chicago %>% filter(dist_CBD<=10) chicago<-data.frame(chicago) boston<-data.frame(boston) dta<-rbind(chicago,boston)

We are now ready to plot. I use a regular scatter plot with hollow circles and add a linear regression line on a black and white background (theme_bw())

ggplot(dta, aes(x=dist_CBD, y=estimate, color=City)) + geom_point(shape=1) + # Use hollow circles geom_smooth(method=lm) + # Add linear regression line xlab("Distance to CBD (miles)") + ylab("Median Housing Prices ($)") + theme_bw() Conclusion: the data reflect the predictions of the simple Alons-Muth-Mills model, land rents will decline with distance to the city center. The speed at which they decline depends on transportation costs. Session Info sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS High Sierra 10.13.5 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] bindrcpp_0.2.2 ggplot2_2.2.1 dplyr_0.7.4 sf_0.6-3 [5] tidycensus_0.4.6 loaded via a namespace (and not attached): [1] Rcpp_0.12.16 plyr_1.8.4 pillar_1.2.2 compiler_3.5.0 [5] bindr_0.1.1 class_7.3-14 tools_3.5.0 uuid_0.1-2 [9] digest_0.6.15 gtable_0.2.0 jsonlite_1.5 evaluate_0.10.1 [13] tibble_1.4.2 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.2.0 [17] DBI_0.8 curl_3.2 rgdal_1.2-18 yaml_2.1.18 [21] spData_0.2.8.3 e1071_1.6-8 httr_1.3.1 stringr_1.3.0 [25] knitr_1.20 xml2_1.2.0 hms_0.4.2 rappdirs_0.3.1 [29] tigris_0.7 tidyselect_0.2.4 classInt_0.2-3 rprojroot_1.3-2 [33] grid_3.5.0 glue_1.2.0 R6_2.2.2 foreign_0.8-70 [37] rmarkdown_1.9 sp_1.2-7 readr_1.1.1 tidyr_0.8.0 [41] purrr_0.2.5 udunits2_0.13 magrittr_1.5 scales_0.5.0 [45] backports_1.1.2 htmltools_0.3.6 units_0.5-1 maptools_0.9-2 [49] assertthat_0.2.0 rvest_0.3.2 colorspace_1.3-2 labeling_0.3 [53] stringi_1.1.7 lazyeval_0.2.1 munsell_0.4.3 lwgeom_0.1-4 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Ignacio Sarmiento Barbieri. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### R 3.5.1 update now available Tue, 07/10/2018 - 01:35 (This article was first published on Revolutions, and kindly contributed to R-bloggers) Last week the R Core Team released the latest update to the R statistical data analysis environment, R version 3.5.1. This update (codenamed "Feather Spray" — a Peanuts reference) makes no user-visible changes and fixes a few bugs. It is backwards-compatible with R 3.5.0, and users can find updates for Windows, Linux and Mac systems at their local CRAN mirror. (The update to Microsoft R Open featuring the R 3.5.1 engine is scheduled for release on August 29.) The complete list of fixes to R 3.5.1 is included in the release announcement, found at the link below. R-announce mailing list: R 3.5.1 is released var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### A Comparative Review of the R Commander GUI for R Tue, 07/10/2018 - 00:02 (This article was first published on R – r4stats.com, and kindly contributed to R-bloggers) Introduction The R Commander is a free and open source user interface for the R software, one that focuses on helping users learn R commands by point-and-clicking their way through analyses. The R Commander is available on Windows, Mac, and Linux; there is no server version. This is one of a series of reviews which aim to help non-programmers choose the user interface for R which is best for them. Each review also includes a cursory description of the programming support that each interface offers. Figure 1. The main R Commander window is in the upper left. A typical dialog box is in the front, and the graph it created is on the right. The data editor is on the upper right. Terminology There are various definitions of user interface types, so here’s how I’ll be using these terms: GUI = Graphical User Interface specifically using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers. IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses. Installation The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi, BlueSky, or RKWard, install in a single step. Others, such as Deducer, install in multiple steps. Advanced computer users often don’t appreciate how lost beginners can become while attempting even a single-step installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester! As described on the R Commander main web site, the installation basics are as follows: 1. Download R from CRAN and install it in the manner appropriate to your operating system. If you have an old version of R — that is, older than the current version — then it’s generally a good idea to install the current version of R before installing the Rcmdr package. On Windows, opt for a customized startup and select the single-document interface (“SDI,” see the Windows notes below for details). 2. On Mac OS X only, download and install XQuartz, and reboot your computer (see the Mac notes below for greater detail). 3. Start R, and at the > command prompt, type the command install.packages(“Rcmdr”). 4. Once it is installed, to load the Rcmdr package, just enter the command library(“Rcmdr”). 5. Optionally install Pandoc and LaTeX to get publication-quality output (via “Tools> Install auxiliary software”) Complete installation notes are here. They’re worth reading as they go on to point out several things that can go wrong. These include having an incompatible version of R (i.e. you skipped step 1), and R packages which fail to install. While these multiple steps are more challenging that single-step installations, they are in line with the developer’s goal of helping people learn to program in R. That audience would have to learn to install R and R packages, then load packages anyway. Plug-ins When choosing a GUI, one of the most fundamental questions is: what can do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling section of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, BlueSky, Deducer) through moderate (jamovi) to very active. R Commander’s development community is by far the most active, with 45 add-ons available. The add-ons are stored on CRAN and installed like any other software. You install them using the install.packages function, then choose “Tools> Load Rcmdr plug-ins…”, select the plug-in, and click OK. R Commander will then tell you that you need to restart R Commander to have it appear on the menus. The R Commander stores your list of plug-ins in your .Rprofile and will edit it for you. That’s important as editing it is a non-trivial task (see Installation, step 6). You can find a comprehensive list of plug-ins here: https://cran.r-project.org/web/packages/index.html. Startup Some user interfaces for R, such as jamovi and BlueSky Statistics, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as Deducer, have you start R, then load a package from your library, then call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway. You start R Commander by first starting the RGUI program that comes with the main R package. You can also start it from any program that offers an R console, such as the one that comes with the main R installation. Once R is started, you start the R Commander by loading it from your library by typing this command and pressing the Enter key: “library(“Rcmdr”).” The main control screen will then appear (Figure 1, upper left) along with a graphics screen (Figure 1, right). If you want to have the R Commander start automatically each time you start R, it’s possible to do so, but it is not a task for beginners. The R Commander creates an “.Rprofile” in your main directory and it includes instructions about how to “uncomment” a few lines by removing the leading “#” characters. However, files whose names consist only of an “.extension” are hidden by your file system, and if edited using the Notepad application, cannot be saved with that non-standard type of name. You can save it to a regular name like “Rprofile.txt”. You then must use operating system commands to rename it, as the Windows file manager also won’t let you choose a file name that is only an extension. Data Editor A data editor is a fundamental feature in data analysis software. It puts you in touch with your data, lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences (single-click sorting, icons that show excluded observations, etc.), to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, Bluesky, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it. You start the R Commander’s Data Editor by choosing “Data> New data set…” You can enter data immediately, though at first the variables are named simply V1, V2… and the rows are named 1,2,3…. You can click on the names to change them (see Figure 2). Clicking on the “Add row” or “Add column” buttons do just that, though the Enter key is a quicker way to get a new row. You can enter simple numeric data or character data; no scientific notation, no dates. The latter is converted to a factor, but there is no way to enter the underlying values such as 1, 2 and have the editor display Male, Female, for example. That slows down data entry. There is no way to enter or change any metadata other than variable and row names. Saving the data provides a lesson on R data structures. Since you started the process by creating a new “data set”, you might start looking on the menus for where to save such a thing. Instead, you have to know that in R, data sets reside in something called a “workspace”. So “Data: New data set…” is balanced by “File: Save R workspace”. It would be nice if there was some instruction explaining this situation. Figure 2. The R Commander’s data editor. Data Import The R Commander can import the file formats: CSV, TXT, Excel, Minitab, SPSS, SAS, and Stata. It can even import data directly from a URL, which is a rare feature for a GUI. These are all located under “Data> Import Data”. A particularly handy feature is the ability to explore and load data sets that are included with installed packages. That’s done via “Data> Data in packages…”. To get data from SQL database formats, you’ll have to use R code. Data Management It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing such tasks one variable at a time is tedious. Some GUIs, such as BlueSky, handle nearly all of these challenges. Others, such as RKWard offer just a handful of data management functions. The R Commander is able to recode many variables, adding an optional prefix to each name like “recoded_” to each variable that you choose to recode. It can also standardize many variables at once, but can only over-write the original values. Make a copy of your data set before doing that! Unfortunately, when it comes to other popular transformations such as the logarithm, you have to apply them one variable at time. For reshaping data sets, the R Commander can stack one set of variables into a single variable and create a factor to classify those values, but it can’t take along other relevant variables, nor can it do the reverse of this process by going from “long” to “wide” data structures. Overall, the R Commander offers a very useful set of data management tools: For managing the active data set as a whole: 1. View data 2. Select active data set 3. Refresh active data set 4. Help on active data set 5. Variables in active data set 6. Set case names 7. Subset active data set 8. Sort active data set 9. Aggregate variables in the active data set 10. Remove row(s) from active data set 11. Stack variables in active data set (half of reshaping discussed above) 12. Remove cases with missing data 13. Save active data set 14. Export active data set For managing variables in the active data set: 1. Recode variables (able to do many variables) 2. Compute new variables (can create only one new variable at a time) 3. Add observation numbers to data set 4. Standardize variables (able to do many variables at once) 5. Convert numeric variables to factors 6. Bin numeric variable 7. Reorder factor levels 8. Drop unused factor levels 9. Define contrasts for a factor 10. Rename variables 11. Delete variables from data set Menus & Dialog Boxes The goal of pointing & clicking your way through an analysis is to save time by recognizing menu settings rather than spend it on the memorization and practice required by programming. Some GUIs, such as jamovi make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use non-standard menus that are unique to it and hence require more learning. Figure 1 shows a common screen layout. The main R Commander window is in the upper left. A typical dialog box is in the front, and the graph it created is on the right. The data editor is on the upper right. The R Commander’s menu structure contains some unique choices. No operations on data files are located on the usual “File” menu. For example, existing data sets or files are not opened using the usual “File> Open…”, but instead using “Data> Load data set…” menu. Also, everything on the models menu applies not to data but from models that you’ve already created from data. The other menus follow Windows standards. When switching between software packages, I found myself usually looking for data under the File menu. The rationale behind the R Commander’s approach is that the R function that opens files is named “load”. So this structure will help people learn more about R code (whether they’re headed that way or not!) The dialog boxes have their own style too, but one that is easy to learn. Rather than have empty role boxes that you drag or click variables into, the role boxes contain the full list of relevant variables, and you click on one (or more) to select them (see “X variable (pick one) in Fig. 1). In the cases where there is an empty role box, or you double-click a variable name to move it from a list to box. The R Commander does a nice job of helping you avoid asking for absurdities, such as the mean of a factor, by not displaying them in certain dialog boxes. The two objects you might be working on are shown on the toolbar right below the main menus. Clicking the “Data set:” tool will allow you to choose which data set most of the dialog boxes will refer to by default. That’s filled in automatically when you load, enter, or import a data set. Similarly, clicking the “Model:” tool will let you select a model which most of the choices on the Models menu will relate to. It too is filled in automatically each time you create a new model. See more on this in the Modeling section below. Continued here… var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### The Ten Commandments for a well-formatted database Mon, 07/09/2018 - 22:37 (This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers) Our diligent readers already know how important having well formatted data is for efficient statistical analyses. Here we gathered some advice on how to make a well structured database, in order to perform accurate analyses, and avoid driving your fellow analysts crazy. We are very grateful to Marion Louveaux, bio-image analyst for this translation of french version “Une base de données réussie” on our ThinkR website. An example is worth a thousand words. Let’s imagine you are a medical intern working on an experimental assay (any resemblance to people who actually existed …). We could have given the following title to our article: “How to make a database when you are a medical intern knowing nothing”. However this exercise may have a broader application: indeed our advice applies to anyone. Let’s say that you already have your medical question and your protocol, and that you “just” have to fill in your database. So, now what? As the proverb says, there are a thousand ways of doing it wrong, and often only one to do it right. Let’s be honest: if your database is not well ordered, then any further analyses will be hindered. Don’t worry, ThinkR is here to help you create a clear database. Here are ten commandments to avoid classical traps. Commandment 1: all your data shall fit into one single dataframe Spreading your data over several spreadsheets or several files is not an option. All your data should fit in one single dataframe. Commandment 2: Thou shalt respect a precise formatting For a perfect database, follow this rule: the table must be filled cell by cell, starting from the upper left corner and, of course, without skipping line(s). Commandment 3: A line = a statistical individual It is not always easy to define what a statistical individual is. In a nutshell, an answer to a survey corresponds to one individual in your experimental protocol. If a physical person answers two times, you can enter these on two lines. Commandment 4: A column = a variable A variable corresponds to a metric from your statistical individual (e.g. a characteristic that you can measure). For instance, the gender, the age or the colour of the eyes. For these categorical variables, where each answer corresponds to one level or another, but not several levels at the same time, each variable corresponds to one column. Do not create one column for “man” and one column for “woman”, each filled with yes/no, but a single one alternating men and women accross the different lines. When several levels can be taken by the same individual, use several columns. For instance, if your patients have several symptoms (hurting eyes/nose/neck…), you will have to create one column per symptom, and fill them with yes or no. Commandment 5: Thou shalt not encode thy qualitative variables Avoid as much as possible encoding qualitative variables: a 1/2 coding for the gender is useless, prefer the use of “woman” and “man” for a better readability. If you really can’t prevent yourself from using a code, then stick to 1/0 to mean presence/absence or yes/no. Not more ! For Excel lovers: don’t code your variables using colors or bold/italic/underlined. No “control in blue, test case in red”: nothing worse than such a format to loose information when exporting those data. Commandment 6: Thy database shall only contain data It is not unusual to see databases with aestethic formatting, containing titles, blank spaces, annotations, explanations and much more. Nothing is worse than this type of format for causing difficulties when importing the data into R. Repeat after me: the database must contain ONLY data. If you really need to make comments, you can always add a special column for them. Commandment 7: Homogeneous thou shalt be For reading, as well as for the analysis, a homogeneous notation is essential. For instance, don’t mix “boy”, “man” and “masculine”. Choose one and stick to it all across your database… including the exact spelling: R is case and blank sensitive. “Man 1” will not be the same as “man 1”, nor as “man1” or “man_1”. For column names, forget about “m1 glycemia”, “glucose rate month3” and “glycemia month6”. Rather use “glycemia_month_1”, “glycemia_month_3” and “glycemia_month_6”! When data are missing, leave a blank space. Commandment 8: Thy numerical variables with respect thou shalt treat You have numerical variables in your database. Fair enough, but please keep them numerical! For instance, the patient’s age must be a number, not “25 yrs” or “25 years old”, just “25”. For the dates, choose the universal notation YYYY-MM-DD, natively recognized by R. And don’t forget to respect Commandment 7 by standardizing the format. Commandment 9: Anonymous thy database thou shalt keep Don’t forget to make your database anonymous (especially if you are medical intern). When communicating, sharing or publishing your results, including the identities of people is not an option. Give them numbers, and keep the correspondance in another database with the names, medical files, … and all other information that is not relevant for statistical analysis. Commandment 10: Human readable thy database shall be Throughout the creation of your database never forget that your data must be perfectly readable by the computer… but also by a human reader! You database must be understandable and readable, either by you or a colleague. Give meaningful names to your variables, so that you understand yourself in 6 months when reopening the database. And avoid falling into the trap of simplicity: rather use “glycemia_month1”, “glycemia_month2” and “glycemia_month3” than G1, G2 and G3. Acknowledgements I would like to thank Victor Jones for reviewing and correcting this translation. Translated by Marion Louveaux, bio-image analyst The post The Ten Commandments for a well-formatted database appeared first on (en) The R Task Force. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Writing Pipe-friendly Functions Mon, 07/09/2018 - 02:14 (This article was first published on R – William Doane, and kindly contributed to R-bloggers) Pipes have been a fundamental aspect of computer programming for many decades. In short, the semantics of pipes can be thought of as taking the output from the left-hand side and passing it as input to the right-hand side. For example, in a linux shell, you might cat example.txt | unique | sort to take the contents of a text file, then take one copy of each row, then sort those remaining rows. | is a common, but not universal, pipe operator and on U.S. Qwerty keyboards, is found above the backslash key: \. Languages that don’t begin by supporting pipes often eventually implement some version of them. In R, the magrittr package introduced the %>% infix operator as a pipe operator and is most often pronounced as “then”. For example, “take the mtcars data.frame, THEN take the head of it, THEN…” and so on. For a function to be pipe friendly, it should at least take a data object (often named .data) as its first argument and return an object of the same type—possibly even the same, unaltered object. This contract ensures that your pipe-friendly function can exist in the middle of a piped workflow, accepting the input from its left-hand side and passing along output to its right-hand side. library(magrittr) custom_function <- function(.data) { message(str(.data)) .data } mtcars %>% custom_function() %>% head(10) %>% custom_function() This will first display the structure of the 32 by 10 mtcars data.frame, then take the head(10) of mtcars and display the structure of that 10 by 10 reduced version, ultimately returning the reduced version which is, by default in R, printed to the console. The dplyr package in R introduces the notion of a grouped data.frame. For example, in the mtcars data, there is a cyl parameter that classifies each observation as a 4, 6, or 8 cylinder vehicle. You might want to process each of these groups of rows separately—i.e., process all the 4 cylinder vehicles together, then all the 6 cylinder, then all the 8 cylinder: library(dplyr) mtcars %>% group_by(cyl) %>% tally() Note that dplyr re-exports the magrittr pipe operator, so it’s not necessary to attach both dplyr and magrittr explicitly; attaching dplyr will usually suffice. In order to make my custom function group-aware, I need to check the incoming .data object to see whether it’s a grouped data.frame. If it is, then I can use dplyr‘s do() function to call my custom function on each subset of the data. Here, the (.) notation denotes the subset of .data being handed to custom_function at each invocation. library(dplyr) custom_function <- function(.data) { if (dplyr::is_grouped_df(.data)) { return(dplyr::do(.data, custom_function(.))) } message(str(.data)) .data } mtcars %>% custom_function() mtcars %>% group_by(cyl) %>% custom_function() In these examples, I’ve messaged some metadata to the console, but your custom functions can do any work they like: create, plot, and save ggplots; compute statistics; generate log files; and so on. I usually include the R three-dots parameter, ..., to allow additional parameters to be passed into the function. custom_function <- function(.data, ...) { if (dplyr::is_grouped_df(.data)) { return(dplyr::do(.data, custom_function(., ...))) } message(str(.data)) .data } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – William Doane. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Euler Problem 33: Digit Cancelling Fractions and Ford Circles Mon, 07/09/2018 - 02:00 (This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to R-bloggers) Euler Problem 33 takes us back to the world of fractions from our primary school days. Many kids hate and struggle learning about fractions but once you master them, a new world of numbers opens up. Unfortunately, the proliferation of digital calculators has negated the use of fractions in favour of decimal expressions. Fractions are an aesthetic way to express numbers, without having to resort to ugly random sequences of decimals. This is why I prefer to use 22/7 as an approximation of Pi over the ugly infinite series of decimals. This Numberphile video below explains fractions and Farey sequences. A Farey sequence contains all fractions between 0 and 1 with a maximum denominator. More formally, a Farey sequence of order n is the sequence of completely reduced fractions between 0 and 1 which, when in lowest terms, have denominators less than or equal to , arranged in order of increasing size. For example, the Farey Sequence with order 3 is: These sequences can be visualised in fractal-esque Ford Circles, but before we get to this, first solve Euler problem 33. Euler Problem 33 Definition The fraction 49/98 is a curious fraction, as an inexperienced mathematician in attempting to simplify it may incorrectly believe that 49/98 = 4/8, which is correct, is obtained by cancelling the 9s. We shall consider fractions like 30/50 = 3/5, to be trivial examples. There are exactly four nontrivial examples of this type of fraction, less than one in value, and containing two digits in the numerator and denominator. If the product of these four fractions is given in its lowest common terms, find the value of the denominator. Proposed Solution in R To solve this problem, we create a pseudo-Farey sequence by generating all different fractions with two decimals in the numerator and denominator. The loop generates all combinations of demoninators and numerators, excluding the trivial ones (multiples of 10 and 11). This solution converts the numbers to strings, strips any common duplicates, and tests the condition. The code concatenates vectors, which is not good practice. However, the loop is so short it does not matter much. You can view the code below or download it from my GitHub page. num <- vector() den <- vector() for (a in 11:99) { for (b in (a + 1):99) { trivial <- (a %% 10 == 0 | b && 10 == 0 | a %% 11 == 0 | b %% 11 == 0) if (!trivial) { i <- as.numeric(unlist(strsplit(as.character(a), ""))) j <- as.numeric(unlist(strsplit(as.character(b), ""))) digs <- c(i, j) dup <- digs[duplicated(digs)] digs <- digs[which(digs != dup)] if (length(digs) == 2 & a/b == digs[1]/digs[2]) { num <- c(num, a) den <- c(den, b) } } } } answer <- prod(den) / prod(num) print(answer) Farey Sequences and Ford Circles Next step is to generalise Euler problem 33 and write a function to generate Farey Sequences and visualise them using Ford Circles. The farey function generates a data table with the numerators (p) and denominators (q) of a Farey sequence. The function builds a list of all possible fractions for the solution space, excluding those with one as a Greatest Common Dominator, as defined by the gcc function. farey <- function(n) { fseq <- list() fseq[[1]] <- c(0, 1) i <- 2 gcd <- function(a, b) { # Euclid's method if (a == 0) return(b) if (b == 0) return(a) gcd(b, a%%b) } for (q in 2:n) { for (p in 1:(q - 1)){ if (gcd(p, q) == 1) { fseq[[i]] <- c(p, q) i <- i + 1 } } } fseq[[i]] <- c(1, 1) fseq <- as.data.frame(do.call(rbind, fseq)) names(fseq) <- c("p", "q") fseq <- fseq[order(fseq$p / fseq$q), ] return(fseq) } Standard ggplot2 cannot draw circles where the radius of the circles is related to the coordinate system. I tried to use the ggforce package to plot circles in ggplot2, but for some reason, I was not able to install this package on Ubuntu. As an alternative, I used a circle function sourced from StackOverflow. This function is called in a for-loop to build the circles on top of the empty canvas. Farey Sequence and Ford Circles (n = 20). library(tidyverse) lm_palette <- c("#008da1", "#005395", "#262e43", "#3b2758", "#865596", "#f26230") ford_circles <- farey(20) %>% mutate(x = p / q, y = 1 / (2* q^2), r = y, c = lm_palette[(q - 1)%%6 + 1]) g_circle <- function(r, x, y, color = NA, fill = "black", ...) { x <- x + r * cos(seq(0, pi, length.out = 100)) ymax <- y + r * sin(seq(0, pi, length.out = 100)) ymin <- y + r * sin(seq(0, -pi, length.out = 100)) annotate("ribbon", x = x, ymin = ymin, ymax = ymax, color = color, fill = fill, ...) } p <- ggplot(ford_circles, aes(x, y)) for (i in 1:nrow(ford_circles)) { p <- p + g_circle(ford_circles$r[i], ford_circles$x[i], ford_circles$y[i], fill = ford_circles$c[i]) } p + xlim(c(0, 1)) + coord_fixed() + theme_void() The post Euler Problem 33: Digit Cancelling Fractions and Ford Circles appeared first on The Lucid Manager. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data – The Lucid Manager. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### RStudio Connect v1.6.4.2 – Security Update Mon, 07/09/2018 - 02:00 (This article was first published on RStudio Blog, and kindly contributed to R-bloggers) A security vulnerability in a third-party library used by RStudio Connect was uncovered during a security audit last week. We have confirmed that this vulnerability has not been used against any of the RStudio Connect instances we host, and are unaware of it being exploited on any customer deployments. Under certain conditions, this vulnerability could compromise the session of a user that was tricked into visiting a specially crafted URL. The issue affects all versions of RStudio Connect up to and including 1.6.4.1, but none of our other products. We have prepared a hotfix: v1.6.4.2. RStudio remains committed to providing the most secure product possible. We regularly perform internal security audits against RStudio Connect in order to ensure the product’s security. As part of the responsible disclosure process, we will provide additional details about the vulnerability and how to ensure that you have not been affected, in the coming weeks once customers have had time to update their systems. For now, please update your RStudio Connect installations to version 1.6.4.2 as soon as possible. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Solver Interfaces in CVXR Mon, 07/09/2018 - 02:00 (This article was first published on R Views, and kindly contributed to R-bloggers) Introduction In our previous blog post , we introduced CVXR, an R package for disciplined convex optimization. The package allows one to describe an optimization problem with Disciplined Convex Programming rules using high level mathematical syntax. Passing this problem definition along (with a list of constraints, if any) to the solve function transforms it into a form that can be handed off to a solver. The default installation of CVXR comes with two (imported) open source solvers: • ECOS and its mixed integer cousin ECOS_BB via the CRAN package ECOSolveR • SCS via the CRAN package scs. CVXR (version 0.99) can also make use of several other open source solvers implemented in R packages: • The linear and mixed integer programming package lpSolve via the lpSolveAPI package • The linear and mixed integer programming package GLPK via the Rglpk package. About Solvers The real work of finding a solution is done by solvers, and writing good solvers is hard work. Furthermore, some solvers work particularly well for certain types of problems (linear programs, quadratic programs, etc.). Not surprisingly, there are commercial vendors who have solvers that are designed for performance and scale. Two well-known solvers are MOSEK and GUROBI. R packages for these solvers are also provided, but they require the problem data to be constructed in a specific form. This necessitates a bit of work in the current version of CVXR and is certainly something we plan to include in future versions. However, it is also true that these commercial solvers expose a much richer API to Python programmers than to R programmers. How, then, do we interface such solvers with R as quickly as possible, at least in the short term? Reticulate to the Rescue The current version of CVXR exploits the reticulate package for commercial solvers such as MOSEK and GUROBI. We took the Python solver interfaces in CVXPY version 0.4.11 , edited them suitably to make them self-contained, and hooked them up to reticulate. This means that one needs two prerequisites to use these commercial solvers in the current version of CVXR: Installing MOSEK/GUROBI Both MOSEK and GUROBI provide academic versions (registration required) free of charge. For example, Anaconda users can install MOSEK with the command: conda install -c mosek mosek Others can use the pip command: pip install -f https://download.mosek.com/stable/wheel/index.html Mosek GUROBI is handled in a similar fashion. The solvers must be activated using a license provided by the vendor. Once activated, one can check that CVXR recognizes the solver; installed_solvers() should list them. > installed_solvers() [1] "ECOS" "ECOS_BB" "SCS" "MOSEK" "LPSOLVE" "GLPK" "GUROBI" Further information More information on these solvers, along with a number of tutorial examples are available on the CVXR site. If you are attending useR! 2018, you can catch Anqi’s CVXR talk on Friday, July 13. _____='https://rviews.rstudio.com/2018/07/09/solver-interfaces-in-cvxr/'; var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Time Series Analysis With Documentation And Steps I Follow For Analytics Projects. Mon, 07/09/2018 - 01:01 (This article was first published on R-Analytics, and kindly contributed to R-bloggers) To do this I will create a prediction of the open values for Bitcoin in the next 3 days. The process I follow is based on CRISP-DM methodology: https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome 1.- Planning the activities. To plan the activities I use a spread sheet document, below I show the spread sheet sample, if you would like the document, please go to the next link: https://github.com/LaranIkal/R-ANALYTICS/blob/master/BitCoinPredictionsAdmin%26FollowUp.ods body,div,table,thead,tbody,tfoot,tr,th,td,p { font-family:"Calibri"; font-size:x-small } a.comment-indicator:hover + comment { background:#ffd; position:absolute; display:block; border:1px solid black; padding:0.5em; } a.comment-indicator { background:red; display:inline-block; border:1px solid black; width:0.5em; height:0.5em; } comment { display:none; } Activity Activity Description DueDate Activity Owner Status Comments Functional Requirement Specification A Text Document explaining the objectives of this project. 4/19/2018 Carlos Kassab Done Get Data For Analysis Get initial data to create feasibility analysis 4/19/2018 Carlos Kassab Done ETL Development ETL to get final data for next analysis 4/20/2018 Carlos Kassab Done 2018/04/19: In this case, there is not ETL needed, the dataset was downloaded from kaggle: https://www.kaggle.com/vivekchamp/bitcoin/data Exploratory Data Analysis Dataset summary and histogram to know the data normalization 4/20/2018 Carlos Kassab In progress Variables frequency Frequency of variable occurrence( frequency of values change, etc. ) 4/20/2018 Carlos Kassab Done 2018/04/19: We have already seen that our values change every day. Outliers Analysis Analysis of variability in numeric variables, show it in charts and grids.. 4/20/2018 Carlos Kassab In progress Time Series Decomposition – Getting metric charts, raw data, seasonality, trend and remainder. 4/20/2018 Carlos Kassab In progress Modelling Create the analytics model 4/25/2018 Carlos Kassab Not Started SQL View Development For Training, Validation And Testing NA Carlos Kassab Not Started 2018/04/19: No SQL view needed, everything is done inside the R script. Model Selection By using random parameters search algorithm, to find the right model to be used for this data. 4/25/2018 Carlos Kassab Not Started Model fine tunning. After finding the right algorithm, find the right model parameters to be used. 4/25/2018 Carlos Kassab Not Started Chart Development Final data chart development 4/25/2018 Carlos Kassab Not Started Data Validation Run the analytics model at least 2 weeks daily in order to see its behavior. NA Carlos Kassab Not Started Deployment Schedule the automatic execution of the R code. NA Carlos Kassab Not Started The first activity is the functional specification, this would be similar to business understanding in the crisp-dm methodology. I use a text document, for this analysis, you can get the document from this link: https://github.com/LaranIkal/R-ANALYTICS/blob/master/BitCoinPredictionRequirementSpecification.odt Now, the next step is to get the data for analysis and create the ETL script, in this case we just got the data from kaggle.com as mentioned in the documentation but, no ETL script was needed. So, we have our data, now we are going to analyze it, we will do all the activities mentioned in yellow in the grid above. I did this in a MarkDown document, here you can see the HTML output: https://github.com/LaranIkal/R-ANALYTICS/blob/master/BitcoinDataAnalysis.html Note. At the end, in order to have everything together I included the time series algorithms in the same rmd file that creates BitcoinDataAnalysis.html file. You can get all the sources from: https://github.com/LaranIkal/R-ANALYTICS Enjoy it!!!. Carlos Kassab https://www.linkedin.com/in/carlos-kassab-48b40743/ More information about R: https://www.r-bloggers.com var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R-Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### a thread to bin them all [puzzle] Mon, 07/09/2018 - 00:18 (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers) The most recent riddle on the Riddler consists in finding the shorter sequence of digits (in 0,1,..,9) such that all 10⁴ numbers between 0 (or 0000) and 9,999 can be found as a group of consecutive four digits. This sequence is obviously longer than 10⁴+3, but how long? On my trip to Brittany last weekend, I wrote an R code first constructing the sequence at random by picking with high preference the next digit among those producing a new four-digit number tenz=10^(0:3) wn2dg=function(dz) 1+sum(dz*tenz) seqz=rep(0,10^4) snak=wndz=sample(0:9,4,rep=TRUE) seqz[wn2dg(wndz)]=1 while (min(seqz)==0){ wndz[1:3]=wndz[-1];wndz[4]=0 wndz[4]=sample(0:9,1,prob=.01+.99*(seqz[wn2dg(wndz)+0:9]==0)) snak=c(snak,wndz[4]) sek=wn2dg(wndz) seqz[sek]=seqz[sek]+1} which usually returns a value above 75,000. I then looked through the sequence to eliminate useless replicas for (i in sample(4:(length(snak)-5))){ if ((seqz[wn2dg(snak[(i-3):i])]>1) &(seqz[wn2dg(snak[(i-2):(i+1)])]>1) &(seqz[wn2dg(snak[(i-1):(i+2)])]>1) &(seqz[wn2dg(snak[i:(i+3)])]>1)){ seqz[wn2dg(snak[(i-3):i])]=seqz[wn2dg(snak[(i-3):i])]-1 seqz[wn2dg(snak[(i-2):(i+1)])]=seqz[wn2dg(snak[(i-2):(i+1)])]-1 seqz[wn2dg(snak[(i-1):(i+2)])]=seqz[wn2dg(snak[(i-1):(i+2)])]-1 seqz[wn2dg(snak[i:(i+3)])]=seqz[wn2dg(snak[i:(i+3)])]-1 snak=snak[-i] seqz[wn2dg(snak[(i-3):i])]=seqz[wn2dg(snak[(i-3):i])]+1 seqz[wn2dg(snak[(i-2):(i+1)])]=seqz[wn2dg(snak[(i-2):(i+1)])]+1 seqz[wn2dg(snak[(i-1):(i+2)])]=seqz[wn2dg(snak[(i-1):(i+2)])]+1}} until none is found. A first attempt produced 12,911 terms in the sequence. A second one 12,913. A third one 12,871. Rather consistent figures but not concentrated enough to believe in achieving a true minimum. An overnight run produced 12,779 as the lowest value. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### World Income, Inequality and Murder Mon, 07/09/2018 - 00:00 (This article was first published on Theory meets practice..., and kindly contributed to R-bloggers) Abstract: We follow up on last weeks post on using Gapminder data to study the world’s income distribution. In order to assess the inequality of the distribution we compute the Gini coefficient for the world’s income distribution by Monte Carlo approximation and visualize the result as a time series. Furthermore, we animate the association between Gini coefficient and homicide rate per country using the new version of gganimate. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github. Introduction One of the main messages of the Chapter ‘The Gap Instinct’ of the book Factfulness is that there is no justification of the ‘we’ and ‘them’ classification of countries anymore, because they have converged towards the same levels in key indicators such as life expectancy, child mortality, births per female. The difference between countries is, hence, not as many imagine it to be: there is less inequality and no real gap. While reading, I became curious about the following: what if countries became more equal, but simultaneously inequality within countries became bigger? This was also indirectly a Disqus comment by F. Weidemann to the post Factfulness: Building Gapminder Income Mountains. Aim of the present post is to investigate this hypothesis using the Gapminder data. Furthermore, we use the country specific Gini coefficients to investigate the association with the number of homicides in the country. Gini coefficient There are different ways to measure income inequality, both in terms of which response you consider and which statistical summary you compute for it. Not going into the details of these discussion we use the GDP/capita in Purchasing Power Parities (PPP) measured in so called international dollars (fixed prices 2011). In other words, comparison between years and countries are possible, because the response is adjusted for inflation and differences in price of living. The Gini coefficient is a statistical measure to quantify inequality. In what follows we shall focus on computing the Gini coefficient for a continuous probability distribution with a known probability density function. Let the probability density function of the non-negative continuous income distribution be defined by $$f$$, then the Gini coefficient is given as half the relative mean difference: $G = \frac{1}{2\mu}\int_0^\infty \int_0^\infty |x-y| \> f(x) \> f(y) \> dx\> dy, \quad\text{where}\quad \mu = \int_{0}^\infty x\cdot f(x) dx.$ Depending on $$f$$ it might be possible to solve these integrals analytically, however, a straightforward computational approach is to use Monte Carlo sampling – as we shall see shortly. Personally, I find the above relative mean difference presentation of the Gini index much more intuitive than the area argument using the Lorenz curve. From the eqution it also becomes clear that the Gini coefficient is invariant to multiplicative changes in the income: if everybody increases their income by factor $$k>0$$ then the Gini coefficient remains the same, because $$|k x – k y| = k | x – y|$$ and $$E(k \cdot X) = k \mu$$. The above formula indirectly also states how to compute the Gini coefficient for a discrete sample of size $$n$$ and with incomes $$x_1,\ldots, x_n$$: $G = \frac{\sum_{i=1}^n \sum_{j=1}^n |x_i – x_j| \frac{1}{n} \frac{1}{n}}{2 \sum_{i=1}^n x_i \frac{1}{n}} = \frac{\sum_{i=1}^n \sum_{j=1}^n |x_i – x_j|}{2 n \sum_{j=1}^n x_j}.$ Approximating the integral by Monte Carlo If one is able to easily sample from $$f$$ then can instead of solving the integral analytically use $$k$$ pairs $$(x,y)$$ both drawn at random from $$f$$ to approximate the double integral: $G \approx \frac{1}{2\mu K} \sum_{k=1}^K |x_k – y_k|, \quad\text{where}\quad x_k \stackrel{\text{iid}}{\sim} f \text{ and } y_k \stackrel{\text{iid}}{\sim} f,$ where for our mixture model $\mu = \sum_{i=1}^{192} w_i \> E(X_i) = \sum_{i=1}^{192} w_i \exp\left(\mu_i + \frac{1}{2}\sigma_i^2\right).$ This allows us to compute $$G$$ even in case of a complex $$f$$ such as the log-normal mixture distribution. As always, the larger $$K$$ is, the better the Monte Carlo approximation is. ##Precision of Monte Carlo approx is controlled by the number of samples n <- 1e6 ##Compute Gini index of world income per year gini_year <- gm %>% group_by(year) %>% do({ x <- rmix(n, meanlog=.$meanlog, sdlog= .$sdlog, w=.$w) y <- rmix(n, meanlog=.$meanlog, sdlog= .$sdlog, w=.$w) int <- mean( abs(x-y) ) mu <- sum(exp( .$meanlog + 1/2 * .$sdlog^2) * .$w) data.frame(gini_all_mc=1/(2*mu)*int, country_gini=gini(.$gdp*.$population)) }) %>% rename(World Population GDP/capita=gini_all_mc, Country GDP/capita`=country_gini) ##Convert to long format gini_ts <- gini_year %>% gather(key="type", value="gini", -year) Results

We start by showing the country specific Gini coefficient per year since 1950 for a somewhat arbitrary selection of countries. The dashed black line shows the mean Gini coefficient each year over all 192 countries in the dataset.

In addition, we now show the Monte Carlo computed Gini coefficient for the world’s income distribution given as a log-normal mixture with 192 components.

We notice that the Gini coefficient taken over the 192 countries’ GDP/capita remains very stable over time. This, however, does not take the large differences in populations between countries into account. A fairer measure is thus the Gini coefficient for the world’s income distribution. We see that this Gini coefficient increased over time until peaking around 1990. From then on it has declined. However, the pre-1950 Gini coefficients are rather guesstimates as stated by Gapminder, hence, we zoom in on the period from 1970, because data are more reliable from this point on.

Gini coefficient and Homicide Rate

Finally, we end the post by illustrating the association between the Gini coefficient and the homicide rate per country using a 2D scatterplot over the years. The Gapminder data download page also contains data for this for the years 1950- 2005. Unfortunately, no data for more recent years are available from the Gapminder homepage, but the plot shown below is the situation in r show_year with a log-base-10 y-axis for the homicide rates. For each of the four Gapminder regional groups we also fit a simple linear regression line to the points of all countries within the region. Findings such as Fajnzylber, Lederman, and Loayza (2002) suggest that there is a strong positive correlation between Gini coefficient and homicide rate, but we see from the plot that there are regional differences and of course correlation is not causality. Explanations for this phenomena are debated and beyond the scope of this post.

We extend the plots to all years 1950-2005. Unfortunately, not all countries are available every year – so we only plot the available countries each year. This means that many African countries are missing from the animation. An improvement would be to try some form of linear interpolation. Furthermore, for the sake of simplicity of illustration, we fix countries with a reported murder rate of zero in a given year (happens for example for Cyprus, Iceland, Fiji in some years) to 0.01 per 100,000 population. This can be nicely animated using the new version of the gganimate pkg by Thomas Lin Pedersen.

## New version of gganimate. Not on CRAN yet. ## devtools::install_github('thomasp85/gganimate') require(gganimate) p <- ggplot(gm2_nozero, aes(x=gini, y=murder_rate,size=population, color=Region)) + geom_point() + scale_x_continuous(labels=scales::percent) + scale_y_continuous(trans="log10", breaks = trans_breaks("log10", function(x) 10^x,n=5), labels = trans_format("log10", function(x) ifelse(x<0, sprintf(paste0("%.",ifelse(is.na(x),"0",round(abs(x))),"f"),10^x), sprintf("%.0f",10^x)))) + geom_smooth(se=FALSE, method="lm", formula=y~x) + geom_text(data=gm2, aes(x=gini,y=murder_rate, label=country), vjust=-0.9, show.legend=FALSE) + ylab("Murder rate (per 100,000 population)") + xlab("Gini coefficient (in %)") + guides(size=FALSE) + labs(title = 'Year: {frame_time}') + transition_time(year) + ease_aes('linear') animate(p, nframes=length(unique(gm2$year)), fps=4, width=800, height=400, res=100) Discussion Based on the available Gapminder data we showed that in the last 25 years the Gini coefficient for the world’s income distribution has decreased. For several individual countries opposite dynamics are, however, observed. One particular concern is the share that the richest 1% have of the overall wealth: more than 50%. Literature Fajnzylber, P., D. Lederman, and N. Loayza. 2002. “Inequality and Violent Crime.” Journal of Law and Economics 45 (April). http://siteresources.worldbank.org/DEC/Resources/Crime&Inequality.pdf. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Statistics Sunday: Mixed Effects Meta-Analysis Sun, 07/08/2018 - 20:08 (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers) As promised, how to conduct mixed effects meta-analysis in R: Code used in the video is available here. And I’d recommend the following posts to provide background for this video: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Speed up your R Work Sun, 07/08/2018 - 18:24 (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers) Introduction In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages. For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines. The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion. Keep in mind: unless the pipeline steps have non-trivial cost, the overhead of partitioning and distributing the work may overwhelm any parallel speedup. Also data.table itself already seems to exploit some thread-level parallelism (notice user time is greater than elapsed time). That being said, in this note we will demonstrate a synthetic example where computation is expensive due to a blow-up in an intermediate join step. Our example First we set up our execution environment and example (some details: OSX 10.13.4 on a 2.8 GHz Intel Core i5 Mac Mini (Late 2015 model) with 8GB RAM and hybrid disk drive). library("rqdatatable") ## Loading required package: rquery library("microbenchmark") library("ggplot2") library("WVPlots") suppressPackageStartupMessages(library("dplyr")) ## Warning: package 'dplyr' was built under R version 3.5.1 base::date() ## [1] "Sun Jul 8 09:05:25 2018" R.version.string ## [1] "R version 3.5.0 (2018-04-23)" parallel::detectCores() ## [1] 4 packageVersion("parallel") ## [1] '3.5.0' packageVersion("rqdatatable") ## [1] '0.1.2' packageVersion("rquery") ## [1] '0.5.1' packageVersion("dplyr") ## [1] '0.7.6' ncore <- parallel::detectCores() print(ncore) ## [1] 4 cl <- parallel::makeCluster(ncore) print(cl) ## socket cluster with 4 nodes on host 'localhost' set.seed(2362) mk_example <- function(nkey, nrep, ngroup = 20) { keys <- paste0("key_", seq_len(nkey)) key_group <- sample(as.character(seq_len(ngroup)), length(keys), replace = TRUE) names(key_group) <- keys key_table <- data.frame( key = rep(keys, nrep), stringsAsFactors = FALSE) key_table$data <- runif(nrow(key_table)) instance_table <- data.frame( key = rep(keys, nrep), stringsAsFactors = FALSE) instance_table$id <- seq_len(nrow(instance_table)) instance_table$info <- runif(nrow(instance_table)) # groups should be no finer than keys key_table$key_group <- key_group[key_table$key] instance_table$key_group <- key_group[instance_table$key] list(key_table = key_table, instance_table = instance_table) } dlist <- mk_example(10, 10) data <- dlist$instance_table annotation <- dlist$key_table rquery / rqdatatable

rquery and rqdatatable can implement a non-trivial calculation as follows.

# possible data lookup: find rows that # have lookup data <= info optree <- local_td(data) %.>% natural_join(., local_td(annotation), jointype = "INNER", by = "key") %.>% select_rows_nse(., data <= info) %.>% pick_top_k(., k = 1, partitionby = "id", orderby = "data", reverse = "data", keep_order_column = FALSE) %.>% orderby(., "id") cat(format(optree)) ## table('data'; ## key, ## id, ## info, ## key_group) %.>% ## natural_join(., ## table('annotation'; ## key, ## data, ## key_group), ## j= INNER, by= key) %.>% ## select_rows(., ## data <= info) %.>% ## extend(., ## row_number := row_number(), ## p= id, ## o= "data" DESC) %.>% ## select_rows(., ## row_number <= 1) %.>% ## drop_columns(., ## row_number) %.>% ## orderby(., id) res1 <- ex_data_table(optree) head(res1) ## data id info key key_group ## 1: 0.9152014 1 0.9860654 key_1 20 ## 2: 0.5599810 2 0.5857570 key_2 8 ## 3: 0.3011882 3 0.3334490 key_3 10 ## 4: 0.3650987 4 0.3960980 key_4 5 ## 5: 0.1469254 5 0.1753649 key_5 14 ## 6: 0.2567631 6 0.3510280 key_6 7 nrow(res1) ## [1] 94

And we can execute the operations in parallel.

parallel::clusterEvalQ(cl, library("rqdatatable")) ## [[1]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[2]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[3]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[4]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" res2 <- ex_data_table_parallel(optree, "key_group", cl) head(res2) ## data id info key key_group ## 1: 0.9152014 1 0.9860654 key_1 20 ## 2: 0.5599810 2 0.5857570 key_2 8 ## 3: 0.3011882 3 0.3334490 key_3 10 ## 4: 0.3650987 4 0.3960980 key_4 5 ## 5: 0.1469254 5 0.1753649 key_5 14 ## 6: 0.2567631 6 0.3510280 key_6 7 nrow(res2) ## [1] 94 data.table

data.table can implement the same function.

library("data.table") ## ## Attaching package: 'data.table' ## The following objects are masked from 'package:dplyr': ## ## between, first, last packageVersion("data.table") ## [1] '1.11.4' data_table_f <- function(data, annotation) { data <- data.table::as.data.table(data) annotation <- data.table::as.data.table(annotation) joined <- merge(data, annotation, by = "key", all=FALSE, allow.cartesian=TRUE) joined <- joined[joined$data <= joined$info, ] data.table::setorderv(joined, cols = "data") joined <- joined[, .SD[.N], id] data.table::setorderv(joined, cols = "id") } resdt <- data_table_f(data, annotation) head(resdt) ## id key info key_group.x data key_group.y ## 1: 1 key_1 0.9860654 20 0.9152014 20 ## 2: 2 key_2 0.5857570 8 0.5599810 8 ## 3: 3 key_3 0.3334490 10 0.3011882 10 ## 4: 4 key_4 0.3960980 5 0.3650987 5 ## 5: 5 key_5 0.1753649 14 0.1469254 14 ## 6: 6 key_6 0.3510280 7 0.2567631 7 nrow(resdt) ## [1] 94

We can also run data.table in parallel using wrapr::execute_parallel.

parallel::clusterEvalQ(cl, library("data.table")) ## [[1]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[2]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[3]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[4]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" parallel::clusterExport(cl, "data_table_f") dt_f <- function(tables_list) { data <- tables_list$data annotation <- tables_list$annotation data_table_f(data, annotation) } data_table_parallel_f <- function(data, annotation) { respdt <- wrapr::execute_parallel( tables = list(data = data, annotation = annotation), f = dt_f, partition_column = "key_group", cl = cl) %.>% data.table::rbindlist(.) data.table::setorderv(respdt, cols = "id") respdt } respdt <- data_table_parallel_f(data, annotation) head(respdt) ## id key info key_group.x data key_group.y ## 1: 1 key_1 0.9860654 20 0.9152014 20 ## 2: 2 key_2 0.5857570 8 0.5599810 8 ## 3: 3 key_3 0.3334490 10 0.3011882 10 ## 4: 4 key_4 0.3960980 5 0.3650987 5 ## 5: 5 key_5 0.1753649 14 0.1469254 14 ## 6: 6 key_6 0.3510280 7 0.2567631 7 nrow(respdt) ## [1] 94 dplyr

dplyr can also implement the example.

dplyr_pipeline <- function(data, annotation) { res <- data %>% inner_join(annotation, by = "key") %>% filter(data <= info) %>% group_by(id) %>% arrange(-data) %>% mutate(rownum = row_number()) %>% ungroup() %>% filter(rownum == 1) %>% arrange(id) res } resd <- dplyr_pipeline(data, annotation) head(resd) ## # A tibble: 6 x 7 ## key id info key_group.x data key_group.y rownum ## ## 1 key_1 1 0.986 20 0.915 20 1 ## 2 key_2 2 0.586 8 0.560 8 1 ## 3 key_3 3 0.333 10 0.301 10 1 ## 4 key_4 4 0.396 5 0.365 5 1 ## 5 key_5 5 0.175 14 0.147 14 1 ## 6 key_6 6 0.351 7 0.257 7 1 nrow(resd) ## [1] 94

And we can use wrapr::execute_parallel to parallelize the dplyr solution.

parallel::clusterEvalQ(cl, library("dplyr")) ## [[1]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[2]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[3]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[4]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" parallel::clusterExport(cl, "dplyr_pipeline") dplyr_f <- function(tables_list) { data <- tables_list$data annotation <- tables_list$annotation dplyr_pipeline(data, annotation) } dplyr_parallel_f <- function(data, annotation) { respdt <- wrapr::execute_parallel( tables = list(data = data, annotation = annotation), f = dplyr_f, partition_column = "key_group", cl = cl) %>% dplyr::bind_rows() %>% arrange(id) } respdplyr <- dplyr_parallel_f(data, annotation) head(respdplyr) ## # A tibble: 6 x 7 ## key id info key_group.x data key_group.y rownum ## ## 1 key_1 1 0.986 20 0.915 20 1 ## 2 key_2 2 0.586 8 0.560 8 1 ## 3 key_3 3 0.333 10 0.301 10 1 ## 4 key_4 4 0.396 5 0.365 5 1 ## 5 key_5 5 0.175 14 0.147 14 1 ## 6 key_6 6 0.351 7 0.257 7 1 nrow(respdplyr) ## [1] 94 Benchmark

We can benchmark the various realizations.

dlist <- mk_example(300, 300) data <- dlist$instance_table annotation <- dlist$key_table timings <- microbenchmark( data_table_parallel = nrow(data_table_parallel_f(data, annotation)), data_table = nrow(data_table_f(data, annotation)), rqdatatable_parallel = nrow(ex_data_table_parallel(optree, "key_group", cl)), rqdatatable = nrow(ex_data_table(optree)), dplyr_parallel = nrow(dplyr_parallel_f(data, annotation)), dplyr = nrow(dplyr_pipeline(data, annotation)), times = 10L) saveRDS(timings, "Parallel_rqdatatable_timings.RDS") Conclusion print(timings) ## Unit: seconds ## expr min lq mean median uq ## data_table_parallel 5.274560 5.457105 5.609827 5.546554 5.686829 ## data_table 9.401677 9.496280 9.701807 9.541218 9.748159 ## rqdatatable_parallel 7.165216 7.497561 7.587663 7.563883 7.761987 ## rqdatatable 12.490469 12.700474 13.320480 12.898154 14.229233 ## dplyr_parallel 6.492262 6.572062 6.784865 6.787277 6.875076 ## dplyr 20.056555 20.450064 20.647073 20.564529 20.800350 ## max neval ## 6.265888 10 ## 10.419316 10 ## 7.949404 10 ## 14.282269 10 ## 7.328223 10 ## 21.332103 10 # autoplot(timings) timings <- as.data.frame(timings) timings$seconds <- timings$time/1e+9 ScatterBoxPlotH(timings, xvar = "seconds", yvar = "expr", title="task duration by method")

The benchmark timings show parallelized data.table is the fastest, followed by parallelized dplyr, and parallelized rqdatatable. In the non-paraellized case data.table is the fastest, followed by rqdatatable, and then dplyr.

A reason dplyr sees greater speedup relative to its own non-parallel implementation (yet does not beat data.table) is that data.table starts already multi-threaded, so data.table is exploiting some parallelism even before we added the process level parallelism (and hence sees less of a speed up, though it is fastest).

rquery pipelines exhibit superior performance on big data systems (Spark, PostgreSQL, Amazon Redshift, and hopefully soon Google bigquery), and rqdatatable supplies a very good in-memory implementation of the rquery system based on data.table. rquery also speeds up solution development by supplying higher order operators and early debugging features.

In this note we have demonstrated simple procedures to reliably parallelize any of rqdatatable, data.table, or dplyr.

Note: we did not include alternatives such as multidplyr or dtplyr in the timings, as they did not appear to work on this example.

Materials

The original rendering of this article can be found here, source code here, and raw timings here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Dealing with heteroskedasticity; regression with robust standard errors using R

Sun, 07/08/2018 - 02:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

First of all, is it heteroskedasticity or heteroscedasticity? According to
McCulloch (1985),
heteroskedasticity is the proper spelling, because when transliterating Greek words, scientists
use the Latin letter k in place of the Greek letter κ (kappa). κ sometimes is transliterated as
the Latin letter c, but only when these words entered the English language through French, such
as scepter.

Now that this is out of the way, we can get to the meat of this blogpost (foreshadowing pun).
A random variable is said to be heteroskedastic, if its variance is not constant. For example,
the variability of expenditures may increase with income. Richer families may spend a similar
amount on groceries as poorer people, but some rich families will sometimes buy expensive
items such as lobster. The variability of expenditures for rich families is thus quite large.
However, the expenditures on food of poorer families, who cannot afford lobster, will not vary much.
Heteroskedasticity can also appear when data is clustered; for example, variability of
expenditures on food may vary from city to city, but is quite constant within a city.

To illustrate this, let’s first load all the packages needed for this blog post:

library(robustbase) library(tidyverse) library(sandwich) library(lmtest) library(modelr) library(broom)

First, let’s load and prepare the data:

data("education") education <- education %>% rename(residents = X1, per_capita_income = X2, young_residents = X3, per_capita_exp = Y, state = State) %>% mutate(region = case_when( Region == 1 ~ "northeast", Region == 2 ~ "northcenter", Region == 3 ~ "south", Region == 4 ~ "west" )) %>% select(-Region)

I will be using the education data sat from the {robustbase} package. I renamed some columns
and changed the values of the Region column. Now, let’s do a scatterplot of per capita expenditures
on per capita income:

ggplot(education, aes(per_capita_income, per_capita_exp)) + geom_point() + theme_dark()

It would seem that, as income increases, variability of expenditures increases too. Let’s look
at the same plot by region:

ggplot(education, aes(per_capita_income, per_capita_exp)) + geom_point() + facet_wrap(~region) + theme_dark()

I don’t think this shows much; it would seem that observations might be clustered, but there are
not enough observations to draw any conclusion from this plot (in any case, drawing conclusions
from plots only is dangerous).

Let’s first run a good ol’ linear regression:

lmfit <- lm(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = education) summary(lmfit) ## ## Call: ## lm(formula = per_capita_exp ~ region + residents + young_residents + ## per_capita_income, data = education) ## ## Residuals: ## Min 1Q Median 3Q Max ## -77.963 -25.499 -2.214 17.618 89.106 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.40283 142.57669 -3.278 0.002073 ** ## regionnortheast 15.72741 18.16260 0.866 0.391338 ## regionsouth 7.08742 17.29950 0.410 0.684068 ## regionwest 34.32416 17.49460 1.962 0.056258 . ## residents -0.03456 0.05319 -0.650 0.519325 ## young_residents 1.30146 0.35717 3.644 0.000719 *** ## per_capita_income 0.07204 0.01305 5.520 1.82e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 39.88 on 43 degrees of freedom ## Multiple R-squared: 0.6292, Adjusted R-squared: 0.5774 ## F-statistic: 12.16 on 6 and 43 DF, p-value: 6.025e-08

Let’s test for heteroskedasticity using the Breusch-Pagan test that you can find in the {lmtest}
package:

bptest(lmfit) ## ## studentized Breusch-Pagan test ## ## data: lmfit ## BP = 17.921, df = 6, p-value = 0.006432

This test shows that we can reject the null that the variance of the residuals is constant,
thus heteroskedacity is present. To get the correct standard errors, we can use the vcovHC()
function from the {sandwich} package (hence the choice for the header picture of this post):

lmfit %>% vcovHC() %>% diag() %>% sqrt() ## (Intercept) regionnortheast regionsouth regionwest ## 311.31088691 25.30778221 23.56106307 24.12258706 ## residents young_residents per_capita_income ## 0.09184368 0.68829667 0.02999882

By default vcovHC() estimates a heteroskedasticity consistent (HC) variance covariance
matrix for the parameters. There are several ways to estimate such a HC matrix, and by default
vcovHC() estimates the “HC3” one. You can refer to Zeileis (2004)
for more details.

We see that the standard errors are much larger than before! The intercept and regionwest variables
are not statistically significant anymore.

You can achieve the same in one single step:

coeftest(lmfit, vcov = vcovHC(lmfit)) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.402827 311.310887 -1.5014 0.14056 ## regionnortheast 15.727405 25.307782 0.6214 0.53759 ## regionsouth 7.087424 23.561063 0.3008 0.76501 ## regionwest 34.324157 24.122587 1.4229 0.16198 ## residents -0.034558 0.091844 -0.3763 0.70857 ## young_residents 1.301458 0.688297 1.8908 0.06540 . ## per_capita_income 0.072036 0.029999 2.4013 0.02073 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It’s is also easy to change the estimation method for the variance-covariance matrix:

coeftest(lmfit, vcov = vcovHC(lmfit, type = "HC0")) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -467.402827 172.577569 -2.7084 0.009666 ** ## regionnortheast 15.727405 20.488148 0.7676 0.446899 ## regionsouth 7.087424 17.755889 0.3992 0.691752 ## regionwest 34.324157 19.308578 1.7777 0.082532 . ## residents -0.034558 0.054145 -0.6382 0.526703 ## young_residents 1.301458 0.387743 3.3565 0.001659 ** ## per_capita_income 0.072036 0.016638 4.3296 8.773e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As I wrote above, by default, the type argument is equal to “HC3”.

Another way of dealing with heteroskedasticity is to use the lmrob() function from the
{robustbase} package. This package is quite interesting, and offers quite a lot of functions
for robust linear, and nonlinear, regression models. Running a robust linear regression
is just the same as with lm():

lmrobfit <- lmrob(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = education) summary(lmrobfit) ## ## Call: ## lmrob(formula = per_capita_exp ~ region + residents + young_residents + per_capita_income, ## data = education) ## \--> method = "MM" ## Residuals: ## Min 1Q Median 3Q Max ## -57.074 -14.803 -0.853 24.154 174.279 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -156.37169 132.73828 -1.178 0.24526 ## regionnortheast 20.64576 26.45378 0.780 0.43940 ## regionsouth 10.79694 29.42747 0.367 0.71549 ## regionwest 45.22589 33.07950 1.367 0.17867 ## residents 0.03406 0.04412 0.772 0.44435 ## young_residents 0.57896 0.25512 2.269 0.02832 * ## per_capita_income 0.04328 0.01442 3.000 0.00447 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Robust residual standard error: 26.4 ## Multiple R-squared: 0.6235, Adjusted R-squared: 0.571 ## Convergence in 24 IRWLS iterations ## ## Robustness weights: ## observation 50 is an outlier with |weight| = 0 ( < 0.002); ## 7 weights are ~= 1. The remaining 42 ones are summarized as ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.05827 0.85200 0.93870 0.85250 0.98700 0.99790 ## Algorithmic parameters: ## tuning.chi bb tuning.psi refine.tol ## 1.548e+00 5.000e-01 4.685e+00 1.000e-07 ## rel.tol scale.tol solve.tol eps.outlier ## 1.000e-07 1.000e-10 1.000e-07 2.000e-03 ## eps.x warn.limit.reject warn.limit.meanrw ## 1.071e-08 5.000e-01 5.000e-01 ## nResample max.it best.r.s k.fast.s k.max ## 500 50 2 1 200 ## maxit.scale trace.lev mts compute.rd fast.s.large.n ## 200 0 1000 0 2000 ## psi subsampling cov ## "bisquare" "nonsingular" ".vcov.avar1" ## compute.outlier.stats ## "SM" ## seed : int(0)

This however, gives you different estimates than when fitting a linear regression model.
The estimates should be the same, only the standard errors should be different. This is because
the estimation method is different, and is also robust to outliers (at least that’s my understanding,
I haven’t read the theoretical papers behind the package yet).

Finally, it is also possible to bootstrap the standard errors. For this I will use the
bootstrap() function from the {modelr} package:

resamples <- 100 boot_education <- education %>% modelr::bootstrap(resamples)

Let’s take a look at the boot_education object:

boot_education ## # A tibble: 100 x 2 ## strap .id ## ## 1 001 ## 2 002 ## 3 003 ## 4 004 ## 5 005 ## 6 006 ## 7 007 ## 8 008 ## 9 009 ## 10 010 ## # ... with 90 more rows

The column strap contains resamples of the original data. I will run my linear regression
from before on each of the resamples:

( boot_lin_reg <- boot_education %>% mutate(regressions = map(strap, ~lm(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = .))) ) ## # A tibble: 100 x 3 ## strap .id regressions ## ## 1 001 ## 2 002 ## 3 003 ## 4 004 ## 5 005 ## 6 006 ## 7 007 ## 8 008 ## 9 009 ## 10 010 ## # ... with 90 more rows

I have added a new column called regressions which contains the linear regressions on each
bootstrapped sample. Now, I will create a list of tidied regression results:

( tidied <- boot_lin_reg %>% mutate(tidy_lm = map(regressions, broom::tidy)) ) ## # A tibble: 100 x 4 ## strap .id regressions tidy_lm ## ## 1 001 ## 2 002 ## 3 003 ## 4 004 ## 5 005 ## 6 006 ## 7 007 ## 8 008 ## 9 009 ## 10 010 ## # ... with 90 more rows

broom::tidy() creates a data frame of the regression results. Let’s look at one of these:

tidied$tidy_lm[[1]] ## term estimate std.error statistic p.value ## 1 (Intercept) -515.19835839 129.61828003 -3.9747353 2.648477e-04 ## 2 regionnortheast -11.75706535 18.72014312 -0.6280436 5.332970e-01 ## 3 regionsouth -8.89139412 15.51588707 -0.5730510 5.695950e-01 ## 4 regionwest 25.96217940 17.48335459 1.4849656 1.448476e-01 ## 5 residents -0.08921914 0.05557157 -1.6054819 1.157079e-01 ## 6 young_residents 1.41203710 0.35297506 4.0003877 2.447887e-04 ## 7 per_capita_income 0.08609340 0.01191468 7.2258276 6.069534e-09 This format is easier to handle than the standard lm() output: tidied$regressions[[1]] ## ## Call: ## lm(formula = per_capita_exp ~ region + residents + young_residents + ## per_capita_income, data = .) ## ## Coefficients: ## (Intercept) regionnortheast regionsouth ## -515.19836 -11.75707 -8.89139 ## regionwest residents young_residents ## 25.96218 -0.08922 1.41204 ## per_capita_income ## 0.08609

Now that I have all these regression results, I can compute any statistic I need. But first,
let’s transform the data even further:

list_mods <- tidied %>% pull(tidy_lm)

list_mods is a list of the tidy_lm data frames. I now add an index and
bind the rows together (by using map2_df() instead of map2()):

mods_df <- map2_df(list_mods, seq(1, resamples), ~mutate(.x, resample = .y))

Let’s take a look at the final object:

head(mods_df, 25) ## term estimate std.error statistic p.value ## 1 (Intercept) -515.19835839 129.61828003 -3.9747353 2.648477e-04 ## 2 regionnortheast -11.75706535 18.72014312 -0.6280436 5.332970e-01 ## 3 regionsouth -8.89139412 15.51588707 -0.5730510 5.695950e-01 ## 4 regionwest 25.96217940 17.48335459 1.4849656 1.448476e-01 ## 5 residents -0.08921914 0.05557157 -1.6054819 1.157079e-01 ## 6 young_residents 1.41203710 0.35297506 4.0003877 2.447887e-04 ## 7 per_capita_income 0.08609340 0.01191468 7.2258276 6.069534e-09 ## 8 (Intercept) -526.78466908 142.64004523 -3.6931050 6.207704e-04 ## 9 regionnortheast 3.35909252 21.97165783 0.1528830 8.792057e-01 ## 10 regionsouth 2.62845462 17.20973263 0.1527307 8.793251e-01 ## 11 regionwest 26.40527188 20.50214280 1.2879274 2.046593e-01 ## 12 residents -0.04834920 0.05722962 -0.8448282 4.028830e-01 ## 13 young_residents 1.49618012 0.37091593 4.0337445 2.208994e-04 ## 14 per_capita_income 0.07489435 0.01179888 6.3475800 1.140844e-07 ## 15 (Intercept) -466.17160566 130.18954610 -3.5807146 8.659014e-04 ## 16 regionnortheast -6.75977787 17.24994600 -0.3918724 6.970880e-01 ## 17 regionsouth 6.31061828 16.27953099 0.3876413 7.001938e-01 ## 18 regionwest 27.89261897 15.43852463 1.8066894 7.781178e-02 ## 19 residents -0.08760677 0.05007780 -1.7494134 8.735434e-02 ## 20 young_residents 1.23763741 0.32965208 3.7543746 5.168492e-04 ## 21 per_capita_income 0.08469609 0.01236601 6.8491057 2.128982e-08 ## 22 (Intercept) -316.44265722 166.96769979 -1.8952328 6.479692e-02 ## 23 regionnortheast 11.54907449 14.95746934 0.7721276 4.442620e-01 ## 24 regionsouth 9.25759229 16.89974638 0.5477947 5.866654e-01 ## 25 regionwest 31.83905855 16.56287888 1.9223143 6.120738e-02 ## resample ## 1 1 ## 2 1 ## 3 1 ## 4 1 ## 5 1 ## 6 1 ## 7 1 ## 8 2 ## 9 2 ## 10 2 ## 11 2 ## 12 2 ## 13 2 ## 14 2 ## 15 3 ## 16 3 ## 17 3 ## 18 3 ## 19 3 ## 20 3 ## 21 3 ## 22 4 ## 23 4 ## 24 4 ## 25 4

Now this is a very useful format, because I now can group by the term column and compute any
statistics I need, in the present case the standard deviation:

( r.std.error <- mods_df %>% group_by(term) %>% summarise(r.std.error = sd(estimate)) ) ## # A tibble: 7 x 2 ## term r.std.error ## ## 1 (Intercept) 226. ## 2 per_capita_income 0.0211 ## 3 regionnortheast 19.7 ## 4 regionsouth 19.1 ## 5 regionwest 18.7 ## 6 residents 0.0629 ## 7 young_residents 0.509

We can append this column to the linear regression model result:

lmfit %>% broom::tidy() %>% full_join(r.std.error) %>% select(term, estimate, std.error, r.std.error) ## Joining, by = "term" ## term estimate std.error r.std.error ## 1 (Intercept) -467.40282655 142.57668653 226.08735720 ## 2 regionnortheast 15.72740519 18.16259519 19.65120904 ## 3 regionsouth 7.08742365 17.29949951 19.05153934 ## 4 regionwest 34.32415663 17.49459866 18.72767470 ## 5 residents -0.03455770 0.05318811 0.06285984 ## 6 young_residents 1.30145825 0.35716636 0.50928879 ## 7 per_capita_income 0.07203552 0.01305066 0.02109277

As you see, using the whole bootstrapping procedure is longer than simply using either one of
the first two methods. However, this procedure is very flexible and can thus be adapted to a very
large range of situations. Either way, in the case of heteroskedasticity, you can see that
results vary a lot depending on the procedure you use, so I would advise to use them all as
robustness tests and discuss the differences.

If you found this blog post useful, you might want to follow me on twitter
for blog post updates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...