rOpenSci’s drake package
(This article was first published on R – Insights of a PhD, and kindly contributed to Rbloggers)
If you don’t know rOpenSci, then I recommend checking them out. They write a lot of really good packages for R. A relatively new seems to be drake. I’ve not played with it yet, but it looks to be very useful at giving indications about which parts of an analysis are subject to changes, and only rerunning those parts to speed up redoing an analysis (envisage the overlays for some version control systems or dropbox that show the status of files, although it’s more complicated than that). Knitr has caching, which goes some way to handling this, but here you can see where outdated parts are in the scope of the entire analysis…
It looks like a super tool!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Call for submissions open!
(This article was first published on MilanoR, and kindly contributed to Rbloggers)
Experienced and wannabe speakers; R professionals and enthusiasts; people with good ideas and will to share them: this is who we are looking for. For what? For our Call for Submissions!
Here is the form —> https://goo.gl/forms/EURRqMWuvuzccuYu2
You have time until July 31st, so be sure to submit your ideas in time!
What’s MilanoR?MilanoR is a user group dedicated to bring together local R users. Our aim is to exchange knowledge, learn and share tricks and techniques and provide R beginners with an opportunity to meet more experienced users. We often organize events, such as meetings, workshops and Rlabs (aka R learnathons, we’ll look more into this!), both with local and international speakers. If you want an overview of what we did recently, take a look to milanor.net or read this article about our last year of events.
Who can contribute?The answer is simple: everyone! Well, everyone with a knowledge of R and something to share.
If you come from abroad and you would need a financial support, write it in the Notes space, at the bottom of the form: we can’t promise anything, but we’ll do our best to try to find funding.
What kind of submissions are we looking for? Talks
A talk is usually around 2045 minutes, and involves the use of a presentation. You can do a tutorial, explain to us your new discovery, idea, widget or whatever you have in mind! You have to specify if it’s going to be in English or in Italian – either is fine. Here‘s an example of what a talk presentation looks like.
 Lightning talks
Lightning talks are just like talks, but faster: only 5 minutes, and 15 slides of 2030 seconds each. Ideal for someone who doesn’t have much time to prepare a whole 20mins talk, or find themselves at ease with short timing. Here‘s a presentation to give you an idea of what we’re talking about.
 Rlabs
Rlabs are our proud venture that’s going absolutely well. During Rlabs, a group of R lovers work together to solve a problem. Some examples: the Comune of Milano asked for our help to make the municipality budget accessible to citizens. We built a shiny app for that! (here it is!) We also programmed a representation of earthquakes evolution for EarthCloud and along with INGM (Istituto Nazionale Genetica Molecolare), a dashboard to recognize risks of various illnesses using personal data. We see a problem and we find a solution together. If you have an issue with R and you want to work on it together, then suggest an Rlab! The RLab can last for 3 hours in the evening or for an entire day on a Saturday.
 Blog articles
Usually tutorial about the most various R subjects. They’ll also be published on Rbloggers platform, so be ready for the spotlight! Here‘s an example.
 Workshop
Usually up to 2 hours. Tutorial for 2050 (or more) people on a beginner or advanced R topic. This is our last workshop, which had a huge success!
Submit your abstracts now, and we’ll get in touch with you very soon!
The post Call for submissions open! appeared first on MilanoR.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: MilanoR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R community update: announcing useR Delhi July meetup
(This article was first published on R – Tech and Mortals, and kindly contributed to Rbloggers)
In my last post, I’d pointed out the importance of a local community and described how Delhi NCR useR group came about. It’s been a month since we’d started the group and I’m excited to announce that our first meetup is scheduled to take place on 14th July 2018, i.e. upcoming Saturday. Venue for the event is SocialCops headquarters in Saket, New Delhi. As a budding community, support from established organizations is a big boost, and we’re grateful to SocialCops for coming onboard as mentors and venue partners.
Event details can be found on meetup and you can RSVP here .
In addition to providing an opportunity to interact with fellow community members, the meetup will also feature sessions by members from the R community. For this meetup, we’ve sessions from two exciting speakers lined up.
Our first session would be an AMA (Ask Me Anything) with Prof. Deepayan Sarkar of Indian Statistical Institute, Delhi. He’s one of 20 Rcore members and author of the lattice package.
Our second session would be on “Satellite image classification in R” by Shilpa Arora of SocialCops. She’s an experienced data scientist who has worked on problems from various domains in the industry, including GIS and image classification.
Satellite Image Classification by Shilpa AroraIf you’re in the region, attend the event by registering on the meetup page or here.
We’re a community in its infancy and could use all the help we can get. If you know of someone who may be interested in speaking at the upcoming event or in the future, please let us know via mail, Twitter, Facebook or Meetup.
Also, please share and retweet our event and/or this blog. We would be grateful.
Thanks for being an amazing community and have a great day, #rstats!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – Tech and Mortals. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to read Stata DTA files into R
(This article was first published on R – Displayr, and kindly contributed to Rbloggers)
The file contains 2017 facetoface postelection survey responses along with explanatory notes. Read the Stata DTA file into R with two these two lines:
library(haven) df &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt; read_dta("http://www.britishelectionstudy.com/wpcontent/uploads/2018/01/bes_f2f_2017_v1.2.dta")The data set is now stored as a dataframe df with 357 variables. To check the properties of the data set we type
str(df)This gives the following output:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2194 obs. of 357 variables: $ finalserialno : atomic 10115 10119 10125 10215 10216 ... .. attr(*, "label")= chr "Final Serial Number" .. attr(*, "format.stata")= chr "%12.0g" $ serial : atomic 000000399 000000398 000000400 000000347 ... .. attr(*, "label")= chr "Respondent Serial Number" .. attr(*, "format.stata")= chr "%9s" $ a01 : atomic nhs brexit society immigration ... .. attr(*, "label")= chr "A1: Most important issue" .. attr(*, "format.stata")= chr "%240s" $ a02 :Class 'labelled' atomic [1:2194] 1 0 1 1 1 1 2 1 2 2 ... .. .. attr(*, "label")= chr "Best party on most important issue" .. .. attr(*, "format.stata")= chr "%8.0g" .. .. attr(*, "labels")= Named num [1:13] NA NA NA 0 1 2 3 4 5 6 ... .. .. .. attr(*, "names")= chr [1:13] "Not stated" "Refused" "Don`t know" "None/No party" ...The above output shows that the variables are already set to the correct types. The first variable finalserialno is numeric (i.e., atomic), the third variable a01 is character, and the fourth variable a02 has a class of ‘labelled’ which can be converted to a factor or categorical variable (after we handle missing values).
Each variable has an associated label attribute to help with interpretation. For example, without having to look up the explanatory notes, we can see that variable a01 contains the responses to the question “A1: most important issue” and variable a02 contains the responses to “Best party on most important issue”.
Stata supports multiple types of missing values. Read_dta automatically handles missing values in numeric and character variables. For categorical variables, missing values are typically encoded by negative numbers. Section 5.3 of the explanatory notes describes the encoding for this file: 1 (Don’t know), 2 (Refused) and 999 (Not stated). We now convert all three of these values to NA.
for (i in 1:length(df)) { if (class(df[[i]] == "labelled") df[[i]][df[[i]] &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt; 0] &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt; NA } Encoding categorical variablesThe categorical variables of class “labelled” are stored as numeric vectors. Convert them into factors so they are correctly associated with the labels with only a single command:
df &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt; as_factor(df)Note that we do this after converting the missing values to avoid spurious factor levels in the final dataset.
Find out moreYou can find out about how to import and read Excel files into Displayr as well.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Optimize for Expected Profit in Lift Models
(This article was first published on Sweiss' Blog, and kindly contributed to Rbloggers)
hete is the wild west of analytics. no wrong answers yet – Bill Lattner
IntroGenerally lift models use a Qini coefficient to measure the performance of a model. However this metric is generally an indirect measure of what the user wants to achieve: profit. In this post I’ll discuss a measure of profitability in lift models. A neural network can optimize this cost function directly. A comparison between this method compared with a causal tree implementation shows promising results.
Current lift modelsThe goal of lift modeling (or Heterogeneous Treatment Effects ) is to assign an optimal action for a particular observation. This can be, among other things, targeted advertisement or price discrimination. There are several different formulations of lift models which include flip, causal trees, etc.
Essentially all try to estimate the interaction between covariates and the treatment. The most simple parameterized model is: \(y\): response
\(x\): covariates
\(t\): randomly assigned treatment variable. Assumed to be binary here.
\(y = \beta_{1}x + \beta_{2}t+ \beta_{3}t*x\)
Here the treatment effect for a particular observation is \(\beta_{2+} \beta_{3}x\). The key point here is that the treatment effect is now a function of \(x\) which means we can see which observations are more or less effected than others.
Alternatively one can estimate a machine learning model \(y = f(t,x)\) and the treatment effect can be calculated \(\hat{f}(t=1,x) – \hat{f}(t=0,x)\).
However there are a few issues estimating this accurately which has spun several methods to accurately estimate it.

The heterogeneous treatment effect is probably very small. Indeed one may find that the treatment effect relative to main effect is so small it does not get split on in normal random forests / GBTs! This means we need a model to focus particularly on the treatment effect if we want the results to be useful.

We can have potentially hundreds of covariates available to include interactions with. We have no idea a priori which covariates are important so usually we have to expand parameter space significantly. Also, we don’t necessarily have the functional form available so we may take into account nonlinear transformations to the data.

We can’t compare the fitted treatment effect values to the ‘truth’. This is the biggest difference between lift models and normal prediction. In prediction settings we can set aside a test set and evaluate our predictions on known responses. Here we want to know the counterfactual; what would have happened if we gave an observation a different value, which lead us to the fundamental problem of causality.
So to sum up; we’re trying to estimate a very noisy signal, distributed among a large number of covariates, and we can’t check our estimated model directly.
Metrics to useOk 3) might be a bit of a stretch because there are metrics people can use to assess the performance of their model. One is qini coefficient.
This is a measure that observations with similar treatment effects and groups them into the treatment they were given. If these groups are very different then that is a good measure the model is.
One particular drawback among qini coefficient is that it focuses only on the rank ordering of predicted values. This can be an adequate measure when you care about ranking but there are limitations. The most important being that it does not assign an obvious cutoff between those affected and those not. In the next section I will describe the problem in terms of cost / benefit and suggest a different metric.
A new kind of metricSuppose we have \(1…K\) options in the treatment and the goal is to assign the most profitable treatment for each observation. We have a model to assign an optimal treatment for observation \(i\): \(optim_{i,k}\). This takes the value 1 if \(k\) is best for observation \(i\) else it’s 0.The expected profit is:
\(Expected Profit = \sum_{n=1}^{N}{I(t = optim_{i,k})* y_{i}(x,t=optim_{i,k})}\)
The problem is that we cannot assign and observe a treatment to our training data. Instead we can only look at the values we have. Instead I will ‘simulate’ an experiment the following way:
\(AverageProfit = {\frac{\sum_{i=1}^{N}\sum_{k=1}^{K}{I(optim_{i,k} = t_{i,k})*y_{i}}}{\sum_{i=1}^{N}\sum_{k=1}^{K}{I(optim_{i} = t_{i,k})}}}\)
Basically this is saying if the optimal results equal the randomly assigned treatment then we include those in our hypothetical model. Since I am assuming the treatment is randomly assigned I think this would give a good indicator on the average increase for each observations under this new models decisions. In order to test this out I compared the expected gain with the actual gain of a simulated dataset. Below is the scatterplot and it appears to be a pretty good metric for the actual results.
Model this Metric Directly?If we have a measure that can accurately measure what we want, can we use it as a loss function? This should lead to a better model since the model will optimize over the metric of interest.
We cannot optimize this directly b/c the indicator function is nondifferentiable. Instead replacing I() function with a probability and use the custom function below to maximize. – Thanks to Matt Becker for this insight.
\(CustomLoss = – \sum_{k=1}^{K}{\hat{p_{i,k}(x)}*I(t_i=k)*y_{i}}\)
This is essentialy answering the question: what is the probability that action \(k\) is best for observation \(i\)?
Using Keras I hacked together a prototype.
ExperimentTo test this new model (I’m calling it heteoptim) I simulated a similar dataset to that found in this paper. There are 200 variables with 15,000 training set and 3,000 test set. Of these there are 8 different scenarios using several nonlinear functions from the explanatory variables. One major difference is that I binarized the response. In addition I increased the random noise and decreased the relative treatment effect to get a more ‘realistic’
I’m comparing this model which my former colleague, Bill Lattner, Developed hete and to grf
ResultsThe Metric I’m comparing is profits with true model / profits with fitted model. So if a model get’s a score of .75 then that means that it captures 75% of potential gain using a lift model. This is sort of a normalized regret score so we aggregate results among the 8 scenarios. Below is a boxplot of scores by model type.
## Using as id variablesThis method performs best on 7 out of 8 datasets. On average hete_opim gets ~99.8% of the gains while hete gets ~98.9% and grf is 99.5%. This suggests that this method might be on to something.
ConclusionThis post described a method to simulate profits in a real world lift model and the optimize it direclty. It shows promising results to existing techniques.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Sweiss' Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How the odds ratio confounds: a brief study in a few colorful figures
(This article was first published on ouR data generation, and kindly contributed to Rbloggers)
The odds ratio always confounds: while it may be constant across different groups or clusters, the risk ratios or risk differences across those groups may vary quite substantially. This makes it really hard to interpret an effect. And then there is inconsistency between marginal and conditional odds ratios, a topic I seem to be visiting frequently, most recently last month.
My aim here is to generate a few figures that might highlight some of these issues.
Assume that there is some exposure (indicated by the use of a \(1\) or \(0\) subscript) applied across a number of different groups or clusters of people (think different regions, hospitals, schools, etc.) – indicated by some number or letter \(i\). Furthermore, assume that the total number exposed at each location is the same as the number unexposed: \(N_{i0} = N_{i1} = N = 100\).
The number of folks with exposure at a particular location \(i\) who have a poor outcome is \(n_{i1}\) and the number with a good outcome is \(Nn_{i1}\). Likewise, the corresponding measures for folks not exposed are \(n_{i0}\) and \(Nn_{i0}\). The probabilities of a poor outcome for exposed and nonexposed are \(n_{i1}/N\) and \(n_{i0}/N\). The relative risk of a poor outcome for those exposed compared to those non exposed is
\[\text{RR}_i = \frac{n_{i1}/N}{n_{i0}/N} = \frac{n_{i1}}{n_{i0}},\]
the risk difference between exposed and unexposed groups is
\[ \text{RD}_i = \frac{n_{i1}}{N}\frac{n_{i0}}{N} = \frac{n_{i1} – n_{i0}}{N},\]
and the odds ratio is
\[ \text{OR}_i = \frac{[n_{i1}/N] / [(N – n_{i1})/N]}{[n_{i0}/N] / [(N – n_{i0})/N]} \]
\[= \frac{n_{i1}(Nn_{i0})}{n_{i0}(Nn_{i1})}.\]
The simple conditional logistic regression model that includes a grouplevel random effect \(b_i\) assumes a constant odds ratio between exposed and unexposed individuals across the different clusters:
\[\text{logit} (Y_{ij}) = \beta_0 + \beta_1 E_{ij} + b_i,\]
where \(E_{ij}\) is an exposure indicator for person \(j\) in group \(i\). The parameter \(\text{exp}(\beta_1)\) is an estimate of the odds ratio defined above.
The point of all of this is to illustrate that although the oddsratio is the same across all groups/clusters (i.e., there is no \(i\) subscript in \(\beta_1\) and \(\text{OR}_i = \text{OR}\)), the risk ratios and risk differences can vary greatly across groups, particularly if the \(b\)’s vary considerably.
Constant odds ratio, different risk ratios and differencesIf the odds ratio is constant and we know \(n_{i1}\), we can perform a little algebraic maneuvering on the \(\text{OR}\) formula above to find \(n_{i0}\):
\[ n_{i0} = \frac{N \times n_{i1}}{\text{OR} \times (N – n_{i1}) + n_{i1}}\]
If we assume that the \(n_{i1}\)’s can range from 2 to 98 (out of 100), we can see how the risk ratios and risk differences vary considerably even though we fix the odds ratio fixed at a value of 3 (don’t pay too close attention to the fact the \(n_0\) is not an integer – this is just an illustration that makes a few violations – if I had used \(N=1000\), we could have called this rounding error):
N < 100 trueOddsRatio < 3 n1 < seq(2:98) n0 < (N * n1)/(trueOddsRatio * (N  n1) + n1) oddsRatio < ((n1 / (N  n1) ) / (n0 / (N  n0) )) riskRatio < n1 / n0 riskDiff < (n1  n0) / N dn < data.table(n1 = as.double(n1), n0, oddsRatio, riskRatio, riskDiff = round(riskDiff,3)) dn[1:6] ## n1 n0 oddsRatio riskRatio riskDiff ## 1: 1 0.3355705 3 2.98 0.007 ## 2: 2 0.6756757 3 2.96 0.013 ## 3: 3 1.0204082 3 2.94 0.020 ## 4: 4 1.3698630 3 2.92 0.026 ## 5: 5 1.7241379 3 2.90 0.033 ## 6: 6 2.0833333 3 2.88 0.039With a constant odds ratio of 3, the risk ratios range from 1 to 3, and the risk differences range from almost 0 to just below 0.3. The odds ratio is not exactly informative with respect to these other two measures. The plots – two takes on the same data – tell a better story:
Another look at contrasting marginal vs conditional odds ratiosUsing this same simple framework, I thought I’d see if I can illustrate the relationship between marginal and conditional odds ratios.
In this case, we have two groups/clusters where the conditional odds ratios are equivalent, yet when we combine the groups into a single entity, the combined (marginal) odds ratio is less than each of the conditional odds ratios.
In this scenario each cluster has 100 people who are exposed and 100 who are not, as before. \(a_1\) and \(a_0\) represent the number of folks with a poor outcome for the exposed and unexposed in the first cluster, respectively; \(b_1\) and \(b_0\) represent the analogous quantities in the second cluster. As before \(a_0\) and \(b_0\) are derived as a function of \(a_1\) and \(b_1\), respectively, and the constant odds ratio.
constantOR < function(n1, N, OR) { return(N*n1 / (OR*(Nn1) + n1)) } # Cluster A a1 < 55 a0 < constantOR(a1, N = 100, OR = 3) (a1*(100  a0)) / (a0 * (100  a1)) ## [1] 3 # Cluster B b1 < 35 b0 < constantOR(b1, N = 100, OR = 3) (b1*(100  b0)) / (b0 * (100  b1)) ## [1] 3 # Marginal OR tot0 < a0 + b0 tot1 < a1 + b1 (tot1*(200  tot0)) / (tot0 * (200  tot1)) ## [1] 2.886952For this example, the marginal odds ratio is less than the conditional odds ratio. How does this contrast between the marginal and conditional odds ratio play out with a range of possible outcomes – all meeting the requirement of a constant conditional odds ratio? (Note we are talking about odds ratio larger than 1; everything is flipped if the OR is < 1.) The plot below shows possible combinations of sums \(a_1 + b_1\) and \(a_0 + b_0\), where the constant conditional odds ratio condition holds within each group. The red line shows all points where the marginal odds ratio equals the conditional odds ratio (which happens to be 3 in this case):
Here is the same plot, but a yellow line is drawn in all cases where \(a_1 = b_1\) (hence \(a_0 = b_0\)). This line is the directly over the earlier line where the marginal odds ratios equal 3. So, sort of proof by plotting. The marginal odds ratio appears to equal the conditional odds ratio when the proportions of each group are equal.
But are the marginal odds ratios not on the colored lines higher or lower than 3? To check this, look at the next figure. In this plot, the odds ratio is plotted as a function of \(a_1 + b_1\), which represents the total number of poor outcomes in the combined exposed groups. Each line represents the marginal odds ratio for a specific value of \(a_1\).
If you notice, the odds ratio reaches the constant conditional odds ratio (which is 3) only when \(a_1 + b_1 = 2a_1\), or when \(a_1 = b_1\). It appears then, when \(a_1 \ne b_1\), the marginal odds ratio lies below the conditional odds ratio. Another “proof” by figure. OK, not a proof, but colorful nonetheless.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Simple Spatial Equilibrium Model with Tidycensus
(This article was first published on Ignacio Sarmiento Barbieri, and kindly contributed to Rbloggers)
The Spatial Equilibrium concept is well known to urban economists. In a nutshell, it states that in equilibrium there are no rents to be gained by changing locations. Ed Glaeser begins Chapter 2 of his book: “Cities, Agglomeration, and Spatial Equilibrium” with the well known AlonsoMuthMills model. In this post, I want to summarize it briefly following Ed Glaeser presentation and reproduce his 2.1. and 2.2. figures. This is the perfect excuse to play around with the “tidycensus” package.
Spatial equilibrium in the AlonsMuthMills modelIn its simplest form, the model assumes a monocentric closed city. Furthermore, we assume that there are $N$ city dwellers that maximize their utility which depends on C consumption and L units of land
\underset{C}{\mathrm{max}\,}U(C,L)
City dwellers use L units of land, consume C, and commute to the city center. Consumption is equal to their income, minus transport costs and rental costs per unit of land. City dwellers income is an exogenous wage w. Commuting costs are a function of distance and for simplicity we assume it linear t(d)=td. Rental costs per unit of land are r(d)L, where r(d) is the rental cost per unit of land. Thus,
C = w − td − r(d)L
To get to the spatial equilibrium we first rewrite utility function
\underset{d}{\mathrm{max}}\,U( w td r(d) L,L)
Then, the first order conditions imply
\frac{\partial U}{\partial d}(t\frac{\partial r}{\partial d} L )=0
r′= − \frac{t}{L}
The equilibrium condition states that rents must decline exactly to offset the increase in transportation cost. The simple assumption that commuting costs are linear implies that the rent gradient is linear:
r(d)=r(0) \frac{td}{L}
The key prediction of the model is that land rents will decline with distance to the city center.
R codeLet’s take the model to the data and reproduce figures 2.1. and 2.2 of “Cities, Agglomeration, and Spatial Equilibrium”. The focus are two cities, Chicago and Boston. These cities are chosen because both differ in how easy is to access to their city centers. Chicago is fairly easy, Boston is more complicated. Our model then implies that gradients then should reflect the differential costs to access the city centers.
So let’s begin, the first step is to get some data. To do so I’m are going to use the “tidycensus” package. This package will allow me to get data from the census website using their API. We are also going to need the help of three other packages: “sf” to handle spatial data, “dplyr” my goto package to wrangle data, and “ggplot2” to plot my results.
require("tidycensus", quietly=TRUE) require("sf", quietly=TRUE) require("dplyr", quietly=TRUE) require("ggplot2", quietly=TRUE)In order to get access to the Census API, I need to supply a key, which can be obtained from http://api.census.gov/data/key_signup.html.
census_api_key("YOUR CENSUS API HERE") ## To install your API key for use in future sessions, run this function with `install = TRUE`.Let’s get Median Housing Values from the latest ACS. I’ll use the variable “B25077_001E” which is the estimated Median Value of OwnerOccupied Housing Units. The package “tidycensus” provides the get_acs function that will let me retrieve this variable for the county at the block group level. It is important to note the option geometry=T. This option downloads also geometry features of the census block that will allow later on to compute distances.
chicago<get_acs(geography = "block group", variables = "B25077_001E", state = "IL",county = "Cook County",year=2016,geometry = T) #retrieve median housing values for Cook County boston<get_acs(geography = "block group", variables = "B25077_001E", state = "MA",county = "Suffolk County",year=2016,geometry = T) #retrieve median housing values for Suffolk CountyNext, we need the city centers. I create two “sf” objects that have the coordinates to the city center.
chicago_cbd<st_as_sf(x = read.table(text="87.627800 41.881998"), coords = c(1,2), crs = "+proj=longlat +datum=WGS84") boston_cbd<st_as_sf(x = read.table(text="71.057083 42.361145"), coords = c(1,2), crs = "+proj=longlat +datum=WGS84")Now, I need to put everything in the same projection. I reproject my cbds to the same projection as my tibbles.
chicago_cbd <chicago_cbd %>% st_transform(st_crs(chicago) ) boston_cbd <boston_cbd %>% st_transform(st_crs(boston) )The next step is to create distances. For that, I use the st_distance function. Given the projection of my tibbles is in meters, I transform all my distances to miles, one meter is 0.000621371 miles.
chicago$dist_CBD<st_distance(chicago,chicago_cbd) #computes distance to CBD (in meters) chicago$dist_CBD<as.numeric(chicago$dist_CBD)*0.000621371 #change units to miles boston$dist_CBD<st_distance(boston,boston_cbd) #computes distance to CBD (in meters) boston$dist_CBD<as.numeric(boston$dist_CBD)*0.000621371 #change units to milesThe last part is some cleaning up. I add a column with the name of the city, keep block groups in Cook County that are within 10 miles of the city center, and keep the data.frames from my tibbles,. Finally, I bind the data.frames together.
boston$City<"Boston" chicago$City<"Chicago" chicago<chicago %>% filter(dist_CBD<=10) chicago<data.frame(chicago) boston<data.frame(boston) dta<rbind(chicago,boston)We are now ready to plot. I use a regular scatter plot with hollow circles and add a linear regression line on a black and white background (theme_bw())
ggplot(dta, aes(x=dist_CBD, y=estimate, color=City)) + geom_point(shape=1) + # Use hollow circles geom_smooth(method=lm) + # Add linear regression line xlab("Distance to CBD (miles)") + ylab("Median Housing Prices ($)") + theme_bw()
Conclusion: the data reflect the predictions of the simple AlonsMuthMills model, land rents will decline with distance to the city center. The speed at which they decline depends on transportation costs.
Session Info sessionInfo() R version 3.5.0 (20180423) Platform: x86_64appledarwin15.6.0 (64bit) Running under: macOS High Sierra 10.13.5 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF8/en_US.UTF8/en_US.UTF8/C/en_US.UTF8/en_US.UTF8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] bindrcpp_0.2.2 ggplot2_2.2.1 dplyr_0.7.4 sf_0.63 [5] tidycensus_0.4.6 loaded via a namespace (and not attached): [1] Rcpp_0.12.16 plyr_1.8.4 pillar_1.2.2 compiler_3.5.0 [5] bindr_0.1.1 class_7.314 tools_3.5.0 uuid_0.12 [9] digest_0.6.15 gtable_0.2.0 jsonlite_1.5 evaluate_0.10.1 [13] tibble_1.4.2 lattice_0.2035 pkgconfig_2.0.1 rlang_0.2.0 [17] DBI_0.8 curl_3.2 rgdal_1.218 yaml_2.1.18 [21] spData_0.2.8.3 e1071_1.68 httr_1.3.1 stringr_1.3.0 [25] knitr_1.20 xml2_1.2.0 hms_0.4.2 rappdirs_0.3.1 [29] tigris_0.7 tidyselect_0.2.4 classInt_0.23 rprojroot_1.32 [33] grid_3.5.0 glue_1.2.0 R6_2.2.2 foreign_0.870 [37] rmarkdown_1.9 sp_1.27 readr_1.1.1 tidyr_0.8.0 [41] purrr_0.2.5 udunits2_0.13 magrittr_1.5 scales_0.5.0 [45] backports_1.1.2 htmltools_0.3.6 units_0.51 maptools_0.92 [49] assertthat_0.2.0 rvest_0.3.2 colorspace_1.32 labeling_0.3 [53] stringi_1.1.7 lazyeval_0.2.1 munsell_0.4.3 lwgeom_0.14 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Ignacio Sarmiento Barbieri. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R 3.5.1 update now available
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
Last week the R Core Team released the latest update to the R statistical data analysis environment, R version 3.5.1. This update (codenamed "Feather Spray" — a Peanuts reference) makes no uservisible changes and fixes a few bugs. It is backwardscompatible with R 3.5.0, and users can find updates for Windows, Linux and Mac systems at their local CRAN mirror. (The update to Microsoft R Open featuring the R 3.5.1 engine is scheduled for release on August 29.)
The complete list of fixes to R 3.5.1 is included in the release announcement, found at the link below.
Rannounce mailing list: R 3.5.1 is released
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Comparative Review of the R Commander GUI for R
(This article was first published on R – r4stats.com, and kindly contributed to Rbloggers)
IntroductionThe R Commander is a free and open source user interface for the R software, one that focuses on helping users learn R commands by pointandclicking their way through analyses. The R Commander is available on Windows, Mac, and Linux; there is no server version.
This is one of a series of reviews which aim to help nonprogrammers choose the user interface for R which is best for them. Each review also includes a cursory description of the programming support that each interface offers.
Terminology
There are various definitions of user interface types, so here’s how I’ll be using these terms:
GUI = Graphical User Interface specifically using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.
IDE = Integrated Development Environment which helps programmers write code. I do not include pointandclick style menus and dialog boxes when using this term. IDE users are people who prefer to write R code to perform their analyses.
Installation
The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi, BlueSky, or RKWard, install in a single step. Others, such as Deducer, install in multiple steps. Advanced computer users often don’t appreciate how lost beginners can become while attempting even a singlestep installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!
As described on the R Commander main web site, the installation basics are as follows:
 Download R from CRAN and install it in the manner appropriate to your operating system. If you have an old version of R — that is, older than the current version — then it’s generally a good idea to install the current version of R before installing the Rcmdr package. On Windows, opt for a customized startup and select the singledocument interface (“SDI,” see the Windows notes below for details).
 On Mac OS X only, download and install XQuartz, and reboot your computer (see the Mac notes below for greater detail).
 Start R, and at the > command prompt, type the command install.packages(“Rcmdr”).
 Once it is installed, to load the Rcmdr package, just enter the command library(“Rcmdr”).
 Optionally install Pandoc and LaTeX to get publicationquality output (via “Tools> Install auxiliary software”)
Complete installation notes are here. They’re worth reading as they go on to point out several things that can go wrong. These include having an incompatible version of R (i.e. you skipped step 1), and R packages which fail to install.
While these multiple steps are more challenging that singlestep installations, they are in line with the developer’s goal of helping people learn to program in R. That audience would have to learn to install R and R packages, then load packages anyway.
Plugins
When choosing a GUI, one of the most fundamental questions is: what can do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling section of this series of articles. Regardless of what comes builtin, it’s good to know how active the development community is. They contribute “plugins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, BlueSky, Deducer) through moderate (jamovi) to very active.
R Commander’s development community is by far the most active, with 45 addons available. The addons are stored on CRAN and installed like any other software. You install them using the install.packages function, then choose “Tools> Load Rcmdr plugins…”, select the plugin, and click OK. R Commander will then tell you that you need to restart R Commander to have it appear on the menus. The R Commander stores your list of plugins in your .Rprofile and will edit it for you. That’s important as editing it is a nontrivial task (see Installation, step 6).
You can find a comprehensive list of plugins here: https://cran.rproject.org/web/packages/index.html.
Startup
Some user interfaces for R, such as jamovi and BlueSky Statistics, start by doubleclicking on a single icon, which is great for people who prefer to not write code. Others, such as Deducer, have you start R, then load a package from your library, then call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.
You start R Commander by first starting the RGUI program that comes with the main R package. You can also start it from any program that offers an R console, such as the one that comes with the main R installation. Once R is started, you start the R Commander by loading it from your library by typing this command and pressing the Enter key: “library(“Rcmdr”).” The main control screen will then appear (Figure 1, upper left) along with a graphics screen (Figure 1, right).
If you want to have the R Commander start automatically each time you start R, it’s possible to do so, but it is not a task for beginners. The R Commander creates an “.Rprofile” in your main directory and it includes instructions about how to “uncomment” a few lines by removing the leading “#” characters. However, files whose names consist only of an “.extension” are hidden by your file system, and if edited using the Notepad application, cannot be saved with that nonstandard type of name. You can save it to a regular name like “Rprofile.txt”. You then must use operating system commands to rename it, as the Windows file manager also won’t let you choose a file name that is only an extension.
Data Editor
A data editor is a fundamental feature in data analysis software. It puts you in touch with your data, lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences (singleclick sorting, icons that show excluded observations, etc.), to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, Bluesky, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.
You start the R Commander’s Data Editor by choosing “Data> New data set…” You can enter data immediately, though at first the variables are named simply V1, V2… and the rows are named 1,2,3…. You can click on the names to change them (see Figure 2). Clicking on the “Add row” or “Add column” buttons do just that, though the Enter key is a quicker way to get a new row. You can enter simple numeric data or character data; no scientific notation, no dates. The latter is converted to a factor, but there is no way to enter the underlying values such as 1, 2 and have the editor display Male, Female, for example. That slows down data entry.
There is no way to enter or change any metadata other than variable and row names.
Saving the data provides a lesson on R data structures. Since you started the process by creating a new “data set”, you might start looking on the menus for where to save such a thing. Instead, you have to know that in R, data sets reside in something called a “workspace”. So “Data: New data set…” is balanced by “File: Save R workspace”. It would be nice if there was some instruction explaining this situation.
Data Import
The R Commander can import the file formats: CSV, TXT, Excel, Minitab, SPSS, SAS, and Stata. It can even import data directly from a URL, which is a rare feature for a GUI. These are all located under “Data> Import Data”. A particularly handy feature is the ability to explore and load data sets that are included with installed packages. That’s done via “Data> Data in packages…”.
To get data from SQL database formats, you’ll have to use R code.
Data Management
It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing such tasks one variable at a time is tedious. Some GUIs, such as BlueSky, handle nearly all of these challenges. Others, such as RKWard offer just a handful of data management functions.
The R Commander is able to recode many variables, adding an optional prefix to each name like “recoded_” to each variable that you choose to recode. It can also standardize many variables at once, but can only overwrite the original values. Make a copy of your data set before doing that! Unfortunately, when it comes to other popular transformations such as the logarithm, you have to apply them one variable at time.
For reshaping data sets, the R Commander can stack one set of variables into a single variable and create a factor to classify those values, but it can’t take along other relevant variables, nor can it do the reverse of this process by going from “long” to “wide” data structures.
Overall, the R Commander offers a very useful set of data management tools:
For managing the active data set as a whole:
 View data
 Select active data set
 Refresh active data set
 Help on active data set
 Variables in active data set
 Set case names
 Subset active data set
 Sort active data set
 Aggregate variables in the active data set
 Remove row(s) from active data set
 Stack variables in active data set (half of reshaping discussed above)
 Remove cases with missing data
 Save active data set
 Export active data set
For managing variables in the active data set:
 Recode variables (able to do many variables)
 Compute new variables (can create only one new variable at a time)
 Add observation numbers to data set
 Standardize variables (able to do many variables at once)
 Convert numeric variables to factors
 Bin numeric variable
 Reorder factor levels
 Drop unused factor levels
 Define contrasts for a factor
 Rename variables
 Delete variables from data set
Menus & Dialog Boxes
The goal of pointing & clicking your way through an analysis is to save time by recognizing menu settings rather than spend it on the memorization and practice required by programming. Some GUIs, such as jamovi make this easy by sticking to menu standards and using simpler dialog boxes; others, such as RKWard, use nonstandard menus that are unique to it and hence require more learning.
Figure 1 shows a common screen layout. The main R Commander window is in the upper left. A typical dialog box is in the front, and the graph it created is on the right. The data editor is on the upper right.
The R Commander’s menu structure contains some unique choices. No operations on data files are located on the usual “File” menu. For example, existing data sets or files are not opened using the usual “File> Open…”, but instead using “Data> Load data set…” menu. Also, everything on the models menu applies not to data but from models that you’ve already created from data. The other menus follow Windows standards. When switching between software packages, I found myself usually looking for data under the File menu. The rationale behind the R Commander’s approach is that the R function that opens files is named “load”. So this structure will help people learn more about R code (whether they’re headed that way or not!)
The dialog boxes have their own style too, but one that is easy to learn. Rather than have empty role boxes that you drag or click variables into, the role boxes contain the full list of relevant variables, and you click on one (or more) to select them (see “X variable (pick one) in Fig. 1). In the cases where there is an empty role box, or you doubleclick a variable name to move it from a list to box. The R Commander does a nice job of helping you avoid asking for absurdities, such as the mean of a factor, by not displaying them in certain dialog boxes.
The two objects you might be working on are shown on the toolbar right below the main menus. Clicking the “Data set:” tool will allow you to choose which data set most of the dialog boxes will refer to by default. That’s filled in automatically when you load, enter, or import a data set. Similarly, clicking the “Model:” tool will let you select a model which most of the choices on the Models menu will relate to. It too is filled in automatically each time you create a new model. See more on this in the Modeling section below.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
The Ten Commandments for a wellformatted database
(This article was first published on (en) The R Task Force, and kindly contributed to Rbloggers)
Our diligent readers already know how important having well formatted data is for efficient statistical analyses. Here we gathered some advice on how to make a well structured database, in order to perform accurate analyses, and avoid driving your fellow analysts crazy.
We are very grateful to Marion Louveaux, bioimage analyst for this translation of french version “Une base de données réussie” on our ThinkR website.
An example is worth a thousand words. Let’s imagine you are a medical intern working on an experimental assay (any resemblance to people who actually existed …). We could have given the following title to our article: “How to make a database when you are a medical intern knowing nothing”. However this exercise may have a broader application: indeed our advice applies to anyone.
Let’s say that you already have your medical question and your protocol, and that you “just” have to fill in your database. So, now what? As the proverb says, there are a thousand ways of doing it wrong, and often only one to do it right. Let’s be honest: if your database is not well ordered, then any further analyses will be hindered. Don’t worry, ThinkR is here to help you create a clear database. Here are ten commandments to avoid classical traps.
Commandment 1: all your data shall fit into one single dataframeSpreading your data over several spreadsheets or several files is not an option. All your data should fit in one single dataframe.
Commandment 2: Thou shalt respect a precise formattingFor a perfect database, follow this rule: the table must be filled cell by cell, starting from the upper left corner and, of course, without skipping line(s).
Commandment 3: A line = a statistical individualIt is not always easy to define what a statistical individual is. In a nutshell, an answer to a survey corresponds to one individual in your experimental protocol. If a physical person answers two times, you can enter these on two lines.
Commandment 4: A column = a variableA variable corresponds to a metric from your statistical individual (e.g. a characteristic that you can measure). For instance, the gender, the age or the colour of the eyes. For these categorical variables, where each answer corresponds to one level or another, but not several levels at the same time, each variable corresponds to one column. Do not create one column for “man” and one column for “woman”, each filled with yes/no, but a single one alternating men and women accross the different lines.
When several levels can be taken by the same individual, use several columns. For instance, if your patients have several symptoms (hurting eyes/nose/neck…), you will have to create one column per symptom, and fill them with yes or no.
Commandment 5: Thou shalt not encode thy qualitative variablesAvoid as much as possible encoding qualitative variables: a 1/2 coding for the gender is useless, prefer the use of “woman” and “man” for a better readability. If you really can’t prevent yourself from using a code, then stick to 1/0 to mean presence/absence or yes/no. Not more !
For Excel lovers: don’t code your variables using colors or bold/italic/underlined. No “control in blue, test case in red”: nothing worse than such a format to loose information when exporting those data.
Commandment 6: Thy database shall only contain dataIt is not unusual to see databases with aestethic formatting, containing titles, blank spaces, annotations, explanations and much more. Nothing is worse than this type of format for causing difficulties when importing the data into R. Repeat after me: the database must contain ONLY data.
If you really need to make comments, you can always add a special column for them.
Commandment 7: Homogeneous thou shalt beFor reading, as well as for the analysis, a homogeneous notation is essential. For instance, don’t mix “boy”, “man” and “masculine”. Choose one and stick to it all across your database… including the exact spelling: R is case and blank sensitive. “Man 1” will not be the same as “man 1”, nor as “man1” or “man_1”. For column names, forget about “m1 glycemia”, “glucose rate month3” and “glycemia month6”. Rather use “glycemia_month_1”, “glycemia_month_3” and “glycemia_month_6”! When data are missing, leave a blank space.
Commandment 8: Thy numerical variables with respect thou shalt treatYou have numerical variables in your database. Fair enough, but please keep them numerical! For instance, the patient’s age must be a number, not “25 yrs” or “25 years old”, just “25”.
For the dates, choose the universal notation YYYYMMDD, natively recognized by R. And don’t forget to respect Commandment 7 by standardizing the format.
Commandment 9: Anonymous thy database thou shalt keepDon’t forget to make your database anonymous (especially if you are medical intern). When communicating, sharing or publishing your results, including the identities of people is not an option. Give them numbers, and keep the correspondance in another database with the names, medical files, … and all other information that is not relevant for statistical analysis.
Commandment 10: Human readable thy database shall beThroughout the creation of your database never forget that your data must be perfectly readable by the computer… but also by a human reader! You database must be understandable and readable, either by you or a colleague. Give meaningful names to your variables, so that you understand yourself in 6 months when reopening the database. And avoid falling into the trap of simplicity: rather use “glycemia_month1”, “glycemia_month2” and “glycemia_month3” than G1, G2 and G3.
AcknowledgementsI would like to thank Victor Jones for reviewing and correcting this translation.
Translated by Marion Louveaux, bioimage analyst
The post The Ten Commandments for a wellformatted database appeared first on (en) The R Task Force.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Writing Pipefriendly Functions
(This article was first published on R – William Doane, and kindly contributed to Rbloggers)
Pipes have been a fundamental aspect of computer programming for many decades. In short, the semantics of pipes can be thought of as taking the output from the lefthand side and passing it as input to the righthand side. For example, in a linux shell, you might cat example.txt  unique  sort to take the contents of a text file, then take one copy of each row, then sort those remaining rows.  is a common, but not universal, pipe operator and on U.S. Qwerty keyboards, is found above the backslash key: \.
Languages that don’t begin by supporting pipes often eventually implement some version of them. In R, the magrittr package introduced the %>% infix operator as a pipe operator and is most often pronounced as “then”. For example, “take the mtcars data.frame, THEN take the head of it, THEN…” and so on.
For a function to be pipe friendly, it should at least take a data object (often named .data) as its first argument and return an object of the same type—possibly even the same, unaltered object. This contract ensures that your pipefriendly function can exist in the middle of a piped workflow, accepting the input from its lefthand side and passing along output to its righthand side.
library(magrittr) custom_function < function(.data) { message(str(.data)) .data } mtcars %>% custom_function() %>% head(10) %>% custom_function()This will first display the structure of the 32 by 10 mtcars data.frame, then take the head(10) of mtcars and display the structure of that 10 by 10 reduced version, ultimately returning the reduced version which is, by default in R, printed to the console.
The dplyr package in R introduces the notion of a grouped data.frame. For example, in the mtcars data, there is a cyl parameter that classifies each observation as a 4, 6, or 8 cylinder vehicle. You might want to process each of these groups of rows separately—i.e., process all the 4 cylinder vehicles together, then all the 6 cylinder, then all the 8 cylinder:
library(dplyr) mtcars %>% group_by(cyl) %>% tally()Note that dplyr reexports the magrittr pipe operator, so it’s not necessary to attach both dplyr and magrittr explicitly; attaching dplyr will usually suffice.
In order to make my custom function groupaware, I need to check the incoming .data object to see whether it’s a grouped data.frame. If it is, then I can use dplyr‘s do() function to call my custom function on each subset of the data. Here, the (.) notation denotes the subset of .data being handed to custom_function at each invocation.
library(dplyr) custom_function < function(.data) { if (dplyr::is_grouped_df(.data)) { return(dplyr::do(.data, custom_function(.))) } message(str(.data)) .data } mtcars %>% custom_function() mtcars %>% group_by(cyl) %>% custom_function()In these examples, I’ve messaged some metadata to the console, but your custom functions can do any work they like: create, plot, and save ggplots; compute statistics; generate log files; and so on.
I usually include the R threedots parameter, ..., to allow additional parameters to be passed into the function.
custom_function < function(.data, ...) { if (dplyr::is_grouped_df(.data)) { return(dplyr::do(.data, custom_function(., ...))) } message(str(.data)) .data } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – William Doane. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Euler Problem 33: Digit Cancelling Fractions and Ford Circles
(This article was first published on The Devil is in the Data – The Lucid Manager, and kindly contributed to Rbloggers)
Euler Problem 33 takes us back to the world of fractions from our primary school days. Many kids hate and struggle learning about fractions but once you master them, a new world of numbers opens up. Unfortunately, the proliferation of digital calculators has negated the use of fractions in favour of decimal expressions. Fractions are an aesthetic way to express numbers, without having to resort to ugly random sequences of decimals. This is why I prefer to use 22/7 as an approximation of Pi over the ugly infinite series of decimals.
This Numberphile video below explains fractions and Farey sequences. A Farey sequence contains all fractions between 0 and 1 with a maximum denominator. More formally, a Farey sequence of order n is the sequence of completely reduced fractions between 0 and 1 which, when in lowest terms, have denominators less than or equal to , arranged in order of increasing size. For example, the Farey Sequence with order 3 is:
These sequences can be visualised in fractalesque Ford Circles, but before we get to this, first solve Euler problem 33.
Euler Problem 33 DefinitionThe fraction 49/98 is a curious fraction, as an inexperienced mathematician in attempting to simplify it may incorrectly believe that 49/98 = 4/8, which is correct, is obtained by cancelling the 9s. We shall consider fractions like 30/50 = 3/5, to be trivial examples.
There are exactly four nontrivial examples of this type of fraction, less than one in value, and containing two digits in the numerator and denominator. If the product of these four fractions is given in its lowest common terms, find the value of the denominator.
Proposed Solution in RTo solve this problem, we create a pseudoFarey sequence by generating all different fractions with two decimals in the numerator and denominator. The loop generates all combinations of demoninators and numerators, excluding the trivial ones (multiples of 10 and 11). This solution converts the numbers to strings, strips any common duplicates, and tests the condition. The code concatenates vectors, which is not good practice. However, the loop is so short it does not matter much.
You can view the code below or download it from my GitHub page.
num < vector() den < vector() for (a in 11:99) { for (b in (a + 1):99) { trivial < (a %% 10 == 0  b && 10 == 0  a %% 11 == 0  b %% 11 == 0) if (!trivial) { i < as.numeric(unlist(strsplit(as.character(a), ""))) j < as.numeric(unlist(strsplit(as.character(b), ""))) digs < c(i, j) dup < digs[duplicated(digs)] digs < digs[which(digs != dup)] if (length(digs) == 2 & a/b == digs[1]/digs[2]) { num < c(num, a) den < c(den, b) } } } } answer < prod(den) / prod(num) print(answer) Farey Sequences and Ford CirclesNext step is to generalise Euler problem 33 and write a function to generate Farey Sequences and visualise them using Ford Circles.
The farey function generates a data table with the numerators (p) and denominators (q) of a Farey sequence. The function builds a list of all possible fractions for the solution space, excluding those with one as a Greatest Common Dominator, as defined by the gcc function.
farey < function(n) { fseq < list() fseq[[1]] < c(0, 1) i < 2 gcd < function(a, b) { # Euclid's method if (a == 0) return(b) if (b == 0) return(a) gcd(b, a%%b) } for (q in 2:n) { for (p in 1:(q  1)){ if (gcd(p, q) == 1) { fseq[[i]] < c(p, q) i < i + 1 } } } fseq[[i]] < c(1, 1) fseq < as.data.frame(do.call(rbind, fseq)) names(fseq) < c("p", "q") fseq < fseq[order(fseq$p / fseq$q), ] return(fseq) }Standard ggplot2 cannot draw circles where the radius of the circles is related to the coordinate system. I tried to use the ggforce package to plot circles in ggplot2, but for some reason, I was not able to install this package on Ubuntu. As an alternative, I used a circle function sourced from StackOverflow. This function is called in a forloop to build the circles on top of the empty canvas.
Farey Sequence and Ford Circles (n = 20). library(tidyverse) lm_palette < c("#008da1", "#005395", "#262e43", "#3b2758", "#865596", "#f26230") ford_circles < farey(20) %>% mutate(x = p / q, y = 1 / (2* q^2), r = y, c = lm_palette[(q  1)%%6 + 1]) g_circle < function(r, x, y, color = NA, fill = "black", ...) { x < x + r * cos(seq(0, pi, length.out = 100)) ymax < y + r * sin(seq(0, pi, length.out = 100)) ymin < y + r * sin(seq(0, pi, length.out = 100)) annotate("ribbon", x = x, ymin = ymin, ymax = ymax, color = color, fill = fill, ...) } p < ggplot(ford_circles, aes(x, y)) for (i in 1:nrow(ford_circles)) { p < p + g_circle(ford_circles$r[i], ford_circles$x[i], ford_circles$y[i], fill = ford_circles$c[i]) } p + xlim(c(0, 1)) + coord_fixed() + theme_void()The post Euler Problem 33: Digit Cancelling Fractions and Ford Circles appeared first on The Lucid Manager.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data – The Lucid Manager. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RStudio Connect v1.6.4.2 – Security Update
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
A security vulnerability in a thirdparty library used by RStudio Connect was uncovered during a security audit last week. We have confirmed that this vulnerability has not been used against any of the RStudio Connect instances we host, and are unaware of it being exploited on any customer deployments. Under certain conditions, this vulnerability could compromise the session of a user that was tricked into visiting a specially crafted URL. The issue affects all versions of RStudio Connect up to and including 1.6.4.1, but none of our other products. We have prepared a hotfix: v1.6.4.2.
RStudio remains committed to providing the most secure product possible. We regularly perform internal security audits against RStudio Connect in order to ensure the product’s security.
As part of the responsible disclosure process, we will provide additional details about the vulnerability and how to ensure that you have not been affected, in the coming weeks once customers have had time to update their systems. For now, please update your RStudio Connect installations to version 1.6.4.2 as soon as possible.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Solver Interfaces in CVXR
(This article was first published on R Views, and kindly contributed to Rbloggers)
IntroductionIn our previous blog
post, we
introduced CVXR, an R package for disciplined convex
optimization. The package allows one to describe an optimization
problem with Disciplined Convex Programming
rules using high level mathematical syntax. Passing this problem
definition along (with a list of constraints, if any) to the solve
function transforms it into a form that can be handed off to
a solver. The default installation of CVXR comes with two (imported)
open source solvers:
 ECOS and its mixed integer
cousin ECOS_BB via the CRAN package
ECOSolveR  SCS via the CRAN package
scs.
CVXR (version 0.99) can also make use of several other open source
solvers implemented in R packages:
 The linear and mixed integer programming package
lpSolve via the
lpSolveAPI package  The linear and mixed integer programming package GLPK via the
Rglpk package.
The real work of finding a solution is done by solvers, and writing
good solvers is hard work. Furthermore, some solvers work particularly
well for certain types of problems (linear programs, quadratic
programs, etc.). Not surprisingly, there are commercial vendors who
have solvers that are designed for performance and scale. Two
wellknown solvers are MOSEK and
GUROBI. R packages for these solvers are
also provided, but they require the problem data to be constructed in
a specific form. This necessitates a bit of work in the current version of
CVXR and is certainly something we plan to include in future versions.
However, it is also true that these commercial solvers expose a much
richer API to Python programmers than to R programmers. How, then, do we
interface such solvers with R as quickly as possible, at least
in the short term?
The current version of CVXR exploits the
reticulate package
for commercial solvers such as MOSEK and GUROBI. We took the Python solver interfaces in CVXPY version
0.4.11, edited them
suitably to make them selfcontained, and hooked them up to reticulate.
This means that one needs two prerequisites to use these commercial solvers in the current version of CVXR:
 A Python installation
 The reticulate R
package.
Both MOSEK and
GUROBI provide academic versions
(registration required) free of charge. For example,
Anaconda users can install MOSEK with the command:
Others can use the pip command:
pip install f https://download.mosek.com/stable/wheel/index.html MosekGUROBI is handled in a similar fashion. The solvers must be activated using a
license provided by the vendor.
Once activated, one can check that CVXR recognizes the solver;
installed_solvers() should list them.
More information on these solvers, along with a number of tutorial examples are
available on the CVXR site. If you are
attending useR! 2018, you can catch
Anqi’s CVXR talk on Friday, July 13.
_____='https://rviews.rstudio.com/2018/07/09/solverinterfacesincvxr/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Time Series Analysis With Documentation And Steps I Follow For Analytics Projects.
(This article was first published on RAnalytics, and kindly contributed to Rbloggers)
To do this I will create a prediction of the open values for Bitcoin in the next 3 days.
The process I follow is based on CRISPDM methodology: https://www.datasciencecentral.com/profiles/blogs/crispdmastandardmethodologytoensureagoodoutcome
1. Planning the activities.
To plan the activities I use a spread sheet document, below I show the spread sheet sample, if you would like the document, please go to the next link:
https://github.com/LaranIkal/RANALYTICS/blob/master/BitCoinPredictionsAdmin%26FollowUp.ods
Get Data For Analysis Get initial data to create feasibility analysis 4/19/2018 Carlos Kassab Done
ETL Development ETL to get final data for next analysis 4/20/2018 Carlos Kassab Done 2018/04/19: In this case, there is not ETL needed, the dataset was downloaded from kaggle: https://www.kaggle.com/vivekchamp/bitcoin/data Exploratory Data Analysis Dataset summary and histogram to know the data normalization 4/20/2018 Carlos Kassab In progress
Variables frequency Frequency of variable occurrence( frequency of values change, etc. ) 4/20/2018 Carlos Kassab Done 2018/04/19: We have already seen that our values change every day. Outliers Analysis Analysis of variability in numeric variables, show it in charts and grids.. 4/20/2018 Carlos Kassab In progress
Time Series Decomposition – Getting metric charts, raw data, seasonality, trend and remainder. 4/20/2018 Carlos Kassab In progress
Modelling Create the analytics model 4/25/2018 Carlos Kassab Not Started
SQL View Development For Training, Validation And Testing NA Carlos Kassab Not Started 2018/04/19: No SQL view needed, everything is done inside the R script. Model Selection By using random parameters search algorithm, to find the right model to be used for this data. 4/25/2018 Carlos Kassab Not Started
Model fine tunning. After finding the right algorithm, find the right model parameters to be used. 4/25/2018 Carlos Kassab Not Started
Chart Development Final data chart development 4/25/2018 Carlos Kassab Not Started
Data Validation Run the analytics model at least 2 weeks daily in order to see its behavior. NA Carlos Kassab Not Started
Deployment Schedule the automatic execution of the R code. NA Carlos Kassab Not Started
The first activity is the functional specification, this would be similar to business understanding in the crispdm methodology.
I use a text document, for this analysis, you can get the document from this link:
https://github.com/LaranIkal/RANALYTICS/blob/master/BitCoinPredictionRequirementSpecification.odt
Now, the next step is to get the data for analysis and create the ETL script, in this case we just got the data from kaggle.com as mentioned in the documentation but, no ETL script was needed.
So, we have our data, now we are going to analyze it, we will do all the activities mentioned in yellow in the grid above. I did this in a MarkDown document, here you can see the HTML output:
https://github.com/LaranIkal/RANALYTICS/blob/master/BitcoinDataAnalysis.html
Note. At the end, in order to have everything together I included the time series algorithms in the same rmd file that creates BitcoinDataAnalysis.html file.
You can get all the sources from:
https://github.com/LaranIkal/RANALYTICS
Enjoy it!!!.Carlos Kassab https://www.linkedin.com/in/carloskassab48b40743/
More information about R: https://www.rbloggers.com var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: RAnalytics. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
a thread to bin them all [puzzle]
(This article was first published on R – Xi'an's Og, and kindly contributed to Rbloggers)
The most recent riddle on the Riddler consists in finding the shorter sequence of digits (in 0,1,..,9) such that all 10⁴ numbers between 0 (or 0000) and 9,999 can be found as a group of consecutive four digits. This sequence is obviously longer than 10⁴+3, but how long? On my trip to Brittany last weekend, I wrote an R code first constructing the sequence at random by picking with high preference the next digit among those producing a new fourdigit number
tenz=10^(0:3) wn2dg=function(dz) 1+sum(dz*tenz) seqz=rep(0,10^4) snak=wndz=sample(0:9,4,rep=TRUE) seqz[wn2dg(wndz)]=1 while (min(seqz)==0){ wndz[1:3]=wndz[1];wndz[4]=0 wndz[4]=sample(0:9,1,prob=.01+.99*(seqz[wn2dg(wndz)+0:9]==0)) snak=c(snak,wndz[4]) sek=wn2dg(wndz) seqz[sek]=seqz[sek]+1}which usually returns a value above 75,000. I then looked through the sequence to eliminate useless replicas
for (i in sample(4:(length(snak)5))){ if ((seqz[wn2dg(snak[(i3):i])]>1) &(seqz[wn2dg(snak[(i2):(i+1)])]>1) &(seqz[wn2dg(snak[(i1):(i+2)])]>1) &(seqz[wn2dg(snak[i:(i+3)])]>1)){ seqz[wn2dg(snak[(i3):i])]=seqz[wn2dg(snak[(i3):i])]1 seqz[wn2dg(snak[(i2):(i+1)])]=seqz[wn2dg(snak[(i2):(i+1)])]1 seqz[wn2dg(snak[(i1):(i+2)])]=seqz[wn2dg(snak[(i1):(i+2)])]1 seqz[wn2dg(snak[i:(i+3)])]=seqz[wn2dg(snak[i:(i+3)])]1 snak=snak[i] seqz[wn2dg(snak[(i3):i])]=seqz[wn2dg(snak[(i3):i])]+1 seqz[wn2dg(snak[(i2):(i+1)])]=seqz[wn2dg(snak[(i2):(i+1)])]+1 seqz[wn2dg(snak[(i1):(i+2)])]=seqz[wn2dg(snak[(i1):(i+2)])]+1}}until none is found. A first attempt produced 12,911 terms in the sequence. A second one 12,913. A third one 12,871. Rather consistent figures but not concentrated enough to believe in achieving a true minimum. An overnight run produced 12,779 as the lowest value.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
World Income, Inequality and Murder
(This article was first published on Theory meets practice..., and kindly contributed to Rbloggers)
Abstract:We follow up on last weeks post on using Gapminder data to study the world’s income distribution. In order to assess the inequality of the distribution we compute the Gini coefficient for the world’s income distribution by Monte Carlo approximation and visualize the result as a time series. Furthermore, we animate the association between Gini coefficient and homicide rate per country using the new version of gganimate.
This work is licensed under a Creative Commons AttributionShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github.
IntroductionOne of the main messages of the Chapter ‘The Gap Instinct’ of the book Factfulness is that there is no justification of the ‘we’ and ‘them’ classification of countries anymore, because they have converged towards the same levels in key indicators such as life expectancy, child mortality, births per female. The difference between countries is, hence, not as many imagine it to be: there is less inequality and no real gap. While reading, I became curious about the following: what if countries became more equal, but simultaneously inequality within countries became bigger? This was also indirectly a Disqus comment by F. Weidemann to the post Factfulness: Building Gapminder Income Mountains. Aim of the present post is to investigate this hypothesis using the Gapminder data. Furthermore, we use the country specific Gini coefficients to investigate the association with the number of homicides in the country.
Gini coefficientThere are different ways to measure income inequality, both in terms of which response you consider and which statistical summary you compute for it. Not going into the details of these discussion we use the GDP/capita in Purchasing Power Parities (PPP) measured in so called international dollars (fixed prices 2011). In other words, comparison between years and countries are possible, because the response is adjusted for inflation and differences in price of living.
The Gini coefficient is a statistical measure to quantify inequality. In what follows we shall focus on computing the Gini coefficient for a continuous probability distribution with a known probability density function. Let the probability density function of the nonnegative continuous income distribution be defined by \(f\), then the Gini coefficient is given as half the relative mean difference:
\[
G
= \frac{1}{2\mu}\int_0^\infty \int_0^\infty xy \> f(x) \> f(y) \>
dx\> dy, \quad\text{where}\quad \mu = \int_{0}^\infty x\cdot f(x) dx.
\]
Depending on \(f\) it might be possible to solve these integrals analytically, however, a straightforward computational approach is to use Monte Carlo sampling – as we shall see shortly. Personally, I find the above relative mean difference presentation of the Gini index much more intuitive than the area argument using the Lorenz curve. From the eqution it also becomes clear that the Gini coefficient is invariant to multiplicative changes in the income: if everybody increases their income by factor \(k>0\) then the Gini coefficient remains the same, because \(k x – k y = k  x – y\) and \(E(k \cdot X) = k \mu\).
The above formula indirectly also states how to compute the Gini coefficient for a discrete sample of size \(n\) and with incomes \(x_1,\ldots, x_n\): \[
G = \frac{\sum_{i=1}^n \sum_{j=1}^n x_i –
x_j \frac{1}{n} \frac{1}{n}}{2 \sum_{i=1}^n x_i \frac{1}{n}} =
\frac{\sum_{i=1}^n \sum_{j=1}^n x_i – x_j}{2 n \sum_{j=1}^n x_j}.
\]
If one is able to easily sample from \(f\) then can instead of solving the integral analytically use \(k\) pairs \((x,y)\) both drawn at random from \(f\) to approximate the double integral:
\[
G \approx \frac{1}{2\mu K} \sum_{k=1}^K x_k – y_k, \quad\text{where}\quad
x_k \stackrel{\text{iid}}{\sim} f \text{ and } y_k \stackrel{\text{iid}}{\sim} f,
\] where for our mixture model \[
\mu = \sum_{i=1}^{192} w_i \> E(X_i) = \sum_{i=1}^{192} w_i \exp\left(\mu_i + \frac{1}{2}\sigma_i^2\right).
\] This allows us to compute \(G\) even in case of a complex \(f\) such as the lognormal mixture distribution. As always, the larger \(K\) is, the better the Monte Carlo approximation is.
We start by showing the country specific Gini coefficient per year since 1950 for a somewhat arbitrary selection of countries. The dashed black line shows the mean Gini coefficient each year over all 192 countries in the dataset.
In addition, we now show the Monte Carlo computed Gini coefficient for the world’s income distribution given as a lognormal mixture with 192 components.
We notice that the Gini coefficient taken over the 192 countries’ GDP/capita remains very stable over time. This, however, does not take the large differences in populations between countries into account. A fairer measure is thus the Gini coefficient for the world’s income distribution. We see that this Gini coefficient increased over time until peaking around 1990. From then on it has declined. However, the pre1950 Gini coefficients are rather guesstimates as stated by Gapminder, hence, we zoom in on the period from 1970, because data are more reliable from this point on.
Gini coefficient and Homicide RateFinally, we end the post by illustrating the association between the Gini coefficient and the homicide rate per country using a 2D scatterplot over the years. The Gapminder data download page also contains data for this for the years 1950 2005. Unfortunately, no data for more recent years are available from the Gapminder homepage, but the plot shown below is the situation in r show_year with a logbase10 yaxis for the homicide rates. For each of the four Gapminder regional groups we also fit a simple linear regression line to the points of all countries within the region. Findings such as Fajnzylber, Lederman, and Loayza (2002) suggest that there is a strong positive correlation between Gini coefficient and homicide rate, but we see from the plot that there are regional differences and of course correlation is not causality. Explanations for this phenomena are debated and beyond the scope of this post.
We extend the plots to all years 19502005. Unfortunately, not all countries are available every year – so we only plot the available countries each year. This means that many African countries are missing from the animation. An improvement would be to try some form of linear interpolation. Furthermore, for the sake of simplicity of illustration, we fix countries with a reported murder rate of zero in a given year (happens for example for Cyprus, Iceland, Fiji in some years) to 0.01 per 100,000 population. This can be nicely animated using the new version of the gganimate pkg by Thomas Lin Pedersen.
## New version of gganimate. Not on CRAN yet. ## devtools::install_github('thomasp85/gganimate') require(gganimate) p < ggplot(gm2_nozero, aes(x=gini, y=murder_rate,size=population, color=Region)) + geom_point() + scale_x_continuous(labels=scales::percent) + scale_y_continuous(trans="log10", breaks = trans_breaks("log10", function(x) 10^x,n=5), labels = trans_format("log10", function(x) ifelse(x<0, sprintf(paste0("%.",ifelse(is.na(x),"0",round(abs(x))),"f"),10^x), sprintf("%.0f",10^x)))) + geom_smooth(se=FALSE, method="lm", formula=y~x) + geom_text(data=gm2, aes(x=gini,y=murder_rate, label=country), vjust=0.9, show.legend=FALSE) + ylab("Murder rate (per 100,000 population)") + xlab("Gini coefficient (in %)") + guides(size=FALSE) + labs(title = 'Year: {frame_time}') + transition_time(year) + ease_aes('linear') animate(p, nframes=length(unique(gm2$year)), fps=4, width=800, height=400, res=100) DiscussionBased on the available Gapminder data we showed that in the last 25 years the Gini coefficient for the world’s income distribution has decreased. For several individual countries opposite dynamics are, however, observed. One particular concern is the share that the richest 1% have of the overall wealth: more than 50%.
LiteratureFajnzylber, P., D. Lederman, and N. Loayza. 2002. “Inequality and Violent Crime.” Journal of Law and Economics 45 (April). http://siteresources.worldbank.org/DEC/Resources/Crime&Inequality.pdf.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Statistics Sunday: Mixed Effects MetaAnalysis
(This article was first published on Deeply Trivial, and kindly contributed to Rbloggers)
As promised, how to conduct mixed effects metaanalysis in R:
Code used in the video is available here. And I’d recommend the following posts to provide background for this video:
 What is metaanalysis?
 Introduction to effect sizes
 Variance and weights in metaanalysis
 Video on conducting fixed and random effects metaanalysis in R
 And the homepage for the metafor package
To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Speed up your R Work
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
IntroductionIn this note we will show how to speed up work in R by partitioning data and processlevel parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with baseR and other packages.
For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition unrelated data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallel conveniently bundles all of these steps together when working with rquery pipelines.
The partitioning is specified by the user preparing a grouping column that tells the system which sets of rows must be kept together in a correct calculation. We are going to try to demonstrate everything with simple code examples, and minimal discussion.
Keep in mind: unless the pipeline steps have nontrivial cost, the overhead of partitioning and distributing the work may overwhelm any parallel speedup. Also data.table itself already seems to exploit some threadlevel parallelism (notice user time is greater than elapsed time). That being said, in this note we will demonstrate a synthetic example where computation is expensive due to a blowup in an intermediate join step.
Our exampleFirst we set up our execution environment and example (some details: OSX 10.13.4 on a 2.8 GHz Intel Core i5 Mac Mini (Late 2015 model) with 8GB RAM and hybrid disk drive).
library("rqdatatable") ## Loading required package: rquery library("microbenchmark") library("ggplot2") library("WVPlots") suppressPackageStartupMessages(library("dplyr")) ## Warning: package 'dplyr' was built under R version 3.5.1 base::date() ## [1] "Sun Jul 8 09:05:25 2018" R.version.string ## [1] "R version 3.5.0 (20180423)" parallel::detectCores() ## [1] 4 packageVersion("parallel") ## [1] '3.5.0' packageVersion("rqdatatable") ## [1] '0.1.2' packageVersion("rquery") ## [1] '0.5.1' packageVersion("dplyr") ## [1] '0.7.6' ncore < parallel::detectCores() print(ncore) ## [1] 4 cl < parallel::makeCluster(ncore) print(cl) ## socket cluster with 4 nodes on host 'localhost' set.seed(2362) mk_example < function(nkey, nrep, ngroup = 20) { keys < paste0("key_", seq_len(nkey)) key_group < sample(as.character(seq_len(ngroup)), length(keys), replace = TRUE) names(key_group) < keys key_table < data.frame( key = rep(keys, nrep), stringsAsFactors = FALSE) key_table$data < runif(nrow(key_table)) instance_table < data.frame( key = rep(keys, nrep), stringsAsFactors = FALSE) instance_table$id < seq_len(nrow(instance_table)) instance_table$info < runif(nrow(instance_table)) # groups should be no finer than keys key_table$key_group < key_group[key_table$key] instance_table$key_group < key_group[instance_table$key] list(key_table = key_table, instance_table = instance_table) } dlist < mk_example(10, 10) data < dlist$instance_table annotation < dlist$key_table rquery / rqdatatablerquery and rqdatatable can implement a nontrivial calculation as follows.
# possible data lookup: find rows that # have lookup data <= info optree < local_td(data) %.>% natural_join(., local_td(annotation), jointype = "INNER", by = "key") %.>% select_rows_nse(., data <= info) %.>% pick_top_k(., k = 1, partitionby = "id", orderby = "data", reverse = "data", keep_order_column = FALSE) %.>% orderby(., "id") cat(format(optree)) ## table('data'; ## key, ## id, ## info, ## key_group) %.>% ## natural_join(., ## table('annotation'; ## key, ## data, ## key_group), ## j= INNER, by= key) %.>% ## select_rows(., ## data <= info) %.>% ## extend(., ## row_number := row_number(), ## p= id, ## o= "data" DESC) %.>% ## select_rows(., ## row_number <= 1) %.>% ## drop_columns(., ## row_number) %.>% ## orderby(., id) res1 < ex_data_table(optree) head(res1) ## data id info key key_group ## 1: 0.9152014 1 0.9860654 key_1 20 ## 2: 0.5599810 2 0.5857570 key_2 8 ## 3: 0.3011882 3 0.3334490 key_3 10 ## 4: 0.3650987 4 0.3960980 key_4 5 ## 5: 0.1469254 5 0.1753649 key_5 14 ## 6: 0.2567631 6 0.3510280 key_6 7 nrow(res1) ## [1] 94And we can execute the operations in parallel.
parallel::clusterEvalQ(cl, library("rqdatatable")) ## [[1]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[2]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[3]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" ## ## [[4]] ## [1] "rqdatatable" "rquery" "stats" "graphics" "grDevices" ## [6] "utils" "datasets" "methods" "base" res2 < ex_data_table_parallel(optree, "key_group", cl) head(res2) ## data id info key key_group ## 1: 0.9152014 1 0.9860654 key_1 20 ## 2: 0.5599810 2 0.5857570 key_2 8 ## 3: 0.3011882 3 0.3334490 key_3 10 ## 4: 0.3650987 4 0.3960980 key_4 5 ## 5: 0.1469254 5 0.1753649 key_5 14 ## 6: 0.2567631 6 0.3510280 key_6 7 nrow(res2) ## [1] 94 data.tabledata.table can implement the same function.
library("data.table") ## ## Attaching package: 'data.table' ## The following objects are masked from 'package:dplyr': ## ## between, first, last packageVersion("data.table") ## [1] '1.11.4' data_table_f < function(data, annotation) { data < data.table::as.data.table(data) annotation < data.table::as.data.table(annotation) joined < merge(data, annotation, by = "key", all=FALSE, allow.cartesian=TRUE) joined < joined[joined$data <= joined$info, ] data.table::setorderv(joined, cols = "data") joined < joined[, .SD[.N], id] data.table::setorderv(joined, cols = "id") } resdt < data_table_f(data, annotation) head(resdt) ## id key info key_group.x data key_group.y ## 1: 1 key_1 0.9860654 20 0.9152014 20 ## 2: 2 key_2 0.5857570 8 0.5599810 8 ## 3: 3 key_3 0.3334490 10 0.3011882 10 ## 4: 4 key_4 0.3960980 5 0.3650987 5 ## 5: 5 key_5 0.1753649 14 0.1469254 14 ## 6: 6 key_6 0.3510280 7 0.2567631 7 nrow(resdt) ## [1] 94We can also run data.table in parallel using wrapr::execute_parallel.
parallel::clusterEvalQ(cl, library("data.table")) ## [[1]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[2]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[3]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" ## ## [[4]] ## [1] "data.table" "rqdatatable" "rquery" "stats" "graphics" ## [6] "grDevices" "utils" "datasets" "methods" "base" parallel::clusterExport(cl, "data_table_f") dt_f < function(tables_list) { data < tables_list$data annotation < tables_list$annotation data_table_f(data, annotation) } data_table_parallel_f < function(data, annotation) { respdt < wrapr::execute_parallel( tables = list(data = data, annotation = annotation), f = dt_f, partition_column = "key_group", cl = cl) %.>% data.table::rbindlist(.) data.table::setorderv(respdt, cols = "id") respdt } respdt < data_table_parallel_f(data, annotation) head(respdt) ## id key info key_group.x data key_group.y ## 1: 1 key_1 0.9860654 20 0.9152014 20 ## 2: 2 key_2 0.5857570 8 0.5599810 8 ## 3: 3 key_3 0.3334490 10 0.3011882 10 ## 4: 4 key_4 0.3960980 5 0.3650987 5 ## 5: 5 key_5 0.1753649 14 0.1469254 14 ## 6: 6 key_6 0.3510280 7 0.2567631 7 nrow(respdt) ## [1] 94 dplyrdplyr can also implement the example.
dplyr_pipeline < function(data, annotation) { res < data %>% inner_join(annotation, by = "key") %>% filter(data <= info) %>% group_by(id) %>% arrange(data) %>% mutate(rownum = row_number()) %>% ungroup() %>% filter(rownum == 1) %>% arrange(id) res } resd < dplyr_pipeline(data, annotation) head(resd) ## # A tibble: 6 x 7 ## key id info key_group.x data key_group.y rownum ## ## 1 key_1 1 0.986 20 0.915 20 1 ## 2 key_2 2 0.586 8 0.560 8 1 ## 3 key_3 3 0.333 10 0.301 10 1 ## 4 key_4 4 0.396 5 0.365 5 1 ## 5 key_5 5 0.175 14 0.147 14 1 ## 6 key_6 6 0.351 7 0.257 7 1 nrow(resd) ## [1] 94And we can use wrapr::execute_parallel to parallelize the dplyr solution.
parallel::clusterEvalQ(cl, library("dplyr")) ## [[1]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[2]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[3]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" ## ## [[4]] ## [1] "dplyr" "data.table" "rqdatatable" "rquery" "stats" ## [6] "graphics" "grDevices" "utils" "datasets" "methods" ## [11] "base" parallel::clusterExport(cl, "dplyr_pipeline") dplyr_f < function(tables_list) { data < tables_list$data annotation < tables_list$annotation dplyr_pipeline(data, annotation) } dplyr_parallel_f < function(data, annotation) { respdt < wrapr::execute_parallel( tables = list(data = data, annotation = annotation), f = dplyr_f, partition_column = "key_group", cl = cl) %>% dplyr::bind_rows() %>% arrange(id) } respdplyr < dplyr_parallel_f(data, annotation) head(respdplyr) ## # A tibble: 6 x 7 ## key id info key_group.x data key_group.y rownum ## ## 1 key_1 1 0.986 20 0.915 20 1 ## 2 key_2 2 0.586 8 0.560 8 1 ## 3 key_3 3 0.333 10 0.301 10 1 ## 4 key_4 4 0.396 5 0.365 5 1 ## 5 key_5 5 0.175 14 0.147 14 1 ## 6 key_6 6 0.351 7 0.257 7 1 nrow(respdplyr) ## [1] 94 BenchmarkWe can benchmark the various realizations.
dlist < mk_example(300, 300) data < dlist$instance_table annotation < dlist$key_table timings < microbenchmark( data_table_parallel = nrow(data_table_parallel_f(data, annotation)), data_table = nrow(data_table_f(data, annotation)), rqdatatable_parallel = nrow(ex_data_table_parallel(optree, "key_group", cl)), rqdatatable = nrow(ex_data_table(optree)), dplyr_parallel = nrow(dplyr_parallel_f(data, annotation)), dplyr = nrow(dplyr_pipeline(data, annotation)), times = 10L) saveRDS(timings, "Parallel_rqdatatable_timings.RDS") Conclusion print(timings) ## Unit: seconds ## expr min lq mean median uq ## data_table_parallel 5.274560 5.457105 5.609827 5.546554 5.686829 ## data_table 9.401677 9.496280 9.701807 9.541218 9.748159 ## rqdatatable_parallel 7.165216 7.497561 7.587663 7.563883 7.761987 ## rqdatatable 12.490469 12.700474 13.320480 12.898154 14.229233 ## dplyr_parallel 6.492262 6.572062 6.784865 6.787277 6.875076 ## dplyr 20.056555 20.450064 20.647073 20.564529 20.800350 ## max neval ## 6.265888 10 ## 10.419316 10 ## 7.949404 10 ## 14.282269 10 ## 7.328223 10 ## 21.332103 10 # autoplot(timings) timings < as.data.frame(timings) timings$seconds < timings$time/1e+9 ScatterBoxPlotH(timings, xvar = "seconds", yvar = "expr", title="task duration by method")The benchmark timings show parallelized data.table is the fastest, followed by parallelized dplyr, and parallelized rqdatatable. In the nonparaellized case data.table is the fastest, followed by rqdatatable, and then dplyr.
A reason dplyr sees greater speedup relative to its own nonparallel implementation (yet does not beat data.table) is that data.table starts already multithreaded, so data.table is exploiting some parallelism even before we added the process level parallelism (and hence sees less of a speed up, though it is fastest).
rquery pipelines exhibit superior performance on big data systems (Spark, PostgreSQL, Amazon Redshift, and hopefully soon Google bigquery), and rqdatatable supplies a very good inmemory implementation of the rquery system based on data.table. rquery also speeds up solution development by supplying higher order operators and early debugging features.
In this note we have demonstrated simple procedures to reliably parallelize any of rqdatatable, data.table, or dplyr.
Note: we did not include alternatives such as multidplyr or dtplyr in the timings, as they did not appear to work on this example.
Materials
The original rendering of this article can be found here, source code here, and raw timings here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Dealing with heteroskedasticity; regression with robust standard errors using R
(This article was first published on Econometrics and Free Software, and kindly contributed to Rbloggers)
First of all, is it heteroskedasticity or heteroscedasticity? According to
McCulloch (1985),
heteroskedasticity is the proper spelling, because when transliterating Greek words, scientists
use the Latin letter k in place of the Greek letter κ (kappa). κ sometimes is transliterated as
the Latin letter c, but only when these words entered the English language through French, such
as scepter.
Now that this is out of the way, we can get to the meat of this blogpost (foreshadowing pun).
A random variable is said to be heteroskedastic, if its variance is not constant. For example,
the variability of expenditures may increase with income. Richer families may spend a similar
amount on groceries as poorer people, but some rich families will sometimes buy expensive
items such as lobster. The variability of expenditures for rich families is thus quite large.
However, the expenditures on food of poorer families, who cannot afford lobster, will not vary much.
Heteroskedasticity can also appear when data is clustered; for example, variability of
expenditures on food may vary from city to city, but is quite constant within a city.
To illustrate this, let’s first load all the packages needed for this blog post:
library(robustbase) library(tidyverse) library(sandwich) library(lmtest) library(modelr) library(broom)First, let’s load and prepare the data:
data("education") education < education %>% rename(residents = X1, per_capita_income = X2, young_residents = X3, per_capita_exp = Y, state = State) %>% mutate(region = case_when( Region == 1 ~ "northeast", Region == 2 ~ "northcenter", Region == 3 ~ "south", Region == 4 ~ "west" )) %>% select(Region)I will be using the education data sat from the {robustbase} package. I renamed some columns
and changed the values of the Region column. Now, let’s do a scatterplot of per capita expenditures
on per capita income:
It would seem that, as income increases, variability of expenditures increases too. Let’s look
at the same plot by region:
I don’t think this shows much; it would seem that observations might be clustered, but there are
not enough observations to draw any conclusion from this plot (in any case, drawing conclusions
from plots only is dangerous).
Let’s first run a good ol’ linear regression:
lmfit < lm(per_capita_exp ~ region + residents + young_residents + per_capita_income, data = education) summary(lmfit) ## ## Call: ## lm(formula = per_capita_exp ~ region + residents + young_residents + ## per_capita_income, data = education) ## ## Residuals: ## Min 1Q Median 3Q Max ## 77.963 25.499 2.214 17.618 89.106 ## ## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 467.40283 142.57669 3.278 0.002073 ** ## regionnortheast 15.72741 18.16260 0.866 0.391338 ## regionsouth 7.08742 17.29950 0.410 0.684068 ## regionwest 34.32416 17.49460 1.962 0.056258 . ## residents 0.03456 0.05319 0.650 0.519325 ## young_residents 1.30146 0.35717 3.644 0.000719 *** ## per_capita_income 0.07204 0.01305 5.520 1.82e06 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 39.88 on 43 degrees of freedom ## Multiple Rsquared: 0.6292, Adjusted Rsquared: 0.5774 ## Fstatistic: 12.16 on 6 and 43 DF, pvalue: 6.025e08Let’s test for heteroskedasticity using the BreuschPagan test that you can find in the {lmtest}
package:
This test shows that we can reject the null that the variance of the residuals is constant,
thus heteroskedacity is present. To get the correct standard errors, we can use the vcovHC()
function from the {sandwich} package (hence the choice for the header picture of this post):
By default vcovHC() estimates a heteroskedasticity consistent (HC) variance covariance
matrix for the parameters. There are several ways to estimate such a HC matrix, and by default
vcovHC() estimates the “HC3” one. You can refer to Zeileis (2004)
for more details.
We see that the standard errors are much larger than before! The intercept and regionwest variables
are not statistically significant anymore.
You can achieve the same in one single step:
coeftest(lmfit, vcov = vcovHC(lmfit)) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>t) ## (Intercept) 467.402827 311.310887 1.5014 0.14056 ## regionnortheast 15.727405 25.307782 0.6214 0.53759 ## regionsouth 7.087424 23.561063 0.3008 0.76501 ## regionwest 34.324157 24.122587 1.4229 0.16198 ## residents 0.034558 0.091844 0.3763 0.70857 ## young_residents 1.301458 0.688297 1.8908 0.06540 . ## per_capita_income 0.072036 0.029999 2.4013 0.02073 * ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1It’s is also easy to change the estimation method for the variancecovariance matrix:
coeftest(lmfit, vcov = vcovHC(lmfit, type = "HC0")) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>t) ## (Intercept) 467.402827 172.577569 2.7084 0.009666 ** ## regionnortheast 15.727405 20.488148 0.7676 0.446899 ## regionsouth 7.087424 17.755889 0.3992 0.691752 ## regionwest 34.324157 19.308578 1.7777 0.082532 . ## residents 0.034558 0.054145 0.6382 0.526703 ## young_residents 1.301458 0.387743 3.3565 0.001659 ** ## per_capita_income 0.072036 0.016638 4.3296 8.773e05 *** ##  ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1As I wrote above, by default, the type argument is equal to “HC3”.
Another way of dealing with heteroskedasticity is to use the lmrob() function from the
{robustbase} package. This package is quite interesting, and offers quite a lot of functions
for robust linear, and nonlinear, regression models. Running a robust linear regression
is just the same as with lm():
This however, gives you different estimates than when fitting a linear regression model.
The estimates should be the same, only the standard errors should be different. This is because
the estimation method is different, and is also robust to outliers (at least that’s my understanding,
I haven’t read the theoretical papers behind the package yet).
Finally, it is also possible to bootstrap the standard errors. For this I will use the
bootstrap() function from the {modelr} package:
Let’s take a look at the boot_education object:
boot_education ## # A tibble: 100 x 2 ## strap .id ## ## 1 001 ## 2 002 ## 3 003 ## 4 004 ## 5 005 ## 6 006 ## 7 007 ## 8 008 ## 9 009 ## 10 010 ## # ... with 90 more rowsThe column strap contains resamples of the original data. I will run my linear regression
from before on each of the resamples:
I have added a new column called regressions which contains the linear regressions on each
bootstrapped sample. Now, I will create a list of tidied regression results:
broom::tidy() creates a data frame of the regression results. Let’s look at one of these:
tidied$tidy_lm[[1]] ## term estimate std.error statistic p.value ## 1 (Intercept) 515.19835839 129.61828003 3.9747353 2.648477e04 ## 2 regionnortheast 11.75706535 18.72014312 0.6280436 5.332970e01 ## 3 regionsouth 8.89139412 15.51588707 0.5730510 5.695950e01 ## 4 regionwest 25.96217940 17.48335459 1.4849656 1.448476e01 ## 5 residents 0.08921914 0.05557157 1.6054819 1.157079e01 ## 6 young_residents 1.41203710 0.35297506 4.0003877 2.447887e04 ## 7 per_capita_income 0.08609340 0.01191468 7.2258276 6.069534e09This format is easier to handle than the standard lm() output:
tidied$regressions[[1]] ## ## Call: ## lm(formula = per_capita_exp ~ region + residents + young_residents + ## per_capita_income, data = .) ## ## Coefficients: ## (Intercept) regionnortheast regionsouth ## 515.19836 11.75707 8.89139 ## regionwest residents young_residents ## 25.96218 0.08922 1.41204 ## per_capita_income ## 0.08609Now that I have all these regression results, I can compute any statistic I need. But first,
let’s transform the data even further:
list_mods is a list of the tidy_lm data frames. I now add an index and
bind the rows together (by using map2_df() instead of map2()):
Let’s take a look at the final object:
head(mods_df, 25) ## term estimate std.error statistic p.value ## 1 (Intercept) 515.19835839 129.61828003 3.9747353 2.648477e04 ## 2 regionnortheast 11.75706535 18.72014312 0.6280436 5.332970e01 ## 3 regionsouth 8.89139412 15.51588707 0.5730510 5.695950e01 ## 4 regionwest 25.96217940 17.48335459 1.4849656 1.448476e01 ## 5 residents 0.08921914 0.05557157 1.6054819 1.157079e01 ## 6 young_residents 1.41203710 0.35297506 4.0003877 2.447887e04 ## 7 per_capita_income 0.08609340 0.01191468 7.2258276 6.069534e09 ## 8 (Intercept) 526.78466908 142.64004523 3.6931050 6.207704e04 ## 9 regionnortheast 3.35909252 21.97165783 0.1528830 8.792057e01 ## 10 regionsouth 2.62845462 17.20973263 0.1527307 8.793251e01 ## 11 regionwest 26.40527188 20.50214280 1.2879274 2.046593e01 ## 12 residents 0.04834920 0.05722962 0.8448282 4.028830e01 ## 13 young_residents 1.49618012 0.37091593 4.0337445 2.208994e04 ## 14 per_capita_income 0.07489435 0.01179888 6.3475800 1.140844e07 ## 15 (Intercept) 466.17160566 130.18954610 3.5807146 8.659014e04 ## 16 regionnortheast 6.75977787 17.24994600 0.3918724 6.970880e01 ## 17 regionsouth 6.31061828 16.27953099 0.3876413 7.001938e01 ## 18 regionwest 27.89261897 15.43852463 1.8066894 7.781178e02 ## 19 residents 0.08760677 0.05007780 1.7494134 8.735434e02 ## 20 young_residents 1.23763741 0.32965208 3.7543746 5.168492e04 ## 21 per_capita_income 0.08469609 0.01236601 6.8491057 2.128982e08 ## 22 (Intercept) 316.44265722 166.96769979 1.8952328 6.479692e02 ## 23 regionnortheast 11.54907449 14.95746934 0.7721276 4.442620e01 ## 24 regionsouth 9.25759229 16.89974638 0.5477947 5.866654e01 ## 25 regionwest 31.83905855 16.56287888 1.9223143 6.120738e02 ## resample ## 1 1 ## 2 1 ## 3 1 ## 4 1 ## 5 1 ## 6 1 ## 7 1 ## 8 2 ## 9 2 ## 10 2 ## 11 2 ## 12 2 ## 13 2 ## 14 2 ## 15 3 ## 16 3 ## 17 3 ## 18 3 ## 19 3 ## 20 3 ## 21 3 ## 22 4 ## 23 4 ## 24 4 ## 25 4Now this is a very useful format, because I now can group by the term column and compute any
statistics I need, in the present case the standard deviation:
We can append this column to the linear regression model result:
lmfit %>% broom::tidy() %>% full_join(r.std.error) %>% select(term, estimate, std.error, r.std.error) ## Joining, by = "term" ## term estimate std.error r.std.error ## 1 (Intercept) 467.40282655 142.57668653 226.08735720 ## 2 regionnortheast 15.72740519 18.16259519 19.65120904 ## 3 regionsouth 7.08742365 17.29949951 19.05153934 ## 4 regionwest 34.32415663 17.49459866 18.72767470 ## 5 residents 0.03455770 0.05318811 0.06285984 ## 6 young_residents 1.30145825 0.35716636 0.50928879 ## 7 per_capita_income 0.07203552 0.01305066 0.02109277As you see, using the whole bootstrapping procedure is longer than simply using either one of
the first two methods. However, this procedure is very flexible and can thus be adapted to a very
large range of situations. Either way, in the case of heteroskedasticity, you can see that
results vary a lot depending on the procedure you use, so I would advise to use them all as
robustness tests and discuss the differences.
If you found this blog post useful, you might want to follow me on twitter
for blog post updates.
To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...