Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 37 min 15 sec ago

Let’s Have Some Sympathy For The Part-time R User

Fri, 08/04/2017 - 16:38

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

When I started writing about methods for better "parametric programming" interfaces for dplyr for R dplyr users in December of 2016 I encountered three divisions in the audience:

  • dplyr users who had such a need, and wanted such extensions.
  • dplyr users who did not have such a need ("we always know the column names").
  • dplyr users who found the then-current fairly complex "underscore" and lazyeval system sufficient for the task.

Needing name substitution is a problem an advanced full-time R user can solve on their own. However a part-time R would greatly benefit from a simple, reliable, readable, documented, and comprehensible packaged solution.

Background

Roughly I suggested two possible methods for making the task easier:

  • Renaming views for data.frames. I have now implemented the idea as a call-scoped concept in replyr::replyr_apply_f_mapped() ("call-scoped", meaning the re-mapping lasts for the duration of a function call).
  • Symbol re-binding by a block-scoped command called let() (a common functional notation; and "block-scoped" meaning the re-mapping lasts for the span of a code-block). I released this solution to CRAN and publicly announced it on December 8 2016.

I mention dates to point out that this is something I have been inviting public comment on for some time.

Things change. Since the above time:

  • The development version of dplyr incorporated a new rlang/tidyeval package (probably around February 14th 2017).
  • rlang/tidyeval was released to CRAN on May 2017. Obviously rlang/tidyeval had been under development for some time, but I don’t think the parametric aspect of it was publicly discussed much before February 16, 2017 (notice that a formula centric interface was still being contemplated).
  • dplyr 0.7.0 was released, based on rlang/tidyeval (June 9th, 2017).
  • dplyr excised direct use of lazyeval.
  • The dplyr "underscore verbs" (or methods) were all deprecated (i.e., no longer advised).

The rlang/tidyeval strategy is to capture un-evaluated user expressions (as a new object called a "quosure") and evaluate them with new language rules (with new bindings and something called an "overscope"). Also note the rlang/tidyeval strategy is full integration or re-writing of packages in terms of rlang/tidyeval; this isn’t something you mix-in or turn on or off.

Some points I think that have been under-represented in previous discussions include:

  • Not all R users consider themselves to be expert programmers (many are happy calling themselves analysts or statisticians).
  • R is often used in collaborative projects where there are varying levels of programming expertise.

The second point I think is particularly interesting. It means:

An R user who does not consider themselves an expert programmer could be maintaining code that they understand, but could not be expected to create from scratch.

Or:

Let’s have some sympathy for the part-time R user.

This is the point we will emphasize in our new example.

The example

The design and discussion of substitution solutions should be driven from concrete realistic use cases. Working from larger examples gives us a taste of what working with each solution is like in practice. So, let’s pretend to discuss social science (instead of programming).

Suppose an analyst, psychologist, medical doctor, or scientist is building an assessment for some aspects of behavior and anxiety.

Often such assessments involve selecting moving through a multiple-choice questionnaire and collecting a number of points that depend on answers selected. One such assessment is the Generalized Anxiety Disorder 7 questionnaire (or GAD-7). It is a very simple system as can be seen below.

One can treat such a test score as a classifier and assess it in terms of sensitivity, specificity, and different correspondence measures.

An obvious extension of such tests is to give a different number of points in different categories for each multiple-choice answer. For example we could imagine such a test where each answer gave a varying number of points in one of two categories called "withdrawal behavior" and "positive re-framing" (both in the sense of coping behaviors).

For example, our scientist might record the results of two subjects taking a test as follows:

d <- data.frame( subjectID = c(1, 1, 2, 2), surveyCategory = c( 'withdrawal behavior', 'positive re-framing', 'withdrawal behavior', 'positive re-framing' ), assessmentTotal = c(5, 2, 3, 4), stringsAsFactors = FALSE ) print(d) ## subjectID surveyCategory assessmentTotal ## 1 1 withdrawal behavior 5 ## 2 1 positive re-framing 2 ## 3 2 withdrawal behavior 3 ## 4 2 positive re-framing 4 # or in "wide form": library("cdata") moveValuesToColumns(d, columnToTakeKeysFrom = 'surveyCategory', columnToTakeValuesFrom = 'assessmentTotal', rowKeyColumns = 'subjectID') ## subjectID positive re-framing withdrawal behavior ## 1 1 2 5 ## 2 2 4 3

A natural question is: how does one assign weights to each answer? One way would be to administer the test to a number of people the experimenter has classified as having either of the above mentioned behaviors and then performing a logistic regression to map assessment answers to the probability of a given diagnosis for this population. By re-scaling the weights and rounding them to small integers we could have a test point system that is very close to performing a logistic regression classification. We may be able to use the same assessment questions in a much more decisive manner than assigning all questions the same number of points.

This sort of idea is what one would expect from a mixed and collaborating team that includes medical experts, statistics experts, and programmers. After some work our team might work out that scoring the assessment can be done by the simple R dplyr pipeline:

suppressPackageStartupMessages(library("dplyr")) scale <- 0.237 d %>% group_by(subjectID) %>% mutate(probability = exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale))) ## # A tibble: 4 x 4 ## # Groups: subjectID [2] ## subjectID surveyCategory assessmentTotal probability ## ## 1 1 withdrawal behavior 5 0.6706221 ## 2 1 positive re-framing 2 0.3293779 ## 3 2 withdrawal behavior 3 0.4410258 ## 4 2 positive re-framing 4 0.5589742

For each subject we take the row with maximal probability as the diagnosis. The diagnosis was already obvious from the original scores, the main addition is the diagnosis confidence is now available as a probability estimate.

Each step of the above pipeline is learn-able:

  • The group_by() is arranging all rows associated with the same subject to work together in later calculations.
  • the exp(assessmentTotal * scale)/sum(exp(assessmentTotal * scale)) is the classic "sigmoid link" from logistic regression. It is the standard way (once you know it) of turning a free-score into a probability estimate.

Suppose this assessment is tested and works well. It is then plausible that the team might ask their R expert to help them construct a much more complicated dplyr pipeline that better formats the results. Under the Harlan Mills’ "Surgical Team" proposal (made famous in Frank Brook’s The Mythical Man Month) we expect effective data science teams to have a diversity of deep expertise (not everybody know everything, but a lot is known by the total team). We expect a well staffed research team to include the statistician who worked out the sigmoid transform above, and a programmer who works out the pipeline we give below.

d %>% group_by(subjectID) %>% mutate(probability = exp(assessmentTotal * scale)/ sum(exp(assessmentTotal * scale))) %>% arrange(probability, surveyCategory) %>% mutate(isDiagnosis = row_number() == n()) %>% filter(isDiagnosis) %>% ungroup() %>% select(subjectID, surveyCategory, probability) %>% rename(diagnosis = surveyCategory) %>% arrange(subjectID) ## # A tibble: 2 x 3 ## subjectID diagnosis probability ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

This is indeed a long (and expert-level) pipeline. But the principle is:

  • It does useful work (concentrates down to the rows we want and ensures good presentation column names and sorting).
  • While a part-time R user would not be expected to come up with it, they could (with cooperation from the pipeline author) understand all the steps and safely use the pipeline in their project.
  • The application (which we spent some time describing) is what the team cares about, the pipeline is a ends to a means (so even though it is long, it isn’t often the central subject of interest).
  • The longer pipeline is paying the bills, and helping patients. So some pain and cost are to be tolerated.

Let’s take this deliberately long (so as to be a strong test) example and see how hard the pipeline is to re-use under different methodologies.

Re-use

An issue that comes up is: can the team re-use the pipeline on another project? Suppose in their next project the ID column isn’t "subjectID" but it is "patientID" (and so on). Obviously they can copy and paste the original pipeline and change the names (which is not a bad practice for the first few re-uses).

But once this procedure is going to be used many times it is a good idea to wrap it up or genericize it so it can be safely re-adapted (so the users can’t accidentally forget to change one name one place).

I will now walk through a number of approaches to this in terms of how hard they are on the researcher. We are assuming their R expert does the wrapping for them, but then must explain the concepts to the part-time R user so they truly understand and can maintain the tools they are using.

For our example we assume all the column names are coming from variables set somewhere else (in another R script, or coming from a spreadsheet that is read into R, or some other source). The nature of the columns is constant from analysis to analysis, but the exact names used may vary. For our example the column names are:

idCol <- "subjectID" categoryCol <- "surveyCategory" linkScoreCol <- "assessmentTotal" indicatorCol <- "isDiagnosis" probScoreCol <- "probability" outcomeCol <- "diagnosis" wrapr solution

In my opinion the easiest solution (in terms of cognitive load) is wrapr::let(). The R expert would share the following code:

library("wrapr") let( c( IDCOL = idCol, CATEGORYCOL = categoryCol, LINKSCORECOL = linkScoreCol, INDICATORCOL = indicatorCol, PROBSCORECOL = probScoreCol, OUTCOMECOL = outcomeCol ), d %>% group_by(IDCOL) %>% mutate(PROBSCORECOL = exp(LINKSCORECOL * scale)/ sum(exp(LINKSCORECOL * scale))) %>% arrange(PROBSCORECOL, CATEGORYCOL) %>% mutate(INDICATORCOL = row_number() == n()) %>% filter(INDICATORCOL) %>% ungroup() %>% select(IDCOL, CATEGORYCOL, PROBSCORECOL) %>% rename(OUTCOMECOL = CATEGORYCOL) %>% arrange(IDCOL) ) ## # A tibble: 2 x 3 ## subjectID diagnosis probability ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

The concept is:

"let() works as if you had written the code with the names substituted as shown in the c() block."

And there is ample documentation showing how this can be used. Notice creating this code is completely mechanical (replace concrete names with the all-caps place holders) and the execution has an easy mental model (the place-holders are replaced with names stored in the variables).

In this solution the adapted code looks like the original code.

replyr solution

The next easiest method in concept is replyr_apply_f_mapped().

The R expert would write the following, and the part-time R user (with some coaching) could maintain it.

library("replyr") d %>% replyr_apply_f_mapped( nmap = c( IDCOL = idCol, CATEGORYCOL = categoryCol, LINKSCORECOL = linkScoreCol, INDICATORCOL = indicatorCol, PROBSCORECOL = probScoreCol, OUTCOMECOL = outcomeCol ), f = . %>% group_by(IDCOL) %>% mutate(PROBSCORECOL = exp(LINKSCORECOL * scale)/ sum(exp(LINKSCORECOL * scale))) %>% arrange(PROBSCORECOL, CATEGORYCOL) %>% mutate(INDICATORCOL = row_number() == n()) %>% filter(INDICATORCOL) %>% ungroup() %>% select(IDCOL, CATEGORYCOL, PROBSCORECOL) %>% rename(OUTCOMECOL = CATEGORYCOL) %>% arrange(IDCOL) ) ## # A tibble: 2 x 3 ## subjectID diagnosis probability ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

What the code does is exactly this:

  • It renames all of the columns in the data.frame to have the chosen names (in this case the all-caps names).
  • It then applies the user-supplied function f() to this data.frame.
  • The reverse of the name-mapping is applied to the result of f(), moving columns back to their original names.

The concept is:

replyr_apply_f_mapped() renames columns and back.

Below is an illustrative example showing the column names seen inside and outside the user supplied function.

print(colnames(d)) ## [1] "subjectID" "surveyCategory" "assessmentTotal" d %>% replyr_apply_f_mapped( nmap = c( IDCOL = idCol, CATEGORYCOL = categoryCol, LINKSCORECOL = linkScoreCol, INDICATORCOL = indicatorCol, PROBSCORECOL = probScoreCol, OUTCOMECOL = outcomeCol ), f = function(df) { df$PROBSCORECOL <- 1 print(colnames(df)) return(df) } ) %>% colnames() ## [1] "IDCOL" "CATEGORYCOL" "LINKSCORECOL" "PROBSCORECOL" ## [1] "subjectID" "surveyCategory" "assessmentTotal" "probability"

This is teachable and something the part-time R user can correctly extend and maintain. Though the user may possibly need to learn about wrapping a pipeline as an anonymous function (the ". %>%" notation).

rlang/tidyeval solution

For the rlang/tidyeval solution the expert writes the following code:

IDSYM <- rlang::sym(idCol) CATEGORYSYM <- rlang::sym(categoryCol) LINKSCORESYM <- rlang::sym(linkScoreCol) INDICATORSYM <- rlang::sym(indicatorCol) PROBSCORESYM <- rlang::sym(probScoreCol) OUTCOMESYM <- rlang::sym(outcomeCol) d %>% group_by(!!IDSYM) %>% mutate(!!PROBSCORESYM := exp((!!LINKSCORESYM) * scale)/ sum(exp((!!LINKSCORESYM) * scale))) %>% arrange(!!PROBSCORESYM, !!CATEGORYSYM) %>% mutate(!!INDICATORSYM := row_number() == n()) %>% filter(!!INDICATORSYM) %>% ungroup() %>% select(!!IDSYM, !!CATEGORYSYM, !!PROBSCORESYM) %>% rename(!!OUTCOMESYM := !!CATEGORYSYM) %>% arrange(!!IDSYM) ## # A tibble: 2 x 3 ## subjectID diagnosis probability ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

Several points have to be taught to the part-time R user if this code is to be maintained:

  • The "!!" symbol does not have the same operator precedence as an assignment symbols such as "=" or ":=", so you must often place "!!"-expressions in extra parentheses.
  • In any assignment we must use ":=" for assignment when using "!!" on the left-hand side of the assignment.

The above are just some syntax edge-cases, we haven’t even gone into teaching rlang::sym(), "!!", and the theory and semantics of quasi-quotation.

seplyr solution

seplyr is an experiment to see what a referentially transparent (or completely value oriented) interface to dplyr would look like. Please don’t think of seplyr as an adapter (though it is, it sends all work to dplyr), but as an illustration of what a completely value-oriented dplyr might look like (i.e., one that did not capture un-evaluated user code through non-standard evaluation). Roughly seplyr is an experiment of the form: "what if one tried harder with something like the new dplyr::*_at() verbs."

Most of the seplyr methods are named *_se() and are designed to be very similar to their dplyr equivalents (and some are nearly identical to dplyr::*_at() methods, rename_se() being a notable exception).

library("seplyr") suppressPackageStartupMessages(library("glue")) d %>% group_by_se(idCol) %>% mutate_se(probScoreCol := glue('exp({linkScoreCol} * scale)/ sum(exp({linkScoreCol} * scale))')) %>% arrange_se(c(probScoreCol, categoryCol)) %>% mutate_se(indicatorCol := "row_number() == n()") %>% filter_se(indicatorCol) %>% ungroup() %>% select_se(c(idCol, categoryCol, probScoreCol)) %>% rename_se(outcomeCol := categoryCol) %>% arrange_se(idCol) ## # A tibble: 2 x 3 ## subjectID diagnosis probability ## ## 1 1 withdrawal behavior 0.6706221 ## 2 2 positive re-framing 0.5589742

The concept is:

"Only mutate needs non-standard evaluation."

seplyr accepts general expressions many more places, but with proper organization and using a few temp-columns you really only need the full generality in mutate().

seplyr has its own issues:

  • It also needs a ":=" operator for assignment.
  • It insists on multiple arguments coming in as vectors (hence the use of "c()" throughout).
  • It runs into a bit of trouble with verbs that take expressions (mutate_se() being the most complicated) in that it needs a helper to substitute in the name of the variable holding the column name, which is later substituted out for the actual column name by seplyr. In this example we used glue::glue() to perform the substitution, but we could also try paste0() or gsub().

The lesson from seplyr is the mutate() verb does indeed need some kind of expression manipulation tooling (direct string manipulation feeling too crude). However, for the rest of the verbs the value oriented notation is in fact quite natural, and really in no sense inferior to the dplyr originals.

Conclusion

Name substitution is a reasonable need that arises when re-using R work or when trying to iterate of column names. I have been publicly exploring variations of substitution systems so that R users can make an informed choice of one or more that most meets their needs and addresses their personal trade-offs between: power, safety, readability, and teachability. These sections are not each independent "yet another way of performing name substitution", but parts of a public conversation that should be had before name substitution is considered settled and fixed in stone.

A part-time R user will not have the background to quickly compare all of the available substitution systems. In fact such a user will only come to need a substitution system when they have a problem. So by definition they are in in the middle of some other task. It is up to expert partners to evaluate and explain alternatives.

There is a temptation that if you are going to only teach one system it might as well be rlang/tidyeval as "that is what now comes with dplyr". I feel this is a false savings as while rlang/tidyeval "is already in dplyr" the rlang/tidyeval concepts and details are not "already in the user" (and in fact include a fairly number of irregular exceptions, needing to be taught and memorized).

Our preference is: wrapr::let(). wrapr::let() delivers a lot of (safe) power for a modest amount of cognitive load. Each of the above systems involves different trade-offs and compromises, and we feel one must really try a few in production before having an expert opinion.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Markdown exercises part 2

Fri, 08/04/2017 - 16:24

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

R Markdown is one of the most popular data science tools and is used to save and execute code, create exceptional reports whice are easily shareable.

The documents that R Markdown provides are fully reproducible and support a wide variety of static and dynamic output formats.

Using markdown syntax, which provides an easy way of creating documents that can be converted to many other file types, while embeding R code in the report, so it is not necessary to keep the report and R script separately. Furthermore The report is written as normal text, so knowledge of HTML is not required. Of course no additional files are needed because everything is incorporated in the HTML file.

Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.

Exercise 1

Make a table out of the object dataframe you created and set its numbers to have one significant figure. HINT: Use kable().

Exercise 2

Use bold text for your report’s title. HINT: Use ** **.

Exercise 3

Use Italic text for the author’s name. HINT: Use * *.

Learn more about reporting your results in the online course: R for Data Science Solutions. In this course you will learn how to:

  • Build a complete workflow in R for your data science problem
  • Get indepth on how to report your results in a interactive way
  • And much more

Exercise 4

Add “Summary” as Header of size 1 above your summary context.

Exercise 5

Add “Plot”, “Dataframe” and “Table 1” as Headers of size 3 above the rest of the three objects of your report respectively.

Exercise 6

Create manually a small table for your dataframe.

Exercise 7

Apply right alignment to the column “B”.

Exercise 8

Create an unordered list of the contents of column “A” of your dataframe.

Exercise 9

Transform the list you just created to ordered.

Exercise 10

Add a link named “Link” that leads to “www.r-exercises.com”.

Related exercise sets:
  1. How to create reports with R Markdown in RStudio
  2. R Markdown exercises part 1
  3. Building Shiny App exercises part 1
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The package hdm for double selection inference with a simple example

Fri, 08/04/2017 - 16:20

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

In a late post I discussed the Double Selection (DS), a procedure for inference after selecting controls. I showed an example of the consequences of ignoring the variable selection step discussed in an article by Belloni, Chernozhukov and Hansen.

Some of the authors of the mentioned article created the hdm package, which implements the double selection using the Rigorous LASSO (RLASSO) to select the controls. The RLASSO uses the theory they developed (instead of cross-validation or information criterion) to select the regularization parameter, normally referred as .

Application

I am going to show an application based on the package’s vignettes, which is based in an article from Barro and Lee (1994). The hypothesis we want to test is if less developed countries, with lower GDP per capita, grow faster than developed countries. In other words, there is a catch up effect. The model equation is as follows:

where is the GDP growth rate over a specific decade in country , is the log of the GDP at the beginning of the decade, are controls that may affect the GDP. We want to know the effects of on , which is measured by . If our catch up hypothesis is true, must be positive and hopefully significant.

The dataset is available in the package. It has 62 variables and 90 observations. Each observation is a country, but the same country may have more than one observation if analysed in two different decades. The large number of variables will require some variable selection, and I will show what happens if we use a single LASSO selection and the Double Selection. The hdm package does all the DS steps in a single line of code, we do not need to estimate the two selection models and the Post-OLS individually. I will also run a naive OLS will all variables just for illustration.

library(hdm) data("GrowthData") # = use ?GrowthData for more information = # dataset=GrowthData[,-2] # = The second column is just a vector of ones = # # = Naive OLS with all variables = # # = I will select only the summary line that contains the initial log GDP = # OLS = summary(lm(Outcome ~., data = dataset))$coefficients[1, ] # = Single step selection LASSO and Post-OLS = # # = I will select only the summary line that contains the initial log GDP = # lasso = rlasso(Outcome~., data = dataset, post = FALSE) # = Run the Rigorous LASSO = # selected = which(coef(lasso)[-c(1:2)] !=0) # = Select relevant variables = # formula = paste(c("Outcome ~ gdpsh465", names(selected)), collapse = "+") SS = summary(lm(formula, data = dataset))$coefficients[1, ] # = Double Selection = # DS=rlassoEffects(Outcome~. , I=~gdpsh465, data=dataset) DS=summary(DS)$coefficients[1,] (results=rbind(OLS,SS,DS)) ## Estimate Std. Error t value Pr(>|t|) ## OLS 0.24716089 0.78450163 0.3150547 0.755056170 ## SS 0.31168793 0.09832465 3.1699876 0.002169693 ## DS -0.04432403 0.01531925 -2.8933558 0.003811493

The OLS estimate is positive, however the standard error is very big because we have only 90 observations for more than 60 variables. The Single Selection estimate is also positive and, in this case, significant. However, the Double Selection showed a negative and significant coefficient. If the DS is correct, our initial catch up hypothesis is wrong and poor countries grow less than rich countries. We can’t say that the DS is correct for sure, but it is backed up by a strong theory and lots of simulations that show that the SS is problematic. It is very, very unlikely that the SS results are more accurate than the DS. It is very surprising how much the results can change from one case to the other. You should at least be skeptic when you see this type of modelling and the selection of controls is not clear.

The hdm package has several other implementations in this framework such as instrumental variables and logit models and there are also more examples in the package vignette.

References

Belloni, A., V. Chernozhukov, and C. Hansen. “Inference on treatment effects after selection amongst high-dimensional controls.” https://arxiv.org/abs/1201.0224

Barro, Robert J., and Jong-Wha Lee. “Sources of economic growth.” Carnegie-Rochester conference series on public policy. Vol. 40. North-Holland, 1994. http://www.sciencedirect.com/science/article/pii/0167223194900027

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sampling distribution of weighted Gini coefficient by @ellis2013nz

Fri, 08/04/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Calculating Gini coefficients

Stats NZ release a series of working papers, and a recent one caught my eye because of my interest in inequality statistics (disclaimer – I am working at Stats NZ at the moment, but on completely different things). Working Paper No 17–02 by Vic Duoba and Nairn MacGibbon is on Accurate calculation of a Gini index using SAS and R. While the conceptual definition of a Gini coefficient is well known, there are a few different methods out there for estimating one from finite data.

To refresh memories, a Gini coefficient is defined as the ratio A / (A +B) in the diagram below, which is drawn with simulated data I’ll introduce later in the post:

In a situation of perfect equality (everyone has the same income, or whatever it is we are measuring inequality of) the Lorenz curve is the diagonal line, and A is 0 so the Gini coefficient is 0. In a situation where the richest person has all the income and everyone else none, the Lorenz curve is horizontal along the bottom of the chart and spikes up to 1 at the extreme right – so B is 0 and the Gini coefficient is 1. Income Gini coefficients are usually calculated on household incomes, and according to Wikipedia they range from 0.24 (Slovenia) to 0.63 (Lesotho). My simulated data above is meant to represent individual income, which is higher than household income in situations where significant numbers of low income individuals live with higher income people.

Duoba and MacGibbon do a brief review of algorithms and choose a preferred one, noting along the way that the existing Stats NZ method will underestimate inequality (very, very slightly – this isn’t something that should make the news please, it’s the fourth decimal place). They helpfully provide a SAS program to implement the algorithm which is literally two pages long, and a much shorter R program that is half a page.

Q1. Is acid::weighted.gini() using the best method?

Of course, shorter, more readable, and more mathematically-oriented (ie vectorized) code is one reason I prefer R to SAS. But another reason is that nearly any functionality one can imagine for published techniques already has an implementation out there we can borrow. In this respect, pitting R against SAS is simply no contest; it’s like Wikipedia versus an encyclopedia written by a single organisation. Wikipedia has 5.5 million articles in English, Britannica’s last hard copy had 40,000 (and the online edition has 120,000, but I think the hard copy is somehow more like SAS, if I’m not pushing the metaphor too far). The R universe quite simply has vastly more functionality available to it than is ready off the shelf for SAS users.

So while Duoba and MacGibbon’s 20 effective lines of R is preferable to 90 lines of SAS, even more convenient is to just use the one liner acid::weighted.gini(x, w), using free user contributions to the R universe.

With choice comes uncertainty. I did want to check that the method I’ve been using for calculating Gini coefficients from weighted data (like in this post on inter-country inequality) matched the one recommended by Duoba and MacGibbon. I used the weighted.gini function from Alexander Sohn (2016). acid: Analysing Conditional Income Distributions. R package version
1.1.
.

First, here’s code setting up the session and creating Duoba and MacGibbon’s StatsGini R function:

library(tidyverse) library(acid) library(microbenchmark) library(forcats) # Peter's Stats Stuff miscellaneous stuff package, install with: devtools::install_github("ellisp/pssmisc-r-package/pkg") library(pssmisc) # Duoba and MacGibbon's function, reproduced as in the original paper: StatsGini <- function(x, w = rep(1, length(x))){ # x and w are vectors # w can be left blank when calling the fn (i.e. no weighting) # Examples: # x <- c(3, 1, 7, 2, 5) # w <- c(1, 2, 3, 4, 5) # StatsGini(x, w) should yield 0.2983050847 # StatsGini(c(0.25, 0.75), c(1, 1)) should yield 0.25 n <- length(x) wxsum <- sum(w * x) wsum <- sum(w) sxw <- order(x, w) # Ascending order sort sx <- w[sxw] * x[sxw] sw <- w[sxw] pxi <- vector(mode = "numeric", length = n) pci <- vector(mode = "numeric", length = n) pxi <- cumsum(sx) / wxsum pci <- cumsum(sw) / wsum G <- 0.0 for (i in 2:n){ G <- G - (pci[i] * pxi[i - 1] - pxi[i] * pci[i - 1] ) } return(G) }

And here’s some checks. We see to 10 decimal places, using the toy examples provided by Stats NZ, we get identical results:

> options(digits = 10) > x <- c(3, 1, 7, 2, 5) > w <- c(1, 2, 3, 4, 5) > StatsGini(x, w) # should yield 0.2983050847 [1] 0.2983050847 > StatsGini(c(0.25, 0.75), c(1, 1)) # should yield 0.25 [1] 0.25 > weighted.gini(x, w) $Gini [,1] [1,] 0.2983050847 > weighted.gini(c(0.25, 0.75), c(1, 1)) $Gini [,1] [1,] 0.25

Tests with more complicated data showed identical results too but I won’t bore the reader with them.

A realistic individual income simulation

Next, I wanted to try out the two functions on some realistic data. Individual income data typically has a very complex distribution that is best thought of as a mixture of at least three distributions. There is usually a spike at zero for zero income individuals, a smattering of people with negative incomes, and then a mixture of skewed distributions for positive incomes. Typically, heavily skewed, like a log-Normal distribution with bonus outliers – there are a small number of very high income individuals out there.

To simulate the data in the Lorenz curve graphic I started with, I used the function below. It implements a mixture of log-t distributions (using t rather than Normal because of the longer tails) ie e to the power of a variable from a mix of t distributions. The mean of those t distributions itself is a random variable, in this case a non-transformed t distribution. Then each value is multiplied at random by -1, 0 or 1 to simulate the idiosyncratic negative, zero, positive nature of incomes.

When I create the data frame with 100,000 incomes in it, I also create a column called prob with randomly generated numbers between 0 and 1 following a Beta(2, 2) distribution. I’m going to use that later to simulate the experience of sampling individuals for a complex survey.

#' Generate a realistic mixed distribution of incomes #' #' alpha ~ t(d_alpha) scaled to mean, sd of (log(mu), sigma_alpha) #' beta ~ t(d_beta) scaled to mean, sd of (alpha, sigma_beta) #' x = exp(beta) #' y = x * I(-1, 0, 1) #' #' @param n sample size to generate #' @param mu median income #' @param sigma_alpha standard deviation of underlying latent means #' @param sigma_beta standard deviation of scaled t distribution underlying the population's log-t distribution #' @param d_alpha degrees of freedom in th #' @param d_beta degrees of freedom in the t distribution underlying the log-t #' @param pr vector of 2 elements, the probability of negative income and of zero income respectively gen_incomes <- function(n, mu = 30000, sigma_alpha = 1, sigma_beta = 1, d_alpha = 15, d_beta = 15, pr = c(0.01, 0.2)){ alpha <- rt(n, d_alpha) * sigma_alpha + log(mu) beta <- exp(rt(n, d_beta) * sigma_beta + alpha) y <- beta * sample(c(-1, 0, 1), n, replace = TRUE, prob = c(pr, 1 - sum(pr))) return(y) } set.seed(123) N <- 100000 population <- data.frame(income = gen_incomes(N), sigma_alpha = 1, d_alpha = 10, d_beta = 150, sigma_beta = 2, prob = rbeta(N, 2, 2)) # draw density plot with a slice of 1000 people: population %>% slice(1:1000) %>% ggplot(aes( x = income)) + scale_x_continuous("Annual income (modulus transform with power of 0.02)", label = dollar, trans = modulus_trans(0.02), breaks = modulus_breaks(lambda = 0.02)) + geom_density(fill = "grey", alpha = 0.2) + geom_rug() + ggtitle("Density of simulated annual incomes", "First 1,000 values")

This is what that looks like:

I wrote about the modulus transform, used in the horizontal axis in that chart, in earlier blog posts. It’s an idea put forward by John and Draper in 1980 that I’m surprised hasn’t taken off more. It’s very similar to a Box-Cox power transform, but it works for values of 0 and negative values too. It does something very similar to a Box-Cox transform to the absolute value of the data, then returns the sign to it. It’s a much, much better answer than the commonly used alternatives to what to do with data that looks like it needs a log transformation but inconveniently has some zeros and negatives. I’ve implemented methods to do this in my new pssmisc ad hoc R package, referenced earlier in the code.

For completeness, here’s the code that drew the Lorenz curve diagram at the top of the post:

population %>% # thin down to just a sample of 10,000 to save on vector graphic file size: sample_n(10000) %>% arrange(income) %>% mutate(cum_prop_inc = cumsum(income) / sum(income), seq = 1:n() / n()) %>% ggplot(aes(x = seq, y = cum_prop_inc)) + geom_line() + geom_ribbon(aes(ymax = seq, ymin = cum_prop_inc), fill = "steelblue", alpha = 0.2) + geom_abline(intercept = 0, slope = 1, colour = "steelblue") + labs(x = "Cumulative proportion of population, from poorest to richest", y = "Cumulative proportion of income") + annotate("text", x = c(0.6, 0.95), y = c(0.4, 0.05), label = c("A", "B"), size = 7) + annotate("text", x = 0.5, y = 0.6, label = "Line of equality", angle = 45) + annotate("text", x = 0.85, y = 0.15, label = "Actual distribution", angle = 45) + coord_equal() + ggtitle("Lorenz curve with simulated income data", paste("Gini coefficient is", round(gini(population$income)$Gini, 2))) Q2. Is acid:weighted.gini more or less efficient than the StatsNZ alternative?

Next I wanted to try out the efficiency of the two functions. To do this I gave them a tougher job – estimate the Gini coefficient for the whole population of 100,000, but using the inverse of the prob column as weights. The microbenchmark package making it easy to do this 1,000 times for a reliable estimate, and even gives an automatic ggplot2 method.

y <- population$income w <- 1/ population$prob mb <- microbenchmark( weighted.gini(y, w)$Gini, StatsGini(y, w), times = 1000) autoplot(mb, log = FALSE) + ggtitle("Time taken to estimated Gini coefficient from 100,000 observations", "Microbenchmark results")

With these results:

The method in acid::weighted.gini is slightly faster than is StatsGini. acid uses matrix multiplication for a calculation that StatsGini does through a for loop, which explains the difference. It’s not a big deal though, and both functions reliably taking less than 50 millionths of a second to do the calculation on a sample size of 100,000.

Q3. What does the sampling distribution of Gini from a weighted complex survey sample look like?

Finally, I had a more substantive question of interest from a statistical point of view. In the real world, we have finite samples to estimate inequality from and our estimate of Gini coefficient will be random; it will have a sampling distribution. The samples often come from a complex survey design, which complicates the estimation process and changes the sampling distribtuion. Allowing for data from sources like a complex survey is why both calculation methods allow for a vector of unequal weights, which in this context would normally be calculated by a method such as the inverse probability of an population member being in the sample.

I’ve looked previously at the sampling distribution of Gini coefficient, but only with unweighted data. Reading Duoba and MacGibbon’s paper reminded me that I should probably look at the realistic situation of estimating Gini coefficient with weighted data from a complex survey.

To do this in a pragmatic way, I simulated surveys with a range of sample sizes by creating multiple unequally probability random samples from my population of 100,000. This doesn’t really re-create the experience of a complex survey because I haven’t got primary sampling units and intra-class correlation (which make estimation harder and sampling distributions wider), but it is a start for illustrative/exploratory purposes. I calculated the weighted Gini coefficient for each and kept the results. When we plot the distributions of all these efforts to estimate the population total (which we know is really 0.82) we get the following:

I truncated the horizontal axes (but not the density estimations) because there were a few very bad estimates from the smaller sample sizes – as high as 1.5 – that made the visual comparison difficult. How can a Gini coefficient be greater than 1? When there are enough negative incomes.

We see that a sample size of 500 isn’t too bad, with most estimates coming out between 0.7 and 0.9. But even with a sample of 10,000 there’s quite a bit of variation in the Gini coefficient of the sample. Lesson – don’t rely too much on the decimal points in inequality data coming from samples (and that’s without even getting into the other formidable measurement problems with economic data).

Here’s the code for the simulation:

# number of repeats at each sample size: reps <- 100 # sample sizes, chosen to resemble realistic survey sizes: ns <- c(30, 100, 500, 1000, 3000, 10000) # matrix of zeros that we'll replace with the estimates of Gini res <- matrix(0, nrow = reps, ncol = length(ns)) # obviously this next bit could be made much more efficient by parallel processing, # but it's only a few minutes of running set.seed(666) for(i in 1:length(ns)){ n <- ns[i] for(j in 1:reps){ the_sample <- sample_n(population, size = n, replace = FALSE, weight = population$prob) res[j, i] <- weighted.gini(the_sample$income, w = 1 / the_sample$prob)$Gini } } res_df <- as.data.frame(res) names(res_df) <- format(ns, big.mark = ",") res_df %>% gather(sample_size, value) %>% mutate(sample_size = fct_reorder(sample_size, -value)) %>% ggplot(aes(x = value)) + facet_wrap(~sample_size) + geom_density(alpha = 0.3, fill = "grey") + # a few values of > 1 make no sense and warp the graph: coord_cartesian(xlim = c(0.6, 1)) + labs(x = "Estimated weighted Gini") + ggtitle("Sampling distribution of weighted Gini point estimate from a complex survey", paste0("Various sample sizes; true value is ", round(weighted.gini(population$income)$Gini, 2)))

Finished for today.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Saving High-Resolution ggplots: How to Preserve Semi-Transparency

Fri, 08/04/2017 - 10:50

(This article was first published on Easy Guides, and kindly contributed to R-bloggers)

This article describes solutions for preserving semi-transparency when saving a ggplot2-based graphs into a high quality postscript (.eps) file format.

Contents:

Create a ggplot with semi-transparent color

To illustrate this, we start by creating ggplot2-based survival curves using the function ggsurvplot() in the survminer package. The ggsurvplot() function creates survival curves with the 95% confidence bands in a semi-transparent color.

First install (if needed) survminer as follow:

install.packages("survminer")

Then type, this:

# Fit survival curves require("survival") fit<- survfit(Surv(time, status) ~ sex, data = lung) # Visualize library("survminer") p <- ggsurvplot(fit, data = lung, surv.median.line = "hv", # Add medians survival pval = TRUE, # Add p-value and tervals conf.int = TRUE, # Add the 95% confidence band risk.table = TRUE, # Add risk table tables.height = 0.2, tables.theme = theme_cleantable(), palette = "jco", ggtheme = theme_bw() ) print(p)

In the plot above, the confidence band is semi-transparent. It can be saved to a PDF file without loosing the semi-transparent color.

If you try to export the picture as vector file (EPS or SVG, …), the 95% confidence interval will disappear and the saved plot looks as follow:

The problem is that EPS in R does not support transparency.

In the following sections, we’ll describe convenient solutions to save high-quality ggplots by preserving semi-transparency.

Save ggplots with semi-transparent colors Use cairo-based postscript graphics devices

You can use the ggsave() function in [ggplot2] as follow:

ggsave(filename = "survival-curves.eps", plot = print(p), device = cairo_eps)

Or use this:

cairo_ps(filename = "survival-curves.eps", width = 7, height = 7, pointsize = 12, fallback_resolution = 300) print(p) dev.off()

Note that, the argument fallback_resolution is used to control the resolution in dpi at which semi-transparent areas are rasterized (the rest stays as vector format).

Export to powerpoint

You can export the plot to Powerpoint using the ReporteRs package. ReporteRs will give you a fully editable vector format with full support for transparency as well.

We previously described how to Create and format PowerPoint documents from R software using the ReporteRs package. We also described how to export an editable ggplot from R software to powerpoint.

Briefly, to export our survival curves from R to powerpoint, the script looks like this

library('ReporteRs') # Create a new powerpoint document doc <- pptx() # Add a new slide into the ppt document doc <- addSlide(doc, slide.layout = "Two Content" ) # Add a slide title doc <- addTitle(doc, "Survival Curves: Editable Vector Graphics" ) # Print the survival curves in the powerpoint doc <- addPlot(doc, function() print(p, newpage = FALSE), vector.graphic = TRUE # Make it editable ) # write the document to a file writeDoc(doc, file = "editable-survival-curves.pptx")

The output looks like this:

Edit the plot in powerpoint. See the video below: Editing ggplots Exported with ReporteRs into PWPT


Infos

jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header

.content{padding:0px;}


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Easy Guides. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Clustering with FactoMineR

Fri, 08/04/2017 - 10:05

(This article was first published on François Husson, and kindly contributed to R-bloggers)

Here is a course with videos that present Hierarchical clustering and its complementary with principal component methods. Four videos present a course on clustering, how to determine the number of clusters, how to describe the clusters and how to perform the clustering when there are lots of individuals and/or lots of variables. Then  you will find videos presenting the way to implement hierarchical clustering in FactoMineR.

For more information, you can see the book blow. Here are some reviews on the book and a link to order the book.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: François Husson. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R for System Adminstration

Fri, 08/04/2017 - 05:33

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Just getting back from the most fun meetup I have been to in quite some time: episode 23 (by their count) of Open Source Open Mic hosted by Matt Godbolt and Joe Walnes here in Chicago. Nothing but a sequence of lightning talks. Plus beer and pizza. Sounds awesome? It was!

We had fantastic talks across at least half a dozen languages, covering both new-ish (Pony) and interesting ones such (Rust, Go, …) plus of course some Javascript and some Python, no Java (yay!) and a few batshit crazy things like a self-hosting database in its own (shell) code, a terminal gif viewer (!!), and more. And it gave me an opportunity to quickly (one evening and morning commute) jam out a presentation about what is in the title: R for system administration.

And I am only half-joking. I had used R a couple of years ago when I needed to select, subset, modify, … a large number of image files given some timestamp and filename patterns. And given how well R works in a vectorised manner with both regular expressions and timestamps, as well as on top of essentially all standard POSIX-style operating system / file-system functions, I picked up that thread again on the problem of … cleaning up the file storage underlying CRANberries which by now has well over fifty-seven thousand (!!) tarballs of CRAN packages based on now ten years of CRANberries. So I showed how to prune this in essentially half a dozen lines of R (and data.table code), plus some motivation—all just right for a lightning talk. Seemingly the talk went well enough as quite a few folks gave a thumbs up and compliments over beers afterwards.

But see for yourself as the slides are now uploaded to my standard talks page.

My thanks to Matt and Joe for organizing the meetup. I think I will be back.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Initiating development of a chatbot with plumber and ngrok

Fri, 08/04/2017 - 02:00

(This article was first published on R, and kindly contributed to R-bloggers)

Chatbots have become a rage since some time now, with firms from various sectors investing in such bots that would reduce or remove the need of employing a call centre whilst maintaining similar levels of efficiency, if not greater. A notable example is DoNotPay, a ‘Lawyer Chatbot’ which is said to give free legal aid to refugees seeking asylum in the US, Canada, and some asylum support in the UK.

It may be that most of such companies are utilising in-house technology for chatbot development, it is also true that with the advent of developers like API.AI, Wit.AI, and with Facebook exposing their API and Microsoft providing their SDKs for Chatbot development, it has become much easier to craft an intelligent, engaging bot.

Although other programming languages like Python and Javascript have a fair amount of tutorials and packages that can directly or indirectly interface with Facebook’s API, those for R have been rather scarce.

Following are my limited, baby steps into the world of chatbot development using R and ngrok, making use of the Facebook Chatbot Development Platform.

Since we are using Facebook’s platform, we need to first set up a Facebook page, and then set up a Facebook app – both of these are easy steps, so I am going to skip covering them in detail. We create any page on Facebook, assign it a name, and we move on to Facebook for developers, where we create a new Facebook App ID.

Then, we add ‘Messenger’ as our product to the page, and generate a page access token that our chatbot will use later on.

To be able to send real-time notifications to our chatbot (and also to verify us!), Facebook uses Webhooks. We are required to list a URL here to which Facebook will direct any notifications we ask it to (keep your copy of Messenger Platform Document at hand, folks!).

But where do we host our app such that we can communicate it via its URL?
A possible candidate is Heroku – we can develop a simple chatbot and deploy it to Heroku. It will be live, and thus able to receive any verification requests and notifications that Facebook might send. R does not have an official buildpack for Heroku at the time of this writing, however. There is an unofficial buildpack on Github, but I have been unable to get it working for me and I am feeling lazy. So instead of heroku, I am going to rely on ngrok.

Set up Webhook and Complete our Verification

We need to first setup our Webhook, as Facebook will initially send a Verification request. It is only after this verification that Facebook will send notifications to our chatbot via the callback URL. In a verification request, Facebook sends 3 parameters:

  1. hub.mode (this is set to ‘subscribe)

  2. hub.challenge (this is a random string that Facebook expects us to send back)

  3. hub.verify_token (this is the same Verification Token that we will be entering in our Webhook’s setup)

Step 1: Write a script with plumber package

I am going to quickly script a few lines that will enable us to verify our callback URL, using the plumber package. Plumber allows you to create APIs by merely decorating your existing R code with special annotations, and I encourage you to read their documentation. I have, in my current project, found it to provide functionality similar to what flask does for Python.

library(plumber) library(httr) library(data.table) library(RCurl) library(jsonlite) #* @serializer html #* @get / myFunc <- function(hub.mode, hub.challenge, hub.verify_token){ if(hub.verify_token == "RRocker"){ return(hub.challenge) } else{ return(paste("Verification Token mismatch", 403)) } } Step 2: Save the Script and Run it!

Once you have copied the script shown above, save it as an R file, open an R session, and run the following code:

r <- plumb("") r$run(port=5000) Step 3: Launch ngrok

Now, you need to open your ngrok and type in this code:


ngrok http 5000

This will show you a screen like the one below, where you can take note of the Web Interface and Forwarding URL. The web interface will allow you to inspect requests and their responses as they happen in real time, whereas the secure URL in the forwarding section will be used to register our callback URL with Facebook.

Step 4: Launch another R session and/or Ngrok Web Interface to test our script

You may open another R session to test our script. For example, I ran the below code and inspected the response in ngrok’s Web Interface as well as in R using the httr package:

library(httr) url2 <- "https://a5da2efd.ngrok.io?hub.challenge=hubChallenger&hub.mode=subscribe&hub.verify_token=RRocker" res1 <- GET(url2, verbose())

Keep in mind the significance of parameters shown above (after the question mark). You can see in a screenshot of ngrok’s web interface an inspection of our request and the response given by our script – things are looking good!

Step 5: Copy and paste the secure URL from ngrok to Facebook’s Callback URL in Webhook Setup

Here, we simply paste in the secure URL from ngrok to Facebook. If things are alright, we will see a green tick for the Webhook’s section.

What Now?

Admittedly, this is a very first step in our chatbot development endeavour. It is, however, a vital one. What we ought to do now is to
carry out a perusal of Facebook’s documentation to understand how we need to
handle POST requests, as it is such requests that will contain messages from other users as they try to engage with our chatbot.

I am unsure if I will be able to follow this through any time soon, as being employed full time and still working on a startup are already overwhelming tasks! But perhaps somebody more skilled will be able to take on from here sooner than I return!

Reference

This blog entry obtained a lot of help from Masnun’s blog post, which uses Python and ngrok for setting up an echo bot.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Text categorization with deep learning, in R

Fri, 08/04/2017 - 00:42

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Given a short review of a product, like "I couldn't put it down!", can you predict what the product is? In that case it's pretty easy — it's for a book — but this general problem of text categorization comes up in a lot of natural language analysis problems. In his talk at useR!2017 (shown below), Microsoft data scientist Angus Taylor demonstrates how to build a text categorization model in R. He applies a convolutional neural network (trained using the R interface to the MXNET deep learning platform) to Amazon review data, and creates a small Shiny app to categorize previously-unseen reviews. The talk also provides an brief introduction to convolutional neural networks and one-hot encoding, if you haven't come across those concepts before.

The model Angus uses in this example is described in more detail in the blog post, Cloud-Scale Text Classification with Convolutional Neural Networks on Microsoft Azure. If you'd like to implement something similar yourself, you may also want to check out this tutorial on deep learning for text classification, where you can find code and sample data, and explains how to use GPU-accelerated clusters in Azure to speed up the training process. The GPU-enabled Data Science Virtual Machine on Azure contains everything you need to implement the text categorization models described there.

Channel 9: Deep Learning for Natural Language Processing in R

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Numerical Differentiation with Finite Differences in R

Thu, 08/03/2017 - 20:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part 1 of 7 in the series Numerical Analysis

Numerical differentiation is a method of approximating the derivative of a function at particular value . Often, particularly in physics and engineering, a function may be too complicated to merit the work necessary to find the exact derivative, or the function itself is unknown, and all that is available are some points and the function evaluated at those points. Numerical differentiation, of which finite differences is just one approach, allows one to avoid these complications by approximating the derivative.

The most straightforward and simple approximation of the first derivative is defined as:

The approximation error of this equation can be found by performing a Taylor expansion of about , which gives:

We then rearrange the expansion and substitute with the earlier approximation to arrive at an exact form of the approximation equation:

Finite Differences Methods for Numerical Differentiation

There are three primary differencing techniques, forward, backward, and central, for approximating the derivative of a function at a particular point. Note there are more accurate methods and algorithms for approximating derivatives; however, due to their relative ease of computation and accuracy, differencing methods are still a viable tool for performing numerical differentiation.

The forward difference approximation is the same as the earlier approximation:

The backward difference approximation is based on the values of the function evaluated at and , defined as:

The central (or centered) difference approximation is essentially an average of the forward and backward differencing methods:

Error of Approximations

It was shown earlier the forward and backward difference approximations have an error term of where . The error term of the central difference method can be found by performing Taylor expansions on and about .


These two equations are then subtracted, and then solving for leads to the central difference method:

The central difference approximation is more accurate than forward and backward differences and should be used whenever possible. The three difference methods report the same approximations of the following example as the function and its derivative are rather simple; however, it is still best to apply the central difference approximation in actual practice.

Differencing Approximations Example in R

Consider the following set of data points:

0.0 0.00000 0.2 0.74140 0.4 1.37180 x <- c(0.0, 0.2, 0.4) fx <- c(0.00000, 0.74140, 1.3718)

The function is unknown; however, we would still like to approximate the function’s derivative at the values of . The following function combines the forward and backward differencing methods to approximate the derivative of the function at each value of .

finite.differences <- function(x, y) { if (length(x) != length(y)) { stop('x and y vectors must have equal length') } n <- length(x) # Initialize a vector of length n to enter the derivative approximations fdx <- vector(length = n) # Iterate through the values using the forward differencing method for (i in 2:n) { fdx[i-1] <- (y[i-1] - y[i]) / (x[i-1] - x[i]) } # For the last value, since we are unable to perform the forward differencing method # as only the first n values are known, we use the backward differencing approach # instead. Note this will essentially give the same value as the last iteration # in the forward differencing method, but it is used as an approximation as we # don't have any more information fdx[n] <- (y[n] - y[n - 1]) / (x[n] - x[n - 1]) return(fdx) }

Using the function, perform each differencing method on the values and the evaluated points .

finite <- finite.differences(x, fx) finite ## [1] 3.707 3.152 3.152

Let’s assume the data from earlier was taken from the function .

f <- function(x) { return(exp(x) - 2 * x^2 + 3 * x - 1) }

Using the central differencing method and the given function, we can approximate the derivative at each point. As approaches 0, the approximation becomes more accurate. The following function approximates the values of the derivative at the given points at several different values of to demonstrate how the approximations converge.

central.difference <- function(f, x) { steps <- c(0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001) n <- length(steps) fdx <- vector(length = n) for (h in 1:n) { fdx[h] <- (f(x + 0.5 * steps[h]) - f(x - 0.5 * steps[h])) / steps[h] } return(fdx) }

Print the approximations of the derivative.

for (i in x) { print(central.difference(f, i)) } ## [1] 4.000417 4.000004 4.000000 4.000000 4.000000 4.000000 4.000000 ## [1] 3.421912 3.421408 3.421403 3.421403 3.421403 3.421403 3.421403 ## [1] 2.892446 2.891831 2.891825 2.891825 2.891825 2.891825 2.891825

We see the approximations converge rather quickly as approaches 0, giving the following approximations:

The derivative of the function is . We shall compare our approximated values of with the actual values of the derivative at to see how the central differencing method faired.

fdx <- function(x) { return(exp(x) - 4 * x + 3) }

For fun, plot both the function and its derivative to get a visualization of where the function’s derivative is at the values of .

ggplot(data = data.frame(x = 0), mapping = aes(x = x)) + stat_function(fun = f, size = 1.25, alpha = 0.75) + stat_function(fun = fdx, size = 1.25, color = 'blue', alpha = 0.75) + xlim(-3,3)

Print the calculated derivative’s values for each and compare them with our previously approximated values.

actual <- vector(length = length(x)) central.approx <- c(4.00000, 3.421403, 2.891825) for (i in 1:length(x)) { actual[i] <- fdx(x[i]) } approx.df <- data.frame(cbind(actual, central.approx, actual - central.approx, finite, actual - finite)) colnames(approx.df) <- c('Actual Values', 'Central Difference Approx', 'Central Differences Error', 'Finite Differences', 'Finite Differences Error') approx.df ## Actual Values Central Difference Approx Central Differences Error ## 1 4.000000 4.000000 0.000000e+00 ## 2 3.421403 3.421403 -2.418398e-07 ## 3 2.891825 2.891825 -3.023587e-07 ## Finite Differences Finite Differences Error ## 1 3.707 0.2930000 ## 2 3.152 0.2694028 ## 3 3.152 -0.2601753 References

Burden, R. L., & Faires, J. D. (2011). Numerical analysis (9th ed.). Boston, MA: Brooks/Cole, Cengage Learning.

Weisstein, Eric W. “Numerical Differentiation.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/NumericalDifferentiation.html

The post Numerical Differentiation with Finite Differences in R appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Parallel Computing Exercises: Snow and Rmpi (Part-3)

Thu, 08/03/2017 - 17:50

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


The foreach statement, which was introduced in the previous set of exercises of this series, can work with various parallel backends. This set allows to train in working with backends provided by the snow and Rmpi packages (on a single machine with multiple CPUs). The name of the former package stands for “Simple Network of Workstations”. It can employ various parallelization techniques; socket clustering is used here. The latter one is an R’s wrapper for the MPI (Message-Passing Interface), which is another paralellization technique.
The set also demonstrates that inter-process communication overhead has to be taken into account when preparing to use parallelization. If short tasks are run in parallel the overhead can offset the gains in performance from using multiple CPUs, and in some cases execution can get even slower. For parallelization to be useful, tasks that are run in parallel have to be long enough.
The exercises are based on an example of using bootstrapping to estimate the sampling distribution of linear regression coefficients. The regression is run multiple times on different sets of data derived from an original sample. The size of each derived data set is equal to the size of the original sample, which is possible because the sets are produced by random sampling with replacement. The original sample is taken from the InstEval data set, which comes with the lme4 package, and represents lecture/instructor evaluations by students at the ETH. The estimated distribution is not analyzed in the exercises.
The exercises require the packages foreach, snow, doSNOW, Rmpi, and doMPI to be installed.
IMPORTANT NOTE: the Rmpi package depends on an MPI software, which has to be installed on the machine separately. The software can be the following:

  • Windows: either the Microsoft MPI, or Open MPI library (the former one can be installed as an ordinary application).
  • OS X/macOS: the Open MPI library (available through Homebrew).
  • Linux: the Open MPI library (look for packages named libopenmpi (or openmpi, lib64openmpi, or similar), as well as libopenmpi-dev (or libopenmpi-devel, or similar) in your distribution’s repository).

The zipped data set can be downloaded here. For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.

Exercise 1
Load the data set, and assign it to the data_large variable.

Exercise 2
Create a smaller data set that will be used to compare how parallel computing performance depends on the size of the task. Use the sample function to obtain a random subset from the loaded data. Its size has to be 10% of the size of the original dataset (in terms of rows). Assign the subset to the data_small variable.
For reproducibility, set the seed to 1234.
Print the number of rows in the data_large and data_small data sets.

Exercise 3
Write a function that will be used as a task in parallel computing. The function has to take a data set as an input, and do the following:

  1. Resample the data, i.e. obtain a new sample of data based on the input data set. The number of rows in the new sample has to be equal to the one in the input data set (use the sample function as in the previous exercise, but change parameters to allow for resampling with replacement).
  2. Run a linear regression on the resampled data. Use y as the dependent variable, and the others as independent variables (this can be done by using the formula y ~ . as an argument to the lm function).
  3. Return a vector of coefficients of the linear regression.
Learn more about optimizing your workflow in the online course Getting Started with R for Data Science. In this course you will learn how to:

  • efficiently organize your workflow to get the best performance of your entire project
  • get a full introduction to using R for a data science project
  • And much more

Exercise 4
Let’s test how much time it takes to run the task multiple times sequentially (not in parallel). Use the foreach statement with the %do% operator (as discussed in the previous set of exercises of this series) to run it:

  • 10 times with the data_large data set, and
  • 100 times with the data_small data set.

Use the rbind function as an argument to foreach to combine the results.
In both cases, measure how much time is spent on execution of the task (with the system.time function). Theoretically, the length of time spent should be roughly the same because the total number of rows processed is equal (it is 100,000 rows: 10,000 rows 10 times in the first case, and 1,000 rows 100 times in the second case), and the row length is the same. But is this the case in practice?

Exercise 5
Now we’ll prepare to run the task in parallel using 2 CPU cores. First, we’ll use a parallel computing backend for the foreach statement from the snow package. This requires to steps:

  1. Make a cluster for parallel execution using the makeCluster function from the snow package. Pass two arguments to this function: the size of the cluster (i.e. the number of CPU cores that will be used in computations), and the type of the cluster ("SOCK" in this case).
  2. Register the cluster with the registerDoSNOW function from the doSNOW package (which provides a foreach parallel adapter for the 'snow' package).

Exercise 6
Run the task 10 times with the large data set in parallel using the foreach statement with the %dopar% operator (as discussed in the previous set of exercises of this series). Measure the time spent on execution with the system.time function.
When done, use the stopCluster function from the snow package to stop the cluster.
Is the length of execution time smaller comparing to the one measured in Exercise 4?

Exercise 7
Repeat the steps listed in Exercise 5 and Exercise 6 to run the task 100 times using the small data set.
What is the change in the execution time?

Exercise 8
Next, we’ll use another parallel backend for the foreach function: the one that is provided by the Rmpi package (R’s wrapper to Message-Passing Interface), and accessible through an adapter from the doMPI package. From the user perspective, it differs from the snow-based backend in the following ways:

  • as mentioned above, additional software has to be installed for this backend to work (either (a) the openmpi library, available for Windows, macOS, and Linux, or (b) the Microsoft MPI library, which is available for Windows,
  • when an mpi cluster is created, it immediately starts using CPUs as much as it can,
  • when the work is complete, the mpi execution environment has to be terminated; if terminated, it can’t be relaunched without restarting the R session (if you try to create an mpi cluster after the environment was terminated, the session will be aborted, which may result in a loss of data; see Exercise 10 for more details).

In this exercise, we’ll create an mpi execution environment to run the task using 2 CPU cores. This requires actions similar to the ones performed in Exercise 5:

  1. Make a cluster for parallel execution using the startMPIcluster function from the doMPI package. This function can take just one argument, which is the number of CPU cores to be used in computations.
  2. Register the cluster with the registerDoMPI function from the doMPI package.

After creating a cluster, you may check whether the CPU usage on your machine increased using Resource Monitor (Windows), Activity Monitor (macOS), top or htop commands (Linux), or other tools.

Exercise 9
Stop the cluster created in the previous exercise with the closeCluster command from the doMPI package. The CPU usage should fall immediately.

Exercise 10
Create an mpi cluster again, and use it as a backend for the foreach statement to run the task defined above:

  • 10 times with the data_large data set, and
  • 100 times with the data_small data set.

In both cases, start a cluster before running the task, and stop it afterwards. Measure how much time is spent on execution of the task. How the time compares to the execution time with the snow cluster (found in Exercises 6 and 7)?
When done working with the clusters, terminate the mpi execution environment with the mpi.finalize function. Note that this function always returns 1.
Important! As mentioned above, if you intend to create an mpi cluster again after the environment was terminated you have to restart the R session, otherwise the current session will be aborted, which may result in a loss of data. In RStudio, an R session can be relaunched from the Session menu (relaunching the session this way does not affect the data, you’ll only need to reload libraries). In other cases, you may have to quit and restart R.

Related exercise sets:
  1. Parallel Computing Exercises: Foreach and DoParallel (Part-2)
  2. Parallel Computing Exercises: Snowfall (Part-1)
  3. Creating Sample Datasets – Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rborist version 0-1.8 available from CRAN

Thu, 08/03/2017 - 16:39

(This article was first published on Mood Stochastic, and kindly contributed to R-bloggers)

Version 0-1.8 of the Rborist implementation of the Random
Forest (TM) algorithm is now available from CRAN. Although most changes involve
refactoring to accommodate future updates, there are several bug fixes and
enhancements worth mentioning.

New option maxLeaf allows a limit to be set on the number of
terminal nodes (i.e., leaves) in each trained tree. In order to not to
introduce behavior dependent upon the training algorithm, leaves are pruned
from fully-trained trees.

A few users had reported premature termination of their R sessions. All
reproducible instances were found to originate from uncaught exceptions thrown
from glue code. These are now caught.

An error in splitting sparse representations resulted in egregious behavior
when employing autocompression. This error has been repaired.

There has been a redoubling of effort to ensure regular data-access
patterns, particularly within nested loops. While decision-tree training is at
heart an exercise in random memory access, there are tangible benefits to
keeping irregular accesses to a minimum. In particular, we continue to see
improvements to both the Arborist’s performance and its scalability as this
focus is maintained.

On the now well-known flight-delay data set, for example,
Rborist achieves parity with Xgboost, the
acknowledged leader in performance, as the size of the training set increases.
Experiments were performed on an 8-core, single-node I7 having 32GB of DDR3
RAM. Two sets of timings were recorded for Rborist, one
employing factors to represent categorical predictors, the other employing a
one-hot encoding. Xgboost, which employs a one-hot encoding, was run with the
default fast ("approximate") execution mode. Comparative timings are given in
the table below, with the AUC value of prediction in parentheses:

blowup1.pdf

Finally, we note the passing of an unreported bug. As Zhao et al., note in
their 2017, JSS paper, the initial release of Rborist, 0-1.0,
"fails even on the small example of ParallelForest", a census of household
incomes. Although identified in 2015, the failure was never reported and we
remain unclear as to its provenance. Lamentable as this state of affairs is,
suffice it to say that version 0-1.8, training the example as a two-category
classification model, completes in approximately 25 seconds and infers the test
outcomes with an AUC value of 0.9.

Thanks go out to a number of people who have suggested new features,
isolated bugs and corrected documentation. These include Ryan Ballantine, Brent
Crossman, Chris Kennedy, Mark Juarez and Pantelis Hadjipantelis.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mood Stochastic. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Generating Quadratic Primes: Euler Problem 27

Thu, 08/03/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

Solution to Euler Problem 27 using the R language. Find the product of the coefficients for the quadratic expression that produces the most primes. Continue reading →

The post Generating Quadratic Primes: Euler Problem 27 appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fun data: open data that is fun to analyse

Thu, 08/03/2017 - 02:00

(This article was first published on Blog, and kindly contributed to R-bloggers)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Connect v1.5.4 – Now Supporting Plumber!

Thu, 08/03/2017 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re thrilled to announce support for hosting Plumber APIs in RStudio Connect: version 1.5.4. Plumber is an R package that allows you to define web APIs by adding special annotations to your existing R code – allowing you to make your R functions accessible to other systems.

Below you can see the auto-generated “swagger” interface for a web API written using Plumber.

Develop Web APIs using Plumber

The open-source Plumber R package enables you to create web APIs by merely adding special comments to your existing functions. These APIs can then be leveraged from other systems in your organization. For instance, you could query some functions written in R from a Java or Python application. Or you could develop a client for your API in JavaScript and allow users to interact with your R functions from a web browser.

Like Shiny applications, RStudio Connect supports one-step publishing, access controls, logging, and scaling for Plumber APIs. Visit the documentation for guidance on publishing APIs to RStudio Connect.

Users may now create and manage personal API keys that will allow them to programmatically access APIs that require authentication; see the user guide for more details.

Other notable changes this release:

  • Content search – On the content listing page, you can now search across all deployed content by title.
  • Official support for using PostgreSQL databases instead of SQLite. When configured appropriately, PostgreSQL can offer better performance. You can find documentation on configuration and the built-in database migration tool here.
  • No customization of external usernames – When a username is obtained from an external authentication provider like LDAP, RStudio Connect will no longer offer the user an opportunity to customize the associated internal username. Previously this situation could occur if the username obtained from the external provider included a special character that RStudio Connect didn’t allow in usernames. Now, whatever username is provided from the external provider will be used without complaint. See the admin guide for more details.
  • Upgraded our licensing software – 1.5.4 includes new licensing software that will minimize user issues and report errors more clearly when they do occur. This release also includes experimental support for floating licenses which can be used to support transient servers that might be running in Docker or another virtualized environment. Please contact support@rstudio.com if you’re interested in helping test this feature.
  • Added a health check endpoint to make monitoring easier. See the admin guide for more details.
  • Added support for Shiny reconnects. This enables users to reconnect to existing Shiny sessions after brief network interruptions. This feature is not yet enabled by default but you can turn it on by setting [Client].ReconnectTimeout to something like 15s.
  • The [Authentication].Inactivity setting can now be used to log users out after a period of inactivity. By default this feature is disabled, meaning users will remain logged in until their session expires, as controlled by the [Authentication].Lifetime setting. Additionally, we now do a better job of detecting when the user is logged out and immediately send them to the login page.
  • Support external R packages. This allows you to install an R package in the global system library and have deployed content use that package rather than trying to rebuild the package itself. This can be used as a workaround for packages that can’t be installed correctly using Packrat, but should be viewed as a last resort, since this practice decreases the reproducibility and isolation of your content. See the admin guide for more details.
  • If they exist, inject http_proxy and https_proxy environment variables into all child R processes. More documentation available here.
  • RStudio Connect now presents a warning when it is not using HTTPS. This is to remind users and administrators that it is insecure the send sensitive information like usernames and passwords over a non-secured connection. See the admin guide for more information on how to configure your server to use HTTPS. Alternatively, if you’re handling SSL termination outside of Connect and want to disable this warning, you can set [Http].NoWarning = true.
  • RStudio Connect no longer leaves any R processes running when you stop the service. When the rstudio-connect service is restarted or stopped, all running R jobs are immediately interrupted.
  • LDAP group queries are now cached for approximately ten seconds. This can significantly improve the load time of Shiny applications and other resources when using an LDAP server that contains many users or groups. Additionally, LDAP user searching has been improved to better handle certain configurations.

You can see the full release notes for RStudio Connect 1.5.4 here.

Upgrade Planning

You can expect the installation and startup of v1.5.4 to be completed in under a minute. Previously authenticated users will need to login again when they visit the server again.

If your server is not using Connect’s HTTPS capabilities, your users will see a warning about using an insecure configuration. If you’re doing SSL termination outside of Connect, you should configure [Http].NoWarning=true to remove this warning.

If you’re upgrading from a release older than 1.5.0, be sure to consider the “Upgrade Planning” notes from those other releases, as well.

If you haven’t yet had a chance to download and try RStudio Connect we encourage you to do so. RStudio Connect is the best way to share all the work that you do in R (Shiny apps, R Markdown documents, plots, dashboards, Plumber APIs, etc.) with collaborators, colleagues, or customers.

You can find more details or download a 45 day evaluation of the product at https://www.rstudio.com/products/connect/. Additional resources can be found below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Applications in energy, retail and shipping

Wed, 08/02/2017 - 22:39

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The Solutions section of the Cortana Intelligence Gallery provides more than two dozen working examples of applying machine learning, data science and artificial intelligence to real-world problems. Each solution provides sample data, scripts for model training and evaluation, and reporting of predictions. You can deploy a complete stack in Azure to implement the solution with the click of a button, or follow instructions to deploy on your own hardware. The internals of each solution is fully documented and open source, so you can easily customize it to your needs.

Here's a brief overview of some solutions that have recently been posted to the Gallery. Click on the links in bold to be taken to the main solution page.

Customer Churn Prediction. This solution uses historical customer transaction data to identify new customers that are most likely to churn (switch to a competitor) in the near future. Azure Machine Learning is used to train a boosted decision tree model (with some Python scripts for data wrangling) on data stored in Azure SQL Data Warehouse; Azure Event Hub and Azure Stream Analytics are used to generate predictions on incoming data. Results are presented in interactive Power BI dashboards. Details in the Solution Guide on Github.

Demand Forecasting for Shipping and Distribution. This solution uses uses historical demand data (such as shipping or delivery records) to forecast future demand across customers, products and destinations. Forecasts are generated from data in Azure SQL Server using R functions within Azure ML and presented as a dashboard in Power BI. Developed in conjunction with New Zealand supply-chain company Kotahi. Details in the Solution Guide on Github.

Energy Supply Optimization. This solution determines the optimal mix of energy types (substation, battery cell, wind, solar, etc.) for efficient management of an energy grid. Azure Batch is used to manage a cluster of Data Science Virtual Machines, which run Pyomo (within Python) to solve complex numerical optimization problems. Results are presented in Power BI from data exported to Azure SQL Database. Details in the Solution Guide in Github.

Oil and Gas Tank Level Forecasting. This solution helps oil and gas storage facilities to anticipate problems that can lead to spills or emergency shutdowns. Data from supply sources combined with tank sensor data and outflow measurements are used to predict tank levels and flag anomalies. The zoo package for R is used to process the time series data and generate the forecasts, with are presented in a Power BI dashboard as shown below. Details in the Solution Guide on Github.

For more solutions in various industries, browse the Gallery at the link below.

Cortana Intelligence Gallery: Solutions

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data wrangling : Transforming (3/3)

Wed, 08/02/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the third part of the series and it aims to cover the transforming of data used.This can include filtering, summarizing, and ordering your data by different means. This also includes combining various data sets, creating new variables, and many other manipulation tasks. At this post, we will go through a few more advanced transformation tasks on mtcars data set, in particular table manipulation.

Before proceeding, it might be helpful to look over the help pages for the ineer_join, full__join, left_join, right_join, semi_join, anti_join, intersect, union, setdiff, bind_rows.

Moreover please load the following libraries and run the following link.
install.packages("dplyr")
library(dplyr)

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Create a new object named car_inner containing the observations that have matching values in both tables mtcars and cars_table using as key the variable ID.

Exercise 2

Create a new object named car_left containing all the observations from the left table (mtcars), and the matched records from the right table (cars_table) using as key the variable ID.

Learn more about Data Pre-Processing in the online course R Data Pre-Processing & Data Management – Shape your Data!. In this course you will learn how to:

  • Work with popular libraries such as dplyr
  • Learn about methods such as pipelines
  • And much more

Exercise 3

Create a new object named car_right containing all the observations from the right table (cars_table), and the matched records from the right table (mtcars) using as key the variable ID.

Exercise 4

Create a new object named car_full containing all the observations when there is a match in either left (cars_table) or right (mtcars) table observation using as key the variable ID.

Exercise 5

Create a new object named car_semi containing all the observations from mtcars where there are matching values in cars_table using as key the variable ID.

Exercise 6
Create a new object named car_anti containing all the observations from mtcars where there are not matching values in cars_table using as key the variable ID.

Exercise 7

Create a new object named cars_inter which contains rows that appear in both tables mtcars and cars.

Exercise 8

Create a new object named cars_union which contains rows appear in either tables mtcars and cars.

Exercise 9

Create a new object named cars_diff which contains rows appear in table mtcars and not cars.

Exercise 10

Append mtcars to cars and assign it at the object car_rows.

Related exercise sets:
  1. Data table exercises: keys and subsetting
  2. Data wrangling : Transforming (2/3)
  3. Merging Dataframes Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

[R] Kenntnis-Tage 2017: Register now and benefit from the Summer Special

Wed, 08/02/2017 - 14:44

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

On November 8 and 9, Kassel will once more become the meeting point for the German-speaking R community.

From the usage of R in the automotive industry to risk analysis, from data mining with caret to R Markdown: The [R] Kenntnis-Tage 2017 are again standing for an exciting program – always with a focus on the application of R in a productive corporate setting.

The event offers an exchange platform for the day-to-day work with R which is unique in this form.

A diverse program

Informative data science success stories meet practical R tutorials and offer deep insights into the fields of application of the free data science language.

One of the highlights of the [R] Kenntnis-Tage 2017 will be the opening keynote by Markus Gesmann, one of the leading experts for the application of data science and R in the financial sector and developer of the R packages “googleVis” and “ChainLadder”. In the second keynote, Kira Schacht from the Rheinische Post and Marie-Louise Timcke from the Berliner Morgenpost will provide insights into the use of R in data journalism. Another keynote will be announced shortly.

Data science goes professional: The event’s motto is also reflected in the guest talks. Predictive maintenance at global market leader TRUMPF, forecast models in the energy sector or working with the platform Docker – experienced R users talk about their challenges and especially their solutions in R projects.

An integral part of the [R] Kenntnis-Tage are the tutorial sessions. In addition to the popular methodology tutorials on data visualization and cluster analysis, the agenda also covers the current trend topic Shiny.

Register now and save money with the promotion code “sommeR”

From now on until August 31, you can benefit from the [R] Kenntnis-Tage 2017 Summer Special: Use the promotion code “sommeR” and get a 17% discount on your attendance at the [R] Kenntnis-Tage.

For further information please visit our [R] Kenntnis-Tage page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Twitter Coverage of the ISMB/ECCB Conference 2017

Wed, 08/02/2017 - 13:02

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

Search all the hashtags

ISMB (Intelligent Systems for Molecular Biology – which sounds rather old-fashioned now, doesn’t it?) is the largest conference for bioinformatics and computational biology. It is held annually and, when in Europe, jointly with the European Conference on Computational Biology (ECCB).

I’ve had the good fortune to attend twice: in Brisbane 2003 (very enjoyable early in my bioinformatics career, but unfortunately the seed for the “no more southern hemisphere meetings” decision), and in Toronto 2008. The latter was notable for its online coverage and for me, the pleasure of finally meeting in person many members of the online bioinformatics community.

The 2017 meeting (and its satellite meetings) were covered quite extensively on Twitter. My search using a variety of hashtags based on “ismb”, “eccb”, “17” and “2017” retrieved 9052 tweets, which form the basis of this summary at RPubs. Code and raw data can be found at Github.

Usually I just let these reports speak for themselves but in this case, I thought it was worth noting a few points:

ISMB/ECCB 2017 was tweeted extensively. My hashtag search retrieved 9052 tweets (the same search a day or so later – 8414, go figure) and there are sure to be some that were missed. July 23 saw more than 2800 tweets and one user tweeted more than 600 times. These are large numbers by comparison with many conferences.

Most of them managed to use the official meeting hashtag – #ismbeccb – but deciding on and sticking with a single hashtag always seems to be an issue. Of course, bioinformaticians know that naming things is hard.

Interesting evening activity. Tweets by day and time from conferences generally peak around presentations in the morning and afternoon sessions. ISMB/ECCB 2017 also has a couple of days with a large late evening peak. Perhaps this relates to an activity, or people gathering their thoughts in their rooms?

Communities of tweeters. I created two directed graphs from the Twitter data; one based on users who replied directly to other users, and one based on users who mentioned other users. Usually I load these into Gephi, colour by page rank, play around with layouts and generate a pretty, but uninformative graphic. This time I experimented with filters (to remove nodes with few connections) and community detection. It’s more art than (data) science, but there do seem to be some interesting subgraphs in the mentions network. Some of them may be based around special interest groups and their satellite meetings, such as BOSC and HitSeq. It’s likely that users in these groups are talking to one another and using group hashtags in addition to the main meeting hashtags.

Many retweets, a lot of likes, few quotes. Meeting tweets were retweeted extensively – a majority of tweets were retweets, and almost half were favourites, indicating high engagement. Far fewer quotes (retweets with additional comment) though; because the time and effort barrier is a little higher?

It’s still all about careers. About 20% of tweets had media attached, of which the most-liked was this one:

Tips for a successful research career by Christine Orengo #SCS17 #ismbeccb pic.twitter.com/7y8DdHlrIb

— LauraC (@LauCan88) July 21, 2017

Best wishes to all of those young attendees trying to build their careers in bioinformatics. And a reminder from one who knows: there are always alternatives.

Filed under: bioinformatics, meetings, R, statistics Tagged: eccb, ismb, twitter

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

F-Test: Compare Two Variances in R

Wed, 08/02/2017 - 08:42

F-test is used to assess whether the variances of two populations (A and B) are equal.

Contents

When to you use F-test?

Comparing two variances is useful in several cases, including:

  • When you want to perform a two samples t-test to check the equality of the variances of the two samples

  • When you want to compare the variability of a new measurement method to an old one. Does the new method reduce the variability of the measure?

Research questions and statistical hypotheses

Typical research questions are:

  1. whether the variance of group A (\(\sigma^2_A\)) is equal to the variance of group B (\(\sigma^2_B\))?
  2. whether the variance of group A (\(\sigma^2_A\)) is less than the variance of group B (\(\sigma^2_B\))?
  3. whether the variance of group A (\(\sigma^2_A\)) is greather than the variance of group B (\(\sigma^2_B\))?

In statistics, we can define the corresponding null hypothesis (\(H_0\)) as follow:

  1. \(H_0: \sigma^2_A = \sigma^2_B\)
  2. \(H_0: \sigma^2_A \leq \sigma^2_B\)
  3. \(H_0: \sigma^2_A \geq \sigma^2_B\)

The corresponding alternative hypotheses (\(H_a\)) are as follow:

  1. \(H_a: \sigma^2_A \ne \sigma^2_B\) (different)
  2. \(H_a: \sigma^2_A > \sigma^2_B\) (greater)
  3. \(H_a: \sigma^2_A < \sigma^2_B\) (less)

Note that:

  • Hypotheses 1) are called two-tailed tests
  • Hypotheses 2) and 3) are called one-tailed tests
Formula of F-test

The test statistic can be obtained by computing the ratio of the two variances \(S_A^2\) and \(S_B^2\).

\[F = \frac{S_A^2}{S_B^2}\]

The degrees of freedom are \(n_A – 1\) (for the numerator) and \(n_B – 1\) (for the denominator).

Note that, the more this ratio deviates from 1, the stronger the evidence for unequal population variances.

Note that, the F-test requires the two samples to be normally distributed.

Compute F-test in R R function

The R function var.test() can be used to compare two variances as follow:

# Method 1 var.test(values ~ groups, data, alternative = "two.sided") # or Method 2 var.test(x, y, alternative = "two.sided")

  • x,y: numeric vectors
  • alternative: the alternative hypothesis. Allowed value is one of “two.sided” (default), “greater” or “less”.

Import and check your data into R

To import your data, use the following R code:

# If .txt tab file, use this my_data <- read.delim(file.choose()) # Or, if .csv file, use this my_data <- read.csv(file.choose())

Here, we’ll use the built-in R data set named ToothGrowth:

# Store the data in the variable my_data my_data <- ToothGrowth

To have an idea of what the data look like, we start by displaying a random sample of 10 rows using the function sample_n()[in dplyr package]:

library("dplyr") sample_n(my_data, 10) len supp dose 43 23.6 OJ 1.0 28 21.5 VC 2.0 25 26.4 VC 2.0 56 30.9 OJ 2.0 46 25.2 OJ 1.0 7 11.2 VC 0.5 16 17.3 VC 1.0 4 5.8 VC 0.5 48 21.2 OJ 1.0 37 8.2 OJ 0.5

We want to test the equality of variances between the two groups OJ and VC in the column “supp”.

Preleminary test to check F-test assumptions

F-test is very sensitive to departure from the normal assumption. You need to check whether the data is normally distributed before using the F-test.

Shapiro-Wilk test can be used to test whether the normal assumption holds. It’s also possible to use Q-Q plot (quantile-quantile plot) to graphically evaluate the normality of a variable. Q-Q plot draws the correlation between a given sample and the normal distribution.

If there is doubt about normality, the better choice is to use Levene’s test or Fligner-Killeen test, which are less sensitive to departure from normal assumption.

Compute F-test # F-test res.ftest <- var.test(len ~ supp, data = my_data) res.ftest F test to compare two variances data: len by supp F = 0.6386, num df = 29, denom df = 29, p-value = 0.2331 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.3039488 1.3416857 sample estimates: ratio of variances 0.6385951 Interpretation of the result

The p-value of F-test is p = 0.2331433 which is greater than the significance level 0.05. In conclusion, there is no significant difference between the two variances.

Access to the values returned by var.test() function

The function var.test() returns a list containing the following components:

  • statistic: the value of the F test statistic.
  • parameter: the degrees of the freedom of the F distribution of the test statistic.
  • p.value: the p-value of the test.
  • conf.int: a confidence interval for the ratio of the population variances.
  • estimate: the ratio of the sample variances

The format of the R code to use for getting these values is as follow:

# ratio of variances res.ftest$estimate ratio of variances 0.6385951 # p-value of the test res.ftest$p.value [1] 0.2331433 Infos

This analysis has been performed using R software (ver. 3.3.2).

jQuery(document).ready(function () { jQuery('#rdoc h1').addClass('wiki_paragraph1'); jQuery('#rdoc h2').addClass('wiki_paragraph2'); jQuery('#rdoc h3').addClass('wiki_paragraph3'); jQuery('#rdoc h4').addClass('wiki_paragraph4'); });//add phpboost class to header

.content{padding:0px;}


(function () { var script = document.createElement("script"); script.type = "text/javascript"; script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"; document.getElementsByTagName("head")[0].appendChild(script); })();


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Pages