Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 32 min ago

Calculating Marginal Effects Exercises

Mon, 09/04/2017 - 08:54

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

A common experience for those in the social sciences migrating to R from SPSS or STATA is that some procedures that happened at the click of a button will now require more work or are too obscured by the unfamiliar language to see how to accomplish. One such procedure that I’ve experienced is when calculating the marginal effects of a generalized linear model. In this exercise set, we will explore calculating marginal effects for linear, logistic, and probit regression models in R.

Exercises in this section will be solved using the Margins and mfx packages. It is recommended to take a look at the concise and excellent documentation for these packages before continuing.

Answers to the exercises are available here.

Exercise 1
Load the mtcars dataset. Build a linear regression of mpg on wt, qsec, am, and hp.

Exercise 2
Print the coefficients from the linear model in the previous exercise.

Exercise 3
Using Margins package find marginal effects.

Exercise 4
Verify that you receive the same results from Exercises 2 and 3. Why do these marginal effects match the coefficients found when printing the linear model object?

Exercise 5
Using the mtcars dataset, built a linear regression similar to Exercise 1 except include an interaction term with am and hp. Find the marginal effects for this regression.

Exercise 6
Using your favorite dataset (mine is field.goals from the nutshell package), construct a logistic regression.

Exercise 7
Explain why marginal effects for a logit model more complex than for a linear model?

Exercise 8
For the next two exercises, you may use either package. Calculate the marginal effects with respect to the mean.

Exercise 9
Calculate the average marginal effects.

Exercise 10
If these marginal effects are different, explain why they are different.

Related exercise sets:
  1. Introduction to copulas Exercises (Part-1)
  2. Multiple Regression (Part 3) Diagnostics
  3. Introduction to copulas Exercises (Part-2)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Thou shalt not compare numeric values (except when it works)

Mon, 09/04/2017 - 07:32

(This article was first published on R – Irregularly Scheduled Programming, and kindly contributed to R-bloggers)

This was just going to be a few Tweets but it ended up being a bit of a rollercoaster of learning for me, and I haven’t blogged in far too long, so I’m writing it up quickly as a ‘hey look at that’ example for newcomers.

I’ve been working on the ‘merging data’ part of my book and, as I do when I’m writing this stuff, I had a play around with some examples to see if there was anything funky going on if a reader was to try something slightly different. I’ve been using dplyr for the examples after being thoroughly convinced on Twitter to do so. It’s going well. Mostly.

## if you haven't already done so, load dplyr suppressPackageStartupMessages(library(dplyr))

My example involved joining together two tibbles containing text values. Nothing too surprising. I wondered though; do numbers behave the way I expect?

Now, a big rule in programming is ‘thou shalt not compare numbers’, and it holds especially true when numbers aren’t exactly integers. This is because representing non-integers is hard, and what you see on the screen isn’t always what the computer sees internally.

Thou shalt not compare numbers.

If I had a tibble where the column I would use to join had integers

dataA <- tribble( ~X, ~Y, 0L, 100L, 1L, 101L, 2L, 102L, 3L, 103L ) dataA ## # A tibble: 4 x 2 ## X Y ## ## 1 0 100 ## 2 1 101 ## 3 2 102 ## 4 3 103

and another tibble with numeric in that column

dataB <- tribble( ~X, ~Z, 0, 1000L, 1, 1001L, 2, 1002L, 3, 1003L ) dataB ## # A tibble: 4 x 2 ## X Z ## ## 1 0 1000 ## 2 1 1001 ## 3 2 1002 ## 4 3 1003

would they still join?

full_join(dataA, dataB) ## Joining, by = "X" ## # A tibble: 4 x 3 ## X Y Z ## ## 1 0 100 1000 ## 2 1 101 1001 ## 3 2 102 1002 ## 4 3 103 1003

Okay, sure. R treats these as close enough to join. I mean, maybe it shouldn’t, but we’ll work with what we have. R doesn’t always think these are equal

identical(0L, 0) ## [1] FALSE identical(2L, 2) ## [1] FALSE

though sometimes it does

0L == 0 ## [1] TRUE 2L == 2 ## [1] TRUE

(== coerces types before comparing). Well, what if one of these just ‘looks like’ the other value (can be coerced to the same?)

dataC <- tribble( ~X, ~Z, "0", 100L, "1", 101L, "2", 102L, "3", 103L ) dataC ## # A tibble: 4 x 2 ## X Z ## ## 1 0 100 ## 2 1 101 ## 3 2 102 ## 4 3 103 full_join(dataA, dataC) ## Joining, by = "X" ## Error in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y, check_na_matches(na_matches)): Can't join on 'X' x 'X' because of incompatible types (character / integer)

That’s probably wise. Of course, R is perfectly happy with things like

"2":"5" ## [1] 2 3 4 5

and == thinks that’s fine

"0" == 0L ## [1] TRUE "2" == 2L ## [1] TRUE

but who am I to argue?

Anyway, how far apart can those integers and numerics be before they aren’t able to be joined? What if we shift the ‘numeric in name only’ values away from the integers just a teensy bit? .Machine$double.eps is the built-in value for ‘the tiniest number you can produce’. On this system it’s 2.22044610^{-16}.

dataBeps <- tribble( ~X, ~Z, 0 + .Machine$double.eps, 1000L, 1 + .Machine$double.eps, 1001L, 2 + .Machine$double.eps, 1002L, 3 + .Machine$double.eps, 1003L ) dataBeps ## # A tibble: 4 x 2 ## X Z ## ## 1 2.220446e-16 1000 ## 2 1.000000e+00 1001 ## 3 2.000000e+00 1002 ## 4 3.000000e+00 1003 full_join(dataA, dataBeps) ## Joining, by = "X" ## # A tibble: 6 x 3 ## X Y Z ## ## 1 0.000000e+00 100 NA ## 2 1.000000e+00 101 NA ## 3 2.000000e+00 102 1002 ## 4 3.000000e+00 103 1003 ## 5 2.220446e-16 NA 1000 ## 6 1.000000e+00 NA 1001

Well, that’s… weirder. The values offset from 2 and 3 joined fine, but the 0 and 1 each got multiple copies since R thinks they’re different. What if we offset a little further?

dataB2eps <- tribble( ~X, ~Z, 0 + 2*.Machine$double.eps, 1000L, 1 + 2*.Machine$double.eps, 1001L, 2 + 2*.Machine$double.eps, 1002L, 3 + 2*.Machine$double.eps, 1003L ) dataB2eps ## # A tibble: 4 x 2 ## X Z ## ## 1 4.440892e-16 1000 ## 2 1.000000e+00 1001 ## 3 2.000000e+00 1002 ## 4 3.000000e+00 1003 full_join(dataA, dataB2eps) ## Joining, by = "X" ## # A tibble: 8 x 3 ## X Y Z ## ## 1 0.000000e+00 100 NA ## 2 1.000000e+00 101 NA ## 3 2.000000e+00 102 NA ## 4 3.000000e+00 103 NA ## 5 4.440892e-16 NA 1000 ## 6 1.000000e+00 NA 1001 ## 7 2.000000e+00 NA 1002 ## 8 3.000000e+00 NA 1003

That’s what I’d expect. So, what’s going on? Why does R think those numbers are the same? Let’s check with a minimal example: For each of the values 0:4, let’s compare that integer with the same offset by .Machine$double.eps

suppressPackageStartupMessages(library(purrr)) ## for the 'thou shalt not for-loop' crowd map_lgl(0:4, ~ as.integer(.x) == as.integer(.x) + .Machine$double.eps) ## [1] FALSE FALSE TRUE TRUE TRUE

And there we have it. Some sort of relative difference tolerance maybe? In any case, the general rule to live by is to never compare floats. Add this to the list of reasons why.

For what it’s worth, I’m sure this is hardly a surprising detail to the dplyr team. They’ve dealt with things like this for a long time and I’m sure it was much worse before those changes.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Irregularly Scheduled Programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Migrating from GitHub to GitLab with RStudio (Tutorial)

Mon, 09/04/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

GitHub vs. GitLab

Git is a distributed implementation of version control. Many people have written very eloquently about why it is a good idea to use version control, not only if you collaborate in a team but also if you work on your own; one example is this article from RStudio’s Support pages.

In short, its main feature is that version control allows you to keep track of the changes you make to your code. It will also keep a history of all the changes you have made in the past and allows you to go back to specific versions if you made a major mistake. And Git makes collaborating on the same code very easy.

Most R packages are also hosted on GitHub. You can check out their R code in the repositories if you want to get a deeper understanding of the functions, you can install the latest development versions of packages or install packages that are not on CRAN. The issue tracker function of GitHub also makes it easy to report and respond to issues/problems with your code.

Why would you want to leave GitHub?

Public repositories are free on GitHub but you need to pay for private repos (if you are a student or work in academia, you get private repos for free). Since I switched from academia to industry lately and no longer fulfil these criteria, all my private repos would have to be switched to public in the future. Here, GitLab is a great alternative!

GitLab offers very similar functionalities as GitHub. There are many pros and cons for using GitHub versus GitLab but for me, the selling point was that GitLab offers unlimited private projects and collaborators in its free plan.

Tutorial

Migrating from GitHub to GitLab with RStudio is very easy! Here, I will show how I migrated my GitHub repositories of R projects, that I work with from within RStudio, to GitLab.

Beware, that ALL code snippets below show Terminal code (they are NOT from the R console)!

Migrating existing repositories

You first need to set up your GitLab account (you can login with your GitHub account) and connect your old GitHub account. Under Settings &Account, you will find “Social sign-in”; here click on “Connect” next to the GitHub symbol (if you signed in with your GitHub account, it will already be connected).

Once you have done this, you can import all your GitHub repositories to GitLab. To do this, you first need to create a new project. Click on the drop-down arrow next to the plus sign in the top-right corner and select “New project”. This will open the following window:

Here, choose “Import project from GitHub” and choose the repositories you want to import.

If you go into one of your repositories, GitLab will show you a message at the top of the site that tells you that you need to add an SSH key. The SSH key is used for secure communication between the GitLab server and your computer when you want to share information, like push/pull commits.

If you already work with GitHub on your computer, you will have an SSH key set up and you can copy your public SSH key to GitLab. Follow the instructions here.

Here is how you do it on a Mac:

  1. Look for your public key and copy it to the clipboard

cat ~/.ssh/id_rsa.pub pbcopy < ~/.ssh/id_rsa.pub

Then paste it into the respective field here.

The next step is to change the remote URL for pushing/pulling your project from RStudio. In your Git window (tab next to “Environment” and “History” for me), click on Settings and “Shell”.

Then write in the shell window that opened:

git remote set-url origin git@:/.git

You can copy the link in the navigation bar of your repo on GitLab.

Check that you now have the correct new gitlab path by going to “Tools”, “Project Options” and “Git/SVN”.

Also check your SSH key configuration with:

ssh -T git@

If you get the following message

The authenticity of host 'gitlab.com (52.167.219.168)' can't be established. ECDSA key fingerprint is ... Are you sure you want to continue connecting (yes/no)?

type “yes” (and enter passphrase if prompted).

If everything is okay, you now get a message saying Welcome to GitLab!

Now, you can commit, push and pull from within RStudio just as you have done before!

In case of problems with pushing/pulling

In my case, I migrated both, my private as well as my company’s GitHub repos to GitLab. While my private repos could be migrated without a hitch, migrating my company’s repos was a bit more tricky (because they had additional security settings, I assume).

Here is how I solved this problem with my company’s repos:

I have protected my SSH key with a passphrase. When pushing or pulling commits via the shell with git pull and git push origin master, I am prompted to enter my passphrase and everything works fine. Pushing/pulling from within RStudio, however, threw an error:

ssh_askpass: exec(/usr/X11R6/bin/ssh-askpass): No such file or directory Permission denied (publickey). fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.

I am using a MacBook Pro with MacOS Sierra version 10.12.6, so this might not be an issue with another operating system.

The following solution worked for me:

  1. Add your SSH key

ssh-add ~/.ssh/id_rsa
  1. And reinstall VS Code

Now I could commit, push and pull from within RStudio just as before!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

It is Needlessly Difficult to Count Rows Using dplyr

Sun, 09/03/2017 - 23:05

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

  • Question: how hard is it to count rows using the R package dplyr?
  • Answer: surprisingly difficult.

When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this "dplyr inferno").




Let’s take an example from sparklyr issue 973:

suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ## [1] '0.7.2.9000' library("sparklyr") packageVersion("sparklyr") ## [1] '0.6.2' sc <- spark_connect(master = "local") ## * Using Spark: 2.1.0 db_drop_table(sc, 'extab', force = TRUE) ## [1] 0 DBI::dbGetQuery(sc, "DROP TABLE IF EXISTS extab") DBI::dbGetQuery(sc, "CREATE TABLE extab (n TINYINT)") DBI::dbGetQuery(sc, "INSERT INTO extab VALUES (1), (2), (3)") dRemote <- tbl(sc, "extab") print(dRemote) ## # Source: table [?? x 1] ## # Database: spark_connection ## n ## ## 1 01 ## 2 02 ## 3 03 dLocal <- data.frame(n = as.raw(1:3)) print(dLocal) ## n ## 1 01 ## 2 02 ## 3 03

Many Apache Spark big data projects use the TINYINT type to save space. TINYINT behaves as a numeric type on the Spark side (you can run it through SparkML machine learning models correctly), and the translation of this type to R‘s raw type (which is not an arithmetic or numerical type) is something that is likely to be fixed very soon. However, there are other reasons a table might have R raw columns in them, so we should expect our tools to work properly with such columns present.

Now let’s try to count the rows of this table:

nrow(dRemote) ## [1] NA

That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

There are a number of common legitimate uses of nrow() in user code and package code including:

  • Checking if a table is empty.
  • Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).
  • Confirming data size is the same as reported in other sources (Spark, database, and so on).
  • Reporting amount of work performed or rows-per-second processed.

The obvious generic dplyr idiom would then be dplyr::tally() (our code won’t know to call the new sparklyr::sdf_nrow() function, without writing code to check we are in fact looking at a Sparklyr reference structure):

tally(dRemote) ## # Source: lazy query [?? x 1] ## # Database: spark_connection ## nn ## ## 1 3

That works for Spark, but not for local tables:

dLocal %>% tally ## Using `n` as weighting variable ## Error in summarise_impl(.data, dots): Evaluation error: invalid 'type' (raw) of argument.

The above issue is filed as dplyr issue 3070. The above code usually either errors-out (if the column is raw) or creates a new total column called nn with the sum of the n column instead of the count.

data.frame(n=100) %>% tally ## Using `n` as weighting variable ## nn ## 1 100

We could try adding a column and summing that:

dLocal %>% transmute(constant = 1.0) %>% summarize(n = sum(constant)) ## Error in mutate_impl(.data, dots): Column `n` is of unsupported type raw vector

That fails due to dplyr issue 3069: local mutate() fails if there are any raw columns present (even if they are not the columns you are attempting to work with).

We can try removing the dangerous column prior to other steps:

dLocal %>% select(-n) %>% tally ## data frame with 0 columns and 3 rows

That does not work on local tables, as tally fails to count 0-column objects (dplyr issue 3071; probably the same issue exists for may dplyr verbs as we saw a related issue for dplyr::distinct).

And the method does not work on remote tables either (Spark, or database tables) as many of them do not appear to support 0-column results:

dRemote %>% select(-n) %>% tally ## Error: Query contains no columns

In fact we start to feel trapped here. For a data-object whose only column is of type raw we can’t remove all the raw columns as we would then form a zero-column result (which does not seem to always be legal), but we can not add columns as that is a current bug for local frames. We could try some other transforms (such as joins, but we don’t have safe columns to join on).

At best we can try something like this:

nrow2 <- function(d) { n <- nrow(d) if(!is.na(n)) { return(n) } d %>% ungroup() %>% transmute(constant = 1.0) %>% summarize(tot = sum(constant)) %>% pull() } dRemote %>% nrow2() ## [1] 3 dLocal %>% nrow2() ## [1] 3

We are still experimenting with work-arounds in the replyr package (but it is necessarily ugly code).

spark_disconnect(sc) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Variable Selection with Elastic Net

Sun, 09/03/2017 - 22:50

(This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

LASSO has been a popular algorithm for the variable selection and extremely effective with high-dimension data. However, it often tends to “over-regularize” a model that might be overly compact and therefore under-predictive.

The Elastic Net addresses the aforementioned “over-regularization” by balancing between LASSO and ridge penalties. In particular, a hyper-parameter, namely Alpha, would be used to regularize the model such that the model would become a LASSO in case of Alpha = 1 and a ridge in case of Alpha = 0. In practice, Alpha can be tuned easily by the cross-validation. Below is a demonstration of Elastic Net with R glmnet package and its comparison with LASSO and ridge models.

pkgs <- list("glmnet", "doParallel", "foreach", "pROC") lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n * 0.5, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1] mdlY <- as.factor(as.matrix(train["DEFAULT"])) mdlX <- as.matrix(train[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))]) newY <- as.factor(as.matrix(test["DEFAULT"])) newX <- as.matrix(test[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))])

First of all, we estimates a LASSO model with Alpha = 1. The function cv.glmnet() is used to search for a regularization parameter, namely Lambda, that controls the penalty strength. As shown below, the model only identifies 2 attributes out of total 12.

# LASSO WITH ALPHA = 1 cv1 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 1) md1 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv1$lambda.1se, alpha = 1) coef(md1) #(Intercept) -1.963030e+00 #AGE . #ACADMOS . #ADEPCNT . #MAJORDRG . #MINORDRG . #OWNRENT . #INCOME -5.845981e-05 #SELFEMPL . #INCPER . #EXP_INC . #SPENDING . #LOGSPEND -4.015902e-02 roc(newY, as.numeric(predict(md1, newX, type = "response"))) #Area under the curve: 0.636

We next estimates a ridge model as below by setting Alpha = 0. Similarly, Lambda is searched by the cross-validation. Since the ridge penalty would only regularize the magnitude of each coefficient, we end up with a “full” model with all model attributes. The model performance is slightly better with 10 more variables, which is a debatable outcome.

# RIDGE WITH ALPHA = 0 cv2 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 0) md2 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv2$lambda.1se, alpha = 0) coef(md2) #(Intercept) -2.221016e+00 #AGE -4.184422e-04 #ACADMOS -3.085096e-05 #ADEPCNT 1.485114e-04 #MAJORDRG 6.684849e-03 #MINORDRG 1.006660e-03 #OWNRENT -9.082750e-03 #INCOME -6.960253e-06 #SELFEMPL 3.610381e-03 #INCPER -3.881890e-07 #EXP_INC -1.416971e-02 #SPENDING -1.638184e-05 #LOGSPEND -6.213884e-03 roc(newY, as.numeric(predict(md2, newX, type = "response"))) #Area under the curve: 0.6435

At last, we use the Elastic Net by tuning the value of Alpha through a line search with the parallelism. In this particular case, Alpha = 0.3 is chosen through the cross-validation. As shown below, 6 variables are used in the model that even performs better than the ridge model with all 12 attributes.

# ELASTIC NET WITH 0 < ALPHA < 1 a <- seq(0.1, 0.9, 0.05) search <- foreach(i = a, .combine = rbind) %dopar% { cv <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = i) data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i) } cv3 <- search[search$cvm == min(search$cvm), ] md3 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv3$lambda.1se, alpha = cv3$alpha) coef(md3) #(Intercept) -1.434700e+00 #AGE -8.426525e-04 #ACADMOS . #ADEPCNT . #MAJORDRG 6.276924e-02 #MINORDRG . #OWNRENT -2.780958e-02 #INCOME -1.305118e-04 #SELFEMPL . #INCPER -2.085349e-06 #EXP_INC . #SPENDING . #LOGSPEND -9.992808e-02 roc(newY, as.numeric(predict(md3, newX, type = "response"))) #Area under the curve: 0.6449

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Combining and comparing models using Model Confidence Set

Sun, 09/03/2017 - 19:19

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

In many cases, especially if you are dealing with forecasting models, it is natural to use a large set of models to forecast the same variable and then select the best model using some error measure. For example, you can break the sample into a training sample (in-sample) and a test sample (out-of-sample), estimate all models in the training sample and see how the perform in the test sample. You could compare the models using the root mean squared error (RMSE) or the mean absolute error (MAE).

If you want to make things a little more formal, then you can use some statistical test such as Diebold-Mariano to compare the models pairwise. This test will tell you if model A is better than model B. If you have models to compare you will need to perform tests. If  is large you will have a lot of test results to look at and reporting the results will become a problem (imagine that you have 100 models, you will have 4950 p-values to report).

Fortunately, Hansen, Lunde and Nason proposed (here) a suitable solution in an article in 2011. Their solution is called Model Confidence Set (MCS). The MCS has an interpretation similar to a confidence interval, i. e. the contains the best model with confidence. The number of models in the MCS increases as we decrease just like the size of a confidence interval. You only need to perform the MCS procedure once to compare all models and you have to report only the p-value for each model, which is an output of the test.

Application

Before going to the application, let us define a function that calculates the RMSE for several models at the same time, which we will be using a lot.

rmse=function(real,...){ forecasts=cbind(...) sqrt(colMeans((real-forecasts)^2)) }

To the application!! We are going to generate some data with the following design:

  • 300 observations () and 8 variables (),
  • All variables are generated from with , is sampled from a uniform(-1,1) distribution and is Normal with mean equals 0 and variance equals 4,
  • is a common factor generated from with and is Normal with mean equals 0 and variance equals 4,
  • I generated two independent factors: four variables are generated from factor 1 and the remaining four variables are generated from factor 2.
  • We assume that we do not know which factor generates each variable.
library(reshape2) library(grid) library(gridExtra) library(ggplot2) library(boot) T = 300 # = number of observations = # n = 8 # = Number of Variables = # rho = 0.75 # = factors AR coefficient = # phi = 0.6 # = Variables AR coefficient = # set.seed(1) # = seed for replication = # # = Generate two autoregressive factors = # f = matrix(0, T, 2) for(i in 2:T){ f[i, ] = rho * f[i-1, ] + rnorm(2) } lambda = runif(n,-1,1) # = Factor loadings = # # = Generate variables = # Y = matrix(0, T, n) for(i in 2:T){ # = Based on the first factor = # Y[i, 1:4] = phi * Y[i-1, 1:4] + lambda[1:4] * f[i, 1] + rnorm(4, 0, 2) # = Based on the second factor = # Y[i, 5:8] = phi * Y[i-1, 5:8] + lambda[5:8] * f[i,2] + rnorm(4, 0, 2) } # = Organize variable lags = # # = y is the first (dependent) variable = # Y = as.data.frame(embed(Y, 2)[ , c(1, 9:16)]) # = y is the first (dependent) variable = # colnames(Y) = c("y", paste("x", 1:n, sep = ""))

Now that we have our data, we are going to estimate all possible combinations of linear regressions using our 8 variables. The first variable will be the dependent variable, defined as . The explanatory variables are the first lags of all eight variables (including the first), defined as . This design results in 255 models to be estimated and compared.

# = First we will generate formulas for all the 255 equations = # variables = colnames(Y)[-1] combinations = lapply(1:n, function(x)combn(1:n, x)) formulas = lapply(combinations, function(y){ apply(y, 2, function(x){ paste("~", paste(variables[x], collapse = " + "), sep=" ") }) }) formulas = unlist(formulas) # = Now we break the sample into in-sample and out-of-sample = # in_sample = Y[1:100, ] out_of_sample = Y[-c(1:100), ] # = Estimate the models = # models = lapply(formulas, function(x) lm(paste("y", x), data = in_sample)) # = Compute forecasts = # forecasts = Reduce(cbind, lapply(models, function(x) predict(x, out_of_sample))) colnames(forecasts) = paste("model", 1:ncol(forecasts), sep = "") # RMSES (Mean of all models, model with all variables) rmse(out_of_sample$y,rowMeans(forecasts),forecasts[,ncol(forecasts)]) ## [1] 2.225175 2.322508

As you can see, the average forecast is more accurate than the model using all 8 variables. Now we are going to compute the MCS using the function below. If you want more details on the algorithm, just look at the original article here.

mcs=function(Loss,R,l){ LbarB=tsboot(Loss,colMeans,R=R,sim="fixed",l=l)$t Lbar=colMeans(Loss) zeta.Bi=t(t(LbarB)-Lbar) save.res=c() for(j in 1:(ncol(Loss)-1)){ Lbardot=mean(Lbar) zetadot=rowMeans(zeta.Bi) vard=colMeans((zeta.Bi-zetadot)^2) t.stat=(Lbar-Lbardot)/sqrt(vard) t.max=max(t.stat) model.t.max=which(t.stat==t.max) t.stat.b=t(t(zeta.Bi-zetadot)/sqrt(vard)) t.max.b=apply(t.stat.b,1,max) p=length(which(t.max<t.max.b))/R save.res=c(save.res,p) names(save.res)[j]=names(model.t.max) Lbar=Lbar[-model.t.max] zeta.Bi=zeta.Bi[,-model.t.max] } save.res=c(save.res,1) names(save.res)[j+1]=names(Lbar) save.p=save.res for(i in 2:(length(save.res)-1)){ save.p[i]=max(save.res[i-1],save.res[i]) } aux=match(colnames(Loss),names(save.p)) save.p=save.p[aux] save.res=save.res[aux] return(list(test=save.p,individual=save.res)) }

The MCS is computed in a single line of code, but first we are going to break the out-of-sample into two sub-samples. We are going to compute the MCS in the first sub-sample and compare the models in the second sub-sample. The MCS function returns the p-value needed to include each model in the set. For example, if model1 has a p-value=0.1, it will be included in the MCS of 90\% confidence.

# = generate 2 subsamples = # out1 = out_of_sample[1:100, ] out2 = out_of_sample[-c(1:100), ] f1 = forecasts[1:100, ] f2 = forecasts[-c(1:100),] # = Compute MCS = # set.seed(123) # = Seed for replication = # MCS=mcs( Loss=(f1-out1$y)^2, R=1000, l=3 )$test # = R block bootstrap samples with block length = 3 = # # = If you want to see the p-values write (print(sort(MCS)) = # # = Selected models in the 50% MCS = # selected=which(MCS>0.5) # RMSE (Mean of all models, model with all variables, MCS models) rmse(out2$y,rowMeans(f2),f2[,ncol(f2)],rowMeans(f2[,selected])) ## [1] 2.142932 2.240110 2.013547

As you can see, if we combine only the models in the MCS the RMSE becomes smaller. This is not always the case, but it happens a lot. The combined results are better when the forecasts do not have a very high positive correlation. The MCS is not useful only to combine forecasts. Even if the combinations are not better than individual models, you still have a confidence set that includes the best model with the desired significance. If some model is excluded from the confidence set you have strong evidence that this model is worst than the ones in the set (depending on the confidence level, of course). Naturally, if you increase the confidence level, the number of models in the set also increases just like in confidence intervals.

Finally, we can find some interesting relationships in the plots below. The first plot shows the forecasting errors as function of the confidence level. The optimal confidence level is around 55\%. If we increase it more the MCS starts to select a lot o bad models and the forecast becomes less accurate. The second plot shows the number of models as a function of the confidence level and the third plot shows the errors as a function of the number of models.

alpha = seq(0,1,0.01) nmodels = sapply(alpha, function(x) length(which(MCS >= x))) errors = sapply(alpha, function(x) rmse(out2$y, rowMeans(as.matrix(f2[ , which(MCS >= x)])))) df = data.frame(confidence = 1 - alpha, errors = errors, nmodels = nmodels) dfmelt = melt(df, id.vars = "errors") g1=ggplot(data=df) + geom_point(aes(x=confidence,y=errors)) g2=ggplot(data=df) + geom_point(aes(x=confidence,y=nmodels)) g3=ggplot(data=df) + geom_point(aes(x=nmodels,y=errors)) grid.arrange(g1,g2,g3,ncol=2)

References

Hansen, Peter R., Asger Lunde, and James M. Nason. “The model confidence set.” Econometrica 79.2 (2011): 453-497.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Unifying packages to help users access R repositories

Sun, 09/03/2017 - 15:16

(This article was first published on Blog: John C. Nash, and kindly contributed to R-bloggers)

With over 11,000 packages in the main CRAN (Consolidated R Archive Network) collection, users of R face a challenge when looking for tools to carry out their statistical and computational tasks, especially in subject areas outside their main domain of expertise. Moreover, a lot of packages have similar or overlapping features. Some are, to be kind, not very well written, use methods which are outdated, or are simply inappropriate to a given situation.

In an attempt to address this, I joined with Julia Silge and Spencer Graves to organize a special session “Navigating the R package universe” at the UseR2017 meeting in Brussels. A lot of other people contributed to the session — see https://github.com/nashjc/Rnavpkg, and especially the wiki portion, https://github.com/nashjc/Rnavpkg/wiki. The presentation materials are there, along with links to various follow-on material such as notes and this and other blog posts. In particular, there is a general follow-up blog post at https://juliasilge.com/blog/navigating-packages/.

There are lots of ideas about how to help users find the right package, but we found that we could consider them as generally falling into three categories:

  • unification of packages that have similar goals or features (of which more in this posting)
  • guidance in the form of CRAN Task Views and similar infrastructure (see Julia’s posting https://juliasilge.com/blog/package-guidance/)
  • search tools for R packages (pending posting by Spencer Graves)

There are overlaps in these categories, and we are certain some ideas have been left out. Let us know!

Approaches to unification of packages

In the breakout session on unification at UseR2017, many ideas were suggested, sometimes in rapid-fire fashion. Alice Daish kindly took notes on her cell phone, but I will apologize in advance if I have missed anything important.

My own view of unification with regard to the package universe is to reduce the effective span of the number of “things” the user must consider. If the rule of thumb for presentations is to have no more than (fill in your favourite single digit integer) bullets per screen, surely a user should not have to face thousands of choices? Following this objective, we can consider:

  • developing wrappers that provide a common syntax or interface to multiple packages
  • finding ways to identify packages that should not generally be used so they can be archived out of the immediate or first-level view of users who are searching for tools
  • promoting collaboration between developers of similar packages with the aim of consolidating those packages
  • encouraging those working on guidance and search to provide filtering of their output to reduce the choice burden on users.

None of these approaches can be considered “easy”.

Wrappers

My own efforts to unifying packages have been in the field of optimization, particularly function minimization. With Ravi Varadhan, I developed the optimx package to allow a common interface to a number of base and package tools. It is an indicator of the difficulty of doing this well that I later refactored optimx to optimr to allow for easier maintenance of the package. Note that because these packages call other packages, failures in the called functions may give rise to error messages in the unifying package. To avoid too many CRAN alerts, optimrx can be found on R-forge in the optimizer project https://r-forge.r-project.org/projects/optimizer/ and can call more solvers than optimr. I believe that optimrx makes a worthwhile step towards simplifying the user’s task in carrying out function minimization where some of the parameters can also be bounded or temporarily fixed. A similar wrapper for global optimizers has been written by Hans Werner Borchers. gloptim is so far only on R-forge in the optimizer project. As I mentioned, writing these codes takes a lot of effort. They are never truly finished because new solvers become available from time to time.

A sample of some other packages which unify a number of methods:

  • mlr: Machine Learning in R
  • caret: Classification And REgression Training for developing predictive models
  • Matrix: Matrix methods for dense and sparse matrices (What would we do without them?)

We could also consider the nls() function in the stats package that is distributed with base R to unify several methods for nonlinear least squares. Unfortunately, the default method and derivative approximation are quite often mis-applied by non-experts (and even some experts!). For the Marquardt stabilization, users must look in packages like minpack.lm and nlsr. Duncan Murdoch and I wrote the last package to allow for analytic derivatives of expressions in nonlinear modelling, but it would be nice to be able to use them in conjunction with separable models (essentially those with linear structure with some nonlinear terms). While nlsr has a wrapnls() function to call nls(), we don’t yet have a way to access minpack.lm or some other packages with relevant features. There is always more work to do!

Archiving

During the breakout session on unification, there were a lot of comments about packages that should be merged, as well as a few about packages that should be abandoned. A subtext of the discussion that is relevant to R in general is that younger workers were quite clear that they felt they could not voice their opinions openly because some packages had been written by members of well-established groups who might provide employment opportunities or other perks. In this, the fact that two of the session organizers (JN and SG) are retired and the third (JS) does not work in academia was noted. We are relatively protected from backlash to honest commentary. Here I genuinely mean honest and not capriciously critical. We must recognize that many packages have been important in the life of R but now need renewal or replacement. I have been trying to get my own contributions to the optim() function deprecated for some time! However, there are workers who, I am sure, will for a variety of reasons be unwilling to discuss ways to reorganize the package repositories.

Ultimately, the users will vote with their feet (or rather mouse/touchpad). However, there will be a lot of wasted effort that does them and R no good whatever.

Filtered presentation

The CRAN repository is, at present, offered to users as a list alphabetically or by date on the different mirrors, (e.g., https://cloud.r-project.org/). While a way to see every package available is necessary, it is far from friendly to users, either novices or experts. Clearly it is feasible to restructure the list(s) for different audiences. Moreover, this restructuring or filtering need not be on the CRAN sites, but could be delegated to sites dealing with particular applications or methods. Here is a prime area of overlap with the guidance and search ideas for navigating the packages. In particular, it will be helpful if automated tools are developed that do the lion’s share of the work for this.

Another approach to filtering is to present R tools within a different framework. https://www.jamovi.org/ uses a spreadsheet paradigm. https://www.tidyverse.org/ collects tools that deal with data science and adds a streamlined command set. Unfortunately, there is a learning cost to using new frameworks like these. They simplify the use of R in one domain, but add to the general collection of packages and tools.

Directions

Unification of R packages with the aim to reduce user costs in finding the right tools is never going to be trivial. Moreover, it is not as directly rewarding as writing an (apparently) new package or method, and getting a paper published abut it. On the other hand, making R infrastructure more accessible to users is key to continued vitality of the overall resource and community.

John C. Nash     2017-9-3

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog: John C. Nash. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Permutation Theory In Action

Sat, 09/02/2017 - 19:38

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

Let’s take a look at an example. Suppose our two orderings are o1 (the ordering Spark ML chooses) and o2 (the order the R user chooses).

set.seed(326346) symbols <- letters[1:7] o1 <- sample(symbols, length(symbols), replace = FALSE) o1 ## [1] "e" "a" "b" "f" "d" "c" "g" o2 <- sample(symbols, length(symbols), replace = FALSE) o2 ## [1] "d" "g" "f" "e" "b" "c" "a"

To translate Spark results into R results we need a permutation that takes o1 to o2. The idea is: if we had a permeation that takes o1 to o2 we could use it to re-map predictions that are in o1 order to be predictions in o2 order.

To solve this we crack open our article on the algebra of permutations.

We are going to use the fact that the R command base::order(x) builds a permutation p such that x[p] is in order.

Given this the solution is: we find permutations p1 and p2 such that o1[p1] is ordered and o2[p2] is ordered. Then build a permutation perm such that o1[perm] = (o1[p1])[inverse_permutation(p2)]. I.e., to get from o1 to o2 move o1 to sorted order and then move from the sorted order to o2‘s order (by using the reverse of the process that sorts o2). Again, the tools to solve this are in our article on the relation between permutations and indexing.

Below is the complete solution (including combining the two steps into a single permutation):

p1 <- order(o1) p2 <- order(o2) # invert p2 # see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/ p2inv <- seq_len(length(p2)) p2inv[p2] <- seq_len(length(p2)) (o1[p1])[p2inv] ## [1] "d" "g" "f" "e" "b" "c" "a" # composition rule: (o1[p1])[p2inv] == o1[p1[p2inv]] # see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/ perm <- p1[p2inv] o1[perm] ## [1] "d" "g" "f" "e" "b" "c" "a"

The equivilence "(o1[p1])[p2inv] == o1[p1[p2inv]]" is frankly magic (though also quickly follows "by definition"), and studying it is the topic of our original article on permutations.

The above application is a good example of why it is nice to have a little theory worked out, even before you think you need it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pair Programming Statistical Analyses

Sat, 09/02/2017 - 00:00

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract

Control calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results and illustrate its use by a case study with two statisticians ping-ponging an analysis using dplyr and SQL, respectively.



Clip art is based on work by Gan Khoon Lay available under a CC BY 3.0 US license.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

Introduction

If you are a statistician working in climate science, data driven journalism, official statistics, public health, economics or any related field working with real data, chances are that you have to perform analyses, where you know there is zero tolerance for errors. The easiest way to ensure the correctness of such an analysis is to check your results over and over again (the iterated 2-eye principle). A better approach is to pair-program the analysis by either having a colleague read through your code (the 4-eye principle) or have a skilled colleague completely redo your analysis from scratch using her favorite toolchain (the 2-mindset principle). Structured software development in the form of, e.g. version control and unit tests, provides valuable inspiration on how to ensure the quality of your code. However, when it comes to pair-programming analyses, surprisingly many steps remain manual. The daff package provides the equivalent of a diff statement on data frames and we shall illustrate its use by automatizing the comparison step of the control calculation ping-pong process.

The Case Story

Ada and Bob have to calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. The present exposition will for the sake of convenience ignore the complex survey part of the problem and just pretend that the sample corresponds to the population (complete inventory count).

The Data

After the questionnaire phase, the following data are available to Ada and Bob.

Id Version Region Sales Volume Staff Costs People 01 1 A 23000 10003 5 02 1 B 12200 7200 1 03 1 NA 19500 7609 2 04 1 A 17500 13000 3 05 1 B 119500 90000 NA 05 2 B 119500 95691 19 06 1 B NA 19990 6 07 1 A 19123 20100 8 08 1 D 25000 100 NA 09 1 D 45860 32555 9 10 1 E 33020 25010 7

Here Id denotes the unique identifier of the sampled fitness center, Version indicates the version of a center’s questionnaire and Region denotes the region in which the center is located. In case a center’s questionnaire lacks information or has inconsistent information, the protocol is to get back to the center and have it send a revised questionnaire. All Ada and Bob now need to do is aggregate the data per region in order to obtain region stratified estimates of:

  • the overall number of fitness centres
  • total sales volume
  • total staff cost
  • total number of people employed in fitness centres

However, the analysis protocol instructs that only fitness centers with an annual sales volume larger than or equal to €17,500 are to be included in the analysis. Furthermore, if missing values occur, they are to be ignored in the summations. Since a lot of muscle will be angered in case of errors, Ada and Bob agree on following the 2-mindset procedure and meet after an hour to discuss their results. Here is what each of them came up with.

Ada

Ada loves the tidyverse and in particular dplyr. This is her solution:

ada <- fitness %>% na.omit() %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) ada ## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 1 45860 32555 9 ## 4 E 1 33020 25010 7 Bob

Bob has a dark past as a relational database management system (RDBMS) developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach. The hope is that in big data situation this might be a little more speedy than base R:

library(RSQLite) ## Create ad-hoc database db <- dbConnect(SQLite(), dbname = file.path(filePath,"Test.sqlite")) ##Move fitness data into the ad-hoc DB dbWriteTable(conn = db, name = "fitness", fitness, overwrite=TRUE, row.names=FALSE) ##Query using SQL bob <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE [Sales Volume] > 17500 GROUP BY Region ") bob ## Region NoOfUnits Sales Volume Staff Costs People ## 1 1 19500 7609 2 ## 2 A 2 42123 30103 13 ## 3 B 2 239000 185691 19 ## 4 D 2 70860 32655 9 ## 5 E 1 33020 25010 7

Update: An alternative approach with less syntactic overhead would have been the sqldf package, which has a standard SQLite backend and automagically handles the import of the data.frame into the DB using the RSQLite pkg.

##Load package suppressPackageStartupMessages(library(sqldf)) ##Ensure SQLite backend. options(sqldf.driver = "SQLite") ##Same query as before bob <- sqldf(" SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE [Sales Volume] > 17500 GROUP BY Region ")

Even shorter is the direct use of SQL chunks in knitr using the variable db as connection and using the chunk argument output.var=bob:

SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE [Sales Volume] > 17500 GROUP BY Region The Ping-Pong Phase

After Ada and Bob each have a result, they compare their resulting data.frame using the daff package, which was recently presented by @edwindjonge at the useR in Brussels.

library(daff) diff <- diff_data(ada, bob) diff$get_data() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9 ## 5 E 1 33020 25010 7

After Ada’s and Bob’s serve, the two realize that their results just agree for the region E.

Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

##Small helper function for better printing print.data_diff <- function(x) x$get_data() %>% filter(`@@` != "") diff %>% print() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9

The two decide to first agree on the number of units per region.

diff$get_data() %>% filter(`@@` != "") %>% select(`@@`, Region, NoOfUnits) ## @@ Region NoOfUnits ## 1 +++ 1 ## 2 -> A 3->2 ## 3 -> B 1->2 ## 4 -> D 1->2

One obvious reason for the discrepancies appears to be that Bob has results for an extra region. Therefore, Bob takes another look at his management of missing values and decides to improve his code by:

Pong Bob SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE ([Sales Volume] > 17500 AND REGION IS NOT NULL) GROUP BY Region ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 -> A 3->2 59623->42123 43103->30103 16->13 ## 2 -> B 1->2 119500->239000 95691->185691 19 ## 3 -> D 1->2 45860->70860 32555->32655 9 Ping Bob

Better. Now the NA region is gone, but still quite some differences remain. Note: You may at this point want to stop reading and try yourself to fix the analysis – the data and code are available from the github repository.

Pong Bob

Now Bob notices that he forgot to handle the duplicate records and apparently misunderstood the exact definition of the €17,500 exclusion limit. His massaged SQL query looks as follows:

SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM (SELECT Id, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM fitness GROUP BY Id) WHERE ([Sales Volume] >= 17500 AND REGION IS NOT NULL) GROUP BY Region ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 1->2 45860->70860 32555->32655 9 Ping Ada

Comparing with Ada, Bob is sort of envious that she was able to just use dplyr‘s group_by and top_n functions. However, daff shows that there still is one difference left. By looking more carefully at Ada’s code it becomes clear that she accidentally leaves out one unit in region D. The reason is the too liberate use of na.omit, which also removes the one entry with an NA in one of the not so important columns. However, they discuss the issue, if one really wants to include partial records or not, because summation in the different columns then is over a different number of units. After consulting with the standard operation procedure (SOP) for these kind of surveys they decide to include the observation where possible. Here is Ada’s modified code:

ada2 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) (diff_final <- diff_data(ada2,bob3)) %>% print() ## @@ Region NoOfUnits ... Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 2 ... 32655 NA->9 Pong Ada

Oops, Ada forgot to take care of the NA in the summation:

ada3 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People,na.rm=TRUE)) diff_final <- diff_data(ada3,bob3) ##Check if the results really are the same length(diff_final$get_data()) == 0 ## [1] TRUE

Finally, their results agree and they move on to production and their results are published in a nice report.

Conclusion

As shown, the ping-pong game is quite manual and particularly annoying, if at some point someone steps into the office with a statement like Btw, I found some extra questionnaires, which need to be added to the analysis asap. However, the two now aligned analysis scripts and the corresponding daff-overlay could be put into a script, which is triggered every time the data change. In case new discrepancies emerge as length(diff$get_data()) > 0, the two could then be automatically informed.

Question 1: Now the two get the same results, do you agree with them?

## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 2 70860 32655 9 ## 4 E 1 33020 25010 7

Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below. Control calculations appear quite common, but little structured code support appears to be available for such processes.



Photo is copyright Johnathan J. Stegeman under the GNU Free Documentation License, version 1.2.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Practical Data Science for Stats

Fri, 09/01/2017 - 19:05

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

PeerJ Preprints has recently published a collection of articles that focus on the practical side of statistical analysis: Practical Data Science for Stats. While the articles are not peer-reviewed, they have been selected and edited by Jennifer Bryan and Hadley Wickham, both well-respected members of the R community. And while the articles provide great advice for any data scientist, the content does heavily feature the use of R, so it's particularly useful to R users.

There are 16 articles in the collection (with possibly more to come?). Here are just a few examples that caught my eye:

There's lots more to explore in the collection as well, including case studies on using R at the likes of AirBnB and the New York Mets. Check out the entire collection at the link below.

PeerJ Collections: Practical Data Science for Stats (via Jenny Bryan)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Summarizing dataset to apply machine learning – exercises

Fri, 09/01/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

Dear reader,

If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world.

This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets.

Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.

Exercise 1

Create a list of 80% of the rows in the original dataset to use for training. HINT: Use createDataPartition().

Exercise 2

Select 20% of the data for validation.

Exercise 3

Use the remaining 80% of data to train and test the models.

Exercise 4

Find the dimensions of the “iris” dataset. HINT: Use dim().

Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to:

  • Create a machine learning algorithm from a beginner point of view
  • Quickly dive into more advanced methods in an accessible pace and with more explanations
  • And much more

This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.

Exercise 5

Find the type of each attribute in your dataset. HINT: Use sapply().

Exercise 6

Take a look at the first 5 rows of your dataset. HINT: Use head().

Exercise 7

Find the levels of the variable “Species.” HINT: Use levels().

Exercise 8

Find the percentages of rows that belong to the labels you found in Exercise 7. HINT: Use prop.table() and table().

Exercise 9

Display the absolute count of instances for each class as well as its percentage. HINT: Use cbind().

Exercise 10

Display the summary of the “iris” dataset. HINT: Use summary().

Related exercise sets:
  1. How to prepare and apply machine learning to your dataset
  2. Building Shiny App exercises part 4
  3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-5)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How rOpenSci uses Code Review to Promote Reproducible Science

Fri, 09/01/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

This is cross-posted from the NumFOCUS blog. NumFOCUS is a nonprofit that supports and promotes world-class, innovative, open source scientific computing, including rOpenSci.

Scott use this for NumFOCUS SEO?

At rOpenSci, we create and curate software to help scientists with the data life cycle. These tools access, download, manage, and archive scientific data in open, reproducible ways. Early on, we realized this could only be a community effort. The variety of scientific data and workflows could only be tackled by drawing on contributions of scientists with field-specific expertise.

With the community approach came challenges. How could we ensure the quality of code written by scientists without formal training in software development practices? How could we drive adoption of best practices among our contributors? How could we create a community that would support each other in this work?

We have had great success addressing these challenges via the peer review. We draw elements from a process familiar to our target community – academic peer review – and a practice from the software development world – production code review – to create a system that fosters software quality, ongoing education, and community development.

An Open Review Process

Production software review occurs within software development teams, open source or not. Contributions to a software project are reviewed by one or more other team members before incorporation into project source code. Contributions are typically small patches, and review serves as a check on quality, as well as an opportunity for training in team standards.

In academic peer review, external reviewers critique a complete product – usually a manuscript – with a very broad mandate to address any areas they see as deficient. Academic review is often anonymous and passing through it gives the product, and the author, a public mark of validation.

We blend these approaches. In our process, authors submit complete R packages to rOpenSci. Editors check that packages fit into our project’s scope, run a series of automated tests to ensure a baseline of code quality and completeness, and then assign two independent reviewers. Reviewers comment on usability, quality, and style of software code as well as documentation. Authors make changes in response, and once reviewers are satisfied with the updates, the package receives a badge of approval and joins our suite.

This process is quite iterative. After reviewers post a first round of extensive reviews, authors and reviewers chat in an informal back-and-forth, only lightly moderated by an editor. This lets both reviewers and authors pose questions of each other and explain differences of opinion. It can proceed at a much faster pace than typical academic review correspondence. We use the GitHub issues system for this conversation, and responses take minutes to days, rather than weeks to months.

The exchange is also open and public. Authors, reviewers, and editors all know each other’s identities. The broader community can view or even participate in the conversation as it happens. This provides an incentive to be thorough and provide non-adversarial, constructive reviews. Both authors and reviewers report that they enjoy and learn more from this open and direct exchange. It also has the benefit of building community. Participants have the opportunity to meaningfully network with new peers, and new collaborations have emerged via ideas spawned during the review process.

We are aware that open systems can have drawbacks. For instance, in traditional academic review, double-blind peer review can increase representation of female authors, suggesting bias in non-blind reviews. It is also possible reviewers are less critical in open review. However, we posit that the openness of the review conversation provides a check on review quality and bias; it’s harder to inject unsupported or subjective comments in public and without the cover of anonymity. Ultimately, we believe the ability of authors and reviewers to have direct but public communication improves quality and fairness.

Guidance and Standards

rOpenSci provides guidance on reviewing. This falls into two main categories: high-level best practices and low-level standards. High-level best practices are general and broadly applicable across languages and applications. These are practices such as “Write re-usable functions rather than repeating the same code,” “Test edge cases,” or “Write documentation for all of your functions.” Because of their general nature, these can be drawn from other sources and not developed from scratch. Our best practices are based on guidance originally developed by Mozilla Science Lab.

Low-level standards are specific to a language (in our case, R), applications (data interfaces) and user base (researchers). These include specific items such as naming conventions for functions, best choices of dependencies for certain tasks, and adherence to a code style guide. We have an extensive set of standards for our reviewers to check. These change over time as the R software ecosystem evolves, best practices change, and tooling and educational resources make new methods available to developers.

Our standards also change based on feedback from reviewers. We adopt into our standards suggestions that emerge in multiple reviewers across different packages. Many of these, we’ve found, have to do with with the ease-of-use and consistency of software APIs, and the type and location of information in documentation that make it easiest to find. This highlights one of the advantages of external reviewers – they can provide a fresh perspective on usability, as well as test software under different use-cases than imagined by the author.

As our standards have become more extensive, we have come to rely more on automated tools. The R ecosystem, like most languages, has a suite of tools for code linting, function testing, static code analysis and continuous integration. We require authors to use these, and editors run submissions through a suite of tests prior to sending them for review. This frees reviewers from the burden of low-level tasks to focus on high-level critiques where they can add the most value.

The Reviewer Community

One of the core challenges and rewards of our work has been developing a community of reviewers.

Reviewing is a high-skill activity. Reviewers need expertise in the programming methods used in a software package and also the scientific field of its application. (“Find me someone who knows sensory ecology and sparse data structures!”) They need good communications skills and the time and willingness to volunteer. Thankfully, the open-science and open-source worlds are filled with generous, expert people. We have been able to expand our reviewer pool as the pace of submissions and the domains of their applications have grown.

Developing the reviewer pool requires constant recruitment. Our editors actively and broadly engage with developer and research communities to find new reviewers. We recruit from authors of previous submissions, co-workers and colleagues, at conferences, through our other collaborative work and on social media. In the open-source software ecosystem, one can often identify people with particular expertise by looking at their published software or contribution to other projects, and we often will cold-email potential reviewers whose published work suggests they would be a good match for a submission.

We cultivate our reviewer pool as well as expand it. We bring back reviewers so that they may develop reviewing as a skill, but not so often as to overburden them. We provide guidance and feedback to new recruits. When assigning reviewers to a submission, we aim to pair experienced reviewers with new ones, or reviewers with expertise on a package’s programming methods with those experienced in its field of application. These reviewers learn from each other, and diversity in perspectives is an advantage; less experienced developers often provide insight that more experienced ones do not on software usability, API design, and documentation. More experienced developers will more often identify inefficiencies in code, pitfalls due to edge-cases, or suggest alternate implementation approaches.

Expanding Peer Review for Code

Code review has been one of rOpenSci’s best initiatives. We build software, build skills, and build community, and the peer review process has been a major catalyst for all three. It has made both the software we develop internally and that we accept from outside contributors more reliable, usable, and maintainable. So we are working to promote open peer review of code by more organizations working with scientific software. We helped launch The Journal of Open Source Software, which uses a version of our review process to provide a developer-friendly publication venue. JOSS’s success has led to a spin-off, the Journal of Open Source Education, which uses an open, code-review-like processes to provide feedback on curricula and educational materials. We are also developing a pilot program where software papers submitted to a partner academic journal receive a badge for going through rOpenSci review. We are encouraged by other review initiatives like ReScience and The Programming Historian. BioConductor’s code reviews, which predate ours by several years, recently switched to an open model.

If your organization is developing or curating scientific code, we believe code review, implemented well, can be a great benefit. It can take considerable effort to begin, but here are some of the key lessons we’ve learned that can help:

  • Develop standards and guidelines for your authors and reviewers to use. Borrow these freely from other projects (feel free to use ours), and modify them iteratively as you go.
  • Use automated tools such as code linters, test suites, and test coverage measures to reduce burden on authors, reviewers, and editors as much as possible.
  • Have a clear scope. Spell out to yourselves and potential contributors what your project will accept, and why. This will save a lot of confusion and effort in the future.
  • Build a community through incentives of networking, opportunities to learn, and kindness.

rOpenSci is eager to work with other groups interested in developing similar review processes, especially if you are interested in reviewing and curating scientific software in languages other than R or beyond our scope of supporting the data life cycle. Software that implements statistical algorithms, for instance, is an area ripe for open peer review of code. Please get in touch if you have questions or wish to co-pilot a review system for your project.

And of course, if you want to review, we’re always looking for volunteers. Sign up at https://ropensci.org/onboarding.

You can support rOpenSci by becoming a NumFOCUS member or making a donation to the project.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppAnnoy 0.0.9

Fri, 09/01/2017 - 03:20

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

An new version 0.0.9 of RcppAnnoy, our Rcpp-based R integration of the nifty Annoy library by Erik, is now on CRAN. Annoy is a small and lightweight C++ template header library for very fast approximate nearest neighbours.

This release corrects an issue for Windows users discovered by GitHub user ‘khoran’ who later also suggested the fix of binary mode. It upgrades to Annoy release 1.9.1 and brings its new Manhattan distance to RcppAnnoy. A number of unit tests were added as well, and we updated some packaging internals such as symbol registration.

And I presume I had a good streak emailing with Uwe’s robots as the package made it onto CRAN rather smoothly within ten minutes of submission:

 

Changes in this version are summarized here:

Changes in version 0.0.9 (2017-08-31)
  • Synchronized with Annoy upstream version 1.9.1

  • Minor updates in calls and tests as required by annoy 1.9.1

  • New Manhattan distance modules along with unit test code

  • Additional unit tests from upstream test code carried over

  • Binary mode is used for save (as suggested by @khoran in #21)

  • A new file init.c was added with calls to R_registerRoutines() and R_useDynamicSymbols()

  • Symbol registration is enabled in useDynLib

Courtesy of CRANberries, there is also a diffstat report for this release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Proof-Calculation Ping Pong

Fri, 09/01/2017 - 00:00
Abstract

The proof-calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results.



This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

Introduction

If you are a statistician working in climate science, data driven journalism, official statistics, public health, economics or any related field working with real data, chances are that you have to perform analyses, where you know there is zero tolerance for errors. The easiest way to ensure the correctness of such an analysis is to check your results over and over again (the iterated 2-eye principle). A better approach is to pair-program the analysis by either having a colleague read through your code (the 4-eye principle) or have a skilled colleague completely redo your analysis from scratch using her favorite toolchain (the 2-mindset principle). Structured software development in the form of, e.g. version control and unit tests, provides valuable inspiration on how to ensure the quality of your code. However, when it comes to pair-programming analyses, surprisingly many steps remain manual. The daff package provides the equivalent of a diff statement on data frames and we shall illustrate its use by automatizing the comparison step of the statistical proof-calculation ping-pong process.

Case Story

Ada and Bob have to proof-calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. For this post we will ignore the complex survey part of the problem and just pretend that our sample corresponds to the population (complete inventory count). After the questionnaire phase, the following data are available to Ada and Bob.

Id Version Region Sales Volume Staff Costs People 01 1 A 23000 10003 5 02 1 B 12200 7200 1 03 1 NA 19500 7609 2 04 1 A 17500 13000 3 05 1 B 119500 90000 NA 05 2 B 119500 95691 19 06 1 B NA 19990 6 07 1 A 19123 20100 8 08 1 D 25000 100 NA 09 1 D 45860 32555 9 10 1 E 33020 25010 7

Here Id denotes the unique identifier of the sampled fitness center, Version indicates the version of a center’s questionnaire and Region denotes the region in which the center is located. In case a center’s questionnaire lacks information or has inconsistent information, the protocol is to get back to the center and have it send a revised questionnaire. All Ada and Bob now need to do is aggregate the data per region in order to obtain region stratified estimates of:

  • the overall number of fitness centres
  • total sales volume
  • total staff cost
  • total number of people employed in fitness centres

However, the analysis protocol instructs that only fitness centers with an annual sales volume larger than or equal to €17,500 are to be included in the analysis. Furthermore, if missing values occur, they are to be ignored in the summations. Since a lot of muscle will be angered in case of errors, Ada and Bob agree on following the 2-mindset procedure and meet after an hour to discuss their results. Here is what each of them came up with.

Ada

Ada loves the tidyverse and in particular dplyr. This is her solution:

ada <- fitness %>% na.omit() %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) ada ## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 1 45860 32555 9 ## 4 E 1 33020 25010 7 Bob

Bob has a dark past as database developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach:

library(RSQLite) ## Create ad-hoc database db <- dbConnect(SQLite(), dbname = file.path(filePath,"Test.sqlite")) ##Move fitness data into the ad-hoc DB dbWriteTable(conn = db, name = "FITNESS", fitness, overwrite=TRUE, row.names=FALSE) ##Query using SQL bob <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM FITNESS WHERE [Sales Volume] > 17500 GROUP BY Region ") bob ## Region NoOfUnits Sales Volume Staff Costs People ## 1 1 19500 7609 2 ## 2 A 2 42123 30103 13 ## 3 B 2 239000 185691 19 ## 4 D 2 70860 32655 9 ## 5 E 1 33020 25010 7 The Ping-Pong phase

After Ada and Bob each have a result, they compare their resulting data.framess using the daff package, which was recently presented by @edwindjonge at the useR in Brussels.

library(daff) diff <- diff_data(ada, bob) diff$get_data() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9 ## 5 E 1 33020 25010 7

After Ada’s and Bob’s serve, the two realize that their results just agree for one region (‘E’). Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

print.data_diff <- function(x) x$get_data() %>% filter(`@@` != "") print(diff) ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9

The two decide to first focus on agreeing on the number of units per region.

## @@ Region NoOfUnits ## 1 +++ 1 ## 2 -> A 3->2 ## 3 -> B 1->2 ## 4 -> D 1->2

One obvious reason for the discrepancies appears to be that Bob has results for an extra region. Therefore, Bob takes another look at his management of missing values and decides to improve his code by:

Pong Bob bob2 <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM FITNESS WHERE ([Sales Volume] > 17500 AND REGION IS NOT NULL) GROUP BY Region ") diff2 <- diff_data(ada, bob2, ordered=FALSE,count_like_a_spreadsheet=FALSE) diff2 %>% print() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 -> A 3->2 59623->42123 43103->30103 16->13 ## 2 -> B 1->2 119500->239000 95691->185691 19 ## 3 -> D 1->2 45860->70860 32555->32655 9 Ping Bob

Better. Now the NA region is gone, but still quite some differences remain. Note: You may at this point want to stop reading and try yourself to fix the analysis – the data and code are available from the github repository.

Pong Bob

Now Bob notices that he forgot to handle the duplicate records and massages the SQL query as follows:

bob3 <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM (SELECT Id, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM FITNESS GROUP BY Id) WHERE ([Sales Volume] >= 17500 AND REGION IS NOT NULL) GROUP BY Region ") diff3 <- diff_data(ada, bob3, ordered=FALSE,count_like_a_spreadsheet=FALSE) diff3 %>% print() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 1->2 45860->70860 32555->32655 9 Ping Ada

Comparing with Ada, Bob is sort of envious that she was able to just use dplyr‘s group_by and top_n functions. However, daff shows that there still is one difference left. By looking more carefully at Ada’s code it becomes clear that she accidentally leaves out one unit in region D. The reason is the too liberate use of na.omit, which also removes the one entry with an NA in one of the not so important columns. However, they discuss the issue, if one really wants to include partial records or not, because summation in the different columns then is over a different number of units. After consulting with the standard operation procedure (SOP) for these kind of surveys they decide to include the observation where possible. Here is Ada’s modified code:

ada2 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) (diff_final <- diff_data(ada2,bob3)) %>% print() ## @@ Region NoOfUnits ... Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 2 ... 32655 NA->9 Pong Ada

Oops, forgot to take care of the NA in the summation:

ada3 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People,na.rm=TRUE)) diff_final <- diff_data(ada3,bob3) length(diff_final$get_data()) == 0 ## [1] TRUE Conclusion

Finally, their results agree and they move on to production and their results are published in a nice report.

Question 1: Do you agree with their results?

ada3 ## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 2 70860 32655 9 ## 4 E 1 33020 25010 7

As shown, the ping-pong game is quite manual and particularly annoying, if at some point someone steps into the office with a statement like Btw, I found some extra questionnaires, which need to be added to the analysis asap. However, the two now aligned analysis scripts and the corresponding daff-overlay could be put into a script, which is triggered every time the data change. In case new discrepancies emerge as length(diff$get_data()) > 0, the two could then be automatically informed.

Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Multiplicative Congruential Generators in R

Thu, 08/31/2017 - 20:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part 2 of 2 in the series Random Number Generation

Multiplicative congruential generators, also known as Lehmer random number generators, is a type of linear congruential generator for generating pseudorandom numbers in . The multiplicative congruential generator, often abbreviated as MLCG or MCG, is defined as a recurrence relation similar to the LCG with .

Unlike the LCG, the parameters and for multiplicative congruential generators are more restricted and the initial seed must be relatively prime to the modulus (the greatest common divisor between and is ). The current parameters in common use are . However, in a correspondence from the Communications of the ACM, Park, Miller and Stockmeyer changed the value of the parameter , stating:

The minimal standard Lehmer generator we advocated had a modulus of m = 2^31 – 1 and a multiplier of a = 16807. Relative to this particular choice of multiplier, we wrote “… if this paper were to be written again in a few years it is quite possible that we would advocate a different multiplier ….” We are now prepared to do so. That is, we now advocate a = 48271 and, indeed, have done so “officially” since July 1990. This new advocacy is consistent with the discussion on page 1198 of [10]. There is nothing wrong with 16807; we now believe, however, that 48271 is a little better (with q = 44488, r = 3399).

Multiplicative Congruential Generators with Schrage’s Method

When using a large prime modulus such as , the multiplicative congruential generator can overflow. Schrage’s method was invented to overcome the possibility of overflow. We can check the parameters in use satisfy this condition:

a <- 48271 m <- 2 ** 31 - 1 a * (m %% a) < m ## [1] TRUE

Schrage’s method restates the modulus as a decomposition where and .

Multiplicative Congruential Generator in R

We can implement a Lehmer random number generator in R using the parameters mentioned earlier.

lehmer.rng <- function(n=10) { rng <- vector(length = n) m <- 2147483647 a <- 48271 q <- 44488 r <- 3399 # Set the seed using the current system time in microseconds. # The initial seed value must be coprime to the modulus m, # which we are not really concerning ourselves with for this example. d <- as.numeric(Sys.time()) for (i in 1:n) { h <- d / q l <- d %% q t <- a * l - r * h if (t < 0) { d <- t } else { d <- t + m } rng[i] <- d / m } return(rng) } # Print the first 10 randomly generated numbers lehmer.rng() ## [1] 0.68635675 0.12657390 0.84869106 0.16614698 0.08108171 0.89533896 ## [7] 0.90708773 0.03195725 0.60847522 0.70736551

Plotting our multiplicative congruential generator in three dimensions allows us to visualize the apparent ‘randomness’ of the generator. As before, we generate three random vectors with our Lehmer RNG function and plot the points. The plot3d package is used to create the scatterplot and the animation package is used to animate each scatterplot as the length of the random vectors, , increases.

library(plot3D) library(animation) n <- c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000) saveGIF({ for (i in 1:length(n)) { x <- lehmer.rng(n[i]) y <- lehmer.rng(n[i]) z <- lehmer.rng(n[i]) scatter3D(x, y, z, colvar = NULL, pch=20, cex = 0.5, theta=20, main = paste('n = ', n[i])) } }, movie.name = 'lehmer.gif')

The generator appears to be generating suitably random numbers demonstrated by the increasing swarm of points as increases.

References

Anne Gille-Genest (March 1, 2012). Implementation of the Pseudo-Random Number Generators and the Low Discrepancy Sequences.

Saucier, R. (2000). Computer Generation of Statistical Distributions (1st ed.). Aberdeen, MD. Army Research Lab.

Stephen K. Park; Keith W. Miller; Paul K. Stockmeyer (1988). “Technical Correspondence”. Communications of the ACM. 36 (7): 105–110.

The post Multiplicative Congruential Generators in R appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Probability functions intermediate

Thu, 08/31/2017 - 18:57

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercises, we are going to explore some of the probability functions in R by using practical applications. Basic probability knowledge is required. In case you are not familiarized with the function apply, check the R documentation.

Note: We are going to use random numbers functions and random processes functions in R such as runif. A problem with these functions is that every time you run them, you will obtain a different value. To make your results reproducible you can specify the value of the seed using set.seed(‘any number’) before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random number process.) For this set of exercises, we will use set.seed(1).Don’t forget to specify it before every exercise that includes random numbers.

 

 

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generating dice rolls Set your seed to 1 and generate 30 random numbers using runif. Save it in an object called random_numbers. Then use the ceiling function to round the values. These values represent rolling dice values.

Exercise 2

Simulate one dice roll using the function rmultinom. Make sure n = 1 is inside the function, and save it in an object called die_result. The matrix die_result is a collection of 1 one and 5 zeros, with the one indicating which value was obtained during the process. Use the function whichto create an output that shows only the value obtained after the dice is rolled.

Exercise 3

Using rmultinom, simulate 30 dice rolls. Save it in a variable called dice_result and use apply to transform the matrix into a vector with the result of each dice.

Exercise 4

Some gambling games use 2 dice, and after being rolled they sum their value. Simulate throwing 2 dice 30 times and record the sum of the values of each pair. Use rmultinomto simulate throwing 2 dice 30 times. Use the function apply to record the sum of the values of each experiment.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to

  • work with different binomial and logistic regression techniques,
  • know how to compare regression models and choose the right fit,
  • and much more.

Exercise 5

Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with a standard deviation of 0.1. Using rnorm, simulate the height of 100 people and save it in an object called heights.

To get an idea of the values of heights, use the function summary.

Exercise 6

90% of the population is smaller than ____________?

Exercise 7

Which percentage of the population is bigger than 1.60 m?

Exercise 8

Run the following line code before this exercise. This will load a library required for the exercise.

if (!'MASS' %in% installed.packages()) install.packages('MASS')
library(MASS)

Simulate 1000 people with height and weight using the function mvrnorm with mu = c(1.70, 60) and
Sigma = matrix(c(.1,3.1,3.1,100), nrow = 2)

Exercise 9

How many people from the simulated population are taller than 1.70 m and heavier than 60 kg?

Exercise 10

How many people from the simulated population are taller than 1.75 m and lighter than 60 kg?

Related exercise sets:
  1. Lets Begin with something sample
  2. Probability functions beginner
  3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

DEADLINE EXTENDED: Last call for Boston EARL abstracts

Thu, 08/31/2017 - 18:56

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Are you solving problems and innovating with R?
Are you working with R in a commercial setting?
Do you enjoy sharing your knowledge?

If you said yes to any of the above, we want your abstract!

Share your commerical R stories with your peers at EARL Boston this November.

EARL isn’t about knowing the most or being the best in your field – it’s about taking what you’ve learnt and sharing it with others, so they can learn from your wins (and sometimes your failures, because we all have them!).

As long as your proposed presentation is focused on the commerical use of R, any topic from any industry is welcome!

The abstract submission deadline has been extended to Sunday 3 September.

Join David Robinson, Mara Averick and Tareef Kawaf on 1-3 November 2017 at The Charles Hotel in Cambridge.

To submit your abstract on the EARL website, click here.

See you in Boston!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Text featurization with the Microsoft ML package

Thu, 08/31/2017 - 17:23

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last week I wrote about how you can use the MicrosoftML package in Microsoft R to featurize images: reduce an image to a vector of 4096 numbers that quantify the essential characteristics of the image, according to an AI vision model. You can perform a similar featurization process with text as well, but in this case you have a lot more control of the features used to represent the text.

Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from Amazon.com. The dataset includes reviews from 975,194 products on Amazon.com from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here's one example, selected at random:

What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It's great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.

The review contains many positive terms ("useful", "indespensable", "highly recommended"), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:

transformRule = list( featurizeText( vars = c(Features = "REVIEW_TEXT"), # ngramLength=2: include not only "Azure", "AD", but also "Azure AD" # skipLength=1 : "computer" and "compuuter" is the same wordFeatureExtractor = ngramCount( weighting = "tfidf", ngramLength = 2, skipLength = 1), language = "English" ), selectFeatures( vars = c("Features"), mode = minCount(500) ) ) # train using transforms ! model <- rxFastLinear( RATING ~ Features, data = train, mlTransforms = transformRule, type = "regression" # not binary (numeric regression) )

We can then look at the coefficients associated with these features (presence of n-grams) to assess their impact on the overall rating. By this standard, the top 10 words or word-pairs contributing to a negative rating are:

boring -7.647399 waste -7.537471 not -6.355953 nothing -6.149342 money -5.386262 bad -5.377981 no -5.210301 worst -5.051558 poorly -4.962763 disappointed -4.890280

Similarly, the top 10 words or word-pairs associated with a positive rating are:

will 3.073104 the|best 3.265797 love 3.290348 life 3.562267 wonderful 3.652950 ,|and 3.762862 you 3.889580 excellent 3.902497 my 4.454115 great 4.552569

Another option is simply to look at the sentiment score for each review, which can be extracted using the getSentiment function. 

sentimentScores <- rxFeaturize(data=data, mlTransforms = getSentiment(vars = list(SentimentScore = "REVIEW_TEXT")))

As we expect, a negative seniment (in the 0-0.5 range) is associated with 1- and 2-star reviews, while a positive sentiment (0.5-1.0) is associated with the 4- and 5-star reviews.

You can find more details on this analysis, including the Microsoft R code, at the link below.

Microsoft Technologies Blog for Enterprise Developers: Analyze your text in R (MicrosoftML)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why to use the replyr R package

Thu, 08/31/2017 - 16:48

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #> [1] '0.7.2.9000' packageVersion("sparklyr") #> [1] '0.6.2' packageVersion("dbplyr") #> [1] '1.1.0.9000' sc <- spark_connect(master = 'local') #> * Using Spark: 2.1.0 d <- dplyr::copy_to(sc, data.frame(x = 1:2)) dim(d) #> [1] NA ncol(d) #> [1] NA nrow(d) #> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).



Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users.

The explanation is: “tibble::truncate uses nrow()” and “print.tbl_spark is too slow since dbplyr started using tibble as the default way of printing records”.

A little digging gets us to this:

The above might make sense if tibble and dbplyr were the only users of dim(), ncol() or nrow().

Frankly if I call nrow() I expect to learn the number of rows in a table.

The suggestion is for all user code to adapt to use sdf_dim(), sdf_ncol() and sdf_nrow() (instead of tibble adapting). Even if practical (there are already a lot of existing sparklyr analyses), this prohibits the writing of generic dplyr code that works the same over local data, databases, and Spark (by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-sparklyr dbplyr users (i.e., databases such as PostgreSQL), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “d %>% tally %>% pull“).

I admit, calling nrow() against an arbitrary query can be expensive. However, I am usually calling nrow() on physical tables (not on arbitrary dplyr queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for nrow() to be a cheap operation.

Allowing the user to write reliable generic code that works against many dplyr data sources is the purpose of our replyr package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or Spark. Below are the functions replyr supplies for examining the size of tables:

library("replyr") packageVersion("replyr") #> [1] '0.5.4' replyr_hasrows(d) #> [1] TRUE replyr_dim(d) #> [1] 2 1 replyr_ncol(d) #> [1] 1 replyr_nrow(d) #> [1] 2 spark_disconnect(sc)

Note: the above is only working properly in the development version of replyr, as I only found out about the issue and made the fix recently.

replyr_hasrows() was added as I found in many projects the primary use of nrow() was to determine if there was any data in a table. The idea is: user code uses the replyr functions, and the replyr functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems. replyr accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).

The point of replyr is to provide re-usable work arounds of design choices far away from our influence.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Pulling Data Out of Census Spreadsheets Using R

Thu, 08/31/2017 - 14:22

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In this post, I show a method for extracting small amounts of data from somewhat large Census Bureau Excel spreadsheets, using R.  The objects of interest are expenditures of state and local governments on hospital capital in Iowa for the years 2004 to 2014. The data can be found at http://www2.census.gov/govs/local/. The files at the site are yearly files.

The files to be used are those named ‘yrslsstab1a.xls’, where ‘yr‘ is replaced by the two digits of the year for a given year, for example, ’04’ or ’11’. The individual yearly files contain data for the whole country and for all of the states, over all classes of state and local government revenue and expenditures. The task is to extract three data points from each file – state and local expenditures, state expenditures, and local expenditures – for the state of Iowa.

The structure of the files varies from year to year, so first reviewing the files is important. I found two patterns for the expenditure data – data with and data without margins of error. The program locates the columns for Iowa and the row for hospital capital expenditures. Then, the data are extracted and put in a matrix for outputting.

First, character strings of the years are created, to be used in referencing the data sets, and a data frame is created to contain the final result.

years = c(paste("0", 4:9, sep=""), paste(10:14)) hospital.capital.expend <- data.frame(NA,NA,NA)

Second, the library ‘gdata’ is opened. The library ‘gdata’ contains functions useful for manipulating data in R and provides for reading data into R from an URL containing an Excel file.

library(gdata)

Third, a loop is run through the eleven years to fill in the ‘hospital.capital.expend’ data frame with the data from each year. The object ‘fn’ contains the URL of the Excel file for a given year. The function ‘paste’ concatenates the three parts of the URL. Note that ‘sep’ must be set to “” in the function.

for (i in 1:11) { fn = paste("http://www2.census.gov/govs/local/",years[i], "slsstab1a.xls", sep="")

Next, the Excel file is read into the object ‘ex’. The argument ‘header’ is set to ‘F’ so that all of the rows are input. Also, since all of the columns contain some character data, all of the data is forced to be character by setting ‘stringsAsFactors’ to ‘F’.  The function used to read the spreadsheet is ‘read.xls’ in the package ‘gdata’.

ex = read.xls(fn, sheet=1, header=F, stringsAsFactors=F)

Next, the row and column indices of the data are found using the functions ‘grepl’ and ‘which’. The first argument in ‘grepl’ is a pattern to be matched. For a data frame, the ‘grepl’ function returns a logical vector of ‘T’s and ‘F’s of length equal to the number of columns in the data frame – giving ‘T’ if the column contains the pattern and ‘F’ if not. Note that ‘*’ can be used as a wild card in the pattern.  For a character vector, ‘grepl’ returns ‘T’ if an element of the vector matches the pattern and ‘F’ otherwise.

The ‘which’ function returns the indices of a logical vector which have the value ‘T’. So, ‘ssi1’ contains the index of the column containing ‘Hospital’ and ‘ssi2’ contains the index of the column containing ‘Iowa’. The object ‘ssi4’ contains the rows containing ‘Hospital’, since ‘ex[,ssi1]’ is a character vector instead of a data frame.   For all of the eleven years, the second incidence of ‘Hospital’ in the ‘Hospital’ column contains hospital expenditures.

ssi1 = which(grepl("*Hospital*", ex, ignore.case=T)) ssi2 = which(grepl("Iowa", ex, ignore.case=T)) ssi4 = which(grepl("Hospital",ex[,ssi1], ignore.case=T))[2]

Next, the data are extracted, and the temporary files are removed. If the column index of ‘Iowa’ is less that 80, no margin of error was included and the data points are in the column of ‘Iowa’ and in the next two columns. If the column index of ‘Iowa’ is larger than 79, a margin of error was included and the data are in the column of ‘Iowa’ and the second and third columns to the right.

The capital expenditures are found one row below the ‘Hospital’ row, so one is added to ‘ssi4’ to get the correct row index. The data are put in the data frame ‘df.1’ which is row bound to the data frame ‘hospital.capital.expend’. The names of the columns in ‘df.1’ are set to ‘NA’ so that the row bind will work.  Then the temporary files are removed and the loop ends.

if (ssi2<80) ssi5=ssi2+0:2 else ssi5 = ssi2 + c(0,2,3) df.1 = data.frame(ex[ssi4+1, ssi5], stringsAsFactors = F) names(df.1)=c(NA,NA,NA) hospital.capital.expend = rbind(hospital.capital.expend, df.1) rm(fn, ex, df.1, ssi1, ssi2, ssi4, ssi5) }

There are just a few steps left to clean things up. The first row of ‘hospital.capital.expend’, which just contains ‘NA’s, is removed. Then, the commas within the numbers, as extracted from the census file, are removed from the character strings using the function ‘gsub’ and the data frame is converted to a numeric matrix. Next, the eleven years are column bound to the matrix. Last, the columns are given names and the matrix is printed out.

hospital.capital.expend = as.matrix(hospital.capital.expend[-1,]) hospital.capital.expend = matrix(as.numeric(gsub(",","",hospital.capital.expend)),ncol=3) hospital.capital.expend = cbind(2004:2014,hospital.capital.expend) colnames(hospital.capital.expend) = c("Year", "State.Local", "State", "Local") print(hospital.capital.expend)

That’s it!!!

    Related Post

    1. Extracting Tables from PDFs in R using the Tabulizer Package
    2. Extract Twitter Data Automatically using Scheduler R package
    3. An Introduction to Time Series with JSON Data
    4. Get Your Data into R: Import Data from SPSS, Stata, SAS, CSV or TXT
    5. Exporting Data from R to TXT, CSV, SPSS or Stata
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages