Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 9 hours 50 min ago

Infographic-style charts using the R waffle package

Fri, 09/08/2017 - 07:54

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

Infographics. I’ve seen good examples. I’ve seen more bad examples. In general, I prefer a good chart to an infographic. That said, there’s a “genre” of infographic that I do think is useful, which I’ll call “if X were 100 Y”. A good example: if the world were 100 people.

That method of showing proportions has been called a waffle chart and for extra “infographic-i-ness”, the squares can be replaced by icons. You want to do this using R? Of course you do. Here’s how.

There’s not much more here than you’ll find at the Github home of the R packages, waffle and extrafont. I’ve just made it a little more step-by-step.

1. Install the R packages
You need waffle to create waffle charts and extrafont to use icons in the charts.

install.packages(c("waffle", "extrafont")) library(waffle) library(extrafont)

2. Install Font Awesome
The icons that waffle can use are supplied by Font Awesome.

Download the zip file by clicking the Download link at their website. You may want to check out the paid option later but for now, just select the free download.

Unzip the file and navigate to the fonts directory. To install the font, use the file named fontawesome-webfont.ttf. For Windows or Mac OS X users, installation is as simple as double-clicking the file and choosing “Install”. I haven’t done this under Linux for a while, it may even be the same procedure these days.

3. Import and register fonts
In R, fonts have to be imported to the extrafont database. You need to do this whenever a new font is installed that you want to use in R.

font_import() # check that Font Awesome is imported fonts()[grep("Awesome", fonts())] # [1] "FontAwesome"

If you run font_import() in a session, you need to register the fonts with the R output device. Otherwise, this occurs when the package is loaded. If you’re using the Windows OS and want to plot to the screen in RStudio, an additional argument is required:

# this should be fine for Mac OSX loadfonts() # use this if things look odd in RStudio under Windows loadfonts(device = "win")

4. Create charts
Now we are ready to waffle. First, your basic waffle chart. The default colours are Colour Brewer “Set2”.

waffle(c(50, 30, 15, 5), rows = 5, title = "Your basic waffle chart")

Next, using icons. You can browse or search for available icon names on this page. You may need to fiddle with the size.

waffle(c(50, 30, 15, 5), rows = 5, use_glyph = "child", glyph_size = 6, title = "Look I made an infographic using R!")

You can use the iron() function to append waffle charts, which might be useful in comparisons. Here’s some made-up data involving some aspect of cars in two unnamed countries:

iron( waffle(c(no = 80, yes = 20), rows = 5, use_glyph = "car", glyph_size = 6, colors = c("#c7d4b6", "#a3aabd"), title = "Country A"), waffle(c(no = 70, yes = 30), rows = 5, use_glyph = "car", glyph_size = 6, colors = c("#c7d4b6", "#a3aabd"), title = "Country B") )

Conclusion
That’s it, more or less. Are the icons any more effective than the squares? In certain settings, perhaps. You be the judge.

Filed under: R, statistics Tagged: packages, rstats, visualization, waffle

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining – Week 1 (2017)

Fri, 09/08/2017 - 01:28

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Edit: A version posted at 7:30pm and removed at 8:30pm (eastern time zone) had last years figures. This one now points to the correct database.

Football is Back! But I’m still shaking off the rust… Bare with me, we will try to get it up tonight.

The post Gold-Mining – Week 1 (2017) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Naive Principal Component Analysis (using R)

Thu, 09/07/2017 - 22:46

(This article was first published on R-posts.com, and kindly contributed to R-bloggers)

Post from Pablo Bernabeu’s blog.

Principal Component Analysis (PCA) is a technique used to find the core components that underlie different variables. It comes in very useful whenever doubts arise about the true origin of three or more variables. There are two main methods for performing a PCA: naive or less naive. In the naive method, you first check some conditions in your data which will determine the essentials of the analysis. In the less-naive method, you set the those yourself, based on whatever prior information or purposes you had. I will tackle the naive method, mainly by following the guidelines in Field, Miles, and Field (2012), with updated code where necessary. This lecture material was also useful.

The ‘naive’ approach is characterized by a first stage that checks whether the PCA should actually be performed with your current variables, or if some should be removed. The variables that are accepted are taken to a second stage which identifies the number of principal components that seem to underlie your set of variables. I ascribe these to the ‘naive’ or formal approach because either or both could potentially be skipped in exceptional circumstances, where the purpose is not scientific, or where enough information exists in advance.

STAGE 1.  Determine whether PCA is appropriate at all, considering the variables
  • Variables should be inter-correlated enough but not too much. Field et al. (2012) provide some thresholds, suggesting that no variable should have many correlations below .30, or any correlation at all above .90. Thus, in the example here, variable Q06 should probably be excluded from the PCA.
  • Bartlett’s test, on the nature of the intercorrelations, should be significant. Significance suggests that the variables are not an ‘identity matrix’ in which correlations are a sampling error.
  • KMO (Kaiser-Meyer-Olkin), a measure of sampling adequacy based on common variance (so similar purpose as Bartlett’s). As Field et al. review, ‘values between .5 and .7 are mediocre, values between .7 and .8 are good, values between .8 and .9 are great and values above .9 are superb’ (p. 761). There’s a general score as well as one per variable. The general one will often be good, whereas the individual scores may more likely fail. Any variable with a score below .5 should probably be removed, and the test should be run again.
  • Determinant: A formula about multicollinearity. The result should preferably fall below .00001.

Note that some of these tests are run on the dataframe and others on a correlation matrix of the data, as distinguished below.

# Necessary libraries library(ltm) library(lattice) library(psych) library(car) library(pastecs) library(scales) library(ggplot2) library(arules) library(plyr) library(Rmisc) library(GPArotation) library(gdata) library(MASS) library(qpcR) library(dplyr) library(gtools) library(Hmisc) # Select only your variables of interest for the PCA dataset = mydata[, c('select_var1','select_var1', 'select_var2','select_var3','select_var4', 'select_var5','select_var6','select_var7')] # Create matrix: some tests will require it data_matrix = cor(dataset, use = 'complete.obs') # See intercorrelations round(data_matrix, 2) # Bartlett's cortest.bartlett(dataset) # KMO (Kaiser-Meyer-Olkin) KMO(data_matrix) # Determinant det(data_matrix) STAGE 2.  Identify number of components (aka factors)

In this stage, principal components (formally called ‘factors’ at this stage) are identified among the set of variables.

  • The identification is done through a basic, ‘unrotated’ PCA. The number of components set a priori must equal the number of variables that are being tested.
# Start off with unrotated PCA pc1 = psych::principal(dataset, nfactors = length(dataset), rotate="none") pc1

Below, an example result:

## Principal Components Analysis ## Call: psych::principal(r = eng_prop, nfactors = 3, rotate = "none") ## Standardized loadings (pattern matrix) based upon correlation matrix ## PC1 PC2 PC3 h2 u2 com ## Aud_eng -0.89 0.13 0.44 1 -2.2e-16 1.5 ## Hap_eng 0.64 0.75 0.15 1 1.1e-16 2.0 ## Vis_eng 0.81 -0.46 0.36 1 -4.4e-16 2.0 ## ## PC1 PC2 PC3 ## SS loadings 1.87 0.79 0.34 ## Proportion Var 0.62 0.26 0.11 ## Cumulative Var 0.62 0.89 1.00 ## Proportion Explained 0.62 0.26 0.11 ## Cumulative Proportion 0.62 0.89 1.00 ## ## Mean item complexity = 1.9 ## Test of the hypothesis that 3 components are sufficient. ## ## The root mean square of the residuals (RMSR) is 0 ## with the empirical chi square 0 with prob < NA ## ## Fit based upon off diagonal values = 1

Among the columns, there are first the correlations between variables and components, followed by a column (h2) with the ‘communalities‘. If less factors than variables had been selected, communality values would be below 1. Then there is the uniqueness column (u2): uniqueness is equal to 1 minus the communality. Next is ‘com’, which reflects the complexity with which a variable relates to the principal components. Those components are precisely found below. The first row contains the sums of squared loadings, or eigenvalues, namely, the total variance explained by each linear component. This value corresponds to the number of units explained out of all possible factors (which were three in the above example). The rows below all cut from the same cloth. Proportion var = variance explained over a total of 1. This is the result of dividing the eigenvalue by the number of components. Multiply by 100 and you get the percentage of total variance explained, which becomes useful. In the example, 99% of the variance has been explained. Aside from the meddling maths, we should actually expect 100% there because the number of factors equaled the number of variables. Cumulative var: variance added consecutively up to the last component. Proportion explained: variance explained over what has actually been explained (only when variables = factors is this the same as Proportion var). Cumulative proportion: the actually explained variance added consecutively up to the last component.

Two sources will determine the number of components to select for the next stage:

  • Kaiser’s criterion: components with SS loadings > 1. In our example, only PC1.

A more lenient alternative is Joliffe’s criterion, SS loadings > .7.

  • Scree plot: the number of points after point of inflexion. For this plot, call:
plot(pc1$values, type = 'b')

Imagine a straight line from the first point on the right. Once this line bends considerably, count the points after the bend and up to the last point on the left. The number of points is the number of components to select. The example here is probably the most complicated (two components were finally chosen), but normally it’s not difficult.

Based on both criteria, go ahead and select the definitive number of components.

STAGE 3.  Run definitive PCA

Run a very similar command as you did before, but now with a more advanced method. The first PCA, a heuristic one, worked essentially on the inter-correlations. The definitive PCA, in contrast, will implement a prior shuffling known as ‘rotation’, to ensure that the result is robust enough (just like cards are shuffled). Explained variance is captured better this way. The go-to rotation method is the orthogonal, or ‘varimax’ (though others may be considered too).

# Now with varimax rotation, Kaiser-normalized # by default: pc2 = psych::principal(dataset, nfactors=2, rotate = "varimax", scores = TRUE) pc2 pc2$loadings # Healthcheck pc2$residual pc2$fit pc2$communality

We would want:

  • Less than half of residuals with absolute values > 0.05
  • Model fit > .9
  • All communalities > .7

If any of this fails, consider changing the number of factors. Next, the rotated components that have been ‘extracted’ from the core of the set of variables can be added to the dataset. This would enable the use of these components as new variables that might prove powerful and useful (as in this research).

dataset = cbind(dataset, pc2$scores) summary(dataset$RC1, dataset$RC2) STAGE 4.  Determine ascription of each variable to components

Check the main summary by just calling pc2, and see how each variable correlates with the rotated components. This is essential because it reveals how variables load on each component, or in other words, to which component a variable belongs. For instance, the table shown here belongs to a study about meaning of words. These results suggest that the visual and haptic modalities of words are quite related, whereas the auditory modality is relatively unique. When the analysis works out well, a cut-off point of r = .8 may be applied for considering a variable as part of a component.

STAGE 5.  Enjoy the plot

The plot is perhaps the coolest part about PCA. It really makes an awesome illustration of the power of data analysis.

ggplot(dataset, aes(RC1, RC2, label = as.character(main_eng))) + aes (x = RC1, y = RC2, by = main_eng) + stat_density2d(color = "gray87")+ geom_text(size = 7) + ggtitle ('English properties') + theme_bw() + theme( plot.background = element_blank() ,panel.grid.major = element_blank() ,panel.grid.minor = element_blank() ,panel.border = element_blank() ) + theme(axis.line = element_line(color = 'black')) + theme(axis.title.x = element_text(colour = 'black', size = 23, margin=margin(15,15,15,15)), axis.title.y = element_text(colour = 'black', size = 23, margin=margin(15,15,15,15)), axis.text.x = element_text(size=16), axis.text.y = element_text(size=16)) + labs(x = "", y = "Varimax-rotated Principal Component 2") + theme(plot.title = element_text(hjust = 0.5, size = 32, face = "bold", margin=margin(15,15,15,15)))


Below is an example combining PCA plots with code similar to the above. These plots illustrate something further with regard to the relationships among modalities. In property words, the different modalities spread out more clearly than they do in concept words. This makes sense because in language, properties define concepts (see more).

An example of this code is use is available here (with data here).

References
Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. London: Sage.

Feel free to comment below or on the original post.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

In case you missed it: August 2017 roundup

Thu, 09/07/2017 - 22:18

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from August of particular interest to R users.

Using the featurizeText function in the MicrosoftML package to extract ngrams from unstructured text.

A joyplot visualizes the probabilities associated with phrases like "highly likely" and "little chance" by a sample of 46 Redditors.

Two examples of creating 3-D animations in R: a stereo cube, and the Charleston in motion capture data.

A tutorial on creating thematic maps in R, from ComputerWorld. 

Some tips on using R to query data in Power BI.

Using the Rcpp package to calculate a membership matrix for fuzzy k-means clustering.

A reimagining of Minard's chart of Napoleon's march on Russia.

Rankings of gender roles for men and women in film, from an analysis of scripts with the tidytext package.

Several talks at the upcoming Ignite conference feature R.

Norm Matloff's keynote at UseR!2017 discussed various obstacles to performance in parallel programming.

Peter Dalgaard marks the 20th anniversary of the formation of the R Core Group.

Using the rxFeaturize function in the MicrosoftML package to find images similar to a reference image.

Buzzfeed used R to identify possible spy planes in Flightradar24 data.

Timo Grossenbacher offers a helpful workflow for implementing reproducible data analysis in R.

A preview of version 0.10.0 of the dplyrXdf package, featuring support for the tidyeval framework on out-of-memory Xdf data files.

A tutorial on using CNTK via the keras package for forecasting

Neils Berglund explains how to use the sqlrutils package to publish an R function as a SQL Server stored procedure.

Tomas Kalibera's presentation at UseR!2017 includes useful guidance on getting the most out of R's byte compiler.

The kadinsky package makes accidental aRt, deliberately.

Angus Taylor's UseR!2017 presentation uses MXNET to categorize product reviews on Amazon.

Various solutions for the energy, retail and shipping industries, with data and Microsoft R code.

Talks on new database interfaces from UseR!2017: Jim Hester on the ODBC package, and Kirill Müller on the DBI package.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Readability Redux

Tue, 09/05/2017 - 02:19

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I recently posted about using a Python module to convert HTML to usable text. Since then, a new package has hit CRAN dubbed htm2txt that is 100% R and uses regular expressions to strip tags from text.

I gave it a spin so folks could compare some basic output, but you should definitely give htm2txt a try on your own conversion needs since each method produces different results.

On my macOS systems, the htm2txt calls ended up invoking XQuartz (the X11 environment on macOS) and they felt kind of sluggish (base R regular expressions don’t have a “compile” feature and can be sluggish compared to other types of regular expression computations).

I decided to spend some of Labor Day (in the U.S.) laboring (not for long, though) on a (currently small) rJava-based R package dubbed jericho which builds upon work created by Martin Jericho which is used in at-scale initiatives like the Internet Archive. Yes, I’m trading Java for Python, but the combination of Java+R has been around for much longer and there are many solved problems in Java-space that don’t need to be re-invented (if you do know a header-only, cross-platform, C++ HTML-to-text library, definitely leave a comment).

Is it worth it to get rJava up and running to use jericho vs htm2txt? Let’s take a look:

library(jericho) # devtools::install_github("hrbrmstr/jericho") library(microbenchmark) library(htm2txt) library(tidyverse) c( "https://medium.com/starts-with-a-bang/science-knows-if-a-nation-is-testing-nuclear-bombs-ec5db88f4526", "https://en.wikipedia.org/wiki/Timeline_of_antisemitism", "http://www.healthsecuritysolutions.com/2017/09/04/watch-out-more-ransomware-attacks-incoming/" ) -> urls map_chr(urls, ~paste0(read_lines(.x), collapse="\n")) -> sites_html microbenchmark( jericho_txt = { a <- html_to_text(sites_html[1]) }, jericho_render = { a <- render_html_to_text(sites_html[1]) }, htm2txt = { a <- htm2txt(sites_html[1]) }, times = 10 ) -> mb1 # microbenchmark( # jericho_txt = { # a <- html_to_text(sites_html[2]) # }, # jericho_render = { # a <- render_html_to_text(sites_html[2]) # }, # htm2txt = { # a <- htm2txt(sites_html[2]) # }, # times = 10 # ) -> mb2 microbenchmark( jericho_txt = { a <- html_to_text(sites_html[3]) }, jericho_render = { a <- render_html_to_text(sites_html[3]) }, htm2txt = { a <- htm2txt(sites_html[3]) }, times = 10 ) -> mb3

The second benchmark is commented out because I really didn’t have time wait for it to complete (FWIW jericho goes fast in that test). Here’s what the other two look like:

mb1 ## Unit: milliseconds ## expr min lq mean median uq max neval ## jericho_txt 4.121872 4.294953 4.567241 4.405356 4.734923 5.621142 10 ## jericho_render 5.446296 5.564006 5.927956 5.719971 6.357465 6.785791 10 ## htm2txt 1014.858678 1021.575316 1035.342729 1029.154451 1042.642065 1082.340132 10 mb3 ## Unit: milliseconds ## expr min lq mean median uq max neval ## jericho_txt 2.641352 2.814318 3.297543 3.034445 3.488639 5.437411 10 ## jericho_render 3.034765 3.143431 4.708136 3.746157 5.953550 8.931072 10 ## htm2txt 417.429658 437.493406 446.907140 445.622242 451.443907 484.563958 10

You should run the conversion functions on your own systems to compare the results (they’re somewhat large to incorporate here). I’m fairly certain they do a comparable — if not better — job of extracting clean, pertinent text.

I need to separate the package into two (one for the base JAR and the other for the conversion functions) and add some more tests before a CRAN submission, but I think this would be a good addition to the budding arsenal of HTML-to-text conversion options in R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyzing Google Trends Data in R

Mon, 09/04/2017 - 13:32

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Google Trends shows the changes in the popularity of search terms over a given time (i.e., number of hits over time). It can be used to find search terms with growing or decreasing popularity or to review periodic variations from the past such as seasonality. Google Trends search data can be added to other analyses, manipulated and explored in more detail in R.

This post describes how you can use R to download data from Google Trends, and then include it in a chart or other analysis. We’ll discuss first how you can get overall (global) data on a search term (query), how to plot it as a simple line chart, and then how to can break the data down by geographical region. The first example I will look at is the rise and fall of the Blu-ray.

Analyzing Google Trends in R

I have never bought a Blu-ray disc and probably never will. In my world, technology moved from DVDs to streaming without the need for a high definition physical medium. I still see them in some shops, but it feels as though they are declining. Using Google Trends we can find out when interest in Blu-rays peaked.

The following R code retrieves the global search history since 2004 for Blu-ray.

library(gtrendsR) library(reshape2) google.trends = gtrends(c("blu-ray"), gprop = "web", time = "all")[[1]] google.trends = dcast(google.trends, date ~ keyword + geo, value.var = "hits") rownames(google.trends) = google.trends$date google.trends$date = NULL

The first argument to the gtrends function is a list of up to 5 search terms. In this case, we have just one item. The second argument gprop is the medium searched on and can be any of web, news, images or youtube. The third argument time can be any of now 1-d, now 7-d, today 1-m, today 3-m, today 12-m, today+5-y or all (which means since 2004). A final possibility for time is to specify a custom date range e.g. 2010-12-31 2011-06-30.

Note that I am using gtrendsR version 1.9.9.0. This version improves upon the CRAN version 1.3.5 (as of August 2017) by not requiring a login. You may see a warning if your timezone is not set – this can be avoided by adding the following line of code:

Sys.setenv(TZ = "UTC")

After retrieving the data from Google Trends, I format it into a table with dates for the row names and search terms along the columns. The table below shows the result of running this code.

Plotting Google Trends data: Identifying seasonality and trends

Plotting the Google Trends data as an R chart we can draw two conclusions. First, interest peaked around the end of 2008. Second, there is a strong seasonal effect, with significant spikes around Christmas every year.

Note that results are relative to the total number of searches at each time point, with the maximum being 100. We cannot infer anything about the volume of Google searches. But we can say that as a proportion of all searches Blu-ray was about half as frequent in June 2008 compared to December 2008. An explanation about Google Trend methodology is here.

Google Trends by geographic region

Next, I will illustrate the use of country codes. To do so I will find the search history for skiing in Canada and New Zealand. I use the same code as previously, except modifying the gtrends line as below.

google.trends = gtrends(c("skiing"), geo = c("CA", "NZ"), gprop = "web", time = "2010-06-30 2017-06-30")[[1]]

The new argument to gtrends is geo, which allows the users to specify geographic codes to narrow the search region. The awkward part about geographical codes is that they are not always obvious. Country codes consist of two letters, for example, CA and NZ in this case. We could also use region codes such as US-CA for California. I find the easiest way to get these codes is to use this Wikipedia page.

An alternative way to find all the region-level codes for a given country is to use the following snippet of R code. In this case, it retrieves all the regions of Italy (IT).

library(gtrendsR) geo.codes = sort(unique(countries[substr(countries$sub_code, 1, 2) == "IT", ]$sub_code))

Plotting the ski data below, we note the contrast between northern and southern hemisphere winters. It is also relatively more popular in Canada than New Zealand. The 2014 winter Olympics causes a notable spike in both countries but particularly Canada.

Create your own analysis

In this post I have shown how to import data from Google Trends using the R package gtrendsR. Anyone can click on this link to explore the examples used in this post or create your own analysis.

    Related Post

    1. Gradient boosting in R
    2. Radial kernel Support Vector Classifier
    3. Random Forests in R
    4. Network analysis of Game of Thrones
    5. Structural Changes in Global Warming
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Calculating Marginal Effects Exercises

    Mon, 09/04/2017 - 08:54

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    A common experience for those in the social sciences migrating to R from SPSS or STATA is that some procedures that happened at the click of a button will now require more work or are too obscured by the unfamiliar language to see how to accomplish. One such procedure that I’ve experienced is when calculating the marginal effects of a generalized linear model. In this exercise set, we will explore calculating marginal effects for linear, logistic, and probit regression models in R.

    Exercises in this section will be solved using the Margins and mfx packages. It is recommended to take a look at the concise and excellent documentation for these packages before continuing.

    Answers to the exercises are available here.

    Exercise 1
    Load the mtcars dataset. Build a linear regression of mpg on wt, qsec, am, and hp.

    Exercise 2
    Print the coefficients from the linear model in the previous exercise.

    Exercise 3
    Using Margins package find marginal effects.

    Exercise 4
    Verify that you receive the same results from Exercises 2 and 3. Why do these marginal effects match the coefficients found when printing the linear model object?

    Exercise 5
    Using the mtcars dataset, built a linear regression similar to Exercise 1 except include an interaction term with am and hp. Find the marginal effects for this regression.

    Exercise 6
    Using your favorite dataset (mine is field.goals from the nutshell package), construct a logistic regression.

    Exercise 7
    Explain why marginal effects for a logit model more complex than for a linear model?

    Exercise 8
    For the next two exercises, you may use either package. Calculate the marginal effects with respect to the mean.

    Exercise 9
    Calculate the average marginal effects.

    Exercise 10
    If these marginal effects are different, explain why they are different.

    Related exercise sets:
    1. Introduction to copulas Exercises (Part-1)
    2. Multiple Regression (Part 3) Diagnostics
    3. Introduction to copulas Exercises (Part-2)
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Thou shalt not compare numeric values (except when it works)

    Mon, 09/04/2017 - 07:32

    (This article was first published on R – Irregularly Scheduled Programming, and kindly contributed to R-bloggers)

    This was just going to be a few Tweets but it ended up being a bit of a rollercoaster of learning for me, and I haven’t blogged in far too long, so I’m writing it up quickly as a ‘hey look at that’ example for newcomers.

    I’ve been working on the ‘merging data’ part of my book and, as I do when I’m writing this stuff, I had a play around with some examples to see if there was anything funky going on if a reader was to try something slightly different. I’ve been using dplyr for the examples after being thoroughly convinced on Twitter to do so. It’s going well. Mostly.

    ## if you haven't already done so, load dplyr suppressPackageStartupMessages(library(dplyr))

    My example involved joining together two tibbles containing text values. Nothing too surprising. I wondered though; do numbers behave the way I expect?

    Now, a big rule in programming is ‘thou shalt not compare numbers’, and it holds especially true when numbers aren’t exactly integers. This is because representing non-integers is hard, and what you see on the screen isn’t always what the computer sees internally.

    Thou shalt not compare numbers.

    If I had a tibble where the column I would use to join had integers

    dataA <- tribble( ~X, ~Y, 0L, 100L, 1L, 101L, 2L, 102L, 3L, 103L ) dataA ## # A tibble: 4 x 2 ## X Y ## ## 1 0 100 ## 2 1 101 ## 3 2 102 ## 4 3 103

    and another tibble with numeric in that column

    dataB <- tribble( ~X, ~Z, 0, 1000L, 1, 1001L, 2, 1002L, 3, 1003L ) dataB ## # A tibble: 4 x 2 ## X Z ## ## 1 0 1000 ## 2 1 1001 ## 3 2 1002 ## 4 3 1003

    would they still join?

    full_join(dataA, dataB) ## Joining, by = "X" ## # A tibble: 4 x 3 ## X Y Z ## ## 1 0 100 1000 ## 2 1 101 1001 ## 3 2 102 1002 ## 4 3 103 1003

    Okay, sure. R treats these as close enough to join. I mean, maybe it shouldn’t, but we’ll work with what we have. R doesn’t always think these are equal

    identical(0L, 0) ## [1] FALSE identical(2L, 2) ## [1] FALSE

    though sometimes it does

    0L == 0 ## [1] TRUE 2L == 2 ## [1] TRUE

    (== coerces types before comparing). Well, what if one of these just ‘looks like’ the other value (can be coerced to the same?)

    dataC <- tribble( ~X, ~Z, "0", 100L, "1", 101L, "2", 102L, "3", 103L ) dataC ## # A tibble: 4 x 2 ## X Z ## ## 1 0 100 ## 2 1 101 ## 3 2 102 ## 4 3 103 full_join(dataA, dataC) ## Joining, by = "X" ## Error in full_join_impl(x, y, by$x, by$y, suffix$x, suffix$y, check_na_matches(na_matches)): Can't join on 'X' x 'X' because of incompatible types (character / integer)

    That’s probably wise. Of course, R is perfectly happy with things like

    "2":"5" ## [1] 2 3 4 5

    and == thinks that’s fine

    "0" == 0L ## [1] TRUE "2" == 2L ## [1] TRUE

    but who am I to argue?

    Anyway, how far apart can those integers and numerics be before they aren’t able to be joined? What if we shift the ‘numeric in name only’ values away from the integers just a teensy bit? .Machine$double.eps is the built-in value for ‘the tiniest number you can produce’. On this system it’s 2.22044610^{-16}.

    dataBeps <- tribble( ~X, ~Z, 0 + .Machine$double.eps, 1000L, 1 + .Machine$double.eps, 1001L, 2 + .Machine$double.eps, 1002L, 3 + .Machine$double.eps, 1003L ) dataBeps ## # A tibble: 4 x 2 ## X Z ## ## 1 2.220446e-16 1000 ## 2 1.000000e+00 1001 ## 3 2.000000e+00 1002 ## 4 3.000000e+00 1003 full_join(dataA, dataBeps) ## Joining, by = "X" ## # A tibble: 6 x 3 ## X Y Z ## ## 1 0.000000e+00 100 NA ## 2 1.000000e+00 101 NA ## 3 2.000000e+00 102 1002 ## 4 3.000000e+00 103 1003 ## 5 2.220446e-16 NA 1000 ## 6 1.000000e+00 NA 1001

    Well, that’s… weirder. The values offset from 2 and 3 joined fine, but the 0 and 1 each got multiple copies since R thinks they’re different. What if we offset a little further?

    dataB2eps <- tribble( ~X, ~Z, 0 + 2*.Machine$double.eps, 1000L, 1 + 2*.Machine$double.eps, 1001L, 2 + 2*.Machine$double.eps, 1002L, 3 + 2*.Machine$double.eps, 1003L ) dataB2eps ## # A tibble: 4 x 2 ## X Z ## ## 1 4.440892e-16 1000 ## 2 1.000000e+00 1001 ## 3 2.000000e+00 1002 ## 4 3.000000e+00 1003 full_join(dataA, dataB2eps) ## Joining, by = "X" ## # A tibble: 8 x 3 ## X Y Z ## ## 1 0.000000e+00 100 NA ## 2 1.000000e+00 101 NA ## 3 2.000000e+00 102 NA ## 4 3.000000e+00 103 NA ## 5 4.440892e-16 NA 1000 ## 6 1.000000e+00 NA 1001 ## 7 2.000000e+00 NA 1002 ## 8 3.000000e+00 NA 1003

    That’s what I’d expect. So, what’s going on? Why does R think those numbers are the same? Let’s check with a minimal example: For each of the values 0:4, let’s compare that integer with the same offset by .Machine$double.eps

    suppressPackageStartupMessages(library(purrr)) ## for the 'thou shalt not for-loop' crowd map_lgl(0:4, ~ as.integer(.x) == as.integer(.x) + .Machine$double.eps) ## [1] FALSE FALSE TRUE TRUE TRUE

    And there we have it. Some sort of relative difference tolerance maybe? In any case, the general rule to live by is to never compare floats. Add this to the list of reasons why.

    For what it’s worth, I’m sure this is hardly a surprising detail to the dplyr team. They’ve dealt with things like this for a long time and I’m sure it was much worse before those changes.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Irregularly Scheduled Programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Migrating from GitHub to GitLab with RStudio (Tutorial)

    Mon, 09/04/2017 - 02:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    GitHub vs. GitLab

    Git is a distributed implementation of version control. Many people have written very eloquently about why it is a good idea to use version control, not only if you collaborate in a team but also if you work on your own; one example is this article from RStudio’s Support pages.

    In short, its main feature is that version control allows you to keep track of the changes you make to your code. It will also keep a history of all the changes you have made in the past and allows you to go back to specific versions if you made a major mistake. And Git makes collaborating on the same code very easy.

    Most R packages are also hosted on GitHub. You can check out their R code in the repositories if you want to get a deeper understanding of the functions, you can install the latest development versions of packages or install packages that are not on CRAN. The issue tracker function of GitHub also makes it easy to report and respond to issues/problems with your code.

    Why would you want to leave GitHub?

    Public repositories are free on GitHub but you need to pay for private repos (if you are a student or work in academia, you get private repos for free). Since I switched from academia to industry lately and no longer fulfil these criteria, all my private repos would have to be switched to public in the future. Here, GitLab is a great alternative!

    GitLab offers very similar functionalities as GitHub. There are many pros and cons for using GitHub versus GitLab but for me, the selling point was that GitLab offers unlimited private projects and collaborators in its free plan.

    Tutorial

    Migrating from GitHub to GitLab with RStudio is very easy! Here, I will show how I migrated my GitHub repositories of R projects, that I work with from within RStudio, to GitLab.

    Beware, that ALL code snippets below show Terminal code (they are NOT from the R console)!

    Migrating existing repositories

    You first need to set up your GitLab account (you can login with your GitHub account) and connect your old GitHub account. Under Settings &Account, you will find “Social sign-in”; here click on “Connect” next to the GitHub symbol (if you signed in with your GitHub account, it will already be connected).

    Once you have done this, you can import all your GitHub repositories to GitLab. To do this, you first need to create a new project. Click on the drop-down arrow next to the plus sign in the top-right corner and select “New project”. This will open the following window:

    Here, choose “Import project from GitHub” and choose the repositories you want to import.

    If you go into one of your repositories, GitLab will show you a message at the top of the site that tells you that you need to add an SSH key. The SSH key is used for secure communication between the GitLab server and your computer when you want to share information, like push/pull commits.

    If you already work with GitHub on your computer, you will have an SSH key set up and you can copy your public SSH key to GitLab. Follow the instructions here.

    Here is how you do it on a Mac:

    1. Look for your public key and copy it to the clipboard

    cat ~/.ssh/id_rsa.pub pbcopy < ~/.ssh/id_rsa.pub

    Then paste it into the respective field here.

    The next step is to change the remote URL for pushing/pulling your project from RStudio. In your Git window (tab next to “Environment” and “History” for me), click on Settings and “Shell”.

    Then write in the shell window that opened:

    git remote set-url origin git@:/.git

    You can copy the link in the navigation bar of your repo on GitLab.

    Check that you now have the correct new gitlab path by going to “Tools”, “Project Options” and “Git/SVN”.

    Also check your SSH key configuration with:

    ssh -T git@

    If you get the following message

    The authenticity of host 'gitlab.com (52.167.219.168)' can't be established. ECDSA key fingerprint is ... Are you sure you want to continue connecting (yes/no)?

    type “yes” (and enter passphrase if prompted).

    If everything is okay, you now get a message saying Welcome to GitLab!

    Now, you can commit, push and pull from within RStudio just as you have done before!

    In case of problems with pushing/pulling

    In my case, I migrated both, my private as well as my company’s GitHub repos to GitLab. While my private repos could be migrated without a hitch, migrating my company’s repos was a bit more tricky (because they had additional security settings, I assume).

    Here is how I solved this problem with my company’s repos:

    I have protected my SSH key with a passphrase. When pushing or pulling commits via the shell with git pull and git push origin master, I am prompted to enter my passphrase and everything works fine. Pushing/pulling from within RStudio, however, threw an error:

    ssh_askpass: exec(/usr/X11R6/bin/ssh-askpass): No such file or directory Permission denied (publickey). fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.

    I am using a MacBook Pro with MacOS Sierra version 10.12.6, so this might not be an issue with another operating system.

    The following solution worked for me:

    1. Add your SSH key

    ssh-add ~/.ssh/id_rsa
    1. And reinstall VS Code

    Now I could commit, push and pull from within RStudio just as before!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    It is Needlessly Difficult to Count Rows Using dplyr

    Sun, 09/03/2017 - 23:05

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    • Question: how hard is it to count rows using the R package dplyr?
    • Answer: surprisingly difficult.

    When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and irregularities (a few of which I attempt to document in this "dplyr inferno").




    Let’s take an example from sparklyr issue 973:

    suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr") ## [1] '0.7.2.9000' library("sparklyr") packageVersion("sparklyr") ## [1] '0.6.2' sc <- spark_connect(master = "local") ## * Using Spark: 2.1.0 db_drop_table(sc, 'extab', force = TRUE) ## [1] 0 DBI::dbGetQuery(sc, "DROP TABLE IF EXISTS extab") DBI::dbGetQuery(sc, "CREATE TABLE extab (n TINYINT)") DBI::dbGetQuery(sc, "INSERT INTO extab VALUES (1), (2), (3)") dRemote <- tbl(sc, "extab") print(dRemote) ## # Source: table [?? x 1] ## # Database: spark_connection ## n ## ## 1 01 ## 2 02 ## 3 03 dLocal <- data.frame(n = as.raw(1:3)) print(dLocal) ## n ## 1 01 ## 2 02 ## 3 03

    Many Apache Spark big data projects use the TINYINT type to save space. TINYINT behaves as a numeric type on the Spark side (you can run it through SparkML machine learning models correctly), and the translation of this type to R‘s raw type (which is not an arithmetic or numerical type) is something that is likely to be fixed very soon. However, there are other reasons a table might have R raw columns in them, so we should expect our tools to work properly with such columns present.

    Now let’s try to count the rows of this table:

    nrow(dRemote) ## [1] NA

    That doesn’t work (apparently by choice!). And I find myself in the odd position of having to defend expecting nrow() to return the number of rows.

    There are a number of common legitimate uses of nrow() in user code and package code including:

    • Checking if a table is empty.
    • Checking the relative sizes of tables to re-order or optimize complicated joins (something our join planner might add one day).
    • Confirming data size is the same as reported in other sources (Spark, database, and so on).
    • Reporting amount of work performed or rows-per-second processed.

    The obvious generic dplyr idiom would then be dplyr::tally() (our code won’t know to call the new sparklyr::sdf_nrow() function, without writing code to check we are in fact looking at a Sparklyr reference structure):

    tally(dRemote) ## # Source: lazy query [?? x 1] ## # Database: spark_connection ## nn ## ## 1 3

    That works for Spark, but not for local tables:

    dLocal %>% tally ## Using `n` as weighting variable ## Error in summarise_impl(.data, dots): Evaluation error: invalid 'type' (raw) of argument.

    The above issue is filed as dplyr issue 3070. The above code usually either errors-out (if the column is raw) or creates a new total column called nn with the sum of the n column instead of the count.

    data.frame(n=100) %>% tally ## Using `n` as weighting variable ## nn ## 1 100

    We could try adding a column and summing that:

    dLocal %>% transmute(constant = 1.0) %>% summarize(n = sum(constant)) ## Error in mutate_impl(.data, dots): Column `n` is of unsupported type raw vector

    That fails due to dplyr issue 3069: local mutate() fails if there are any raw columns present (even if they are not the columns you are attempting to work with).

    We can try removing the dangerous column prior to other steps:

    dLocal %>% select(-n) %>% tally ## data frame with 0 columns and 3 rows

    That does not work on local tables, as tally fails to count 0-column objects (dplyr issue 3071; probably the same issue exists for may dplyr verbs as we saw a related issue for dplyr::distinct).

    And the method does not work on remote tables either (Spark, or database tables) as many of them do not appear to support 0-column results:

    dRemote %>% select(-n) %>% tally ## Error: Query contains no columns

    In fact we start to feel trapped here. For a data-object whose only column is of type raw we can’t remove all the raw columns as we would then form a zero-column result (which does not seem to always be legal), but we can not add columns as that is a current bug for local frames. We could try some other transforms (such as joins, but we don’t have safe columns to join on).

    At best we can try something like this:

    nrow2 <- function(d) { n <- nrow(d) if(!is.na(n)) { return(n) } d %>% ungroup() %>% transmute(constant = 1.0) %>% summarize(tot = sum(constant)) %>% pull() } dRemote %>% nrow2() ## [1] 3 dLocal %>% nrow2() ## [1] 3

    We are still experimenting with work-arounds in the replyr package (but it is necessarily ugly code).

    spark_disconnect(sc) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Variable Selection with Elastic Net

    Sun, 09/03/2017 - 22:50

    (This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

    LASSO has been a popular algorithm for the variable selection and extremely effective with high-dimension data. However, it often tends to “over-regularize” a model that might be overly compact and therefore under-predictive.

    The Elastic Net addresses the aforementioned “over-regularization” by balancing between LASSO and ridge penalties. In particular, a hyper-parameter, namely Alpha, would be used to regularize the model such that the model would become a LASSO in case of Alpha = 1 and a ridge in case of Alpha = 0. In practice, Alpha can be tuned easily by the cross-validation. Below is a demonstration of Elastic Net with R glmnet package and its comparison with LASSO and ridge models.

    pkgs <- list("glmnet", "doParallel", "foreach", "pROC") lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[df1$CARDHLDR == 1, ] set.seed(2017) n <- nrow(df2) sample <- sample(seq(n), size = n * 0.5, replace = FALSE) train <- df2[sample, -1] test <- df2[-sample, -1] mdlY <- as.factor(as.matrix(train["DEFAULT"])) mdlX <- as.matrix(train[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))]) newY <- as.factor(as.matrix(test["DEFAULT"])) newX <- as.matrix(test[setdiff(colnames(df1), c("CARDHLDR", "DEFAULT"))])

    First of all, we estimates a LASSO model with Alpha = 1. The function cv.glmnet() is used to search for a regularization parameter, namely Lambda, that controls the penalty strength. As shown below, the model only identifies 2 attributes out of total 12.

    # LASSO WITH ALPHA = 1 cv1 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 1) md1 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv1$lambda.1se, alpha = 1) coef(md1) #(Intercept) -1.963030e+00 #AGE . #ACADMOS . #ADEPCNT . #MAJORDRG . #MINORDRG . #OWNRENT . #INCOME -5.845981e-05 #SELFEMPL . #INCPER . #EXP_INC . #SPENDING . #LOGSPEND -4.015902e-02 roc(newY, as.numeric(predict(md1, newX, type = "response"))) #Area under the curve: 0.636

    We next estimates a ridge model as below by setting Alpha = 0. Similarly, Lambda is searched by the cross-validation. Since the ridge penalty would only regularize the magnitude of each coefficient, we end up with a “full” model with all model attributes. The model performance is slightly better with 10 more variables, which is a debatable outcome.

    # RIDGE WITH ALPHA = 0 cv2 <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = 0) md2 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv2$lambda.1se, alpha = 0) coef(md2) #(Intercept) -2.221016e+00 #AGE -4.184422e-04 #ACADMOS -3.085096e-05 #ADEPCNT 1.485114e-04 #MAJORDRG 6.684849e-03 #MINORDRG 1.006660e-03 #OWNRENT -9.082750e-03 #INCOME -6.960253e-06 #SELFEMPL 3.610381e-03 #INCPER -3.881890e-07 #EXP_INC -1.416971e-02 #SPENDING -1.638184e-05 #LOGSPEND -6.213884e-03 roc(newY, as.numeric(predict(md2, newX, type = "response"))) #Area under the curve: 0.6435

    At last, we use the Elastic Net by tuning the value of Alpha through a line search with the parallelism. In this particular case, Alpha = 0.3 is chosen through the cross-validation. As shown below, 6 variables are used in the model that even performs better than the ridge model with all 12 attributes.

    # ELASTIC NET WITH 0 < ALPHA < 1 a <- seq(0.1, 0.9, 0.05) search <- foreach(i = a, .combine = rbind) %dopar% { cv <- cv.glmnet(mdlX, mdlY, family = "binomial", nfold = 10, type.measure = "deviance", paralle = TRUE, alpha = i) data.frame(cvm = cv$cvm[cv$lambda == cv$lambda.1se], lambda.1se = cv$lambda.1se, alpha = i) } cv3 <- search[search$cvm == min(search$cvm), ] md3 <- glmnet(mdlX, mdlY, family = "binomial", lambda = cv3$lambda.1se, alpha = cv3$alpha) coef(md3) #(Intercept) -1.434700e+00 #AGE -8.426525e-04 #ACADMOS . #ADEPCNT . #MAJORDRG 6.276924e-02 #MINORDRG . #OWNRENT -2.780958e-02 #INCOME -1.305118e-04 #SELFEMPL . #INCPER -2.085349e-06 #EXP_INC . #SPENDING . #LOGSPEND -9.992808e-02 roc(newY, as.numeric(predict(md3, newX, type = "response"))) #Area under the curve: 0.6449

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Combining and comparing models using Model Confidence Set

    Sun, 09/03/2017 - 19:19

    (This article was first published on R – insightR, and kindly contributed to R-bloggers)

    By Gabriel Vasconcelos

    In many cases, especially if you are dealing with forecasting models, it is natural to use a large set of models to forecast the same variable and then select the best model using some error measure. For example, you can break the sample into a training sample (in-sample) and a test sample (out-of-sample), estimate all models in the training sample and see how the perform in the test sample. You could compare the models using the root mean squared error (RMSE) or the mean absolute error (MAE).

    If you want to make things a little more formal, then you can use some statistical test such as Diebold-Mariano to compare the models pairwise. This test will tell you if model A is better than model B. If you have models to compare you will need to perform tests. If  is large you will have a lot of test results to look at and reporting the results will become a problem (imagine that you have 100 models, you will have 4950 p-values to report).

    Fortunately, Hansen, Lunde and Nason proposed (here) a suitable solution in an article in 2011. Their solution is called Model Confidence Set (MCS). The MCS has an interpretation similar to a confidence interval, i. e. the contains the best model with confidence. The number of models in the MCS increases as we decrease just like the size of a confidence interval. You only need to perform the MCS procedure once to compare all models and you have to report only the p-value for each model, which is an output of the test.

    Application

    Before going to the application, let us define a function that calculates the RMSE for several models at the same time, which we will be using a lot.

    rmse=function(real,...){ forecasts=cbind(...) sqrt(colMeans((real-forecasts)^2)) }

    To the application!! We are going to generate some data with the following design:

    • 300 observations () and 8 variables (),
    • All variables are generated from with , is sampled from a uniform(-1,1) distribution and is Normal with mean equals 0 and variance equals 4,
    • is a common factor generated from with and is Normal with mean equals 0 and variance equals 4,
    • I generated two independent factors: four variables are generated from factor 1 and the remaining four variables are generated from factor 2.
    • We assume that we do not know which factor generates each variable.
    library(reshape2) library(grid) library(gridExtra) library(ggplot2) library(boot) T = 300 # = number of observations = # n = 8 # = Number of Variables = # rho = 0.75 # = factors AR coefficient = # phi = 0.6 # = Variables AR coefficient = # set.seed(1) # = seed for replication = # # = Generate two autoregressive factors = # f = matrix(0, T, 2) for(i in 2:T){ f[i, ] = rho * f[i-1, ] + rnorm(2) } lambda = runif(n,-1,1) # = Factor loadings = # # = Generate variables = # Y = matrix(0, T, n) for(i in 2:T){ # = Based on the first factor = # Y[i, 1:4] = phi * Y[i-1, 1:4] + lambda[1:4] * f[i, 1] + rnorm(4, 0, 2) # = Based on the second factor = # Y[i, 5:8] = phi * Y[i-1, 5:8] + lambda[5:8] * f[i,2] + rnorm(4, 0, 2) } # = Organize variable lags = # # = y is the first (dependent) variable = # Y = as.data.frame(embed(Y, 2)[ , c(1, 9:16)]) # = y is the first (dependent) variable = # colnames(Y) = c("y", paste("x", 1:n, sep = ""))

    Now that we have our data, we are going to estimate all possible combinations of linear regressions using our 8 variables. The first variable will be the dependent variable, defined as . The explanatory variables are the first lags of all eight variables (including the first), defined as . This design results in 255 models to be estimated and compared.

    # = First we will generate formulas for all the 255 equations = # variables = colnames(Y)[-1] combinations = lapply(1:n, function(x)combn(1:n, x)) formulas = lapply(combinations, function(y){ apply(y, 2, function(x){ paste("~", paste(variables[x], collapse = " + "), sep=" ") }) }) formulas = unlist(formulas) # = Now we break the sample into in-sample and out-of-sample = # in_sample = Y[1:100, ] out_of_sample = Y[-c(1:100), ] # = Estimate the models = # models = lapply(formulas, function(x) lm(paste("y", x), data = in_sample)) # = Compute forecasts = # forecasts = Reduce(cbind, lapply(models, function(x) predict(x, out_of_sample))) colnames(forecasts) = paste("model", 1:ncol(forecasts), sep = "") # RMSES (Mean of all models, model with all variables) rmse(out_of_sample$y,rowMeans(forecasts),forecasts[,ncol(forecasts)]) ## [1] 2.225175 2.322508

    As you can see, the average forecast is more accurate than the model using all 8 variables. Now we are going to compute the MCS using the function below. If you want more details on the algorithm, just look at the original article here.

    mcs=function(Loss,R,l){ LbarB=tsboot(Loss,colMeans,R=R,sim="fixed",l=l)$t Lbar=colMeans(Loss) zeta.Bi=t(t(LbarB)-Lbar) save.res=c() for(j in 1:(ncol(Loss)-1)){ Lbardot=mean(Lbar) zetadot=rowMeans(zeta.Bi) vard=colMeans((zeta.Bi-zetadot)^2) t.stat=(Lbar-Lbardot)/sqrt(vard) t.max=max(t.stat) model.t.max=which(t.stat==t.max) t.stat.b=t(t(zeta.Bi-zetadot)/sqrt(vard)) t.max.b=apply(t.stat.b,1,max) p=length(which(t.max<t.max.b))/R save.res=c(save.res,p) names(save.res)[j]=names(model.t.max) Lbar=Lbar[-model.t.max] zeta.Bi=zeta.Bi[,-model.t.max] } save.res=c(save.res,1) names(save.res)[j+1]=names(Lbar) save.p=save.res for(i in 2:(length(save.res)-1)){ save.p[i]=max(save.res[i-1],save.res[i]) } aux=match(colnames(Loss),names(save.p)) save.p=save.p[aux] save.res=save.res[aux] return(list(test=save.p,individual=save.res)) }

    The MCS is computed in a single line of code, but first we are going to break the out-of-sample into two sub-samples. We are going to compute the MCS in the first sub-sample and compare the models in the second sub-sample. The MCS function returns the p-value needed to include each model in the set. For example, if model1 has a p-value=0.1, it will be included in the MCS of 90\% confidence.

    # = generate 2 subsamples = # out1 = out_of_sample[1:100, ] out2 = out_of_sample[-c(1:100), ] f1 = forecasts[1:100, ] f2 = forecasts[-c(1:100),] # = Compute MCS = # set.seed(123) # = Seed for replication = # MCS=mcs( Loss=(f1-out1$y)^2, R=1000, l=3 )$test # = R block bootstrap samples with block length = 3 = # # = If you want to see the p-values write (print(sort(MCS)) = # # = Selected models in the 50% MCS = # selected=which(MCS>0.5) # RMSE (Mean of all models, model with all variables, MCS models) rmse(out2$y,rowMeans(f2),f2[,ncol(f2)],rowMeans(f2[,selected])) ## [1] 2.142932 2.240110 2.013547

    As you can see, if we combine only the models in the MCS the RMSE becomes smaller. This is not always the case, but it happens a lot. The combined results are better when the forecasts do not have a very high positive correlation. The MCS is not useful only to combine forecasts. Even if the combinations are not better than individual models, you still have a confidence set that includes the best model with the desired significance. If some model is excluded from the confidence set you have strong evidence that this model is worst than the ones in the set (depending on the confidence level, of course). Naturally, if you increase the confidence level, the number of models in the set also increases just like in confidence intervals.

    Finally, we can find some interesting relationships in the plots below. The first plot shows the forecasting errors as function of the confidence level. The optimal confidence level is around 55\%. If we increase it more the MCS starts to select a lot o bad models and the forecast becomes less accurate. The second plot shows the number of models as a function of the confidence level and the third plot shows the errors as a function of the number of models.

    alpha = seq(0,1,0.01) nmodels = sapply(alpha, function(x) length(which(MCS >= x))) errors = sapply(alpha, function(x) rmse(out2$y, rowMeans(as.matrix(f2[ , which(MCS >= x)])))) df = data.frame(confidence = 1 - alpha, errors = errors, nmodels = nmodels) dfmelt = melt(df, id.vars = "errors") g1=ggplot(data=df) + geom_point(aes(x=confidence,y=errors)) g2=ggplot(data=df) + geom_point(aes(x=confidence,y=nmodels)) g3=ggplot(data=df) + geom_point(aes(x=nmodels,y=errors)) grid.arrange(g1,g2,g3,ncol=2)

    References

    Hansen, Peter R., Asger Lunde, and James M. Nason. “The model confidence set.” Econometrica 79.2 (2011): 453-497.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Unifying packages to help users access R repositories

    Sun, 09/03/2017 - 15:16

    (This article was first published on Blog: John C. Nash, and kindly contributed to R-bloggers)

    With over 11,000 packages in the main CRAN (Consolidated R Archive Network) collection, users of R face a challenge when looking for tools to carry out their statistical and computational tasks, especially in subject areas outside their main domain of expertise. Moreover, a lot of packages have similar or overlapping features. Some are, to be kind, not very well written, use methods which are outdated, or are simply inappropriate to a given situation.

    In an attempt to address this, I joined with Julia Silge and Spencer Graves to organize a special session “Navigating the R package universe” at the UseR2017 meeting in Brussels. A lot of other people contributed to the session — see https://github.com/nashjc/Rnavpkg, and especially the wiki portion, https://github.com/nashjc/Rnavpkg/wiki. The presentation materials are there, along with links to various follow-on material such as notes and this and other blog posts. In particular, there is a general follow-up blog post at https://juliasilge.com/blog/navigating-packages/.

    There are lots of ideas about how to help users find the right package, but we found that we could consider them as generally falling into three categories:

    • unification of packages that have similar goals or features (of which more in this posting)
    • guidance in the form of CRAN Task Views and similar infrastructure (see Julia’s posting https://juliasilge.com/blog/package-guidance/)
    • search tools for R packages (pending posting by Spencer Graves)

    There are overlaps in these categories, and we are certain some ideas have been left out. Let us know!

    Approaches to unification of packages

    In the breakout session on unification at UseR2017, many ideas were suggested, sometimes in rapid-fire fashion. Alice Daish kindly took notes on her cell phone, but I will apologize in advance if I have missed anything important.

    My own view of unification with regard to the package universe is to reduce the effective span of the number of “things” the user must consider. If the rule of thumb for presentations is to have no more than (fill in your favourite single digit integer) bullets per screen, surely a user should not have to face thousands of choices? Following this objective, we can consider:

    • developing wrappers that provide a common syntax or interface to multiple packages
    • finding ways to identify packages that should not generally be used so they can be archived out of the immediate or first-level view of users who are searching for tools
    • promoting collaboration between developers of similar packages with the aim of consolidating those packages
    • encouraging those working on guidance and search to provide filtering of their output to reduce the choice burden on users.

    None of these approaches can be considered “easy”.

    Wrappers

    My own efforts to unifying packages have been in the field of optimization, particularly function minimization. With Ravi Varadhan, I developed the optimx package to allow a common interface to a number of base and package tools. It is an indicator of the difficulty of doing this well that I later refactored optimx to optimr to allow for easier maintenance of the package. Note that because these packages call other packages, failures in the called functions may give rise to error messages in the unifying package. To avoid too many CRAN alerts, optimrx can be found on R-forge in the optimizer project https://r-forge.r-project.org/projects/optimizer/ and can call more solvers than optimr. I believe that optimrx makes a worthwhile step towards simplifying the user’s task in carrying out function minimization where some of the parameters can also be bounded or temporarily fixed. A similar wrapper for global optimizers has been written by Hans Werner Borchers. gloptim is so far only on R-forge in the optimizer project. As I mentioned, writing these codes takes a lot of effort. They are never truly finished because new solvers become available from time to time.

    A sample of some other packages which unify a number of methods:

    • mlr: Machine Learning in R
    • caret: Classification And REgression Training for developing predictive models
    • Matrix: Matrix methods for dense and sparse matrices (What would we do without them?)

    We could also consider the nls() function in the stats package that is distributed with base R to unify several methods for nonlinear least squares. Unfortunately, the default method and derivative approximation are quite often mis-applied by non-experts (and even some experts!). For the Marquardt stabilization, users must look in packages like minpack.lm and nlsr. Duncan Murdoch and I wrote the last package to allow for analytic derivatives of expressions in nonlinear modelling, but it would be nice to be able to use them in conjunction with separable models (essentially those with linear structure with some nonlinear terms). While nlsr has a wrapnls() function to call nls(), we don’t yet have a way to access minpack.lm or some other packages with relevant features. There is always more work to do!

    Archiving

    During the breakout session on unification, there were a lot of comments about packages that should be merged, as well as a few about packages that should be abandoned. A subtext of the discussion that is relevant to R in general is that younger workers were quite clear that they felt they could not voice their opinions openly because some packages had been written by members of well-established groups who might provide employment opportunities or other perks. In this, the fact that two of the session organizers (JN and SG) are retired and the third (JS) does not work in academia was noted. We are relatively protected from backlash to honest commentary. Here I genuinely mean honest and not capriciously critical. We must recognize that many packages have been important in the life of R but now need renewal or replacement. I have been trying to get my own contributions to the optim() function deprecated for some time! However, there are workers who, I am sure, will for a variety of reasons be unwilling to discuss ways to reorganize the package repositories.

    Ultimately, the users will vote with their feet (or rather mouse/touchpad). However, there will be a lot of wasted effort that does them and R no good whatever.

    Filtered presentation

    The CRAN repository is, at present, offered to users as a list alphabetically or by date on the different mirrors, (e.g., https://cloud.r-project.org/). While a way to see every package available is necessary, it is far from friendly to users, either novices or experts. Clearly it is feasible to restructure the list(s) for different audiences. Moreover, this restructuring or filtering need not be on the CRAN sites, but could be delegated to sites dealing with particular applications or methods. Here is a prime area of overlap with the guidance and search ideas for navigating the packages. In particular, it will be helpful if automated tools are developed that do the lion’s share of the work for this.

    Another approach to filtering is to present R tools within a different framework. https://www.jamovi.org/ uses a spreadsheet paradigm. https://www.tidyverse.org/ collects tools that deal with data science and adds a streamlined command set. Unfortunately, there is a learning cost to using new frameworks like these. They simplify the use of R in one domain, but add to the general collection of packages and tools.

    Directions

    Unification of R packages with the aim to reduce user costs in finding the right tools is never going to be trivial. Moreover, it is not as directly rewarding as writing an (apparently) new package or method, and getting a paper published abut it. On the other hand, making R infrastructure more accessible to users is key to continued vitality of the overall resource and community.

    John C. Nash     2017-9-3

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Blog: John C. Nash. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Permutation Theory In Action

    Sat, 09/02/2017 - 19:38

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.

    Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

    Let’s take a look at an example. Suppose our two orderings are o1 (the ordering Spark ML chooses) and o2 (the order the R user chooses).

    set.seed(326346) symbols <- letters[1:7] o1 <- sample(symbols, length(symbols), replace = FALSE) o1 ## [1] "e" "a" "b" "f" "d" "c" "g" o2 <- sample(symbols, length(symbols), replace = FALSE) o2 ## [1] "d" "g" "f" "e" "b" "c" "a"

    To translate Spark results into R results we need a permutation that takes o1 to o2. The idea is: if we had a permeation that takes o1 to o2 we could use it to re-map predictions that are in o1 order to be predictions in o2 order.

    To solve this we crack open our article on the algebra of permutations.

    We are going to use the fact that the R command base::order(x) builds a permutation p such that x[p] is in order.

    Given this the solution is: we find permutations p1 and p2 such that o1[p1] is ordered and o2[p2] is ordered. Then build a permutation perm such that o1[perm] = (o1[p1])[inverse_permutation(p2)]. I.e., to get from o1 to o2 move o1 to sorted order and then move from the sorted order to o2‘s order (by using the reverse of the process that sorts o2). Again, the tools to solve this are in our article on the relation between permutations and indexing.

    Below is the complete solution (including combining the two steps into a single permutation):

    p1 <- order(o1) p2 <- order(o2) # invert p2 # see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/ p2inv <- seq_len(length(p2)) p2inv[p2] <- seq_len(length(p2)) (o1[p1])[p2inv] ## [1] "d" "g" "f" "e" "b" "c" "a" # composition rule: (o1[p1])[p2inv] == o1[p1[p2inv]] # see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/ perm <- p1[p2inv] o1[perm] ## [1] "d" "g" "f" "e" "b" "c" "a"

    The equivilence "(o1[p1])[p2inv] == o1[p1[p2inv]]" is frankly magic (though also quickly follows "by definition"), and studying it is the topic of our original article on permutations.

    The above application is a good example of why it is nice to have a little theory worked out, even before you think you need it.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pair Programming Statistical Analyses

    Sat, 09/02/2017 - 00:00

    (This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

    Abstract

    Control calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results and illustrate its use by a case study with two statisticians ping-ponging an analysis using dplyr and SQL, respectively.



    Clip art is based on work by Gan Khoon Lay available under a CC BY 3.0 US license.

    This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

    Introduction

    If you are a statistician working in climate science, data driven journalism, official statistics, public health, economics or any related field working with real data, chances are that you have to perform analyses, where you know there is zero tolerance for errors. The easiest way to ensure the correctness of such an analysis is to check your results over and over again (the iterated 2-eye principle). A better approach is to pair-program the analysis by either having a colleague read through your code (the 4-eye principle) or have a skilled colleague completely redo your analysis from scratch using her favorite toolchain (the 2-mindset principle). Structured software development in the form of, e.g. version control and unit tests, provides valuable inspiration on how to ensure the quality of your code. However, when it comes to pair-programming analyses, surprisingly many steps remain manual. The daff package provides the equivalent of a diff statement on data frames and we shall illustrate its use by automatizing the comparison step of the control calculation ping-pong process.

    The Case Story

    Ada and Bob have to calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. The present exposition will for the sake of convenience ignore the complex survey part of the problem and just pretend that the sample corresponds to the population (complete inventory count).

    The Data

    After the questionnaire phase, the following data are available to Ada and Bob.

    Id Version Region Sales Volume Staff Costs People 01 1 A 23000 10003 5 02 1 B 12200 7200 1 03 1 NA 19500 7609 2 04 1 A 17500 13000 3 05 1 B 119500 90000 NA 05 2 B 119500 95691 19 06 1 B NA 19990 6 07 1 A 19123 20100 8 08 1 D 25000 100 NA 09 1 D 45860 32555 9 10 1 E 33020 25010 7

    Here Id denotes the unique identifier of the sampled fitness center, Version indicates the version of a center’s questionnaire and Region denotes the region in which the center is located. In case a center’s questionnaire lacks information or has inconsistent information, the protocol is to get back to the center and have it send a revised questionnaire. All Ada and Bob now need to do is aggregate the data per region in order to obtain region stratified estimates of:

    • the overall number of fitness centres
    • total sales volume
    • total staff cost
    • total number of people employed in fitness centres

    However, the analysis protocol instructs that only fitness centers with an annual sales volume larger than or equal to €17,500 are to be included in the analysis. Furthermore, if missing values occur, they are to be ignored in the summations. Since a lot of muscle will be angered in case of errors, Ada and Bob agree on following the 2-mindset procedure and meet after an hour to discuss their results. Here is what each of them came up with.

    Ada

    Ada loves the tidyverse and in particular dplyr. This is her solution:

    ada <- fitness %>% na.omit() %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) ada ## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 1 45860 32555 9 ## 4 E 1 33020 25010 7 Bob

    Bob has a dark past as a relational database management system (RDBMS) developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach. The hope is that in big data situation this might be a little more speedy than base R:

    library(RSQLite) ## Create ad-hoc database db <- dbConnect(SQLite(), dbname = file.path(filePath,"Test.sqlite")) ##Move fitness data into the ad-hoc DB dbWriteTable(conn = db, name = "fitness", fitness, overwrite=TRUE, row.names=FALSE) ##Query using SQL bob <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE [Sales Volume] > 17500 GROUP BY Region ") bob ## Region NoOfUnits Sales Volume Staff Costs People ## 1 1 19500 7609 2 ## 2 A 2 42123 30103 13 ## 3 B 2 239000 185691 19 ## 4 D 2 70860 32655 9 ## 5 E 1 33020 25010 7

    Update: An alternative approach with less syntactic overhead would have been the sqldf package, which has a standard SQLite backend and automagically handles the import of the data.frame into the DB using the RSQLite pkg.

    ##Load package suppressPackageStartupMessages(library(sqldf)) ##Ensure SQLite backend. options(sqldf.driver = "SQLite") ##Same query as before bob <- sqldf(" SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE [Sales Volume] > 17500 GROUP BY Region ")

    Even shorter is the direct use of SQL chunks in knitr using the variable db as connection and using the chunk argument output.var=bob:

    SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE [Sales Volume] > 17500 GROUP BY Region The Ping-Pong Phase

    After Ada and Bob each have a result, they compare their resulting data.frame using the daff package, which was recently presented by @edwindjonge at the useR in Brussels.

    library(daff) diff <- diff_data(ada, bob) diff$get_data() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9 ## 5 E 1 33020 25010 7

    After Ada’s and Bob’s serve, the two realize that their results just agree for the region E.

    Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

    ##Small helper function for better printing print.data_diff <- function(x) x$get_data() %>% filter(`@@` != "") diff %>% print() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9

    The two decide to first agree on the number of units per region.

    diff$get_data() %>% filter(`@@` != "") %>% select(`@@`, Region, NoOfUnits) ## @@ Region NoOfUnits ## 1 +++ 1 ## 2 -> A 3->2 ## 3 -> B 1->2 ## 4 -> D 1->2

    One obvious reason for the discrepancies appears to be that Bob has results for an extra region. Therefore, Bob takes another look at his management of missing values and decides to improve his code by:

    Pong Bob SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE ([Sales Volume] > 17500 AND REGION IS NOT NULL) GROUP BY Region ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 -> A 3->2 59623->42123 43103->30103 16->13 ## 2 -> B 1->2 119500->239000 95691->185691 19 ## 3 -> D 1->2 45860->70860 32555->32655 9 Ping Bob

    Better. Now the NA region is gone, but still quite some differences remain. Note: You may at this point want to stop reading and try yourself to fix the analysis – the data and code are available from the github repository.

    Pong Bob

    Now Bob notices that he forgot to handle the duplicate records and apparently misunderstood the exact definition of the €17,500 exclusion limit. His massaged SQL query looks as follows:

    SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM (SELECT Id, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM fitness GROUP BY Id) WHERE ([Sales Volume] >= 17500 AND REGION IS NOT NULL) GROUP BY Region ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 1->2 45860->70860 32555->32655 9 Ping Ada

    Comparing with Ada, Bob is sort of envious that she was able to just use dplyr‘s group_by and top_n functions. However, daff shows that there still is one difference left. By looking more carefully at Ada’s code it becomes clear that she accidentally leaves out one unit in region D. The reason is the too liberate use of na.omit, which also removes the one entry with an NA in one of the not so important columns. However, they discuss the issue, if one really wants to include partial records or not, because summation in the different columns then is over a different number of units. After consulting with the standard operation procedure (SOP) for these kind of surveys they decide to include the observation where possible. Here is Ada’s modified code:

    ada2 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) (diff_final <- diff_data(ada2,bob3)) %>% print() ## @@ Region NoOfUnits ... Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 2 ... 32655 NA->9 Pong Ada

    Oops, Ada forgot to take care of the NA in the summation:

    ada3 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People,na.rm=TRUE)) diff_final <- diff_data(ada3,bob3) ##Check if the results really are the same length(diff_final$get_data()) == 0 ## [1] TRUE

    Finally, their results agree and they move on to production and their results are published in a nice report.

    Conclusion

    As shown, the ping-pong game is quite manual and particularly annoying, if at some point someone steps into the office with a statement like Btw, I found some extra questionnaires, which need to be added to the analysis asap. However, the two now aligned analysis scripts and the corresponding daff-overlay could be put into a script, which is triggered every time the data change. In case new discrepancies emerge as length(diff$get_data()) > 0, the two could then be automatically informed.

    Question 1: Now the two get the same results, do you agree with them?

    ## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 2 70860 32655 9 ## 4 E 1 33020 25010 7

    Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below. Control calculations appear quite common, but little structured code support appears to be available for such processes.



    Photo is copyright Johnathan J. Stegeman under the GNU Free Documentation License, version 1.2.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Practical Data Science for Stats

    Fri, 09/01/2017 - 19:05

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    PeerJ Preprints has recently published a collection of articles that focus on the practical side of statistical analysis: Practical Data Science for Stats. While the articles are not peer-reviewed, they have been selected and edited by Jennifer Bryan and Hadley Wickham, both well-respected members of the R community. And while the articles provide great advice for any data scientist, the content does heavily feature the use of R, so it's particularly useful to R users.

    There are 16 articles in the collection (with possibly more to come?). Here are just a few examples that caught my eye:

    There's lots more to explore in the collection as well, including case studies on using R at the likes of AirBnB and the New York Mets. Check out the entire collection at the link below.

    PeerJ Collections: Practical Data Science for Stats (via Jenny Bryan)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Summarizing dataset to apply machine learning – exercises

    Fri, 09/01/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    INTRODUCTION

    Dear reader,

    If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world.

    This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets.

    Before proceeding, please follow our short tutorial.

    Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
    to check your answers.

    Exercise 1

    Create a list of 80% of the rows in the original dataset to use for training. HINT: Use createDataPartition().

    Exercise 2

    Select 20% of the data for validation.

    Exercise 3

    Use the remaining 80% of data to train and test the models.

    Exercise 4

    Find the dimensions of the “iris” dataset. HINT: Use dim().

    Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to:

    • Create a machine learning algorithm from a beginner point of view
    • Quickly dive into more advanced methods in an accessible pace and with more explanations
    • And much more

    This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.

    Exercise 5

    Find the type of each attribute in your dataset. HINT: Use sapply().

    Exercise 6

    Take a look at the first 5 rows of your dataset. HINT: Use head().

    Exercise 7

    Find the levels of the variable “Species.” HINT: Use levels().

    Exercise 8

    Find the percentages of rows that belong to the labels you found in Exercise 7. HINT: Use prop.table() and table().

    Exercise 9

    Display the absolute count of instances for each class as well as its percentage. HINT: Use cbind().

    Exercise 10

    Display the summary of the “iris” dataset. HINT: Use summary().

    Related exercise sets:
    1. How to prepare and apply machine learning to your dataset
    2. Building Shiny App exercises part 4
    3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-5)
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How rOpenSci uses Code Review to Promote Reproducible Science

    Fri, 09/01/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    This is cross-posted from the NumFOCUS blog. NumFOCUS is a nonprofit that supports and promotes world-class, innovative, open source scientific computing, including rOpenSci.

    Scott use this for NumFOCUS SEO?

    At rOpenSci, we create and curate software to help scientists with the data life cycle. These tools access, download, manage, and archive scientific data in open, reproducible ways. Early on, we realized this could only be a community effort. The variety of scientific data and workflows could only be tackled by drawing on contributions of scientists with field-specific expertise.

    With the community approach came challenges. How could we ensure the quality of code written by scientists without formal training in software development practices? How could we drive adoption of best practices among our contributors? How could we create a community that would support each other in this work?

    We have had great success addressing these challenges via the peer review. We draw elements from a process familiar to our target community – academic peer review – and a practice from the software development world – production code review – to create a system that fosters software quality, ongoing education, and community development.

    An Open Review Process

    Production software review occurs within software development teams, open source or not. Contributions to a software project are reviewed by one or more other team members before incorporation into project source code. Contributions are typically small patches, and review serves as a check on quality, as well as an opportunity for training in team standards.

    In academic peer review, external reviewers critique a complete product – usually a manuscript – with a very broad mandate to address any areas they see as deficient. Academic review is often anonymous and passing through it gives the product, and the author, a public mark of validation.

    We blend these approaches. In our process, authors submit complete R packages to rOpenSci. Editors check that packages fit into our project’s scope, run a series of automated tests to ensure a baseline of code quality and completeness, and then assign two independent reviewers. Reviewers comment on usability, quality, and style of software code as well as documentation. Authors make changes in response, and once reviewers are satisfied with the updates, the package receives a badge of approval and joins our suite.

    This process is quite iterative. After reviewers post a first round of extensive reviews, authors and reviewers chat in an informal back-and-forth, only lightly moderated by an editor. This lets both reviewers and authors pose questions of each other and explain differences of opinion. It can proceed at a much faster pace than typical academic review correspondence. We use the GitHub issues system for this conversation, and responses take minutes to days, rather than weeks to months.

    The exchange is also open and public. Authors, reviewers, and editors all know each other’s identities. The broader community can view or even participate in the conversation as it happens. This provides an incentive to be thorough and provide non-adversarial, constructive reviews. Both authors and reviewers report that they enjoy and learn more from this open and direct exchange. It also has the benefit of building community. Participants have the opportunity to meaningfully network with new peers, and new collaborations have emerged via ideas spawned during the review process.

    We are aware that open systems can have drawbacks. For instance, in traditional academic review, double-blind peer review can increase representation of female authors, suggesting bias in non-blind reviews. It is also possible reviewers are less critical in open review. However, we posit that the openness of the review conversation provides a check on review quality and bias; it’s harder to inject unsupported or subjective comments in public and without the cover of anonymity. Ultimately, we believe the ability of authors and reviewers to have direct but public communication improves quality and fairness.

    Guidance and Standards

    rOpenSci provides guidance on reviewing. This falls into two main categories: high-level best practices and low-level standards. High-level best practices are general and broadly applicable across languages and applications. These are practices such as “Write re-usable functions rather than repeating the same code,” “Test edge cases,” or “Write documentation for all of your functions.” Because of their general nature, these can be drawn from other sources and not developed from scratch. Our best practices are based on guidance originally developed by Mozilla Science Lab.

    Low-level standards are specific to a language (in our case, R), applications (data interfaces) and user base (researchers). These include specific items such as naming conventions for functions, best choices of dependencies for certain tasks, and adherence to a code style guide. We have an extensive set of standards for our reviewers to check. These change over time as the R software ecosystem evolves, best practices change, and tooling and educational resources make new methods available to developers.

    Our standards also change based on feedback from reviewers. We adopt into our standards suggestions that emerge in multiple reviewers across different packages. Many of these, we’ve found, have to do with with the ease-of-use and consistency of software APIs, and the type and location of information in documentation that make it easiest to find. This highlights one of the advantages of external reviewers – they can provide a fresh perspective on usability, as well as test software under different use-cases than imagined by the author.

    As our standards have become more extensive, we have come to rely more on automated tools. The R ecosystem, like most languages, has a suite of tools for code linting, function testing, static code analysis and continuous integration. We require authors to use these, and editors run submissions through a suite of tests prior to sending them for review. This frees reviewers from the burden of low-level tasks to focus on high-level critiques where they can add the most value.

    The Reviewer Community

    One of the core challenges and rewards of our work has been developing a community of reviewers.

    Reviewing is a high-skill activity. Reviewers need expertise in the programming methods used in a software package and also the scientific field of its application. (“Find me someone who knows sensory ecology and sparse data structures!”) They need good communications skills and the time and willingness to volunteer. Thankfully, the open-science and open-source worlds are filled with generous, expert people. We have been able to expand our reviewer pool as the pace of submissions and the domains of their applications have grown.

    Developing the reviewer pool requires constant recruitment. Our editors actively and broadly engage with developer and research communities to find new reviewers. We recruit from authors of previous submissions, co-workers and colleagues, at conferences, through our other collaborative work and on social media. In the open-source software ecosystem, one can often identify people with particular expertise by looking at their published software or contribution to other projects, and we often will cold-email potential reviewers whose published work suggests they would be a good match for a submission.

    We cultivate our reviewer pool as well as expand it. We bring back reviewers so that they may develop reviewing as a skill, but not so often as to overburden them. We provide guidance and feedback to new recruits. When assigning reviewers to a submission, we aim to pair experienced reviewers with new ones, or reviewers with expertise on a package’s programming methods with those experienced in its field of application. These reviewers learn from each other, and diversity in perspectives is an advantage; less experienced developers often provide insight that more experienced ones do not on software usability, API design, and documentation. More experienced developers will more often identify inefficiencies in code, pitfalls due to edge-cases, or suggest alternate implementation approaches.

    Expanding Peer Review for Code

    Code review has been one of rOpenSci’s best initiatives. We build software, build skills, and build community, and the peer review process has been a major catalyst for all three. It has made both the software we develop internally and that we accept from outside contributors more reliable, usable, and maintainable. So we are working to promote open peer review of code by more organizations working with scientific software. We helped launch The Journal of Open Source Software, which uses a version of our review process to provide a developer-friendly publication venue. JOSS’s success has led to a spin-off, the Journal of Open Source Education, which uses an open, code-review-like processes to provide feedback on curricula and educational materials. We are also developing a pilot program where software papers submitted to a partner academic journal receive a badge for going through rOpenSci review. We are encouraged by other review initiatives like ReScience and The Programming Historian. BioConductor’s code reviews, which predate ours by several years, recently switched to an open model.

    If your organization is developing or curating scientific code, we believe code review, implemented well, can be a great benefit. It can take considerable effort to begin, but here are some of the key lessons we’ve learned that can help:

    • Develop standards and guidelines for your authors and reviewers to use. Borrow these freely from other projects (feel free to use ours), and modify them iteratively as you go.
    • Use automated tools such as code linters, test suites, and test coverage measures to reduce burden on authors, reviewers, and editors as much as possible.
    • Have a clear scope. Spell out to yourselves and potential contributors what your project will accept, and why. This will save a lot of confusion and effort in the future.
    • Build a community through incentives of networking, opportunities to learn, and kindness.

    rOpenSci is eager to work with other groups interested in developing similar review processes, especially if you are interested in reviewing and curating scientific software in languages other than R or beyond our scope of supporting the data life cycle. Software that implements statistical algorithms, for instance, is an area ripe for open peer review of code. Please get in touch if you have questions or wish to co-pilot a review system for your project.

    And of course, if you want to review, we’re always looking for volunteers. Sign up at https://ropensci.org/onboarding.

    You can support rOpenSci by becoming a NumFOCUS member or making a donation to the project.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppAnnoy 0.0.9

    Fri, 09/01/2017 - 03:20

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    An new version 0.0.9 of RcppAnnoy, our Rcpp-based R integration of the nifty Annoy library by Erik, is now on CRAN. Annoy is a small and lightweight C++ template header library for very fast approximate nearest neighbours.

    This release corrects an issue for Windows users discovered by GitHub user ‘khoran’ who later also suggested the fix of binary mode. It upgrades to Annoy release 1.9.1 and brings its new Manhattan distance to RcppAnnoy. A number of unit tests were added as well, and we updated some packaging internals such as symbol registration.

    And I presume I had a good streak emailing with Uwe’s robots as the package made it onto CRAN rather smoothly within ten minutes of submission:

     

    Changes in this version are summarized here:

    Changes in version 0.0.9 (2017-08-31)
    • Synchronized with Annoy upstream version 1.9.1

    • Minor updates in calls and tests as required by annoy 1.9.1

    • New Manhattan distance modules along with unit test code

    • Additional unit tests from upstream test code carried over

    • Binary mode is used for save (as suggested by @khoran in #21)

    • A new file init.c was added with calls to R_registerRoutines() and R_useDynamicSymbols()

    • Symbol registration is enabled in useDynLib

    Courtesy of CRANberries, there is also a diffstat report for this release.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The Proof-Calculation Ping Pong

    Fri, 09/01/2017 - 00:00
    Abstract

    The proof-calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results.



    This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

    Introduction

    If you are a statistician working in climate science, data driven journalism, official statistics, public health, economics or any related field working with real data, chances are that you have to perform analyses, where you know there is zero tolerance for errors. The easiest way to ensure the correctness of such an analysis is to check your results over and over again (the iterated 2-eye principle). A better approach is to pair-program the analysis by either having a colleague read through your code (the 4-eye principle) or have a skilled colleague completely redo your analysis from scratch using her favorite toolchain (the 2-mindset principle). Structured software development in the form of, e.g. version control and unit tests, provides valuable inspiration on how to ensure the quality of your code. However, when it comes to pair-programming analyses, surprisingly many steps remain manual. The daff package provides the equivalent of a diff statement on data frames and we shall illustrate its use by automatizing the comparison step of the statistical proof-calculation ping-pong process.

    Case Story

    Ada and Bob have to proof-calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. For this post we will ignore the complex survey part of the problem and just pretend that our sample corresponds to the population (complete inventory count). After the questionnaire phase, the following data are available to Ada and Bob.

    Id Version Region Sales Volume Staff Costs People 01 1 A 23000 10003 5 02 1 B 12200 7200 1 03 1 NA 19500 7609 2 04 1 A 17500 13000 3 05 1 B 119500 90000 NA 05 2 B 119500 95691 19 06 1 B NA 19990 6 07 1 A 19123 20100 8 08 1 D 25000 100 NA 09 1 D 45860 32555 9 10 1 E 33020 25010 7

    Here Id denotes the unique identifier of the sampled fitness center, Version indicates the version of a center’s questionnaire and Region denotes the region in which the center is located. In case a center’s questionnaire lacks information or has inconsistent information, the protocol is to get back to the center and have it send a revised questionnaire. All Ada and Bob now need to do is aggregate the data per region in order to obtain region stratified estimates of:

    • the overall number of fitness centres
    • total sales volume
    • total staff cost
    • total number of people employed in fitness centres

    However, the analysis protocol instructs that only fitness centers with an annual sales volume larger than or equal to €17,500 are to be included in the analysis. Furthermore, if missing values occur, they are to be ignored in the summations. Since a lot of muscle will be angered in case of errors, Ada and Bob agree on following the 2-mindset procedure and meet after an hour to discuss their results. Here is what each of them came up with.

    Ada

    Ada loves the tidyverse and in particular dplyr. This is her solution:

    ada <- fitness %>% na.omit() %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) ada ## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 1 45860 32555 9 ## 4 E 1 33020 25010 7 Bob

    Bob has a dark past as database developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach:

    library(RSQLite) ## Create ad-hoc database db <- dbConnect(SQLite(), dbname = file.path(filePath,"Test.sqlite")) ##Move fitness data into the ad-hoc DB dbWriteTable(conn = db, name = "FITNESS", fitness, overwrite=TRUE, row.names=FALSE) ##Query using SQL bob <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM FITNESS WHERE [Sales Volume] > 17500 GROUP BY Region ") bob ## Region NoOfUnits Sales Volume Staff Costs People ## 1 1 19500 7609 2 ## 2 A 2 42123 30103 13 ## 3 B 2 239000 185691 19 ## 4 D 2 70860 32655 9 ## 5 E 1 33020 25010 7 The Ping-Pong phase

    After Ada and Bob each have a result, they compare their resulting data.framess using the daff package, which was recently presented by @edwindjonge at the useR in Brussels.

    library(daff) diff <- diff_data(ada, bob) diff$get_data() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9 ## 5 E 1 33020 25010 7

    After Ada’s and Bob’s serve, the two realize that their results just agree for one region (‘E’). Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

    print.data_diff <- function(x) x$get_data() %>% filter(`@@` != "") print(diff) ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9

    The two decide to first focus on agreeing on the number of units per region.

    ## @@ Region NoOfUnits ## 1 +++ 1 ## 2 -> A 3->2 ## 3 -> B 1->2 ## 4 -> D 1->2

    One obvious reason for the discrepancies appears to be that Bob has results for an extra region. Therefore, Bob takes another look at his management of missing values and decides to improve his code by:

    Pong Bob bob2 <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM FITNESS WHERE ([Sales Volume] > 17500 AND REGION IS NOT NULL) GROUP BY Region ") diff2 <- diff_data(ada, bob2, ordered=FALSE,count_like_a_spreadsheet=FALSE) diff2 %>% print() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 -> A 3->2 59623->42123 43103->30103 16->13 ## 2 -> B 1->2 119500->239000 95691->185691 19 ## 3 -> D 1->2 45860->70860 32555->32655 9 Ping Bob

    Better. Now the NA region is gone, but still quite some differences remain. Note: You may at this point want to stop reading and try yourself to fix the analysis – the data and code are available from the github repository.

    Pong Bob

    Now Bob notices that he forgot to handle the duplicate records and massages the SQL query as follows:

    bob3 <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM (SELECT Id, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM FITNESS GROUP BY Id) WHERE ([Sales Volume] >= 17500 AND REGION IS NOT NULL) GROUP BY Region ") diff3 <- diff_data(ada, bob3, ordered=FALSE,count_like_a_spreadsheet=FALSE) diff3 %>% print() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 1->2 45860->70860 32555->32655 9 Ping Ada

    Comparing with Ada, Bob is sort of envious that she was able to just use dplyr‘s group_by and top_n functions. However, daff shows that there still is one difference left. By looking more carefully at Ada’s code it becomes clear that she accidentally leaves out one unit in region D. The reason is the too liberate use of na.omit, which also removes the one entry with an NA in one of the not so important columns. However, they discuss the issue, if one really wants to include partial records or not, because summation in the different columns then is over a different number of units. After consulting with the standard operation procedure (SOP) for these kind of surveys they decide to include the observation where possible. Here is Ada’s modified code:

    ada2 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) (diff_final <- diff_data(ada2,bob3)) %>% print() ## @@ Region NoOfUnits ... Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 2 ... 32655 NA->9 Pong Ada

    Oops, forgot to take care of the NA in the summation:

    ada3 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People,na.rm=TRUE)) diff_final <- diff_data(ada3,bob3) length(diff_final$get_data()) == 0 ## [1] TRUE Conclusion

    Finally, their results agree and they move on to production and their results are published in a nice report.

    Question 1: Do you agree with their results?

    ada3 ## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 2 70860 32655 9 ## 4 E 1 33020 25010 7

    As shown, the ping-pong game is quite manual and particularly annoying, if at some point someone steps into the office with a statement like Btw, I found some extra questionnaires, which need to be added to the analysis asap. However, the two now aligned analysis scripts and the corresponding daff-overlay could be put into a script, which is triggered every time the data change. In case new discrepancies emerge as length(diff$get_data()) > 0, the two could then be automatically informed.

    Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Pages