Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 30 min ago

Unifying packages to help users access R repositories

Sun, 09/03/2017 - 15:16

(This article was first published on Blog: John C. Nash, and kindly contributed to R-bloggers)

With over 11,000 packages in the main CRAN (Consolidated R Archive Network) collection, users of R face a challenge when looking for tools to carry out their statistical and computational tasks, especially in subject areas outside their main domain of expertise. Moreover, a lot of packages have similar or overlapping features. Some are, to be kind, not very well written, use methods which are outdated, or are simply inappropriate to a given situation.

In an attempt to address this, I joined with Julia Silge and Spencer Graves to organize a special session “Navigating the R package universe” at the UseR2017 meeting in Brussels. A lot of other people contributed to the session — see https://github.com/nashjc/Rnavpkg, and especially the wiki portion, https://github.com/nashjc/Rnavpkg/wiki. The presentation materials are there, along with links to various follow-on material such as notes and this and other blog posts. In particular, there is a general follow-up blog post at https://juliasilge.com/blog/navigating-packages/.

There are lots of ideas about how to help users find the right package, but we found that we could consider them as generally falling into three categories:

  • unification of packages that have similar goals or features (of which more in this posting)
  • guidance in the form of CRAN Task Views and similar infrastructure (see Julia’s posting https://juliasilge.com/blog/package-guidance/)
  • search tools for R packages (pending posting by Spencer Graves)

There are overlaps in these categories, and we are certain some ideas have been left out. Let us know!

Approaches to unification of packages

In the breakout session on unification at UseR2017, many ideas were suggested, sometimes in rapid-fire fashion. Alice Daish kindly took notes on her cell phone, but I will apologize in advance if I have missed anything important.

My own view of unification with regard to the package universe is to reduce the effective span of the number of “things” the user must consider. If the rule of thumb for presentations is to have no more than (fill in your favourite single digit integer) bullets per screen, surely a user should not have to face thousands of choices? Following this objective, we can consider:

  • developing wrappers that provide a common syntax or interface to multiple packages
  • finding ways to identify packages that should not generally be used so they can be archived out of the immediate or first-level view of users who are searching for tools
  • promoting collaboration between developers of similar packages with the aim of consolidating those packages
  • encouraging those working on guidance and search to provide filtering of their output to reduce the choice burden on users.

None of these approaches can be considered “easy”.

Wrappers

My own efforts to unifying packages have been in the field of optimization, particularly function minimization. With Ravi Varadhan, I developed the optimx package to allow a common interface to a number of base and package tools. It is an indicator of the difficulty of doing this well that I later refactored optimx to optimr to allow for easier maintenance of the package. Note that because these packages call other packages, failures in the called functions may give rise to error messages in the unifying package. To avoid too many CRAN alerts, optimrx can be found on R-forge in the optimizer project https://r-forge.r-project.org/projects/optimizer/ and can call more solvers than optimr. I believe that optimrx makes a worthwhile step towards simplifying the user’s task in carrying out function minimization where some of the parameters can also be bounded or temporarily fixed. A similar wrapper for global optimizers has been written by Hans Werner Borchers. gloptim is so far only on R-forge in the optimizer project. As I mentioned, writing these codes takes a lot of effort. They are never truly finished because new solvers become available from time to time.

A sample of some other packages which unify a number of methods:

  • mlr: Machine Learning in R
  • caret: Classification And REgression Training for developing predictive models
  • Matrix: Matrix methods for dense and sparse matrices (What would we do without them?)

We could also consider the nls() function in the stats package that is distributed with base R to unify several methods for nonlinear least squares. Unfortunately, the default method and derivative approximation are quite often mis-applied by non-experts (and even some experts!). For the Marquardt stabilization, users must look in packages like minpack.lm and nlsr. Duncan Murdoch and I wrote the last package to allow for analytic derivatives of expressions in nonlinear modelling, but it would be nice to be able to use them in conjunction with separable models (essentially those with linear structure with some nonlinear terms). While nlsr has a wrapnls() function to call nls(), we don’t yet have a way to access minpack.lm or some other packages with relevant features. There is always more work to do!

Archiving

During the breakout session on unification, there were a lot of comments about packages that should be merged, as well as a few about packages that should be abandoned. A subtext of the discussion that is relevant to R in general is that younger workers were quite clear that they felt they could not voice their opinions openly because some packages had been written by members of well-established groups who might provide employment opportunities or other perks. In this, the fact that two of the session organizers (JN and SG) are retired and the third (JS) does not work in academia was noted. We are relatively protected from backlash to honest commentary. Here I genuinely mean honest and not capriciously critical. We must recognize that many packages have been important in the life of R but now need renewal or replacement. I have been trying to get my own contributions to the optim() function deprecated for some time! However, there are workers who, I am sure, will for a variety of reasons be unwilling to discuss ways to reorganize the package repositories.

Ultimately, the users will vote with their feet (or rather mouse/touchpad). However, there will be a lot of wasted effort that does them and R no good whatever.

Filtered presentation

The CRAN repository is, at present, offered to users as a list alphabetically or by date on the different mirrors, (e.g., https://cloud.r-project.org/). While a way to see every package available is necessary, it is far from friendly to users, either novices or experts. Clearly it is feasible to restructure the list(s) for different audiences. Moreover, this restructuring or filtering need not be on the CRAN sites, but could be delegated to sites dealing with particular applications or methods. Here is a prime area of overlap with the guidance and search ideas for navigating the packages. In particular, it will be helpful if automated tools are developed that do the lion’s share of the work for this.

Another approach to filtering is to present R tools within a different framework. https://www.jamovi.org/ uses a spreadsheet paradigm. https://www.tidyverse.org/ collects tools that deal with data science and adds a streamlined command set. Unfortunately, there is a learning cost to using new frameworks like these. They simplify the use of R in one domain, but add to the general collection of packages and tools.

Directions

Unification of R packages with the aim to reduce user costs in finding the right tools is never going to be trivial. Moreover, it is not as directly rewarding as writing an (apparently) new package or method, and getting a paper published abut it. On the other hand, making R infrastructure more accessible to users is key to continued vitality of the overall resource and community.

John C. Nash     2017-9-3

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog: John C. Nash. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Permutation Theory In Action

Sat, 09/02/2017 - 19:38

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things more like R users expect, we need a way to translate one order to another.

Providing good solutions to gaps like this is one of the thing Win-Vector LLC does both in our consulting and training practices.

Let’s take a look at an example. Suppose our two orderings are o1 (the ordering Spark ML chooses) and o2 (the order the R user chooses).

set.seed(326346) symbols <- letters[1:7] o1 <- sample(symbols, length(symbols), replace = FALSE) o1 ## [1] "e" "a" "b" "f" "d" "c" "g" o2 <- sample(symbols, length(symbols), replace = FALSE) o2 ## [1] "d" "g" "f" "e" "b" "c" "a"

To translate Spark results into R results we need a permutation that takes o1 to o2. The idea is: if we had a permeation that takes o1 to o2 we could use it to re-map predictions that are in o1 order to be predictions in o2 order.

To solve this we crack open our article on the algebra of permutations.

We are going to use the fact that the R command base::order(x) builds a permutation p such that x[p] is in order.

Given this the solution is: we find permutations p1 and p2 such that o1[p1] is ordered and o2[p2] is ordered. Then build a permutation perm such that o1[perm] = (o1[p1])[inverse_permutation(p2)]. I.e., to get from o1 to o2 move o1 to sorted order and then move from the sorted order to o2‘s order (by using the reverse of the process that sorts o2). Again, the tools to solve this are in our article on the relation between permutations and indexing.

Below is the complete solution (including combining the two steps into a single permutation):

p1 <- order(o1) p2 <- order(o2) # invert p2 # see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/ p2inv <- seq_len(length(p2)) p2inv[p2] <- seq_len(length(p2)) (o1[p1])[p2inv] ## [1] "d" "g" "f" "e" "b" "c" "a" # composition rule: (o1[p1])[p2inv] == o1[p1[p2inv]] # see: http://www.win-vector.com/blog/2017/05/on-indexing-operators-and-composition/ perm <- p1[p2inv] o1[perm] ## [1] "d" "g" "f" "e" "b" "c" "a"

The equivilence "(o1[p1])[p2inv] == o1[p1[p2inv]]" is frankly magic (though also quickly follows "by definition"), and studying it is the topic of our original article on permutations.

The above application is a good example of why it is nice to have a little theory worked out, even before you think you need it.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pair Programming Statistical Analyses

Sat, 09/02/2017 - 00:00

(This article was first published on Theory meets practice..., and kindly contributed to R-bloggers)

Abstract

Control calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results and illustrate its use by a case study with two statisticians ping-ponging an analysis using dplyr and SQL, respectively.



Clip art is based on work by Gan Khoon Lay available under a CC BY 3.0 US license.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

Introduction

If you are a statistician working in climate science, data driven journalism, official statistics, public health, economics or any related field working with real data, chances are that you have to perform analyses, where you know there is zero tolerance for errors. The easiest way to ensure the correctness of such an analysis is to check your results over and over again (the iterated 2-eye principle). A better approach is to pair-program the analysis by either having a colleague read through your code (the 4-eye principle) or have a skilled colleague completely redo your analysis from scratch using her favorite toolchain (the 2-mindset principle). Structured software development in the form of, e.g. version control and unit tests, provides valuable inspiration on how to ensure the quality of your code. However, when it comes to pair-programming analyses, surprisingly many steps remain manual. The daff package provides the equivalent of a diff statement on data frames and we shall illustrate its use by automatizing the comparison step of the control calculation ping-pong process.

The Case Story

Ada and Bob have to calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. The present exposition will for the sake of convenience ignore the complex survey part of the problem and just pretend that the sample corresponds to the population (complete inventory count).

The Data

After the questionnaire phase, the following data are available to Ada and Bob.

Id Version Region Sales Volume Staff Costs People 01 1 A 23000 10003 5 02 1 B 12200 7200 1 03 1 NA 19500 7609 2 04 1 A 17500 13000 3 05 1 B 119500 90000 NA 05 2 B 119500 95691 19 06 1 B NA 19990 6 07 1 A 19123 20100 8 08 1 D 25000 100 NA 09 1 D 45860 32555 9 10 1 E 33020 25010 7

Here Id denotes the unique identifier of the sampled fitness center, Version indicates the version of a center’s questionnaire and Region denotes the region in which the center is located. In case a center’s questionnaire lacks information or has inconsistent information, the protocol is to get back to the center and have it send a revised questionnaire. All Ada and Bob now need to do is aggregate the data per region in order to obtain region stratified estimates of:

  • the overall number of fitness centres
  • total sales volume
  • total staff cost
  • total number of people employed in fitness centres

However, the analysis protocol instructs that only fitness centers with an annual sales volume larger than or equal to €17,500 are to be included in the analysis. Furthermore, if missing values occur, they are to be ignored in the summations. Since a lot of muscle will be angered in case of errors, Ada and Bob agree on following the 2-mindset procedure and meet after an hour to discuss their results. Here is what each of them came up with.

Ada

Ada loves the tidyverse and in particular dplyr. This is her solution:

ada <- fitness %>% na.omit() %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) ada ## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 1 45860 32555 9 ## 4 E 1 33020 25010 7 Bob

Bob has a dark past as a relational database management system (RDBMS) developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach. The hope is that in big data situation this might be a little more speedy than base R:

library(RSQLite) ## Create ad-hoc database db <- dbConnect(SQLite(), dbname = file.path(filePath,"Test.sqlite")) ##Move fitness data into the ad-hoc DB dbWriteTable(conn = db, name = "fitness", fitness, overwrite=TRUE, row.names=FALSE) ##Query using SQL bob <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE [Sales Volume] > 17500 GROUP BY Region ") bob ## Region NoOfUnits Sales Volume Staff Costs People ## 1 1 19500 7609 2 ## 2 A 2 42123 30103 13 ## 3 B 2 239000 185691 19 ## 4 D 2 70860 32655 9 ## 5 E 1 33020 25010 7

Update: An alternative approach with less syntactic overhead would have been the sqldf package, which has a standard SQLite backend and automagically handles the import of the data.frame into the DB using the RSQLite pkg.

##Load package suppressPackageStartupMessages(library(sqldf)) ##Ensure SQLite backend. options(sqldf.driver = "SQLite") ##Same query as before bob <- sqldf(" SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE [Sales Volume] > 17500 GROUP BY Region ")

Even shorter is the direct use of SQL chunks in knitr using the variable db as connection and using the chunk argument output.var=bob:

SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE [Sales Volume] > 17500 GROUP BY Region The Ping-Pong Phase

After Ada and Bob each have a result, they compare their resulting data.frame using the daff package, which was recently presented by @edwindjonge at the useR in Brussels.

library(daff) diff <- diff_data(ada, bob) diff$get_data() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9 ## 5 E 1 33020 25010 7

After Ada’s and Bob’s serve, the two realize that their results just agree for the region E.

Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

##Small helper function for better printing print.data_diff <- function(x) x$get_data() %>% filter(`@@` != "") diff %>% print() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9

The two decide to first agree on the number of units per region.

diff$get_data() %>% filter(`@@` != "") %>% select(`@@`, Region, NoOfUnits) ## @@ Region NoOfUnits ## 1 +++ 1 ## 2 -> A 3->2 ## 3 -> B 1->2 ## 4 -> D 1->2

One obvious reason for the discrepancies appears to be that Bob has results for an extra region. Therefore, Bob takes another look at his management of missing values and decides to improve his code by:

Pong Bob SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM fitness WHERE ([Sales Volume] > 17500 AND REGION IS NOT NULL) GROUP BY Region ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 -> A 3->2 59623->42123 43103->30103 16->13 ## 2 -> B 1->2 119500->239000 95691->185691 19 ## 3 -> D 1->2 45860->70860 32555->32655 9 Ping Bob

Better. Now the NA region is gone, but still quite some differences remain. Note: You may at this point want to stop reading and try yourself to fix the analysis – the data and code are available from the github repository.

Pong Bob

Now Bob notices that he forgot to handle the duplicate records and apparently misunderstood the exact definition of the €17,500 exclusion limit. His massaged SQL query looks as follows:

SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM (SELECT Id, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM fitness GROUP BY Id) WHERE ([Sales Volume] >= 17500 AND REGION IS NOT NULL) GROUP BY Region ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 1->2 45860->70860 32555->32655 9 Ping Ada

Comparing with Ada, Bob is sort of envious that she was able to just use dplyr‘s group_by and top_n functions. However, daff shows that there still is one difference left. By looking more carefully at Ada’s code it becomes clear that she accidentally leaves out one unit in region D. The reason is the too liberate use of na.omit, which also removes the one entry with an NA in one of the not so important columns. However, they discuss the issue, if one really wants to include partial records or not, because summation in the different columns then is over a different number of units. After consulting with the standard operation procedure (SOP) for these kind of surveys they decide to include the observation where possible. Here is Ada’s modified code:

ada2 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) (diff_final <- diff_data(ada2,bob3)) %>% print() ## @@ Region NoOfUnits ... Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 2 ... 32655 NA->9 Pong Ada

Oops, Ada forgot to take care of the NA in the summation:

ada3 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People,na.rm=TRUE)) diff_final <- diff_data(ada3,bob3) ##Check if the results really are the same length(diff_final$get_data()) == 0 ## [1] TRUE

Finally, their results agree and they move on to production and their results are published in a nice report.

Conclusion

As shown, the ping-pong game is quite manual and particularly annoying, if at some point someone steps into the office with a statement like Btw, I found some extra questionnaires, which need to be added to the analysis asap. However, the two now aligned analysis scripts and the corresponding daff-overlay could be put into a script, which is triggered every time the data change. In case new discrepancies emerge as length(diff$get_data()) > 0, the two could then be automatically informed.

Question 1: Now the two get the same results, do you agree with them?

## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 2 70860 32655 9 ## 4 E 1 33020 25010 7

Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below. Control calculations appear quite common, but little structured code support appears to be available for such processes.



Photo is copyright Johnathan J. Stegeman under the GNU Free Documentation License, version 1.2.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Practical Data Science for Stats

Fri, 09/01/2017 - 19:05

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

PeerJ Preprints has recently published a collection of articles that focus on the practical side of statistical analysis: Practical Data Science for Stats. While the articles are not peer-reviewed, they have been selected and edited by Jennifer Bryan and Hadley Wickham, both well-respected members of the R community. And while the articles provide great advice for any data scientist, the content does heavily feature the use of R, so it's particularly useful to R users.

There are 16 articles in the collection (with possibly more to come?). Here are just a few examples that caught my eye:

There's lots more to explore in the collection as well, including case studies on using R at the likes of AirBnB and the New York Mets. Check out the entire collection at the link below.

PeerJ Collections: Practical Data Science for Stats (via Jenny Bryan)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Summarizing dataset to apply machine learning – exercises

Fri, 09/01/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

INTRODUCTION

Dear reader,

If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world.

This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets.

Before proceeding, please follow our short tutorial.

Look at the examples given and try to understand the logic behind them. Then try to solve the exercises below using R and without looking at the answers. Then check the solutions.
to check your answers.

Exercise 1

Create a list of 80% of the rows in the original dataset to use for training. HINT: Use createDataPartition().

Exercise 2

Select 20% of the data for validation.

Exercise 3

Use the remaining 80% of data to train and test the models.

Exercise 4

Find the dimensions of the “iris” dataset. HINT: Use dim().

Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to:

  • Create a machine learning algorithm from a beginner point of view
  • Quickly dive into more advanced methods in an accessible pace and with more explanations
  • And much more

This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.

Exercise 5

Find the type of each attribute in your dataset. HINT: Use sapply().

Exercise 6

Take a look at the first 5 rows of your dataset. HINT: Use head().

Exercise 7

Find the levels of the variable “Species.” HINT: Use levels().

Exercise 8

Find the percentages of rows that belong to the labels you found in Exercise 7. HINT: Use prop.table() and table().

Exercise 9

Display the absolute count of instances for each class as well as its percentage. HINT: Use cbind().

Exercise 10

Display the summary of the “iris” dataset. HINT: Use summary().

Related exercise sets:
  1. How to prepare and apply machine learning to your dataset
  2. Building Shiny App exercises part 4
  3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-5)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How rOpenSci uses Code Review to Promote Reproducible Science

Fri, 09/01/2017 - 09:00

(This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

This is cross-posted from the NumFOCUS blog. NumFOCUS is a nonprofit that supports and promotes world-class, innovative, open source scientific computing, including rOpenSci.

Scott use this for NumFOCUS SEO?

At rOpenSci, we create and curate software to help scientists with the data life cycle. These tools access, download, manage, and archive scientific data in open, reproducible ways. Early on, we realized this could only be a community effort. The variety of scientific data and workflows could only be tackled by drawing on contributions of scientists with field-specific expertise.

With the community approach came challenges. How could we ensure the quality of code written by scientists without formal training in software development practices? How could we drive adoption of best practices among our contributors? How could we create a community that would support each other in this work?

We have had great success addressing these challenges via the peer review. We draw elements from a process familiar to our target community – academic peer review – and a practice from the software development world – production code review – to create a system that fosters software quality, ongoing education, and community development.

An Open Review Process

Production software review occurs within software development teams, open source or not. Contributions to a software project are reviewed by one or more other team members before incorporation into project source code. Contributions are typically small patches, and review serves as a check on quality, as well as an opportunity for training in team standards.

In academic peer review, external reviewers critique a complete product – usually a manuscript – with a very broad mandate to address any areas they see as deficient. Academic review is often anonymous and passing through it gives the product, and the author, a public mark of validation.

We blend these approaches. In our process, authors submit complete R packages to rOpenSci. Editors check that packages fit into our project’s scope, run a series of automated tests to ensure a baseline of code quality and completeness, and then assign two independent reviewers. Reviewers comment on usability, quality, and style of software code as well as documentation. Authors make changes in response, and once reviewers are satisfied with the updates, the package receives a badge of approval and joins our suite.

This process is quite iterative. After reviewers post a first round of extensive reviews, authors and reviewers chat in an informal back-and-forth, only lightly moderated by an editor. This lets both reviewers and authors pose questions of each other and explain differences of opinion. It can proceed at a much faster pace than typical academic review correspondence. We use the GitHub issues system for this conversation, and responses take minutes to days, rather than weeks to months.

The exchange is also open and public. Authors, reviewers, and editors all know each other’s identities. The broader community can view or even participate in the conversation as it happens. This provides an incentive to be thorough and provide non-adversarial, constructive reviews. Both authors and reviewers report that they enjoy and learn more from this open and direct exchange. It also has the benefit of building community. Participants have the opportunity to meaningfully network with new peers, and new collaborations have emerged via ideas spawned during the review process.

We are aware that open systems can have drawbacks. For instance, in traditional academic review, double-blind peer review can increase representation of female authors, suggesting bias in non-blind reviews. It is also possible reviewers are less critical in open review. However, we posit that the openness of the review conversation provides a check on review quality and bias; it’s harder to inject unsupported or subjective comments in public and without the cover of anonymity. Ultimately, we believe the ability of authors and reviewers to have direct but public communication improves quality and fairness.

Guidance and Standards

rOpenSci provides guidance on reviewing. This falls into two main categories: high-level best practices and low-level standards. High-level best practices are general and broadly applicable across languages and applications. These are practices such as “Write re-usable functions rather than repeating the same code,” “Test edge cases,” or “Write documentation for all of your functions.” Because of their general nature, these can be drawn from other sources and not developed from scratch. Our best practices are based on guidance originally developed by Mozilla Science Lab.

Low-level standards are specific to a language (in our case, R), applications (data interfaces) and user base (researchers). These include specific items such as naming conventions for functions, best choices of dependencies for certain tasks, and adherence to a code style guide. We have an extensive set of standards for our reviewers to check. These change over time as the R software ecosystem evolves, best practices change, and tooling and educational resources make new methods available to developers.

Our standards also change based on feedback from reviewers. We adopt into our standards suggestions that emerge in multiple reviewers across different packages. Many of these, we’ve found, have to do with with the ease-of-use and consistency of software APIs, and the type and location of information in documentation that make it easiest to find. This highlights one of the advantages of external reviewers – they can provide a fresh perspective on usability, as well as test software under different use-cases than imagined by the author.

As our standards have become more extensive, we have come to rely more on automated tools. The R ecosystem, like most languages, has a suite of tools for code linting, function testing, static code analysis and continuous integration. We require authors to use these, and editors run submissions through a suite of tests prior to sending them for review. This frees reviewers from the burden of low-level tasks to focus on high-level critiques where they can add the most value.

The Reviewer Community

One of the core challenges and rewards of our work has been developing a community of reviewers.

Reviewing is a high-skill activity. Reviewers need expertise in the programming methods used in a software package and also the scientific field of its application. (“Find me someone who knows sensory ecology and sparse data structures!”) They need good communications skills and the time and willingness to volunteer. Thankfully, the open-science and open-source worlds are filled with generous, expert people. We have been able to expand our reviewer pool as the pace of submissions and the domains of their applications have grown.

Developing the reviewer pool requires constant recruitment. Our editors actively and broadly engage with developer and research communities to find new reviewers. We recruit from authors of previous submissions, co-workers and colleagues, at conferences, through our other collaborative work and on social media. In the open-source software ecosystem, one can often identify people with particular expertise by looking at their published software or contribution to other projects, and we often will cold-email potential reviewers whose published work suggests they would be a good match for a submission.

We cultivate our reviewer pool as well as expand it. We bring back reviewers so that they may develop reviewing as a skill, but not so often as to overburden them. We provide guidance and feedback to new recruits. When assigning reviewers to a submission, we aim to pair experienced reviewers with new ones, or reviewers with expertise on a package’s programming methods with those experienced in its field of application. These reviewers learn from each other, and diversity in perspectives is an advantage; less experienced developers often provide insight that more experienced ones do not on software usability, API design, and documentation. More experienced developers will more often identify inefficiencies in code, pitfalls due to edge-cases, or suggest alternate implementation approaches.

Expanding Peer Review for Code

Code review has been one of rOpenSci’s best initiatives. We build software, build skills, and build community, and the peer review process has been a major catalyst for all three. It has made both the software we develop internally and that we accept from outside contributors more reliable, usable, and maintainable. So we are working to promote open peer review of code by more organizations working with scientific software. We helped launch The Journal of Open Source Software, which uses a version of our review process to provide a developer-friendly publication venue. JOSS’s success has led to a spin-off, the Journal of Open Source Education, which uses an open, code-review-like processes to provide feedback on curricula and educational materials. We are also developing a pilot program where software papers submitted to a partner academic journal receive a badge for going through rOpenSci review. We are encouraged by other review initiatives like ReScience and The Programming Historian. BioConductor’s code reviews, which predate ours by several years, recently switched to an open model.

If your organization is developing or curating scientific code, we believe code review, implemented well, can be a great benefit. It can take considerable effort to begin, but here are some of the key lessons we’ve learned that can help:

  • Develop standards and guidelines for your authors and reviewers to use. Borrow these freely from other projects (feel free to use ours), and modify them iteratively as you go.
  • Use automated tools such as code linters, test suites, and test coverage measures to reduce burden on authors, reviewers, and editors as much as possible.
  • Have a clear scope. Spell out to yourselves and potential contributors what your project will accept, and why. This will save a lot of confusion and effort in the future.
  • Build a community through incentives of networking, opportunities to learn, and kindness.

rOpenSci is eager to work with other groups interested in developing similar review processes, especially if you are interested in reviewing and curating scientific software in languages other than R or beyond our scope of supporting the data life cycle. Software that implements statistical algorithms, for instance, is an area ripe for open peer review of code. Please get in touch if you have questions or wish to co-pilot a review system for your project.

And of course, if you want to review, we’re always looking for volunteers. Sign up at https://ropensci.org/onboarding.

You can support rOpenSci by becoming a NumFOCUS member or making a donation to the project.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppAnnoy 0.0.9

Fri, 09/01/2017 - 03:20

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

An new version 0.0.9 of RcppAnnoy, our Rcpp-based R integration of the nifty Annoy library by Erik, is now on CRAN. Annoy is a small and lightweight C++ template header library for very fast approximate nearest neighbours.

This release corrects an issue for Windows users discovered by GitHub user ‘khoran’ who later also suggested the fix of binary mode. It upgrades to Annoy release 1.9.1 and brings its new Manhattan distance to RcppAnnoy. A number of unit tests were added as well, and we updated some packaging internals such as symbol registration.

And I presume I had a good streak emailing with Uwe’s robots as the package made it onto CRAN rather smoothly within ten minutes of submission:

 

Changes in this version are summarized here:

Changes in version 0.0.9 (2017-08-31)
  • Synchronized with Annoy upstream version 1.9.1

  • Minor updates in calls and tests as required by annoy 1.9.1

  • New Manhattan distance modules along with unit test code

  • Additional unit tests from upstream test code carried over

  • Binary mode is used for save (as suggested by @khoran in #21)

  • A new file init.c was added with calls to R_registerRoutines() and R_useDynamicSymbols()

  • Symbol registration is enabled in useDynLib

Courtesy of CRANberries, there is also a diffstat report for this release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Proof-Calculation Ping Pong

Fri, 09/01/2017 - 00:00
Abstract

The proof-calculation ping-pong is the process of iteratively improving a statistical analysis by comparing results from two independent analysis approaches until agreement. We use the daff package to simplify the comparison of the two results.



This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from .

Introduction

If you are a statistician working in climate science, data driven journalism, official statistics, public health, economics or any related field working with real data, chances are that you have to perform analyses, where you know there is zero tolerance for errors. The easiest way to ensure the correctness of such an analysis is to check your results over and over again (the iterated 2-eye principle). A better approach is to pair-program the analysis by either having a colleague read through your code (the 4-eye principle) or have a skilled colleague completely redo your analysis from scratch using her favorite toolchain (the 2-mindset principle). Structured software development in the form of, e.g. version control and unit tests, provides valuable inspiration on how to ensure the quality of your code. However, when it comes to pair-programming analyses, surprisingly many steps remain manual. The daff package provides the equivalent of a diff statement on data frames and we shall illustrate its use by automatizing the comparison step of the statistical proof-calculation ping-pong process.

Case Story

Ada and Bob have to proof-calculate their country’s quadrennial official statistics on the total income and number of employed people in fitness centers. A sample of fitness centers is asked to fill out a questionnaire containing their yearly sales volume, staff costs and number of employees. For this post we will ignore the complex survey part of the problem and just pretend that our sample corresponds to the population (complete inventory count). After the questionnaire phase, the following data are available to Ada and Bob.

Id Version Region Sales Volume Staff Costs People 01 1 A 23000 10003 5 02 1 B 12200 7200 1 03 1 NA 19500 7609 2 04 1 A 17500 13000 3 05 1 B 119500 90000 NA 05 2 B 119500 95691 19 06 1 B NA 19990 6 07 1 A 19123 20100 8 08 1 D 25000 100 NA 09 1 D 45860 32555 9 10 1 E 33020 25010 7

Here Id denotes the unique identifier of the sampled fitness center, Version indicates the version of a center’s questionnaire and Region denotes the region in which the center is located. In case a center’s questionnaire lacks information or has inconsistent information, the protocol is to get back to the center and have it send a revised questionnaire. All Ada and Bob now need to do is aggregate the data per region in order to obtain region stratified estimates of:

  • the overall number of fitness centres
  • total sales volume
  • total staff cost
  • total number of people employed in fitness centres

However, the analysis protocol instructs that only fitness centers with an annual sales volume larger than or equal to €17,500 are to be included in the analysis. Furthermore, if missing values occur, they are to be ignored in the summations. Since a lot of muscle will be angered in case of errors, Ada and Bob agree on following the 2-mindset procedure and meet after an hour to discuss their results. Here is what each of them came up with.

Ada

Ada loves the tidyverse and in particular dplyr. This is her solution:

ada <- fitness %>% na.omit() %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) ada ## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 1 45860 32555 9 ## 4 E 1 33020 25010 7 Bob

Bob has a dark past as database developer and, hence, only recently experienced the joys of R. He therefore chooses a no-SQL-server-necessary SQLite within R approach:

library(RSQLite) ## Create ad-hoc database db <- dbConnect(SQLite(), dbname = file.path(filePath,"Test.sqlite")) ##Move fitness data into the ad-hoc DB dbWriteTable(conn = db, name = "FITNESS", fitness, overwrite=TRUE, row.names=FALSE) ##Query using SQL bob <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM FITNESS WHERE [Sales Volume] > 17500 GROUP BY Region ") bob ## Region NoOfUnits Sales Volume Staff Costs People ## 1 1 19500 7609 2 ## 2 A 2 42123 30103 13 ## 3 B 2 239000 185691 19 ## 4 D 2 70860 32655 9 ## 5 E 1 33020 25010 7 The Ping-Pong phase

After Ada and Bob each have a result, they compare their resulting data.framess using the daff package, which was recently presented by @edwindjonge at the useR in Brussels.

library(daff) diff <- diff_data(ada, bob) diff$get_data() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9 ## 5 E 1 33020 25010 7

After Ada’s and Bob’s serve, the two realize that their results just agree for one region (‘E’). Note: Currently, daff has the semi-annoying feature of not being able to show all the diffs when printing, but just n lines of the head and tail. As a consequence, for the purpose of this post, we overwrite the printing function such that it always shows all rows with differences.

print.data_diff <- function(x) x$get_data() %>% filter(`@@` != "") print(diff) ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 +++ 1 19500 7609 2 ## 2 -> A 3->2 59623->42123 43103->30103 16->13 ## 3 -> B 1->2 119500->239000 95691->185691 19 ## 4 -> D 1->2 45860->70860 32555->32655 9

The two decide to first focus on agreeing on the number of units per region.

## @@ Region NoOfUnits ## 1 +++ 1 ## 2 -> A 3->2 ## 3 -> B 1->2 ## 4 -> D 1->2

One obvious reason for the discrepancies appears to be that Bob has results for an extra region. Therefore, Bob takes another look at his management of missing values and decides to improve his code by:

Pong Bob bob2 <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM FITNESS WHERE ([Sales Volume] > 17500 AND REGION IS NOT NULL) GROUP BY Region ") diff2 <- diff_data(ada, bob2, ordered=FALSE,count_like_a_spreadsheet=FALSE) diff2 %>% print() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 -> A 3->2 59623->42123 43103->30103 16->13 ## 2 -> B 1->2 119500->239000 95691->185691 19 ## 3 -> D 1->2 45860->70860 32555->32655 9 Ping Bob

Better. Now the NA region is gone, but still quite some differences remain. Note: You may at this point want to stop reading and try yourself to fix the analysis – the data and code are available from the github repository.

Pong Bob

Now Bob notices that he forgot to handle the duplicate records and massages the SQL query as follows:

bob3 <- dbGetQuery(db, " SELECT Region, COUNT(*) As NoOfUnits, SUM([Sales Volume]) As [Sales Volume], SUM([Staff Costs]) AS [Staff Costs], SUM(People) AS People FROM (SELECT Id, MAX(Version), Region, [Sales Volume], [Staff Costs], People FROM FITNESS GROUP BY Id) WHERE ([Sales Volume] >= 17500 AND REGION IS NOT NULL) GROUP BY Region ") diff3 <- diff_data(ada, bob3, ordered=FALSE,count_like_a_spreadsheet=FALSE) diff3 %>% print() ## @@ Region NoOfUnits Sales Volume Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 1->2 45860->70860 32555->32655 9 Ping Ada

Comparing with Ada, Bob is sort of envious that she was able to just use dplyr‘s group_by and top_n functions. However, daff shows that there still is one difference left. By looking more carefully at Ada’s code it becomes clear that she accidentally leaves out one unit in region D. The reason is the too liberate use of na.omit, which also removes the one entry with an NA in one of the not so important columns. However, they discuss the issue, if one really wants to include partial records or not, because summation in the different columns then is over a different number of units. After consulting with the standard operation procedure (SOP) for these kind of surveys they decide to include the observation where possible. Here is Ada’s modified code:

ada2 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People)) (diff_final <- diff_data(ada2,bob3)) %>% print() ## @@ Region NoOfUnits ... Staff Costs People ## 1 ... ... ... ... ... ... ## 2 -> D 2 ... 32655 NA->9 Pong Ada

Oops, forgot to take care of the NA in the summation:

ada3 <- fitness %>% filter(!is.na(Region)) %>% group_by(Region,Id) %>% top_n(1,Version) %>% group_by(Region) %>% filter(`Sales Volume` >= 17500) %>% summarise(`NoOfUnits`=n(), `Sales Volume`=sum(`Sales Volume`), `Staff Costs`=sum(`Staff Costs`), People=sum(People,na.rm=TRUE)) diff_final <- diff_data(ada3,bob3) length(diff_final$get_data()) == 0 ## [1] TRUE Conclusion

Finally, their results agree and they move on to production and their results are published in a nice report.

Question 1: Do you agree with their results?

ada3 ## # A tibble: 4 x 5 ## Region NoOfUnits `Sales Volume` `Staff Costs` People ## ## 1 A 3 59623 43103 16 ## 2 B 1 119500 95691 19 ## 3 D 2 70860 32655 9 ## 4 E 1 33020 25010 7

As shown, the ping-pong game is quite manual and particularly annoying, if at some point someone steps into the office with a statement like Btw, I found some extra questionnaires, which need to be added to the analysis asap. However, the two now aligned analysis scripts and the corresponding daff-overlay could be put into a script, which is triggered every time the data change. In case new discrepancies emerge as length(diff$get_data()) > 0, the two could then be automatically informed.

Question 2: Are you aware of any other good ways and tools to structure and automatize such a process? If so, please share your experiences as a Disqus comment below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Multiplicative Congruential Generators in R

Thu, 08/31/2017 - 20:00

(This article was first published on R – Aaron Schlegel, and kindly contributed to R-bloggers)

Part 2 of 2 in the series Random Number Generation

Multiplicative congruential generators, also known as Lehmer random number generators, is a type of linear congruential generator for generating pseudorandom numbers in . The multiplicative congruential generator, often abbreviated as MLCG or MCG, is defined as a recurrence relation similar to the LCG with .

Unlike the LCG, the parameters and for multiplicative congruential generators are more restricted and the initial seed must be relatively prime to the modulus (the greatest common divisor between and is ). The current parameters in common use are . However, in a correspondence from the Communications of the ACM, Park, Miller and Stockmeyer changed the value of the parameter , stating:

The minimal standard Lehmer generator we advocated had a modulus of m = 2^31 – 1 and a multiplier of a = 16807. Relative to this particular choice of multiplier, we wrote “… if this paper were to be written again in a few years it is quite possible that we would advocate a different multiplier ….” We are now prepared to do so. That is, we now advocate a = 48271 and, indeed, have done so “officially” since July 1990. This new advocacy is consistent with the discussion on page 1198 of [10]. There is nothing wrong with 16807; we now believe, however, that 48271 is a little better (with q = 44488, r = 3399).

Multiplicative Congruential Generators with Schrage’s Method

When using a large prime modulus such as , the multiplicative congruential generator can overflow. Schrage’s method was invented to overcome the possibility of overflow. We can check the parameters in use satisfy this condition:

a <- 48271 m <- 2 ** 31 - 1 a * (m %% a) < m ## [1] TRUE

Schrage’s method restates the modulus as a decomposition where and .

Multiplicative Congruential Generator in R

We can implement a Lehmer random number generator in R using the parameters mentioned earlier.

lehmer.rng <- function(n=10) { rng <- vector(length = n) m <- 2147483647 a <- 48271 q <- 44488 r <- 3399 # Set the seed using the current system time in microseconds. # The initial seed value must be coprime to the modulus m, # which we are not really concerning ourselves with for this example. d <- as.numeric(Sys.time()) for (i in 1:n) { h <- d / q l <- d %% q t <- a * l - r * h if (t < 0) { d <- t } else { d <- t + m } rng[i] <- d / m } return(rng) } # Print the first 10 randomly generated numbers lehmer.rng() ## [1] 0.68635675 0.12657390 0.84869106 0.16614698 0.08108171 0.89533896 ## [7] 0.90708773 0.03195725 0.60847522 0.70736551

Plotting our multiplicative congruential generator in three dimensions allows us to visualize the apparent ‘randomness’ of the generator. As before, we generate three random vectors with our Lehmer RNG function and plot the points. The plot3d package is used to create the scatterplot and the animation package is used to animate each scatterplot as the length of the random vectors, , increases.

library(plot3D) library(animation) n <- c(3, 10, 20, 100, 500, 1000, 2000, 5000, 10000, 20000) saveGIF({ for (i in 1:length(n)) { x <- lehmer.rng(n[i]) y <- lehmer.rng(n[i]) z <- lehmer.rng(n[i]) scatter3D(x, y, z, colvar = NULL, pch=20, cex = 0.5, theta=20, main = paste('n = ', n[i])) } }, movie.name = 'lehmer.gif')

The generator appears to be generating suitably random numbers demonstrated by the increasing swarm of points as increases.

References

Anne Gille-Genest (March 1, 2012). Implementation of the Pseudo-Random Number Generators and the Low Discrepancy Sequences.

Saucier, R. (2000). Computer Generation of Statistical Distributions (1st ed.). Aberdeen, MD. Army Research Lab.

Stephen K. Park; Keith W. Miller; Paul K. Stockmeyer (1988). “Technical Correspondence”. Communications of the ACM. 36 (7): 105–110.

The post Multiplicative Congruential Generators in R appeared first on Aaron Schlegel.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Aaron Schlegel. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Probability functions intermediate

Thu, 08/31/2017 - 18:57

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In this set of exercises, we are going to explore some of the probability functions in R by using practical applications. Basic probability knowledge is required. In case you are not familiarized with the function apply, check the R documentation.

Note: We are going to use random numbers functions and random processes functions in R such as runif. A problem with these functions is that every time you run them, you will obtain a different value. To make your results reproducible you can specify the value of the seed using set.seed(‘any number’) before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random number process.) For this set of exercises, we will use set.seed(1).Don’t forget to specify it before every exercise that includes random numbers.

 

 

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Generating dice rolls Set your seed to 1 and generate 30 random numbers using runif. Save it in an object called random_numbers. Then use the ceiling function to round the values. These values represent rolling dice values.

Exercise 2

Simulate one dice roll using the function rmultinom. Make sure n = 1 is inside the function, and save it in an object called die_result. The matrix die_result is a collection of 1 one and 5 zeros, with the one indicating which value was obtained during the process. Use the function whichto create an output that shows only the value obtained after the dice is rolled.

Exercise 3

Using rmultinom, simulate 30 dice rolls. Save it in a variable called dice_result and use apply to transform the matrix into a vector with the result of each dice.

Exercise 4

Some gambling games use 2 dice, and after being rolled they sum their value. Simulate throwing 2 dice 30 times and record the sum of the values of each pair. Use rmultinomto simulate throwing 2 dice 30 times. Use the function apply to record the sum of the values of each experiment.

Learn more about probability functions in the online course Statistics with R – Advanced Level. In this course you will learn how to

  • work with different binomial and logistic regression techniques,
  • know how to compare regression models and choose the right fit,
  • and much more.

Exercise 5

Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with a standard deviation of 0.1. Using rnorm, simulate the height of 100 people and save it in an object called heights.

To get an idea of the values of heights, use the function summary.

Exercise 6

90% of the population is smaller than ____________?

Exercise 7

Which percentage of the population is bigger than 1.60 m?

Exercise 8

Run the following line code before this exercise. This will load a library required for the exercise.

if (!'MASS' %in% installed.packages()) install.packages('MASS')
library(MASS)

Simulate 1000 people with height and weight using the function mvrnorm with mu = c(1.70, 60) and
Sigma = matrix(c(.1,3.1,3.1,100), nrow = 2)

Exercise 9

How many people from the simulated population are taller than 1.70 m and heavier than 60 kg?

Exercise 10

How many people from the simulated population are taller than 1.75 m and lighter than 60 kg?

Related exercise sets:
  1. Lets Begin with something sample
  2. Probability functions beginner
  3. Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

DEADLINE EXTENDED: Last call for Boston EARL abstracts

Thu, 08/31/2017 - 18:56

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

Are you solving problems and innovating with R?
Are you working with R in a commercial setting?
Do you enjoy sharing your knowledge?

If you said yes to any of the above, we want your abstract!

Share your commerical R stories with your peers at EARL Boston this November.

EARL isn’t about knowing the most or being the best in your field – it’s about taking what you’ve learnt and sharing it with others, so they can learn from your wins (and sometimes your failures, because we all have them!).

As long as your proposed presentation is focused on the commerical use of R, any topic from any industry is welcome!

The abstract submission deadline has been extended to Sunday 3 September.

Join David Robinson, Mara Averick and Tareef Kawaf on 1-3 November 2017 at The Charles Hotel in Cambridge.

To submit your abstract on the EARL website, click here.

See you in Boston!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Text featurization with the Microsoft ML package

Thu, 08/31/2017 - 17:23

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last week I wrote about how you can use the MicrosoftML package in Microsoft R to featurize images: reduce an image to a vector of 4096 numbers that quantify the essential characteristics of the image, according to an AI vision model. You can perform a similar featurization process with text as well, but in this case you have a lot more control of the features used to represent the text.

Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from Amazon.com. The dataset includes reviews from 975,194 products on Amazon.com from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here's one example, selected at random:

What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It's great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.

The review contains many positive terms ("useful", "indespensable", "highly recommended"), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:

transformRule = list( featurizeText( vars = c(Features = "REVIEW_TEXT"), # ngramLength=2: include not only "Azure", "AD", but also "Azure AD" # skipLength=1 : "computer" and "compuuter" is the same wordFeatureExtractor = ngramCount( weighting = "tfidf", ngramLength = 2, skipLength = 1), language = "English" ), selectFeatures( vars = c("Features"), mode = minCount(500) ) ) # train using transforms ! model <- rxFastLinear( RATING ~ Features, data = train, mlTransforms = transformRule, type = "regression" # not binary (numeric regression) )

We can then look at the coefficients associated with these features (presence of n-grams) to assess their impact on the overall rating. By this standard, the top 10 words or word-pairs contributing to a negative rating are:

boring -7.647399 waste -7.537471 not -6.355953 nothing -6.149342 money -5.386262 bad -5.377981 no -5.210301 worst -5.051558 poorly -4.962763 disappointed -4.890280

Similarly, the top 10 words or word-pairs associated with a positive rating are:

will 3.073104 the|best 3.265797 love 3.290348 life 3.562267 wonderful 3.652950 ,|and 3.762862 you 3.889580 excellent 3.902497 my 4.454115 great 4.552569

Another option is simply to look at the sentiment score for each review, which can be extracted using the getSentiment function. 

sentimentScores <- rxFeaturize(data=data, mlTransforms = getSentiment(vars = list(SentimentScore = "REVIEW_TEXT")))

As we expect, a negative seniment (in the 0-0.5 range) is associated with 1- and 2-star reviews, while a positive sentiment (0.5-1.0) is associated with the 4- and 5-star reviews.

You can find more details on this analysis, including the Microsoft R code, at the link below.

Microsoft Technologies Blog for Enterprise Developers: Analyze your text in R (MicrosoftML)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why to use the replyr R package

Thu, 08/31/2017 - 16:48

Recently I noticed that the R package sparklyr had the following odd behavior:

suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #> [1] '0.7.2.9000' packageVersion("sparklyr") #> [1] '0.6.2' packageVersion("dbplyr") #> [1] '1.1.0.9000' sc <- spark_connect(master = 'local') #> * Using Spark: 2.1.0 d <- dplyr::copy_to(sc, data.frame(x = 1:2)) dim(d) #> [1] NA ncol(d) #> [1] NA nrow(d) #> [1] NA

This means user code or user analyses that depend on one of dim(), ncol() or nrow() possibly breaks. nrow() used to return something other than NA, so older work may not be reproducible.

In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).



Tron: fights for the users.

In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both sparklyr and dbplyr users.

The explanation is: “tibble::truncate uses nrow()” and “print.tbl_spark is too slow since dbplyr started using tibble as the default way of printing records”.

A little digging gets us to this:

The above might make sense if tibble and dbplyr were the only users of dim(), ncol() or nrow().

Frankly if I call nrow() I expect to learn the number of rows in a table.

The suggestion is for all user code to adapt to use sdf_dim(), sdf_ncol() and sdf_nrow() (instead of tibble adapting). Even if practical (there are already a lot of existing sparklyr analyses), this prohibits the writing of generic dplyr code that works the same over local data, databases, and Spark (by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-sparklyr dbplyr users (i.e., databases such as PostgreSQL), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “d %>% tally %>% pull“).

I admit, calling nrow() against an arbitrary query can be expensive. However, I am usually calling nrow() on physical tables (not on arbitrary dplyr queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for nrow() to be a cheap operation.

Allowing the user to write reliable generic code that works against many dplyr data sources is the purpose of our replyr package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or Spark. Below are the functions replyr supplies for examining the size of tables:

library("replyr") packageVersion("replyr") #> [1] '0.5.4' replyr_hasrows(d) #> [1] TRUE replyr_dim(d) #> [1] 2 1 replyr_ncol(d) #> [1] 1 replyr_nrow(d) #> [1] 2 spark_disconnect(sc)

Note: the above is only working properly in the development version of replyr, as I only found out about the issue and made the fix recently.

replyr_hasrows() was added as I found in many projects the primary use of nrow() was to determine if there was any data in a table. The idea is: user code uses the replyr functions, and the replyr functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems. replyr accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).

The point of replyr is to provide re-usable work arounds of design choices far away from our influence.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Pulling Data Out of Census Spreadsheets Using R

Thu, 08/31/2017 - 14:22

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In this post, I show a method for extracting small amounts of data from somewhat large Census Bureau Excel spreadsheets, using R.  The objects of interest are expenditures of state and local governments on hospital capital in Iowa for the years 2004 to 2014. The data can be found at http://www2.census.gov/govs/local/. The files at the site are yearly files.

The files to be used are those named ‘yrslsstab1a.xls’, where ‘yr‘ is replaced by the two digits of the year for a given year, for example, ’04’ or ’11’. The individual yearly files contain data for the whole country and for all of the states, over all classes of state and local government revenue and expenditures. The task is to extract three data points from each file – state and local expenditures, state expenditures, and local expenditures – for the state of Iowa.

The structure of the files varies from year to year, so first reviewing the files is important. I found two patterns for the expenditure data – data with and data without margins of error. The program locates the columns for Iowa and the row for hospital capital expenditures. Then, the data are extracted and put in a matrix for outputting.

First, character strings of the years are created, to be used in referencing the data sets, and a data frame is created to contain the final result.

years = c(paste("0", 4:9, sep=""), paste(10:14)) hospital.capital.expend <- data.frame(NA,NA,NA)

Second, the library ‘gdata’ is opened. The library ‘gdata’ contains functions useful for manipulating data in R and provides for reading data into R from an URL containing an Excel file.

library(gdata)

Third, a loop is run through the eleven years to fill in the ‘hospital.capital.expend’ data frame with the data from each year. The object ‘fn’ contains the URL of the Excel file for a given year. The function ‘paste’ concatenates the three parts of the URL. Note that ‘sep’ must be set to “” in the function.

for (i in 1:11) { fn = paste("http://www2.census.gov/govs/local/",years[i], "slsstab1a.xls", sep="")

Next, the Excel file is read into the object ‘ex’. The argument ‘header’ is set to ‘F’ so that all of the rows are input. Also, since all of the columns contain some character data, all of the data is forced to be character by setting ‘stringsAsFactors’ to ‘F’.  The function used to read the spreadsheet is ‘read.xls’ in the package ‘gdata’.

ex = read.xls(fn, sheet=1, header=F, stringsAsFactors=F)

Next, the row and column indices of the data are found using the functions ‘grepl’ and ‘which’. The first argument in ‘grepl’ is a pattern to be matched. For a data frame, the ‘grepl’ function returns a logical vector of ‘T’s and ‘F’s of length equal to the number of columns in the data frame – giving ‘T’ if the column contains the pattern and ‘F’ if not. Note that ‘*’ can be used as a wild card in the pattern.  For a character vector, ‘grepl’ returns ‘T’ if an element of the vector matches the pattern and ‘F’ otherwise.

The ‘which’ function returns the indices of a logical vector which have the value ‘T’. So, ‘ssi1’ contains the index of the column containing ‘Hospital’ and ‘ssi2’ contains the index of the column containing ‘Iowa’. The object ‘ssi4’ contains the rows containing ‘Hospital’, since ‘ex[,ssi1]’ is a character vector instead of a data frame.   For all of the eleven years, the second incidence of ‘Hospital’ in the ‘Hospital’ column contains hospital expenditures.

ssi1 = which(grepl("*Hospital*", ex, ignore.case=T)) ssi2 = which(grepl("Iowa", ex, ignore.case=T)) ssi4 = which(grepl("Hospital",ex[,ssi1], ignore.case=T))[2]

Next, the data are extracted, and the temporary files are removed. If the column index of ‘Iowa’ is less that 80, no margin of error was included and the data points are in the column of ‘Iowa’ and in the next two columns. If the column index of ‘Iowa’ is larger than 79, a margin of error was included and the data are in the column of ‘Iowa’ and the second and third columns to the right.

The capital expenditures are found one row below the ‘Hospital’ row, so one is added to ‘ssi4’ to get the correct row index. The data are put in the data frame ‘df.1’ which is row bound to the data frame ‘hospital.capital.expend’. The names of the columns in ‘df.1’ are set to ‘NA’ so that the row bind will work.  Then the temporary files are removed and the loop ends.

if (ssi2<80) ssi5=ssi2+0:2 else ssi5 = ssi2 + c(0,2,3) df.1 = data.frame(ex[ssi4+1, ssi5], stringsAsFactors = F) names(df.1)=c(NA,NA,NA) hospital.capital.expend = rbind(hospital.capital.expend, df.1) rm(fn, ex, df.1, ssi1, ssi2, ssi4, ssi5) }

There are just a few steps left to clean things up. The first row of ‘hospital.capital.expend’, which just contains ‘NA’s, is removed. Then, the commas within the numbers, as extracted from the census file, are removed from the character strings using the function ‘gsub’ and the data frame is converted to a numeric matrix. Next, the eleven years are column bound to the matrix. Last, the columns are given names and the matrix is printed out.

hospital.capital.expend = as.matrix(hospital.capital.expend[-1,]) hospital.capital.expend = matrix(as.numeric(gsub(",","",hospital.capital.expend)),ncol=3) hospital.capital.expend = cbind(2004:2014,hospital.capital.expend) colnames(hospital.capital.expend) = c("Year", "State.Local", "State", "Local") print(hospital.capital.expend)

That’s it!!!

    Related Post

    1. Extracting Tables from PDFs in R using the Tabulizer Package
    2. Extract Twitter Data Automatically using Scheduler R package
    3. An Introduction to Time Series with JSON Data
    4. Get Your Data into R: Import Data from SPSS, Stata, SAS, CSV or TXT
    5. Exporting Data from R to TXT, CSV, SPSS or Stata
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Community Call – rOpenSci Software Review and Onboarding

    Thu, 08/31/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    Are you thinking about submitting a package to rOpenSci's open peer software review? Considering volunteering to review for the first time? Maybe you're an experienced package author or reviewer and have ideas about how we can improve.

    Join our Community Call on Wednesday, September 13th. We want to get your feedback and we'd love to answer your questions!

    Agenda
    1. Welcome (Stefanie Butland, rOpenSci Community Manager, 5 min)
    2. guest: Noam Ross, editor (15 min)
      Noam will give an overview of the rOpenSci software review and onboarding, highlighting the role editors play and how decisions are made about policies and changes to the process.
    3. guest: Andee Kaplan, reviewer (15 min)
      Andee will give her perspective as a package reviewer, sharing specifics about her workflow and her motivation for doing this.
    4. Q & A (25 min, moderated by Noam Ross)
    Speaker bios

    Andee Kaplan is a Postdoctoral Fellow at Duke University. She is a recent PhD graduate from the Iowa State University Department of Statistics, where she learned a lot about R and reproducibility by developing a class on data stewardship for Agronomists. Andee has reviewed multiple (two!) packages for rOpenSci, iheatmapr and getlandsat, and hopes to one day be on the receiving end of the review process.

    Andee on GitHub, Twitter

    Noam Ross is one of rOpenSci's four editors for software peer review. Noam is a Senior Research Scientist at EcoHealth Alliance in New York, specializing in mathematical modeling of disease outbreaks, as well as training and standards for data science and reproducibility. Noam earned his Ph.D. in Ecology from the University of California-Davis, where he founded the Davis R Users' Group.

    Noam on GitHub, Twitter

    Resources var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Create and Update PowerPoint Reports using R

    Thu, 08/31/2017 - 02:50

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    In my sordid past, I was a data science consultant. One thing about data science that they don’t teach you at school is that senior managers in most large companies require reports to be in PowerPoint. Yet, I like to do my more complex data science in R – PowerPoint and R are not natural allies. As a result, creating an updating PowerPoint reports using R can be painful.

    In this post, I discuss how to make R and PowerPoint work efficiently together. The underlying assumption is that R is your computational engine and that you are trying to get outputs into PowerPoint. I compare and contrast three tools for creating and updating PowerPoint reports using R: free ReporteRs package with two commercial products, Displayr and Q.

    Option 1: ReporteRs

    The first approach to getting R and PowerPoint to work together is to use David Gohel’s ReporteRs. To my mind, this is the most “pure” of the approaches from an R perspective. If you are an experienced R user, this approach works in pretty much the way that you will expect it to work.

    The code below creates 250 crosstabs, conducts significance tests, and, if the p-value is less than 0.05, presents a slide containing each. And, yes, I know this is p-hacking, but this post is about how to use PowerPoint and R, not how to do statistics…

    library(devtools) devtools::install_github('davidgohel/ReporteRsjars') devtools::install_github('davidgohel/ReporteRs') install.packages(c('ReporteRs', 'haven', 'vcd', 'ggplot2', 'reshape2')) library(ReporteRs) library(haven) library(vcd) library(ggplot2) library(reshape2) dat = read_spss("http://wiki.q-researchsoftware.com/images/9/94/GSSforDIYsegmentation.sav") filename = "c://delete//Significant crosstabs.pptx" # the document to produce document = pptx(title = "My significant crosstabs!") alpha = 0.05 # The level at which the statistical testing is to be done. dependent.variable.names = c("wrkstat", "marital", "sibs", "age", "educ") all.names = names(dat)[6:55] # The first 50 variables int the file. counter = 0 for (nm in all.names) for (dp in dependent.variable.names) { if (nm != dp) { v1 = dat[[nm]] if (is.labelled(v1)) v1 = as_factor(v1) v2 = dat[[dp]] l1 = attr(v1, "label") l2 = attr(v2, "label") if (is.labelled(v2)) v2 = as_factor(v2) if (length(unique(v1)) <= 10 <= 10) # Only performing tests if 10 or fewer rows and columns. { x = xtabs(~v1 + v2) x = x[rowSums(x) > 0, colSums(x) > 0] ch = chisq.test(x) p = ch$p.value if (!is.na(p) && p <= alpha) { counter = counter + 1 # Creating the outputs. crosstab = prop.table(x, 2) * 100 melted = melt(crosstab) melted$position = 100 - as.numeric(apply(crosstab, 2, cumsum) - 0.5 * crosstab) p = ggplot(melted, aes(x = v2, y = value,fill = v1)) + geom_bar(stat='identity') p = p + geom_text(data = melted, aes(x = v2, y = position, label = paste0(round(value, 0),"%")), size=4) p = p + labs(x = l2, y = l1) colnames(crosstab) = paste0(colnames(crosstab), "%") #bar = ggplot() + geom_bar(aes(y = v1, x = v2), data = data.frame(v1, v2), stat="identity") # Writing them to the PowerPoint document. document = addSlide(document, slide.layout = "Title and Content" ) document = addTitle(document, paste0("Standardized residuals and chart: ", l1, " by ", l2)) document = addPlot(doc = document, fun = print, x = p, offx = 3, offy = 1, width = 6, height = 5 ) document = addFlexTable(doc = document, FlexTable(round(ch$stdres, 1), add.rownames = TRUE),offx = 8, offy = 2, width = 4.5, height = 3 ) } } } } writeDoc(document, file = filename ) cat(paste0(counter, " tables and charges exported to ", filename, "."))

    Below we see one of the admittedly ugly slides created using this code. With more time and expertise, I am sure I could have done something prettier. A cool aspect of the ReporteRs package is that you can then edit the file in PowerPoint. You can then get R to update any charts and other outputs originally created in R.

     

     

    Option 2: Displayr

    A completely different approach is to author the report in Displayr, and then export the resulting report from Displayr to PowerPoint.

    This has advantages and disadvantages relative to using ReporteRs. First, I will start with the big disadvantage, in the hope of persuading you of my objectivity (disclaimer: I have no objectivity, I work at Displayr).

    Each page of a Displayr report is created interactively, using a mouse and clicking and dragging things. In my earlier example using ReporteRs, I only created pages where there was a statistically significant association. Currently, there is no way of doing such a thing in Displayr.

    The flipside of using the graphical user interface like Displayr is that it is a lot easier to create attractive visualizations. As a result, the user has much greater control over the look and feel of the report. For example, the screenshot below shows a PowerPoint document created by Displayr. All but one of the charts has been created using R, and the first two are based on a moderately complicated statistical model (latent class rank-ordered logit model).

    You can access the document used to create the PowerPoint report with R here (just sign in to Displayr first) – you can poke around and see how it all works.

    A benefit of authoring a report using Displayr is that the user can access the report online, interact with it (e.g., filter the data), and then export precisely what they want. You can see this document as it is viewed by a user of the online report here.

     

     

    Option 3: Q

    A third approach for authoring and updating PowerPoint reports using R is to use Q, which is a Windows program designed for survey reporting (same disclaimer as with Displayr). It works by exporting and updating results to a PowerPoint document. Q has two different mechanisms for exporting R analyses to PowerPoint. First, you can export R outputs, including HTMLwidgets, created in Q directly to PowerPoint as images. Second, you can create tables using R and then have these exported as native PowerPoint objects, such as Excel charts and PowerPoint tables.

    Q has two different mechanisms for exporting R analyses to PowerPoint. First, you can export R outputs, including HTMLwidgets, created in Q directly to PowerPoint as images. Second, you can create tables using R and then have these exported as native PowerPoint objects, such as Excel charts and PowerPoint tables.

    In Q, a Report contains a series of analyses. Analyses can either be created using R, or, using Q’s own internal calculation engine, which is designed for producing tables from survey data.

    The map above (in the Displayr report) is an HTMLwidget created using the plotly R package. It draws data from a table called Region, which would also be shown in the report. (The same R code in the Displayr example can be used in an R object within Q). So when exported into PowerPoint, it creates a page, using the PowerPoint template, where the title is Responses by region and the map appears in the middle of the page.

    The screenshot below is showing another R chart created in PowerPoint. The data has been extracted from Google Trends using the gtrendsR R package. However, the chart itself is a standard Excel chart, attached to a spreadsheet containing the data. These slides can then be customized using all the normal PowerPoint tools and can be automatically updated when the data is revised.

     

    Explore the Displayr example

    You can access the Displayr document used to create and update the PowerPoint report with R here (just sign in to Displayr first). Here, you can poke around and see how it all works or create your own document.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pacific Island Hopping using R and iGraph

    Thu, 08/31/2017 - 02:00

    (This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

    Last month I enjoyed a relaxing holiday in the tropical paradise of Vanuatu. One rainy day I contemplated how to go island hopping across the Pacific ocean visiting as many island nations as possible. The Pacific ocean is a massive body of water between, Asia and the Americas, which covers almost half the surface of the earth. The southern Pacific is strewn with island nations from Australia to Chile. In this post, I describe how to use R to plan your next Pacific island hopping journey.

    The Pacific Ocean.

    Listing all airports

    My first step was to create a list of flight connections between each of the island nations in the Pacific ocean. I am not aware of a publically available data set of international flights so unfortunately, I created a list manually (if you do know of such data set, then please leave a comment).

    My manual research resulted in a list of international flights from or to island airports. This list might not be complete, but it is a start. My Pinterest board with Pacific island airline route maps was the information source for this list.

    The first code section reads the list of airline routes and uses the ggmap package to extract their coordinates from Google maps. The data frame with airport coordinates is saved for future reference to avoid repeatedly pinging Google for the same information.

    # Init library(tidyverse) library(ggmap) library(ggrepel) library(geosphere) # Read flight list and airport list flights <- read.csv("Geography/PacificFlights.csv", stringsAsFactors = FALSE) f <- "Geography/airports.csv" if (file.exists(f)) { airports <- read.csv(f) } else airports <- data.frame(airport = NA, lat = NA, lon = NA) # Lookup coordinates for new airports all_airports <- unique(c(flights$From, flights$To)) new_airports <- all_airports[!(all_airports %in% airports$airport)] if (length(new_airports) != 0) { coords <- geocode(new_airports) new_airports <- data.frame(airport = new_airports, coords) airports <- rbind(airports, new_airports) airports <- subset(airports, !is.na(airport)) write.csv(airports, "Geography/airports.csv", row.names = FALSE) } # Add coordinates to flight list flights <- merge(flights, airports, by.x="From", by.y="airport") flights <- merge(flights, airports, by.x="To", by.y="airport") Create the map

    To create a map, I modified the code to create flight maps I published in an earlier post. This code had to be changed to centre the map on the Pacific. Mapping the Pacific ocean is problematic because the -180 and +180 degree meridians meet around the date line. Longitudes west of the antemeridian are positive, while longitudes east are negative.

    The world2 data set in the borders function of the ggplot2 package is centred on the Pacific ocean. To enable plotting on this map, all negative longitudes are made positive by adding 360 degrees to them.

    # Pacific centric flights$lon.x[flights$lon.x < 0] <- flights$lon.x[flights$lon.x < 0] + 360 flights$lon.y[flights$lon.y < 0] <- flights$lon.y[flights$lon.y < 0] + 360 airports$lon[airports$lon < 0] <- airports$lon[airports$lon < 0] + 360 # Plot flight routes worldmap <- borders("world2", colour="#efede1", fill="#efede1") ggplot() + worldmap + geom_point(data=airports, aes(x = lon, y = lat), col = "#970027") + geom_text_repel(data=airports, aes(x = lon, y = lat, label = airport), col = "black", size = 2, segment.color = NA) + geom_curve(data=flights, aes(x = lon.x, y = lat.x, xend = lon.y, yend = lat.y, col = Airline), size = .4, curvature = .2) + theme(panel.background = element_rect(fill="white"), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank() ) + xlim(100, 300) + ylim(-40,40) Pacific Island Hopping

    This visualisation is aesthetic and full of context, but it is not the best visualisation to solve the travel problem. This map can also be expressed as a graph with nodes (airports) and edges (routes). Once the map is represented mathematically, we can generate travel routes and begin our Pacific Island hopping.

    The igraph package converts the flight list to a graph that can be analysed and plotted. The shortest_path function can then be used to plan routes. If I would want to travel from Auckland to Saipan in the Northern Mariana Islands, I have to go through Port Vila, Honiara, Port Moresby, Chuuk, Guam and then to Saipan. I am pretty sure there are quicker ways to get there, but this would be an exciting journey through the Pacific.

    library(igraph) g <- graph_from_edgelist(as.matrix(flights[,1:2]), directed = FALSE) par(mar = rep(0, 4)) plot(g, layout = layout.fruchterman.reingold, vertex.size=0) V(g) shortest_paths(g, "Auckland", "Saipan")

    View the latest version of this code on GitHub.

    The post Pacific Island Hopping using R and iGraph appeared first on The Devil is in the Data.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Probably more likely than probable

    Wed, 08/30/2017 - 23:33

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    What kind of probability are people talking about when they say something is "highly likely" or has "almost no chance"? The chart below, created by Reddit user zonination, visualizes the responses of 46 other Reddit users to "What probability would you assign to the phase: <phrase>" for various statements of probability. Each set of responses has been converted to a kernel destiny estimate and presented as a joyplot using R. 

    Somewhat surprisingly, the results from the Redditors hew quite closely to a similar study of 23 NATO intelligence officers in 2007. In that study, the officers — who were accustomed to reading intelligence reports with assertions of likelihood — were giving a similar task with the same descriptions of probability. The results, here presented as a dotplot, are quite similar.

    For details on the analysis of the Redditors, including the data and R code behind the joyplot chart, check out the Github repository linked below.

    Github (zonination): Perceptions of Probability and Numbers

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Finding distinct rows of a tibble

    Wed, 08/30/2017 - 22:30

    (This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

    I’ve been using R or its predecessors for about 30 years, which means I know a lot about R, and also that I don’t necessarily know how to use modern R tools. Lately, I’ve been trying to unlearn some old approaches, and to re-learn them using the tidyverse approach to data analysis. I agree that it is much better, but old dogs and new tricks…
    Recently, I was teaching a class where I needed to extract some rows of a data set.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R in the Data Science Stack at ODSC

    Wed, 08/30/2017 - 22:05

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)


    Register now
    for ODSC West in San Francisco, November 2-4 and save 60% with code RB60 until September 1st.

    R continues to hold its own in the data science landscape thanks in no small part to its flexibility.  That flexibility allows R to integrate with some of the most popular data science tools available.

    Given R’s memory bounds, it’s no surprise that deep learning tools like Tensorflow are on that list. Comprehensive, intuitive, and well documented, Tensorflow has quickly become one of the most popular deep learning platforms and RStudio released a package to integrate with the Tensorflow API. Not to be outdone MXNet, another popular and powerful deep learning framework has native support for R with an API interface.  

    It doesn’t stop with deep learning. Data science is moving real-time and the streaming analytics platform, Apache Kafka, is rapidly gaining traction with the community.  The kafka package allows one to use the Kafka messaging queue via R. Spark is now one of the dominant machine learning platforms and thus we see multiple R integrations in the form of the spark package and the SparkR package. The list will continue to grow with package integrations released for H20.ai, Druid etc. and more on the way.

    At the  Open Data Science Conference, R has long been one of the most popular data science languages and ODSC West 2017 is no exception. We have a strong lineup this year that includes:

    • R Tools for Data Science
    • Modeling Big Data with R, sparklyr, and Apache Spark
    • Machine Learning with R
    • Introduction to Data Science with R
    • Modern Time-Series with Prophet
    • R4ML: A Scalable and Distributed framework in R for Machine Learning
    • Databases Using R
    • Geo-Spatial Data Visualization using R

    From an R user perspective, one of the most exciting things about ODSC West 2017 is that it offers an excellent opportunity to do a deep dive into some of the most popular data science tools you can now leverage with R. Talks and workshops on the conference schedule include:

    • Deep learning from Scratch WIth Tensorflow
    • Apache Kafka for Real-time analytics
    • Deep learning with MXNet
    • Effective TensorFlow
    • Building an Open Source Analytics Solution with Kafka and Druid
    • Deep Neural Networks with Keras
    • Robust Data Pipelines with Apache Airflow
    • Apache Superset – A Modern, Enterprise-Ready Business Intelligence Web Application

    Over 3 packed days, ODSC West 2017 also offers a great opportunity to brush up on your modeling skills that include predictive analytics, time series, NLP, machine learning, image recognition, deep learning. autonomous vehicles, and AI chatbot assistants. Here’s just a few of the data science workshops and talks scheduled:

    • Feature Selection from High Dimensions
    • Interpreting Predictions from Complex Models
    • Deep Learning for Recommender Systems
    • Natural Language Processing in Practice – Do’s and Don’ts
    • Machine Imaging recognition
    • Training a Prosocial Chatbot
    • Anomaly Detection Using Deep Learning
    • Myths of Data Science: Practical Issues You Can and Can Not Ignore.
    • Playing Detective with CNNs
    • Recommendation System Architecture and Algorithms
    • Driver and Occupants Monitoring AI for Autonomous Vehicles
    • Solving Impossible Problems by Collaborating with an AI
    • Dynamic Risk Networks: Mapping Risk in the Financial System

    With over 20 full training session, 50 workshops and 100 speakers, ODSC West 2017 is ideal for beginners to experts looking to understand the latest in R tools and topics in data science and AI.

    Register now and save 60% with code RB60 until September 1st.

    Sheamus McGovern, CEO of ODSC

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages