Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 11 hours 45 min ago

A useful forecast combination benchmark

Sun, 06/24/2018 - 02:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

Forecasting benchmarks are very important when testing new forecasting methods, to see how well they perform against some simple alternatives. Every week I get sent papers proposing new forecasting methods that fail to do better than even the simplest benchmark. They are rejected without review.
Typical benchmarks include the naïve method (especially for finance and economic data), the seasonal naïve method (for seasonal data), an automatically selected ETS model, and an automatically selected ARIMA model.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A primer in using Java from R – part 1

Sat, 06/23/2018 - 15:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

This primer shall consist of two parts and its goal is to provide a walk-through of using resources developed in Java from R. It is structured as more of a “note-to-future-self” rather than a proper educational article, I however hope that some readers may still find it useful. It will also list a set of references that I found very helpful, for which I thank the respective authors.

The primer is split into 2 posts:

  1. In this first one we talk about using of the rJava package to create objects, call methods and work with arrays, we examine the various ways to call Java methods and calling Java code from R directly via execution of shell commands.
  2. In the second one we discuss creating and using custom .jar archives within our R scripts and packages, handling of Java exceptions from R and a quick look at performance comparison between the low and high-level interfaces provided by rJava.

R <3 Java, or maybe not?

Contents
  1. Calling Java from R directly
  2. The rJava package – an R to Java interface
  3. Calling Java methods using the rJava package
  4. Signatures in JNI notation
  5. References
Calling Java from R directly

Calling Java resources from R directly can be achieved using R’s system() function, which invokes the specified OS command. We can either use an already compiled java class, or invoke the compilation also via a system() call from R. Of course for any real world practical uses, we will probably do the Java coding, compilation and jaring in a Java IDE and provide R with just the final .jar file(s), I however find it helpful to have a small example of the simplest complete case, for which even the following is sufficient. Integrating pre-prepared .jars into an R packages will be covered in detail by the second part of this primer.

Let us show that by writing a very silly dummy class with just 2 methods:

  • main, that prints “Hello World!” + an optional suffix, if provided as argument
  • SayMyName method, that returns a string constructed from “My name is” and getClass().getName()

This HelloWorldDummy.java file can look as follows:

package DummyJavaClassJustForFun; public class HelloWorldDummy { public String SayMyName() { return("My name is " + getClass().getName()); } public static void main(String[] args) { String stringArg = "And that is it."; if (args.length > 0) { stringArg = args[0]; } System.out.println("Hello, World. " + stringArg); } } Compilation and execution via bash commands

Now that we have our dummy class ready, we can put together the commands and test them by just executing via a shell, or for RStudio fans, we can test the commands via RStudio’s cool Terminal feature. First, the compilation command, which may look something like the following, assuming that we are in the correct working directory:

$ javac DummyJavaClassJustForFun/HelloWorldDummy.java

Now that we have the class compiled, we can execute the main method, with and without the argument provided:

$ java DummyJavaClassJustForFun/HelloWorldDummy $ java DummyJavaClassJustForFun/HelloWorldDummy "I like winter"

In case we need to compile and run with more .jars that are in folder jars/, we specify the folder using -cp (class path):

$ javac -cp "jars/*" DummyJavaClassJustForFun/HelloWorldDummy.java $ java -cp "jars/*:compile/src" DummyJavaClassJustForFun/HelloWorldDummy Compilation and execution of Java code from R

Now that we have tested our commands, we can use R to do the compilation via the system function. Do not forget to cd into the correct directory within a single system call if needed:

system('cd data/; javac DummyJavaClassJustForFun/HelloWorldDummy.java')

After that we can also execute the main method, and the main method with one argument specified, just like we did it outside of R, once again using cd to enter the proper working directory if needed:

system('cd data/; java DummyJavaClassJustForFun/HelloWorldDummy') system('cd data/; java DummyJavaClassJustForFun/HelloWorldDummy "Also I like winter"') The rJava package – an R to Java interface

The rJava package provides a low-level interface to Java virtual machine. It allows creation of objects, calling methods and accessing fields of the objects. It also provides functionality to include our java resources into R packages easily.

We can install it with the classic:

install.packages("rJava")

Note the system requirement Java JDK 1.2 or higher and for JRI/REngine JDK 1.4 or higher. After attaching the package, we also need to initialize a Java Virtual Machine (JVM):

## Attach rJava and Init a JVM library(rJava) .jinit()

In case of issues with attaching the package using library, one can refer to this helpful StackOverflow thread.

Creating Java objects with rJava

We will now very quickly go through the basic uses of the package. The .jnew function is used to create a new Java object. Note that the class argument requires a fully qualified class name in Java Native Interface notation.

# Creating a new object of java.lang class String sHello <- .jnew(class = "java/lang/String", "Hello World!") # Creating a new object of java.lang class Integer iOne <- .jnew(class = "java/lang/Integer", "1") Working with arrays via rJava # Creating new arrays iArray <- .jarray(1L:2L) .jevalArray(iArray) ## [1] 1 2 # Using a list of 2 and lapply # Integer Matrix int[2][2] iMatrix <- .jarray(list(iArray, iArray), contents.class = "[I") lapply(iMatrix, .jevalArray) ## [[1]] ## [1] 1 2 ## ## [[2]] ## [1] 1 2 # Integer Matrix int[2][2] square <- array(1:4, dim = c(2, 2)) square ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 # Using dispatch = TRUE to create the array # Using simplify = TRUE to return a nice R array dSquare <- .jarray(square, dispatch = TRUE) .jevalArray(dSquare, simplify = TRUE) ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 # Integer Tesseract int[2][2][2][2] tesseract <- array(1L:16L, dim = c(2, 2, 2, 2)) tesseract ## , , 1, 1 ## ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 ## ## , , 2, 1 ## ## [,1] [,2] ## [1,] 5 7 ## [2,] 6 8 ## ## , , 1, 2 ## ## [,1] [,2] ## [1,] 9 11 ## [2,] 10 12 ## ## , , 2, 2 ## ## [,1] [,2] ## [1,] 13 15 ## [2,] 14 16 # Use dispatch = TRUE to create the array # Use simplify = TRUE to return a nice R array # Interestingly, this seems weird dTesseract <- .jarray(tesseract, dispatch = TRUE) .jevalArray(dTesseract, simplify = TRUE) ## , , 1, 1 ## ## [,1] [,2] ## [1,] 1 0 ## [2,] 0 0 ## ## , , 2, 1 ## ## [,1] [,2] ## [1,] 0 0 ## [2,] 0 8 ## ## , , 1, 2 ## ## [,1] [,2] ## [1,] 9 0 ## [2,] 0 0 ## ## , , 2, 2 ## ## [,1] [,2] ## [1,] 0 0 ## [2,] 0 16 Calling Java methods using the rJava package

rJava provides two levels of API:

  • fast, but inflexible low-level JNI-API in the form of the .jcall function
  • convenient (at the cost of performance) high-level reflection API based on the $ operator.

In practice, there are three ways available to us from the rJava package enabling us to call Java methods, each of them with their positives and negatives.

The low-level way – .jcall()

.jcall(obj, returnSig = "V", method, ...) calls a Java method with the supplied arguments the “low-level” way. A few important notes regarding the usage, for more refer to the R help on .jcall:

  • requires exact match of argument and return types, doesn’t perform any lookup in the reflection tables
  • passing sub-classes of the classes present in the method definition requires explicit casting using .jcast
  • passing null arguments needs a proper class specification with .jnull
  • vector of length 1 corresponding to a native Java type is considered a scalar, use .jarray to pass a vector as array for safety
# Calling a Java method length on the object low-level way .jcall(sHello, returnSig = "I", "length") ## [1] 12 # Also we must be careful with the data types: # This works .jcall(sHello, returnSig = "C", "charAt", 5L) ## [1] 32 # This does not .jcall(sHello, returnSig = "C", "charAt", 5) ## Error in .jcall(sHello, returnSig = "C", "charAt", 5): method charAt with signature (D)C not found The high-level way – J()

J(class, method, ...) is the high level API for accessing Java, it is slower than .jnew or .jcall since it has to use reflection to find the most suitable method.

  • to call a method, the method argument must be present as a character vector of length 1
  • if method is missing, J creates a class name reference
# Calling a Java method length on the object high-level way J(sHello, "length") ## [1] 12 # Also, the high-level will not help here this way J(sHello, "charAt", 5L) ## Error in .jcall(o, "I", "intValue"): method intValue with signature ()I not found J(sHello, "charAt", 5) ## Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.NoSuchMethodException: No suitable method for the given parameters The high-level way with convenience – $

Closely connected to the J function, the $ operator for jobjRef Java object references provides convenience access to object attributes and calling Java methods by implementing relevant methods for the completion generator for R.

  • $ returns either the value of the attribute or calls a method, depending on which name matches first
  • $<- assigns a value to the corresponding Java attribute
# And via the $ operator sHello$length() ## [1] 12 # But these still do not work sHello$charAt(5L) ## Error in .jcall(o, "I", "intValue"): method intValue with signature ()I not found sHello$charAt(5) ## Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, : java.lang.NoSuchMethodException: No suitable method for the given parameters Examining methods and fields

.DollarNames returns all fields and methods associated with the object. Method names are followed by ( or () depending on arity:

# vector of all fields and methods associated with sHello .DollarNames(sHello) ## [1] "CASE_INSENSITIVE_ORDER" "equals(" ## [3] "toString()" "hashCode()" ## [5] "compareTo(" "compareTo(" ## [7] "indexOf(" "indexOf(" ## [9] "indexOf(" "indexOf(" ## [11] "valueOf(" "valueOf(" ## [13] "valueOf(" "valueOf(" ## [15] "valueOf(" "valueOf(" ## [17] "valueOf(" "valueOf(" ## [19] "valueOf(" "length()" ## [21] "isEmpty()" "charAt(" ## [23] "codePointAt(" "codePointBefore(" ## [25] "codePointCount(" "offsetByCodePoints(" ## [27] "getChars(" "getBytes()" ## [29] "getBytes(" "getBytes(" ## [31] "getBytes(" "contentEquals(" ## [33] "contentEquals(" "equalsIgnoreCase(" ## [35] "compareToIgnoreCase(" "regionMatches(" ## [37] "regionMatches(" "startsWith(" ## [39] "startsWith(" "endsWith(" ## [41] "lastIndexOf(" "lastIndexOf(" ## [43] "lastIndexOf(" "lastIndexOf(" ## [45] "substring(" "substring(" ## [47] "subSequence(" "concat(" ## [49] "replace(" "replace(" ## [51] "matches(" "contains(" ## [53] "replaceFirst(" "replaceAll(" ## [55] "split(" "split(" ## [57] "join(" "join(" ## [59] "toLowerCase(" "toLowerCase()" ## [61] "toUpperCase()" "toUpperCase(" ## [63] "trim()" "toCharArray()" ## [65] "format(" "format(" ## [67] "copyValueOf(" "copyValueOf(" ## [69] "intern()" "wait(" ## [71] "wait(" "wait()" ## [73] "getClass()" "notify()" ## [75] "notifyAll()" "chars()" ## [77] "codePoints()" Signatures in JNI notation Java Type Signature boolean Z byte B char C short S int I long J float F double D type[] [ type method type ( arg-types ) ret-type fully-qualified-class Lfully-qualified-class ; In the fully-qualified-class row of the table above note the
  • L prefix
  • ; suffix
For example
  • the Java method: long f (int n, String s, int[] arr);
  • has type signature: (ILjava/lang/String;[I)J
References
  1. rJava basic crashcourse – at the rJava site on rforge, scroll down to the Documentation section
  2. The JNI Type Signatures – at Oracle JNI specs
  3. rJava documentation on CRAN
  4. Calling Java code from R by prof. Darren Wilkinson
  5. Mapping of types between Java (JNI) and native code
  6. Fixing issues with loading rJava
Did you find the article helpful or interesting? Help others find it by sharing var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

future.apply – Parallelize Any Base R Apply Function

Sat, 06/23/2018 - 02:00

(This article was first published on JottR on R, and kindly contributed to R-bloggers)


Got compute?

future.apply 1.0.0 – Apply Function to Elements in Parallel using Futures – is on CRAN. With this milestone release, all* base R apply functions now have corresponding futurized implementations. This makes it easier than ever before to parallelize your existing apply(), lapply(), mapply(), … code – just prepend future_ to an apply call that takes a long time to complete. That’s it! The default is sequential processing but by using plan(multiprocess) it’ll run in parallel.

Table: All future_nnn() functions in the future.apply package. Each function takes the same arguments as the corresponding base function does.

Function Description future_apply() Apply Functions Over Array Margins future_lapply() Apply a Function over a List or Vector future_sapply() – “ – future_vapply() – “ – future_replicate() – “ – future_mapply() Apply a Function to Multiple List or Vector Arguments future_Map() – “ – future_eapply() Apply a Function Over Values in an Environment future_tapply() Apply a Function Over a Ragged Array

* future_rapply() – Recursively Apply a Function to a List – is yet to be implemented.

A Motivating Example

In the parallel package there is an example – in ?clusterApply – showing how to perform bootstrap simulations in parallel. After some small modifications to clarify the steps, it looks like the following:

library(parallel) library(boot) run1 <- function(...) { library(boot) cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle <- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 500, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle) } cl <- makeCluster(4) ## Parallelize using four cores clusterSetRNGStream(cl, 123) cd4.boot <- do.call(c, parLapply(cl, 1:4, run1)) boot.ci(cd4.boot, type = c("norm", "basic", "perc"), conf = 0.9, h = atanh, hinv = tanh) stopCluster(cl)

The script defines a function run1() that produces 500 bootstrap samples, and then it calls this function four times, combines the four replicated samples into one cd4.boot, and at the end it uses boot.ci() to summarize the results.

The corresponding sequential implementation would look something like:

library(boot) run1 <- function(...) { cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle <- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 500, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle) } set.seed(123) cd4.boot <- do.call(c, lapply(1:4, run1)) boot.ci(cd4.boot, type = c("norm", "basic", "perc"), conf = 0.9, h = atanh, hinv = tanh)

We notice a few things about these two code snippets. First of all, in the parallel code, there are two library(boot) calls; one in the main code and one inside the run1() function. The reason for this is to make sure that the boot package is also attached in the parallel, background R session when run1() is called there. The boot package defines the boot.ci() function, as well as the boot() function and the cd4 data.frame – both used inside run1(). If boot is not attached inside the function, we would get an error on "object 'cd4' not found" when running the parallel code. In contrast, we do not need to do this in the sequential code. Also, if we later would turn our parallel script into a package, then R CMD check would complain if we kept the library(boot) call inside the run1() function.

Second, the example uses MASS::mvrnorm() in run1(). The reason for this is related to the above – if we use only mvrnorm(), we need to attach the MASS package using library(MASS) and also do so inside run1(). Since there is only one MASS function called, it’s easier and neater to use the form MASS::mvrnorm().

Third, the random-seed setup differs between the sequential and the parallel approach.

In summary, in order to turn the sequential script into a script that parallelizes using the parallel package, we would have to not only rewrite parts of the code but also be aware of important differences in order to avoid getting run-time errors due to missing packages or global variables.

One of the objectives of the future.apply package, and the future ecosystem in general, is to make transitions from writing sequential code to writing parallel code as simple and frictionless as possible.

Here is the same example parallelized using the future.apply package:

library(future.apply) plan(multiprocess, workers = 4) ## Parallelize using four cores library(boot) run1 <- function(...) { cd4.rg <- function(data, mle) MASS::mvrnorm(nrow(data), mle$m, mle$v) cd4.mle <- list(m = colMeans(cd4), v = var(cd4)) boot(cd4, corr, R = 500, sim = "parametric", ran.gen = cd4.rg, mle = cd4.mle) } set.seed(123) cd4.boot <- do.call(c, future_lapply(1:4, run1, future.seed = TRUE)) boot.ci(cd4.boot, type = c("norm", "basic", "perc"), conf = 0.9, h = atanh, hinv = tanh)

The difference between the sequential base-R implementation and the future.apply implementation is minimal. The future.apply package is attached, the parallel plan of four workers is set up, and the apply() function is replaced by future_apply(), where we specify future.seed = TRUE to get statistical sound and numerically reproducible parallel random number generation (RNG).
More importantly, notice how there is no need to worry about which packages need to be attached on the workers and which global variables need to be exported. That is all taken care of automatically by the future framework.

Q&A

Q. What are my options for parallelization?
A. Everything in future.apply is processed through the future framework. This means that all parallelization backends supported by the parallel package are supported out of the box, e.g. on your local machine, and on local or remote ad-hoc compute clusters (also in the cloud). Additional parallelization and distribution schemas are provided by backends such as future.callr (parallelization on your local machine) and future.batchtools (large-scale parallelization via HPC job schedulers). For other alternatives, see the CRAN Page for the future package and the High-Performance and Parallel Computing with R CRAN Task View.

Q. Righty-oh, so how do I specify which parallelization backend to use?
A. A fundamental design pattern of the future framework is that the end user decides how and where to parallelize while the developer decides what to parallelize. This means that you do not specify the backend via some argument to the future_nnn() functions. Instead, the backend is specified by the plan() function – you can almost think of it as a global option that the end user controls. For example, plan(multiprocess) will parallelize on the local machine, so will plan(future.callr::callr), whereas plan(cluster, workers = c("n1", "n2", "remote.server.org")) will parallelize on two local machines and one remote machine. Using plan(future.batchtools::batchtools_sge) will distribute the processing on your SGE-supported compute cluster. BTW, you can also have nested parallelization strategies, e.g. plan(list(tweak(cluster, workers = nodes), multiprocess)) where nodes = c("n1", "n2", "remote.server.org").

Q. What about load balancing?
A. The default behavior of all functions is to distribute equally-sized chunks of elements to each available background worker – such that each worker process exactly one chunk (= one future). If the processing times vary significantly across chunks, you can increase the average number of chunks processed by each worker, e.g. to have them process two chunks on average, specify future.scheduling = 2.0. Alternatively, you can specify the number of elements processed per chunk, e.g. future.chunk.size = 10L (an analog to the chunk.size argument added to the parallel package in R 3.5.0).

Q. What about random number generation (RNG)? I’ve heard it’s tricky to get right when running in parallel.
A. Just add future.seed = TRUE and you’re good. This will use parallel safe and statistical sound L’Ecuyer-CMRG RNG, which is a well-established parallel RNG algorithm and used by the parallel package. The future.apply functions use this in a way that is also invariant to the future backend and the amount of “chunking” used. To produce numerically reproducible results, set set.seed(123) before (as in the above example), or simply use future.seed = 123.

Q. What about global variables? Whenever I’ve tried to parallelize code before, I often ran into errors on “this or that variable is not found”.
A. This is very rarely a problem when using the future framework – things work out of the box. Global variables and packages needed are automatically identified from static code inspection and passed on to the workers – even when the workers run on remote computers or in the cloud.

Happy futuring!

Links See also var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: JottR on R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Thanks for Reading!

Fri, 06/22/2018 - 22:44

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

As I’ve been blogging more about statistics, R, and research in general, I’ve been trying to increase my online presence, sharing my blog posts in groups of like-minded people. Those efforts seem to have paid off, based on my view counts over the past year:

And based on read counts, here are my top 10 blog posts, most of which are stats-related:

  1. Beautiful Asymmetry – none of us is symmetrical, and that’s okay 
  2. Statistical Sins: Stepwise Regression – just step away from stepwise regression
  3. Statistics Sunday: What Are Degrees of Freedom? (Part 1) – and read Part 2 here
  4. Working with Your Facebook Data in R
  5. Statistics Sunday: Free Data Science and Statistics Resources
  6. Statistics Sunday: What is Bootstrapping?
  7. Statistical Sins: Know Your Variables (A Confession) – we all make mistakes, but we should learn from them
  8. Statistical Sins: Not Making it Fun (A Thinly Veiled Excuse to Post a Bunch of XKCD Cartoons) – the subtitle says it all
  9. Statistics Sunday: Taylor Swift vs. Lorde – Analyzing Song Lyrics – analyzing song lyrics is my jam
  10. How Has Taylor Swift’s Word Choice Changed Over Time? – ditto

It’s so nice to see people are enjoying the posts, even sharing them and reaching out with additional thoughts and questions. Thanks, readers!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A guide to working with character data in R

Fri, 06/22/2018 - 19:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R is primarily a language for working with numbers, but we often need to work with text as well. Whether it’s formatting text for reports, or analyzing natural language data, R provides a number of facilities for working with character data. Handling Strings with R, a free (CC-BY-NC-SA) e-book by UC Berkeley’s Gaston Sanchez, provides an overview of the ways you can manipulate characters and strings with R. 

There are many useful sections in the book, but a few selections include:

Note that the book does not cover analysis of natural language data, for which you might want to check out the CRAN Task View on Natural Language Processing or the book Text Mining with R: A Tidy Approach. It’s also sadly silent on the topic of character encoding in R, a topic that often causes problems when dealing with text data, especially from international sources. Nonetheless, the book is a really useful overview of working with text in R, and has been updated extensively since it was last published in 2014. You can read Handling Strings with R at the link below.

Gaston Sanchez: Handling Strings with R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using DataCamp’s Autograder to Teach R

Fri, 06/22/2018 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Immediate and personalized feedback has been central to the learning experience on DataCamp since we launched the first courses. If students submit code that contains a mistake, they are told where they made a mistake, and how they can fix this. You can play around with it in our free Introduction to R course. The screenshot below is from our Intermediate R course.

To check submissions and generate feedback, every exercise on DataCamp features a so-called Submission Correctness Test, or SCT.

The SCT is a script of custom tests that assesses the code students submitted and the output and variables they created with their code. For every language that we teach on DataCamp, we have built a corresponding open-source package to easily verify all these elements of a student submission. For R exercises, this package is called testwhat. Over the years, we’ve added a lot of functions to:

  • check variable assignment
  • check function calls and results of function calls
  • check if statements, for loops, and while loops
  • check function definition
  • check whether the proper packages are loaded
  • check the output the student generated
  • checking ggplot and ggvis plotting calls

When these checking functions spot a mistake, they will automatically generate a meaningful feedback message that points students to their mistakes. You can also specify custom feedback messages that override these automatically generated messages.

Historically, testwhat was closely linked with our proprietary backend that executes R code on DataCamp’s servers. Even though testwhat has always been open source, it wouldn’t work well without this custom backend and testwhat could only be used in the context of DataCamp. Today, however, testwhat can be used independently and supports other use cases as well. You can leverage everything testwhat has to offer to test student submissions, even when your format of teaching is very different from DataCamp’s. You can install the package from GitHub:

library(remotes) # devtools works fine too install_github('datacamp/testwhat')

As a quick demo, assume that you ask students to create a variable x equal to 5. A string version of the ideal solution would be something like this:

solution_code <- 'x <- 5'

Now suppose that the student submitted a script where x is incorrectly set to be a vector of two values, which can be coded up as follows:

student_code <- 'x <- c(4, 5)'

testwhat features a function called setup_state() that executes a student submission and a solution script in separate environments and captures the output and errors the student code generated. All of this information is stored in the so-called exercise state that you can access with ex():

library(testwhat) setup_state(stu_code = student_code, sol_code = solution_code) ex() ##

This exercise state can be passed to the wide range of checking functions that testwhat features with the piping syntax from magrittr. To check whether the student defined a variable x, you can use check_object():

ex() %>% check_object('x') ##

This code runs fine, because the student actually defined a variable x. To continue and check whether the student defined the variable x correctly, you can use check_equal():

ex() %>% check_object('x') %>% check_equal() ## Error in check_that(is_equal(student_obj, solution_obj, eq_condition), : The contents of the variable `x` aren't correct. It has length 2, while it should have length 1.

This errors out. check_equal() detects that the value of x in the student environment (a vector) does not match the value of x in the solution environment (a single number). Notice how the error message contains a human-readable message that describes this mistake.

The above example was interactive: you set up a state, and then go on typing chains of check functions. On a higher level, you can use run_until_fail() to wrap around the checking code. This will run your battery of tests until it fails:

library(testwhat) setup_state(stu_code = 'x <- 4', sol_code = 'x <- 5') res <- run_until_fail({ ex() %>% check_object('x') %>% check_equal() }) res$correct ## [1] FALSE res$message ## [1] "The contents of the variable `x` aren't correct."

This basic approach of building an exercise state with a student submission and solution and then using run_until_fail() to execute a battery of tests can easily be embedded in an autograding workflow that fits your needs. Suppose your students have submitted their assignments in the form of an R script that you have downloaded onto your computer in the folder submissions (we have prepopulated the folder with some random data). We have a bunch of R scripts in the submissions folder. Each R script contains a student submission:

dir('submissions') ## [1] "student_a.R" "student_b.R" "student_c.R" "student_d.R" "student_e.R" ## [6] "student_f.R" "student_g.R" "student_h.R" "student_i.R" "student_j.R" readLines('submissions/student_a.R') ## [1] "x <- 4" readLines('submissions/student_b.R') ## [1] "x <- 5"

We can now cycle through all of these submissions, and generate a data.frame that specifies whether or not the submission was correct per student:

library(testwhat) folder_name <- 'submissions' all_files <- file.path(folder_name, dir(folder_name)) student_results <- sapply(all_files, function(file) { student_code <- readLines(file) setup_state(sol_code = 'x <- 5', stu_code = student_code) res <- run_until_fail({ ex() %>% check_object('x') %>% check_equal() }) res$correct }) results_df <- data.frame(name = all_files, correct = student_results, row.names = NULL, stringsAsFactors = FALSE) results_df ## name correct ## 1 submissions/student_a.R FALSE ## 2 submissions/student_b.R TRUE ## 3 submissions/student_c.R TRUE ## 4 submissions/student_d.R TRUE ## 5 submissions/student_e.R FALSE ## 6 submissions/student_f.R TRUE ## 7 submissions/student_g.R FALSE ## 8 submissions/student_h.R FALSE ## 9 submissions/student_i.R TRUE ## 10 submissions/student_j.R TRUE

This small demo is still a bit rough, but I hope it brings the point across: with some R fiddling to build the appropriate inputs for setup_state(), you can put all of testwhat’s checking functions to work. The tests you specify in run_until_fail() will both verify the students’ work and generate meaningful feedback to point them to the mistakes they are making.

To learn more about testwhat, you can visit the GitHub repo or the package documentation, generated with pkgdown. It features both vignettes that outline how to use certain checking functions as well as reference function that describes all arguments you can specify. Don’t hesitate to raise issues on GitHub or reach out through content-engineering@datacamp.com, we love feedback! Happy teaching!

This blog was generated with RMarkdown. You can find the source here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Melt and cast the shape of your data.frame – Exercises

Fri, 06/22/2018 - 15:56

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

 

Datasets often arrive to us in a form that is different from what we need for our modelling or visualisations functions who in turn don’t necessary require the same format.

Reshaping data.frames is a step that all analysts need but many struggle with. Practicing this meta-skill will in the long-run result in more time to focus on the actual analysis.

The solutions to this set will rely on data.table, mostly melt() and dcast() which are originally from the reshape2 package. However, you can also get practice out if it using your favourite base-R, tidyverse or any other method and then compare the results.

Solutions are available here.

 

Exercise 1

Take the following data.frame from this form

df <- data.frame(id = 1:2, q1 = c("A", "B"), q2 = c("C", "A"), stringsAsFactors = FALSE) df id q1 q2 1 1 A C 2 2 B A

to this

id question value 1 1 q1 A 2 2 q1 B 3 1 q2 C 4 2 q2 A

 

Exercise 2

Do the opposite; return the data.frame back to it’s original form.

Exercise 3

Set up the data.frame in terms of questions. Such as the following:

question id_1 id_2 1 q1 A B 2 q2 C A

 

Exercise 4

The data entry behind this data.frame went a little bit wrong. Get all the C and B entries into their corresponding columns

df2 <- data.frame( A = c("A1", "A12", "A31", "A4"), B = c("B4", "C7", "C3", "B9"), C = c("C3", "B16", "B3", "C4") )

 

Exercise 5

Get this data.frame

df3 <- data.frame( Join_ID = rep(1:3, each = 2), Type = rep(c("a", "b"), 3), v2 = c(8, 9, 7, 6, 5, 4)*10 )

 

To look like this:

Join_ID a_v2 b_v2 1 1 80 90 2 2 70 60 3 3 50 40

 

Exercise 6

Revisiting a dataset used in an earlier exercise set on data exploration;
load the AER package and run the command data("Fertility") which loads the dataset Fertility to your workspace.
Melt it into the following format, with one row per child.

head(ferl) morekids age afam hispanic other work mother_id order gender 1 no 27 no no no 0 1 1 male 2 no 30 no no no 30 2 1 female 3 no 27 no no no 0 3 1 male 4 no 35 yes no no 0 4 1 male 5 no 30 no no no 22 5 1 female 6 no 26 no no no 40 6 1 male

 

Exercise 7

Take this

d1 = data.frame( ID=c(1,1,1,2,2,4,1,2), medication=c(1,2,3,1,2,7,2,8) ) d1 ID medication 1 1 1 2 1 2 3 1 3 4 2 1 5 2 2 6 4 7 7 1 2 8 2 8

to this form:

ID medications 1: 1 1, 2, 3, 2 2: 2 1, 2, 8 3: 4 7

 

Note the solution doesn’t use melt() nor dcast(), so you might look at other options.

Exercise 8

Get this

dfs <- data.frame( Name = c(rep("name1",3),rep("name2",2)), MedName = c("atenolol 25mg","aspirin 81mg","sildenafil 100mg", "atenolol 50mg","enalapril 20mg") ) dfs Name MedName 1 name1 atenolol 25mg 2 name1 aspirin 81mg 3 name1 sildenafil 100mg 4 name2 atenolol 50mg 5 name2 enalapril 20mg

 

Into the following format:

Name medication_1 medication_2 medication_3 1: name1 atenolol 25mg aspirin 81mg sildenafil 100mg 2: name2 atenolol 50mg enalapril 20mg

 

Exercise 9

Get the following data.frame organized in standard form

df7 <- data.table( v1 = c("name1, name2", "name3", "name4, name5"), v2 = c("1, 2", "3", "4, 5"), v3 = c(1, 2, 3) ) df7 v1 v2 v3 1: name1, name2 1, 2 1 2: name3 3 2 3: name4, name5 4, 5 3

 

Expected output:

v1 v2 v3 1: name1 1 1 2: name2 2 1 3: name3 3 2 4: name4 4 3 5: name5 5 3

 

The solution doesn’t use melt() nor dcast() and can be suprisingly hard.

Exercise 10

Convert this:

df <- data.frame( Method = c("10.fold.CV Lasso", "10.fold.CV.1SE", "BIC", "Modified.BIC"), n = c(30, 30, 50, 50, 50, 50, 100, 100), lambda = c(1, 3, 1, 2, 2, 0, 1, 2), df = c(21, 17, 29, 26, 25, 32, 34, 32) ) > df Method n lambda df 1 10.fold.CV Lasso 30 1 21 2 10.fold.CV.1SE 30 3 17 3 BIC 50 1 29 4 Modified.BIC 50 2 26 5 10.fold.CV Lasso 50 2 25 6 10.fold.CV.1SE 50 0 32 7 BIC 100 1 34 8 Modified.BIC 100 2 32

Into

Method lambda_30 lambda_50 lambda_100 df_30 df_50 df_100 1 10.fold.CV Lasso 1 2 21 25 2 10.fold.CV.1SE 3 0 17 32 3 BIC 1 1 29 34 4 Modified.BIC 2 2 26 32

 

(Image by Joe Alterio)

Related exercise sets:
  1. Spatial Data Analysis: Introduction to Raster Processing (Part 1)
  2. Advanced Techniques With Raster Data – Part 3: Exercises
  3. Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Creating Slopegraphs with R

Fri, 06/22/2018 - 14:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Presenting data results in the most informative and compelling manner is part of the role of the data scientist. It's all well and good to master the arcana of some algorithm, to manipulate and master the numbers and bend them to your will to produce a “solution” that is both accurate and useful. But, those activities are typically in pursuit of informing some decision or at least providing information that serves a purpose. So taking those results and making them compelling and understandable by your audience is part of your job!

This article will focus on my efforts to develop an R function that is designed to automate the process of producing a Tufte style slopegraph using ggplot2 and dplyr. Tufte is often considered one of the pioneers of data visualization and you are likely to have been influenced techniques he championed such as the sparkline. I've been aware of slopegraphs for quite some time as an excellent visualization technique for some situations. So when I saw the article from Murtaza Haider titled “Edward Tufte’s Slopegraphs and political fortunes in Ontario” I just had to take a shot at writing a function.

To make it a little easier to get started with the function I have taken the liberty of providing a couple of small datasets for you to practice with, please see ?newcancer and ?newgdp.

Installation and setup

Long term I'll try and ensure the version on CRAN is well maintained but for now you're better served by grabbing the current version from GITHUB today since I tend to put all the latest features and fixes there in between pushing to CRAN. I've provided the instructions for installing both commented out below.

knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # Install from CRAN # install.packages("CGPfunctions") # Or the development version from GitHub # install.packages("devtools") # devtools::install_github("ibecav/CGPfunctions") library(CGPfunctions) Simple examples

If you're unfamiliar with slopegraphs or just want to see what the display is all about the dataset I've provided can get you started in one line

newggslopegraph(newcancer, Year, Survival, Type)

Gives this plot:

Slopegraphs are designed to provide maximum information with “minimum ink”. Some key features are:

  • Scaling – this function plots to scale; a big gap indicates a big difference.
  • Names of the items on both the left-hand and right-hand axes are aligned, to make vertical scanning of the items’ names easier. Rank ordering is easily understood as well as change in rank over time.
  • Trends are easily understood over time via the slope. Many suggest using a thin, light gray line to connect the data. A too-heavy line is unnecessary and will make the chart harder to read. If the chart features many lines, judicious use of color can help.
  • A table (with more statistical detail) might be a good complement to use alongside the slopegraph. As Tufte notes: “The data table and the slopegraph are colleagues in explanation not competitors. One display can serve some but not all functions.”

Optionally you can provide important label information through Title, Subtitle, and Caption arguments. You can suppress them all together by setting them = NULL but since I think they are very important the default behavior is to gently remind you, that you have not provided any information. Let's provide a title and sub-title but skip the caption.

newggslopegraph(dataframe = newcancer, Times = Year, Measurement = Survival, Grouping = Type, Title = "Estimates of Percent Survival Rates", SubTitle = "Based on: Edward Tufte, Beautiful Evidence, 174, 176.", Caption = NULL )

Gives this plot:

How it all works

It's all well and good to get the little demo to work, but it might be useful for you to understand how to extend it out to data you're interested in.

You'll need a dataframe with at least three columns. The function will do some basic error checking and complain if you don't hit the essentials.

  • Times is the column in the dataframe that corresponds to the x axis of the plot and is normally a set of moments in time expressed as either characters, factors or ordered factors (in our case newcancer$Year. If it is truly time series data (especially with a lot of dates you're much better off using an R function purpose built for time series). In newcancer it's an ordered factor, mainly because if we fed the information in as character the sort order would be Year 10, Year 15, Year 20, Year 5 which is very confusing. A command like newcancer$Year <- factor(newcancer$Year,levels = c("Year.5", "Year.10", "Year.15", "Year.20"), labels = c("5 Year","10 Year","15 Year","20 Year"), ordered = TRUE) would be the way to force things they way you want them.
  • Measurement is the column that has the actual numbers you want to display along the y axis. Frequently that's a percentage but it could just as easily be any number. Watch out for scaling issues here, you'll want to ensure that its not disparate. In our case newcancer$Survival is the percentage of patients surviving at that point in time, so the maximum scale is 0 to 100.
  • Grouping is what controls how many individual lines are portrayed. Every attempt is made to color them and label them in ways that lead to clarity but eventually you can have too many. In our example case the column is newcancer$Type for the type of cancer or location.

As a sidenote, if you're interested in how the function was built and some of the underlying R programming please consider reading this other post about building the function step by step.

Another quick example

This is loosely based off a blog post from Murtaza Haider titled Edward Tufte’s Slopegraphs and political fortunes in Ontario. In this case we're going to plot the percent of the vote captured by some Canadian political parties. “The data is loosely based on real data but is not actually accurate”

moredata$Date is the hypothetical polling date as a factor (in this case character would work equally well). moredata$Party is the various political parties and moredata$Pct is the percentage of the vote they are estimated to have.

moredata <- structure(list(Date = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("11-May-18", "18-May-18", "25-May-18"), class = "factor"), Party = structure(c(5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L, 5L, 3L, 2L, 1L, 4L), .Label = c("Green", "Liberal", "NDP", "Others", "PC"), class = "factor"), Pct = c(42.3, 28.4, 22.1, 5.4, 1.8, 41.9, 29.3, 22.3, 5, 1.4, 41.9, 26.8, 26.8, 5, 1.4)), class = "data.frame", row.names = c(NA, -15L)) newggslopegraph(moredata, Date, Pct, Party, Title = "Notional data", SubTitle = NULL, Caption = NULL) #> #> Converting 'Date' to an ordered factor

Gives this plot:

There are a plethora of formatting options. See ?newggslopegraph for all of them. Here's a few.

newggslopegraph(moredata, Date, Pct, Party, Title = "Notional data", SubTitle = "none", Caption = "imaginary", LineColor = "gray", LineThickness = .5, YTextSize = 4 ) Converting 'Date' to an ordered factor

Gives this plot:

The most complex is LineColor where you can do the following if you want to highlight the difference between the Liberal and NDP parties while making the other three less prominent…

newggslopegraph(moredata, Date, Pct, Party, Title = "Notional data", SubTitle = "none", Caption = "imaginary", LineColor = c("Green" = "gray", "Liberal" = "green", "NDP" = "red", "Others" = "gray", "PC" = "gray"), LineThickness = .5, YTextSize = 4 ) #> #> Converting 'Date' to an ordered factor

Gives this plot:

One last set of data

Also from Tufte, this is data about a select group of countries Gross Domestic Product (GDP). I'll use it to show you a tricky way to highlight certain countries without making a named vector with LineColor = c(rep("gray",3), "red", rep("gray",3), "red", rep("gray",10)) the excess vector entries are silently dropped… The bottom line is that LineColor is simply a character vector that you can fill any way you choose.

newggslopegraph(newgdp, Year, GDP, Country, Title = "Gross GDP", SubTitle = NULL, Caption = NULL, LineThickness = .5, YTextSize = 4, LineColor = c(rep("gray",3), "red", rep("gray",3), "red", rep("gray",10)) )

Gives this plot:

Finally, let me take a moment about crowding and labeling. I've made every effort to try and deconflict the labels on the left and right axis (in this example the Country) and that should work automatically as you resize your plot dimensions. ** pro tip – if you use RStudio you can press the zoom icon and then use the rescaling of the window to see best choices **.

But the numbers (GDP) are a different matter and there's no easy way to ensure separation in a case like this data. There's a decent total spread from 57.4 to 20.7 and some really close measurements like France, Belgium, and Germany on the right side. My suggestion is in a case like this one you create a new column in your dataframe with two significant places. So specifically it would be newgdp$rGDP <- signif(newgdp$GDP, 2). In my testing, at least, I've found this helps without creating inaccuracy and not causing you to try and “stretch” vertically to disambiguate the numbers. This time I'll also use LineColor to highlight how Canada, Finland and Belgium fare from 1970 to 1979.

newgdp$rGDP <- signif(newgdp$GDP, 2) newggslopegraph(newgdp, Year, rGDP, Country, Title = "Gross GDP", SubTitle = NULL, Caption = NULL, LineThickness = .5, YTextSize = 4, LineColor = c(rep("gray",6), rep("red",2), "red", rep("gray",10)) )

Gives this plot:

One last example and the latest feature

Returning to the cancer dataset we initially used I recently added a new feature that is only available from the github version of the library. It's called WideLabels and is a simple logical set to FALSE by default. If you change it to TRUE as I have in the next example it will expand the x axis and essentially give you more room for the side labels. This can be very useful in cases like the cancer data where you have a few long complex labels. Here's a made up example where I want to draw the reader's attention to certain cancers which appear to have a more precipitous decline in survival over time.

newggslopegraph(dataframe = newcancer, Times = Year, Measurement = Survival, Grouping = Type, Title = "Estimates of Percent Survival Rates", SubTitle = "Based on: Edward Tufte, Beautiful Evidence, 174, 176.", Caption = NULL, YTextSize = 2.5, LineThickness = .5, LineColor = c("red", rep("gray",4), "red", rep("gray",3)), WiderLabels = TRUE ) #> #> You gave me 9 colors I'm recycling colors because you have 24 Types

Gives this plot:

If you like newggslopegraph, please consider filing a GitHub issue by leaving feedback at my GitHub, or by posting a comment here.

    Related Post

    1. How to use paletteR to automagically build palettes from pictures
    2. Visualize Market Basket analysis in R
    3. Lattice-Like Forest Plot using ggplot2 in R
    4. How does visualization in Plotly differ from Seaborn
    5. Twitter Analysis with Python
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Parallelizing Linear Regression or Using Multiple Sources

    Fri, 06/22/2018 - 06:31

    (This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

    My previous post was explaining how mathematically it was possible to parallelize computation to estimate the parameters of a linear regression. More speficially, we have a matrix which is matrix and a -dimensional vector, and we want to compute by spliting the job. Instead of using the observations, we’ve seen that it was to possible to compute “something” using the first rows, then the next rows, etc. Then, finally, we “aggregate” the objects created to get our overall estimate.

    Parallelizing on multiple cores

    Let us see how it works from a computational point of view, to run each computation on a different core of the machine. Each core will see a slave, computing what we’ve seen in the previous post. Here, the data we use are

    1 2 3 y = cars$dist X = data.frame(1,cars$speed) k = ncol(X)

    On my laptop, I have three cores, so we will split it in chunks

    1 2 3 4 library(parallel) library(pbapply) ncl = detectCores()-1 cl = makeCluster(ncl)

    This is more or less what we will do: we have our dataset, and we split the jobs,

    We can then create lists containing elements that will be sent to each core, as Ewen suggested,

    1 2 3 4 5 chunk = function(x,n) split(x, cut(seq_along(x), n, labels = FALSE)) a_parcourir = chunk(seq_len(nrow(X)), ncl) for(i in 1:length(a_parcourir)) a_parcourir[[i]] = rep(i, length(a_parcourir[[i]])) Xlist = split(X, unlist(a_parcourir)) ylist = split(y, unlist(a_parcourir))

    It is also possible to simplify the QR functions we will use

    1 2 3 4 5 6 7 8 compute_qr = function(x){ list(Q=qr.Q(qr(as.matrix(x))),R=qr.R(qr(as.matrix(x)))) } get_Vlist = function(j){ Q3 = QR1[[j]]$Q %*% Q2list[[j]] t(Q3) %*% ylist[[j]] } clusterExport(cl, c("compute_qr", "get_Vlist"), envir=environment())

    Then, we can run our functions on each core. The first one is

    1 QR1 = parLapply(cl=cl,Xlist, compute_qr)

    note that it is also possible to use

    1 QR1 = pblapply(Xlist, compute_qr, cl=cl)

    which will include a progress bar (that can be nice when the database is rather large). Then use

    1 2 3 4 5 6 7 R1 = pblapply(QR1, function(x) x$R, cl=cl) %&gt;% do.call("rbind", .) Q1 = qr.Q(qr(as.matrix(R1))) R2 = qr.R(qr(as.matrix(R1))) Q2list = split.data.frame(Q1, rep(1:ncl, each=k)) clusterExport(cl, c("QR1", "Q2list", "ylist"), envir=environment()) Vlist = pblapply(1:length(QR1), get_Vlist, cl=cl) sumV = Reduce('+', Vlist)

    and finally the ouput is

    1 2 3 4 solve(R2) %*% sumV [,1] X1 -17.579095 X2 3.932409

    which is what we were expecting…

    Using multiple sources

    In practice, it might also happen that various “servers” have the data, but we cannot get a copy. But it is possible to run some functions on their server, and get some output, that we can use afterwards.

    Datasets are supposed to be available somewhere. We can send a request, and get a matrix. Then we we aggregate all of them, and send another request. That’s what we will do here. Provider should run on his part of the data, that function will return . More precisely, to the first provider, send

    1 2 3 function1 = function(subX){ return(qr.R(qr(as.matrix(subX))))} R1 = function1(Xlist[[1]])

    and actually, send that function to all providers, and aggregate the output

    1 for(j in 2:m) R1 = rbind(R1,function1(Xlist[[j]]))

    The create on your side the following objects

    1 2 3 4 Q1 = qr.Q(qr(as.matrix(R1))) R2 = qr.R(qr(as.matrix(R1))) Q2list=list() for(j in 1:m) Q2list[[j]] = Q1[(j-1)*k+1:k,]

    Finally, contact one last time the providers, and send one of your objects

    1 2 3 4 function2=function(subX,suby,Q){ Q1=qr.Q(qr(as.matrix(subX))) Q2=Q return(t(Q1%*%Q2) %*% suby)}

    Provider should then run on his part of the data, using also as argument (that we obtained on own side) and that function will return . For instance, ask the first provider to run

    1 sumV = function2(Xlist[[1]],ylist[[1]], Q2list[[1]])

    and do the same with all providers

    1 for(j in 2:m) sumV = sumV+ function2(Xlist[[j]],ylist[[j]], Q2list[[j]]) 1 2 3 4 solve(R2) %*% sumV [,1] X1 -17.579095 X2 3.932409

    which is what we were expecting…

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Announcing new software review editors: Anna Krystalli and Lincoln Mullen

    Fri, 06/22/2018 - 02:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    Part of rOpenSci’s mission is to create technical infrastructure in the form of carefully vetted R software tools that lower barriers to working with data sources on the web. Our open peer software review system for community-contributed tools is a key component of this. As the rOpenSci community grows and more package authors submit their work for peer review, we need to expand our editorial board to maintain a speedy process. As our recent post shows, package submissions have grown every year since we started this experiment, and we see no reason they will slow down!

    Editors manage the review process, performing initial package checks, identifying reviewers, and moderating the process until the package is accepted by reviewers and transferred to rOpenSci. Anna Krystalli and Lincoln Mullen have both served as guest editors for rOpenSci and now they join as full members of the editorial board with Scott Chamberlain, Karthik Ram, Noam Ross and Maëlle Salmon. As a Research Software Engineer for University of Sheffield RSE, Anna brings the experience of working at the intersection of science and software development that we need. Lincoln has worked broadly in reproducibility for the digital humanities, and led editorial for specialized text analysis packages on an ad-hoc basis.

    Anna Krystalli is a Research Software Engineer at University of Sheffield with a PhD in marine macroecology. She has helped organize Reprohacks where teams try to reproduce papers nominated by their authors from supplied code and data; is one of Mozilla Foundation’s “50 People Who Made the Internet a Better Place in 2016”; and she co-organizes the Sheffield R Users Group. Anna reviewed the codemetar and rdflib packages for rOpenSci. During her second package review, Anna started work on an experimental reviewing workflow package, pkgreviewr.

    The overall goals of rOpenSci are fully aligned with my interests and passions, both personally and also professionally as a Research Software Engineer, tasked with helping researchers make the most of their code and data. It’s a great opportunity to learn more about the incredibly active and constantly developing space of research software development in R which will ultimately benefit the communities I serve.

    Anna on GitHub, Twitter, website

    Lincoln Mullen is an Assistant Professor at George Mason University with a PhD in history. He is a historian of American religious history and the nineteenth-century United States, with expertise in computational methods for texts and maps. Lincoln’s historical research that uses R includes an article on legal history published in the American Historical Review, the leading journal in history; America’s Public Bible, which won first-place in the National Endowment for the Humanities Chronicling America Data Challenge; and the NEH-funded project Mapping Early American Elections. Lincoln maintains six rOpenSci packages, including: tokenizers for fast, consistent tokenization of natural language text, textreuse for detecting text reuse and document similarity, and USAboundaries for historical and contemporary state, county, and Congressional district boundaries, as well as zip code tabulation area centroids.

    My software has benefitted from rOpenSci’s thorough peer reviews. I am glad to join the editorial team to help accomplish the rOpenSci mission of making software recognizable as scholarship.

    Lincoln on GitHub, Twitter, website

    Want to learn more about rOpenci software review?

    We’d love to see you soon in the onboarding repository as a submitter or reviewer!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Idle thoughts lead to R internals: how to count function arguments

    Fri, 06/22/2018 - 01:55

    (This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

    “Some R functions have an awful lot of arguments”, you think to yourself. “I wonder which has the most?”

    It’s not an original thought: the same question as applied to the R base package is an exercise in the Functions chapter of the excellent Advanced R. Much of the information in this post came from there.

    There are lots of R packages. We’ll limit ourselves to those packages which ship with R, and which load on startup. Which ones are they?

    What packages load on starting R?
    Start a new R session and type search(). Here’s the result on my machine:


    search()
    [1] ".GlobalEnv" "tools:rstudio" "package:stats" "package:graphics" "package:grDevices"
    "package:utils" "package:datasets" "package:methods" "Autoloads" "package:base"

    We’re interested in the packages with priority = base. Next question:

    How can I see and filter for package priority?
    You don’t need dplyr for this, but it helps.

    library(tidyverse) installed.packages() %>% as.tibble() %>% filter(Priority == "base") %>% select(Package, Priority) # A tibble: 14 x 2 Package Priority 1 base base 2 compiler base 3 datasets base 4 graphics base 5 grDevices base 6 grid base 7 methods base 8 parallel base 9 splines base 10 stats base 11 stats4 base 12 tcltk base 13 tools base 14 utils base

    Comparing to the output from search(), we want to look at: stats, graphics, grDevices, utils, datasets, methods and base.

    How can I see all the objects in a package?
    Like this, for the base package. For other packages, just change base to the package name of interest.

    ls("package:base")

    However, not every object in a package is a function. Next question:

    How do I know if an object is a function?
    The simplest way is to use is.function().

    is.function(ls) [1] TRUE

    What if the function name is stored as a character variable, “ls”? Then we can use get():

    is.function(get("ls")) [1] TRUE

    But wait: what if two functions from different packages have the same name and we have loaded both of those packages? Then we specify the package too, using the pos argument.

    is.function(get("Position", pos = "package:base")) [1] TRUE is.function(get("Position", pos = "package:ggplot2")) [1] FALSE

    So far, so good. Now, to the arguments.

    How do I see the arguments to a function?
    Now things start to get interesting. In R, function arguments are called formals. There is a function of the same name, formals(), to show the arguments for a function. You can also use formalArgs() which returns a vector with just the argument names:

    formalArgs(ls) [1] "name" "pos" "envir" "all.names" "pattern" "sorted"

    But that won’t work for every function. Let’s try abs():

    formalArgs(abs) NULL

    The issue here is that abs() is a primitive function, and primitives don’t have formals. Our next two questions:

    How do I know if an object is a primitive?
    Hopefully you guessed that one:

    is.primitive(abs) [1] TRUE

    How do I see the arguments to a primitive?
    You can use args(), and you can pass the output of args() to formals() or formalArgs():

    args(abs) function (x) NULL formalArgs(args(abs)) [1] "x"

    However, there are a few objects which are primitive functions for which this doesn’t work. Let’s not worry about those.

    is.primitive(`:`) [1] TRUE formalArgs(args(`:`)) NULL Warning message: In formals(fun) : argument is not a function

    So what was the original question again?
    Let’s put all that together. We want to find the base packages which load on startup, list their objects, identify which are functions or primitive functions, list their arguments and count them up.

    We’ll create a tibble by pasting the arguments for each function into a comma-separated string, then pulling the string apart using unnest_tokens() from the tidytext package.

    library(tidytext) library(tidyverse) pkgs <- installed.packages() %>% as.tibble() %>% filter(Priority == "base", Package %in% c("stats", "graphics", "grDevices", "utils", "datasets", "methods", "base")) %>% select(Package) %>% rowwise() %>% mutate(fnames = paste(ls(paste0("package:", Package)), collapse = ",")) %>% unnest_tokens(fname, fnames, token = stringr::str_split, pattern = ",", to_lower = FALSE) %>% filter(is.function(get(fname, pos = paste0("package:", Package)))) %>% mutate(is_primitive = ifelse(is.primitive(get(fname, pos = paste0("package:", Package))), 1, 0), num_args = ifelse(is.primitive(get(fname, pos = paste0("package:", Package))), length(formalArgs(args(fname))), length(formalArgs(fname)))) %>% ungroup()

    That throws out a few warnings where, as noted, args() doesn’t work for some primitives.

    And the winner is –

    pkgs %>% top_n(10) %>% arrange(desc(num_args)) Selecting by num_args # A tibble: 10 x 4 Package fname is_primitive num_args 1 graphics legend 0 39 2 graphics stars 0 33 3 graphics barplot.default 0 30 4 stats termplot 0 28 5 utils read.table 0 25 6 stats heatmap 0 24 7 base scan 0 22 8 graphics filled.contour 0 21 9 graphics hist.default 0 21 10 stats interaction.plot 0 21

    – the function legend() from the graphics package, with 39 arguments. From the base package itself, scan(), with 22 arguments.

    Just to wrap up, some histograms of argument number by package, suggesting that the base graphics functions tend to be the more verbose.

    pkgs %>% ggplot(aes(num_args)) + geom_histogram() + facet_wrap(~Package, scales = "free_y") + theme_bw() + labs(x = "arguments", title = "R base function arguments by package")

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    A Comparative Review of the BlueSky Statistics GUI for R

    Thu, 06/21/2018 - 14:47

    (This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

    Introduction

    BlueSky Statistics’ desktop version is a free and open source graphical user interface for the R software that focuses on beginners looking to point-and-click their way through analyses.  A commercial version is also available which includes technical support and a version for Windows Terminal Servers such as Remote Desktop, or Citrix. Mac, Linux, or tablet users could run it via a terminal server.

    This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.

     

    Terminology

    There are various definitions of user interface types, so here’s how I’ll be using these terms:

    GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

    IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE usersare people who prefer to write R code to perform their analyses.

     

    Installation

    The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others install in multiple steps, such as the R Commander (two steps) and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

    The main BlueSky installation is easily performed in a single step. The installer provides its own embedded copy of R, simplifying the installation and ensuring complete compatibility between BlueSky and the version of R it’s using. However, it also means if you already have R installed, you’ll end up with a second copy. You can have BlueSky control any version of R you choose, but if the version differs too much, you may run into occasional problems.

     

    Plug-in Modules

    When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

    BlueSky is a fairly new open source project, and at the moment all the add-on modules are provided by the company. However, BlueSky’s capabilities approaches the comprehensiveness of R Commander, which currently has the most add-ons available. The BlueSky developers are working to create an Internet repository for module distribution.

     

    Startup

    Some user interfaces for R, such as jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

    You start BlueSky directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.

     

    Data Editor

    A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

    BlueSky starts up by showing you its main Application screen (Figure 1) and prompts you to enter data with an empty spreadsheet-style data editor. You can start entering data immediately, though at first, the variables are simply named var1, var2…. You might think you can rename them by clicking on their names, but such changes are done in a different manner, one that will be very familiar to SPSS users. There are two tabs at the bottom left of the data editor screen, which are labeled “Data” and “Variables.” The “Data” tab is shown by default, but clicking on the “Variables” tab takes you to a screen (Figure 2) which displays the metadata: variable names, labels, types, classes, values, and measurement scale.

    Figure 1. The main BlueSky Application screen.

    The big advantage that SPSS offers is that you can change the settings of many variables at once. So if you had, say, 20 variables for which you needed to set the same factor labels (e.g. 1=strongly disagree…5=Strongly Agree) you could do it once and then paste them into the other 19 with just a click or two. Unfortunately, that’s not yet fully implemented in BlueSky. Some of the metadata fields can be edited directly. For the rest, you must instead follow the directions at the top of that screen and right click on each variable, one at a time, to make the changes. Complete copy and paste of metadata is planned for a future version.

    Figure 2. The Variables screen in the data editor. The “Variables” tab in the lower left is selected, letting us see the metadata for the same variables as shown in Figure 1.

    You can enter numeric or character data in the editor right after starting BlueSky. The first time you enter character data, it will offer to convert the variable from numeric to character and wait for you to approve the change. This is very helpful as it’s all too easy to type the letter “O” when meaning to type a zero “0”, or the letter “I” instead of number one “1”.

    To add rows, the Data tab is clearly labeled, “Click here to add a new row”. It would be much faster if the Enter key did that automatically.

    To add variables you have to go to the Variables tab and right-click on the row of any variable (variable names are in rows on that screen), then choose “Insert new variable at end.”

    To enter factor data, it’s best to leave it numeric such as 1 or 2, for male and female, then set the labels (which are called values using SPSS terminology) afterwards. The reason for this is that once labels are set, you must enter them from drop-down menus. While that ensures no invalid values are entered, it slows down data entry. The developer’s future plans includes automatic display of labels upon entry of numeric values.

    If you instead decide to make the variable a factor before entering numeric data, it’s best to enter the numbers as labels as well. It’s an oddity of R that factors are numeric inside, while displaying labels that may or may not be the same as the numbers they represent.

    To enter dates, enter them as character data and use the “Data> Compute” menu to convert the character data to a date. When I reported this problem to the developers, they said they would add this to the “Variables” metadata tab so you could set it to be a date variable before entering the data.

    If you have another data set to enter, you can start the process again by clicking “File> New”, and a new editor window will appear in a new tab. You can change data sets simply by clicking on its tab and its window will pop to the front for you to see. When doing analyses, or saving data, the data set that’s displayed in the editor is the one that will be used. That approach feels very natural; what you see is what you get.

    Saving the data is done with the standard “File > Save As” menu. You must save each one to its own file. While R allows multiple data sets (and other objects such as models) to be saved to a single file, BlueSky does not. Its developers chose to simplify what their users have to learn by limiting each file to a single data set. That is a useful simplification for GUI users. If a more advanced R user sends a compound file containing many objects, BlueSky will detect it and offer to open one data set (data frame) at a time.

    Figure 3. Output window showing standard journal-style tables. Syntax editor has been opened and is shown on right side.

     

    Data Import

    The open source version of BlueSky supports the following file formats, all located under “File> Open”:

    • Comma Separated Values (.csv)
    • Plain text files (.txt)
    • Excel (old and new xls file types)
    • Dbase’s DBF
    • SPSS (.sav)
    • SAS binary files (sas7bdat)
    • Standard R workspace files (RData) with individual data frame selection

    The SQL database formats are found under the “File> Import Data” menu. The supported formats include:

    • Microsoft Access
    • Microsoft SQL Server
    • MySQL
    • PostgreSQL
    • SQLite

     

    Data Management

    It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as the R Commander, can handle many, but not all, of them.

    BlueSky offers one of the most comprehensive sets of data management tools of any R GUI. The “Data” menu offers the following set of tools. Not shown is an extensive set of character and date/time functions which appear under “Compute.”

    1. Missing Values
    2. Compute
    3. Bin Numeric Variables
    4. Recode (able to recode many at once)
    5. Make Factor Variable (able to covert many at once)
    6. Transpose
    7. Transform (able to transform many at once)
    8. Sample Dataset
    9. Delete Variables
    10. Standardize Variables (able to standardize many at once)
    11. Aggregate (outputs results to a new dataset)
    12. Aggregate (outputs results to a printed table)
    13. Subset (outputs to a new data et)
    14. Subset (outputs results to a printed table)
    15. Merge Datasets
    16. Sort (outputs results to a new dataset)
    17. Sort (outputs results to a printed table)
    18. Reload Dataset from File
    19. Refresh Grid
    20. Concatenate Multiple Variables (handling missing values)
    21. Legacy (does same things but using base R code)
    22. Reshape (long to wide)
    23. Reshape (wide to long)

    Continued here…

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Non-Linear Model in R Exercises

    Thu, 06/21/2018 - 08:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    A mechanistic model for the relationship between x and y sometimes needs parameter estimation. When model linearisation does not work,we need to use non-linear modelling.
    There are three main differences between non-linear and linear modelling in R:
    1. specify the exact nature of the equation
    2. replace the lm() with nls() which means nonlinear least squares
    3. sometimes we also need to specify the model parameters a,b, and c.

    In this exercise, we will use the same dataset as the previous exercise in polynomial regression here. Download the data-set here.
    A quick overview of the dataset.
    Response variable = number of invertebrates (INDIV)
    Explanatory variable = the area of each clump (AREA)
    Additional possible response variables = Species richness of invertebrates (SPECIES)
    Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

    Exercise 1
    Load dataset. Specify the model, try to use power function with nls() and a=0.1 and b=1 as initial parameter number

    Exercise 2
    A quick check by creating plot residual versus fitted model since normal plot will not work

    Exercise 3
    Try to build self start function of the powered model

    Exercise 4
    Generate the asymptotic model

    Exercise 5
    Compared the asymptotic model to the powered one using AIC. What can we infer?

    Exercise 6
    Plot the model in one graph

    Exercise 7
    Predict across the data and plot all three lines

    Related exercise sets:
    1. Spatial Data Analysis: Introduction to Raster Processing (Part 1)
    2. Spatial Data Analysis: Introduction to Raster Processing: Part-3
    3. Density-Based Clustering Exercises
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Explaining Keras image classification models with lime

    Thu, 06/21/2018 - 02:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    Last week I published a blog post about how easy it is to train image classification models with Keras.

    What I did not show in that post was how to use the model for making predictions. This, I will do here. But predictions alone are boring, so I’m adding explanations for the predictions using the lime package.

    I have already written a few blog posts (here, here and here) about LIME and have given talks (here and here) about it, too.

    Neither of them applies LIME to image classification models, though. And with the new(ish) release from March of Thomas Lin Pedersen’s lime package, lime is now not only on CRAN but it natively supports Keras and image classification models.

    Thomas wrote a very nice article about how to use keras and lime in R! Here, I am following this article to use Imagenet (VGG16) to make and explain predictions of fruit images and then I am extending the analysis to last week’s model and compare it with the pretrained net.

    Loading libraries and models library(keras) # for working with neural nets library(lime) # for explaining models library(magick) # for preprocessing images library(ggplot2) # for additional plotting
    • Loading the pretrained Imagenet model
    model <- application_vgg16(weights = "imagenet", include_top = TRUE) model ## Model ## ___________________________________________________________________________ ## Layer (type) Output Shape Param # ## =========================================================================== ## input_1 (InputLayer) (None, 224, 224, 3) 0 ## ___________________________________________________________________________ ## block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 ## ___________________________________________________________________________ ## block1_conv2 (Conv2D) (None, 224, 224, 64) 36928 ## ___________________________________________________________________________ ## block1_pool (MaxPooling2D) (None, 112, 112, 64) 0 ## ___________________________________________________________________________ ## block2_conv1 (Conv2D) (None, 112, 112, 128) 73856 ## ___________________________________________________________________________ ## block2_conv2 (Conv2D) (None, 112, 112, 128) 147584 ## ___________________________________________________________________________ ## block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 ## ___________________________________________________________________________ ## block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 ## ___________________________________________________________________________ ## block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 ## ___________________________________________________________________________ ## block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 ## ___________________________________________________________________________ ## block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 ## ___________________________________________________________________________ ## block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 ## ___________________________________________________________________________ ## block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 ## ___________________________________________________________________________ ## block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 ## ___________________________________________________________________________ ## block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 ## ___________________________________________________________________________ ## block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 ## ___________________________________________________________________________ ## block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 ## ___________________________________________________________________________ ## block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 ## ___________________________________________________________________________ ## block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 ## ___________________________________________________________________________ ## flatten (Flatten) (None, 25088) 0 ## ___________________________________________________________________________ ## fc1 (Dense) (None, 4096) 102764544 ## ___________________________________________________________________________ ## fc2 (Dense) (None, 4096) 16781312 ## ___________________________________________________________________________ ## predictions (Dense) (None, 1000) 4097000 ## =========================================================================== ## Total params: 138,357,544 ## Trainable params: 138,357,544 ## Non-trainable params: 0 ## ___________________________________________________________________________ model2 <- load_model_hdf5(filepath = "/Users/shiringlander/Documents/Github/DL_AI/Tutti_Frutti/fruits-360/keras/fruits_checkpoints.h5") model2 ## Model ## ___________________________________________________________________________ ## Layer (type) Output Shape Param # ## =========================================================================== ## conv2d_1 (Conv2D) (None, 20, 20, 32) 896 ## ___________________________________________________________________________ ## activation_1 (Activation) (None, 20, 20, 32) 0 ## ___________________________________________________________________________ ## conv2d_2 (Conv2D) (None, 20, 20, 16) 4624 ## ___________________________________________________________________________ ## leaky_re_lu_1 (LeakyReLU) (None, 20, 20, 16) 0 ## ___________________________________________________________________________ ## batch_normalization_1 (BatchNorm (None, 20, 20, 16) 64 ## ___________________________________________________________________________ ## max_pooling2d_1 (MaxPooling2D) (None, 10, 10, 16) 0 ## ___________________________________________________________________________ ## dropout_1 (Dropout) (None, 10, 10, 16) 0 ## ___________________________________________________________________________ ## flatten_1 (Flatten) (None, 1600) 0 ## ___________________________________________________________________________ ## dense_1 (Dense) (None, 100) 160100 ## ___________________________________________________________________________ ## activation_2 (Activation) (None, 100) 0 ## ___________________________________________________________________________ ## dropout_2 (Dropout) (None, 100) 0 ## ___________________________________________________________________________ ## dense_2 (Dense) (None, 16) 1616 ## ___________________________________________________________________________ ## activation_3 (Activation) (None, 16) 0 ## =========================================================================== ## Total params: 167,300 ## Trainable params: 167,268 ## Non-trainable params: 32 ## ___________________________________________________________________________ Load and prepare images

    Here, I am loading and preprocessing two images of fruits (and yes, I am cheating a bit because I am choosing images where I expect my model to work as they are similar to the training images…).

    • Banana
    test_image_files_path <- "/Users/shiringlander/Documents/Github/DL_AI/Tutti_Frutti/fruits-360/Test" img <- image_read('https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Banana-Single.jpg/272px-Banana-Single.jpg') img_path <- file.path(test_image_files_path, "Banana", 'banana.jpg') image_write(img, img_path) #plot(as.raster(img))
    • Clementine
    img2 <- image_read('https://cdn.pixabay.com/photo/2010/12/13/09/51/clementine-1792_1280.jpg') img_path2 <- file.path(test_image_files_path, "Clementine", 'clementine.jpg') image_write(img2, img_path2) #plot(as.raster(img2)) Superpixels

    The segmentation of an image into superpixels are an important step in generating explanations for image models. It is both important that the segmentation is correct and follows meaningful patterns in the picture, but also that the size/number of superpixels are appropriate. If the important features in the image are chopped into too many segments the permutations will probably damage the picture beyond recognition in almost all cases leading to a poor or failing explanation model. As the size of the object of interest is varying it is impossible to set up hard rules for the number of superpixels to segment into – the larger the object is relative to the size of the image, the fewer superpixels should be generated. Using plot_superpixels it is possible to evaluate the superpixel parameters before starting the time consuming explanation function. (help(plot_superpixels))

    plot_superpixels(img_path, n_superpixels = 35, weight = 10)

    plot_superpixels(img_path2, n_superpixels = 50, weight = 20)

    From the superpixel plots we can see that the clementine image has a higher resolution than the banana image.

    Prepare images for Imagenet image_prep <- function(x) { arrays <- lapply(x, function(path) { img <- image_load(path, target_size = c(224,224)) x <- image_to_array(img) x <- array_reshape(x, c(1, dim(x))) x <- imagenet_preprocess_input(x) }) do.call(abind::abind, c(arrays, list(along = 1))) }
    • test predictions
    res <- predict(model, image_prep(c(img_path, img_path2))) imagenet_decode_predictions(res) ## [[1]] ## class_name class_description score ## 1 n07753592 banana 0.9929747581 ## 2 n03532672 hook 0.0013420776 ## 3 n07747607 orange 0.0010816186 ## 4 n07749582 lemon 0.0010625814 ## 5 n07716906 spaghetti_squash 0.0009176208 ## ## [[2]] ## class_name class_description score ## 1 n07747607 orange 0.78233224 ## 2 n07753592 banana 0.04653566 ## 3 n07749582 lemon 0.03868873 ## 4 n03134739 croquet_ball 0.03350329 ## 5 n07745940 strawberry 0.01862431
    • load labels and train explainer
    model_labels <- readRDS(system.file('extdata', 'imagenet_labels.rds', package = 'lime')) explainer <- lime(c(img_path, img_path2), as_classifier(model, model_labels), image_prep)

    Training the explainer (explain() function) can take pretty long. It will be much faster with the smaller images in my own model but with the bigger Imagenet it takes a few minutes to run.

    explanation <- explain(c(img_path, img_path2), explainer, n_labels = 2, n_features = 35, n_superpixels = 35, weight = 10, background = "white")
    • plot_image_explanation() only supports showing one case at a time
    plot_image_explanation(explanation)

    clementine <- explanation[explanation$case == "clementine.jpg",] plot_image_explanation(clementine)

    Prepare images for my own model
    • test predictions (analogous to training and validation images)
    test_datagen <- image_data_generator(rescale = 1/255) test_generator = flow_images_from_directory( test_image_files_path, test_datagen, target_size = c(20, 20), class_mode = 'categorical') predictions <- as.data.frame(predict_generator(model2, test_generator, steps = 1)) load("/Users/shiringlander/Documents/Github/DL_AI/Tutti_Frutti/fruits-360/fruits_classes_indices.RData") fruits_classes_indices_df <- data.frame(indices = unlist(fruits_classes_indices)) fruits_classes_indices_df <- fruits_classes_indices_df[order(fruits_classes_indices_df$indices), , drop = FALSE] colnames(predictions) <- rownames(fruits_classes_indices_df) t(round(predictions, digits = 2)) ## [,1] [,2] ## Kiwi 0 0.00 ## Banana 1 0.11 ## Apricot 0 0.00 ## Avocado 0 0.00 ## Cocos 0 0.00 ## Clementine 0 0.87 ## Mandarine 0 0.00 ## Orange 0 0.00 ## Limes 0 0.00 ## Lemon 0 0.00 ## Peach 0 0.00 ## Plum 0 0.00 ## Raspberry 0 0.00 ## Strawberry 0 0.01 ## Pineapple 0 0.00 ## Pomegranate 0 0.00 for (i in 1:nrow(predictions)) { cat(i, ":") print(unlist(which.max(predictions[i, ]))) } ## 1 :Banana ## 2 ## 2 :Clementine ## 6

    This seems to be incompatible with lime, though (or if someone knows how it works, please let me know) – so I prepared the images similarly to the Imagenet images.

    image_prep2 <- function(x) { arrays <- lapply(x, function(path) { img <- image_load(path, target_size = c(20, 20)) x <- image_to_array(img) x <- reticulate::array_reshape(x, c(1, dim(x))) x <- x / 255 }) do.call(abind::abind, c(arrays, list(along = 1))) }
    • prepare labels
    fruits_classes_indices_l <- rownames(fruits_classes_indices_df) names(fruits_classes_indices_l) <- unlist(fruits_classes_indices) fruits_classes_indices_l ## 9 10 8 2 11 ## "Kiwi" "Banana" "Apricot" "Avocado" "Cocos" ## 3 13 14 7 6 ## "Clementine" "Mandarine" "Orange" "Limes" "Lemon" ## 1 5 0 4 15 ## "Peach" "Plum" "Raspberry" "Strawberry" "Pineapple" ## 12 ## "Pomegranate"
    • train explainer
    explainer2 <- lime(c(img_path, img_path2), as_classifier(model2, fruits_classes_indices_l), image_prep2) explanation2 <- explain(c(img_path, img_path2), explainer2, n_labels = 1, n_features = 20, n_superpixels = 35, weight = 10, background = "white")
    • plot feature weights to find a good threshold for plotting block (see below)
    explanation2 %>% ggplot(aes(x = feature_weight)) + facet_wrap(~ case, scales = "free") + geom_density()

    • plot predictions
    plot_image_explanation(explanation2, display = 'block', threshold = 5e-07)

    clementine2 <- explanation2[explanation2$case == "clementine.jpg",] plot_image_explanation(clementine2, display = 'block', threshold = 0.16)

    sessionInfo() ## R version 3.5.0 (2018-04-23) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS High Sierra 10.13.5 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ggplot2_2.2.1 magick_1.9 lime_0.4.0 keras_2.1.6 ## ## loaded via a namespace (and not attached): ## [1] stringdist_0.9.5.1 reticulate_1.8 xfun_0.2 ## [4] lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6 ## [7] yaml_2.1.19 base64enc_0.1-3 rlang_0.2.1 ## [10] pillar_1.2.3 later_0.7.3 foreach_1.4.4 ## [13] plyr_1.8.4 tensorflow_1.8 stringr_1.3.1 ## [16] munsell_0.5.0 blogdown_0.6 gtable_0.2.0 ## [19] htmlwidgets_1.2 codetools_0.2-15 evaluate_0.10.1 ## [22] labeling_0.3 knitr_1.20 httpuv_1.4.4.1 ## [25] tfruns_1.3 parallel_3.5.0 curl_3.2 ## [28] Rcpp_0.12.17 xtable_1.8-2 scales_0.5.0 ## [31] backports_1.1.2 promises_1.0.1 jsonlite_1.5 ## [34] abind_1.4-5 mime_0.5 digest_0.6.15 ## [37] stringi_1.2.3 bookdown_0.7 shiny_1.1.0 ## [40] grid_3.5.0 rprojroot_1.3-2 tools_3.5.0 ## [43] magrittr_1.5 lazyeval_0.2.1 shinythemes_1.1.1 ## [46] glmnet_2.0-16 tibble_1.4.2 whisker_0.3-2 ## [49] zeallot_0.1.0 Matrix_1.2-14 gower_0.1.2 ## [52] assertthat_0.2.0 rmarkdown_1.10 iterators_1.0.9 ## [55] R6_2.2.2 compiler_3.5.0 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Scraping Responsibly with R

    Thu, 06/21/2018 - 02:00

    (This article was first published on Blog-rss on stevenmortimer.com, and kindly contributed to R-bloggers)

    I recently wrote a blog post here comparing the number of CRAN downloads an R package gets relative to its number of stars on GitHub. What I didn’t really think about during my analysis was whether or not scraping CRAN was a violation of its Terms and Conditions. I simply copy and pasted some code from R-bloggers that seemed to work and went on my merry way. In hindsight, it would have been better to check whether or not the scraping was allowed and maybe find a better way to get the information I needed. Of course, there was a much easier way to get the CRAN package metadata using the function tools::CRAN_package_db() thanks to a hint from Maëlle Salmon provided in this tweet.

    How to Check if Scraping is Permitted

    Also provided by Maëlle’s tweet was the recommendation for using the robotstxt package (currently having 27 Stars + one Star that I just added!). It doesn’t seem to be well known as it only has 6,571 total downloads. I’m hoping this post will help spread the word. It’s easy to use! In this case I’ll check whether or not CRAN permits bots on specific resources of the domain.

    My other blog post analysis originally started with trying to get a list of all current R packages on CRAN by parsing the HTML from https://cran.rstudio.com/src/contrib. The page looks like this:

    The question is whether or not scraping this page is permitted according to the robots.txt file on the cran.rstudio.com domain. This is where the robotstxt package can help us out. We can check simply by supplying the domain and path that is used to form the full link we are interested in scraping. If the paths_allowed() function returns TRUE then we should be allowed to scrape, if it returns FALSE then we are not permitted to scrape.

    library(robotstxt) paths_allowed( paths = "/src/contrib", domain = "cran.rstudio.com", bot = "*" ) #> [1] TRUE

    In this case the value that is returned is TRUE meaning that bots are allowed to scrape that particular path. This was how I originally scraped the list of current R packages, even though you don’t really need to do that since there is the wonderful function tools::CRAN_package_db().

    After retrieving the list of packages I decided to scrape details from the DESCRIPTION file of each package. Here is where things get interesting. CRAN’s robots.txt file shows that scraping the DESCRIPTION file of each package is not allowed. Furthermore, you can verify this using the robotstxt package:

    paths_allowed( paths = "/web/packages/ggplot2/DESCRIPTION", domain = "cran.r-project.org", bot = "*" ) #> [1] FALSE

    However, when I decided to scrape the package metadata I did it by parsing the HTML from the canonical package link that resolves to the index.html page for the package. For example, https://cran.r-project.org/package=ggplot2 resolves to https://cran.r-project.org/web/packages/ggplot2/index.html. If you check whether scraping is allowed on this page, the robotstxt package says that it is permitted.

    paths_allowed( paths = "/web/packages/ggplot2/index.html", domain = "cran.r-project.org", bot = "*" ) #> [1] TRUE paths_allowed( paths = "/web/packages/ggplot2", domain = "cran.r-project.org", bot = "*" ) #> [1] TRUE

    This is a tricky situation because I can access the same information that is in the DESCRIPTION file just by going to the index.html page for the package where scraping seems to be allowed. In the spirit of respecting CRAN it logically follows that I should not be scraping the package index pages if the individual DESCRIPTION files are off-limits. This is despite there being no formal instruction from the robots.txt file about package index pages. All in all, it was an interesting bit of work and glad that I was able to learn about the robotstxt package so I can have it in my toolkit going forward.

    Remember to Always Scrape Responsibly!

    DISCLAIMER: I only have a basic understanding of how robots.txt files work based on allowing or disallowing specified paths. I believe in this case CRAN’s robots.txt broadly permitted scraping, but too narrowly disallowed just the DESCRIPTION files. Perhaps this goes back to an older time where those DESCRIPTION files really were the best place for people to start scraping so it made sense to disallow them. Or the reason could be something else entirely.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Blog-rss on stevenmortimer.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Le Monde puzzle [#1053]

    Thu, 06/21/2018 - 00:18

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    An easy arithmetic Le Monde mathematical puzzle again:

    1. If coins come in units of 1, x, and y, what is the optimal value of (x,y) that minimises the number of coins representing an arbitrary price between 1 and 149?
    2.  If the number of units is now four, what is the optimal choice?

    The first question is fairly easy to code

    coinz <- function(x,y){ z=(1:149) if (y

    and returns M=12 as the maximal number of coins, corresponding to x=4 and y=22. And a price tag of 129.  For the second question, one unit is necessarily 1 (!) and there is just an extra loop to the above, which returns M=8, with other units taking several possible values:

    [1] 40 11 3 [1] 41 11 3 [1] 55 15 4 [1] 56 15 4

    A quick search revealed that this problem (or a variant) is solved in many places, from stackexchange (for an average—why average?, as it does not make sense when looking at real prices—number of coins, rather than maximal), to a paper by Shalit calling for the 18¢ coin, to Freakonomics, to Wikipedia, although this is about finding the minimum number of coins summing up to a given value, using fixed currency denominations (a knapsack problem). This Wikipedia page made me realise that my solution is not necessarily optimal, as I use the remainders from the larger denominations in my code, while there may be more efficient divisions. For instance, running the following dynamic programming code

    coz=function(x,y){ minco=1:149 if (x

    returns the lower value of M=11 (with x=7,y=23) in the first case and M=7 in the second one.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Awesome Twitter Word Clouds in R

    Wed, 06/20/2018 - 23:40

    (This article was first published on Little Miss Data, and kindly contributed to R-bloggers)

     

     

    We have officially kicked off summer of 2018!  Y’all know what that means right?  SUMMER OF DATA SCIENCE!!! Imagine it now, you could be soaking up the rays while harnessing your inner data science skills ALL. SUMMER. LONG!!

    Summer of data science (aka #SoDS18 on twitter) is a community created by Data Science Renee to support and promote learning new data science skills in a collaborative environment.  The community releases a list of books to collectively tackle during the program.  To participate; you pick a book, work through it and collaborate in the social channels of your choice (slack, twitter etc).

    For my goals, I decided to work through the book Tidy Text Mining with R by Julia Silge and David Robinson   I chose to tap into Twitter data for my text analysis using the rtweets package.  Inspired by some of the word clouds in the Tidy Text book, I decided to plot the data in fancy word clouds using wordcloud2.

      1) Install and Load the Libraries

    The first thing we need to do is install and load all required packages for our summertime fun work with R.  These are all of the install.packages() and library() commands below.  Note that some packages can be installed directly via CRAN and some need to be installed from github via the devtools package.  I wrote a blog on navigating various R package install issues here

    install.packages("rtweet") install.packages("tidytext") install.packages("dplyr") install.packages("stringr") require(devtools) install_github("lchiffon/wordcloud2") library(tidytext) library(dplyr) library(stringr) library(rtweet) library(wordcloud2)   2) API Authorization

    Before we get rolling, we need some data.  Let’s tap into twitter using the rtweets package.  rtweet is an excellent package for pulling twitter data.  However, first you need to be authorized to pull from the twitter API.  To get my API credentials, I followed the processes documented in rtweet reference article and the github readme page.  Following these steps, the process was pretty seamless.  I would only add that I had to uncheck “Enable Callback Locking” to establish my token in step 2.

    3) Establish your token

    Execute the following code to to establish your token.  As per the rtweet reference article, the create_token() function automatically saves your token as an environment variable for future use.  

    create_token(   app = "PLACE YOUR APP NAME HERE",   consumer_key = "PLACE YOUR CONSUMER KEY HERE",   consumer_secret = "PLACE YOUR CONSUMER SECRET HERE") 4) Gather Tweets

    Use the search_tweets() function to grab 2000 tweets with the hashtag of your choice.  I chose to look into #HandmaidsTale because the show is AMAZING.  We will then look at the data collected from the tweets via the head() and dim() commands.  We will then use the unnest() command to transform the data to have only one word per row.  Next, we remove common words such as “the”, “it” etc via the stop_words data set.  Finally, we will filter out a custom set of words of our choice.

    #Grab tweets - note: reduce to 1000 if it's slow hmt <- search_tweets(   "#HandmaidsTale", n = 2000, include_rts = FALSE ) #Look at tweets head(hmt) dim(hmt) hmt$text #Unnest the words - code via Tidy Text hmtTable <- hmt %>%   unnest_tokens(word, text) #remove stop words - aka typically very common words such as "the", "of" etc data(stop_words) hmtTable <- hmtTable %>%   anti_join(stop_words) #do a word count hmtTable <- hmtTable %>%   count(word, sort = TRUE) hmtTable #Remove other nonsense words hmtTable <-hmtTable %>%   filter(!word %in% c('t.co', 'https', 'handmaidstale', "handmaid's", 'season', 'episode', 'de', 'handmaidsonhulu',  'tvtime',                       'watched', 'watching', 'watch', 'la', "it's", 'el', 'en', 'tv',                       'je', 'ep', 'week', 'amp'))

     

    5) Create a Word Cloud

    Use the wordcloud2() function out of the box to create a wordcloud on the #HandmaidsTale tweets.

    wordcloud2(hmtTable, size=0.7)

     

     

    6) Create a Better Word Cloud

    That word cloud was nice, but I think we can do better!  We need a color palette that suits the show and we would like a more suitable shape.  To do this we create a list of hex codes for our redPalette.  We then use the download.file() function to download some symbols from my github.  Note that you can also very easily create the word cloud shape by referencing local files.  Finally we create the word cloud via the wordcloud2() function and we set the shape with the figPath attribute.  

    #Create Palette redPalette <- c("#5c1010", "#6f0000", "#560d0d", "#c30101", "#940000") redPalette #Download images for plotting url = "https://raw.githubusercontent.com/lgellis/MiscTutorial/master/twitter_wordcloud/handmaiden.jpeg" handmaiden <- "handmaiden.jpg" download.file(url, handmaiden) # download file url = "https://raw.githubusercontent.com/lgellis/MiscTutorial/master/twitter_wordcloud/resistance.jpeg" resistance <- "resistance.jpeg" download.file(url, resistance) # download file #plots wordcloud2(hmtTable, size=1.6, figPath = handmaiden, color=rep_len( redPalette, nrow(hmtTable) ) ) wordcloud2(hmtTable, size=1.6, figPath = resistance, color=rep_len( redPalette, nrow(summary4) ) ) wordcloud2(hmtTable, size=1.6, figPath = resistance, color="#B20000")

     

    7) Marvel at your amazing word cloud  

    Yes, you have done it!  You’ve made a pretty sweet word cloud about an epic tv show.  Word clouds with custom symbols are a great way to make an immediate impact on your audience.  

     

       

       

    8) Go wild on word clouds

    Now that you’ve learned this trick, it’s time to unleash your new #SoDS18 powers on the subjects of your choice! In keeping with the TV theme, I can’t forget Westworld.  Below are a few more Westworld visualizations I created following similar steps to above. All code is available in my github repo.

       

      

    Thank You

    Thanks for reading along while we explored twitter data, the beginnings of text analysis and cool word clouds.  Please share your thoughts and creations with me on twitter

    Note that the full code is available on my  github repo.  If you have trouble downloading the file from github, go to the main page of the repo and select “Clone or Download” and then “Download Zip”.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Little Miss Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    PYPL Language Rankings: Python ranks #1, R at #7 in popularity

    Wed, 06/20/2018 - 21:38

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    The new PYPL Popularity of Programming Languages (June 2018) index ranks Python at #1 and R at #7.

    Like the similar TIOBE language index, the PYPL index uses Google search activity to rank language popularity. PYPL, however, fcouses on people searching for tutorials in the respective languages as a proxy for popularity. By that measure, Python has always been more popular than R (as you'd expect from a more general-purpose language), but both have been growing at similar rates. The chart below includes the three data-oriented languages tracked by the index (and note the vertical scale is logarithmic).

    Another language ranking was also released recently: the annual KDnuggets Analytics, Data Science and Machine Learning Poll. These rankings, however, are derived not from search trends but by self-selected poll respondents, which perhaps explains the presence of Rapidminer at the #2 spot.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

    Wed, 06/20/2018 - 19:23

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing.

    vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark.

    vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.

    Thanks to the rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as PostgreSQL, Amazon RedShift, Apache Spark, or Google BigQuery. Or, thanks to the data.table and rqdatatable packages, even fast large in-memory transforms are possible.

    We have some basic examples of the new vtreat capabilities here and here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Neural Networks Are Essentially Polynomial Regression

    Wed, 06/20/2018 - 19:00

    (This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

    You may be interested in my new arXiv paper, joint work with Xi Cheng, an undergraduate at UC Davis (now heading to Cornell for grad school); Bohdan Khomtchouk, a post doc in biology at Stanford; and Pete Mohanty,  a Science, Engineering & Education Fellow in statistics at Stanford. The paper is of a provocative nature, and we welcome feedback.

    A summary of the paper is:

    • We present a very simple, informal mathematical argument that neural networks (NNs) are in essence polynomial regression (PR). We refer to this as NNAEPR.
    • NNAEPR implies that we can use our knowledge of the “old-fashioned” method of PR to gain insight into how NNs — widely viewed somewhat warily as a “black box” — work inside.
    • One such insight is that the outputs of an NN layer will be prone to multicollinearity, with the problem becoming worse with each successive layer. This in turn may explain why convergence issues often develop in NNs. It also suggests that NN users tend to use overly large networks.
    • NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead.
    • We investigated this on a wide variety of datasets, and found that in every case PR did as well as, and often better than, NNs.
    • We have developed a feature-rich R package, polyreg, to facilitate using PR in multivariate settings.

    Much work remains to be done (see paper), but our results so far are very encouraging. By using PR, one can avoid the headaches of NN, such as selecting good combinations of tuning parameters, dealing with convergence problems, and so on.

    Also available are the slides for our presentation at GRAIL on this project.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages