Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 min 8 sec ago

Idle thoughts lead to R internals: how to count function arguments

7 hours 2 min ago

(This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers)

“Some R functions have an awful lot of arguments”, you think to yourself. “I wonder which has the most?”

It’s not an original thought: the same question as applied to the R base package is an exercise in the Functions chapter of the excellent Advanced R. Much of the information in this post came from there.

There are lots of R packages. We’ll limit ourselves to those packages which ship with R, and which load on startup. Which ones are they?

What packages load on starting R?
Start a new R session and type search(). Here’s the result on my machine:


search()
[1] ".GlobalEnv" "tools:rstudio" "package:stats" "package:graphics" "package:grDevices"
"package:utils" "package:datasets" "package:methods" "Autoloads" "package:base"

We’re interested in the packages with priority = base. Next question:

How can I see and filter for package priority?
You don’t need dplyr for this, but it helps.

library(tidyverse) installed.packages() %>% as.tibble() %>% filter(Priority == "base") %>% select(Package, Priority) # A tibble: 14 x 2 Package Priority 1 base base 2 compiler base 3 datasets base 4 graphics base 5 grDevices base 6 grid base 7 methods base 8 parallel base 9 splines base 10 stats base 11 stats4 base 12 tcltk base 13 tools base 14 utils base

Comparing to the output from search(), we want to look at: stats, graphics, grDevices, utils, datasets, methods and base.

How can I see all the objects in a package?
Like this, for the base package. For other packages, just change base to the package name of interest.

ls("package:base")

However, not every object in a package is a function. Next question:

How do I know if an object is a function?
The simplest way is to use is.function().

is.function(ls) [1] TRUE

What if the function name is stored as a character variable, “ls”? Then we can use get():

is.function(get("ls")) [1] TRUE

But wait: what if two functions from different packages have the same name and we have loaded both of those packages? Then we specify the package too, using the pos argument.

is.function(get("Position", pos = "package:base")) [1] TRUE is.function(get("Position", pos = "package:ggplot2")) [1] FALSE

So far, so good. Now, to the arguments.

How do I see the arguments to a function?
Now things start to get interesting. In R, function arguments are called formals. There is a function of the same name, formals(), to show the arguments for a function. You can also use formalArgs() which returns a vector with just the argument names:

formalArgs(ls) [1] "name" "pos" "envir" "all.names" "pattern" "sorted"

But that won’t work for every function. Let’s try abs():

formalArgs(abs) NULL

The issue here is that abs() is a primitive function, and primitives don’t have formals. Our next two questions:

How do I know if an object is a primitive?
Hopefully you guessed that one:

is.primitive(abs) [1] TRUE

How do I see the arguments to a primitive?
You can use args(), and you can pass the output of args() to formals() or formalArgs():

args(abs) function (x) NULL formalArgs(args(abs)) [1] "x"

However, there are a few objects which are primitive functions for which this doesn’t work. Let’s not worry about those.

is.primitive(`:`) [1] TRUE formalArgs(args(`:`)) NULL Warning message: In formals(fun) : argument is not a function

So what was the original question again?
Let’s put all that together. We want to find the base packages which load on startup, list their objects, identify which are functions or primitive functions, list their arguments and count them up.

We’ll create a tibble by pasting the arguments for each function into a comma-separated string, then pulling the string apart using unnest_tokens() from the tidytext package.

library(tidytext) library(tidyverse) pkgs <- installed.packages() %>% as.tibble() %>% filter(Priority == "base", Package %in% c("stats", "graphics", "grDevices", "utils", "datasets", "methods", "base")) %>% select(Package) %>% rowwise() %>% mutate(fnames = paste(ls(paste0("package:", Package)), collapse = ",")) %>% unnest_tokens(fname, fnames, token = stringr::str_split, pattern = ",", to_lower = FALSE) %>% filter(is.function(get(fname, pos = paste0("package:", Package)))) %>% mutate(is_primitive = ifelse(is.primitive(get(fname, pos = paste0("package:", Package))), 1, 0), num_args = ifelse(is.primitive(get(fname, pos = paste0("package:", Package))), length(formalArgs(args(fname))), length(formalArgs(fname)))) %>% ungroup()

That throws out a few warnings where, as noted, args() doesn’t work for some primitives.

And the winner is –

pkgs %>% top_n(10) %>% arrange(desc(num_args)) Selecting by num_args # A tibble: 10 x 4 Package fname is_primitive num_args 1 graphics legend 0 39 2 graphics stars 0 33 3 graphics barplot.default 0 30 4 stats termplot 0 28 5 utils read.table 0 25 6 stats heatmap 0 24 7 base scan 0 22 8 graphics filled.contour 0 21 9 graphics hist.default 0 21 10 stats interaction.plot 0 21

– the function legend() from the graphics package, with 39 arguments. From the base package itself, scan(), with 22 arguments.

Just to wrap up, some histograms of argument number by package, suggesting that the base graphics functions tend to be the more verbose.

pkgs %>% ggplot(aes(num_args)) + geom_histogram() + facet_wrap(~Package, scales = "free_y") + theme_bw() + labs(x = "arguments", title = "R base function arguments by package")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Comparative Review of the BlueSky Statistics GUI for R

Thu, 06/21/2018 - 14:47

(This article was first published on R – r4stats.com, and kindly contributed to R-bloggers)

Introduction

BlueSky Statistics’ desktop version is a free and open source graphical user interface for the R software that focuses on beginners looking to point-and-click their way through analyses.  A commercial version is also available which includes technical support and a version for Windows Terminal Servers such as Remote Desktop, or Citrix. Mac, Linux, or tablet users could run it via a terminal server.

This post is one of a series of reviews which aim to help non-programmers choose the Graphical User Interface (GUI) that is best for them. Additionally, these reviews include a cursory description of the programming support that each GUI offers.

 

Terminology

There are various definitions of user interface types, so here’s how I’ll be using these terms:

GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I do not include any assistance for programming in this definition. So, GUI users are people who prefer using a GUI to perform their analyses. They don’t have the time or inclination to become good programmers.

IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-click style menus and dialog boxes when using this term. IDE usersare people who prefer to write R code to perform their analyses.

 

Installation

The various user interfaces available for R differ quite a lot in how they’re installed. Some, such as jamovi or RKWard, install in a single step. Others install in multiple steps, such as the R Commander (two steps) and Deducer (up to seven steps). Advanced computer users often don’t appreciate how lost beginners can become while attempting even a simple installation. The HelpDesks at most universities are flooded with such calls at the beginning of each semester!

The main BlueSky installation is easily performed in a single step. The installer provides its own embedded copy of R, simplifying the installation and ensuring complete compatibility between BlueSky and the version of R it’s using. However, it also means if you already have R installed, you’ll end up with a second copy. You can have BlueSky control any version of R you choose, but if the version differs too much, you may run into occasional problems.

 

Plug-in Modules

When choosing a GUI, one of the most fundamental questions is: what can it do for you? What the initial software installation of each GUI gets you is covered in the Graphics, Analysis, and Modeling sections of this series of articles. Regardless of what comes built-in, it’s good to know how active the development community is. They contribute “plug-ins” which add new menus and dialog boxes to the GUI. This level of activity ranges from very low (RKWard, Deducer) through moderate (jamovi) to very active (R Commander).

BlueSky is a fairly new open source project, and at the moment all the add-on modules are provided by the company. However, BlueSky’s capabilities approaches the comprehensiveness of R Commander, which currently has the most add-ons available. The BlueSky developers are working to create an Internet repository for module distribution.

 

Startup

Some user interfaces for R, such as jamovi, start by double-clicking on a single icon, which is great for people who prefer to not write code. Others, such as R commander and JGR, have you start R, then load a package from your library, and call a function. That’s better for people looking to learn R, as those are among the first tasks they’ll have to learn anyway.

You start BlueSky directly by double-clicking its icon from your desktop, or choosing it from your Start Menu (i.e. not from within R itself). It interacts with R in the background; you never need to be aware that R is running.

 

Data Editor

A data editor is a fundamental feature in data analysis software. It puts you in touch with your data and lets you get a feel for it, if only in a rough way. A data editor is such a simple concept that you might think there would be hardly any differences in how they work in different GUIs. While there are technical differences, to a beginner what matters the most are the differences in simplicity. Some GUIs, including jamovi, let you create only what R calls a data frame. They use more common terminology and call it a data set: you create one, you save one, later you open one, then you use one. Others, such as RKWard trade this simplicity for the full R language perspective: a data set is stored in a workspace. So the process goes: you create a data set, you save a workspace, you open a workspace, and choose a data set from within it.

BlueSky starts up by showing you its main Application screen (Figure 1) and prompts you to enter data with an empty spreadsheet-style data editor. You can start entering data immediately, though at first, the variables are simply named var1, var2…. You might think you can rename them by clicking on their names, but such changes are done in a different manner, one that will be very familiar to SPSS users. There are two tabs at the bottom left of the data editor screen, which are labeled “Data” and “Variables.” The “Data” tab is shown by default, but clicking on the “Variables” tab takes you to a screen (Figure 2) which displays the metadata: variable names, labels, types, classes, values, and measurement scale.

Figure 1. The main BlueSky Application screen.

The big advantage that SPSS offers is that you can change the settings of many variables at once. So if you had, say, 20 variables for which you needed to set the same factor labels (e.g. 1=strongly disagree…5=Strongly Agree) you could do it once and then paste them into the other 19 with just a click or two. Unfortunately, that’s not yet fully implemented in BlueSky. Some of the metadata fields can be edited directly. For the rest, you must instead follow the directions at the top of that screen and right click on each variable, one at a time, to make the changes. Complete copy and paste of metadata is planned for a future version.

Figure 2. The Variables screen in the data editor. The “Variables” tab in the lower left is selected, letting us see the metadata for the same variables as shown in Figure 1.

You can enter numeric or character data in the editor right after starting BlueSky. The first time you enter character data, it will offer to convert the variable from numeric to character and wait for you to approve the change. This is very helpful as it’s all too easy to type the letter “O” when meaning to type a zero “0”, or the letter “I” instead of number one “1”.

To add rows, the Data tab is clearly labeled, “Click here to add a new row”. It would be much faster if the Enter key did that automatically.

To add variables you have to go to the Variables tab and right-click on the row of any variable (variable names are in rows on that screen), then choose “Insert new variable at end.”

To enter factor data, it’s best to leave it numeric such as 1 or 2, for male and female, then set the labels (which are called values using SPSS terminology) afterwards. The reason for this is that once labels are set, you must enter them from drop-down menus. While that ensures no invalid values are entered, it slows down data entry. The developer’s future plans includes automatic display of labels upon entry of numeric values.

If you instead decide to make the variable a factor before entering numeric data, it’s best to enter the numbers as labels as well. It’s an oddity of R that factors are numeric inside, while displaying labels that may or may not be the same as the numbers they represent.

To enter dates, enter them as character data and use the “Data> Compute” menu to convert the character data to a date. When I reported this problem to the developers, they said they would add this to the “Variables” metadata tab so you could set it to be a date variable before entering the data.

If you have another data set to enter, you can start the process again by clicking “File> New”, and a new editor window will appear in a new tab. You can change data sets simply by clicking on its tab and its window will pop to the front for you to see. When doing analyses, or saving data, the data set that’s displayed in the editor is the one that will be used. That approach feels very natural; what you see is what you get.

Saving the data is done with the standard “File > Save As” menu. You must save each one to its own file. While R allows multiple data sets (and other objects such as models) to be saved to a single file, BlueSky does not. Its developers chose to simplify what their users have to learn by limiting each file to a single data set. That is a useful simplification for GUI users. If a more advanced R user sends a compound file containing many objects, BlueSky will detect it and offer to open one data set (data frame) at a time.

Figure 3. Output window showing standard journal-style tables. Syntax editor has been opened and is shown on right side.

 

Data Import

The open source version of BlueSky supports the following file formats, all located under “File> Open”:

  • Comma Separated Values (.csv)
  • Plain text files (.txt)
  • Excel (old and new xls file types)
  • Dbase’s DBF
  • SPSS (.sav)
  • SAS binary files (sas7bdat)
  • Standard R workspace files (RData) with individual data frame selection

The SQL database formats are found under the “File> Import Data” menu. The supported formats include:

  • Microsoft Access
  • Microsoft SQL Server
  • MySQL
  • PostgreSQL
  • SQLite

 

Data Management

It’s often said that 80% of data analysis time is spent preparing the data. Variables need to be transformed, recoded, or created; strings and dates need to be manipulated; missing values need to be handled; datasets need to be stacked or merged, aggregated, transposed, or reshaped (e.g. from wide to long and back). A critically important aspect of data management is the ability to transform many variables at once. For example, social scientists need to recode many survey items, biologists need to take the logarithms of many variables. Doing these types of tasks one variable at a time can be tedious. Some GUIs, such as jamovi and RKWard handle only a few of these functions. Others, such as the R Commander, can handle many, but not all, of them.

BlueSky offers one of the most comprehensive sets of data management tools of any R GUI. The “Data” menu offers the following set of tools. Not shown is an extensive set of character and date/time functions which appear under “Compute.”

  1. Missing Values
  2. Compute
  3. Bin Numeric Variables
  4. Recode (able to recode many at once)
  5. Make Factor Variable (able to covert many at once)
  6. Transpose
  7. Transform (able to transform many at once)
  8. Sample Dataset
  9. Delete Variables
  10. Standardize Variables (able to standardize many at once)
  11. Aggregate (outputs results to a new dataset)
  12. Aggregate (outputs results to a printed table)
  13. Subset (outputs to a new data et)
  14. Subset (outputs results to a printed table)
  15. Merge Datasets
  16. Sort (outputs results to a new dataset)
  17. Sort (outputs results to a printed table)
  18. Reload Dataset from File
  19. Refresh Grid
  20. Concatenate Multiple Variables (handling missing values)
  21. Legacy (does same things but using base R code)
  22. Reshape (long to wide)
  23. Reshape (wide to long)

Continued here…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Non-Linear Model in R Exercises

Thu, 06/21/2018 - 08:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

A mechanistic model for the relationship between x and y sometimes needs parameter estimation. When model linearisation does not work,we need to use non-linear modelling.
There are three main differences between non-linear and linear modelling in R:
1. specify the exact nature of the equation
2. replace the lm() with nls() which means nonlinear least squares
3. sometimes we also need to specify the model parameters a,b, and c.

In this exercise, we will use the same dataset as the previous exercise in polynomial regression here. Download the data-set here.
A quick overview of the dataset.
Response variable = number of invertebrates (INDIV)
Explanatory variable = the area of each clump (AREA)
Additional possible response variables = Species richness of invertebrates (SPECIES)
Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Load dataset. Specify the model, try to use power function with nls() and a=0.1 and b=1 as initial parameter number

Exercise 2
A quick check by creating plot residual versus fitted model since normal plot will not work

Exercise 3
Try to build self start function of the powered model

Exercise 4
Generate the asymptotic model

Exercise 5
Compared the asymptotic model to the powered one using AIC. What can we infer?

Exercise 6
Plot the model in one graph

Exercise 7
Predict across the data and plot all three lines

Related exercise sets:
  1. Spatial Data Analysis: Introduction to Raster Processing (Part 1)
  2. Spatial Data Analysis: Introduction to Raster Processing: Part-3
  3. Density-Based Clustering Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Explaining Keras image classification models with lime

Thu, 06/21/2018 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Last week I published a blog post about how easy it is to train image classification models with Keras.

What I did not show in that post was how to use the model for making predictions. This, I will do here. But predictions alone are boring, so I’m adding explanations for the predictions using the lime package.

I have already written a few blog posts (here, here and here) about LIME and have given talks (here and here) about it, too.

Neither of them applies LIME to image classification models, though. And with the new(ish) release from March of Thomas Lin Pedersen’s lime package, lime is now not only on CRAN but it natively supports Keras and image classification models.

Thomas wrote a very nice article about how to use keras and lime in R! Here, I am following this article to use Imagenet (VGG16) to make and explain predictions of fruit images and then I am extending the analysis to last week’s model and compare it with the pretrained net.

Loading libraries and models library(keras) # for working with neural nets library(lime) # for explaining models library(magick) # for preprocessing images library(ggplot2) # for additional plotting
  • Loading the pretrained Imagenet model
model <- application_vgg16(weights = "imagenet", include_top = TRUE) model ## Model ## ___________________________________________________________________________ ## Layer (type) Output Shape Param # ## =========================================================================== ## input_1 (InputLayer) (None, 224, 224, 3) 0 ## ___________________________________________________________________________ ## block1_conv1 (Conv2D) (None, 224, 224, 64) 1792 ## ___________________________________________________________________________ ## block1_conv2 (Conv2D) (None, 224, 224, 64) 36928 ## ___________________________________________________________________________ ## block1_pool (MaxPooling2D) (None, 112, 112, 64) 0 ## ___________________________________________________________________________ ## block2_conv1 (Conv2D) (None, 112, 112, 128) 73856 ## ___________________________________________________________________________ ## block2_conv2 (Conv2D) (None, 112, 112, 128) 147584 ## ___________________________________________________________________________ ## block2_pool (MaxPooling2D) (None, 56, 56, 128) 0 ## ___________________________________________________________________________ ## block3_conv1 (Conv2D) (None, 56, 56, 256) 295168 ## ___________________________________________________________________________ ## block3_conv2 (Conv2D) (None, 56, 56, 256) 590080 ## ___________________________________________________________________________ ## block3_conv3 (Conv2D) (None, 56, 56, 256) 590080 ## ___________________________________________________________________________ ## block3_pool (MaxPooling2D) (None, 28, 28, 256) 0 ## ___________________________________________________________________________ ## block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160 ## ___________________________________________________________________________ ## block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808 ## ___________________________________________________________________________ ## block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808 ## ___________________________________________________________________________ ## block4_pool (MaxPooling2D) (None, 14, 14, 512) 0 ## ___________________________________________________________________________ ## block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808 ## ___________________________________________________________________________ ## block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808 ## ___________________________________________________________________________ ## block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808 ## ___________________________________________________________________________ ## block5_pool (MaxPooling2D) (None, 7, 7, 512) 0 ## ___________________________________________________________________________ ## flatten (Flatten) (None, 25088) 0 ## ___________________________________________________________________________ ## fc1 (Dense) (None, 4096) 102764544 ## ___________________________________________________________________________ ## fc2 (Dense) (None, 4096) 16781312 ## ___________________________________________________________________________ ## predictions (Dense) (None, 1000) 4097000 ## =========================================================================== ## Total params: 138,357,544 ## Trainable params: 138,357,544 ## Non-trainable params: 0 ## ___________________________________________________________________________ model2 <- load_model_hdf5(filepath = "/Users/shiringlander/Documents/Github/DL_AI/Tutti_Frutti/fruits-360/keras/fruits_checkpoints.h5") model2 ## Model ## ___________________________________________________________________________ ## Layer (type) Output Shape Param # ## =========================================================================== ## conv2d_1 (Conv2D) (None, 20, 20, 32) 896 ## ___________________________________________________________________________ ## activation_1 (Activation) (None, 20, 20, 32) 0 ## ___________________________________________________________________________ ## conv2d_2 (Conv2D) (None, 20, 20, 16) 4624 ## ___________________________________________________________________________ ## leaky_re_lu_1 (LeakyReLU) (None, 20, 20, 16) 0 ## ___________________________________________________________________________ ## batch_normalization_1 (BatchNorm (None, 20, 20, 16) 64 ## ___________________________________________________________________________ ## max_pooling2d_1 (MaxPooling2D) (None, 10, 10, 16) 0 ## ___________________________________________________________________________ ## dropout_1 (Dropout) (None, 10, 10, 16) 0 ## ___________________________________________________________________________ ## flatten_1 (Flatten) (None, 1600) 0 ## ___________________________________________________________________________ ## dense_1 (Dense) (None, 100) 160100 ## ___________________________________________________________________________ ## activation_2 (Activation) (None, 100) 0 ## ___________________________________________________________________________ ## dropout_2 (Dropout) (None, 100) 0 ## ___________________________________________________________________________ ## dense_2 (Dense) (None, 16) 1616 ## ___________________________________________________________________________ ## activation_3 (Activation) (None, 16) 0 ## =========================================================================== ## Total params: 167,300 ## Trainable params: 167,268 ## Non-trainable params: 32 ## ___________________________________________________________________________ Load and prepare images

Here, I am loading and preprocessing two images of fruits (and yes, I am cheating a bit because I am choosing images where I expect my model to work as they are similar to the training images…).

  • Banana
test_image_files_path <- "/Users/shiringlander/Documents/Github/DL_AI/Tutti_Frutti/fruits-360/Test" img <- image_read('https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/Banana-Single.jpg/272px-Banana-Single.jpg') img_path <- file.path(test_image_files_path, "Banana", 'banana.jpg') image_write(img, img_path) #plot(as.raster(img))
  • Clementine
img2 <- image_read('https://cdn.pixabay.com/photo/2010/12/13/09/51/clementine-1792_1280.jpg') img_path2 <- file.path(test_image_files_path, "Clementine", 'clementine.jpg') image_write(img2, img_path2) #plot(as.raster(img2)) Superpixels

The segmentation of an image into superpixels are an important step in generating explanations for image models. It is both important that the segmentation is correct and follows meaningful patterns in the picture, but also that the size/number of superpixels are appropriate. If the important features in the image are chopped into too many segments the permutations will probably damage the picture beyond recognition in almost all cases leading to a poor or failing explanation model. As the size of the object of interest is varying it is impossible to set up hard rules for the number of superpixels to segment into – the larger the object is relative to the size of the image, the fewer superpixels should be generated. Using plot_superpixels it is possible to evaluate the superpixel parameters before starting the time consuming explanation function. (help(plot_superpixels))

plot_superpixels(img_path, n_superpixels = 35, weight = 10)

plot_superpixels(img_path2, n_superpixels = 50, weight = 20)

From the superpixel plots we can see that the clementine image has a higher resolution than the banana image.

Prepare images for Imagenet image_prep <- function(x) { arrays <- lapply(x, function(path) { img <- image_load(path, target_size = c(224,224)) x <- image_to_array(img) x <- array_reshape(x, c(1, dim(x))) x <- imagenet_preprocess_input(x) }) do.call(abind::abind, c(arrays, list(along = 1))) }
  • test predictions
res <- predict(model, image_prep(c(img_path, img_path2))) imagenet_decode_predictions(res) ## [[1]] ## class_name class_description score ## 1 n07753592 banana 0.9929747581 ## 2 n03532672 hook 0.0013420776 ## 3 n07747607 orange 0.0010816186 ## 4 n07749582 lemon 0.0010625814 ## 5 n07716906 spaghetti_squash 0.0009176208 ## ## [[2]] ## class_name class_description score ## 1 n07747607 orange 0.78233224 ## 2 n07753592 banana 0.04653566 ## 3 n07749582 lemon 0.03868873 ## 4 n03134739 croquet_ball 0.03350329 ## 5 n07745940 strawberry 0.01862431
  • load labels and train explainer
model_labels <- readRDS(system.file('extdata', 'imagenet_labels.rds', package = 'lime')) explainer <- lime(c(img_path, img_path2), as_classifier(model, model_labels), image_prep)

Training the explainer (explain() function) can take pretty long. It will be much faster with the smaller images in my own model but with the bigger Imagenet it takes a few minutes to run.

explanation <- explain(c(img_path, img_path2), explainer, n_labels = 2, n_features = 35, n_superpixels = 35, weight = 10, background = "white")
  • plot_image_explanation() only supports showing one case at a time
plot_image_explanation(explanation)

clementine <- explanation[explanation$case == "clementine.jpg",] plot_image_explanation(clementine)

Prepare images for my own model
  • test predictions (analogous to training and validation images)
test_datagen <- image_data_generator(rescale = 1/255) test_generator = flow_images_from_directory( test_image_files_path, test_datagen, target_size = c(20, 20), class_mode = 'categorical') predictions <- as.data.frame(predict_generator(model2, test_generator, steps = 1)) load("/Users/shiringlander/Documents/Github/DL_AI/Tutti_Frutti/fruits-360/fruits_classes_indices.RData") fruits_classes_indices_df <- data.frame(indices = unlist(fruits_classes_indices)) fruits_classes_indices_df <- fruits_classes_indices_df[order(fruits_classes_indices_df$indices), , drop = FALSE] colnames(predictions) <- rownames(fruits_classes_indices_df) t(round(predictions, digits = 2)) ## [,1] [,2] ## Kiwi 0 0.00 ## Banana 1 0.11 ## Apricot 0 0.00 ## Avocado 0 0.00 ## Cocos 0 0.00 ## Clementine 0 0.87 ## Mandarine 0 0.00 ## Orange 0 0.00 ## Limes 0 0.00 ## Lemon 0 0.00 ## Peach 0 0.00 ## Plum 0 0.00 ## Raspberry 0 0.00 ## Strawberry 0 0.01 ## Pineapple 0 0.00 ## Pomegranate 0 0.00 for (i in 1:nrow(predictions)) { cat(i, ":") print(unlist(which.max(predictions[i, ]))) } ## 1 :Banana ## 2 ## 2 :Clementine ## 6

This seems to be incompatible with lime, though (or if someone knows how it works, please let me know) – so I prepared the images similarly to the Imagenet images.

image_prep2 <- function(x) { arrays <- lapply(x, function(path) { img <- image_load(path, target_size = c(20, 20)) x <- image_to_array(img) x <- reticulate::array_reshape(x, c(1, dim(x))) x <- x / 255 }) do.call(abind::abind, c(arrays, list(along = 1))) }
  • prepare labels
fruits_classes_indices_l <- rownames(fruits_classes_indices_df) names(fruits_classes_indices_l) <- unlist(fruits_classes_indices) fruits_classes_indices_l ## 9 10 8 2 11 ## "Kiwi" "Banana" "Apricot" "Avocado" "Cocos" ## 3 13 14 7 6 ## "Clementine" "Mandarine" "Orange" "Limes" "Lemon" ## 1 5 0 4 15 ## "Peach" "Plum" "Raspberry" "Strawberry" "Pineapple" ## 12 ## "Pomegranate"
  • train explainer
explainer2 <- lime(c(img_path, img_path2), as_classifier(model2, fruits_classes_indices_l), image_prep2) explanation2 <- explain(c(img_path, img_path2), explainer2, n_labels = 1, n_features = 20, n_superpixels = 35, weight = 10, background = "white")
  • plot feature weights to find a good threshold for plotting block (see below)
explanation2 %>% ggplot(aes(x = feature_weight)) + facet_wrap(~ case, scales = "free") + geom_density()

  • plot predictions
plot_image_explanation(explanation2, display = 'block', threshold = 5e-07)

clementine2 <- explanation2[explanation2$case == "clementine.jpg",] plot_image_explanation(clementine2, display = 'block', threshold = 0.16)

sessionInfo() ## R version 3.5.0 (2018-04-23) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS High Sierra 10.13.5 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] ggplot2_2.2.1 magick_1.9 lime_0.4.0 keras_2.1.6 ## ## loaded via a namespace (and not attached): ## [1] stringdist_0.9.5.1 reticulate_1.8 xfun_0.2 ## [4] lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6 ## [7] yaml_2.1.19 base64enc_0.1-3 rlang_0.2.1 ## [10] pillar_1.2.3 later_0.7.3 foreach_1.4.4 ## [13] plyr_1.8.4 tensorflow_1.8 stringr_1.3.1 ## [16] munsell_0.5.0 blogdown_0.6 gtable_0.2.0 ## [19] htmlwidgets_1.2 codetools_0.2-15 evaluate_0.10.1 ## [22] labeling_0.3 knitr_1.20 httpuv_1.4.4.1 ## [25] tfruns_1.3 parallel_3.5.0 curl_3.2 ## [28] Rcpp_0.12.17 xtable_1.8-2 scales_0.5.0 ## [31] backports_1.1.2 promises_1.0.1 jsonlite_1.5 ## [34] abind_1.4-5 mime_0.5 digest_0.6.15 ## [37] stringi_1.2.3 bookdown_0.7 shiny_1.1.0 ## [40] grid_3.5.0 rprojroot_1.3-2 tools_3.5.0 ## [43] magrittr_1.5 lazyeval_0.2.1 shinythemes_1.1.1 ## [46] glmnet_2.0-16 tibble_1.4.2 whisker_0.3-2 ## [49] zeallot_0.1.0 Matrix_1.2-14 gower_0.1.2 ## [52] assertthat_0.2.0 rmarkdown_1.10 iterators_1.0.9 ## [55] R6_2.2.2 compiler_3.5.0 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Scraping Responsibly with R

Thu, 06/21/2018 - 02:00

(This article was first published on Blog-rss on stevenmortimer.com, and kindly contributed to R-bloggers)

I recently wrote a blog post here comparing the number of CRAN downloads an R package gets relative to its number of stars on GitHub. What I didn’t really think about during my analysis was whether or not scraping CRAN was a violation of its Terms and Conditions. I simply copy and pasted some code from R-bloggers that seemed to work and went on my merry way. In hindsight, it would have been better to check whether or not the scraping was allowed and maybe find a better way to get the information I needed. Of course, there was a much easier way to get the CRAN package metadata using the function tools::CRAN_package_db() thanks to a hint from Maëlle Salmon provided in this tweet.

How to Check if Scraping is Permitted

Also provided by Maëlle’s tweet was the recommendation for using the robotstxt package (currently having 27 Stars + one Star that I just added!). It doesn’t seem to be well known as it only has 6,571 total downloads. I’m hoping this post will help spread the word. It’s easy to use! In this case I’ll check whether or not CRAN permits bots on specific resources of the domain.

My other blog post analysis originally started with trying to get a list of all current R packages on CRAN by parsing the HTML from https://cran.rstudio.com/src/contrib. The page looks like this:

The question is whether or not scraping this page is permitted according to the robots.txt file on the cran.rstudio.com domain. This is where the robotstxt package can help us out. We can check simply by supplying the domain and path that is used to form the full link we are interested in scraping. If the paths_allowed() function returns TRUE then we should be allowed to scrape, if it returns FALSE then we are not permitted to scrape.

library(robotstxt) paths_allowed( paths = "/src/contrib", domain = "cran.rstudio.com", bot = "*" ) #> [1] TRUE

In this case the value that is returned is TRUE meaning that bots are allowed to scrape that particular path. This was how I originally scraped the list of current R packages, even though you don’t really need to do that since there is the wonderful function tools::CRAN_package_db().

After retrieving the list of packages I decided to scrape details from the DESCRIPTION file of each package. Here is where things get interesting. CRAN’s robots.txt file shows that scraping the DESCRIPTION file of each package is not allowed. Furthermore, you can verify this using the robotstxt package:

paths_allowed( paths = "/web/packages/ggplot2/DESCRIPTION", domain = "cran.r-project.org", bot = "*" ) #> [1] FALSE

However, when I decided to scrape the package metadata I did it by parsing the HTML from the canonical package link that resolves to the index.html page for the package. For example, https://cran.r-project.org/package=ggplot2 resolves to https://cran.r-project.org/web/packages/ggplot2/index.html. If you check whether scraping is allowed on this page, the robotstxt package says that it is permitted.

paths_allowed( paths = "/web/packages/ggplot2/index.html", domain = "cran.r-project.org", bot = "*" ) #> [1] TRUE paths_allowed( paths = "/web/packages/ggplot2", domain = "cran.r-project.org", bot = "*" ) #> [1] TRUE

This is a tricky situation because I can access the same information that is in the DESCRIPTION file just by going to the index.html page for the package where scraping seems to be allowed. In the spirit of respecting CRAN it logically follows that I should not be scraping the package index pages if the individual DESCRIPTION files are off-limits. This is despite there being no formal instruction from the robots.txt file about package index pages. All in all, it was an interesting bit of work and glad that I was able to learn about the robotstxt package so I can have it in my toolkit going forward.

Remember to Always Scrape Responsibly!

DISCLAIMER: I only have a basic understanding of how robots.txt files work based on allowing or disallowing specified paths. I believe in this case CRAN’s robots.txt broadly permitted scraping, but too narrowly disallowed just the DESCRIPTION files. Perhaps this goes back to an older time where those DESCRIPTION files really were the best place for people to start scraping so it made sense to disallow them. Or the reason could be something else entirely.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog-rss on stevenmortimer.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1053]

Thu, 06/21/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

An easy arithmetic Le Monde mathematical puzzle again:

  1. If coins come in units of 1, x, and y, what is the optimal value of (x,y) that minimises the number of coins representing an arbitrary price between 1 and 149?
  2.  If the number of units is now four, what is the optimal choice?

The first question is fairly easy to code

coinz <- function(x,y){ z=(1:149) if (y

and returns M=12 as the maximal number of coins, corresponding to x=4 and y=22. And a price tag of 129.  For the second question, one unit is necessarily 1 (!) and there is just an extra loop to the above, which returns M=8, with other units taking several possible values:

[1] 40 11 3 [1] 41 11 3 [1] 55 15 4 [1] 56 15 4

A quick search revealed that this problem (or a variant) is solved in many places, from stackexchange (for an average—why average?, as it does not make sense when looking at real prices—number of coins, rather than maximal), to a paper by Shalit calling for the 18¢ coin, to Freakonomics, to Wikipedia, although this is about finding the minimum number of coins summing up to a given value, using fixed currency denominations (a knapsack problem). This Wikipedia page made me realise that my solution is not necessarily optimal, as I use the remainders from the larger denominations in my code, while there may be more efficient divisions. For instance, running the following dynamic programming code

coz=function(x,y){ minco=1:149 if (x

returns the lower value of M=11 (with x=7,y=23) in the first case and M=7 in the second one.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

PYPL Language Rankings: Python ranks #1, R at #7 in popularity

Wed, 06/20/2018 - 21:38

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The new PYPL Popularity of Programming Languages (June 2018) index ranks Python at #1 and R at #7.

Like the similar TIOBE language index, the PYPL index uses Google search activity to rank language popularity. PYPL, however, fcouses on people searching for tutorials in the respective languages as a proxy for popularity. By that measure, Python has always been more popular than R (as you'd expect from a more general-purpose language), but both have been growing at similar rates. The chart below includes the three data-oriented languages tracked by the index (and note the vertical scale is logarithmic).

Another language ranking was also released recently: the annual KDnuggets Analytics, Data Science and Machine Learning Poll. These rankings, however, are derived not from search trends but by self-selected poll respondents, which perhaps explains the presence of Rapidminer at the #2 spot.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Big News: vtreat 1.2.0 is Available on CRAN, and it is now Big Data Capable

Wed, 06/20/2018 - 19:23

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing.

vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark.

vtreat is a very complete and rigorous tool for preparing messy real world data for supervised machine-learning tasks. It implements a technique we call “safe y-aware processing” using cross-validation or stacking techniques. It is very easy to use: you show it some data and it designs a data transform for you.

Thanks to the rquery package, this data preparation transform can now be directly applied to databases, or big data systems such as PostgreSQL, Amazon RedShift, Apache Spark, or Google BigQuery. Or, thanks to the data.table and rqdatatable packages, even fast large in-memory transforms are possible.

We have some basic examples of the new vtreat capabilities here and here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Neural Networks Are Essentially Polynomial Regression

Wed, 06/20/2018 - 19:00

(This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

You may be interested in my new arXiv paper, joint work with Xi Cheng, an undergraduate at UC Davis (now heading to Cornell for grad school); Bohdan Khomtchouk, a post doc in biology at Stanford; and Pete Mohanty,  a Science, Engineering & Education Fellow in statistics at Stanford. The paper is of a provocative nature, and we welcome feedback.

A summary of the paper is:

  • We present a very simple, informal mathematical argument that neural networks (NNs) are in essence polynomial regression (PR). We refer to this as NNAEPR.
  • NNAEPR implies that we can use our knowledge of the “old-fashioned” method of PR to gain insight into how NNs — widely viewed somewhat warily as a “black box” — work inside.
  • One such insight is that the outputs of an NN layer will be prone to multicollinearity, with the problem becoming worse with each successive layer. This in turn may explain why convergence issues often develop in NNs. It also suggests that NN users tend to use overly large networks.
  • NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead.
  • We investigated this on a wide variety of datasets, and found that in every case PR did as well as, and often better than, NNs.
  • We have developed a feature-rich R package, polyreg, to facilitate using PR in multivariate settings.

Much work remains to be done (see paper), but our results so far are very encouraging. By using PR, one can avoid the headaches of NN, such as selecting good combinations of tuning parameters, dealing with convergence problems, and so on.

Also available are the slides for our presentation at GRAIL on this project.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Intro To Time Series Analysis Part 2 :Exercises

Wed, 06/20/2018 - 06:38

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


In the exercises below, we will explore more in Time Series analysis.The previous exercise is here,Please follow this in sequence
Answers to these exercises are available here.

Exercise 1

load the AirPassangers data,check its class and see the start and end of the series .

Exercise 2
check the cycle of the TimeSeries AirPassangers .

Exercise 3

create the lagplot using the gglagplot from the forecast package,check how the relationship changes as the lag increases

Exercise 4

Also plot the correlation for each of the lags , you can see when the lag is above 6 the correlation drops and again climbs up in 12 and again drops in 18 .
Exercise 5

Plot the histogram of the AirPassengers using gghistogram from forecast

Exercise 6

Use tsdisplay to plot autocorrelation , timeseries and partial autocorrelation together in a same plot

Exercise 7

Find the outliers in the timeseries .

Related exercise sets:
  1. 3D plotting exercises
  2. Vector exercises
  3. Bayesian Inference – MCMC Diagnostics using coda : Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Reading and analysing log files in the RRD database format

Wed, 06/20/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

I have frequent conversations with R champions and Systems Administrators responsible for R, in which they ask how they can measure and analyze the usage of their servers. Among the many solutions to this problem, one of the my favourites is to use an RRD database and RRDtool.

From Wikipedia:

RRDtool (round-robin database tool) aims to handle time series data such as network bandwidth, temperatures or CPU load. The data is stored in a circular buffer based database, thus the system storage footprint remains constant over time.

RRDtool is a library written in C, with implementations that can also be accessed from the Linux command line. This makes it convenient for system development, but makes it difficult for R users to extract and analyze this data.

I am pleased to announce that I’ve been working on the rrd R package to import RRD files directly into tibble objects, thus making it easy to analyze your metrics.

As an aside, the RStudio Pro products (specifically RStudio Server Pro and RStudio Connect) also make use of RRD to store metrics – more about this later.

Understanding the RRD format as an R user

The name RRD is an initialism of Round Robin Database. The “round robin” refers to the fact that the database is always fixed in size, and as a new entry enters the database, the oldest entry is discarded. In practical terms, the database collects data for a fixed period of time, and information that is older than the threshold gets removed.

A second quality of RRD databases is that each datum is stored in different “consolidation data points”, where every data point is an aggregation over time. For example, a data point can represent an average value for the time period, or a maximum over the period. Typical consolidation functions include average, min and max.

The third quality is that every RRD database file typically consists of multiple archives. Each archive measures data for a different time period. For instance, the archives can capture data for intervals of 10 seconds, 30 seconds, 1 minute or 5 minutes.

As an example, here is a description of an RRD file that originated in RStudio Connect:

describe_rrd("rrd_cpu_0") #> A RRD file with 10 RRA arrays and step size 60 #> [1] AVERAGE_60 (43200 rows) #> [2] AVERAGE_300 (25920 rows) #> [3] MIN_300 (25920 rows) #> [4] MAX_300 (25920 rows) #> [5] AVERAGE_3600 (8760 rows) #> [6] MIN_3600 (8760 rows) #> [7] MAX_3600 (8760 rows) #> [8] AVERAGE_86400 (1825 rows) #> [9] MIN_86400 (1825 rows) #> [10] MAX_86400 (1825 rows)

This RRD file contains data for the properties of CPU 0 of the system. In this example, the first RRA archive contains averaged metrics for one minute (60s) intervals, while the second RRA measures the same metric, but averaged over five minutes. The same metrics are also available for intervals of one hour and one day.

Notice also that every archive has a different number of rows, representing a different historical period where the data is kept. For example, the per minute data AVERAGE_60 is retained for 43,200 periods (12 days) while the daily data MAX_86400 is retained for 1,825 periods (5 years).

If you want to know more, please read the excellent introduction tutorial to RRD database.

Introducing the rrd package

Until recently, it wasn’t easy to import RRD files into R. But I was pleased to discover that a Google Summer of Code 2014 project created a proof-of-concept R package to read these files. The author of this package is Plamen Dimitrov, who published the code on GitHub and also wrote an explanatory blog post.

Because I had to provide some suggestions to our customers, I decided to update the package, provide some example code, and generally improve the reliability.

The result is not yet on CRAN, but you can install the development version of package from github.

Installing the package

To build the package from source, you first need to install librrd. Installing RRDtool from your Linux package manager will usually also install this library.

Using Ubuntu:

sudo apt-get install rrdtool librrd-dev

Using RHEL / CentOS:

sudo yum install rrdtool rrdtool-devel

Once you have the system requirements in place, you can install the development version of the R package from GitHub using:

# install.packages("devtools") devtools::install_github("andrie/rrd") Limitations

The package is not yet available for Windows.

Using the package

Once you’ve installed the package, you can start to use it. The package itself contains some built-in RRD files, so you should be able to run the following code directly.

library(rrd) Describing the contents of a RRD

To describe the contents of an RRD file, use describe_rrd(). This function reports information about the names of each archive (RRA) file, the consolidation function, and the number of observations:

rrd_cpu_0 <- system.file("extdata/cpu-0.rrd", package = "rrd") describe_rrd(rrd_cpu_0) #> A RRD file with 10 RRA arrays and step size 60 #> [1] AVERAGE_60 (43200 rows) #> [2] AVERAGE_300 (25920 rows) #> [3] MIN_300 (25920 rows) #> [4] MAX_300 (25920 rows) #> [5] AVERAGE_3600 (8760 rows) #> [6] MIN_3600 (8760 rows) #> [7] MAX_3600 (8760 rows) #> [8] AVERAGE_86400 (1825 rows) #> [9] MIN_86400 (1825 rows) #> [10] MAX_86400 (1825 rows) Reading an entire RRD file

To read an entire RRD file, i.e. all of the RRA archives, use read_rrd(). This returns a list of tibble objects:

cpu <- read_rrd(rrd_cpu_0) str(cpu, max.level = 1) #> List of 10 #> $ AVERAGE60 :Classes 'tbl_df', 'tbl' and 'data.frame': 43199 obs. of 9 variables: #> $ AVERAGE300 :Classes 'tbl_df', 'tbl' and 'data.frame': 25919 obs. of 9 variables: #> $ MIN300 :Classes 'tbl_df', 'tbl' and 'data.frame': 25919 obs. of 9 variables: #> $ MAX300 :Classes 'tbl_df', 'tbl' and 'data.frame': 25919 obs. of 9 variables: #> $ AVERAGE3600 :Classes 'tbl_df', 'tbl' and 'data.frame': 8759 obs. of 9 variables: #> $ MIN3600 :Classes 'tbl_df', 'tbl' and 'data.frame': 8759 obs. of 9 variables: #> $ MAX3600 :Classes 'tbl_df', 'tbl' and 'data.frame': 8759 obs. of 9 variables: #> $ AVERAGE86400:Classes 'tbl_df', 'tbl' and 'data.frame': 1824 obs. of 9 variables: #> $ MIN86400 :Classes 'tbl_df', 'tbl' and 'data.frame': 1824 obs. of 9 variables: #> $ MAX86400 :Classes 'tbl_df', 'tbl' and 'data.frame': 1824 obs. of 9 variables:

Since the resulting object is a list of tibble objects, you can easily use R functions to work with an individual archive:

names(cpu) #> [1] "AVERAGE60" "AVERAGE300" "MIN300" "MAX300" #> [5] "AVERAGE3600" "MIN3600" "MAX3600" "AVERAGE86400" #> [9] "MIN86400" "MAX86400"

To inspect the contents of the first archive (AVERAGE60), simply print the object – since it’s a tibble, you get 10 lines of output.

For example, the CPU metrics contains a time stamp and metrics for average user and sys usage, as well as the nice value, idle time, interrupt requests and soft interrupt requests:

cpu[[1]] #> # A tibble: 43,199 x 9 #> timestamp user sys nice idle wait irq softirq #> * #> 1 2018-04-02 12:24:00 0.0104 0.00811 0 0.981 0 0 0 #> 2 2018-04-02 12:25:00 0.0126 0.00630 0 0.979 0 0 0 #> 3 2018-04-02 12:26:00 0.0159 0.00808 0 0.976 0 0 0 #> 4 2018-04-02 12:27:00 0.00853 0.00647 0 0.985 0 0 0 #> 5 2018-04-02 12:28:00 0.0122 0.00999 0 0.978 0 0 0 #> 6 2018-04-02 12:29:00 0.0106 0.00604 0 0.983 0 0 0 #> 7 2018-04-02 12:30:00 0.0147 0.00427 0 0.981 0 0 0 #> 8 2018-04-02 12:31:00 0.0193 0.00767 0 0.971 0 0 0 #> 9 2018-04-02 12:32:00 0.0300 0.0274 0 0.943 0 0 0 #> 10 2018-04-02 12:33:00 0.0162 0.00617 0 0.978 0 0 0 #> # ... with 43,189 more rows, and 1 more variable: stolen

Since the data is in tibble format, you can easily extract specific data, e.g., the last values of the system usage:

tail(cpu$AVERAGE60$sys) #> [1] 0.0014390667 0.0020080000 0.0005689333 0.0000000000 0.0014390667 #> [6] 0.0005689333 Reading only a single archive

The underlying code in the rrd package is written in C, and is therefore blazingly fast. Reading an entire RRD file takes a fraction of a second, but sometimes you may want to extract a specific RRA archive immediately.

To read a single RRA archive from an RRD file, use read_rra(). To use this function, you must specify several arguments that define the specific data to retrieve. This includes the consolidation function (e.g., "AVERAGE") and time step (e.g., 60). You must also specify either the start time or the number of steps, n_steps.

In this example, I extract the average for one-minute periods (step = 60) for one day (n_steps = 24 * 60):

end_time <- as.POSIXct("2018-05-02") # timestamp with data in example avg_60 <- read_rra(rrd_cpu_0, cf = "AVERAGE", step = 60, n_steps = 24 * 60, end = end_time) avg_60 #> # A tibble: 1,440 x 9 #> timestamp user sys nice idle wait irq softirq #> * #> 1 2018-05-01 00:01:00 0.00458 0.00201 0 0.992 0 0 0 #> 2 2018-05-01 00:02:00 0.00258 0.000570 0 0.996 0 0 0 #> 3 2018-05-01 00:03:00 0.00633 0.00144 0 0.992 0 0 0 #> 4 2018-05-01 00:04:00 0.00515 0.00201 0 0.991 0 0 0 #> 5 2018-05-01 00:05:00 0.00402 0.000569 0 0.995 0 0 0 #> 6 2018-05-01 00:06:00 0.00689 0.00144 0 0.992 0 0 0 #> 7 2018-05-01 00:07:00 0.00371 0.00201 0 0.993 0.00144 0 0 #> 8 2018-05-01 00:08:00 0.00488 0.00201 0 0.993 0.000569 0 0 #> 9 2018-05-01 00:09:00 0.00748 0.000568 0 0.992 0 0 0 #> 10 2018-05-01 00:10:00 0.00516 0 0 0.995 0 0 0 #> # ... with 1,430 more rows, and 1 more variable: stolen Plotting the results

The original RRDTool library for Linux contains some functions to easily plot the RRD data, a feature that distinguishes RRD from many other databases.

However, R already has very rich plotting capability, so the rrd R package doesn’t expose any specific plotting functions.

For example, you can easily plot these data using your favourite packages, like ggplot2:

library(ggplot2) ggplot(avg_60, aes(x = timestamp, y = user)) + geom_line() + stat_smooth(method = "loess", span = 0.125, se = FALSE) + ggtitle("CPU0 usage, data read from RRD file")

Getting the RRD files from RStudio Server Pro and RStudio Connect

As I mentioned in the introduction, both RStudio Server Pro and RStudio Connect use RRD to store metrics. In fact, these metrics are used to power the administration dashboard of these products.

This means that often the easiest solution is simply to enable the admin dashboard and view the information there.

However, sometimes R users and system administrators have a need to analyze the metrics in more detail, so in this section, I discuss where you can find the files for analysis.

The administration guides for these products explain where to find the metrics files:

  • The admin guide for RStudio Server Pro discusses metrics in this in section 8.2 Monitoring Configuration.
    • By default, the metrics are stored at /var/lib/rstudio-server/monitor/rrd, although this path is configurable by the server administrator
    • RStudio Server Pro stores system metrics as well as user metrics
  • RStudio Connect discusses metrics in section 16.1 Historical Metrics
    • The default path for metrics logs is /var/lib/rstudio-connect/metrics, though again, this is configurable by the server administrator.
rsc <- "/var/lib/rstudio-connect/metrics/rrd" rsp <- "/var/lib/rstudio-server/monitor/rrd"

If you want to analyze these files, it is best to copy the files to a different location. The security and permissions on both products are configured in such a way that it’s not possible to read the files while they are in the original folder. Therefore, we recommend that you copy the files to a different location and do the analysis there.

Warning about using the RStudio Connect RRD files:

The RStudio Connect team is actively planning to change the way content-level metrics are stored, so data related to shiny apps, markdown reports, etc. will likely look different in a future release.

To be clear:

  • The schemas might change
  • RStudio Connect may stop tracking some metrics
  • It’s also possible that the entire mechanism might change

The only guarantees that we make in RStudio Connect are around the data that we actually surface:

  • server-wide user counts
  • RAM
  • CPU data

This means that if you analyze RRD files, you should be aware that the entire mechanism for storing metrics might change in future.

Additional caveat
  • The metrics collection process runs in a sandboxed environment, and it is not possible to publish a report to RStudio Connect that reads the metrics directly. If you want to automate a process to read the Connect metrics, you will have to set up a cron job to copy the files to a different location, and run the analysis against the copied files. (Also, re-read the warning that everything might change!)
Example

In the following worked example, I copied some rrd files that originated in RStudio Connect to a different location on disk, and stored this in a config file.

First, list the file names:

config <- config::get() rrd_location <- config$rrd_location rrd_location %>% list.files() %>% tail(20) ## [1] "content-978.rrd" "content-986.rrd" "content-98.rrd" ## [4] "content-990.rrd" "content-995.rrd" "content-998.rrd" ## [7] "cpu-0.rrd" "cpu-1.rrd" "cpu-2.rrd" ## [10] "cpu-3.rrd" "license-users.rrd" "network-eth0.rrd" ## [13] "network-lo.rrd" "system-CPU.rrd" "system.cpu.usage.rrd" ## [16] "system.load.rrd" "system.memory.rrd" "system-RAM.rrd" ## [19] "system.swap.rrd" "system-SWAP.rrd"

The file names indicated that RStudio Connect collects metrics for the system (CPU, RAM, etc.), as well as for every piece of published content.

To look at the system load, first describe the contents of the "system.load.rrd" file:

sys_load <- file.path(rrd_location, "system.load.rrd") describe_rrd(sys_load) ## A RRD file with 10 RRA arrays and step size 60 ## [1] AVERAGE_60 (43200 rows) ## [2] AVERAGE_300 (25920 rows) ## [3] MIN_300 (25920 rows) ## [4] MAX_300 (25920 rows) ## [5] AVERAGE_3600 (8760 rows) ## [6] MIN_3600 (8760 rows) ## [7] MAX_3600 (8760 rows) ## [8] AVERAGE_86400 (1825 rows) ## [9] MIN_86400 (1825 rows) ## [10] MAX_86400 (1825 rows)

This output tells you that metrics are collected every 60 seconds (one minute), and then in selected multiples (1 minute, 5 minutes, 1 hour and 1 day.) You can also tell that the consolidation functions are average, min and max.

To extract one month of data, averaged at 5-minute intervals use step = 300:

dat <- read_rra(sys_load, cf = "AVERAGE", step = 300L, n_steps = (3600 / 300) * 24 * 30) dat ## # A tibble: 8,640 x 4 ## timestamp `1min` `5min` `15min` ## * ## 1 2018-05-10 19:10:00 0.0254 0.0214 0.05 ## 2 2018-05-10 19:15:00 0.263 0.153 0.0920 ## 3 2018-05-10 19:20:00 0.0510 0.117 0.101 ## 4 2018-05-10 19:25:00 0.00137 0.0509 0.0781 ## 5 2018-05-10 19:30:00 0 0.0168 0.0534 ## 6 2018-05-10 19:35:00 0 0.01 0.05 ## 7 2018-05-10 19:40:00 0.0146 0.0166 0.05 ## 8 2018-05-10 19:45:00 0.00147 0.0115 0.05 ## 9 2018-05-10 19:50:00 0.0381 0.0306 0.05 ## 10 2018-05-10 19:55:00 0.0105 0.018 0.05 ## # ... with 8,630 more rows

It is very easy to plot this using your preferred plotting package, e.g., ggplot2:

ggplot(dat, aes(x = timestamp, y = `5min`)) + geom_line() + stat_smooth(method = "loess", span = 0.125)

Conclusion

The rrd package, available from GitHub, makes it very easy to read metrics stored in the RRD database format. Reading an archive is very quick, and your resulting data is a tibble for an individual archive, or a list of tibbles for the entire file.

This makes it easy to analyze your data using the tidyverse packages, and to plot the information.

_____='https://rviews.rstudio.com/2018/06/20/reading-rrd-files/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

a chain of collapses

Wed, 06/20/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A quick riddler resolution during a committee meeting (!) of a short riddle: 36 houses stand in a row and collapse at times t=1,2,..,36. In addition, once a house collapses, the neighbours if still standing collapse at the next time unit. What are the shortest and longest lifespans of this row?

Since a house with index i would collapse on its own by time i, the longest lifespan is 36, which can be achieved with the extra rule when the collapsing times are perfectly ordered. For the shortest lifespan, I ran a short R code implementing the rules and monitoring its minimum. Which found 7 as the minimal number for 10⁵ draws. However, with an optimal ordering, one house plus one or two neighbours of the most recently collapsed, leading to a maximal number of collapsed houses after k time units being

1+2(k-1)+1+2(k-2)+….=k+k(k-1)=k²

which happens to be equal to 36 for k=6. (Which was also obtained in 10⁶ draws!) This also gives the solution for any value of k.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Searching For Unicorns (And Other NBA Myths)

Tue, 06/19/2018 - 23:27

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

A visual exploration of the 2017-2018 NBA landscape

The modern NBA landscape is rapidly changing.

Steph Curry has redefined the lead guard prototype with jaw-dropping shooting range coupled with unprecedented scoring efficiency for a guard. The likes of Marc Gasol, Al Horford and Kristaps Porzingis are paving the way for a younger generation of modern big men as defensive rim protectors who can space the floor on offense as three-point threats. Then there are the new-wave facilitators – LeBron James, Draymond Green, Ben Simmons – enormous athletes who can guard any position on defense and push the ball down court in transition.

For fans, analysts and NBA front offices alike, these are the prototypical players that make our mouths water. So what do they have in common?

For one, they are elite statistical outliers in at least two categories, and this serves as the primary motivation for my exploratory analysis tool: To identify NBA players in the 2017-2018 season that exhibited unique skill sets based on statistical correlations.

To access the tool, click here.

The Data

The tool uses box score data from the 2017-2018 NBA season (source: Kaggle) and focuses on the following categories: Points, rebounds, assists, turnovers, steals, blocks, 3-pointers made, FG% and FT%. I also used Dean Oliver’s formula for estimating a players total possessions (outlined here).

To assess all players on an equal scale, I normalized the box score data for each player. For ease of interpretability, I chose to use “per 36 minute” normalization, which take a player’s per-minute production and extrapolates it to 36 minutes of playing time. In this way, the values displayed in the scatterplot represent each player’s production per 36 minutes of playing time.

To ensure that the per-36 minute calculations did not generate any outliers due to small statistical samples, I removed all players with fewer than nine games in the season, as well as players who averaged three minutes or less per game.

Using the tool: A demonstration

The tool is a Shiny application intended to be used for exploratory analysis and player discovery. To most effectively understand and interpret the charts, you can follow these steps:

Step 1: Assess the correlation matrix

The correlation matrix uses the Pearson correlation coefficient as a reference to guide your use of the dynamic scatter plot. Each dot represents the league-wide correlation between two statistical categories.

The color scale indicates the direction of the correlation. That is, blue dots represent negatively correlated statistics, and red dots positively correlated statistics. The size of the dot indicates the magnitude of the correlation – that is, how strong the relationship is between the two statistics across the entire league. Large dots represent high correlation between two statistics, while small dots indicate that the two statistics do not have a linear relationship.

Step 2: Select two statistics to plot for exploration

We can get a flavor of these relationships as we move to the scatterplot. (Follow along using the app.) For the purpose of identifying truly unique players, let’s look at a pairing of negatively correlated statistics with high magnitude (i.e. a blue, large dot): 3-pointers made (“3PM”) vs. Field goal percentage (“FG%”).

Step 3: Explore

It makes sense intuitively why these are negatively correlated – a player making a lot of threes is also attempting a lot of long-distance, low-percentage shots. Given the value of floor-spacing in today’s NBA, a high-volume 3-point shooter who is also an efficient scorer possesses unique abilities. So, let’s select FG% for our x-axis and 3PM for our y-axis (using the dropdowns in the menu bar), and see what we find…

The two dotted lines within the scatterplot represent the 50th percentile for each statistic. In the case of FG% vs. 3PM, we turn to the upper right quadrant, which represents the players who are above average in both FG% and 3-pointers made. To focus our analysis, we can zoom in on this quadrant for a close look.

To zoom, simply select and drag across the plotted space you want to zoom in to, in this case the upper right quadrant. You can also filter by position by simply selecting specific positions in the legend.

Scroll over a point to see who the player is, as well as their per-36 statistics. At the top of our plot, no surprises here: Steph Curry. While his 4.7 threes per 36 minutes leads the league, what truly separates him is his 50% efficiency from the field. But we already know that Steph is an exceptional anomaly, so who what else can we find?

While several superstars can also be found at the top of our plot – Kevin Durant, Kyrie Irving, and Klay Thompson stand out – we have quite a few role players up there as well: Kyle Korver, J.J. Redick, Kelly Olynyk and Joe Ingles. These are quality reserves who may not wow us with their overall statistical profiles, but play a crucial, high-value role on teams by spacing the floor without sacrificing scoring efficiency.

Step 4: Repeat

I recommend starting your exploration on the blue-dots of the correlation matrix – blocks vs. threes, rebounds vs. threes, assists vs. blocks, for example. These are where you can identify players with the most unique skill pairings across the league. (Note: When plotting turnovers, be sure to focus below the median line, as it is better to have low turnovers than high.)

For fantasy basketball enthusiasts, this is a great tool to identify players with specific statistical strengths to construct a well-balanced team, or complement your roster core.

Conclusion

I really enjoyed building this tool and exploring its visualization of the NBA landscape. From an interpretability standpoint, however, it is not ideal that we can only focus on one player at time. To improve on this, I plan include an additional table that provides a deeper look at players that fall above the median line for both X and Y statistics. In this way, we can further analyze these players across a larger range of performance variables.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dear data scientists, how to ease your job

Tue, 06/19/2018 - 11:00

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

You have got the best job of the 21st century. Like a treasure hunter, looking for a data treasure chest while sailing through data lakes. In many companies you are a digital maker – armed with skills that turn you to a modern-day polymath and a toolset which is so holistic and complex at once, even astronauts feel dizzy. 

However, there is still something you carry every day to work: the burden of high expectations and demands of others – whether it is a customer, your supervisor or colleagues. Being a data scientist is a dream job, but also very stressful as it requires creative approaches and new solutions every day.  

Would it not be great if there is something that makes your daily work easier? 

Many requirements and your need for a solution

R, Python and Julia – does your department work with several programming languages? Would it not be great to have a solution that supports various languages and thus encourage what is so essential for data analysis: teamwork. Additionally, connectivity packages could enable you to work in a familiar surrounding such as RStudio.

Performance is everything: you create the most complex neuronal networks and process data quantities that make big data really deserve the word “big”. Imagine, you could transfer your analyses in a controlled and scalable environment where your scripts perform not only reliable, but also an optimal load distribution. All this, including a horizontal scale and improvement of system performance.

Data sources, user and analysis scripts – in search of a tool that can bring together all components in a bundled analysis project to manage ressources more effeciently, raise transparency and develop a compliant workflow. The best possible solution is a role management which can be easily expanded to the specialist department.

Time is money. Of course, that also applies to your working time. A solution that can free you from time-consuming routine tasks, such as monitoring and parameterization of script execution, as well as for implementing analyses via temporal trigger. Additionally, the dynamic load distribution and the logging of script output ensure the operationalization of script execution in a business-critical environment.

Keep an eye on the big picture: your performant analysis will not bring you the deserved satisfaction if you are not able to embed it into the existing IT-landscape. A tool that has consistent Interfaces to integrate your analysis scripts via REST-API neatly in any existing application would be perfect to ease your daily workload.

eoda | data science core: a solution from data scientists to data scientists

Imagine a data science tool that incorporates the experience to leverage your potential in bringing data science to the enterprise environment.

Based on many years of experience from analysis projects and the knowledge about your daily challenges, we have developed a solution for you: the eoda | data science core. You can manage your analysis projects flexibly, performant and secure. It gives you the space you need to deal with expectations and keep the love for the profession– as it is, after all, the best job in the world.

The eoda | data science environment provides a framework for creating and managing different containers with several setups for various applications. In the eoda | data science environment scripts can access common intermediate results despite different languages like Rstats, Python or Julia and be managed in one data science project.

The eoda | data science core is the first main component of the eoda | data science environment. This will be complemented with the second component, the eoda | data science portal. How does the portal enable collaborative working, explorative analyses and a user-friendly visualization of results? Read all about it in the next article and find out.

For more information: www.data-science-environment.com

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

11 Jobs for R users from around the world (2018-06-19)

Tue, 06/19/2018 - 07:31
To post your R job on the next post

Just visit  this link and post a new R job  to the R community.

You can post a job for  free  (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers:  please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

 

All New R Jobs

 

  1. Full-Time
    Research Fellow UC Hastings Institute for Innovation Law – Posted by feldmanr
    San Francisco California, United States
    19 Jun 2018
  2. Full-Time
    Technical Support Engineer at RStudio RStudio – Posted by agadrow
    Anywhere
    19 Jun 2018
  3. Full-Time
    postdoc in psychiatry: machine learning in human genomics University of Iowa – Posted by michaelsonlab
    Anywhere
    18 Jun 2018
  4. Full-Time
    Lead Quantitative Developer The Millburn Corporation – Posted by The Millburn Corporation
    New York New York, United States
    15 Jun 2018
  5. Full-Time
    Research Data Analyst @ Arlington, Virginia, U.S. RSG – Posted by patricia.holland@rsginc.com
    Arlington Virginia, United States
    15 Jun 2018
  6. Full-Time
    Data Scientist / Senior Strategic Analyst (Communications & Marketing) Memorial Sloan Kettering Cancer Center – Posted by MSKCC
    New York New York, United States
    30 May 2018
  7. Full-Time
    Market Research Analyst: Mobility for RSG RSG – Posted by patricia.holland@rsginc.com
    Anywhere
    25 May 2018
  8. Full-Time
    Data Scientist @ New Delhi, India Amulozyme Inc. – Posted by Amulozyme
    New Delhi Delhi, India
    25 May 2018
  9. Full-Time
    Data Scientist/Programmer @ Milwaukee, Wisconsin, U.S. ConsensioHealth – Posted by ericadar
    Milwaukee Wisconsin, United States
    25 May 2018
  10. Full-Time
    Customer Success Rep RStudio – Posted by jclemens1
    Anywhere
    2 May 2018
  11. Full-Time
    Lead Data Scientist @ Washington, District of Columbia, U.S. AFL-CIO – Posted by carterkalchik
    Washington District of Columbia, United States
    27 Apr 2018

 

In  R-users.com  you can see  all  the R jobs that are currently available.

R-users Resumes

R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

(you may also look at  previous R jobs posts ).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

RStudio Connect v1.6.4

Tue, 06/19/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio Connect version 1.6.4 is now available!

There are a few breaking changes and a handful of new features that are highlighted below.
We encourage you to upgrade as soon as possible!

Breaking

Please take note of important breaking changes before upgrading.

Pandoc 2

RStudio Connect includes Pandoc 1 and will now also include Pandoc 2. Admins do
not need to install either.

If you have deployed content with rmarkdown version 1.9 or higher, then that
content will now use Pandoc 2 at runtime. This brings in several bug fixes and
enables some new functionality, but does introduce some backwards
incompatibilities. To protect older versions of rmarkdown, Pandoc 1 will still
be used for content deployed with any rmarkdown version prior to 1.9. Content
not using the rmarkdown package will have Pandoc 2 available.

Pandoc is dynamically made available to content when it is executed, so content
using the newer version of rmarkdown will see Pandoc 2 immediately upon
upgrading RStudio Connect, whether or not you have updated the content recently.
The types of backwards incompatibilities we expect are issues like minor
white-space rendering differences.

R Markdown Rendering

The R Markdown rendering environment has been updated, which will break a
certain class of R Markdown documents. No action is needed for the majority of
R Markdown documents. Publishers will need to rewrite R Markdown documents that
depended on locally preserving and storing state in between renderings.

The update isolates renderings and protects against clashes caused by concurrent
writes, but also means that files written to the local directory during a render
will not be present or available the next time that the report is rendered.

For example, a report that writes a CSV file to disk on day 1 at a local
location, write.csv(‘data.csv’), and then on day 2 reads the same CSV
read.csv(‘data.csv’), will no longer work. Publishers should refactor this
type of R Markdown document to write data to a database or a shared directory
that is not
sandboxed.
For instance, to /app-data/data.csv.

New Features File Download

When a user accesses a Microsoft Word file or some other file type that is not
rendered in the browser, Connect previously downloaded the content immediately.
We have added a download page that simplifies the presentation of
browser-unfriendly file types.

Content Filtering

The RStudio Connect Dashboard now includes interactive labels for tag filters in
the content listing view. This simplifies keeping track of complex searches,
especially when returning to the Dashboard with saved filter state.

Log Download

The Connect UI truncates log files to show the latest output. However, when
someone downloads log files, the downloaded file is no longer truncated. This
makes it easier for a developer to inspect asset behavior with the full log file
available on Connect.

User Management

Connect now allows administrators to filter the users list by multiple account
statuses. The last day that each user was active is now displayed along with the
user list.

Upgrade Planning

Besides breaking changes above, there are no special precautions to be aware of
when upgrading from v1.6.2 to v1.6.4. You can expect the installation and startup
of v1.6.4 to be complete in under a minute.

If you’re upgrading from a release older than v1.6.2, be sure to consider the
“Upgrade Planning” notes from the intervening releases, as well.

If you haven’t yet had a chance to download and try RStudio
Connect
, we encourage you to do so.
RStudio Connect is the best way to share all the work that you do in R (Shiny
apps, R Markdown documents, plots, dashboards, Plumber APIs, etc.) with
collaborators, colleagues, or customers.

You can find more details or download a 45-day evaluation of the product at
https://www.rstudio.com/products/connect/.
Additional resources can be found below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Chat with the rOpenSci team at upcoming meetings

Tue, 06/19/2018 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

You can find members of the rOpenSci team at various meetings and workshops around the world. Come say ‘hi’, learn about how our software packages can enable your research, or about our process for open peer software review and onboarding, how you can get connected with the community or tell us how we can help you do open and reproducible research.

Where’s rOpenSci? When Who Where What June 23, 2018 Maëlle Salmon Cardiff, UK satRday Cardiff June 27-28, 2018 Scott Chamberlain Portland, OR Bioinformatics Open Source Conference 2018 (BOSC) July 4-6, 2018 Maëlle Salmon Rennes, FR French R conference July 10-13, 2018 Jenny Bryan Brisbane, AU UseR! July 28-Aug 2, 2018 Jenny Bryan Vancouver, CA Joint Statistical Meetings (JSM) Aug 6-10, 2018 Carl Boettiger, Dan Sholler New Orleans, LA Ecological Society of America (ESA) Aug 15-16, 2018 Stefanie Butland Cambridge, MA R / Pharma Aug 27-30, 2018 Scott Chamberlain Dunedin, NZ Biodiversity Information Standards (TDWG) Sep 3-7, 2018 Jenny Bryan Buenos Aires, AR LatinR Sep 11, 2018 Maëlle Salmon Radolfzell, GE AniMove Sep 12-14, 2018 Jeroen Ooms The Hague, NL Use of R in Official Statistics (uRos2018) Oct 26, 2018 Nick Tierney (representing rOpenSci) Seoul, KR R User Conference in Korea 2018 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R vs Python: Image Classification with Keras

Tue, 06/19/2018 - 01:15

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Even though the libraries for R from Python, or Python from R code execution existed since years and despite of a recent announcement of Ursa Labs foundation by Wes McKinney who is aiming to join forces with RStudio foundation, Hadley Wickham in particularly, (find more here) to improve data scientists workflow and unify libraries to be used not only in Python, but in any programming language used by data scientists, some data professionals are still very strict on the language to be used for ANN models limiting their dev. environment exclusively to Python.

As a continuation of my R vs. Python comparison, I decided to test performance of both languages in terms of time required to train a convolutional neural network based model for image recognition. As the starting point, I took the blog post by Dr. Shirin Glander on how easy it is to build a CNN model in R using Keras.

A few words about Keras. It is a Python library for artificial neural network ML models which provides high level fronted to various deep learning frameworks with Tensorflow being the default one.
Keras has many pros with the following among the others:

  • Easy to build complex models just in few lines of code => perfect for dev. cycle to quickly experiment and check your ideas
  • Code recycling: one can easily swap the backend framework (let’s say from CNTK to Tensorflow or vice versa) => DRY principle
  • Seamless use of GPU => perfect for fast model tuning and experimenting

Since Keras is written in Python, it may be a natural choice for your dev. environment to use Python. And that was the case until about a year ago when RStudio founder J.J.Allaire announced release of the Keras library for R in May’17. I consider this to be a turning point for data scientists; now we can be more flexible with dev. environment and be able to deliver result more efficiently with opportunity to extend existing solutions we may have written in R.

It brings me to the point of this post.
My hypothesis is, when it comes to ANN ML model building with Keras, Python is not a must, and depending on your client’s request, or tech stack, R can be used without limitations and with similar efficiency.

Image Classification with Keras

In order to test my hypothesis, I am going to perform image classification using the fruit images data from kaggle and train a CNN model with four hidden layers: two 2D convolutional layers, one pooling layer and one dense layer. RMSProp is being used as the optimizer function.

Tech stack

Hardware:
CPU: Intel Core i7-7700HQ: 4 cores (8 threads), 2800 – 3800 (Boost) MHz core clock
GPU: Nvidia Geforce GTX 1050 Ti Mobile: 4Gb vRAM, 1493 – 1620 (Boost) MHz core clock
RAM: 16 Gb

Software:
OS: Linux Ubuntu 16.04
R: ver. 3.4.4
Python: ver. 3.6.3
Keras: ver. 2.2
Tensorflow: ver. 1.7.0
CUDA: ver. 9.0 (note that the current tensorflow version supports ver. 9.0 while the up-to-date version of cuda is 9.2)
cuDNN: ver. 7.0.5 (note that the current tensorflow version supports ver. 7.0 while the up-to-date version of cuDNN is 7.1)

Code

The R and Python code snippets used for CNN model building are present bellow. Thanks to fruitful collaboration between F. Chollet and J.J. Allaire, the logic and functions names in R are alike the Python ones.

R ## Courtesy: Dr. Shirin Glander. Code source: https://shirinsplayground.netlify.com/2018/06/keras_fruits/ library(keras) start <- Sys.time() fruit_list <- c("Kiwi", "Banana", "Plum", "Apricot", "Avocado", "Cocos", "Clementine", "Mandarine", "Orange", "Limes", "Lemon", "Peach", "Plum", "Raspberry", "Strawberry", "Pineapple", "Pomegranate") # number of output classes (i.e. fruits) output_n <- length(fruit_list) # image size to scale down to (original images are 100 x 100 px) img_width <- 20 img_height <- 20 target_size <- c(img_width, img_height) # RGB = 3 channels channels <- 3 # path to image folders path <- "path/to/folder/with/data" train_image_files_path <- file.path(path, "fruits-360", "Training") valid_image_files_path <- file.path(path, "fruits-360", "Test") # optional data augmentation train_data_gen %>% image_data_generator( rescale = 1/255 ) # Validation data shouldn't be augmented! But it should also be scaled. valid_data_gen <- image_data_generator( rescale = 1/255 ) # training images train_image_array_gen <- flow_images_from_directory(train_image_files_path, train_data_gen, target_size = target_size, class_mode = 'categorical', classes = fruit_list, seed = 42) # validation images valid_image_array_gen <- flow_images_from_directory(valid_image_files_path, valid_data_gen, target_size = target_size, class_mode = 'categorical', classes = fruit_list, seed = 42) ### model definition # number of training samples train_samples <- train_image_array_gen$n # number of validation samples valid_samples <- valid_image_array_gen$n # define batch size and number of epochs batch_size <- 32 epochs <- 10 # initialise model model <- keras_model_sequential() # add layers model %>% layer_conv_2d(filter = 32, kernel_size = c(3,3), padding = 'same', input_shape = c(img_width, img_height, channels)) %>% layer_activation('relu') %>% # Second hidden layer layer_conv_2d(filter = 16, kernel_size = c(3,3), padding = 'same') %>% layer_activation_leaky_relu(0.5) %>% layer_batch_normalization() %>% # Use max pooling layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_dropout(0.25) %>% # Flatten max filtered output into feature vector # and feed into dense layer layer_flatten() %>% layer_dense(100) %>% layer_activation('relu') %>% layer_dropout(0.5) %>% # Outputs from dense layer are projected onto output layer layer_dense(output_n) %>% layer_activation('softmax') # compile model %>% compile( loss = 'categorical_crossentropy', optimizer = optimizer_rmsprop(lr = 0.0001, decay = 1e-6), metrics = 'accuracy' ) # fit hist <- fit_generator( # training data train_image_array_gen, # epochs steps_per_epoch = as.integer(train_samples / batch_size), epochs = epochs, # validation data validation_data = valid_image_array_gen, validation_steps = as.integer(valid_samples / batch_size), # print progress verbose = 2, callbacks = list( # save best model after every epoch callback_model_checkpoint(file.path(path, "fruits_checkpoints.h5"), save_best_only = TRUE), # only needed for visualising with TensorBoard callback_tensorboard(log_dir = file.path(path, "fruits_logs")) ) ) df_out <- hist$metrics %>% {data.frame(acc = .$acc[epochs], val_acc = .$val_acc[epochs], elapsed_time = as.integer(Sys.time()) - as.integer(start))} Python ## Author: D. Kisler - adoptation of R code from https://shirinsplayground.netlify.com/2018/06/keras_fruits/ from keras.preprocessing.image import ImageDataGenerator from keras.models import Sequential from keras.layers import (Conv2D, Dense, LeakyReLU, BatchNormalization, MaxPooling2D, Dropout, Flatten) from keras.optimizers import RMSprop from keras.callbacks import ModelCheckpoint, TensorBoard import PIL.Image from datetime import datetime as dt start = dt.now() # fruits categories fruit_list = ["Kiwi", "Banana", "Plum", "Apricot", "Avocado", "Cocos", "Clementine", "Mandarine", "Orange", "Limes", "Lemon", "Peach", "Plum", "Raspberry", "Strawberry", "Pineapple", "Pomegranate"] # number of output classes (i.e. fruits) output_n = len(fruit_list) # image size to scale down to (original images are 100 x 100 px) img_width = 20 img_height = 20 target_size = (img_width, img_height) # image RGB channels number channels = 3 # path to image folders path = "path/to/folder/with/data" train_image_files_path = path + "fruits-360/Training" valid_image_files_path = path + "fruits-360/Test" ## input data augmentation/modification # training images train_data_gen = ImageDataGenerator( rescale = 1./255 ) # validation images valid_data_gen = ImageDataGenerator( rescale = 1./255 ) ## getting data # training images train_image_array_gen = train_data_gen.flow_from_directory(train_image_files_path, target_size = target_size, classes = fruit_list, class_mode = 'categorical', seed = 42) # validation images valid_image_array_gen = valid_data_gen.flow_from_directory(valid_image_files_path, target_size = target_size, classes = fruit_list, class_mode = 'categorical', seed = 42) ## model definition # number of training samples train_samples = train_image_array_gen.n # number of validation samples valid_samples = valid_image_array_gen.n # define batch size and number of epochs batch_size = 32 epochs = 10 # initialise model model = Sequential() # add layers # input layer model.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'same', input_shape = (img_width, img_height, channels), activation = 'relu')) # hiddel conv layer model.add(Conv2D(filters = 16, kernel_size = (3,3), padding = 'same')) model.add(LeakyReLU(.5)) model.add(BatchNormalization()) # using max pooling model.add(MaxPooling2D(pool_size = (2,2))) # randomly switch off 25% of the nodes per epoch step to avoid overfitting model.add(Dropout(.25)) # flatten max filtered output into feature vector model.add(Flatten()) # output features onto a dense layer model.add(Dense(units = 100, activation = 'relu')) # randomly switch off 25% of the nodes per epoch step to avoid overfitting model.add(Dropout(.5)) # output layer with the number of units equal to the number of categories model.add(Dense(units = output_n, activation = 'softmax')) # compile the model model.compile(loss = 'categorical_crossentropy', metrics = ['accuracy'], optimizer = RMSprop(lr = 1e-4, decay = 1e-6)) # train the model hist = model.fit_generator( # training data train_image_array_gen, # epochs steps_per_epoch = train_samples // batch_size, epochs = epochs, # validation data validation_data = valid_image_array_gen, validation_steps = valid_samples // batch_size, # print progress verbose = 2, callbacks = [ # save best model after every epoch ModelCheckpoint("fruits_checkpoints.h5", save_best_only = True), # only needed for visualising with TensorBoard TensorBoard(log_dir = "fruits_logs") ] ) df_out = {'acc': hist.history['acc'][epochs - 1], 'val_acc': hist.history['val_acc'][epochs - 1], 'elapsed_time': (dt.now() - start).seconds} Experiment

The models above were trained 10 times with R and Pythons on GPU and CPU, the elapsed time and the final accuracy after 10 epochs were measured. The results of the measurements are presented on the plots below (click the plot to be redirected to plotly interactive plots).

From the plots above, one can see that:

  • the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is around 99% as well).
  • even though 10 measurements may be not convincing, but Python would reduce (by up to 15%) the time required to train your CNN model. This is somewhat expected because R uses Python under the hood when executes Keras functions.

Let’s perform unpaired t-test assuming that all our observations are normally distributed.

library(dplyr) library(data.table) # fetch the data used to plot graphs d <- fread('keras_elapsed_time_rvspy.csv') # unpaired t test: # t_score = (mean1 - mean2)/sqrt(stdev1^2/n1+stdev2^2/n2) d %>% group_by(lang, eng) %>% summarise(el_mean = mean(elapsed_time), el_std = sd(elapsed_time), n = n()) %>% data.frame() %>% group_by(eng) %>% summarise(t_score = round(diff(el_mean)/sqrt(sum(el_std^2/n)), 2)) eng t_score cpu 11.38 gpu 9.64

T-score reflects a significant difference between the time required to train a CNN model in R compared to Python as we saw on the plot above.

Summary

Building and training CNN model in R using Keras is as “easy” as in Python with the same coding logic and functions naming convention. Final accuracy of your Keras model will depend on the neural net architecture, hyperparameters tuning, training duration, train/test data amount etc., but not on the programming language you would use for your DS project. Training a CNN Keras model in Python may be up to 15% faster compared to R

P.S.

If you would like to know more about Keras and to be able to build models with this awesome library, I recommend you these books:

As well as this Udemy course to start your journey with Keras.

Thanks a lot for your attention! I hope this post would be helpful for an aspiring data scientist to gain an understanding of use cases for different technologies and to avoid being biased when it comes to the selection of the tools for DS projects accomplishment.

    Related Post

    1. Update: Can we predict flu outcome with Machine Learning in R?
    2. Evaluation of Topic Modeling: Topic Coherence
    3. Natural Language Generation with Markovify in Python
    4. Anomaly Detection in R – The Tidy Way
    5. Forecasting with ARIMA – Part I
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Statistics Sunday: Accessing the YouTube API with tuber

    Mon, 06/18/2018 - 17:16

    (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

    I haven’t had a lot of time to play with this but yesterday, I discovered the tuber R package, which allows you to interact with the YouTube API.

    To use the tuber package, not only do you need to install it in R – you’ll need a Google account and will have to authorize 4 APIs through Developer Console: all 3 YouTube APIs (though the Data API will be doing the heavy lifting) and the Freebase API. Before you authorize the first API, Google will have you create a project to tie the APIs to. Then, you’ll find the APIs in the API library to add to this project. Click on each API and on the next screen, select Enable. You’ll need to create credentials for each of the YouTube APIs. When asked to identify the type of app that will accessing the YouTube API, select “Other”.

    The tuber package requires two pieces of information from you, which you’ll get when you set up credentials for the OAuth 2.0 Client: client ID and client secret. Once you set those up, you can download them at any time in JSON format by going to the Credentials dashboard and clicking the download button on the far right.

    In R, load the tuber package, then call the yt_oauth function, using the client ID (which should end with something like “apps.googleusercontent.com”) and client secret (a string of letters and numbers). R will launch a browser window to authorize tuber to access the APIs. Once that’s done, you’ll be able to use the tuber package to, for instance, access data about a channel or get information about a video. My plan is to use my Facebook dataset to pull out the head songs I’ve shared and get the video information to generate a dataset of my songs. Look for more on that later. In the meantime, this great post on insightr can give you some details about using the tuber package.

    [Apologies for the short and late post – I’ve been sick and haven’t had as much time to write recently. Hopefully I’ll get back to normal over the next week.]

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Effectively scaling Shiny in enterprise

    Mon, 06/18/2018 - 16:58

    (This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

    James Blair, RStudio

    Scalability is a hot word these days, and for good reason. As data continues to grow in volume and importance, the ability to reliably access and reason about that data increases in importance. Enterprises expect data analysis and reporting solutions that are robust and allow several hundred, even thousands, of concurrent users while offering up-to-date security options.

    Shiny is a highly flexible and widely used framework for creating web applications using R. It enables data scientists and analysts to create dynamic content that provides straightforward access to their work for those with no working knowledge of R. While Shiny has been around for quite some time, recent introductions to the Shiny ecosystem make Shiny simpler and safer to deploy in an enterprise environment where security and scalability are paramount. These new tools in connection with RStudio Connect provide enterprise grade solutions that make Shiny an even more attractive option for data resource creation.

    Develop and Test

    Most Shiny applications are developed either locally on a personal computer or using an instance of RStudio Server. During development it can be helpful to understand application performance, specifically if there are any concerning bottlenecks. The profvis package provides functions for profiling R code and can profile the performance of Shiny applications. profvis provides a breakdown of code performance and can be useful for identifying potential areas for improving application responsiveness.

    The recently released promises package provides asynchronous capabilities to Shiny applications. Asynchronous programming can be used to improve application responsiveness when several concurrent users are accessing the same application. While there is some overhead involved in creating asynchronous applications, this method can improve application responsiveness.

    Once an application is fully developed and ready to be deployed, it’s useful to establish a set of behavioral expectations. These expectations can be used to ensure that future updates to the application don’t break or unexpectedly change behavior. Traditionally most testing of Shiny applications has been done by hand, which is both time consuming and error prone. The new shinytest package provides a clean interface for testing Shiny applications. Once an application is fully developed, a set of tests can be recorded and stored to compare against future application versions. These tests can be run programatically and can even be used with continuous integration (CI) platforms. Robust testing for Shiny applications is a huge step forward in increasing the deployability and dependability of such applications.

    Deploy

    Once an application has been developed and tested to satisfaction, it must be deployed to a production environment in order to provide other users with application access. Production deployment of data resources within an enterprise centers on control. For example, access control and user authentication are important for controlling who has access to the application. Server resource control and monitoring are important for controlling application performance and server stability. These control points enable trustworthy and performant deployment.

    There are a few current solutions for deploying Shiny applications. Shiny Server provides both an open source and professional framework for publishing Shiny applications and making them available to a wide audience. The professional version provides features that are attractive for enterprise deployment, such as user authentication. RStudio Connect is a recent product from RStudio that provides several enhancements to Shiny Server. Specifically, RStudio Connect supports push button deployment and natively handles application dependencies, both of which simplify the deployment process. RStudio Connect also places resource control in the hands of the application developer, which lightens the load on system administrators and allows the developer to tune app performance to align with expectations and company priorities.

    Scale

    In order to be properly leveraged, a deployed application must scale to meet user demand. In some instances, applications will have low concurrent users and will not need any additional help to remain responsive. However, it is often the case in large enterprises that applications are widely distributed and concurrently accessed by several hundred or even thousands of users. RStudio Connect provides the ability to set up a cluster of servers to provide high availability (HA) and load balanced configurations in order to scale applications to meet the needs of concurrent users. Shiny itself has been shown to effectively scale to meet the demands of 10,000 concurrent users!

    As businesses continue searching for ways to efficiently capture and digest growing stores of data, R in connection with Shiny continues to establish itself as a robust and enterprise ready solution for data analysis and reporting.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages