Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 9 hours 30 min ago

Intro To Time Series Analysis Part 2 :Exercises

Wed, 06/20/2018 - 06:38

(This article was first published on R-exercises, and kindly contributed to R-bloggers)


In the exercises below, we will explore more in Time Series analysis.The previous exercise is here,Please follow this in sequence
Answers to these exercises are available here.

Exercise 1

load the AirPassangers data,check its class and see the start and end of the series .

Exercise 2
check the cycle of the TimeSeries AirPassangers .

Exercise 3

create the lagplot using the gglagplot from the forecast package,check how the relationship changes as the lag increases

Exercise 4

Also plot the correlation for each of the lags , you can see when the lag is above 6 the correlation drops and again climbs up in 12 and again drops in 18 .
Exercise 5

Plot the histogram of the AirPassengers using gghistogram from forecast

Exercise 6

Use tsdisplay to plot autocorrelation , timeseries and partial autocorrelation together in a same plot

Exercise 7

Find the outliers in the timeseries .

Related exercise sets:
  1. 3D plotting exercises
  2. Vector exercises
  3. Bayesian Inference – MCMC Diagnostics using coda : Exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Reading and analysing log files in the RRD database format

Wed, 06/20/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

I have frequent conversations with R champions and Systems Administrators responsible for R, in which they ask how they can measure and analyze the usage of their servers. Among the many solutions to this problem, one of the my favourites is to use an RRD database and RRDtool.

From Wikipedia:

RRDtool (round-robin database tool) aims to handle time series data such as network bandwidth, temperatures or CPU load. The data is stored in a circular buffer based database, thus the system storage footprint remains constant over time.

RRDtool is a library written in C, with implementations that can also be accessed from the Linux command line. This makes it convenient for system development, but makes it difficult for R users to extract and analyze this data.

I am pleased to announce that I’ve been working on the rrd R package to import RRD files directly into tibble objects, thus making it easy to analyze your metrics.

As an aside, the RStudio Pro products (specifically RStudio Server Pro and RStudio Connect) also make use of RRD to store metrics – more about this later.

Understanding the RRD format as an R user

The name RRD is an initialism of Round Robin Database. The “round robin” refers to the fact that the database is always fixed in size, and as a new entry enters the database, the oldest entry is discarded. In practical terms, the database collects data for a fixed period of time, and information that is older than the threshold gets removed.

A second quality of RRD databases is that each datum is stored in different “consolidation data points”, where every data point is an aggregation over time. For example, a data point can represent an average value for the time period, or a maximum over the period. Typical consolidation functions include average, min and max.

The third quality is that every RRD database file typically consists of multiple archives. Each archive measures data for a different time period. For instance, the archives can capture data for intervals of 10 seconds, 30 seconds, 1 minute or 5 minutes.

As an example, here is a description of an RRD file that originated in RStudio Connect:

describe_rrd("rrd_cpu_0") #> A RRD file with 10 RRA arrays and step size 60 #> [1] AVERAGE_60 (43200 rows) #> [2] AVERAGE_300 (25920 rows) #> [3] MIN_300 (25920 rows) #> [4] MAX_300 (25920 rows) #> [5] AVERAGE_3600 (8760 rows) #> [6] MIN_3600 (8760 rows) #> [7] MAX_3600 (8760 rows) #> [8] AVERAGE_86400 (1825 rows) #> [9] MIN_86400 (1825 rows) #> [10] MAX_86400 (1825 rows)

This RRD file contains data for the properties of CPU 0 of the system. In this example, the first RRA archive contains averaged metrics for one minute (60s) intervals, while the second RRA measures the same metric, but averaged over five minutes. The same metrics are also available for intervals of one hour and one day.

Notice also that every archive has a different number of rows, representing a different historical period where the data is kept. For example, the per minute data AVERAGE_60 is retained for 43,200 periods (12 days) while the daily data MAX_86400 is retained for 1,825 periods (5 years).

If you want to know more, please read the excellent introduction tutorial to RRD database.

Introducing the rrd package

Until recently, it wasn’t easy to import RRD files into R. But I was pleased to discover that a Google Summer of Code 2014 project created a proof-of-concept R package to read these files. The author of this package is Plamen Dimitrov, who published the code on GitHub and also wrote an explanatory blog post.

Because I had to provide some suggestions to our customers, I decided to update the package, provide some example code, and generally improve the reliability.

The result is not yet on CRAN, but you can install the development version of package from github.

Installing the package

To build the package from source, you first need to install librrd. Installing RRDtool from your Linux package manager will usually also install this library.

Using Ubuntu:

sudo apt-get install rrdtool librrd-dev

Using RHEL / CentOS:

sudo yum install rrdtool rrdtool-devel

Once you have the system requirements in place, you can install the development version of the R package from GitHub using:

# install.packages("devtools") devtools::install_github("andrie/rrd") Limitations

The package is not yet available for Windows.

Using the package

Once you’ve installed the package, you can start to use it. The package itself contains some built-in RRD files, so you should be able to run the following code directly.

library(rrd) Describing the contents of a RRD

To describe the contents of an RRD file, use describe_rrd(). This function reports information about the names of each archive (RRA) file, the consolidation function, and the number of observations:

rrd_cpu_0 <- system.file("extdata/cpu-0.rrd", package = "rrd") describe_rrd(rrd_cpu_0) #> A RRD file with 10 RRA arrays and step size 60 #> [1] AVERAGE_60 (43200 rows) #> [2] AVERAGE_300 (25920 rows) #> [3] MIN_300 (25920 rows) #> [4] MAX_300 (25920 rows) #> [5] AVERAGE_3600 (8760 rows) #> [6] MIN_3600 (8760 rows) #> [7] MAX_3600 (8760 rows) #> [8] AVERAGE_86400 (1825 rows) #> [9] MIN_86400 (1825 rows) #> [10] MAX_86400 (1825 rows) Reading an entire RRD file

To read an entire RRD file, i.e. all of the RRA archives, use read_rrd(). This returns a list of tibble objects:

cpu <- read_rrd(rrd_cpu_0) str(cpu, max.level = 1) #> List of 10 #> $ AVERAGE60 :Classes 'tbl_df', 'tbl' and 'data.frame': 43199 obs. of 9 variables: #> $ AVERAGE300 :Classes 'tbl_df', 'tbl' and 'data.frame': 25919 obs. of 9 variables: #> $ MIN300 :Classes 'tbl_df', 'tbl' and 'data.frame': 25919 obs. of 9 variables: #> $ MAX300 :Classes 'tbl_df', 'tbl' and 'data.frame': 25919 obs. of 9 variables: #> $ AVERAGE3600 :Classes 'tbl_df', 'tbl' and 'data.frame': 8759 obs. of 9 variables: #> $ MIN3600 :Classes 'tbl_df', 'tbl' and 'data.frame': 8759 obs. of 9 variables: #> $ MAX3600 :Classes 'tbl_df', 'tbl' and 'data.frame': 8759 obs. of 9 variables: #> $ AVERAGE86400:Classes 'tbl_df', 'tbl' and 'data.frame': 1824 obs. of 9 variables: #> $ MIN86400 :Classes 'tbl_df', 'tbl' and 'data.frame': 1824 obs. of 9 variables: #> $ MAX86400 :Classes 'tbl_df', 'tbl' and 'data.frame': 1824 obs. of 9 variables:

Since the resulting object is a list of tibble objects, you can easily use R functions to work with an individual archive:

names(cpu) #> [1] "AVERAGE60" "AVERAGE300" "MIN300" "MAX300" #> [5] "AVERAGE3600" "MIN3600" "MAX3600" "AVERAGE86400" #> [9] "MIN86400" "MAX86400"

To inspect the contents of the first archive (AVERAGE60), simply print the object – since it’s a tibble, you get 10 lines of output.

For example, the CPU metrics contains a time stamp and metrics for average user and sys usage, as well as the nice value, idle time, interrupt requests and soft interrupt requests:

cpu[[1]] #> # A tibble: 43,199 x 9 #> timestamp user sys nice idle wait irq softirq #> * #> 1 2018-04-02 12:24:00 0.0104 0.00811 0 0.981 0 0 0 #> 2 2018-04-02 12:25:00 0.0126 0.00630 0 0.979 0 0 0 #> 3 2018-04-02 12:26:00 0.0159 0.00808 0 0.976 0 0 0 #> 4 2018-04-02 12:27:00 0.00853 0.00647 0 0.985 0 0 0 #> 5 2018-04-02 12:28:00 0.0122 0.00999 0 0.978 0 0 0 #> 6 2018-04-02 12:29:00 0.0106 0.00604 0 0.983 0 0 0 #> 7 2018-04-02 12:30:00 0.0147 0.00427 0 0.981 0 0 0 #> 8 2018-04-02 12:31:00 0.0193 0.00767 0 0.971 0 0 0 #> 9 2018-04-02 12:32:00 0.0300 0.0274 0 0.943 0 0 0 #> 10 2018-04-02 12:33:00 0.0162 0.00617 0 0.978 0 0 0 #> # ... with 43,189 more rows, and 1 more variable: stolen

Since the data is in tibble format, you can easily extract specific data, e.g., the last values of the system usage:

tail(cpu$AVERAGE60$sys) #> [1] 0.0014390667 0.0020080000 0.0005689333 0.0000000000 0.0014390667 #> [6] 0.0005689333 Reading only a single archive

The underlying code in the rrd package is written in C, and is therefore blazingly fast. Reading an entire RRD file takes a fraction of a second, but sometimes you may want to extract a specific RRA archive immediately.

To read a single RRA archive from an RRD file, use read_rra(). To use this function, you must specify several arguments that define the specific data to retrieve. This includes the consolidation function (e.g., "AVERAGE") and time step (e.g., 60). You must also specify either the start time or the number of steps, n_steps.

In this example, I extract the average for one-minute periods (step = 60) for one day (n_steps = 24 * 60):

end_time <- as.POSIXct("2018-05-02") # timestamp with data in example avg_60 <- read_rra(rrd_cpu_0, cf = "AVERAGE", step = 60, n_steps = 24 * 60, end = end_time) avg_60 #> # A tibble: 1,440 x 9 #> timestamp user sys nice idle wait irq softirq #> * #> 1 2018-05-01 00:01:00 0.00458 0.00201 0 0.992 0 0 0 #> 2 2018-05-01 00:02:00 0.00258 0.000570 0 0.996 0 0 0 #> 3 2018-05-01 00:03:00 0.00633 0.00144 0 0.992 0 0 0 #> 4 2018-05-01 00:04:00 0.00515 0.00201 0 0.991 0 0 0 #> 5 2018-05-01 00:05:00 0.00402 0.000569 0 0.995 0 0 0 #> 6 2018-05-01 00:06:00 0.00689 0.00144 0 0.992 0 0 0 #> 7 2018-05-01 00:07:00 0.00371 0.00201 0 0.993 0.00144 0 0 #> 8 2018-05-01 00:08:00 0.00488 0.00201 0 0.993 0.000569 0 0 #> 9 2018-05-01 00:09:00 0.00748 0.000568 0 0.992 0 0 0 #> 10 2018-05-01 00:10:00 0.00516 0 0 0.995 0 0 0 #> # ... with 1,430 more rows, and 1 more variable: stolen Plotting the results

The original RRDTool library for Linux contains some functions to easily plot the RRD data, a feature that distinguishes RRD from many other databases.

However, R already has very rich plotting capability, so the rrd R package doesn’t expose any specific plotting functions.

For example, you can easily plot these data using your favourite packages, like ggplot2:

library(ggplot2) ggplot(avg_60, aes(x = timestamp, y = user)) + geom_line() + stat_smooth(method = "loess", span = 0.125, se = FALSE) + ggtitle("CPU0 usage, data read from RRD file")

Getting the RRD files from RStudio Server Pro and RStudio Connect

As I mentioned in the introduction, both RStudio Server Pro and RStudio Connect use RRD to store metrics. In fact, these metrics are used to power the administration dashboard of these products.

This means that often the easiest solution is simply to enable the admin dashboard and view the information there.

However, sometimes R users and system administrators have a need to analyze the metrics in more detail, so in this section, I discuss where you can find the files for analysis.

The administration guides for these products explain where to find the metrics files:

  • The admin guide for RStudio Server Pro discusses metrics in this in section 8.2 Monitoring Configuration.
    • By default, the metrics are stored at /var/lib/rstudio-server/monitor/rrd, although this path is configurable by the server administrator
    • RStudio Server Pro stores system metrics as well as user metrics
  • RStudio Connect discusses metrics in section 16.1 Historical Metrics
    • The default path for metrics logs is /var/lib/rstudio-connect/metrics, though again, this is configurable by the server administrator.
rsc <- "/var/lib/rstudio-connect/metrics/rrd" rsp <- "/var/lib/rstudio-server/monitor/rrd"

If you want to analyze these files, it is best to copy the files to a different location. The security and permissions on both products are configured in such a way that it’s not possible to read the files while they are in the original folder. Therefore, we recommend that you copy the files to a different location and do the analysis there.

Warning about using the RStudio Connect RRD files:

The RStudio Connect team is actively planning to change the way content-level metrics are stored, so data related to shiny apps, markdown reports, etc. will likely look different in a future release.

To be clear:

  • The schemas might change
  • RStudio Connect may stop tracking some metrics
  • It’s also possible that the entire mechanism might change

The only guarantees that we make in RStudio Connect are around the data that we actually surface:

  • server-wide user counts
  • RAM
  • CPU data

This means that if you analyze RRD files, you should be aware that the entire mechanism for storing metrics might change in future.

Additional caveat
  • The metrics collection process runs in a sandboxed environment, and it is not possible to publish a report to RStudio Connect that reads the metrics directly. If you want to automate a process to read the Connect metrics, you will have to set up a cron job to copy the files to a different location, and run the analysis against the copied files. (Also, re-read the warning that everything might change!)
Example

In the following worked example, I copied some rrd files that originated in RStudio Connect to a different location on disk, and stored this in a config file.

First, list the file names:

config <- config::get() rrd_location <- config$rrd_location rrd_location %>% list.files() %>% tail(20) ## [1] "content-978.rrd" "content-986.rrd" "content-98.rrd" ## [4] "content-990.rrd" "content-995.rrd" "content-998.rrd" ## [7] "cpu-0.rrd" "cpu-1.rrd" "cpu-2.rrd" ## [10] "cpu-3.rrd" "license-users.rrd" "network-eth0.rrd" ## [13] "network-lo.rrd" "system-CPU.rrd" "system.cpu.usage.rrd" ## [16] "system.load.rrd" "system.memory.rrd" "system-RAM.rrd" ## [19] "system.swap.rrd" "system-SWAP.rrd"

The file names indicated that RStudio Connect collects metrics for the system (CPU, RAM, etc.), as well as for every piece of published content.

To look at the system load, first describe the contents of the "system.load.rrd" file:

sys_load <- file.path(rrd_location, "system.load.rrd") describe_rrd(sys_load) ## A RRD file with 10 RRA arrays and step size 60 ## [1] AVERAGE_60 (43200 rows) ## [2] AVERAGE_300 (25920 rows) ## [3] MIN_300 (25920 rows) ## [4] MAX_300 (25920 rows) ## [5] AVERAGE_3600 (8760 rows) ## [6] MIN_3600 (8760 rows) ## [7] MAX_3600 (8760 rows) ## [8] AVERAGE_86400 (1825 rows) ## [9] MIN_86400 (1825 rows) ## [10] MAX_86400 (1825 rows)

This output tells you that metrics are collected every 60 seconds (one minute), and then in selected multiples (1 minute, 5 minutes, 1 hour and 1 day.) You can also tell that the consolidation functions are average, min and max.

To extract one month of data, averaged at 5-minute intervals use step = 300:

dat <- read_rra(sys_load, cf = "AVERAGE", step = 300L, n_steps = (3600 / 300) * 24 * 30) dat ## # A tibble: 8,640 x 4 ## timestamp `1min` `5min` `15min` ## * ## 1 2018-05-10 19:10:00 0.0254 0.0214 0.05 ## 2 2018-05-10 19:15:00 0.263 0.153 0.0920 ## 3 2018-05-10 19:20:00 0.0510 0.117 0.101 ## 4 2018-05-10 19:25:00 0.00137 0.0509 0.0781 ## 5 2018-05-10 19:30:00 0 0.0168 0.0534 ## 6 2018-05-10 19:35:00 0 0.01 0.05 ## 7 2018-05-10 19:40:00 0.0146 0.0166 0.05 ## 8 2018-05-10 19:45:00 0.00147 0.0115 0.05 ## 9 2018-05-10 19:50:00 0.0381 0.0306 0.05 ## 10 2018-05-10 19:55:00 0.0105 0.018 0.05 ## # ... with 8,630 more rows

It is very easy to plot this using your preferred plotting package, e.g., ggplot2:

ggplot(dat, aes(x = timestamp, y = `5min`)) + geom_line() + stat_smooth(method = "loess", span = 0.125)

Conclusion

The rrd package, available from GitHub, makes it very easy to read metrics stored in the RRD database format. Reading an archive is very quick, and your resulting data is a tibble for an individual archive, or a list of tibbles for the entire file.

This makes it easy to analyze your data using the tidyverse packages, and to plot the information.

_____='https://rviews.rstudio.com/2018/06/20/reading-rrd-files/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

a chain of collapses

Wed, 06/20/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A quick riddler resolution during a committee meeting (!) of a short riddle: 36 houses stand in a row and collapse at times t=1,2,..,36. In addition, once a house collapses, the neighbours if still standing collapse at the next time unit. What are the shortest and longest lifespans of this row?

Since a house with index i would collapse on its own by time i, the longest lifespan is 36, which can be achieved with the extra rule when the collapsing times are perfectly ordered. For the shortest lifespan, I ran a short R code implementing the rules and monitoring its minimum. Which found 7 as the minimal number for 10⁵ draws. However, with an optimal ordering, one house plus one or two neighbours of the most recently collapsed, leading to a maximal number of collapsed houses after k time units being

1+2(k-1)+1+2(k-2)+….=k+k(k-1)=k²

which happens to be equal to 36 for k=6. (Which was also obtained in 10⁶ draws!) This also gives the solution for any value of k.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Searching For Unicorns (And Other NBA Myths)

Tue, 06/19/2018 - 23:27

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

A visual exploration of the 2017-2018 NBA landscape

The modern NBA landscape is rapidly changing.

Steph Curry has redefined the lead guard prototype with jaw-dropping shooting range coupled with unprecedented scoring efficiency for a guard. The likes of Marc Gasol, Al Horford and Kristaps Porzingis are paving the way for a younger generation of modern big men as defensive rim protectors who can space the floor on offense as three-point threats. Then there are the new-wave facilitators – LeBron James, Draymond Green, Ben Simmons – enormous athletes who can guard any position on defense and push the ball down court in transition.

For fans, analysts and NBA front offices alike, these are the prototypical players that make our mouths water. So what do they have in common?

For one, they are elite statistical outliers in at least two categories, and this serves as the primary motivation for my exploratory analysis tool: To identify NBA players in the 2017-2018 season that exhibited unique skill sets based on statistical correlations.

To access the tool, click here.

The Data

The tool uses box score data from the 2017-2018 NBA season (source: Kaggle) and focuses on the following categories: Points, rebounds, assists, turnovers, steals, blocks, 3-pointers made, FG% and FT%. I also used Dean Oliver’s formula for estimating a players total possessions (outlined here).

To assess all players on an equal scale, I normalized the box score data for each player. For ease of interpretability, I chose to use “per 36 minute” normalization, which take a player’s per-minute production and extrapolates it to 36 minutes of playing time. In this way, the values displayed in the scatterplot represent each player’s production per 36 minutes of playing time.

To ensure that the per-36 minute calculations did not generate any outliers due to small statistical samples, I removed all players with fewer than nine games in the season, as well as players who averaged three minutes or less per game.

Using the tool: A demonstration

The tool is a Shiny application intended to be used for exploratory analysis and player discovery. To most effectively understand and interpret the charts, you can follow these steps:

Step 1: Assess the correlation matrix

The correlation matrix uses the Pearson correlation coefficient as a reference to guide your use of the dynamic scatter plot. Each dot represents the league-wide correlation between two statistical categories.

The color scale indicates the direction of the correlation. That is, blue dots represent negatively correlated statistics, and red dots positively correlated statistics. The size of the dot indicates the magnitude of the correlation – that is, how strong the relationship is between the two statistics across the entire league. Large dots represent high correlation between two statistics, while small dots indicate that the two statistics do not have a linear relationship.

Step 2: Select two statistics to plot for exploration

We can get a flavor of these relationships as we move to the scatterplot. (Follow along using the app.) For the purpose of identifying truly unique players, let’s look at a pairing of negatively correlated statistics with high magnitude (i.e. a blue, large dot): 3-pointers made (“3PM”) vs. Field goal percentage (“FG%”).

Step 3: Explore

It makes sense intuitively why these are negatively correlated – a player making a lot of threes is also attempting a lot of long-distance, low-percentage shots. Given the value of floor-spacing in today’s NBA, a high-volume 3-point shooter who is also an efficient scorer possesses unique abilities. So, let’s select FG% for our x-axis and 3PM for our y-axis (using the dropdowns in the menu bar), and see what we find…

The two dotted lines within the scatterplot represent the 50th percentile for each statistic. In the case of FG% vs. 3PM, we turn to the upper right quadrant, which represents the players who are above average in both FG% and 3-pointers made. To focus our analysis, we can zoom in on this quadrant for a close look.

To zoom, simply select and drag across the plotted space you want to zoom in to, in this case the upper right quadrant. You can also filter by position by simply selecting specific positions in the legend.

Scroll over a point to see who the player is, as well as their per-36 statistics. At the top of our plot, no surprises here: Steph Curry. While his 4.7 threes per 36 minutes leads the league, what truly separates him is his 50% efficiency from the field. But we already know that Steph is an exceptional anomaly, so who what else can we find?

While several superstars can also be found at the top of our plot – Kevin Durant, Kyrie Irving, and Klay Thompson stand out – we have quite a few role players up there as well: Kyle Korver, J.J. Redick, Kelly Olynyk and Joe Ingles. These are quality reserves who may not wow us with their overall statistical profiles, but play a crucial, high-value role on teams by spacing the floor without sacrificing scoring efficiency.

Step 4: Repeat

I recommend starting your exploration on the blue-dots of the correlation matrix – blocks vs. threes, rebounds vs. threes, assists vs. blocks, for example. These are where you can identify players with the most unique skill pairings across the league. (Note: When plotting turnovers, be sure to focus below the median line, as it is better to have low turnovers than high.)

For fantasy basketball enthusiasts, this is a great tool to identify players with specific statistical strengths to construct a well-balanced team, or complement your roster core.

Conclusion

I really enjoyed building this tool and exploring its visualization of the NBA landscape. From an interpretability standpoint, however, it is not ideal that we can only focus on one player at time. To improve on this, I plan include an additional table that provides a deeper look at players that fall above the median line for both X and Y statistics. In this way, we can further analyze these players across a larger range of performance variables.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dear data scientists, how to ease your job

Tue, 06/19/2018 - 11:00

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

You have got the best job of the 21st century. Like a treasure hunter, looking for a data treasure chest while sailing through data lakes. In many companies you are a digital maker – armed with skills that turn you to a modern-day polymath and a toolset which is so holistic and complex at once, even astronauts feel dizzy. 

However, there is still something you carry every day to work: the burden of high expectations and demands of others – whether it is a customer, your supervisor or colleagues. Being a data scientist is a dream job, but also very stressful as it requires creative approaches and new solutions every day.  

Would it not be great if there is something that makes your daily work easier? 

Many requirements and your need for a solution

R, Python and Julia – does your department work with several programming languages? Would it not be great to have a solution that supports various languages and thus encourage what is so essential for data analysis: teamwork. Additionally, connectivity packages could enable you to work in a familiar surrounding such as RStudio.

Performance is everything: you create the most complex neuronal networks and process data quantities that make big data really deserve the word “big”. Imagine, you could transfer your analyses in a controlled and scalable environment where your scripts perform not only reliable, but also an optimal load distribution. All this, including a horizontal scale and improvement of system performance.

Data sources, user and analysis scripts – in search of a tool that can bring together all components in a bundled analysis project to manage ressources more effeciently, raise transparency and develop a compliant workflow. The best possible solution is a role management which can be easily expanded to the specialist department.

Time is money. Of course, that also applies to your working time. A solution that can free you from time-consuming routine tasks, such as monitoring and parameterization of script execution, as well as for implementing analyses via temporal trigger. Additionally, the dynamic load distribution and the logging of script output ensure the operationalization of script execution in a business-critical environment.

Keep an eye on the big picture: your performant analysis will not bring you the deserved satisfaction if you are not able to embed it into the existing IT-landscape. A tool that has consistent Interfaces to integrate your analysis scripts via REST-API neatly in any existing application would be perfect to ease your daily workload.

eoda | data science core: a solution from data scientists to data scientists

Imagine a data science tool that incorporates the experience to leverage your potential in bringing data science to the enterprise environment.

Based on many years of experience from analysis projects and the knowledge about your daily challenges, we have developed a solution for you: the eoda | data science core. You can manage your analysis projects flexibly, performant and secure. It gives you the space you need to deal with expectations and keep the love for the profession– as it is, after all, the best job in the world.

The eoda | data science environment provides a framework for creating and managing different containers with several setups for various applications. In the eoda | data science environment scripts can access common intermediate results despite different languages like Rstats, Python or Julia and be managed in one data science project.

The eoda | data science core is the first main component of the eoda | data science environment. This will be complemented with the second component, the eoda | data science portal. How does the portal enable collaborative working, explorative analyses and a user-friendly visualization of results? Read all about it in the next article and find out.

For more information: www.data-science-environment.com

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

11 Jobs for R users from around the world (2018-06-19)

Tue, 06/19/2018 - 07:31
To post your R job on the next post

Just visit  this link and post a new R job  to the R community.

You can post a job for  free  (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers:  please follow the links below to learn more and apply for your R job of interest:

Featured Jobs

 

All New R Jobs

 

  1. Full-Time
    Research Fellow UC Hastings Institute for Innovation Law – Posted by feldmanr
    San Francisco California, United States
    19 Jun 2018
  2. Full-Time
    Technical Support Engineer at RStudio RStudio – Posted by agadrow
    Anywhere
    19 Jun 2018
  3. Full-Time
    postdoc in psychiatry: machine learning in human genomics University of Iowa – Posted by michaelsonlab
    Anywhere
    18 Jun 2018
  4. Full-Time
    Lead Quantitative Developer The Millburn Corporation – Posted by The Millburn Corporation
    New York New York, United States
    15 Jun 2018
  5. Full-Time
    Research Data Analyst @ Arlington, Virginia, U.S. RSG – Posted by patricia.holland@rsginc.com
    Arlington Virginia, United States
    15 Jun 2018
  6. Full-Time
    Data Scientist / Senior Strategic Analyst (Communications & Marketing) Memorial Sloan Kettering Cancer Center – Posted by MSKCC
    New York New York, United States
    30 May 2018
  7. Full-Time
    Market Research Analyst: Mobility for RSG RSG – Posted by patricia.holland@rsginc.com
    Anywhere
    25 May 2018
  8. Full-Time
    Data Scientist @ New Delhi, India Amulozyme Inc. – Posted by Amulozyme
    New Delhi Delhi, India
    25 May 2018
  9. Full-Time
    Data Scientist/Programmer @ Milwaukee, Wisconsin, U.S. ConsensioHealth – Posted by ericadar
    Milwaukee Wisconsin, United States
    25 May 2018
  10. Full-Time
    Customer Success Rep RStudio – Posted by jclemens1
    Anywhere
    2 May 2018
  11. Full-Time
    Lead Data Scientist @ Washington, District of Columbia, U.S. AFL-CIO – Posted by carterkalchik
    Washington District of Columbia, United States
    27 Apr 2018

 

In  R-users.com  you can see  all  the R jobs that are currently available.

R-users Resumes

R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

(you may also look at  previous R jobs posts ).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

RStudio Connect v1.6.4

Tue, 06/19/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

RStudio Connect version 1.6.4 is now available!

There are a few breaking changes and a handful of new features that are highlighted below.
We encourage you to upgrade as soon as possible!

Breaking

Please take note of important breaking changes before upgrading.

Pandoc 2

RStudio Connect includes Pandoc 1 and will now also include Pandoc 2. Admins do
not need to install either.

If you have deployed content with rmarkdown version 1.9 or higher, then that
content will now use Pandoc 2 at runtime. This brings in several bug fixes and
enables some new functionality, but does introduce some backwards
incompatibilities. To protect older versions of rmarkdown, Pandoc 1 will still
be used for content deployed with any rmarkdown version prior to 1.9. Content
not using the rmarkdown package will have Pandoc 2 available.

Pandoc is dynamically made available to content when it is executed, so content
using the newer version of rmarkdown will see Pandoc 2 immediately upon
upgrading RStudio Connect, whether or not you have updated the content recently.
The types of backwards incompatibilities we expect are issues like minor
white-space rendering differences.

R Markdown Rendering

The R Markdown rendering environment has been updated, which will break a
certain class of R Markdown documents. No action is needed for the majority of
R Markdown documents. Publishers will need to rewrite R Markdown documents that
depended on locally preserving and storing state in between renderings.

The update isolates renderings and protects against clashes caused by concurrent
writes, but also means that files written to the local directory during a render
will not be present or available the next time that the report is rendered.

For example, a report that writes a CSV file to disk on day 1 at a local
location, write.csv(‘data.csv’), and then on day 2 reads the same CSV
read.csv(‘data.csv’), will no longer work. Publishers should refactor this
type of R Markdown document to write data to a database or a shared directory
that is not
sandboxed.
For instance, to /app-data/data.csv.

New Features File Download

When a user accesses a Microsoft Word file or some other file type that is not
rendered in the browser, Connect previously downloaded the content immediately.
We have added a download page that simplifies the presentation of
browser-unfriendly file types.

Content Filtering

The RStudio Connect Dashboard now includes interactive labels for tag filters in
the content listing view. This simplifies keeping track of complex searches,
especially when returning to the Dashboard with saved filter state.

Log Download

The Connect UI truncates log files to show the latest output. However, when
someone downloads log files, the downloaded file is no longer truncated. This
makes it easier for a developer to inspect asset behavior with the full log file
available on Connect.

User Management

Connect now allows administrators to filter the users list by multiple account
statuses. The last day that each user was active is now displayed along with the
user list.

Upgrade Planning

Besides breaking changes above, there are no special precautions to be aware of
when upgrading from v1.6.2 to v1.6.4. You can expect the installation and startup
of v1.6.4 to be complete in under a minute.

If you’re upgrading from a release older than v1.6.2, be sure to consider the
“Upgrade Planning” notes from the intervening releases, as well.

If you haven’t yet had a chance to download and try RStudio
Connect
, we encourage you to do so.
RStudio Connect is the best way to share all the work that you do in R (Shiny
apps, R Markdown documents, plots, dashboards, Plumber APIs, etc.) with
collaborators, colleagues, or customers.

You can find more details or download a 45-day evaluation of the product at
https://www.rstudio.com/products/connect/.
Additional resources can be found below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Chat with the rOpenSci team at upcoming meetings

Tue, 06/19/2018 - 02:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

You can find members of the rOpenSci team at various meetings and workshops around the world. Come say ‘hi’, learn about how our software packages can enable your research, or about our process for open peer software review and onboarding, how you can get connected with the community or tell us how we can help you do open and reproducible research.

Where’s rOpenSci? When Who Where What June 23, 2018 Maëlle Salmon Cardiff, UK satRday Cardiff June 27-28, 2018 Scott Chamberlain Portland, OR Bioinformatics Open Source Conference 2018 (BOSC) July 4-6, 2018 Maëlle Salmon Rennes, FR French R conference July 10-13, 2018 Jenny Bryan Brisbane, AU UseR! July 28-Aug 2, 2018 Jenny Bryan Vancouver, CA Joint Statistical Meetings (JSM) Aug 6-10, 2018 Carl Boettiger, Dan Sholler New Orleans, LA Ecological Society of America (ESA) Aug 15-16, 2018 Stefanie Butland Cambridge, MA R / Pharma Aug 27-30, 2018 Scott Chamberlain Dunedin, NZ Biodiversity Information Standards (TDWG) Sep 3-7, 2018 Jenny Bryan Buenos Aires, AR LatinR Sep 11, 2018 Maëlle Salmon Radolfzell, GE AniMove Sep 12-14, 2018 Jeroen Ooms The Hague, NL Use of R in Official Statistics (uRos2018) Oct 26, 2018 Nick Tierney (representing rOpenSci) Seoul, KR R User Conference in Korea 2018 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R vs Python: Image Classification with Keras

Tue, 06/19/2018 - 01:15

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Even though the libraries for R from Python, or Python from R code execution existed since years and despite of a recent announcement of Ursa Labs foundation by Wes McKinney who is aiming to join forces with RStudio foundation, Hadley Wickham in particularly, (find more here) to improve data scientists workflow and unify libraries to be used not only in Python, but in any programming language used by data scientists, some data professionals are still very strict on the language to be used for ANN models limiting their dev. environment exclusively to Python.

As a continuation of my R vs. Python comparison, I decided to test performance of both languages in terms of time required to train a convolutional neural network based model for image recognition. As the starting point, I took the blog post by Dr. Shirin Glander on how easy it is to build a CNN model in R using Keras.

A few words about Keras. It is a Python library for artificial neural network ML models which provides high level fronted to various deep learning frameworks with Tensorflow being the default one.
Keras has many pros with the following among the others:

  • Easy to build complex models just in few lines of code => perfect for dev. cycle to quickly experiment and check your ideas
  • Code recycling: one can easily swap the backend framework (let’s say from CNTK to Tensorflow or vice versa) => DRY principle
  • Seamless use of GPU => perfect for fast model tuning and experimenting

Since Keras is written in Python, it may be a natural choice for your dev. environment to use Python. And that was the case until about a year ago when RStudio founder J.J.Allaire announced release of the Keras library for R in May’17. I consider this to be a turning point for data scientists; now we can be more flexible with dev. environment and be able to deliver result more efficiently with opportunity to extend existing solutions we may have written in R.

It brings me to the point of this post.
My hypothesis is, when it comes to ANN ML model building with Keras, Python is not a must, and depending on your client’s request, or tech stack, R can be used without limitations and with similar efficiency.

Image Classification with Keras

In order to test my hypothesis, I am going to perform image classification using the fruit images data from kaggle and train a CNN model with four hidden layers: two 2D convolutional layers, one pooling layer and one dense layer. RMSProp is being used as the optimizer function.

Tech stack

Hardware:
CPU: Intel Core i7-7700HQ: 4 cores (8 threads), 2800 – 3800 (Boost) MHz core clock
GPU: Nvidia Geforce GTX 1050 Ti Mobile: 4Gb vRAM, 1493 – 1620 (Boost) MHz core clock
RAM: 16 Gb

Software:
OS: Linux Ubuntu 16.04
R: ver. 3.4.4
Python: ver. 3.6.3
Keras: ver. 2.2
Tensorflow: ver. 1.7.0
CUDA: ver. 9.0 (note that the current tensorflow version supports ver. 9.0 while the up-to-date version of cuda is 9.2)
cuDNN: ver. 7.0.5 (note that the current tensorflow version supports ver. 7.0 while the up-to-date version of cuDNN is 7.1)

Code

The R and Python code snippets used for CNN model building are present bellow. Thanks to fruitful collaboration between F. Chollet and J.J. Allaire, the logic and functions names in R are alike the Python ones.

R ## Courtesy: Dr. Shirin Glander. Code source: https://shirinsplayground.netlify.com/2018/06/keras_fruits/ library(keras) start <- Sys.time() fruit_list <- c("Kiwi", "Banana", "Plum", "Apricot", "Avocado", "Cocos", "Clementine", "Mandarine", "Orange", "Limes", "Lemon", "Peach", "Plum", "Raspberry", "Strawberry", "Pineapple", "Pomegranate") # number of output classes (i.e. fruits) output_n <- length(fruit_list) # image size to scale down to (original images are 100 x 100 px) img_width <- 20 img_height <- 20 target_size <- c(img_width, img_height) # RGB = 3 channels channels <- 3 # path to image folders path <- "path/to/folder/with/data" train_image_files_path <- file.path(path, "fruits-360", "Training") valid_image_files_path <- file.path(path, "fruits-360", "Test") # optional data augmentation train_data_gen %>% image_data_generator( rescale = 1/255 ) # Validation data shouldn't be augmented! But it should also be scaled. valid_data_gen <- image_data_generator( rescale = 1/255 ) # training images train_image_array_gen <- flow_images_from_directory(train_image_files_path, train_data_gen, target_size = target_size, class_mode = 'categorical', classes = fruit_list, seed = 42) # validation images valid_image_array_gen <- flow_images_from_directory(valid_image_files_path, valid_data_gen, target_size = target_size, class_mode = 'categorical', classes = fruit_list, seed = 42) ### model definition # number of training samples train_samples <- train_image_array_gen$n # number of validation samples valid_samples <- valid_image_array_gen$n # define batch size and number of epochs batch_size <- 32 epochs <- 10 # initialise model model <- keras_model_sequential() # add layers model %>% layer_conv_2d(filter = 32, kernel_size = c(3,3), padding = 'same', input_shape = c(img_width, img_height, channels)) %>% layer_activation('relu') %>% # Second hidden layer layer_conv_2d(filter = 16, kernel_size = c(3,3), padding = 'same') %>% layer_activation_leaky_relu(0.5) %>% layer_batch_normalization() %>% # Use max pooling layer_max_pooling_2d(pool_size = c(2,2)) %>% layer_dropout(0.25) %>% # Flatten max filtered output into feature vector # and feed into dense layer layer_flatten() %>% layer_dense(100) %>% layer_activation('relu') %>% layer_dropout(0.5) %>% # Outputs from dense layer are projected onto output layer layer_dense(output_n) %>% layer_activation('softmax') # compile model %>% compile( loss = 'categorical_crossentropy', optimizer = optimizer_rmsprop(lr = 0.0001, decay = 1e-6), metrics = 'accuracy' ) # fit hist <- fit_generator( # training data train_image_array_gen, # epochs steps_per_epoch = as.integer(train_samples / batch_size), epochs = epochs, # validation data validation_data = valid_image_array_gen, validation_steps = as.integer(valid_samples / batch_size), # print progress verbose = 2, callbacks = list( # save best model after every epoch callback_model_checkpoint(file.path(path, "fruits_checkpoints.h5"), save_best_only = TRUE), # only needed for visualising with TensorBoard callback_tensorboard(log_dir = file.path(path, "fruits_logs")) ) ) df_out <- hist$metrics %>% {data.frame(acc = .$acc[epochs], val_acc = .$val_acc[epochs], elapsed_time = as.integer(Sys.time()) - as.integer(start))} Python ## Author: D. Kisler - adoptation of R code from https://shirinsplayground.netlify.com/2018/06/keras_fruits/ from keras.preprocessing.image import ImageDataGenerator from keras.models import Sequential from keras.layers import (Conv2D, Dense, LeakyReLU, BatchNormalization, MaxPooling2D, Dropout, Flatten) from keras.optimizers import RMSprop from keras.callbacks import ModelCheckpoint, TensorBoard import PIL.Image from datetime import datetime as dt start = dt.now() # fruits categories fruit_list = ["Kiwi", "Banana", "Plum", "Apricot", "Avocado", "Cocos", "Clementine", "Mandarine", "Orange", "Limes", "Lemon", "Peach", "Plum", "Raspberry", "Strawberry", "Pineapple", "Pomegranate"] # number of output classes (i.e. fruits) output_n = len(fruit_list) # image size to scale down to (original images are 100 x 100 px) img_width = 20 img_height = 20 target_size = (img_width, img_height) # image RGB channels number channels = 3 # path to image folders path = "path/to/folder/with/data" train_image_files_path = path + "fruits-360/Training" valid_image_files_path = path + "fruits-360/Test" ## input data augmentation/modification # training images train_data_gen = ImageDataGenerator( rescale = 1./255 ) # validation images valid_data_gen = ImageDataGenerator( rescale = 1./255 ) ## getting data # training images train_image_array_gen = train_data_gen.flow_from_directory(train_image_files_path, target_size = target_size, classes = fruit_list, class_mode = 'categorical', seed = 42) # validation images valid_image_array_gen = valid_data_gen.flow_from_directory(valid_image_files_path, target_size = target_size, classes = fruit_list, class_mode = 'categorical', seed = 42) ## model definition # number of training samples train_samples = train_image_array_gen.n # number of validation samples valid_samples = valid_image_array_gen.n # define batch size and number of epochs batch_size = 32 epochs = 10 # initialise model model = Sequential() # add layers # input layer model.add(Conv2D(filters = 32, kernel_size = (3,3), padding = 'same', input_shape = (img_width, img_height, channels), activation = 'relu')) # hiddel conv layer model.add(Conv2D(filters = 16, kernel_size = (3,3), padding = 'same')) model.add(LeakyReLU(.5)) model.add(BatchNormalization()) # using max pooling model.add(MaxPooling2D(pool_size = (2,2))) # randomly switch off 25% of the nodes per epoch step to avoid overfitting model.add(Dropout(.25)) # flatten max filtered output into feature vector model.add(Flatten()) # output features onto a dense layer model.add(Dense(units = 100, activation = 'relu')) # randomly switch off 25% of the nodes per epoch step to avoid overfitting model.add(Dropout(.5)) # output layer with the number of units equal to the number of categories model.add(Dense(units = output_n, activation = 'softmax')) # compile the model model.compile(loss = 'categorical_crossentropy', metrics = ['accuracy'], optimizer = RMSprop(lr = 1e-4, decay = 1e-6)) # train the model hist = model.fit_generator( # training data train_image_array_gen, # epochs steps_per_epoch = train_samples // batch_size, epochs = epochs, # validation data validation_data = valid_image_array_gen, validation_steps = valid_samples // batch_size, # print progress verbose = 2, callbacks = [ # save best model after every epoch ModelCheckpoint("fruits_checkpoints.h5", save_best_only = True), # only needed for visualising with TensorBoard TensorBoard(log_dir = "fruits_logs") ] ) df_out = {'acc': hist.history['acc'][epochs - 1], 'val_acc': hist.history['val_acc'][epochs - 1], 'elapsed_time': (dt.now() - start).seconds} Experiment

The models above were trained 10 times with R and Pythons on GPU and CPU, the elapsed time and the final accuracy after 10 epochs were measured. The results of the measurements are presented on the plots below (click the plot to be redirected to plotly interactive plots).

From the plots above, one can see that:

  • the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is around 99% as well).
  • even though 10 measurements may be not convincing, but Python would reduce (by up to 15%) the time required to train your CNN model. This is somewhat expected because R uses Python under the hood when executes Keras functions.

Let’s perform unpaired t-test assuming that all our observations are normally distributed.

library(dplyr) library(data.table) # fetch the data used to plot graphs d <- fread('keras_elapsed_time_rvspy.csv') # unpaired t test: # t_score = (mean1 - mean2)/sqrt(stdev1^2/n1+stdev2^2/n2) d %>% group_by(lang, eng) %>% summarise(el_mean = mean(elapsed_time), el_std = sd(elapsed_time), n = n()) %>% data.frame() %>% group_by(eng) %>% summarise(t_score = round(diff(el_mean)/sqrt(sum(el_std^2/n)), 2)) eng t_score cpu 11.38 gpu 9.64

T-score reflects a significant difference between the time required to train a CNN model in R compared to Python as we saw on the plot above.

Summary

Building and training CNN model in R using Keras is as “easy” as in Python with the same coding logic and functions naming convention. Final accuracy of your Keras model will depend on the neural net architecture, hyperparameters tuning, training duration, train/test data amount etc., but not on the programming language you would use for your DS project. Training a CNN Keras model in Python may be up to 15% faster compared to R

P.S.

If you would like to know more about Keras and to be able to build models with this awesome library, I recommend you these books:

As well as this Udemy course to start your journey with Keras.

Thanks a lot for your attention! I hope this post would be helpful for an aspiring data scientist to gain an understanding of use cases for different technologies and to avoid being biased when it comes to the selection of the tools for DS projects accomplishment.

    Related Post

    1. Update: Can we predict flu outcome with Machine Learning in R?
    2. Evaluation of Topic Modeling: Topic Coherence
    3. Natural Language Generation with Markovify in Python
    4. Anomaly Detection in R – The Tidy Way
    5. Forecasting with ARIMA – Part I
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Statistics Sunday: Accessing the YouTube API with tuber

    Mon, 06/18/2018 - 17:16

    (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

    I haven’t had a lot of time to play with this but yesterday, I discovered the tuber R package, which allows you to interact with the YouTube API.

    To use the tuber package, not only do you need to install it in R – you’ll need a Google account and will have to authorize 4 APIs through Developer Console: all 3 YouTube APIs (though the Data API will be doing the heavy lifting) and the Freebase API. Before you authorize the first API, Google will have you create a project to tie the APIs to. Then, you’ll find the APIs in the API library to add to this project. Click on each API and on the next screen, select Enable. You’ll need to create credentials for each of the YouTube APIs. When asked to identify the type of app that will accessing the YouTube API, select “Other”.

    The tuber package requires two pieces of information from you, which you’ll get when you set up credentials for the OAuth 2.0 Client: client ID and client secret. Once you set those up, you can download them at any time in JSON format by going to the Credentials dashboard and clicking the download button on the far right.

    In R, load the tuber package, then call the yt_oauth function, using the client ID (which should end with something like “apps.googleusercontent.com”) and client secret (a string of letters and numbers). R will launch a browser window to authorize tuber to access the APIs. Once that’s done, you’ll be able to use the tuber package to, for instance, access data about a channel or get information about a video. My plan is to use my Facebook dataset to pull out the head songs I’ve shared and get the video information to generate a dataset of my songs. Look for more on that later. In the meantime, this great post on insightr can give you some details about using the tuber package.

    [Apologies for the short and late post – I’ve been sick and haven’t had as much time to write recently. Hopefully I’ll get back to normal over the next week.]

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Effectively scaling Shiny in enterprise

    Mon, 06/18/2018 - 16:58

    (This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

    James Blair, RStudio

    Scalability is a hot word these days, and for good reason. As data continues to grow in volume and importance, the ability to reliably access and reason about that data increases in importance. Enterprises expect data analysis and reporting solutions that are robust and allow several hundred, even thousands, of concurrent users while offering up-to-date security options.

    Shiny is a highly flexible and widely used framework for creating web applications using R. It enables data scientists and analysts to create dynamic content that provides straightforward access to their work for those with no working knowledge of R. While Shiny has been around for quite some time, recent introductions to the Shiny ecosystem make Shiny simpler and safer to deploy in an enterprise environment where security and scalability are paramount. These new tools in connection with RStudio Connect provide enterprise grade solutions that make Shiny an even more attractive option for data resource creation.

    Develop and Test

    Most Shiny applications are developed either locally on a personal computer or using an instance of RStudio Server. During development it can be helpful to understand application performance, specifically if there are any concerning bottlenecks. The profvis package provides functions for profiling R code and can profile the performance of Shiny applications. profvis provides a breakdown of code performance and can be useful for identifying potential areas for improving application responsiveness.

    The recently released promises package provides asynchronous capabilities to Shiny applications. Asynchronous programming can be used to improve application responsiveness when several concurrent users are accessing the same application. While there is some overhead involved in creating asynchronous applications, this method can improve application responsiveness.

    Once an application is fully developed and ready to be deployed, it’s useful to establish a set of behavioral expectations. These expectations can be used to ensure that future updates to the application don’t break or unexpectedly change behavior. Traditionally most testing of Shiny applications has been done by hand, which is both time consuming and error prone. The new shinytest package provides a clean interface for testing Shiny applications. Once an application is fully developed, a set of tests can be recorded and stored to compare against future application versions. These tests can be run programatically and can even be used with continuous integration (CI) platforms. Robust testing for Shiny applications is a huge step forward in increasing the deployability and dependability of such applications.

    Deploy

    Once an application has been developed and tested to satisfaction, it must be deployed to a production environment in order to provide other users with application access. Production deployment of data resources within an enterprise centers on control. For example, access control and user authentication are important for controlling who has access to the application. Server resource control and monitoring are important for controlling application performance and server stability. These control points enable trustworthy and performant deployment.

    There are a few current solutions for deploying Shiny applications. Shiny Server provides both an open source and professional framework for publishing Shiny applications and making them available to a wide audience. The professional version provides features that are attractive for enterprise deployment, such as user authentication. RStudio Connect is a recent product from RStudio that provides several enhancements to Shiny Server. Specifically, RStudio Connect supports push button deployment and natively handles application dependencies, both of which simplify the deployment process. RStudio Connect also places resource control in the hands of the application developer, which lightens the load on system administrators and allows the developer to tune app performance to align with expectations and company priorities.

    Scale

    In order to be properly leveraged, a deployed application must scale to meet user demand. In some instances, applications will have low concurrent users and will not need any additional help to remain responsive. However, it is often the case in large enterprises that applications are widely distributed and concurrently accessed by several hundred or even thousands of users. RStudio Connect provides the ability to set up a cluster of servers to provide high availability (HA) and load balanced configurations in order to scale applications to meet the needs of concurrent users. Shiny itself has been shown to effectively scale to meet the demands of 10,000 concurrent users!

    As businesses continue searching for ways to efficiently capture and digest growing stores of data, R in connection with Shiny continues to establish itself as a robust and enterprise ready solution for data analysis and reporting.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Not only LIME

    Mon, 06/18/2018 - 08:44

    (This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

    I’ve heard about a number of consulting companies, that decided to use simple linear model instead of a black box model with higher performance, because ,,client wants to understand factors that drive the prediction’’.
    And usually the discussion goes as following: ,,We have tried LIME for our black-box model, it is great, but it is not working in our case’’, ,,Have you tried other explainers?’’, ,,What other explainers’’?

    So here you have a map of different visual explanations for black-box models. Choose one in (on average) less than three simple steps.

    These are available in the DALEX package. Feel free to propose other visual explainers that should be added to this map (and the package).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    bounceR 0.1.2: Automated Feature Selection

    Mon, 06/18/2018 - 08:15

    (This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

    New Features

    As promised, we kept on working on our bounceR package. For once, we changed the interface: users now do not have to choose a number of tuning parameters, that – thanks to my somewhat cryptic documentation – sound more complicated than they are. Inspired by H2o.ai feature to let the user set the time he or she wants to wait, instead of a number of cryptic tuning parameters, we added a similar function.

    Further, we changed the source code quite a bit. Henrik Bengtsson gave a very inspiring talk on parallization using the genius future package at this year's eRum conference. A couple days later, Davis Vaughan released furrr. An incredibly smart – kudos – wrapper on-top of the no-less genius purrr package. Davis' package combines purrr's maping functions with future's parallization madness. As you can tell, I am a big fan of all these three packages. Thus, inspired by these new inventions, we wanted to make use of them in our package. So, the entire parallization setup of bounceR is now leveraging furrr. This way, the parallization is so much smarter, faster and works seemingless on different operating systems.

    Practical Example

    Thus, lets see how you can use it now. Let's start by downloading the package.

    # you need devtools, cause we are just about to get it to CRAN, though we are not that far library(devtools) # now you are good to go devtools::install_github("STATWORX/bounceR") # now you can source it like every normal package library(bounceR)

    To show how the feature selection works, we now need some data, so lets simulate some with our sim_data() function.

    # simulate some data data <- sim_data(n = 100, modelvars = 10, noisevars = 300)

    Now you guys can all imagine that with 310 features on 100 observations, building models could be a little challenging. In order to be able to model the target no less, you need to reduce your feature space. There are numerous ways to do so. In my last Blog Post I described our solution. Let's see how to use our algorithm.

    # run our algorithm selected_features <- featureSelection(data = data, target = "y", max_time = "30 mins", bootstrap = "regular", early_stopping = TRUE, parallel = TRUE)

    What can you expect to get out of it? Well, we return a list with of course the optimal formula calculated by our algorithm. Further, you get a stability matrix with it, where you can see a ranking of the features by importance. Additionally we built in some convenient S4 methods, so you can easily access all the information you need.

    Outlook

    I hope I could teaser you a little to check out the package and help us further improve it. Currently, we are developing two new algorithms for feature selection. Thus, in the next iteration we will implement those two as well. I am looking forward to your comments, issues and thoughts on the package.

    Cheers Guys!

    Über den Autor

    Lukas Strömsdörfer

    Lukas ist im Data Science Team und promoviert gerade extern an der Uni Göttingen. In seiner Freizeit fährt er leidenschaftlich gerne Fahrrad und schaut Serien.

    Der Beitrag bounceR 0.1.2: Automated Feature Selection erschien zuerst auf STATWORX.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Most Starred R Packages on GitHub

    Mon, 06/18/2018 - 02:00

    (This article was first published on Blog-rss on stevenmortimer.com, and kindly contributed to R-bloggers)

    It seems like all the best R packages proudly use GitHub and have a README adorned with badges across the top. The recent Microsoft acquisition of GitHub got me wondering: What proportion of current R packages use GitHub? Or at least refer to it in the URL of the package description. Also, what is the relationship between the number of CRAN downloads and the number of stars on a repository? My curiosity got the best of me so I hastily wrote a script to pull the data. Click here to go straight to the full script and data included at the bottom of this post. I acknowledge there are more elegant ways to have coded this, but let’s press on.

    Pulling List of Packages & their Details

    CRAN provides a list and links to all current packages at https://cran.rstudio.com/src/contrib. By scraping this page I found 12,675 current R packages (non-archived). For each package I pinged its detail page using the canonical form link (e.g. https://cran.r-project.org/package=ggplot2). From there I looked at the package description fields "URL" and "BugReports" to see if either contained “github.com”. It turns out that 3,718 of the packages (29.3% of the total) referenced GitHub. Below is the code snippet for retrieving the list of packages that was taken from Gergely Daróczi’s gist.

    # getting list of the packages pkgs <- readHTMLTable(readLines(file.path('https://cran.rstudio.com/src/contrib')), which = 1, stringsAsFactors = FALSE)[,-1] # filter out lines that aren't really packages pkgs <- pkgs %>% filter(Size != "-", grepl('tar.gz$', Name)) %>% mutate(Name = sub('^([a-zA-Z0-9\\.]*).*', '\\1', Name))

    While retrieving the package metadata I pinged the GitHub API to see if I could get the number of stars for the repository. Currently, GitHub allows 5,000 authenticated requests per hour (link), but out of all the packages only 3,718 referenced GitHub, so I could make all the requests at once. Here is the function I used to take a cleaned up version of the package’s URL then form a request to the GitHub API to get star counts:

    # get the star count from a clean version of the package's URL gh_star_count <- function(url){ stars <- tryCatch({ this_url <- gsub("https://github.com/", "https://api.github.com/repos/", url) req <- GET(this_url, gtoken) stop_for_status(req) cont <- content(req) cont$stargazers_count }, error = function(e){ return(NA_integer_) }) return(stars) } Analyzing the Data

    Once I had all the package detail data, I found that R packages, on average, have 35.7 GitHub stars, but the median number of stars is only 6! ggplot2 has the most stars with 3,174. In my analysis I removed the xgboost, h2o, and feather packages which point to the repository of their implementations in many languages, not just R.

    What I really found interesting was comparing CRAN downloads to GitHub repo stars. Using the cranlogs package I was able to get the total package downloads dating back to January 1, 2014. In contrast with the low star counts, the median downloads for R packages is 8,975. Combining stars and downloads data I found that the median R package has 903 downloads per star. Only 38.7% of packages had more than 10 stars, which shows how hard stars are to get even if you’ve written a great package. I’m not sure what proportion of R users frequently reference and contribute to GitHub, but it would be interesting to compare that with the high ratios of downloads to stars.

    There are some real outliers in the data. For example, the Rcpp package, perhaps the most downloaded package of all-time, has 15.8M downloads and only 377 stars. Similarly, Hadley’s scales package has 9.4M downloads and only 115 stars. These support/helper packages just don’t get the same star love as the headliners like ggplot2, shiny, and dplyr.

    Of course, I could not help but check out the stats for some of the most prolific package authors. After parsing out individuals with the ["aut", "cre"] roles I came to the not so surprising conclusion that Hadley has the most stars of any author with 12,408 stars. In contrast, Dirk Eddelbuettel had one of the lowest star-to-download ratios. For every ~38K downloads Dirk’s repositories will receive one star. Pay no attention to the man behind the curtain since his Rcpp package underpins a whole host of packages without all the GitHub fanfare. Here is a list of popular R package authors and their stats:

    Author Notable Packages Downloads Stars Downloads Per Star Hadley Wickham ggplot2, dplyr, httr 113,160,314 12,408 9,119.9 Dirk Eddelbuettel Rcpp, BH 28,433,586 745 38,165.9 Yihui Xie knitr, rmarkdown, bookdown 42,472,860 6,315 6,725.7 Winston Chang R6, shiny 17,161,005 4,027 4,261.5 Jennifer Bryan readxl, gapminder, googlesheets 6,055,774 1,714 3,533.1 JJ Allaire rstudioapi, reticulate, tensorflow 8,882,553 2,798 3,174.6 Jeroen Ooms jsonlite, curl, openssl 25,907,868 1,483 17,469.9 Scott Chamberlain geojsonio, taxize 1,770,664 2,528 700.4 Jim Hester devtools, memoise, readr 22,867,071 4,332 5,278.6 Kirill Müller tibble, DBI 36,159,009 1,077 33,573.8

    I’m sure you could create mixed models to determine the unique download to star relationship for individuals. Also, you could use other package attributes to predict stars or downloads, but I’ll leave that to another curious soul. I will include tables below regarding the top 10 most downloaded, most starred, most and least downloaded per star.

    Credits

    Credit is due since this script borrows a couple different pieces of code and concepts. Retrieving the list of packages is from Gergely Daróczi’s gist. Authenticating to GitHub was taken from one of the httr demos.

    Appendix Top 10 Most Starred Packages Name Author Downloads Stars Downloads Per Star ggplot2 Hadley Wickham 13,001,703 3,174 4,096.3 shiny Winston Chang 4,571,794 2,902 1,575.4 dplyr Hadley Wickham 8,276,844 2,408 3,437.2 devtools Jim Hester 5,536,730 1,645 3,365.8 knitr Yihui Xie 7,131,564 1,581 4,510.8 data.table Matt Dowle 6,005,795 1,457 4,122.0 plotly Carson Sievert 1,195,880 1,255 952.9 rmarkdown Yihui Xie 5,432,495 1,160 4,683.2 tensorflow JJ Allaire 94,856 1,033 91.8 bookdown Yihui Xie 126,586 1,009 125.5 Top 10 Most Downloaded Packages with Stars Name Author Downloads Stars Downloads Per Star Rcpp Dirk Eddelbuettel 15,824,781 377 41,975.5 ggplot2 Hadley Wickham 13,001,703 3,174 4,096.3 stringr Hadley Wickham 11,547,828 268 43,088.9 stringi Marek Gagolewski 11,310,113 122 92,705.8 digest Dirk Eddelbuettel with contributions by Antoine Lucas 11,233,244 42 267,458.2 plyr Hadley Wickham 10,340,396 470 22,000.8 R6 Winston Chang 9,993,128 212 47,137.4 reshape2 Hadley Wickham 9,582,245 173 55,388.7 scales Hadley Wickham 9,380,757 115 81,571.8 jsonlite Jeroen Ooms 9,112,790 176 51,777.2 Top 10 Packages by Stars per Download (frequently starred) Name Author Downloads Stars Downloads Per Star r2d3 Javier Luraschi 416 235 1.77 workflowr John Blischak 448 169 2.65 goodpractice Hannah Frick 523 192 2.72 xtensor Johan Mabille 2,057 664 3.10 scico Thomas Lin Pedersen 185 59 3.14 shinytest Winston Chang 418 113 3.70 furrr Davis Vaughan 724 171 4.23 pkgdown Hadley Wickham 1,589 332 4.79 rtika Sasha Goodman 168 32 5.25 mindr Peng Zhao 2,051 368 5.57 Bottom 10 Packages by Stars per Download (infrequently starred) Name Author Downloads Stars Downloads Per Star mime Yihui Xie 7,398,765 12 616,563.8 pkgmaker Renaud Gaujoux 1,228,173 2 614,086.5 rngtools Renaud Gaujoux 1,224,959 2 612,479.5 magic Robin K. S. Hankin 344,741 1 344,741.0 gsubfn G. Grothendieck 675,056 2 337,528.0 bindrcpp Kirill Müller 2,996,452 10 299,645.2 plogr Kirill Müller 3,343,099 12 278,591.6 digest Dirk Eddelbuettel with contributions by Antoine Lucas 11,233,244 42 267,458.2 munsell Charlotte Wickham 7,778,712 31 250,926.2 proto Hadley Wickham 2,593,246 11 235,749.6 Full Script

    Below and available via gist with data at: https://gist.github.com/StevenMMortimer/1b4b626d3d91240a77f969ae04b37114

    # load packages & custom functions --------------------------------------------- library(tidyverse) library(httr) library(cranlogs) library(XML) library(ggrepel) library(scales) library(knitr) library(stringr) gh_from_url <- function(x){ if(!grepl(',', x)){ x <- strsplit(x, " ")[[1]] x <- trimws(x[min(which(grepl(pattern='http://github.com|https://github.com|http://www.github.com', x, ignore.case=TRUE)))]) } else { x <- strsplit(x, ",")[[1]] x <- trimws(x[min(which(grepl(pattern='http://github.com|https://github.com|http://www.github.com', x, ignore.case=TRUE)))]) } x <- gsub("http://", "https://", tolower(x)) x <- gsub("www\\.github\\.com", "github.com", x) x <- gsub("/$", "", x) x <- gsub("^github.com", "https://github.com", x) x <- gsub("/issues", "", x) x <- gsub("\\.git", "", x) return(x) } aut_maintainer_from_details <- function(x){ x <- gsub("'|\"", "", x) if(grepl(',', x)){ x <- strsplit(x, "\\],")[[1]] aut_cre_ind <- grepl(pattern='\\[aut, cre|\\[cre, aut|\\[cre', x, ignore.case=TRUE) if(any(aut_cre_ind)){ x <- x[min(which(aut_cre_ind))] x <- gsub("\\[aut, cre|\\[cre, aut|\\[cre", "", x) } x <- strsplit(x, ",")[[1]][1] x <- trimws(gsub("\\]", "", x)) x <- trimws(gsub(" \\[aut", "", x)) } return(x) } gh_star_count <- function(url){ stars <- tryCatch({ this_url <- gsub("https://github.com/", "https://api.github.com/repos/", url) req <- GET(this_url, gtoken) stop_for_status(req) cont <- content(req) cont$stargazers_count }, error = function(e){ return(NA_integer_) }) return(stars) } # authenticate to github ------------------------------------------------------- # use Hadley's key and secret myapp <- oauth_app("github", key = "56b637a5baffac62cad9", secret = "8e107541ae1791259e9987d544ca568633da2ebf") github_token <- oauth2.0_token(oauth_endpoints("github"), myapp) gtoken <- config(token = github_token) # pull list of packages -------------------------------------------------------- # get list of currently available packages on CRAN pkgs <- readHTMLTable(readLines(file.path('https://cran.rstudio.com/src/contrib')), which = 1, stringsAsFactors = FALSE)[,-1] # filter out lines that aren't really packages pkgs <- pkgs %>% filter(Size != "-", grepl('tar.gz$', Name)) %>% mutate(Name = sub('^([a-zA-Z0-9\\.]*).*', '\\1', Name)) %>% distinct(Name, .keep_all = TRUE) # get details for each package ------------------------------------------------- all_pkg_details <- NULL # old fashioned looping! # WARNING: This takes awhile to complete for(i in 1:nrow(pkgs)){ if(i %% 100 == 0){ message(sprintf("Processing package #%s out of %s", i, nrow(pkgs))) } pkg_details <- readHTMLTable(readLines(file.path(sprintf('https://cran.r-project.org/package=%s', pkgs[i,]$Name))), header=FALSE, which = 1, stringsAsFactors = FALSE) pkg_details <- pkg_details %>% mutate(V1 = gsub(":", "", V1)) %>% spread(V1, V2) this_url <- pkg_details$URL on_github <- FALSE this_github_url <- NA_character_ gh_stars <- NA_integer_ if(!is.null(this_url)){ on_github <- grepl('http://github.com|https://github.com|http://www.github.com', this_url) if(on_github){ this_github_url <- gh_from_url(this_url) gh_stars <- gh_star_count(this_github_url) } else { # check the BugReports URL as a backup (e.g. shiny package references GitHub this way) issues_on_github <- grepl('http://github.com|https://github.com|http://www.github.com', pkg_details$BugReports) if(length(issues_on_github) == 0 || !issues_on_github){ this_github_url <- NA_character_ } else { this_github_url <- gh_from_url(pkg_details$BugReports) gh_stars <- gh_star_count(this_github_url) on_github <- TRUE } } } else { this_url <- NA_character_ } downloads <- cran_downloads(pkgs[i,]$Name, from = "2014-01-01", to = "2018-06-15") all_pkg_details <- rbind(all_pkg_details, tibble(name = pkgs[i,]$Name, published = pkg_details$Published, author = aut_maintainer_from_details(pkg_details$Author), url = this_url, github_ind = on_github, github_url = this_github_url, downloads = sum(downloads$count), stars = gh_stars ) ) } # basic summary stats ---------------------------------------------------------- # remove observations where the GitHub URL refers to a repository that # is not specific to R and therefore might have an inflated star count all_pkg_details_clean <- all_pkg_details %>% filter(!(name %in% c('xgboost', 'h2o', 'feather'))) %>% mutate(downloads_per_star = downloads / stars, downloads_per_star = ifelse(!is.finite(downloads_per_star), NA_real_, downloads_per_star)) # proportion of all packages listing github sum(all_pkg_details$github_ind) mean(all_pkg_details$github_ind) # proportion of packages with stars mean(!is.na(all_pkg_details$stars)) # typical number of stars per package mean(all_pkg_details_clean$stars, na.rm=TRUE) median(all_pkg_details_clean$stars, na.rm=TRUE) max(all_pkg_details_clean$stars, na.rm=TRUE) # typical number of downloads per package mean(all_pkg_details_clean$downloads, na.rm=TRUE) median(all_pkg_details_clean$downloads, na.rm=TRUE) # percent of packages over 10 stars mean(all_pkg_details_clean$stars > 10, na.rm=TRUE) mean(all_pkg_details_clean$downloads_per_star, na.rm=TRUE) median(all_pkg_details_clean$downloads_per_star, na.rm=TRUE) # stars histogram -------------------------------------------------------------- ggplot(data=all_pkg_details_clean, mapping=aes(stars)) + geom_histogram(aes(fill=..count..), bins=60) + scale_x_continuous(trans = "log1p", breaks=c(0,1,2,3,10,100,1000,3000)) + labs(x = "Stars", y = "Count", fill = "Count", caption = "Source: api.github.com as of 6/16/18") + ggtitle("Distribution of GitHub Stars on R Packages") + theme_bw() + theme(panel.grid.minor = element_blank(), plot.caption=element_text(hjust = 0)) # stars to downloads scatterplot ----------------------------------------------- plot_dat <- all_pkg_details_clean idx_label <- which(with(plot_dat, downloads > 10000000 | stars > 1000)) plot_dat$name2 <- plot_dat$name plot_dat$name <- "" plot_dat$name[idx_label] <- plot_dat$name2[idx_label] ggplot(data=plot_dat, aes(stars, downloads, label = name)) + geom_point(color = ifelse(plot_dat$name == "", "grey50", "red")) + geom_text_repel(box.padding = .5) + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma) + labs(x = "GitHub Stars", y = "CRAN Downloads", caption = "Sources:\napi.github.com as of 6/16/18\ncranlogs as of 1/1/14 - 6/15/18") + ggtitle("Relationship Between CRAN Downloads and GitHub Stars") + theme_bw() + theme(plot.caption=element_text(hjust = 0)) # author stats ----------------------------------------------------------------- # summary by author authors_detail <- all_pkg_details_clean %>% group_by(author) %>% summarize(downloads = sum(downloads, na.rm=TRUE), stars = sum(stars, na.rm=TRUE)) %>% mutate(downloads_per_star = downloads / stars, downloads_per_star = ifelse(!is.finite(downloads_per_star), NA_real_, downloads_per_star)) %>% arrange(desc(downloads)) # popular authors pop_authors <- tibble(author = c('Hadley Wickham', 'Dirk Eddelbuettel', 'Yihui Xie', 'Winston Chang', 'Jennifer Bryan', 'JJ Allaire', 'Jeroen Ooms', 'Scott Chamberlain', 'Jim Hester', 'Kirill Müller'), notable_packages = c('ggplot2, dplyr, httr', 'Rcpp, BH', 'knitr, rmarkdown, bookdown', 'R6, shiny', 'readxl, gapminder, googlesheets', 'rstudioapi, reticulate, tensorflow', 'jsonlite, curl, openssl', 'geojsonio, taxize', 'devtools, memoise, readr', 'tibble, DBI') ) author_stats <- pop_authors %>% inner_join(., authors_detail, by='author') %>% select(author, notable_packages, downloads, stars, downloads_per_star) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # single author #all_pkg_details_clean %>% filter(author == 'Dirk Eddelbuettel') %>% arrange(desc(downloads)) # top 10 lists ----------------------------------------------------------------- # Top 10 Most Starred Packages top_starred <- all_pkg_details_clean %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(stars)) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Top 10 Most Downloaded Packages with stars top_downloaded <- all_pkg_details_clean %>% filter(!is.na(stars)) %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(downloads)) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Bottom 10 Packages by Downloads per Star (frequently starred) frequently_starred <- all_pkg_details_clean %>% filter(downloads > 100) %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(downloads_per_star) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 2)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Top 10 Packages by Downloads per Star (infrequently starred) infrequently_starred <- all_pkg_details_clean %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(downloads_per_star)) %>% slice(1:10) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Blog-rss on stevenmortimer.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Sketchnotes from TWiML&AI: Practical Deep Learning with Rachel Thomas

    Mon, 06/18/2018 - 02:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    These are my sketchnotes for Sam Charrington’s podcast This Week in Machine Learning and AI about Practical Deep Learning with Rachel Thomas:

    Sketchnotes from TWiMLAI talk: Practical Deep Learning with Rachel Thomas

    You can listen to the podcast here.

    In this episode, i’m joined by Rachel Thomas, founder and researcher at Fast AI. If you’re not familiar with Fast AI, the company offers a series of courses including Practical Deep Learning for Coders, Cutting Edge Deep Learning for Coders and Rachel’s Computational Linear Algebra course. The courses are designed to make deep learning more accessible to those without the extensive math backgrounds some other courses assume. Rachel and I cover a lot of ground in this conversation, starting with the philosophy and goals behind the Fast AI courses. We also cover Fast AI’s recent decision to switch to their courses from Tensorflow to Pytorch, the reasons for this, and the lessons they’ve learned in the process. We discuss the role of the Fast AI deep learning library as well, and how it was recently used to held their team achieve top results on a popular industry benchmark of training time and training cost by a factor of more than ten. https://twimlai.com/twiml-talk-138-practical-deep-learning-with-rachel-thomas/

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How Much Money Should Machines Earn?

    Sun, 06/17/2018 - 22:45

    (This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

    Every inch of sky’s got a star
    Every inch of skin’s got a scar
    (Everything Now, Arcade Fire)

    I think that a very good way to start with R is doing an interactive visualization of some open data because you will train many important skills of a data scientist: loading, cleaning, transforming and combinig data and performing a suitable visualization. Doing it interactive will give you an idea of the power of R as well, because you will also realise that you are able to handle indirectly other programing languages such as JavaScript.

    That’s precisely what I’ve done today. I combined two interesting datasets:

    • The probability of computerisation of 702 detailed occupations, obtained by Carl Benedikt Frey and Michael A. Osborne from the University of Oxford, using a Gaussian process classifier and published in this paper in 2013.
    • Statistics of jobs from (employments, median annual wages and typical education needed for entry) from the US Bureau of Labor, available here.

    Apart from using dplyr to manipulate data and highcharter to do the visualization, I used tabulizer package to extract the table of probabilities of computerisation from the pdf: it makes this task extremely easy.

    This is the resulting plot:


    If you want to examine it in depth, here you have a full size version.

    These are some of my insights (its corresponding figures are obtained directly from the dataset):

    • There is a moderate negative correlation between wages and probability of computerisation.
    • Around 45% of US employments are threatened by machines (have a computerisation probability higher than 80%): half of them do not require formal education to entry.
    • In fact, 78% of jobs which do not require formal education to entry are threatened by machines: 0% which require a master’s degree are.
    • Teachers are absolutely irreplaceable (0% are threatened by machines) but they earn a 2.2% less then the average wage (unfortunately, I’m afraid this phenomenon occurs in many other countries as well).
    • Don’t study for librarian or archivist: it seems a bad way to invest your time
    • Mathematicians will survive to machines

    The code of this experiment is available here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Version 0.6-11 of NIMBLE released

    Sat, 06/16/2018 - 18:48

    (This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)

    We’ve released the newest version of NIMBLE on CRAN and on our website.

    Version 0.6-11 has important new features, notably support for Bayesian nonparametric mixture modeling, and more are on the way in the next few months.

    New features include:

    • support for Bayesian nonparametric mixture modeling using Dirichlet process mixtures, with specialized MCMC samplers automatically assigned in NIMBLE’s default MCMC (See Chapter 10 of the manual for details);
    • additional resampling methods available with the auxiliary and bootstrap particle filters;
    • user-defined filtering algorithms can be used with NIMBLE’s particle MCMC samplers;
    • MCMC thinning intervals can be modified at MCMC run-time;
    • both runMCMC() and nimbleMCMC() now drop burn-in samples before thinning, making their behavior consistent with each other;
    • increased functionality for the ‘setSeed’ argument in nimbleMCMC() and runMCMC();
    • new functionality for specifying the order in which sampler functions are executed in an MCMC; and
    • invalid dynamic indexes now result in a warning and NaN values but do not cause execution to error out, allowing MCMC sampling to continue.

    Please see the NEWS file in the installed package for more details

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Prediction Interval, the wider sister of Confidence Interval

    Sat, 06/16/2018 - 01:37

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    In this post, I will illustrate the use of prediction intervals for the comparison of measurement methods. In the example, a new spectral method for measuring whole blood hemoglobin is compared with a reference method. But first, let's start with discussing the large difference between a confidence interval and a prediction interval.

    Prediction interval versus Confidence interval

    Very often a confidence interval is misinterpreted as a prediction interval, leading to unrealistic “precise” predictions. As you will see, prediction intervals (PI) resemble confidence intervals (CI), but the width of the PI is by definition larger than the width of the CI.

    Let’s assume that we measure the whole blood hemoglobin concentration in a random sample of 100 persons. We obtain the estimated mean (Est_mean), limits of the confidence interval (CI_Lower and CI_Upper) and limits of the prediction interval (PI_Lower and PI_Upper): (The R-code to do this is in the next paragraph)

    Est_mean CI_Lower CI_Upper PI_Lower PI_Upper 140 138 143 113 167

    The graphical presentation:

    A Confidence interval (CI) is an interval of good estimates of the unknown true population parameter. About a 95% confidence interval for the mean, we can state that if we would repeat our sampling process infinitely, 95% of the constructed confidence intervals would contain the true population mean. In other words, there is a 95% chance of selecting a sample such that the 95% confidence interval calculated from that sample contains the true population mean.

    Interpretation of the 95% confidence interval in our example:

    • The values contained in the interval [138g/L to 143g/L] are good estimates of the unknown mean whole blood hemoglobin concentration in the population. In general, if we would repeat our sampling process infinitely, 95% of such constructed confidence intervals would contain the true mean hemoglobin concentration.

    A Prediction interval (PI) is an estimate of an interval in which a future observation will fall, with a certain confidence level, given the observations that were already observed. About a 95% prediction interval we can state that if we would repeat our sampling process infinitely, 95% of the constructed prediction intervals would contain the new observation.

    Interpretation of the 95% prediction interval in the above example:

    • Given the observed whole blood hemoglobin concentrations, the whole blood hemoglobin concentration of a new sample will be between 113g/L and 167g/L with a confidence of 95%. In general, if we would repeat our sampling process infinitely, 95% of the such constructed prediction intervals would contain the new hemoglobin concentration measurement.

    Remark: Very often we will read the interpretation “The whole blood hemogblobin concentration of a new sample will be between 113g/L and 167g/L with a probability of 95%.” (for example on wikipedia). This interpretation is correct in the theoretical situation where the parameters (true mean and standard deviation) are known.

    Estimating a prediction interval in R

    First, let's simulate some data. The sample size in the plot above was (n=100). Now, to see the effect of the sample size on the width of the confidence interval and the prediction interval, let's take a “sample” of 400 hemoglobin measurements using the same parameters:

    set.seed(123) hemoglobin<-rnorm(400, mean = 139, sd = 14.75) df<-data.frame(hemoglobin)

    Although we don't need a linear regression yet, I'd like to use the lm() function, which makes it very easy to construct a confidence interval (CI) and a prediction interval (PI). We can estimate the mean by fitting a “regression model” with an intercept only (no slope). The default confidence level is 95%.

    Confidence interval: CI<-predict(lm(df$hemoglobin~ 1), interval="confidence") CI[1,] ## fit lwr upr ## 139.2474 137.8425 140.6524

    The CI object has a length of 400. But since there is no slope in our “model”, each row is exactly the same.

    Prediction interval: PI<-predict(lm(df$hemoglobin~ 1), interval="predict") ## Warning in predict.lm(lm(df$hemoglobin ~ 1), interval = "predict"): predictions on current data refer to _future_ responses PI[1,] ## fit lwr upr ## 139.2474 111.1134 167.3815

    We get a “warning” that “predictions on current data refer to future responses”. That's exactly what we want, so no worries there. As you see, the column names of the objects CI and PI are the same. Now, let's visualize the confidence and the prediction interval.

    The code below is not very elegant but I like the result (tips are welcome :-))

    library(ggplot2) limits_CI <- aes(x=1.3 , ymin=CI[1,2], ymax =CI[1,3]) limits_PI <- aes(x=0.7 , ymin=PI[1,2], ymax =PI[1,3]) PI_CI<-ggplot(df, aes(x=1, y=hemoglobin)) + geom_jitter(width=0.1, pch=21, fill="grey", alpha=0.5) + geom_errorbar (limits_PI, width=0.1, col="#1A425C") + geom_point (aes(x=0.7, y=PI[1,1]), col="#1A425C", size=2) + geom_errorbar (limits_CI, width=0.1, col="#8AB63F") + geom_point (aes(x=1.3, y=CI[1,1]), col="#8AB63F", size=2) + scale_x_continuous(limits=c(0,2))+ scale_y_continuous(limits=c(95,190))+ theme_bw()+ylab("Hemoglobin concentration (g/L)") + xlab(NULL)+ geom_text(aes(x=0.6, y=160, label="Prediction\ninterval", hjust="right", cex=2), col="#1A425C")+ geom_text(aes(x=1.4, y=140, label="Confidence\ninterval", hjust="left", cex=2), col="#8AB63F")+ theme(legend.position="none", axis.text.x = element_blank(), axis.ticks.x = element_blank()) PI_CI

    Gives this plot:

    The width of the confidence interval is very small, now that we have this large sample size (n=400). This is not surprising, as the estimated mean is the only source of uncertainty. In contrast, the width of the prediction interval is still substantial. The prediction interval has two sources of uncertainty: the estimated mean (just like the confidence interval) and the random variance of new observations.

    Example: comparing a new with a reference measurement method.

    A prediction interval can be useful in the case where a new method should replace a standard (or reference) method.
    If we can predict well enough what the measurement by the reference method would be, (given the new method) than the two methods give similar information and the new method can be used.

    For example in (Tian, 2017) a new spectral method (Near-Infra-Red) to measure hemoglobin is compared with a Golden Standard. In contrast with the Golden Standard method, the new spectral method does not require reagents. Moreover, the new method is faster. We will investigate whether we can predict well enough, based on the measured concentration of the new method, what the measurement by the Golden Standard would be. (note: the measured concentrations presented below are fictive)

    Hb<- read.table("http://rforbiostatistics.colmanstatistics.be/wp-content/uploads/2018/06/Hb.txt", header = TRUE) kable(head(Hb)) New Reference 84.96576 87.24013 99.91483 103.38143 111.79984 116.71593 116.95961 116.72065 118.11140 113.51541 118.21411 121.70586 plot(Hb$New, Hb$Reference, xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method")

    Gives this plot:

    Prediction Interval based on linear regression

    Now, let's fit a linear regression model predicting the hemoglobin concentrations measured by the reference method, based on the concentrations measured with the new method.

    fit.lm <- lm(Hb$Reference ~ Hb$New) plot(Hb$New, Hb$Reference, xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") cat ("Adding the regression line:")

    Adding the regression line:

    abline (a=fit.lm$coefficients[1], b=fit.lm$coefficients[2] ) cat ("Adding the identity line:")

    Adding the identity line:

    abline (a=0, b=1, lty=2)

    Th plot:

    If both measurement methods would exactly correspond, the intercept would be zero and the slope would be one (=“identity line”, dotted line). Now, let's calculated the confidence interval for this linear regression.

    CI_ex <- predict(fit.lm, interval="confidence") colnames(CI_ex)<- c("fit_CI", "lwr_CI", "upr_CI")

    And the prediction interval:

    PI_ex <- predict(fit.lm, interval="prediction") ## Warning in predict.lm(fit.lm, interval = "prediction"): predictions on current data refer to _future_ responses colnames(PI_ex)<- c("fit_PI", "lwr_PI", "upr_PI")

    We can combine those results in one data frame and plot both the confidence interval and the prediction interval.

    Hb_results<-cbind(Hb, CI_ex, PI_ex) kable(head(round(Hb_results),1)) New Reference fit_CI lwr_CI upr_CI fit_PI lwr_PI upr_PI 85 87 91 87 94 91 82 99

    Visualizing the CI (in green) and the PI (in blue):

    plot(Hb$New, Hb$Reference, xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") Hb_results_s <- Hb_results[order(Hb_results$New),] lines (x=Hb_results_s$New, y=Hb_results_s$fit_CI) lines (x=Hb_results_s$New, y=Hb_results_s$lwr_CI, col="#8AB63F", lwd=1.2) lines (x=Hb_results_s$New, y=Hb_results_s$upr_CI, col="#8AB63F", lwd=1.2) lines (x=Hb_results_s$New, y=Hb_results_s$lwr_PI, col="#1A425C", lwd=1.2) lines (x=Hb_results_s$New, y=Hb_results_s$upr_PI, col="#1A425C", lwd=1.2) abline (a=0, b=1, lty=2)

    Gives this plot:

    In (Bland, Altman 2003) it is proposed to calculate the average width of this prediction interval, and see whether this is acceptable. Another approach is to compare the calculated PI with an “acceptance interval”. If the PI is inside the acceptance interval for the measurement range of interest (see Francq, 2016).

    In the above example, both methods do have the same measurement scale (g/L), but the linear regression with prediction interval is particularly useful when the two methods of measurement have different units.

    However, the method has some disadvantages:

    • Predictions intervals are very sensitive to deviations from the normal distribution.
    • In “standard” linear regression (or Ordinary Least Squares (OLS) regression),the presence of measurement error is allowed for the Y-variable (here, the reference method) but not for the X-variable (the new method). The absence of errors on the x-axis is one of the assumptions. Since we can expect some measurement error for the new method, this assumption is violated here.
    Taking into account errors on both axes

    In contrast to Ordinary Least Square (OLS) regression, Bivariate Least Square (BLS) regression takes into account the measurement errors of both methods (the New method and the Reference method). Interestingly, prediction intervals calculated with BLS are not affected when the axes are switched (del Rio, 2001).

    In 2017, a new R-package BivRegBLS was released. It offers several methods to assess the agreement in method comparison studies, including Bivariate Least Square (BLS) regression.

    If the data are unreplicated but the variance of the measurement error of the methods is known, The BLS() and XY.plot() functions can be used to fit a bivariate Least Square regression line and corresponding confidence and prediction intervals.

    library (BivRegBLS) Hb.BLS = BLS (data = Hb, xcol = c("New"), ycol = c("Reference"), var.y=10, var.x=8, conf.level=0.95) XY.plot (Hb.BLS, yname = "Hemoglobin concentration (g/L) - reference method", xname = "Hemoglobin concentration (g/L) - new method", graph.int = c("CI","PI"))

    Gives this plot:

    Now we would like to decide whether the new method can replace the reference method. We allow the methods to differ up to a given threshold, which is not clinically relevant. Based on this threshold an “acceptance interval” is created. Suppose that differences up to 10 g/L (=threshold) are not clinically relevant, then the acceptance interval can be defined as Y=X±??, with ?? equal to 10. If the PI is inside the acceptance interval for the measurement range of interest then the two measurement methods can be considered to be interchangeable (see Francq, 2016).

    The accept.int argument of the XY.plot() function allows for a visualization of this acceptance interval

    XY.plot (Hb.BLS, yname = "Hemoglobin concentration (g/L) - reference method", xname = "Hemoglobin concentration (g/L) - new method", graph.int = c("CI","PI"), accept.int=10)

    Givbes this plot:

    For the measurement region 120g/L to 150 g/L, we can conclude that the difference between both methods is acceptable. If the measurement regions below 120g/l and above 150g/L are important, the new method cannot replace the reference method.

    Regression on replicated data

    In method comparison studies, it is advised to create replicates (2 or more measurements of the same sample with the same method). An example of such a dataset:

    Hb_rep <- read.table("http://rforbiostatistics.colmanstatistics.be/wp-content/uploads/2018/06/Hb_rep.txt", header = TRUE) kable(head(round(Hb_rep),1)) New_rep1 New_rep2 Ref_rep1 Ref_rep2 88 95 90 84

    When replicates are available, the variance of the measurement errors are calculated for both the new and the reference method and used to estimate the prediction interval. Again, the BLS() function and the XY.plot() function are used to estimate and plot the BLS regression line, the corresponding CI and PI.

    Hb_rep.BLS = BLS (data = Hb_rep, xcol = c("New_rep1", "New_rep2"), ycol = c("Ref_rep1", "Ref_rep2"), qx = 1, qy = 1, conf.level=0.95, pred.level=0.95) XY.plot (Hb_rep.BLS, yname = "Hemoglobin concentration (g/L) - reference method", xname = "Hemoglobin concentration (g/L) - new method", graph.int = c("CI","PI"), accept.int=10)

    Giuves this plot:

    It is clear that the prediction interval is not inside the acceptance interval here. The new method cannot replace the reference method. A possible solution is to average the repeats. The BivRegBLS package can create prediction intervals for the mean of (2 or more) future values, too! More information in this presentation (presented at useR!2017).

    In the plot above, averages of the two replicates are calculated and plotted. I'd like to see the individual measurements:

    plot(x=c(Hb_rep$New_rep1, Hb_rep$New_rep2), y=c(Hb_rep$Ref_rep1, Hb_rep$Ref_rep2), xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,2]), lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,3]), col="#8AB63F", lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,4]), col="#8AB63F", lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,5]), col="#1A425C", lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,6]), col="#1A425C", lwd=2) abline (a=0, b=1, lty=2)

    Gives this plot:

    Remarks
    • Although not appropriate in the context of method comparison studies, Pearson correlation is still frequently used. See Bland & Altman (2003) for an explanation on why correlations are not adviced.
    • Methods presented in this blogpost are not applicable to time-series
    References
    • Confidence interval and prediction interval: Applied Linear Statistical Models, 2005, Michael Kutner, Christopher Nachtsheim, John Neter, William Li. Section 2.5
    • Prediction interval for method comparison:
      Bland, J. M. and Altman, D. G. (2003), Applying the right statistics: analyses of measurement studies. Ultrasound Obstet Gynecol, 22: 85-93. doi:10.1002/uog.12
      section: “Appropriate use of regression”.
    • Francq, B. G., and Govaerts, B. (2016) How to regress and predict in a Bland-Altman plot? Review and contribution based on tolerance intervals and correlated-errors-in-variables models. Statist. Med., 35: 2328-2358. doi: 10.1002/sim.6872.
    • del Río, F. J., Riu, J. and Rius, F. X. (2001), Prediction intervals in linear regression taking into account errors on both axes. J. Chemometrics, 15: 773-788. doi:10.1002/cem.663
    • Example of a method comparison study: H. Tian, M. Li, Y. Wang, D. Sheng, J. Liu, and L. Zhang, “Optical wavelength selection for portable hemoglobin determination by near-infrared spectroscopy method,” Infrared Phys. Techn 86, 98-102 (2017). doi.org/10.1016/j.infrared.2017.09.004.
    • the predict() and lm() functions of R: Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S. Wadsworth & Brooks/Cole.

      Related Post

      1. Six Sigma DMAIC Series in R – Part 2
      2. Six Sigma DMAIC Series in R – Part 1
      3. Implementation and Interpretation of Control Charts in R
      4. Time series model of forecasting future power demand
      5. Tennis Grand Slam Tournaments Champions Basic Analysis
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Interpreting machine learning models with the lime package for R

      Fri, 06/15/2018 - 21:53

      (This article was first published on Revolutions, and kindly contributed to R-bloggers)

      Many types of machine learning classifiers, not least commonly-used techniques like ensemble models and neural networks, are notoriously difficult to interpret. If the model produces a surprising label for any given case, it's difficult to answer the question, "why that label, and not one of the others?".

      One approach to this dilemma is the technique known as LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is that while for highly non-linear models it's impossible to give a simple explanation of the relationship between any one variable and the predicted classes at a global level, it might be possible to asses which variables are most influential on the classification at a local level, near the neighborhood of a particular data point. An procedure for doing so is described in this 2016 paper by Ribeiro et al, and implemented in the R package lime by Thomas Lin Pedersen and Michael Benesty (and a port of the Python package of the same name). 

      You can read about how the lime package works in the introductory vignette Understanding Lime, but this limerick by Mara Averick sums also things up nicely:

      There once was a package called lime,
      Whose models were simply sublime,
      It gave explanations for their variations,
      One observation at a time.

      "One observation at a time" is the key there: given a prediction (or a collection of predictions) it will determine the variables that most support (or contradict) the predicted classification.

      The lime package also works with text data: for example, you may have a model that classifies a paragraph of text as a sentiment "negative", "neutral" or "positive". In that case, lime will determine the the words in that sentence which are most important to determining (or contradicting) the classification. The package helpfully also provides a shiny app making it easy to test out different sentences and see the local effect of the model.

      To learn more about the lime algorithm and how to use the associated R package, a great place to get started is the tutorial Visualizing ML Models with LIME from the University of Cincinnati Business Analytics R Programming Guide. The lime package is available on CRAN now, and you can always find the latest version at the GitHub repository linked below.

      GitHub (thomasp): lime (Local Interpretable Model-Agnostic Explanations)

       

       

       

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      R Tip: Be Wary of “…”

      Fri, 06/15/2018 - 18:31

      (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

      R Tip: be wary of “...“.

      The following code example contains an easy error in using the R function unique().

      vec1 <- c("a", "b", "c") vec2 <- c("c", "d") unique(vec1, vec2) # [1] "a" "b" "c"

      Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value arguments, as one would use union(). Also notice no error or warning was signaled. We used unique() incorrectly and nothing pointed this out to us. What compounded our error was R‘s “...” function signature feature.

      In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign of carelessness, laziness, or weakness. I am well aware that every time I admit to making a mistake (I have indeed made the above mistake) those who claim to never make mistakes have a laugh at my expense. Honestly I feel the reason I see more mistakes is I check a lot more.

      Data science coding is often done in a rush (deadlines, wanting to see results, and so on). Instead of moving fast, let’s take the time to think a bit about programming theory using a very small concrete issue. This lets us show how one can work a bit safer (saving time in the end), without sacrificing notational power or legibility.

      A confounding issue is: unique() failed to alert me of my mistake because, unique()‘s function signature (like so many R functions) includes a “...” argument. I might have been careless or in a rush, but it seems like unique was also in a rush and did not care to supply argument inspection.

      In R a function that includes a “...” in its signature will accept arbitrary arguments and values in addition to the ones explicitly declared and documented. There are three primary uses of “...” in R: accepting unknown arguments that are to be passed to later functions, building variadic functions, and forcing later arguments to be bound by name (my favorite use). Unfortunately, “...” is also often used to avoid documenting arguments and turns off a lot of very useful error checking.

      An example of the “accepting unknown arguments” use is lapply(). lapply() passes what it finds in “...” to whatever function it is working with. For example:

      lapply(c("a", "b"), paste, "!", sep = "") # [[1]] # [1] "a!" # # [[2]] # [1] "b!"

      Notice the arguments “"!", sep = ""” were passed on to paste(). Since lapply() can not know what function the user will use ahead of time it uses the “...” abstraction to forward arguments. Personally I never use this form and tend to write the somewhat more explicit and verbose style shown below.

      lapply(c("a", "b"), function(vi) { paste(vi, "!", sep = "") })

      I feel this form is more readable as the arguments are seen where they are actually used. (Note: this, is a notional example- in practice we would use “paste0(c("a", "b"), "!")” to directly produce the result as a vector.)

      An example of using “...” to supply a variadic interface is paste() itself.

      paste("a", "b", "c") # [1] "a b c"

      Other important examples include list() and c(). In fact I like list() and c() as they only take a “...” and no other arguments. Being variadic is so important to list() and c() is that is essentially all they do. One can often separate out the variadic intent with lists or vectors as in:

      paste(c("a", "b", "c"), collapse = " ") # [1] "a b c"

      Even I don’t write code such as the above (that is too long even for me), unless the values are coming from somewhere else (such as a variable). However with wrapr‘s reduce/expand operator we can completely separate the collection of variadic arguments and their application. The notation looks like the following:

      library("wrapr") values <- c("a", "b", "c") values %.|% paste # [1] "a b c"

      Essentially reduce/expand calls variadic functions with items taken from a vector or list as individual arguments (allowing one to program easily over variadic functions). %.|% is intended to place values in the “...” slot of a function (the variadic term). It is designed for a more perfect world where when a function declares “...” in its signature it is then the only user facing part of the signature. This is hardly ever actually the case in R as common functions such as paste() and sum() have additional optional named arguments (which we are here leaving at their default values), whereas c() and list() are pure (take only “...“).

      With a few non-standard (name capturing) and variadic value constructors one does not in fact need other functions to be name capturing or variadic. With such tools one can have these conveniences everywhere. For example we can convert our incorrect use of unique() into correct code using c().

      unique(c(vec1, vec2)) # [1] "a" "b" "c" "d"

      In the above code roles are kept separate: c() is collecting values and unique() is applying a calculation. We don’t need a powerful “super unique” or “super union” function, unique() is good enough if we remember to use c().

      In the spirit of our earlier article on function argument names we have defined a convenience function wrapr::uniques() that enforces the use of value carrying arguments. With wrapr::uniques() if one attempts the mistake I have been discussing one immediately gets a signaling error (instead of propagating incorrect calculations forward).

      library("wrapr") uniques(c(vec1, vec2)) # [1] "a" "b" "c" "d" uniques(vec1, vec2) # Error: wrapr::uniques unexpected arguments: vec2

      With uniques() we either get the right answer, or we get immediately stopped at the mistaken step. This is a good way to work.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages