Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 2 min ago

Easy time-series prediction with R: a tutorial with air traffic data from Lux Airport

Wed, 11/14/2018 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


In this blog post, I will show you how you can quickly and easily forecast a univariate time series.
I am going to use data from the EU Open Data Portal on air passenger transport. You can find the
data here. I downloaded
the data in the TSV format for Luxembourg Airport, but you could repeat the analysis for any airport.

Once you have the data, load some of the package we are going to need:

library(tidyverse) library(lubridate) library(forecast) library(tsibble) library(brotools)

and define the following function:

ihs <- function(x){ log(x + sqrt(x**2 + 1)) }

This function, the inverse hyperbolic sine, is useful to transform data in a manner that is very
close to logging it, but that allows for 0’s. The data from Eurostat is not complete for some reason,
so there are some 0 sometimes. To avoid having to log 0, which in R yields -Inf, I use this
transformation.

Now, let’s load the data:

avia <- read_tsv("avia_par_lu.tsv") ## Parsed with column specification: ## cols( ## .default = col_character() ## ) ## See spec(...) for full column specifications.

Let’s take a look at the data:

head(avia) ## # A tibble: 6 x 238 ## `unit,tra_meas,… `2018Q1` `2018M03` `2018M02` `2018M01` `2017Q4` `2017Q3` ## ## 1 FLIGHT,CAF_PAS,… 511 172 161 178 502 475 ## 2 FLIGHT,CAF_PAS,… : : : : : : ## 3 FLIGHT,CAF_PAS,… : : : : 399 306 ## 4 FLIGHT,CAF_PAS,… 485 167 151 167 493 497 ## 5 FLIGHT,CAF_PAS,… 834 293 267 274 790 728 ## 6 FLIGHT,CAF_PAS,… : : : : : : ## # ... with 231 more variables: `2017Q2` , `2017Q1` , ## # `2017M12` , `2017M11` , `2017M10` , `2017M09` , ## # `2017M08` , `2017M07` , `2017M06` , `2017M05` , ## # `2017M04` , `2017M03` , `2017M02` , `2017M01` , ## # `2017` , `2016Q4` , `2016Q3` , `2016Q2` , ## # `2016Q1` , `2016M12` , `2016M11` , `2016M10` , ## # `2016M09` , `2016M08` , `2016M07` , `2016M06` , ## # `2016M05` , `2016M04` , `2016M03` , `2016M02` , ## # `2016M01` , `2016` , `2015Q4` , `2015Q3` , ## # `2015Q2` , `2015Q1` , `2015M12` , `2015M11` , ## # `2015M10` , `2015M09` , `2015M08` , `2015M07` , ## # `2015M06` , `2015M05` , `2015M04` , `2015M03` , ## # `2015M02` , `2015M01` , `2015` , `2014Q4` , ## # `2014Q3` , `2014Q2` , `2014Q1` , `2014M12` , ## # `2014M11` , `2014M10` , `2014M09` , `2014M08` , ## # `2014M07` , `2014M06` , `2014M05` , `2014M04` , ## # `2014M03` , `2014M02` , `2014M01` , `2014` , ## # `2013Q4` , `2013Q3` , `2013Q2` , `2013Q1` , ## # `2013M12` , `2013M11` , `2013M10` , `2013M09` , ## # `2013M08` , `2013M07` , `2013M06` , `2013M05` , ## # `2013M04` , `2013M03` , `2013M02` , `2013M01` , ## # `2013` , `2012Q4` , `2012Q3` , `2012Q2` , ## # `2012Q1` , `2012M12` , `2012M11` , `2012M10` , ## # `2012M09` , `2012M08` , `2012M07` , `2012M06` , ## # `2012M05` , `2012M04` , `2012M03` , `2012M02` , ## # `2012M01` , `2012` , …

So yeah, useless in that state. The first column actually is composed of 3 columns, merged together,
and instead of having one column with the date, and another with the value, we have one column
per date. Some cleaning is necessary before using this data.

Let’s start with going from a wide to a long data set:

avia %>% select("unit,tra_meas,airp_pr\\time", contains("20")) %>% gather(date, passengers, -`unit,tra_meas,airp_pr\\time`)

The first line makes it possible to only select the columns that contain the string “20”, so
selecting columns from 2000 onward. Then, using gather, I go from long to wide. The data looks
like this now:

## # A tibble: 117,070 x 3 ## `unit,tra_meas,airp_pr\\time` date passengers ## ## 1 FLIGHT,CAF_PAS,LU_ELLX_AT_LOWW 2018Q1 511 ## 2 FLIGHT,CAF_PAS,LU_ELLX_BE_EBBR 2018Q1 : ## 3 FLIGHT,CAF_PAS,LU_ELLX_CH_LSGG 2018Q1 : ## 4 FLIGHT,CAF_PAS,LU_ELLX_CH_LSZH 2018Q1 485 ## 5 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDF 2018Q1 834 ## 6 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDI 2018Q1 : ## 7 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDM 2018Q1 1095 ## 8 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDR 2018Q1 : ## 9 FLIGHT,CAF_PAS,LU_ELLX_DE_EDDT 2018Q1 : ## 10 FLIGHT,CAF_PAS,LU_ELLX_DK_EKCH 2018Q1 : ## # ... with 117,060 more rows

Now, let’s separate the first column into 3 columns:

avia %>% select("unit,tra_meas,airp_pr\\time", contains("20")) %>% gather(date, passengers, -`unit,tra_meas,airp_pr\\time`) %>% separate(col = `unit,tra_meas,airp_pr\\time`, into = c("unit", "tra_meas", "air_pr\\time"), sep = ",")

This separates the first column into 3 new columns, “unit”, “tra_meas” and “air_pr\time”. This step
is not necessary for the rest of the analysis, but might as well do it. The data looks like this now:

## # A tibble: 117,070 x 5 ## unit tra_meas `air_pr\\time` date passengers ## ## 1 FLIGHT CAF_PAS LU_ELLX_AT_LOWW 2018Q1 511 ## 2 FLIGHT CAF_PAS LU_ELLX_BE_EBBR 2018Q1 : ## 3 FLIGHT CAF_PAS LU_ELLX_CH_LSGG 2018Q1 : ## 4 FLIGHT CAF_PAS LU_ELLX_CH_LSZH 2018Q1 485 ## 5 FLIGHT CAF_PAS LU_ELLX_DE_EDDF 2018Q1 834 ## 6 FLIGHT CAF_PAS LU_ELLX_DE_EDDI 2018Q1 : ## 7 FLIGHT CAF_PAS LU_ELLX_DE_EDDM 2018Q1 1095 ## 8 FLIGHT CAF_PAS LU_ELLX_DE_EDDR 2018Q1 : ## 9 FLIGHT CAF_PAS LU_ELLX_DE_EDDT 2018Q1 : ## 10 FLIGHT CAF_PAS LU_ELLX_DK_EKCH 2018Q1 : ## # ... with 117,060 more rows

The next steps are simple renamings. I have copy-pasted the information from the Eurostat page
where you can view the data online.
If you click here:

you will be able to select the variables you want displayed in the table, as well as the dictionary
of the variables. I simply copy pasted it and recoded the variables. You can take a look at the
whole cleaning workflow by clicking “Click to expand” below:

Click here to take a look at the whole cleaning workflow

avia_clean <- avia %>% select("unit,tra_meas,airp_pr\\time", contains("20")) %>% gather(date, passengers, -`unit,tra_meas,airp_pr\\time`) %>% separate(col = `unit,tra_meas,airp_pr\\time`, into = c("unit", "tra_meas", "air_pr\\time"), sep = ",") %>% mutate(tra_meas = fct_recode(tra_meas, `Passengers on board` = "PAS_BRD", `Passengers on board (arrivals)` = "PAS_BRD_ARR", `Passengers on board (departures)` = "PAS_BRD_DEP", `Passengers carried` = "PAS_CRD", `Passengers carried (arrival)` = "PAS_CRD_ARR", `Passengers carried (departures)` = "PAS_CRD_DEP", `Passengers seats available` = "ST_PAS", `Passengers seats available (arrivals)` = "ST_PAS_ARR", `Passengers seats available (departures)` = "ST_PAS_DEP", `Commercial passenger air flights` = "CAF_PAS", `Commercial passenger air flights (arrivals)` = "CAF_PAS_ARR", `Commercial passenger air flights (departures)` = "CAF_PAS_DEP")) %>% mutate(unit = fct_recode(unit, Passenger = "PAS", Flight = "FLIGHT", `Seats and berths` = "SEAT")) %>% mutate(destination = fct_recode(`air_pr\\time`, `WIEN-SCHWECHAT` = "LU_ELLX_AT_LOWW", `BRUSSELS` = "LU_ELLX_BE_EBBR", `GENEVA` = "LU_ELLX_CH_LSGG", `ZURICH` = "LU_ELLX_CH_LSZH", `FRANKFURT/MAIN` = "LU_ELLX_DE_EDDF", `HAMBURG` = "LU_ELLX_DE_EDDH", `BERLIN-TEMPELHOF` = "LU_ELLX_DE_EDDI", `MUENCHEN` = "LU_ELLX_DE_EDDM", `SAARBRUECKEN` = "LU_ELLX_DE_EDDR", `BERLIN-TEGEL` = "LU_ELLX_DE_EDDT", `KOBENHAVN/KASTRUP` = "LU_ELLX_DK_EKCH", `HURGHADA / INTL` = "LU_ELLX_EG_HEGN", `IRAKLION/NIKOS KAZANTZAKIS` = "LU_ELLX_EL_LGIR", `FUERTEVENTURA` = "LU_ELLX_ES_GCFV", `GRAN CANARIA` = "LU_ELLX_ES_GCLP", `LANZAROTE` = "LU_ELLX_ES_GCRR", `TENERIFE SUR/REINA SOFIA` = "LU_ELLX_ES_GCTS", `BARCELONA/EL PRAT` = "LU_ELLX_ES_LEBL", `ADOLFO SUAREZ MADRID-BARAJAS` = "LU_ELLX_ES_LEMD", `MALAGA/COSTA DEL SOL` = "LU_ELLX_ES_LEMG", `PALMA DE MALLORCA` = "LU_ELLX_ES_LEPA", `SYSTEM - PARIS` = "LU_ELLX_FR_LF90", `NICE-COTE D'AZUR` = "LU_ELLX_FR_LFMN", `PARIS-CHARLES DE GAULLE` = "LU_ELLX_FR_LFPG", `STRASBOURG-ENTZHEIM` = "LU_ELLX_FR_LFST", `KEFLAVIK` = "LU_ELLX_IS_BIKF", `MILANO/MALPENSA` = "LU_ELLX_IT_LIMC", `BERGAMO/ORIO AL SERIO` = "LU_ELLX_IT_LIME", `ROMA/FIUMICINO` = "LU_ELLX_IT_LIRF", `AGADIR/AL MASSIRA` = "LU_ELLX_MA_GMAD", `AMSTERDAM/SCHIPHOL` = "LU_ELLX_NL_EHAM", `WARSZAWA/CHOPINA` = "LU_ELLX_PL_EPWA", `PORTO` = "LU_ELLX_PT_LPPR", `LISBOA` = "LU_ELLX_PT_LPPT", `STOCKHOLM/ARLANDA` = "LU_ELLX_SE_ESSA", `MONASTIR/HABIB BOURGUIBA` = "LU_ELLX_TN_DTMB", `ENFIDHA-HAMMAMET INTERNATIONAL` = "LU_ELLX_TN_DTNH", `ENFIDHA ZINE EL ABIDINE BEN ALI` = "LU_ELLX_TN_DTNZ", `DJERBA/ZARZIS` = "LU_ELLX_TN_DTTJ", `ANTALYA (MIL-CIV)` = "LU_ELLX_TR_LTAI", `ISTANBUL/ATATURK` = "LU_ELLX_TR_LTBA", `SYSTEM - LONDON` = "LU_ELLX_UK_EG90", `MANCHESTER` = "LU_ELLX_UK_EGCC", `LONDON GATWICK` = "LU_ELLX_UK_EGKK", `LONDON/CITY` = "LU_ELLX_UK_EGLC", `LONDON HEATHROW` = "LU_ELLX_UK_EGLL", `LONDON STANSTED` = "LU_ELLX_UK_EGSS", `NEWARK LIBERTY INTERNATIONAL, NJ.` = "LU_ELLX_US_KEWR", `O.R TAMBO INTERNATIONAL` = "LU_ELLX_ZA_FAJS")) %>% mutate(passengers = as.numeric(passengers)) %>% select(unit, tra_meas, destination, date, passengers) ## Warning in evalq(as.numeric(passengers), ): NAs introduced by ## coercion

There is quarterly data and monthly data. Let’s separate the two:

avia_clean_quarterly <- avia_clean %>% filter(tra_meas == "Passengers on board (arrivals)", !is.na(passengers)) %>% filter(str_detect(date, "Q")) %>% mutate(date = yq(date))

In the “date” column, I detect the observations with “Q” in their name, indicating that it is quarterly data.
I do the same for monthly data, but I have to add the string “01” to the dates. This transforms
a date that looks like this “2018M1” to this “2018M101”. “2018M101” can then be converted into a
date by using the ymd() function from lubridate. yq() was used for the quarterly data.

avia_clean_monthly <- avia_clean %>% filter(tra_meas == "Passengers on board (arrivals)", !is.na(passengers)) %>% filter(str_detect(date, "M")) %>% mutate(date = paste0(date, "01")) %>% mutate(date = ymd(date)) %>% select(destination, date, passengers)

Time for some plots. Let’s start with the raw data:

avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% ggplot() + ggtitle("Raw data") + geom_line(aes(y = total, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") + theme_blog()

And now with the logged data (or rather, the data transformed using the inverted hyperbolic sine
transformation):

avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% mutate(total_ihs = ihs(total)) %>% ggplot() + ggtitle("Logged data") + geom_line(aes(y = total_ihs, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") + theme_blog()

We clearly see a seasonal pattern in the data. There is also an upward trend. We will have to deal
with these two problems if we want to do some forecasting. For this, let’s limit ourselves to data
from before 2015, and convert the “passengers” column from the data to a time series object, using
the ts() function:

avia_clean_train <- avia_clean_monthly %>% select(date, passengers) %>% filter(year(date) < 2015) %>% group_by(date) %>% summarise(total_passengers = sum(passengers)) %>% pull(total_passengers) %>% ts(., frequency = 12, start = c(2005, 1))

We will try to pseudo-forecast the data from 2015 to the last point available, March 2018.
First, let’s tranform the data:

logged_data <- ihs(avia_clean_train)

Taking the log, or ihs of the data deals with stabilizing the variance of the time series.

There might also be a need to difference the data. Computing the differences between consecutive
observations makes the time-series stationary. This will be taken care of by the auto.arima()
function, if needed. The auto.arima() function returns the best ARIMA model according to different
statistical criterions, sach us the AIC, AICc or BIC.

(model_fit <- auto.arima(logged_data)) ## Series: logged_data ## ARIMA(2,1,1)(2,1,0)[12] ## ## Coefficients: ## ar1 ar2 ma1 sar1 sar2 ## -0.4061 -0.2431 -0.3562 -0.5590 -0.3282 ## s.e. 0.2003 0.1432 0.1994 0.0911 0.0871 ## ## sigma^2 estimated as 0.004503: log likelihood=137.11 ## AIC=-262.21 AICc=-261.37 BIC=-246.17

auto.arima() found that the best model would be an \(ARIMA(2, 1, 1)(2, 1, 0)_{12}\). This is an
seasonal autoregressive model, with p = 2, d = 1, q = 1, P = 2 and D = 1.

model_forecast <- forecast(model_fit, h = 39)

I can now forecast the model for the next 39 months (which correspond to the data available).

To plot the forecast, one could do a simple call to the plot function. But the resulting plot
is not very aesthetic. To plot my own, I have to grab the data that was forecast, and do some
munging again:

point_estimate <- model_forecast$mean %>% as_tsibble() %>% rename(point_estimate = value, date = index) upper <- model_forecast$upper %>% as_tsibble() %>% spread(key, value) %>% rename(date = index, upper80 = `80%`, upper95 = `95%`) lower <- model_forecast$lower %>% as_tsibble() %>% spread(key, value) %>% rename(date = index, lower80 = `80%`, lower95 = `95%`) estimated_data <- reduce(list(point_estimate, upper, lower), full_join, by = "date")

as_tibble() is a function from the {tsibble} package that converts objects that are time-series aware
to time-aware tibbles. If you are not familiar with ts_tibble(), I urge you to run the above lines
one by one, and especially to compare as_tsibble() with the standard as_tibble() from the {tibble}
package.

This is how estimated_data looks:

head(estimated_data) ## # A tsibble: 6 x 6 [1M] ## date point_estimate upper80 upper95 lower80 lower95 ## ## 1 2015 Jan 11.9 12.0 12.1 11.8 11.8 ## 2 2015 Feb 11.9 12.0 12.0 11.8 11.7 ## 3 2015 Mar 12.1 12.2 12.3 12.0 12.0 ## 4 2015 Apr 12.2 12.3 12.4 12.1 12.1 ## 5 2015 May 12.3 12.4 12.4 12.2 12.1 ## 6 2015 Jun 12.3 12.4 12.5 12.2 12.1

We can now plot the data, with the forecast, and with the 95% confidence interval:

avia_clean_monthly %>% group_by(date) %>% summarise(total = sum(passengers)) %>% mutate(total_ihs = ihs(total)) %>% ggplot() + ggtitle("Logged data") + geom_line(aes(y = total_ihs, x = date), colour = "#82518c") + scale_x_date(date_breaks = "1 year", date_labels = "%m-%Y") + geom_ribbon(data = estimated_data, aes(x = date, ymin = lower95, ymax = upper95), fill = "#666018", alpha = 0.2) + geom_line(data = estimated_data, aes(x = date, y = point_estimate), linetype = 2, colour = "#8e9d98") + theme_blog()

The pseudo-forecast (the dashed line) is not very far from the truth, only overestimating the
seasonal peaks, but the true line is within the 95% confidence interval, which is good!

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a Repository of Alpine-based Docker Images for R, Part II

Wed, 11/14/2018 - 01:00

(This article was first published on Curiosity Killed the Cat, and kindly contributed to R-bloggers)

In the first article of this series, I built an Alpine-based Docker image with R base packages from Alpine’s native repositories, as well as one image with R compiled from source code. The images are hosted on Docker Hub, velaco/alpine-r repository.

The next step was either to address the fatal errors I found while testing the installation of R or to proceed building an image with Shiny Server. The logical choice would have been to pass all tests with R’s base packages before proceeding, but I was a bit impatient and wanted to go through the process of building a Shiny Server as soon as possible. After two weeks of trial and error, I finally have a container that can start the server and run Shiny apps.

Notes: The Dockerfile I used is available in the velaco/alpine-r repository on GitHub. The latest image with Shiny Server is not available on Docker Hub because it exceeded the time limit for automated builds, so I’ll have to upload it manually.

Building the R Shiny Server Image

Shiny Server doesn’t have precompiled installers for Alpine Linux, so I had to build it from source. I used Building Shiny Server from Source as a reference to determine which dependencies I’ll install and to write the first version of the installation and post-installation instructions.

Dependencies

I wasn’t sure which dependencies to put under BUILD and which to put under PERSISTENT, so I added them all to a single variable for now. The packages python2, gcc, g++, git, cmake, and R (base and devel) are the main system requirements for installing Shiny Server from source. Based on my review of other Dockerfiles for Shiny, including the one maintained by Rocker, I also added curl-dev, libffi, psmisc, rrdtool, cairo-dev, libxt-dev, and libxml2-dev to the list of dependencies.

A system-wide installation of the shiny package and its dependencies is also necessary, and the first problem I encountered was the installation of the httpuv package. In order to install the httpuv package, I had to install automake, autoconf, m4, and file as well. The package asked for automake version 1.15 specifically, so I had to include the command echo 'http://dl-cdn.alpinelinux.org/alpine/v3.6/main' >> /etc/apk/repositories before adding the system requirements because that branch contains automake v1.15.1.

The package wget retrieves Node.js and linux-headers are needed to compile it from source. (The precompiled binaries didn’t work.)

Here is the final list of dependencies I included in the Dockerfile:

FROM velaco/alpine-r:base-3.5.0-r1 MAINTAINER Aleksandar Ratesic "aratesic@gmail.com" # TODO: Separate to build and persistent! ENV ALL_DEPS \ automake==1.15.1-r0 \ autoconf \ bash \ m4 \ file \ curl-dev \ python2 \ g++ \ gcc \ cmake \ # Forgot to remove libc6-compat, no longer needed libc6-compat \ git \ libffi \ psmisc \ rrdtool \ cairo-dev \ libxt-dev \ libxml2-dev \ linux-headers \ wget Installation

I forked the Shiny Server git repository to make sure I have control over the changes made to it while working on this image. Apart from the installation of system dependencies and the shiny package, the installation instructions in the first version of the Dockerfile were pretty much an exact copy of the instructions provided by RStudio on GitHub:

RUN echo 'http://dl-cdn.alpinelinux.org/alpine/v3.6/main' >> /etc/apk/repositories && \ apk upgrade --update && \ apk add --no-cache $ALL_DEPS && \ R -e "install.packages(c('shiny'), repos='http://cran.rstudio.com/')" && \ git clone https://github.com/velaco/shiny-server.git && \ cd shiny-server && mkdir tmp && cd tmp && \ DIR=`pwd` && PATH=$DIR/../bin:$PATH && \ cmake -DCMAKE_INSTALL_PREFIX=/usr/local ../ && \ make && mkdir ../build && \ (cd .. && ./bin/npm --python="/usr/bin/python" install) && \ (cd .. && ./bin/node /usr/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js --python="/usr/bin/python" rebuild) && \ make install

The first problem during the installation occurred because the command cmake -DCMAKE_INSTALL_PREFIX=/usr/local ../ installed only pandoc, so the command (cd .. && ./bin/npm --python="$PYTHON" install) couldn’t find the Node.js binary. I decided to build an image up to that point and run a container to interactively look for a solution from the shell.

I retrieved Node.js with the command (cd .. && ./external/node/install-node.sh), but the installation failed again. I ran the command file ../ext/node/bin/node and learned that the requested program interpreter is /lib64/ld-linux-x86-64.so.2. The libc6-compat package contains that file and other compatibility libraries for glibc, so I added it and tried to install Node.js again. However, it didn’t help. I ended up with this error message:

Error relocating ../ext/node/bin/node: __register_atfork: symbol not found

As an alternative to libc6-compat, I found the glibc package for Alpine on GitHub (link to repository) and ran the following commands to retrieve and install the package:

wget -q -O /etc/apk/keys/sgerrand.rsa.pub https://alpine-pkgs.sgerrand.com/sgerrand.rsa.pub wget https://github.com/sgerrand/alpine-pkg-glibc/releases/download/2.28-r0/glibc-2.28-r0.apk apk add glibc-2.28-r0.apk && rm glibc-2.28-r0.apk

I tried to run the script again, but the binary was looking for the libraries in the wrong path. That problem was easy to fix with patchelf:

patchelf --set-rpath /usr/lib:/usr/glibc-compat/lib:/lib /shiny-server/ext/node/bin/node patchelf --add-needed /lib/libc.musl-x86_64.so.1 /shiny-server/ext/node/bin/node

Although the location of the libraries was no longer an issue, rebuilding the modules still failed:

/shiny-server/tmp # (cd .. && ./bin/node ./ext/node/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js --python="/usr/bin/python" rebuild) /shiny-server/ext/node/bin/node.bin: /usr/lib/libstdc++.so.6: no version information available (required by /shiny-server/ext/node/bin/node.bin) /shiny-server/ext/node/bin/node.bin: /usr/lib/libstdc++.so.6: no version information available (required by /shiny-server/ext/node/bin/node.bin) /shiny-server/ext/node/bin/node.bin: /usr/lib/libstdc++.so.6: no version information available (required by /shiny-server/ext/node/bin/node.bin) /shiny-server/ext/node/bin/node.bin: /usr/lib/libstdc++.so.6: no version information available (required by /shiny-server/ext/node/bin/node.bin) /shiny-server/ext/node/bin/node.bin: /usr/lib/libstdc++.so.6: no version information available (required by /shiny-server/ext/node/bin/node.bin) /shiny-server/ext/node/bin/node.bin: /usr/lib/libstdc++.so.6: no version information available (required by /shiny-server/ext/node/bin/node.bin) Segmentation fault (core dumped)

I ran ldd from inside the container and got the same error as before, stating that __register_atfork was not found:

/shiny-server/tmp # ldd ../ext/node/bin/node /lib64/ld-linux-x86-64.so.2 (0x7f6478766000) libdl.so.2 => /lib64/ld-linux-x86-64.so.2 (0x7f6478766000) librt.so.1 => /lib64/ld-linux-x86-64.so.2 (0x7f6478766000) libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x7f6478413000) libm.so.6 => /lib64/ld-linux-x86-64.so.2 (0x7f6478766000) libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x7f6478201000) libpthread.so.0 => /lib64/ld-linux-x86-64.so.2 (0x7f6478766000) libc.so.6 => /lib64/ld-linux-x86-64.so.2 (0x7f6478766000) Error relocating ../ext/node/bin/node: __register_atfork: symbol not found Error relocating ../ext/node/bin/node: backtrace: symbol not found

The glibc repository has an open issue that reported the same problem (__register_atfork: symbol not found) in 2016, but I couldn’t find the solution to it anywhere.

I decided to change my approach, so I tried to install nodejs and npm packages from the native repositories. The only Node.js version available from Alpine’s 3.8 repository was 8.11.4. Shiny Server comes with a script that downloads version 8.11.3, so I hoped that the difference of one patch version wouldn’t be a problem.

With npm and nodejs packages installed from the native repository, I modified the instructions to rebuild the modules before installing Shiny Server, and the following instruction successfully created a layer of the image:

RUN echo 'http://dl-cdn.alpinelinux.org/alpine/v3.6/main' >> /etc/apk/repositories && \ apk upgrade --update && \ apk add --no-cache $ALL_DEPS && \ R -e "install.packages(c('shiny'), repos='http://cran.rstudio.com/')" && \ git clone https://github.com/velaco/shiny-server.git && \ cd shiny-server && mkdir tmp && cd tmp && \ cmake -DCMAKE_INSTALL_PREFIX=/usr/local ../ && \ make && mkdir ../build && \ (cd .. && npm --python="/usr/bin/python" --unsafe-perm install) && \ (cd .. && node /usr/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js --python="/usr/bin/python" rebuild) && \ make install Post-installation

The post-installation is quite straightforward. You need to create a link to the shiny-server executable from /usr/bin, create the shiny user, and the directories for hosting the apps, logs, configuration files, etc. I decided to leave out the logs and not to touch the permissions because the script executed on running the container does that, so this is what the post-installation part of the instructions looked like:

RUN ln -s /usr/local/shiny-server/bin/shiny-server /usr/bin/shiny-server && \ addgroup -g 82 -S shiny; \ adduser -u 82 -D -S -G shiny shiny && \ mkdir -p /srv/shiny-server && \ mkdir -p /var/lib/shiny-server && \ mkdir -p /etc/shiny-server COPY shiny-server.conf /etc/shiny-server/shiny-server.conf COPY shiny-server.sh /usr/bin/shiny-server.sh # Move sample apps to test installation, clean up RUN chmod 744 /usr/bin/shiny-server.sh && \ mv /shiny-server/samples/sample-apps /srv/shiny-server/ && \ mv /shiny-server/samples/welcome.html /srv/shiny-server/ && \ rm -rf /shiny-server/ EXPOSE 3838 CMD ["/usr/bin/shiny-server.sh"]

There were no problems with this part of the instructions, so the image was built successfully.

Testing the Image

With anticipation, I executed the command

run --rm -it velaco/alpine-r:shiny-3.5.0-r1

and… nothing happened. It was possible to run the container and start a shell, but as soon as I executed exec /usr/bin/shiny-server, it stopped working without giving me any messages.

I ran the container again, this time without security restrictions, and used strace on /usr/bin/shiny-server to find out at which point it exits. Here’s what I found:

execve("/usr/local/shiny-server/ext/node/bin/shiny-server", ["/usr/local/shiny-server/ext/node"..., "/usr/local/shiny-server/lib/main"...], 0x7ffce0ae1518 /* 11 vars */) = -1 ENOENT (No such file or directory)

Turns out that the script Shiny Server uses to install Node.js contains a few instructions I skipped during the installation:

cp ext/node/bin/node ext/node/bin/shiny-server rm ext/node/bin/npm (cd ext/node/lib/node_modules/npm && ./scripts/relocate.sh)

There was no way I could run those instructions with Node.js installed from native packages, so I decided to modify the installation instructions to compile it from source:

RUN echo 'http://dl-cdn.alpinelinux.org/alpine/v3.6/main' >> /etc/apk/repositories && \ apk upgrade --update && \ apk add --no-cache $ALL_DEPS && \ R -e "install.packages(c('shiny'), repos='http://cran.rstudio.com/')" && \ git clone https://github.com/velaco/shiny-server.git && \ cd shiny-server && mkdir tmp && cd tmp && \ DIR=`pwd` && PATH=$DIR/../bin:$PATH && \ cmake -DCMAKE_INSTALL_PREFIX=/usr/local ../ && \ wget https://nodejs.org/dist/v8.11.3/node-v8.11.3.tar.xz && \ mkdir node_source && \ tar xf node-v8.11.3.tar.xz --strip-components=1 -C "node_source" && \ rm node-v8.11.3.tar.xz && \ (cd node_source && ./configure --prefix=/shiny-server/ext/node) && \ (cd node_source && make) && \ (cd node_source && make install) && \ cp /shiny-server/ext/node/bin/node /shiny-server/ext/node/bin/shiny-server && \ rm /shiny-server/ext/node/bin/npm && \ (cd ../ext/node/lib/node_modules/npm && ./scripts/relocate.sh) && \ make && mkdir ../build && \ (cd .. && ./bin/npm --python="/usr/bin/python" --unsafe-perm install) && \ (cd .. && ./bin/node ./ext/node/lib/node_modules/npm/node_modules/node-gyp/bin/node-gyp.js --python="/usr/bin/python" rebuild) && \ make install

Building the image this way took more than 2 hours, but when I started the container and got confirmation that the server was running, I started celebrating. I should have waited a bit longer to celebrate because when I checked the example app in my browser, I learned that I still had work to do:

Apparently, the command su in Alpine Linux doesn’t understand the --login option. The server is configured to run_as shiny and the container was running as root, so I guessed hoped that was the only reason it had to execute su at some point.

When I added the USER shiny instruction to the Dockerfile, the su command was no longer necessary. However, the server still didn’t work because the container executes a script to create the log directory. User shiny didn’t have permission to create the log directory, so I created it from the Dockerfile and set shiny:shiny as the owner of pretty much everything. (At this point I just wanted to get it over with, so some changes were probably not necessary and there was probably a better solution.)

With a log directory and shiny as the default user, the server successfully started the example Shiny app:

Next Steps

The image I created runs Shiny Server on Alpine Linux, but I wouldn’t feel comfortable using it in production yet. Here are my goals for improving the current images:

  • Make sure the R installation passes all tests. It doesn’t have to pass every test without a single warning, but it should at least pass them all without fatal errors.
  • Remove build dependencies from the image. The virtual size of this image is currently just over 500 MB. It can be reduced by removing some dependencies that are not required for running the server and probably some other files that are not necessary for running apps (e.g., documentation files, test suites, etc.).
  • Compile Node.js from source in a base container. Node took forever to compile and install. The image alpine-r:shiny-3.5.0-r1 can’t automatically build on Docker Hub because it exceeds the time limit allowed for automated builds. Maybe it would be a better idea to compile Node.js in an separate image and use that image as a base for building one with Shiny Server.

For now I’ll take a short break from this project to clear my head and get back to it with a fresh pair of eyes. Anyone interested in contributing and improving it can find the Dockerfiles in the velaco/alpine-r repository on GitHub.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Curiosity Killed the Cat. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

AI for Good: slides and notebooks from the ODSC workshop

Tue, 11/13/2018 - 16:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Last week at the ODSC West conference, I was thrilled with the interest in my Using AI for Good workshop: it was wonderful to find a room full of data scientists eager to learn how data science and artificial intelligence can be used to help people and the planet. The workshop was focused around projects from the Microsoft AI for Good program. I've included some details about the projects below, and you can also check out the workshop slides and the accompanying  Jupyter Notebooks that demonstrate the underlying AI methods used in the projects. 

The projects were drawn from the AI for Earth, AI for Accessibility and AI for Humanitarian Action programs, respectively:

The iNaturalist app, which uses image recognition and species distribution models to help citizen scientists identify animal and plant species from photos, and submit their observations to the scientific community. The app is available for iOS and Android smartphones, and you can also try this web-based demo with a picture of an animal or plant. 

The Seeing AI app (available for iOS), which helps blind and vision-impaired people navigate the world around them. (This video, featuring creator Saqib Shaikh, demonstrates how the app works.) The app detects and describes people and transcribes signs and documents, and you can try out the underlying vision APIs using just a web browser. You can also use R to drive the vision APIs, as shown in this notebook demonstrating the "Describe Scene" feature in Seeing AI. 

The AirSim Project, which is used to simulate image data for vision system training using the 3-D Unreal Engine. One application is to train a drone to automatically perform search and rescue operation after a disaster, as described in this video. There, 3-D simulations were provided training data for the Custom Vision service, which you can try out yourself with R and this Jupyter Notebook

You can try out all of the aforementioned notebooks by cloning this Azure Notebooks library, or download them at the Github repository below and run them in your own Jupyter Notebooks instance.

Github (revodavid): AI for Good Workshop

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Those “other” apply functions…

Tue, 11/13/2018 - 13:38

(This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers)

So you know lapply, sapply, and apply…but…what about rapply, vapply, or eapply? These are generally a little less known as far as the apply family of functions in R go, so this post will explore how they work.

rapply

Let’s start with rapply. This function has a couple of different purposes. One is to recursively apply a function to a list. We’ll get to that in a moment. The other use of rapply is to a apply a function to only those elements in a list (or columns in a data frame) that belong to a specified class. For example, let’s say we have a data frame with a mix of categorical and numeric variables, but we want to evaluate a function only on the numeric variables.

Use rapply to apply a function to elements of a given class

Using the traditional iris dataset, we can run the one-liner below to get the mean of each numeric column. This works almost exactly like sapply, except we add an extra parameter, class, to specify we only want to apply our function to the numeric columns in iris.

# apply to only numeric variables rapply(iris, mean, class = "numeric") rapply(iris, max, class = "numeric") rapply(iris, min, class = "numeric") # or apply to only factor columns rapply(iris, summary, class = "factor")

If you’re unsure of the class of a particular column in a data frame, df, just run the following to get the clases of each variable.

sapply(df, class)

The other purpose of rapply is to apply a function recursively to a list.

temp <- list(a = c(1, 2, 3), b = list(a = c(4, 5, 6), b = c(7, 8, 9)), c = list(a = c(10, 11, 12), b = c(13, 14, 15))) rapply(temp, sum)

Running rapply here will recursively sum the elements of each vector in each list and sub-list of temp.

rapply, similar to sapply, has an optional parameter how, which specifies how the output should be returned. So in the above example, if we wanted the output to be returned as a list instead of a vector, we could do this:

rapply(temp, sum, how = "list")

Also, if our nested list contained mixed data types, we could specify applying a function to only a specific type:

rapply(temp, sum, how = "list", class = "numeric") vapply

vapply works similarly to sapply, except that it requires an extra parameter specifying the type of the expected return value. This extra parameter is useful in coding because it can help ensure silent errors don’t cause issues for you.

For example, suppose we have the following list of mixed data types:

sample_list <- list(10, 20, 30, "some_string")

Now if we run the below code with sapply, we get a vector of characters, which is not exactly what we want. For instance, if we didn’t know our list had mixed data types, and we ran our function to the get the max of each element in the list, then R doesn’t return an error here. Instead, it silently converts the output to a character vector.

sapply(sample_list, max)

However, we can catch this error using vapply.

vapply(sample_list, max, numeric(1))

This third parameter, numeric(1) specifies that we want the output returned by vapply to be numeric. This means if an issue occurs in returning a numeric result, vapply will result in an error, rather than trying to coerce the result to a different data type. This could allow you to investigate why a character is in the list, for example.

vapply’s ability to catch errors based off data types makes it useful for running R code in production or as a scheduled task as an extra measure to guard against type issues.

eapply

eapply applies a function to every named element in an environment. This requires a little knowledge about how environments work in R. It works pretty similarly to lapply in that it also returns a list as output. The input to eapply, however, must be an environment, whereas lapply can take a variety of objects as inputs.

Here’s an example:

# create a new environment e <- new.env() e$a <- 10 e$b <- 20 e$c <- 30 # multiply every element in the environment by 2 new_e <- eapply(e, function(x) x * 2)

In the above example, we create a list from the initiated environment, e, and then double the value of each element.

Here’s another example, using the environment within a function:

sample_func <- function() { df1 <- data.frame(a = c(1, 2, 3, 4), b = c(5, 6, 7, 8)) df2 <- data.frame(a = c(1, 2, 3, 4, 5), b = c(5, 6, 7, 8, 9)) df3 <- data.frame(a = c(1, 2, 3, 4, 5, 6), b = c(5, 6, 7, 8, 9, 10)) eapply(environment(), nrow) } sample_func()

Above, calling sample_func will return a list of the number of rows for each respective data frame defined within the function’s environment:

eapply can also be called with the parameter USE.NAMES = FALSE, which will return an unnamed list.

eapply(environment(), nrow, USE.NAMES = FALSE)

One other difference versus lapply is that eapply can take an optional Boolean parameter called all.names that specifies if the function input should be applied to just the visible elements of the environment, or to all objects. Here’s an example to illustrate:

e <- new.env() e$.test <- 100 e$other_test <- 64

Here, we defined an environment with one object called .test and one called other_test. Next, if we run this:

eapply(e, sqrt)

we’ll get back a list with one element – 8. As in, eapply is only applying the sqrt function to the object named other_test in e because it is not a hidden object. Names that begin with a dot (hidden objects) are excluded from the function’s application. To apply the function to every object in the environment, we need to set the parameter all.names = TRUE.

Please check out other R articles of mine here: http://theautomatic.net/category/r/.

The post Those “other” apply functions… appeared first on Open Source Automation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A deep dive into glmnet: penalty.factor

Tue, 11/13/2018 - 07:13

(This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

The glmnet function (from the package of the same name) is probably the most used function for fitting the elastic net model in R. (It also fits the lasso and ridge regression, since they are special cases of elastic net.) The glmnet function is very powerful and has several function options that users may not know about. In a series of posts, I hope to shed some light on what these options do.

Here is the full signature of the glmnet function:

glmnet(x, y, family=c("gaussian","binomial","poisson","multinomial","cox","mgaussian"), weights, offset=NULL, alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs

In this post, we will focus on the penalty.factor option.

Unless otherwise stated, will denote the number of observations, will denote the number of features, and fit will denote the output/result of the glmnet call.

penalty.factor

When this option is not set, for each value of in lambda, glmnet is minimizing the following objective function:

When the option is set to a vector c(c_1, ..., c_p), glmnet minimizes the following objective instead:

In the documentation, it is stated that “the penalty factors are internally rescaled to sum to nvars and the lambda sequence will reflect this change.” However, from my own experiments, it seems that the penalty factors are internally rescaled to sum to nvars but the lambda sequence remains the same. Let’s generate some data:

n <- 100; p <- 5; true_p <- 2 set.seed(11) X <- matrix(rnorm(n * p), nrow = n) beta <- matrix(c(rep(1, true_p), rep(0, p - true_p)), ncol = 1) y <- X %*% beta + 3 * rnorm(n)

We fit two models, fit which uses the default options for glmnet, and fit2 which has penalty.factor = rep(2, 5):

fit <- glmnet(X, y) fit2 <- glmnet(X, y, penalty.factor = rep(2, 5))

What we find is that these two models have the exact same lambda sequence and produce the same beta coefficients.

sum(fit$lambda != fit2$lambda) # [1] 0 sum(fit$beta != fit2$beta) # [1] 0

The same thing happens when we supply our own lambda sequence:

fit3 <- glmnet(X, y, lambda = c(1, 0.1, 0.01), penalty.factor = rep(10, 5)) fit4 <- glmnet(X, y, lambda = c(1, 0.1, 0.01), penalty.factor = rep(1, 5)) sum(fit3$lambda != fit4$lambda) # [1] 0 sum(fit3$beta != fit4$beta) # [1] 0

Hence, my conclusion is that if penalty.factor is set to c(c_1, ..., c_p), glmnet is really minimizing

where .

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Installing RStudio & Shiny Servers

Tue, 11/13/2018 - 03:00

(This article was first published on R on datawookie, and kindly contributed to R-bloggers)

I did a remote install of Ubuntu Server today. This was somewhat novel because it’s the first time that I have not had physical access to the machine I was installing on. The server install went very smoothly indeed.

The next tasks were to install RStudio Server and Shiny Server. The installation process for each of these is well documented on the RStudio web site:

These are my notes. Essentially the same, with some small variations.

RStudio Server
  1. Install a recent version of R.
  2. Download the distribution.

    wget https://download2.rstudio.org/rstudio-server-1.1.463-amd64.deb
  3. Install the server.

    sudo dpkg -i rstudio-server-1.1.463-amd64.deb
  4. Verify the installation.

    sudo rstudio-server verify-installation
  5. RStudio Server runs on port 8787, so you should be able to access it in a browser at http://:8787.

Shiny Server
  1. Become root and install the shiny package.

    sudo su R -e "install.packages('shiny', repos='https://cran.rstudio.com/')" exit
  2. Download the distribution.

    wget https://download3.rstudio.org/ubuntu-14.04/x86_64/shiny-server-1.5.9.923-amd64.deb
  3. Install the server.

    sudo dpkg -i shiny-server-1.5.9.923-amd64.deb
  4. Shiny Server runs on port 3838, so you should be able to access it in a browser at http://:3838.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on datawookie. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Interchanging RMarkdown and “spinnable” R

Tue, 11/13/2018 - 01:00

(This article was first published on R on Abhijit Dasgupta, and kindly contributed to R-bloggers)

Dean Attali wrote this nice post a few years ago describing knitr’s spin function. This function allows a regular R file, with comments written with the roxygen2-style comment tag #' to
be rendered as an HTML document with the comments rendered as text and the results of
the R code rendered in place, much as a RMarkdown document would. The basic rules for this are (from Dean’s post):

  • Any line beginning with #' is treated as a markdown directive (#' # title will be a header, #' some **bold** text results in some bold text)
  • Any line beginning with #+ is parsed as code chunk options
  • Run knitr::spin on the file

In effect, this “spinnable” R script is the complement of a RMarkdown document with
respect to format, since the RMarkdown document is primarily a text (Markdown) document with code chunks, and the R script is primarily a code document with text chunks.

RMarkdown is nice when you want to create a final document to report your analysis, but can be burdensome when you’re developing code. The R script is much nicer for developing code, but it would be nice to get a nice document at the end of development without too much hassle. It turns out we can flip back and forth between the two complementary formats through functions in knitr, where fname denotes the respecive source file:

  • RMarkdown to R : knitr::purl(fname, documentation = 2)
  • R to RMarkdown : knitr::spin(fname, knit = FALSE, format = "Rmd")

Why would you want to convert R to RMarkdown when it renders to HTML or Word or PDF automatically (within RStudio), you ask? Because spin uses knit2html/knit2pdf/knit for rendering the documents, rather than rmarkdown::render, which uses the power of Pandoc and hence has richer options in terms of formatting. You could create a pipeline that starts with your R script, creates the intermediate Rmd file using spin, and then uses rmarkdown::render to create the final product.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Abhijit Dasgupta. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny 1.2.0: Plot caching

Tue, 11/13/2018 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re pleased to announce the CRAN release of Shiny v1.2.0! This release features Plot Caching, an important new tool for improving performance and scalability in Shiny apps.

If you’re not familiar with the term “caching”, it just means that when we perform a time-consuming operation, we save (cache) the results so that the next time that operation is requested, we can skip the actual operation and instantly fetch the previously cached results. Shiny’s reactive expressions do some amount of caching for you already, and you can use more explicit techniques to cache the various operations you might do to your data (using the memoise package, or manually saving intermediate data frames to disk as CSV or RDS, for two examples).

Plots, being very common and (potentially) expensive-to-compute outputs, are great candidates for caching, and in theory you can use renderImage to accomplish this. But because Shiny’s renderPlot function contains a lot of complex infrastructure code, it’s actually quite a difficult task. Despite some valiant attempts, all of the examples we’ve seen in the wild have had serious limitations we wanted to overcome.

Shiny v1.2.0 introduces a new function, renderCachedPlot, that makes it much easier to add plot caching to your app.

Using renderCachedPlot

Let’s take a simple, but expensive, plot output:

output$plot <- renderPlot({ ggplot(diamonds, aes(carat, price, color = !!input$color_by)) + geom_point() })

The diamonds dataset has 53,940 rows. On my laptop, this takes about 1580 milliseconds (1.58 seconds). Perhaps that’s fast enough for doing exploratory data analysis, but it’s slower than we’d like for a high traffic Shiny app.

We can tell Shiny to cache this plot in two steps.

  1. Change renderPlot to renderCachedPlot.
  2. Provide a suitable cacheKeyExpr. This is an expression that Shiny will use to determine which invocations of renderCachedPlot should be considered equivalent to each other. (In this case, two plots with different input$color_by values can’t be considered the “same” plot, so the cacheKeyExpr needs to have input$color_by.)
output$plot <- renderCachedPlot({ ggplot(diamonds, aes(carat, price, color = !!input$color_by)) + geom_point() }, cacheKeyExpr = { input$color_by })

With these code changes, the first time a plot with a particular input$color_by is requested, it will take the normal amount of time. But the next time it is requested, it will be almost instant, as the previously rendered plot will be reused.

Benchmarking the results

To quantify the performance difference between the cached and uncached versions, I turned each into a minimal Shiny app (source). This app simply provides the color_by input using the new varSelectInput control, and then renders the output using either of the two code examples above. Then I used our (still-in-development) Shiny load testing tools to record a test script, and “replay” it against both versions of the app, each running in a single R process.

I tested the renderPlot version of the app with 5 concurrent users, and the renderCachedPlot version with 25, 50, and 100 concurrent users. The difference in performance is as dramatic as we’d expect:

With only five concurrent users, the latency is already pretty bad with the renderPlot version. (Note that this isn’t intended to represent typical performance with Shiny apps in general! We chose a particularly torturous ggplot on purpose.)

On the other hand, the renderCachedPlot version doesn’t break a sweat with 50 concurrent users; and even at 100 concurrent users, the latency is still acceptable.

When to use plot caching

A Shiny app is a good candidate for plot caching if:

  1. The app has plot outputs that are time-consuming to generate,
  2. These plots are a significant fraction of the total amount of time the app spends thinking, and
  3. Most users are likely to request the same few plots.

Since our test has a pretty expensive plot as its only output, and our load testing tools simulate n users all doing the same exact thing, this these numbers reflect a best-case scenario for plot caching.

Shiny can store your cached plots in memory, on disk, or with another backend like Redis. You also have a number of options for limiting the size of the cache. Be sure to read this article to get the most out of this feature!

In future releases of Shiny, we hope to build on this foundation we’ve laid to dramatically speed up other areas of Shiny apps, like reactive expressions and non-plot outputs. In the meantime, we hope you find plot caching to be a useful addition to your performance toolkit!

Other changes in Shiny v1.2.0
  • Upgrade FontAwesome from 4.7.0 to 5.3.1 and made icon tags browsable, which means they will display in a web browser or RStudio viewer by default (#2186). Note that if your application or library depends on FontAwesome directly using custom CSS, you may need to make some or all of the changes recommended in Upgrade from Version 4. Font Awesome icons can also now be used in static R Markdown documents.

  • Address #174: Added datesdisabled and daysofweekdisabled as new parameters to dateInput(). This resolves #174 and exposes the underlying arguments of Bootstrap Datepicker. datesdisabled expects a character vector with values in yyyy/mm/dd format and daysofweekdisabled expects an integer vector with day interger ids (Sunday=0, Saturday=6). The default value for both is NULL, which leaves all days selectable. Thanks, @nathancday! #2147

See the full changelog for Shiny v1.2.0 for other minor improvements and bug fixes we’ve made in this release.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Antarctic/Southern Ocean rOpenSci community

Tue, 11/13/2018 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Antarctic/Southern Ocean science and rOpenSci Collaboration and reproducibility are fundamental to Antarctic and Southern Ocean science, and the value of data to Antarctic science has long been promoted. The Antarctic Treaty (which came into force in 1961) included the provision that scientific observations and results from Antarctica should be openly shared. The high cost and difficulty of acquisition means that data tend to be re-used for different studies once collected.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Angela Bassa discusses managing data science teams and much more.

Mon, 11/12/2018 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Hugo Bowne-Anderson, the host of DataFramed, the DataCamp podcast, recently interviewed Angela Bassa, the Director of Data Science at iRobot.

Here is the podcast link.

Introducing Angela Bassa

Hugo: Hi there Angela, and welcome to DataFramed.

Angela: Thanks, thanks for having me.

Hugo: It’s a great pleasure to have you on the show, and I’m really excited to be talking about managing data science teams today, and your work at iRobot. But, before we get into this conversation, I’d love to find out a bit about you. I thought maybe you could start by telling us what you’re known for in the data community.

Angela: Sure. I’ve been involved in data science under that moniker for, let’s call it four, five years. Most of the contributions that I’ve made, you know I don’t have a package named after me, or anything like that. But, I’ve spoken a lot about how to do data science within a business context in terms of corporate data science, and also how to develop the skills to succeed as a data scientist within a larger organization. I’d like to think that folks who care about what I have to say, probably care about that.

Hugo: Absolutely. I think that’s really important, particularly at this stage in the way data science is developing to consider what’s happening in a business context, because a lot of people I think talk about the kind of state of the art techniques, and all of the buzz terms we hear. But, we do need to remember that data science within an organization is there in order to be one set of inputs into the decision making process, right?

Angela: Yeah, exactly. There are lots of companies that have their product be data scientific or algorithmic, but a large portion of data science performed within a business context is actually done in service of the business, rather than packaged as its own product. Really understanding how that fits within the larger organization strategically really matters in terms of being successful.

How did you get into data science?

Hugo: Yeah, and that’s a point we’re going to get to in this conversation. Before we get there though, I’d like to know a bit about your trajectory in terms of how you got into data science initially.

Angela: The short answer is I was really lucky. I can’t claim to have gone to school with the ultimate goal of being a data science professional, even though I’m really, really glad that that’s how things turned out. I went to undergrad, and I went to an engineering school, so obviously what you do in an engineering school is you don’t do engineering. I did my undergrad in math, and they recruit really heavily on Wall Street for math professionals. It was either that or academia, and I didn’t really want to join the academic environment, so off to Wall Street I went. I hated it. It was such a bad personality fit. But I did get to work with data. That was my role, was to do data analysis, and monitor data activity in the market. That part I really liked. Nothing against the financial world, I still have friends there. It just wasn’t for me.

Hugo: When was that?

Angela: Oh my gosh, that was about 15 years ago.

Hugo: Okay.

Angela: After I left, I started doing strategy consulting because that’s the other thing that you do if you don’t do investment banking. You usually do strategy consulting. That’s when I really started getting my hands dirty with data, and modeling in particular, rather than just monitoring. I did a lot of pharmaceutical strategies. There’s a lot of statistics that go into how do you set up a controlled experiment, a randomized control trial so that you can test the efficacy of different treatments. We did a lot of that kind of consulting for large pharmaceutical companies, and biotechs, and med techs. I did that for about eight years.

Angela: I left that industry and joined a large marketing services organization. That’s where I got introduced to big data. Truly big data. I mean, that’s when we went from things that could run on a single machine. I mean, the machine might hang a little, but it would definitely all run within RAM, to computations that you really needed to understand how to run the computation as much as you needed to understand what the computation was. While that was exciting, the stakes were really low. I mean, if you mess up, somebody doesn’t get a coupon, right? I actually, here in the Boston area where I’m located, participating in the community, and meetups, and what not, I ended up meeting some folks from a company that used to exist called EnerNOC. They’ve since been acquired.

Angela: That’s where the great outcome really came, where I was doing data science under that name. The stakes were high enough, so if you messed up you could lead to a brownout, or a blackout, or something like that, and folks really depended on our analysis to be able to save money, to save electricity. I really enjoyed working in that, but then after a while the great folks at iRobot reached out to me, and as a nerd, the opportunity to do this thing that I love, data science, but, also work with robots, it was-

Hugo: That’s pretty cool.

Angela: … Yeah, it was hard to pass. It was really hard to pass.

What do you do at iRobot?

Hugo: What do you do at iRobot now?

Angela: I am the Director of Data Science at iRobot. I think there are sort of two sides to what I do. One side is managing the team, a team of Data Scientists, and Analysts, and Interns, and Contractors who help us achieve our objectives. The other part of that is setting those objectives, and understanding how we can do the most good for the company.

Hugo: Great. Can you just tell us a bit about what iRobot does?

Angela: Sure. iRobot is the number one consumer robotics company. In the US for sure, I’m pretty sure it’s the world as well. We are the makers of consumer robots that you may know and love, like the Roomba, and the Braava. Those are a robotic vacuum cleaner, and a robotic mop, respectively. I think it’s really fantastic, because robotics is hard. This is a company that has figured out how to operate in this really difficult environment, making robots that are affordable, and that are able to help folks who have these tools sort of do more with their day. It’s really exciting.

Hugo: Now I kind of want to delve into a bit about data science management. As we know, there are a lot of roads that lead to data science, and I’m sure there are a lot of roads that lead into becoming a data science manager as well. Maybe you could tell us how you actually got into this position, or into data science management in general?

Angela: Yeah. I think a lot of folks who get into management for technical fields, have a similar background. Which is that, you usually excel as an individual contributor, and then get sort of promoted into this completely different discipline. It’s funny, because a lot of the heuristics that make you a really good individual contributor, don’t necessarily translate when you go into management. As an individual contributor, you are answering questions, and posing questions. Really as a manager, you’re scaling humans. It is a completely different discipline in and of itself. It takes time and effort to get really good at it, and I think the first step is understanding that it’s a different job.

Hugo: And a job that you’re not necessarily, will have ever been trained to do in terms of being a successful individual contributor in your field of expertise, right?

Angela: Yeah, exactly. As an individual contributor you may get your toes wet, in terms of providing mentorship, or working closely with interns, and helping them derive the most benefit from those kinds of relationships, those internships. But, going from being an individual contributor to being a manager, you have to remember that your goal is not to answer the question, your goal is to empower folks to answer their questions.

Hugo: What has your journey been like?

Angela: The way that I ended up in management, I think … Well, it’s funny. When I first was offered the opportunity to manage another person on a team, we interviewed several candidates, and the person who ultimately got the job. There was I think some miscommunication, because I don’t have a PhD, I just have my undergraduate math degree, which I think served really well for how I started, which was in, as a Data Analyst. Back in the dark ages, before Hadoop existed.

Angela: The person that we hired had a PhD. He had just graduated. He starts day one, he finds out that he’s working for me directly, as opposed to with me. He ended up quitting that day.

Hugo: Oh wow. That was your first engagement managing someone?

Angela: Yes, exactly.

Hugo: Wow.

Angela: That left a mark, so I didn’t end up-

Hugo: Mm-hmm (affirmative).

Angela: … I ended up, every time that I was offered a similar opportunity, I found a way to not take that on.

Hugo: Sure.

Angela: After-

Hugo: First impressions last.

Angela: … I know. Then after a few years, it sort of became evident that, that was what needed to happen for this situation, the context that I was in. I was really cautious and worried, and I also didn’t want to give up the role of individual contributor. Something that I had a lot of passion for, that I … If I may so, I thought I was really good at, and I enjoyed. I was worried that the path for professional growth lay in switching tracks, and jumping into management. That isn’t always the case, but it certainly seemed like the door that was open to me at the time.

Angela: I was cautious, and a little bit … It felt bittersweet. But, after that, managing a single person, and having that second iteration worked much better. I think probably because I was more self aware, and the person that I was managing had a much better temperament. After that I ended up managing a whole engagement, where the two of us worked. Then from there, it just kept growing, where I managed a small team, and then ended up managing the function, the discipline of data science within an organization. It has been an evolution.

Data Science Team Models

Hugo: Yeah, interesting. In terms of a data science team needing to deliver value for a business, we need to consider how data science is embedded in an organization. I’m wondering to your mind, what different models exist for data science in orgs, and which is your favorite, or which do you have, or which is working for you at the moment?

Angela: I have personally worked in data science under several different umbrellas. For instance, I’ve had teams that were under the operations branch of the org chart, under finance, and financial operations, under IT, under engineering, or under a dedicated R&D organization. Obviously, that’s a lot of organizational structure, so there were a couple of re-orgs. I’ve even been part of several re-organizations, and one thing that I noticed is that, the data science team always changes hands, always changes branches of the org chart, every time there’s been a re-org, that I’ve been a party of. I think that speaks to the value that data science can bring to an organization. That, it seems like different arms of a company all want to be able to leverage this really powerful discipline.

Angela: I think the key to ensure that the function is successful within an organization, independent of where it sits, whether it sits next to product management, usually when you’re developing product features, or whether it sits within operations, so that you are delivering value back to the enterprise. I think the important thing there is to really allow the function to mature. Usually in companies, especially companies where data science is not the product. Because, otherwise in those cases, data science is part of the founding, right? You need that in order to deliver on the business proposition.

Angela: But, in other cases within a larger enterprise, within sort of legacy companies that are looking to employ the tool set, there’s usually a couple of people who are delivering on that, and they really need time to mature as a discipline within the organization to become experts in the strategy, and the objective of that organization to be able to become experts in the data, and the artifacts, and bring that back. That’s the one thing that I would say that, everything else really doesn’t matter in an organizational context. But, what matters is the ability to let that team go through a couple of iterations, so they get to a point where they’re past the exploration.

Important Management Strategies

Hugo: I think this idea of allowing the team to become mature and evolve is incredibly important, and something we’ll come back to. Something you mentioned just then is allowing the team also time to understand the data, and become experts. I think this is in the direction of facilitating, allowing the team to deliver as much value to the organization as possible. I’m wondering, just in general, what are the most important strategies as a manager to ensure that your team can deliver as much value as possible?

Angela: … I think the metaphor that I like to imagine is, and you’ll have to pardon me because I’m Brazilian, so I think in soccer, not football. But, in American football there is this concept of the lineman who creates space so that the quarterback can make his play. I think a lot of times we like to think of the manager as the quarterback, and I don’t think that’s right. I think the individual contributors, for their particular projects, for their tasks, are their own quarterbacks. The role of the manager is to really create the pocket, create space so that they can think, and create space so that they can see the whole field, and they can see the opportunities, and they can see the answer.

Angela: That’s sort of my mentality. What I coach my team to be able to do is to become experts in the data. I think if you get asked to perform an analysis, or to answer a question, a lot of the times what happens is the person doing the asking doesn’t necessarily have the imagination to envision what an answer might look like, or what might be able, right? They have this narrow view, because they are experts in something else. They’re incredibly smart, but they’re smart in a different thing than the thing we are smart in. When they ask a question, sometimes the question is too low level, or too high level.

Angela: Part of the role of a Data Scientist is to be that therapist, getting the question just so, so that you really get at what the person doing the asking wants. Sometimes they don’t even know what they want, or they don’t even know what’s possible to get answers to. So being the lineman to create that space so that the quarterbacks can do their thing, and do the strategy, and figure out how to answer the question is really how I think of enabling the team to deliver the most value.

Hugo: There’s so much in there. Two takeaways, I was thinking of when listening was, there’s the aspect of managing expectations on both sides of what’s possible, what’s feasible. But, also this act of translation, and helping to turn business questions into data questions. Then, the reverse act of translation, turning the data answers into business answers as well.

Angela: I think that’s fundamentally the job of a Data Scientist. Because, everybody … I mean, it’s the 21st century. Every discipline has data, everybody has information that they’re using to inform their decision making. What makes data science unique is our ability to take a business question, and formally formulate it, formally articulate it in a way that we can use the tools of statistics, and in software development to create a solution that is reproducible, that is replicable, that is interpretable, and that is fit for purpose, that answers the question. Because, a lot of times, what can happen is Data Scientists will become so enamored with a particular approach, that they can try to use it for everything when it wouldn’t be a great fit. Or, they become enamored with a data set, and they use it because they can, not because they should. That translational step, from business, to math, to the technical components, back to the business, really is where the great Data Scientists make a difference.

Team Growth

Hugo: A recurring theme in this conversation so far is the maturing of a data science team, and the evolution as a team. As you said, you started off managing one person. I’m wondering, what are key aspects of a Data Science Manager to think about as your team grows in size as a function of time?

Angela: I think one of the things that happens over time, so I was the first Manager of Data Science at EnerNOC, the company that I was at before iRobot. I was the first Director of Data Science at iRobot, so these are two teams that I grew sort of from the ground up. The thing that happens in the very beginning, there is so much potential, but there’s also so much low hanging fruit. Having a team that has the flexibility to deliver on a couple of … I wouldn’t call them necessarily moon shots, but on a couple of high visibility, high sophistication answers, to start illustrating what’s possible, right? What are the amazing things that this new function can deliver on?

Angela: But also, that low hanging fruit. The quickest way to value is knocking those out, is taking the things that are easy, and answering them better than anybody else can, with an architecture that takes care of itself so that it requires minimal monitoring. You just start adding things to that pipeline, and solving problems that themselves are tiny, but that save an aggregate number of people seconds or minutes. Then, those add up. Having that flexibility means that in the very beginning, you have sort of undifferentiated talent, right? You have the quote/unquote, "Unicorns." I hate that word.

Hugo: So, data science generalists of some sort?

Angela: Exactly, yeah. Folks that have the basic tool set, and that with a little bit of guidance, can sort of play in all of those roles. But there’s something that folks in the medical sciences say, that I think is really relevant, is that ontogeny recapitulates phylogeny. And what I think that phrase means is that the way organisms develop, from fertilization to gestation or hatching, mimic the stages in the evolution of the animal’s remote ancestors. That’s a pretty random analogy, but the way I think it’s relevant here is that the way a data science team develops also mimics the stages in the evolution of a company. So, much like a startup, a budding data science team has lots of undifferentiated and flexible talent, and the team goes through several "pivots" and they try to establish their value, who their champions are, and the ideal way to engage with the rest of their internal customers. When they’re just small teams, they are rudimentary, and they are pluripotent, right? They act a little these stem cells, right? They can mature into anything.

Hugo: What then happens as a data team matures?
Angela: When you get some maturity into the team, that’s when you start to have specialization, that’s when you start to have differentiation. That’s when you start having folks who are really great at visualization, or really natural talents in terms of data platform engineering, or reliability. Folks who are great at QA, they have that personality, and that passion for the attention to detail. When there’s enough scale in the type of work that the data science team is doing, only then I think it makes sense to start having those dedicated, ancillary teams that can liberate the Data Scientist to truly focus on the science component, which is the hypothesis testing.

Hiring and Building a Team

Hugo: Once you enter that stage, how do you think about hiring, or building a team around the different skills? As you say, you don’t necessarily want data science generalists then, but you want a team of people whose skill sets, questions asking, curiosities compliment one another, right?

Angela: Yeah. I think as you start, you want to have folks who are well rounded. But, the farther along you go, I think it’s important to have a team that represents your end users, whoever they may be. I think, especially in companies where the product is data science, you want to make sure that your data science team looks like the folks who use your product, so that you have different perspectives, and you can ask different questions. And, everybody doesn’t look the same way, use the same tools, ask the same questions. I think it’s important to have that diversity in all dimensions. I think folks who are senior, and folks who are junior, there’s something I like to think about in terms of the luxury of ignorance. Which, is when junior folks in a team get to ask “dumb” questions, right? And dumb in quotation marks, because they’re not dumb. What they are is unencumbered. They’re unencumbered by the assumptions that we forget we make. They’re unencumbered by the heuristics that we’ve developed, that may not be applicable everywhere.

Angela: They have this luxury of being able to challenge the assumptions that folks with more seniority are perfectly capable of, but you start forgetting. You hear hooves, and you think horses, not zebras. Well, the more younger folks are like, "Well what if it’s a zebra?" And they challenge that, and they force you to think about why you’re making certain decisions.

Hugo: I love that, and I love that you describe it in terms of heuristics that we develop over time, because we know that when we start using heuristics a lot, they’re coupled with certain biases as well. So, having a new point of view, which is unencumbered by heuristics will also make us recognize our own biases hopefully also.

Angela: Absolutely. Not to knock on heuristics, they’re great, they exist for a reason.

Hugo: Mm-hmm (affirmative) and necessary.

Angela: We build them because they create shortcuts, and they make us efficient, right? It’s the whole thing about thinking fast and slow, and how our brains operate, and how we create our own Bayesian priors, and sort of go from them. But, I think having folks with different priors participate in the conversation, really enriches it.

Hugo: You mentioned earlier the types of questions that a data science team can think about, and perhaps should think about. You actually … We may get to this later, recently you sent me a draft of an article you’re writing for Harvard Business Review, and you make a nice distinction there between the space of questions a team might be able to answer, and the space of questions a team can and should answer. I thought maybe you could speak to that a bit, in light of this.

Angela: Yeah. I think that this is a perfect fit for that. In terms of what usually can happen, and very easily. I’ve been guilty of this as well is, you have access to data, and so you start correlating. You start exploring, and you start figuring out what could happen. I think while there’s certainly value in directionless exploration as you’re starting to build your own heuristics about these data artifacts. I think if possible, whenever possible, it’s much more important to think about first what the objective might be, and to have that north star as you start quote/unquote, "Spelunking," through data. When you think about what the question that is posed to you is, a lot of the times it can be easy to think, "Oh, well I don’t have a perfect fit for that question, but I have this other data set that I bet is correlated." And so, you start going there.

Angela: I think one of the things that makes a really good Data Scientist is also humility. Humility to know that maybe that’s not what it means. I mean, sometimes the answer is in an email thread somewhere that you don’t have access to, that you weren’t involved in, that you weren’t privy to. But, the answer exists elsewhere. I think it’s really important to have the self awareness to go and ask, and become an expert not just through spelunking in data, but spelunking through the organization, right? Making connections with other folks around the organization, and truly gaining insight into how the data is generated, what context is it used for, can it be repurposed, what are the issues that potentially arise with that repurposing?

Angela: And so, really figuring out what kinds of questions can be answered is great, but what kind of questions should be answered I think is something that the data scientists within an organization is well positioned to be able to ask, and perhaps a better position than anybody else.

Trade Offs

Hugo: Now, this speaks to a certain trade off I think. I’m wondering, in your role as a Data Science Manager, what are the types of trade offs you need to make, and how do you think about making the right ones?

Angela: Oh, I think being a manager in any discipline, but especially in data science, I think those trade offs are everything. Data science is a little bit different than other types of work, because you’re not just answering questions. A lot of times you are figuring out if a question is even answerable, right? It’s not just the how or the what, but it’s also the if. Figuring out those trade offs, a lot of other disciplines have different trade offs. But, a lot of the trade offs are also very similar in terms of how much time do you spend catching up on what the latest findings of a discipline, the latest applications, the latest methodologies versus, selling a discipline, selling it internally, letting folks in legal, and in sales, and in operations. Letting them know that this resource is available to them if they have questions, that they would love to have more information, more data to help with their decision making.

Angela: How much do you spend doing? Usually I’m making slides, or writing memos, or thinking through what everybody needs, and articulating that, and setting it in writing. Versus coaching, versus growing your team, and making sure they have what they need, and making sure that they are getting exposed to strategy so that they can make the best play when it’s their turn. As well as planning, and strategizing, and figuring out who do we need to talk to, when do we need to deliver something by, when do we need to do road shows, and present some of our findings so folks know that we’re a credible part of the organization that can be leveraged, and can bring value?

Angela: I think all of that are the things that you are constantly trying to juggle and optimize as a manager. Also, there’s loads of additional questions. Who do you bring into your team, and how do you make sure that everybody who comes and joins the team allows you to get network effects out of that expansion, so that you’re not just having a plus one, but you’re having plus N because of all of the ways in which that person improves a team, and covers blind spots?

Hugo: How do you think about the trade off between … I mean, when hiring for a data science position you can hire someone with incredibly strong quantitative and data science skills. But, you can also go about it, I presume, in terms of someone who maybe has some other expertise, and can pick up a bunch of the data science in the process as well, right?

Angela: Yeah. I’m a big fan of the data science boot camps. Not all of them, but I think there are several that have been fantastic for preparing folks who have that ambition, and that ability to learn the skills, right? I think there are parts to data science which is, that you can’t teach, right? You can’t teach somebody to want to answer a question correctly. But, the how I think is teachable. I think there are a lot of folks out there who are breaking into data science. I mean, the different institutes and universities are only starting to have quote/unquote, "Data science programs." I mean, pretty much everybody who came into data science over the last five years did something else as their training.

Angela: Here’s a perfect example. On the team at iRobot we have one of our Data Scientists who’s originally trained as a Marine Biologist. You would think, "What does a Marine Biologist do in a robotics consumer company?" Well, you’d be surprised, because it turns out that there’s a lot of research in her field. What she did, she did a lot of studies with pods of dolphin in the wild. She actually traveled all over, and I’m sort of jealous. It turns out that that kind of modeling expertise is really useful when you’re thinking about a fleet of robots, and how those robots behave-

Hugo: Oh wow.

Angela: … Independently, and dependently. You can think of a fleet of robots as a pod of dolphin in certain scenarios. Obviously it’s not a perfect analog, but a lot of the modeling becomes extremely handy. That knowledge exists in the world, and it’s just a question of how do you know to look for it there?

Hugo: Yes.

Angela: She brings that level of expertise to us. She’s an amazing Data Scientist. She has all of the qualifications to be a fantastic Data Scientist, technically. But, she also brings this added dimension that helps us solve problems differently, and I think better.

Hugo: Yeah, and of course being from academic research, or scientific research, knowing how to ask the right questions. But also, if she did a lot of travel, data collection, that type of stuff, thinking about the data generating process, how the data was generated, and how then you can model it is such a key part of what it is to do this type of work also.

Angela: Exactly, yeah. That’s one of the reasons that I’m a big fan of internship programs as well, because it can seem like grunt work, but it’s incredibly important grunt work, and I think we all did it. I mean, I did it when I was on Wall Street as well. As I was building my data sets, and I was building these databases that got monitored, I understood very intimately what I meant when I made design choices, and how my design choices propagated downstream so that what kinds of questions were easier to answer, and harder to answer, and why? What was my governance model, right? I didn’t have words for those things when I was starting out, but that’s what those are.

Angela: In our internship program we have folks become intimately familiar with data gathering, and data ingestion, and data management which, I think down the line helps them tremendously, because they are better able to understand context, and to understand how important it is to be respectful of those design decisions, and not use data sets for one thing, when they are really intended for another, and accounting for that.

Hugo: Yeah. Just quickly, for any of our listeners who are really enjoying this conversation and your work at iRobot, can they check out internship programs online or something along those lines?

Angela: Oh yeah, absolutely. If you go to our careers page and search for the data science internship, yes, if you’re interested, please apply.

Hugo: Fantastic. If you do apply, make sure to mention that you heard about it on the podcast.

Angela: Absolutely, yes.

Data Integration with Stakeholders

Hugo: Something you mentioned Angela, is this idea of it being a requirement in some ways to sell data science internally in an organization. Something I’m really interested in is how we’re going to see our data literacy spread across orgs, not only in data science teams. I’m wondering for you to have the best conversation with stakeholders, how much data do they need to be able to speak, or do you see a future in which C-suite and other stakeholders speak more data, and become more data literate?

Angela: Oh, I think the latter. I think it’s going to start becoming even harder to not be able to credibly discuss your product, or your strategy in a data literate way. I think the market has that expectation, I think it’s becoming table stakes. And, to be able to ensure that your strategic decisions are based on information that you have had the foresight to gather so that you can make the right decisions.

Common Pitfalls for Data Science Managers

Hugo: What are some common pitfalls or warnings you have for Data Science Managers?

Angela: One of my pet peeves is when a data team doesn’t know what data is available, and what it means, and how it can be used. I think the first thing that you need to do is have a big exploratory data analysis party, you know?

Hugo: Awesome. Data party, I love it.

Angela: … Yes. Dedicate some time, once a week, maybe 10% of everybody’s time dedicated to getting lost in the data, and truly understanding it, and setting up coffees with other folks in the organization so that you can ask questions about how that data is designed, and created, and collected, and stored, and labeled. I think that’s hugely important. I get really aggravated when folks think that’s a waste of time, because it’s undirected, and I think it’s hugely valuable if you’re going to be the expert on your company’s data, that you be the expert on your company’s data.

Angela: The other thing I think is to not over promise. One of the things that tends to happen is, folks know what’s possible, and so they paint a picture, but they forget to be pragmatic about how they’re going to execute. And so, not over promising is huge, but not under promising either. I think sandbagging backfires, I think you need to be able to accurately promise. Then, to deliver on it. That’s not just because it keeps you from that over/under promising situation, but also because it builds credibility. If you can accurately assess what your outcome is going to be, I think that lends credibility to the actual outcome as well.

Angela: I think one way to get to that point where you can promise and then deliver is to be honest, and to be transparent. Perhaps a little bit more transparent than with other disciplines, because the Data Scientist is trained to interrogate data, to interrogate situations. They’re going to be able to tell when you’re over promising, or when you’re under promising, or when you’re not sure of what the objective is. It’s really important to have that clarity, and to communicate that within the team, and within the organization.

Future of Data Science in Organizations

Hugo: Great. I think this has been a wonderful conversation about the state of data science management today, and your practice in particular. I’m wondering what the future of data science in organizations, particularly relative to the decision function, what this looks like to you.

Angela: It’s rosy. I think there’s job security in data science. It is definitely something that’s becoming more and more ingrained in the fabric of different organizations. I think that’s why it depends. The future is going to look different for companies that have their product be data scientific or algorithmic, versus companies that use data science in service of something else. I also think that the future looks different, whether the team is part of a public organization, a startup, or a large organization. And also, the time horizon. Is this a team that is exclusively research, and they’re working on moon shots? Versus, a team that is more operational, and is enterprise facing, and helping the company optimize its own functioning.

Angela: I think all of those have different curves that they’re on, but I think, in any respect, I can’t see a future where we’re not relying more and more on the expertise of folks who understand how to manipulate data.

Call to Action

Hugo: For all our listeners out there who are either Data Scientists, aspiring Data Scientists, or even have aspirations to get into data science management, do you have a call to action for them?

Angela: Well, I’m so glad you mentioned it, and actually you also mentioned it earlier during our conversation. I’m really excited, I just penned an article for HBR, and it’s actually part of a series that they’re putting together called, "Managing Data Science." It’s an eight week newsletter that they’re putting together, that focuses on making analytics and AI work for everybody’s organizations. I have an article coming up, so by the time this podcast hits the wires, I think it’s going to be two or three weeks old. I encourage you and your listeners to check it out.

Hugo: Fantastic, and we’ll include a link in the show notes as well.

Angela: Awesome, thank you.

Hugo: Angela, it’s been such a pleasure having you on the show.

Angela: Oh, it has been my pleasure. Thank you for letting me nerd out.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Preview my new book: Introduction to Reproducible Science in R

Mon, 11/12/2018 - 17:14

(This article was first published on R – Cartesian Faith, and kindly contributed to R-bloggers)

I’m pleased to share Part I of my new book “Introduction to Reproducible Science in R“. The purpose of this book is to approach model development and software development holistically to help make science and research more reproducible. The need for such a book arose from observing some of the challenges that I’ve seen teaching graduate courses in natural language processing and machine learning, as well as training my own staff to become effective data scientists. While quantitative reasoning and mathematics are important, often I found that the primary obstacle to good data science was reproducibility and repeatability: it’s difficult to quickly reproduce someone else’s results. And this causes myriad headaches:

  • It’s difficult to validate a model
  • It’s difficult to diagnose a model
  • It’s difficult to improve/change a model
  • It’s difficult to reuse parts of a model

Ultimately, without repeatability, it’s difficult to trust a model. And if we can’t trust our models, what good are they?

This book therefore focuses on the practical aspects of data science, the tools and workflows necessary to make work repeatable and efficient. Part I introduces a complete Linux-based toolchain to go from basic prototyping to full-fledged operational/production models. It introduces tools like bash, make, git, and docker. I show how all these tools fit together to help imbue structure and repeatability into your project. These are tools that (some) professional data scientists use and can be used throughout your career. 

With this foundation established, Part II describes common model development workflows, from exploratory analysis to model design through to operationalization and reporting. I walk through these archetypical workflows and discuss the approaches and tools for accomplishing the steps in the workflows. Some examples include how to design code to compare models, effective approaches for testing code, how to create a server and schedule jobs to run models.

Finally, in Part III I dive into programming. Those that have read my blog know that I look at programming for data science differently from systems programming. This changes the way you program. I’m a strong advocate of functional programming for data science, because it fits better with the mathematics. This part introduces functional programming and discusses data structures from this perspective. I also show how to approach common problems in data science from this view.

In short, this book will not only help you become a better programmer, but a better scientist. I assume the reader knows how to program and has experience creating models. It is appropriate for practitioners, graduate students, and advanced upperclass undergraduates.

Any feedback is appreciated. Feel free to comment here or on Twitter.

Rowe – Introduction to Reproducible Science in RDownload Part I

What happened to my other book, “Modeling Data With Functional Programming In R”? For those curious, my editor and I decided to postpone publishing it until after this book. I decided that I needed to provide a foundation that people could use to appreciate this other book. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Time Series and MCHT

Mon, 11/12/2018 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Introduction

Over the past few weeks I’ve published articles about my new package, MCHT, starting with an introduction, a further technical discussion, demonstrating maximized Monte Carlo (MMC) hypothesis testing, bootstrap hypothesis testing, and last week I showed how to handle multi-sample and multivariate data. This is the final article where I explain the capabilities of the package. I show how MCHT can handle time series data.

I should mention that I’m not focused on the merits of the procedures I use as examples in these posts, and that’s going to be the case here. It’s possible (perhaps even likely) that there’s a better way to decide between the hypotheses than what I show here. In these articles, I’m more interested in showing what can be done rather than what should be done. In particular, I like simple examples that many can understand, even if they may not be the best tool for the task at hand.

So far I don’t think this has been a serious issue; that is, I don’t think the procedures I’ve shown so far could be considered controversial (I think the most controversial would be the permutation test example). But the example I want to use here could be argued with; I personally would not use it. That said, I’m still willing to demonstrate it because it doesn’t take much to understand what’s going on and it does demonstrate how time series data can be handled.

Context

Suppose we want to perform a test for the location of the mean, and thus decide between the hypotheses

There is the usual -statistic, which is , and as mentioned before the statistic assumes that the data came from a Normal distribution. That’s not all the test assumes, though. It also assumes that the data is independent and identically distributed.

In cross-sectional contexts this is fine, but it’s not okay when the data could depend on time and thus is not independent and identically distributed. Suppose instead that our data was generated according to a first-order autoregressive process (AR(1)), described below:

In this context, assume and is independent and identically distribution. It’s no longer given that the conventional -test will work as marketed since the data is no longer independent or identically distributed. Additionally, we have two nuisance parameters, and , that need to be accounted for.

Monte Carlo Testing with Time Series

We will view and as nuisance parameters and use MMC testing to handle them. That leaves the question of how to simulate an AR(1) process. With MCHT, if you can simulate a process, you can test with it.

The time series model above has a stationary solution when and when ranges between and . It's not possible to simulate a series of infinite length but one can get close by simulating a series that is very long. In particular, one can simulate, say, 500 terms of the series starting at a fixed number, then the actual number of terms of the series wanted, then throw away the first 500 terms. This is known as burn-in and it's very common practice in time series simulation.

Fortunately MCHTest() allows for burn-in. Suppose that the sample size of the actual dataset is and we've decided that we want a burn-in period of . Then we can do the following:

  1. Generate random numbers to represent (except possibly for the scaling factor, as we're treating that as a nuisance parameter).
  2. Apply the recursive formula described above to the series after scaling the series by and using a chosen , and add to it.
  3. Keep only the last terms of the series; throw away the rest. This is your simulated dataset.
  4. After having obtained the simulated dataset, proceed with the Monte Carlo test as usual.

With MMC, the unscaled series is fixed after we generate it and we use optimization to adversarially choose and so that we maximize the -value of the test.

When using MCHTest(), the rand_gen function does not need to produce a dataset of the same length as the original dataset; this allows for burning it. However, if you're going to do this, then the stat_gen function needs to know what the sample size of the dataset is, but all you need to do is give the stat_gen function the parameter n; this will be given the sample size of the original dataset. And of course the test_stat function won't care whether the data came from a time series or not.

Putting this all together, we create the following test.

library(MCHT) library(doParallel) registerDoParallel(detectCores()) ts <- function(x, mu = 0) { sqrt(length(x)) * (mean(x) - mu)/sd(x) } rg <- function(n) { rnorm(n + 500) # Extra terms for a burn-in period } sg <- function(x, n, mu = 0, rho = 0, sigma = 1) { x <- sigma * x if (abs(rho) >= 1) {stop("Bad rho given!")} eps <- filter(x, rho, "recursive") # Apply the recursion eps <- eps[-(1:500)] # Throw away first 500 observations; they're burn-in dat <- eps + mu test_stat(dat, mu = mu) # Will be localizing } mc.ar1.t.test <- MCHTest(ts, sg, rg, N = 1000, seed = 123, test_params = "mu", nuisance_params = c("rho", "sigma"), optim_control = list(lower = c("rho" = -0.999, "sigma" = 0), upper = c("rho" = 0.999, "sigma" = 100), control = list("max.time" = 10)), threshold_pval = 0.2, localize_functions = TRUE, lock_alternative = FALSE) dat <- c(-1.02, -1.13, 0.53, 0.21, 1.76, 1.79, 1.42, -0.31, -0.28, -0.44) mc.ar1.t.test(dat, mu = 0, alternative = "two.sided") ## Warning in mc.ar1.t.test(dat, mu = 0, alternative = "two.sided"): Computed ## p-value is greater than threshold value (0.2); the optimization algorithm ## may have terminated early ## ## Monte Carlo Test ## ## data: dat ## S = 0.73415, p-value = 0.264 ## alternative hypothesis: true mu is not equal to 0 mc.ar1.t.test(dat, mu = 3, alternative = "two.sided") ## Warning in mc.ar1.t.test(dat, mu = 3, alternative = "two.sided"): Computed ## p-value is greater than threshold value (0.2); the optimization algorithm ## may have terminated early ## ## Monte Carlo Test ## ## data: dat ## S = -7.9712, p-value = 0.504 ## alternative hypothesis: true mu is not equal to 3 t.test(dat, mu = 3, alternative = "two.sided") # For reference ## ## One Sample t-test ## ## data: dat ## t = -7.9712, df = 9, p-value = 2.278e-05 ## alternative hypothesis: true mean is not equal to 3 ## 95 percent confidence interval: ## -0.5265753 1.0325753 ## sample estimates: ## mean of x ## 0.253 Conclusion

I have now covered what I consider the essential technical functionality of MCHT. All of the functionality I described in these posts is functionality that I want this package to have. Thus I personally am quite happy this package exists, which is good; I'm the package's primary audience, after all. All I can hope is that others find the package useful too.

I wrote this article more than a month before it was published, so perhaps I have made an update that isn't being accounted for here, but as of this version (0.1.0), I'd call the package in a beta stage of stability; it's usable, but features could be added or removed and there could be unknown bugs.

The following is a list of possible areas of expansion. This list exists mostly because I think it needs to exist; it gives me something to aim for before making a 1.0 release. That said, they could be useful features.

*A function for making diagnostic-type plots for tests, such as a function creating a plot for the rejection probability function (RPF) as described in [1].
*A function that accepts a MCHTest-class object and returns a function that, rather than returning a htest-class object, returns a function that will give the test statistic, simulated test statistics, and a -value, in a list; could be useful for diagnostic work.
*Real-world datasets that can be used for examples.
*Functions with a simpler interface than MCHTest, perhaps with more restrictions on inputs.
*Pre-made MCHTest objects perhaps implementing common Monte Carlo or Bootstrap tests.

I also welcome community requests and collaboration. If you want a feature, consider issuing a pull request on GitHub.

Do you want more documentation? More examples? More background? Let me know! I'd be willing to write more on this subject. Perhaps if I amass enough content I could write a book documenting MCHT and Monte Carlo/bootstrap testing.

These blog posts together extend beyond 10,000 words, so I'm thinking I have enough material to submit an article to, say, J. Stat. Soft. or the R Journal and thus get my first publication where I'm the sole author. But this is something I'm still considering; I'm an insecure person at heart.

Next week I will still be using this package in a blog post, but I won't be writing about how to use it anymore; instead, I'll be using it to revisit a proposition I made many months ago. (It was because of that article I created this package.) Stay tuned, and thanks for reading!

References
  1. R. Davidson and J. G. MacKinnon, The size distortion of bootstrap test, Econometric Theory, vol. 15 (1999) pp. 361-376

Packt Publishing published a book for me entitled Hands-On Data Analysis with NumPy and Pandas, a book based on my video course Unpacking NumPy and Pandas. This book covers the basics of setting up a Python environment for data analysis with Anaconda, using Jupyter notebooks, and using NumPy and pandas. If you are starting out using Python for data analysis or know someone who is, please consider buying my book or at least spreading the word about it. You can buy the book directly or purchase a subscription to Mapt and read it there.

If you like my blog and would like to support it, spread the word (if not get a copy yourself)!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to de-Bias Standard Deviation Estimates

Mon, 11/12/2018 - 16:48

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) give a very complete bias management solution.

Everitt’s The Cambridge Dictionary of Statistics 2nd Edition, Cambridge University Press defines a “statistic” as:

A numeric characteristic of a sample.

Informally we can say a statistic is a summary. And this lets us say the field of statistics is the science of relating the observed summaries of samples to the corresponding unobserved summary of the total population or universe the samples are drawn from.

For example: we can take our universe to be the set of female adult crew members of the Titanic, and our observable to be if they survived or not. Then our universe or total population is the following:

20 survivors (we will code these as 1's) 3 fatalities (we will code these as 0's)

We can ask R to show us some common summaries: mean (which we will denote as “p”), variance, and standard_deviation.

universe = c(rep(1, 20), rep(0, 3)) p = mean(universe) print(p) # [1] 0.8695652 variance <- mean((universe - p)^2) print(variance) # [1] 0.1134216 standard_deviation <- sqrt(variance) print(standard_deviation) # [1] 0.3367812

Note, we deliberately did not call R’s var() and sd() methods, as they both include Bessel’s sample correction. For an entire universe (sometimes called a population) the sample correction is inappropriate, and leads to a wrong answer.

Python’s statistics library has a sensible approach. It presents un-corrected (population or universe versions) and corrected (sample versions) on equal footing:

From: https://docs.python.org/3/library/statistics.html.

The bias in question is falling off at a rate of 1/n (where n is our sample size). So the bias issue loses what little gravity it ever may have ever had when working with big data. Most sources of noise will be falling off at a slower rate of 1/sqrt(n), so it is unlikely this bias is going to be the worst feature of your sample.

But let’s pretend the sample size correction indeed is an important point for a while.

Under the “no bias allowed” rubric: if it is so vitally important to bias-correct the variance estimate, would it not be equally critical to correct the standard deviation estimate?

The practical answer seems to be: no. The straightforward standard deviation estimate itself is biased (it has to be, as a consequence of Jensen’s inequality). And pretty much nobody cares, corrects it, or teaches how to correct it, as it just isn’t worth the trouble.

Let’s convert that to a worked example. We are going to investigate the statistics of drawing samples of size 5 (uniformly with replacement) from our above universe or population. To see what is going on we will draw a great number of examples (though in practice when we could do this, we could more easily directly summarize over this small universe).

Here is the distribution of observed sample variances for both the naive calculation (assuming the variance observed on the sample is representative of the universe it was draw from) and the Bessel corrected calculation (inflating the sample estimate of variance by n/(n-1), or 25% in this case). The idea is: the observable Bessel corrected variances of samples converge to the unobserved un-corrected (or naive) variance of the population.

We generated the above graph by uniformly drawing samples (with replacement) of size 5 from the above universe. The experiment was repeated 100000 times and we are depicting the distribution of what variance estimates were seen as a density plot (only a few discrete values occur in small samples). The black dashed vertical line is the mean of the 100000 variance estimates, and the vertical red line and blocks structure indicates the true variance of the original universe or population.

Notice the Bessel corrected estimate of variance is itself a higher variance estimate: many of the adjusted observations are further away from the true value. The Bessel correction (in this case) didn’t make any one estimate better, it made the average over all possible estimates better. In this case it is inflating all the positive variance estimates to trade-off against the zero variance estimates.

Now suppose we run the same experiment for standard deviation estimates.

The Bessel adjustment helped, but was nowhere near correcting the bias. Both estimates of standard deviation are severely downward biased.

Here is a that experiment again, including an additional correction procedure (which we will call joint scale corrected).

Notice the scale corrected estimate is unbiased. We admit, if this were so massively important it would be taught more commonly. However, as standard deviations summaries are more common than variance summaries (example: summary.lm()): having an unbiased estimate for a standard deviation is probably more important than having an unbiased estimate for variance.

The scale corrected estimate did raise the variance of the estimation procedure, but not much more than the Bessel correction did:

estimation_method variance_of_estimate 1: scale_corrected_sd 0.05763954 2: naive_sd 0.04554781 3: Bessel_sd 0.05693477

How did we correct the standard deviation estimate?

It is quite simple:

  1. For a binomial outcome and sample size n, the standard deviation estimate is essentially a lookup table that maps the pair n (the sample size) and k (the number of 1’s seen) to an estimate. Call this table est(n,.). Remember est(n,.) is just a table of n+1 numbers.
  2. The expected value of our estimator is just going to be the sum of the probability of each observable situation times the estimate we make in that situation. This is supposed to match the unknown true standard deviation which is sqrt(p(1-p)) (assuming we exactly knew p). This means for any p we want these two quantities near each other, or:

    dbinom(k, n, p) being the probability of drawing “k” 1’s from in n draws when 1’s have a probability of p (this is a known quantity, a value we just copy in). More examples of using this methodology can be found here.
  3. We ask an optimizer to pick values for est(n,.) that simultaneously make many of these check equations near true. We used a discrete set of p‘s with a Jeffreys prior style example weighing. One could probably recast this as a calculus of variations problem over all ps and work closer to an analytic solution.

The concept is from signal processing: observation is an averaging filter, so we need to design an approximate inverse filter (something like an unsharp mask). The above inverse filter was specialized for the 0/1 binomial case, but one can use the above principles do design inverse filters for additional situations.

All of the above was easy in R and we got the following estimation table (here shown graphically with the un-adjusted and also Bessel adjusted estimation procedures).

In the above graph: n is the sample size, k is how many 1’s are observed, and the y axis is the estimate of the unknown true population standard deviation.

Notice the “joint scaled” solution wiggles around a bit.

As we have said, the above graph is the estimator. For samples of size 5 pick the estimator you want to use and the k corresponding to how many 1’s you saw in your sample: then the y height is the estimate you should use for population standard deviation.

In the next graph: p is the unknown true frequency of 1’s, and the y-axis is the difference between the expected value of the estimated standard deviation and the true standard deviation.

Notice the joint scaled estimate reports a non-zero standard deviation even for all zeros or all ones samples. Also notice the joint estimate tries to stay near the desired difference of zero using a smooth curve that wiggles above and below the desired difference of zero.

The next graph is the same sort of presentation for ratios of estimated to true standard deviations.

We still see the Bessel corrected estimates are better than naive, but not as good as jointly adjusted.

And that is how you correct standard deviation estimates (at least for binomial experiments). We emphasize that nobody makes this correction in practice, and for big data and large samples the correction is entirely pointless. The variance correction is equally pointless, but since it is easier to perform it is usually added.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science With R Course Series – Week 9

Mon, 11/12/2018 - 10:45

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

There are only two more weeks in the course! This week will extend what you learned from the Expected Value by performing an optimization and sensitivity analysis.

The optimization and sensitivity analysis will teach you how to identify the maximum business savings for the overtime problem and see how additional factors change the amount of savings:

  • Threshold Optimization – A method used to maximize expected saving via iteratively calculating savings at various thresholds

  • Sensitivity Analysis – A method used to investigate how sensitive the expected savings is to various parameter values that were created based on assumptions

Get ready, this week is packed full of learning!

Here is a recap of our trajectory and the course overview:

Recap: Data Science With R Course Series

You’re in the Week 9: Expected Value Optimization And Sensitivity Analysis. Here’s our game-plan over the 10 articles in this series. We’ll cover how to apply data science for business with R following our systematic process.

Week 9: Expected Value Optimization And Sensitivity Analysis

Student Feedback

Week 9: Expected Value Optimization And Sensitivity Analysis Threshold Optimization: Maximizing Expected ROI

Last week you learned how to increase business savings by targeting employee overtime. In this module, you will use the R package purrr to determine the maximum savings for the overtime policy.

Threshold Optimization: Visualizing The Expected Savings At Various Threshold

Create a plot using the threshold optimization to visualize the optimization results (business savings). This is a useful way to compare the optimization analysis to the employee churn business case and see how the threshold optimization produces business savings.

Sensitivity Analysis: Adjusting Parameters To Test Assumptions

From the previous module, you determined that employees with a certain percentage of overtime (threshold) should be targeted to help reduce employee churn. However, this is based on a static employee salary.

Now you will learn how business savings can change based on different employee salaries (sensitivity).

Sensitivity Analysis: Visualizing The Effect Of Scenarios & Breakeven

In this module, you will analyze the threshold analysis with the addition of another variable, employee salary. Create a profitability heatmap to visualize how business savings changes based on employee revenue amounts.

Challenge #5: Threshold Optimization For Stock Options

Your overtime analysis is complete, but now you see that people with no stock options are leaving at a faster rate than people with stock options.

In this two-part challenge, you implement the same threshold optimization and sensitivity analysis, but this time for employee stock option.

Challenge #6: Sensitivity Analysis For Stock Options

Continuing the challenge, perform a sensitivity analysis for the stock option threshold, and adjust by stock option price.

Once you complete your solution, compare it with the instructor’s solution in the challenge solution videos.

You Need To Learn R For Business



To be efficient as a data scientist, you need to learn R. Take the course that has cut data science projects in half (see this testimonial from a leading data science consultant) and has progressed data scientists more than anything they have tried before. Over 10-weeks you learn what it has taken data scientists 10-years to learn:

  • Our systematic data science for business framework
  • R and H2O for Machine Learning
  • How to produce Return-On-Investment from data science
  • And much more.

Start Learning Today!

Next Up

The next article in the Data Science With R Series covers Build A Recommendation Algorithm To Improve Decision Making.

In the final week, you will implement a 3-Step Process for creating a recommendation algorithm. The learning modules are designed to integrate critical thinking and strategy development with data-driven decision making.

As an added bonus, you’ll see a sneak preview of the Shiny Web App built in DS4B 301-R. The Recommendation Algorithm developed here plays a major role in that course.
sensitive the expected savings is to various parameter values that were created based on assumptions

Week 10: Build A Recommendation Algorithm To Improve Decision Making

Our Revolutionary System For Learning R For Business

Are you interested in learning R For Business? Then look no further.

  • Business Science University has the most advanced, technology intensive, and streamlined data science educational system for business on the planet.

  • We are developing a NEW INTRODUCTORY COURSE (DS4B 101-R) that delivers an amazing educational experience for learners that want to apply R to business analysis.

  • The beginner DS4B 101-R is the prequel to the intermediate/advanced DS4B 201-R course

Course Launch Date

Launch is expected at end of November 2018 / Beginning of December 2018. Sign up at Business Science University to get updates.

Course Details

Our NEW BUSINESS ANALYSIS WITH R COURSE (DS4B 101-R) combines:

  • Teaching the Data Science with R Workflow in a 100% business context: data import, data manipulation (business aggregations, time-based calculations, text manipulation, categorical manipulation, and missing data), data visualization, business reporting with RMarkdown, and advanced analysis including scaling analysis and building functions.

  • Two state-of-the-art business projects.

Project 1 – Exploratory Analysis of Digital Media Merchant

  • Business Project 1: Complete a full exploratory analysis on a simulated digital media merchant:

    • Connect to a sqlite database to retrieve transactional data,
    • Join data from 11 database tables including customers, products, and time-based transactional history
    • Cleaning data using dplyr and tidyr
    • Business Analysis: Product Level Analysis, Customer Level Analysis, Time-Based Analysis, and Business Filtering to Isolate Key Populations
  • Business Project 2: Apply the advanced skills through a Customer Segmentation project

    • Use K-Means Clustering to anlayze customer purchasing habits and group into segments
    • Apply data visualization techniques investigate the key buying habits
    • Bonus Material and More!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppArmadillo 0.9.200.4.0

Sat, 11/10/2018 - 21:01

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new RcppArmadillo release, now at 0.9.200.4.0, based on the new Armadillo release 9.200.4 from earlier this week, is now on CRAN, and should get to Debian very soon.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 532 (or 31 more since just the last release!) other packages on CRAN.

This release once again brings a number of improvements, see below for details.

Changes in RcppArmadillo version 0.9.200.4.0 (2018-11-09)
  • Upgraded to Armadillo release 9.200.4 (Carpe Noctem)

    • faster handling of symmetric positive definite matrices by rcond()

    • faster transpose of matrices with size ≥ 512×512

    • faster handling of compound sparse matrix expressions by accu(), diagmat(), trace()

    • faster handling of sparse matrices by join_rows()

    • expanded sign() to handle scalar arguments

    • expanded operators (*, %, +, −) to handle sparse matrices with differing element types (eg. multiplication of complex matrix by real matrix)

    • expanded conv_to() to allow conversion between sparse matrices with differing element types

    • expanded solve() to optionally allow keeping solutions of systems singular to working precision

    • workaround for gcc and clang bug in C++17 mode

  • Commented-out sparse matrix test consistently failing on the fedora-clang machine CRAN, and only there. No fix without access.

  • The ‘Unit test’ vignette is no longer included.

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gold-Mining Week 10 (2018)

Sat, 11/10/2018 - 20:16

(This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

Week 10 Gold Mining and Fantasy Football Projection Roundup now available. Go get that free agent gold!

The post Gold-Mining Week 10 (2018) appeared first on Fantasy Football Analytics.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Gamification Of Fitbit: How an API Provided the Next Level of tRaining

Sat, 11/10/2018 - 18:29

(This article was first published on Stories by Matt.0 on Medium, and kindly contributed to R-bloggers)

“Anyone can look for fashion in a boutique or history in a museum. The creative explorer looks for history in a hardware store and fashion in an airport” — Robert Wieder

As much as I like Medium for it’s ability to reach an audience I prefer the customization afforded by github.io. If you want to see a far-fancier version of this post please visit this page instead.

Project Inspiration

You may, or may not, have heard of the term gamification but chances are you’ve experienced it.

Gamification is the application of game-design elements, and game principles, in non-game contexts. The idea is, if you use elements of games, like linking rules and rewards into a feedback system, you can make (almost) any activity motivating and fun.

Gamification is the concept behind eLearning. In elementary school I remember all the students wanted to play The Oregon Trail in computer class. I also remember another game where you had to solve math problems before something hit the floor. Okay, maybe it wasn’t the most thrilling introduction to gamification but I remember it nonetheless.

At some point in my career, I got tired of using nano and decided to I wanted to try to learn Vim.

It was then that I discovered two very enjoyable examples of gamification:

  • shortcutFoo teaches you shortcuts for Vim, Emacs, Command Line, Regex etc. via interval training, which is essentially spaced repetition. This helps you memorize shortcuts more efficiently.

Today, I enjoy eLearning-gamification on platforms like DuoLingo, and DataCamp.

I’ve also recently started to participate in a Kaggle competition, “PUBG Finish Placement Prediction”. Kaggle is a Google owned hangout for data science enthusiasts where they can use machine learning to solve predictive analytics problems for cash and clout. Similar to chess there are so-called Kaggle Grandmasters.

The Quest

Our laboratory studies perinatal influences on the biological embedding of early adversity of mental health outcomes. We combine genetic, epigenetic and epidemiological approaches to identify pregnant women who’s offspring may potentially be at risk for adverse mental health outcomes.

My supervisor approached me with a challenge; how feasible would it be to access biometric data from 200 Fitbits?

So I bought myself a Fitbit Charge2 fitness tracker and hit the gym!

At some point I think we both realized that this project was going to be a big undertaking. Perhaps R isn’t really intended to do large scale real-time data management from API services. It’s great for static files, or static endpoints, but if you’re working with multiple participants a dedicated solution like Fitabase may work the best – or so they claim.

Nonetheless, I wanted to try out a bunch of cool new things in R like making a personal website using blogdown, using gganimate with Rokemon, accessing the fitbit API with httr as well as adding a background image with some custom CSS/HTML. Is there possibly a better way to possibly gamify my leaRning curve – I think not.

The following was my attempt at e-learning gamification for R.

I used the blogdown package to allow me to write blog posts as R Markdown documents, knitting everything to a nice neat static website that I can push online. It was a nice opportunity to learn a bit about pandoc, Hugo, CSS/HTML lurking beneath the server side code. I decided to go with the Academic theme for Hugo, pull in as much data as I could from the Fitbit API, clean it up, and then perform some exploratory data analysis. In the process, I generated some cool animated sprites and use video game inspired visualizations.

Setting up a Fitbit Developer Account

Fitbit uses OAuth 2.0 Access Token for making HTTP request to the Fitbit API. You need to set up an account to use the API and include your token in R. Instead of reading the FITBIT DEV HELP section I would direct the reader to better more-concise instructions here.

Now that you have an account we’re ready to do stuff in R.

Set your token up:

# You Found A Secret Area!
token = "yourToken" Using the fitbitr package

I had never made an HTTP request before and although the process is officially documented here it can be a tad overwhelming. Therefore, I initially resorted to using an R package built to access the R API, under-the-hood, called fitbitr.

Unfortunately this would limit me to only accessing some basic user information, heart rate and step count data.

Getting Basic User Info

The first function in this package sends a GET request to the Get Profile resource URL.

# Extracting Resources # Get userInfo
user_info <- fitbitr::getUserInfo(token) # Hailing a Chocobo! # What is my stride length in meters?
strideLengthWalking <- user_info$strideLengthWalking

My stride length is 68.5.

Stride length is measured from heel to heel and determines how far you walk with each step. On average, a man’s walking stride length is 2.5 feet, or 30 inches, while a woman’s average stride length is 2.2 feet, or 26.4 inches, according to this report.

# Hitting 80 MPH # What is my running stride length
strideLengthRunning <- user_info$strideLengthRunning

My running stride length is 105.5.

The Fitbit uses your sex and height by default to gauge your stride length which could potentially be inaccurate.

# Looking for the fourth chaos emerald # What is my average daily steps?
averageDailySteps <- user_info$averageDailySteps

My average daily steps is 14214.

Considering that the daily recommended steps is 10,000 I’d say that’s acceptable. That being said, there’s always room for improvement.

Accessing heart rate and footsteps with the fitbitr pacakge

I’m going to grab a week’s worth of data for a very preliminary EDA.

# Smashing buttons days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday") monday_heart <- getTimeSeries(token, type = "heart", activityDetail = "1min", date = "2018-08-20", startTime = "00:00", endTime = "23:59")
monday_heart %<>% mutate(date = "2018-08-20")
monday_steps <- getTimeSeries(token, type = "steps", activityDetail = "1min", date = "2018-08-20")
monday_steps %<>% mutate(date = "2018-08-20")
monday <- monday_heart %>% full_join(monday_steps)
monday %<>% mutate(week_date = "Monday")
monday %<>% mutate(day_of_week = "1") tuesday_heart <- getTimeSeries(token, type = "heart", activityDetail = "1min", date = "2018-08-21")
tuesday_heart %<>% mutate(date = "2018-08-21")
tuesday_steps <- getTimeSeries(token, type = "steps", activityDetail = "1min", date = "2018-08-21")
tuesday_steps %<>% mutate(date = "2018-08-21")
tuesday <- tuesday_heart %>% full_join(tuesday_steps)
tuesday %<>% mutate(week_date = "Tuesday")
tuesday %<>% mutate(day_of_week = "2") wednesday_heart <- getTimeSeries(token, type = "heart", activityDetail = "1min", date = "2018-08-22")
wednesday_heart %<>% mutate(date = "2018-08-22")
wednesday_steps <- getTimeSeries(token, type = "steps", activityDetail = "1min", date = "2018-08-22")
wednesday_steps %<>% mutate(date = "2018-08-22")
wednesday <- wednesday_heart %>% full_join(wednesday_steps)
wednesday %<>% mutate(week_date = "Wednesday")
wednesday %<>% mutate(day_of_week = "3") thursday_heart <- getTimeSeries(token, type = "heart", activityDetail = "1min", date = "2018-08-23")
thursday_heart %<>% mutate(date = "2018-08-23")
thursday_steps <- getTimeSeries(token, type = "steps", activityDetail = "1min", date = "2018-08-23")
thursday_steps %<>% mutate(date = "2018-08-23")
thursday <- thursday_heart %>% full_join(thursday_steps)
thursday %<>% mutate(week_date = "Thursday")
thursday %<>% mutate(day_of_week = "4") friday_heart <- getTimeSeries(token, type = "heart", activityDetail = "1min", date = "2018-08-24")
friday_heart %<>% mutate(date = "2018-08-24")
friday_steps <- getTimeSeries(token, type = "steps", activityDetail = "1min", date = "2018-08-24")
friday_steps %<>% mutate(date = "2018-08-24")
friday <- friday_heart %>% full_join(friday_steps)
friday %<>% mutate(week_date = "Friday")
friday %<>% mutate(day_of_week = "5") saturday_heart <- getTimeSeries(token, type = "heart", activityDetail = "1min", date = "2018-08-24")
saturday_heart %<>% mutate(date = "2018-08-24")
saturday_steps <- getTimeSeries(token, type = "steps", activityDetail = "1min", date = "2018-08-24")
saturday_steps %<>% mutate(date = "2018-08-24")
saturday <- saturday_heart %>% full_join(saturday_steps)
saturday %<>% mutate(week_date = "Saturday")
saturday %<>% mutate(day_of_week = "6") week <- monday %>% bind_rows(tuesday) %>% bind_rows(wednesday) %>% bind_rows(thursday) %>% bind_rows(friday) %>% bind_rows(saturday)

week$date <- as.Date(week$date) Summary Statistics # Opening pod bay doors week %>%
group_by(type) %>%
summarise(
total = sum(value),
minimum = min(value),
mean = mean(value),
median = median(value),
maximum = max(value),
max_time = max(time)
) %>%
knitr::kable(digits = 3) %>%
kable_styling(full_width = F) Exploratory Data Analysis

Since this is a post about gamification I decided to do something fun with my exploratory data visualizations. I wanted to use the Rokemon package which allows me to set the theme of ggplot2 (and ggplot2 extensions) to Game Boy and Game Boy Advance themes! When convenient, I’ve combined plots with cowplot.

Let’s take a quick look at the relationship and distribution of heart rate and step count.

# Doing the thing... g <- week %>%
spread(type, value) %>%
rename(hear_rate = "heart rate") %>%
na.omit() %>%
ggplot(aes(steps, hear_rate)) + geom_point() + geom_smooth(method="lm", se=F, colour = "#DE7243") gb <- g + theme_gameboy()
gba <- g + theme_gba() plot_grid(gb, gba, labels = c("", ""), align = "h")

Alternatively, we could get a better look at the data by adding marginal density plots to the scatterplots with the ggMarginal() function from the ggExtra package.

Weekly trends

Let’s take a quick look at the distribution of the contiguous variables to get a better idea than the mean and median.

# Loading..... Wait, what else were you expecting? annotations_steps <- data_frame(
x = c(45, 100, 165),
y = c(0.01, 0.01, 0.01),
label = c('walking pace', 'brisk walking pace', 'running pace'),
type = c('steps', 'steps', 'steps')
) g <- week %>%
ggplot(aes(value)) +
geom_density(fill = "#DE7243") +
geom_text(data = annotations_steps, aes(x = x, y = y, label = label), angle = -30, hjust = 1) +
facet_grid(.~type, scales = 'free_x') +
labs(title = 'Heart Rate and Steps-per-minute over two months',
subtitle = 'Data gathered from Fitbit Charge2') g + theme_gameboy()
g + theme_gba()

Heart rate is a little right-skewed, probably due to sleep and sedentary work. Similarly, for step count you see that only a small bump under brisk walking pace from when I skateboarded to work.

Exercise patterns

This week I didn’t work out so I thought I’d at least look at when I was on my way to work. The figure below shows blue for heart rate/min and orange is the number of steps/min.

# You are carrying too much to be able to run between_six_nine <- function(time) time > 7*60*60 & time < 10*60*60 is_weekday <- function(day_of_week) day_of_week %in% 1:6 week$week_date_f <- factor(week$week_date, levels=c("Monday","Tuesday","Wednesday", "Thursday", "Friday", "Saturday")) g <- week %>%
filter(between_six_nine(time) & is_weekday(day_of_week)) %>%
spread(type, value) %>%
ggplot(aes(x = time)) +
geom_bar(aes(y = steps), color = '#DE7243', alpha = 0.3, stat = 'identity') +
geom_line(aes(y = `heart rate`), color = '#E3F24D', size = 0.8) +
facet_grid(~week_date_f) +
scale_x_continuous(breaks=c(27000, 30000, 33000, 36000), labels=c("7am", "8am", "9am", "10am")) g + theme_gameboy()
g + theme_gba()

My activity has been pretty much the same all week since I skateboard to work every morning.

Most active time # 60% of the time, it loads ALL the time step_counts <- week %>%
filter(type == 'steps') %>%
group_by(day_of_week) %>%
summarise(
type = last(type),
avg_num_steps = sprintf('avg num steps: %3.0f', sum(value)/52)
) g <- week %>%
ggplot(aes(x= value, y = fct_rev(factor(day_of_week)))) +
geom_density_ridges(scale = 2.5, fill = "#DE7243") +
geom_text(data = step_counts, nudge_y = 0.15, hjust = 0,
aes(x = 85, y = fct_rev(factor(day_of_week)), label = avg_num_steps)) +
scale_y_discrete(breaks=1:6, labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")) +
facet_grid(.~type, scales = "free") +
labs(x = '', y = "Day of the Week") g + theme_gameboy()
g + theme_gba()

The distribution of steps per-minute was pretty constant because as I said I didn’t work-out; this likely reflects me shuffling to get tea.

It looks like Monday was the day I got my heart rate up the most, the bimodal peak is probably when I was running around looking for a rental property.

Accessing the Fitbit API directly with the httr package

Eventually, I found an excellent tutorial by obrl-soil which introduced me to the httr package and gave me the confidence I needed to peruse the Fitbit DEV web API reference. Now I was able to gain access to far more sources of data.

What data is available?

A brief overview of what data is available from the Fitbit API:

# Your Boko Club is badly damaged # make a kable table for data you can access from Fitbit API
dt01 <- data.frame(Scope = c("activity",
"heartrate",
"location",
"nutrition",
"profile",
"settings",
"sleep",
"social",
"weight"),
Description = c("The activity scope includes activity data and exercise log related features, such as steps, distance, calories burned, and active minutes",
"The heartrate scope includes the continuous heart rate data and related analysis",
"The location scope includes the GPS and other location data",
"The nutrition scope includes calorie consumption and nutrition related features, such as food/water logging, goals, and plans",
"The profile scope is the basic user information",
"The settings scope includes user account and device settings, such as alarms",
"The sleep scope includes sleep logs and related sleep analysis",
"The social scope includes friend-related features, such as friend list, invitations, and leaderboard",
"The weight scope includes weight and related information, such as body mass index, body fat percentage, and goals")
) dt01 %>%
kable("html") %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "30em", background = "#E3F24D")

What are the units of measurement?

# Loading Cutscenes You Can't Skip # make a Kable table or measurement information
dt03 <- data.frame(unitType = c("duration",
"distance",
"elevation",
"height",
"weight",
"body measurements",
"liquids",
"blood glucose"),
unit = c("milliseconds",
"kilometers",
"meters",
"centimeters",
"kilograms",
"centimeters",
"milliliters",
"millimoles per liter"))
dt03 %>%
kable("html") %>%
kable_styling(full_width = F) %>%
column_spec(1, bold = T, border_right = T) %>%
column_spec(2, width = "30em", background = "#E3F24D")

Define a function for turning a json list into a dataframe.

# Inserting last-minute subroutines into program... # json-as-list to dataframe (for simple cases without nesting!)
jsonlist_to_df <- function(data = NULL) {
purrr::transpose(data) %>%
purrr::map(., unlist) %>%
as_tibble(., stringsAsFactors = FALSE)
} Investigating my 10Km run

GET request to retrieve minute-by-minute heart rate data for my 10km run.

# Preparing for the mini-boss get_workout <- function(date = NULL, start_time = NULL, end_time = NULL,
token = Sys.getenv('FITB_AUTH')) {
GET(url =
paste0('https://api.fitbit.com/1/user/-/activities/heart/date/',
date, '/1d/1min/time/', start_time, '/', end_time, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
} # Get the workout for my 10Km run
got_workout <- get_workout(date = '2018-10-21', start_time = '09:29', end_time = '10:24') workout <- content(got_workout) # summary workout[['activities-heart']][[1]][['heartRateZones']] <- jsonlist_to_df(workout[['activities-heart']][[1]][['heartRateZones']]) # the dataset
workout[['activities-heart-intraday']][['dataset']] <- jsonlist_to_df(workout[['activities-heart-intraday']][['dataset']]) # format the time
workout$`activities-heart-intraday`$dataset$time <- as.POSIXlt(workout$`activities-heart-intraday`$dataset$time, format = '%H:%M:%S')
lubridate::date(workout$`activities-heart-intraday`$dataset$time) <- '2018-10-21' # find time zone
# grep("Canada", OlsonNames(), value=TRUE)
lubridate::tz(workout$`activities-heart-intraday`$dataset$time) <- 'Canada/Eastern'

Let’s take a look at the summary for my 10Km run:

# Farming Hell Cows workout$`activities-heart`[[1]]$heartRateZones %>% kable() %>% kable_styling(full_width = F)

obrl-soil used the MyZone Efforts Points (MEPS) which is calculated minute-by-minute as a percentage of max heart rate. It measures the effort put in – The more points the better. Another example of gamification in action.

# Looting a chest meps_max <- function(age = NULL) { 207 - (0.7 * age) }

Mine is 186.

Now is we create a tribble with 4 heart ranges showing the lower and higher bounds and use the mutate() function from above to calculate what my max heart rate is (with lower and upper bounds).

# Taking the hobbits to Isengard my_MEPS <- tribble(~MEPS, ~hr_range, ~hr_lo, ~hr_hi,
1, '50-59%', 0.50, 0.59,
2, '60-69%', 0.60, 0.69,
3, '70-79%', 0.70, 0.79,
4, '>=80', 0.80, 1.00) %>%
mutate(my_hr_low = floor(meps_max(30) * hr_lo),
my_hr_hi = ceiling(meps_max(30) * hr_hi))
my_MEPS ## # A tibble: 4 x 6
## MEPS hr_range hr_lo hr_hi my_hr_low my_hr_hi
##
## 1 1 50-59% 0.5 0.59 93 110
## 2 2 60-69% 0.6 0.69 111 129
## 3 3 70-79% 0.7 0.79 130 147
## 4 4 >=80 0.8 1 148 186

With the equation now defined let’s calculate my total MEPS:

# Checkpoint! mep <- mutate(workout$`activities-heart-intraday`$dataset,
meps = case_when(value >= 146 ~ 4,
value >= 128 ~ 3,
value >= 109 ~ 2,
value >= 91 ~ 1,
TRUE ~ 0)) %>%
summarise("Total MEPS" = sum(meps))

Wow it’s 216!

I’m not sure what that exactly means but apparently the maximum possible MEPS in a 42-minute workout is 168 and since I ran this 10Km in 54:35 I guess that’s good?

I’d like to post sub 50 minutes on my next 10Km run but I’m not sure if I should be aiming to shoot for a greater percentage of peak heart rate minutes or not – guess I will need to look into this.

Minute-by-minute sleep data for one night

Let’s examine my sleep patterns last night.

# Resting at Campfire get_sleep <- function(startDate = NULL, endDate = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1.2/user/-/sleep/date/', startDate, "/", endDate, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
} # make sure that there is data for those days otherwise it tosses an error got_sleep <- get_sleep(startDate = "2018-08-21", endDate = "2018-08-22")
sleep <- content(got_sleep) dateRange <- seq(as.Date("2018-08-21"), as.Date("2018-08-22"), "days") sleep_pattern <- NULL
for(j in 1:length(dateRange)){
sleep[['sleep']][[j]][['levels']][['data']] <- jsonlist_to_df(sleep[['sleep']][[j]][['levels']][['data']])
tmp <- sleep$sleep[[j]]$levels$`data`
sleep_pattern <- bind_rows(sleep_pattern, tmp)
}

Okay now that the data munging is complete, let’s look at my sleep pattern.

# Now entering... The Twilight Zone g <- sleep_pattern %>% group_by(level, seconds) %>%
summarise() %>%
summarise(seconds = sum(seconds)) %>%
mutate(percentage = seconds/sum(seconds)) %>%
ggplot(aes(x = "", y = percentage, fill = c("S", "A", "R"))) +
geom_bar(width = 1, stat = "identity") +
theme(axis.text.y = element_blank(),
axis.text.x = element_blank(), axis.line = element_blank(), plot.caption = element_text(size = 5), plot.title = element_blank()) +
labs(fill = "class", x = NULL, y = NULL, title = "Sleep stages", caption = "A = Awake; R = Restless; S = Asleep") +
coord_polar(theta = "y", start = 0) +
scale_fill_manual(values = c("#FF3F3F", "#2BD1FC", "#BA90A6")) g + theme_gameboy()
g + theme_gba()

A pie chart is probably not the best way to show this data. Let’s visualize the distribution with a box plot.

# Entering Cheat Codes! g <- ggplot(sleep_pattern, aes(y=log10(seconds), x=level)) +
geom_boxplot(color="#031300", fill='#152403') +
labs(x = "", title = 'Sleep patterns over a month',
subtitle = 'Data gathered from Fitbit Charge2') +
theme(legend.position = "none") g + theme_gameboy()
g + theme_gba()

An even better way to visualize the distribution would be to use a violin plot with the raw data points overlaid.

# Neglecting Sleep...
g <- ggplot(sleep_pattern, aes(y=log10(seconds), x=level)) +
geom_violin(color="#031300", fill='#152403') +
geom_point() +
labs(x = "", title = 'Sleep patterns over a month',
subtitle = 'Data gathered from Fitbit Charge2') +
theme(legend.position = "none") g + theme_gameboy()
g + theme_gba() Get Daily Activity Patterns For 3 Months

You can do API requests for various periods from the Fitbit Activity and Exercise Logs but since I’ve only had mine a couple months I’ll use the 3m period.

I will also need to trim off any day’s which are in the future otherwise they’ll appear as 0 calories in the figures. It’s best to use the Sys.Date() function rather than hardcoding the date when doing EDA, making a Shiny app, or parameterizing a RMarkdown file. This way you can explore different time periods without anything breaking.

I cannot remember when I started wearing my Fitbit but we can figure that out with the following code:

# ULTIMATE IS READY! # Query how many days since you've had fitbit for
inception <- user_info$memberSince

I’ve had my Fitbit since 2018–08–20.

Let’s gather data from September 20th until November 6th 2018.

# Catching them all! ### Calories
get_calories <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/calories/date/', baseDate, "/", period, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_calories <- get_calories(baseDate = "2018-11-20", period = "3m")
calories <- content(got_calories)
# turn into df
calories[['activities-calories']] <- jsonlist_to_df(calories[['activities-calories']])
# assign easy object and rename
calories <- calories[['activities-calories']]
colnames(calories) <- c("dateTime", "calories") ### STEPS
get_steps <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/steps/date/', baseDate, "/", period, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_steps <- get_steps(baseDate = "2018-11-20", period = "3m")
steps <- content(got_steps)
# turn into df
steps[['activities-steps']] <- jsonlist_to_df(steps[['activities-steps']])
# assign easy object and rename
steps <- steps[['activities-steps']]
colnames(steps) <- c("dateTime", "steps") ### DISTANCE
get_distance <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/distance/date/', baseDate, "/", period, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_distance <- get_distance(baseDate = "2018-11-20", period = "3m")
distance <- content(got_distance)
# turn into df
distance[['activities-distance']] <- jsonlist_to_df(distance[['activities-distance']])
# assign easy object and rename
distance <- distance[['activities-distance']]
colnames(distance) <- c("dateTime", "distance") ### FLOORS
get_floors <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/floors/date/', baseDate, "/", period, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_floors <- get_floors(baseDate = "2018-11-20", period = "3m")
floors <- content(got_floors)
# turn into df
floors[['activities-floors']] <- jsonlist_to_df(floors[['activities-floors']])
# assign easy object and rename
floors <- floors[['activities-floors']]
colnames(floors) <- c("dateTime", "floors") ### ELEVATION
get_elevation <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/elevation/date/', baseDate, "/", period, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_elevation <- get_elevation(baseDate = "2018-11-20", period = "3m")
elevation <- content(got_elevation)
# turn into df
elevation[['activities-elevation']] <- jsonlist_to_df(elevation[['activities-elevation']])
# assign easy object and rename
elevation <- elevation[['activities-elevation']]
colnames(elevation) <- c("dateTime", "elevation") ### minutesSedentary
get_minutesSedentary <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/minutesSedentary/date/', baseDate, "/", period, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_minutesSedentary <- get_minutesSedentary(baseDate = "2018-11-20", period = "3m")
minutesSedentary <- content(got_minutesSedentary)
# turn into df
minutesSedentary[['activities-minutesSedentary']] <- jsonlist_to_df(minutesSedentary[['activities-minutesSedentary']])
# assign easy object and rename
minutesSedentary <- minutesSedentary[['activities-minutesSedentary']]
colnames(minutesSedentary) <- c("dateTime", "minutesSedentary") ### minutesLightlyActive
get_minutesLightlyActive <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/minutesLightlyActive/date/', baseDate, "/", period, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_minutesLightlyActive <- get_minutesLightlyActive(baseDate = "2018-11-20", period = "3m")
minutesLightlyActive <- content(got_minutesLightlyActive)
# turn into df
minutesLightlyActive[['activities-minutesLightlyActive']] <- jsonlist_to_df(minutesLightlyActive[['activities-minutesLightlyActive']])
# assign easy object and rename
minutesLightlyActive <- minutesLightlyActive[['activities-minutesLightlyActive']]
colnames(minutesLightlyActive) <- c("dateTime", "minutesLightlyActive") ### minutesFairlyActive
get_minutesFairlyActive <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/minutesFairlyActive/date/', baseDate, "/", period, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_minutesFairlyActive <- get_minutesFairlyActive(baseDate = "2018-11-20", period = "3m")
minutesFairlyActive <- content(got_minutesFairlyActive)
# turn into df
minutesFairlyActive[['activities-minutesFairlyActive']] <- jsonlist_to_df(minutesFairlyActive[['activities-minutesFairlyActive']])
# assign easy object and rename
minutesFairlyActive <- minutesFairlyActive[['activities-minutesFairlyActive']]
colnames(minutesFairlyActive) <- c("dateTime", "minutesFairlyActive") ### minutesVeryActive
get_minutesVeryActive <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/minutesVeryActive/date/', baseDate, "/", period, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_minutesVeryActive <- get_minutesVeryActive(baseDate = "2018-11-20", period = "3m")
minutesVeryActive <- content(got_minutesVeryActive)
# turn into df
minutesVeryActive[['activities-minutesVeryActive']] <- jsonlist_to_df(minutesVeryActive[['activities-minutesVeryActive']])
# assign easy object and rename
minutesVeryActive <- minutesVeryActive[['activities-minutesVeryActive']]
colnames(minutesVeryActive) <- c("dateTime", "minutesVeryActive") ### activityCalories
get_activityCalories <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/activityCalories/date/', baseDate, "/", period, '.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_activityCalories <- get_activityCalories(baseDate = "2018-11-20", period = "3m")
activityCalories <- content(got_activityCalories)
# turn into df
activityCalories[['activities-activityCalories']] <- jsonlist_to_df(activityCalories[['activities-activityCalories']])
# assign easy object and rename
activityCalories <- activityCalories[['activities-activityCalories']]
colnames(activityCalories) <- c("dateTime", "activityCalories") ##### Join multiple dataframes with purrr::reduce and dplyr::left_join
activity_df <- list(calories, steps, distance, floors, elevation, activityCalories, minutesSedentary, minutesLightlyActive, minutesFairlyActive, minutesVeryActive) %>%
purrr::reduce(left_join, by = "dateTime") # Add the dateTime to this dataframe
activity_df$dateTime <- as.Date(activity_df$dateTime) names <- c(2:ncol(activity_df))
activity_df[,names] <- lapply(activity_df[,names], as.numeric) # trim off any days that haven't happened yet
activity_df %<>% filter(dateTime <= "2018-11-06") Get Recent Activity Types # We're giving it all she's got!

get_frequentActivities <- function(baseDate = NULL, period = NULL, token = Sys.getenv('FITB_AUTH')){
GET(url = paste0('https://api.fitbit.com/1/user/-/activities/recent.json'),
add_headers(Authorization = paste0("Bearer ", token)))
}

got_frequentActivities <- get_frequentActivities(baseDate = "2018-11-20", period = "3m")
frequentActivities <- content(got_frequentActivities) # This is a list object let's look at how many frequent activities are logged
length(frequentActivities)
## [1] 5 # Take a look at the object with str()
str(frequentActivities) ## List of 5
## $ :List of 6
## ..$ activityId : int 2131
## ..$ calories : int 0
## ..$ description: chr ""
## ..$ distance : int 0
## ..$ duration : int 3038000
## ..$ name : chr "Weights"
## $ :List of 6
## ..$ activityId : int 90009
## ..$ calories : int 0
## ..$ description: chr "Running - 5 mph (12 min/mile)"
## ..$ distance : int 0
## ..$ duration : int 1767000
## ..$ name : chr "Run"
## $ :List of 6
## ..$ activityId : int 90013
## ..$ calories : int 0
## ..$ description: chr "Walking less than 2 mph, strolling very slowly"
## ..$ distance : int 0
## ..$ duration : int 2407000
## ..$ name : chr "Walk"
## $ :List of 6
## ..$ activityId : int 90001
## ..$ calories : int 0
## ..$ description: chr "Very Leisurely - Less than 10 mph"
## ..$ distance : int 0
## ..$ duration : int 4236000
## ..$ name : chr "Bike"
## $ :List of 6
## ..$ activityId : int 15000
## ..$ calories : int 0
## ..$ description: chr ""
## ..$ distance : int 0
## ..$ duration : int 1229000
## ..$ name : chr "Sport"

I would never have considered myself a Darwin or a Thoreau but apparently strolling very slowly is my favorite activity in terms of time spent.

You can see that my Fitbit has also logged times for Weights, Sports and Biking which is likely from when I’ve manually logged my activities. There’s a possibility that Fitbit is registering Biking for when I skateboard.

Correlogram of Activity

Previously I had always used the corrplot package to create a correlation plot; however, it doesn’t play nicely with ggplot meaning you cannot add Game Boy themes easily. Nonetheless, I was able to give it a retro-looking palette with some minor tweaking.

Since I had two colors in mind from the original gameboy, and knew their hex code, I was able to generate a palette from this website.

# Aligning Covariance Matrices # drop dateTime
corr_df <- activity_df[,2:11] # Correlation matrix
corr <- cor(na.omit(corr_df))
corrplot(corr, type = "upper", bg = "#9BBB0E", tl.col = "#565656", col = c("#CADCA0", "#B9CD93", "#A8BE85", "#97AF78", "#86A06B", "#75915E", "#648350", "#537443", "#426536", "#315629", "#20471B", "#0F380E"))

In a correlation plot the color of each circle indicates the magnitude of the correlation, and the size of the circle indicates its significance.

After a bit of searching for a ggplot2 extension I was able to use ggcorrplot which allowed me to use gameboy themes again!

# Generating textures... ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 2,
tl.cex = 8,
show.legend = FALSE,
colors = c( "#306230", "#306230", "#0F380F" ),
title="Correlogram",
ggtheme=theme_gameboy) # Game Over. Loading previous save

ggcorrplot(corr, hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 2,
tl.cex = 8,
show.legend = FALSE,
colors = c( "#3B7AAD", "#56B1F7", "#1D3E5D" ),
title="Correlogram",
ggtheme=theme_gba) Exploring Activity # Link saying "hyahhh!"

# Static
g <- activity_df %>%
ggplot(aes(x=dateTime, y=calories)) +
geom_line(colour = "black") +
geom_point(shape = 21, colour = "black", aes(fill = calories), size = 5, stroke = 1) +
xlab("") +
ylab("Calorie Expenditure")

g + theme_gameboy() + theme(legend.position = "none")
g + theme_gba() + theme(legend.position = "none") # Panick! at the Discord...

# gganimate
g <- activity_df %>%
ggplot(aes(x=dateTime, y=calories)) +
geom_line(colour = "black") +
geom_point(shape = 21, colour = "black", aes(fill = calories), size = 5, stroke = 1) +
transition_time(dateTime) +
shadow_mark() +
ease_aes('linear') +
xlab("") +
ylab("Calorie Expenditure")

g + theme_gba() + theme(legend.position = "none")

Distance is determined by using your steps and your estimated stride length (for the height you put in).

I’ve also made plots for Distance, Steps, Elevationand Floorsbut you’ll have to check out this page to see them.

Closing thoughts

Even though Fitbit offers a nice dashboard for a single user it’s not scale-able. By accessing the data directly one can ask the questions they want from 200 individuals — or more. If one was inclined, they could even build a fancy Shiny dashboard with bespoke visualizations.

If you have any questions or comments you can always reach me on LinkedIn. Till then, see you in the next post!

# Wubba Lubba Dub Dub

# https://www.spriters-resource.com/game_boy_advance/kirbynim/sheet/15585/
sprite_sheet <- png::readPNG("kirby.png")

Nframes <- 11 # number of frames to extract
width <- 29 # width of a frame
sprite_frames <- list() # storage for the extracted frames

# Not equal sized frames in the sprite sheet. Need to compensate for each frame
offset <- c(0, -4, -6, -7, -10, -16, -22, -26, -28, -29, -30)

# Manually extract each frame
for (i in seq(Nframes)) {
sprite_frames[[i]] <- sprite_sheet[120:148, (width*(i-1)) + (1:width) + offset[i], 1:3]
}

# Function to convert a sprite frame to a data.frame
# and remove any background pixels i.e. #00DBFF
sprite_frame_to_df <- function(frame) {
plot_df <- data_frame(
fill = as.vector(as.raster(frame)),
x = rep(1:width, width),
y = rep(width:1, each=width)
) %>%
filter(fill != '#00DBFF')
}

sprite_dfs <- sprite_frames %>%
map(sprite_frame_to_df) %>%
imap(~mutate(.x, idx=.y))

fill_manual_values <- unique(sprite_dfs[[1]]$fill)
fill_manual_values <- setNames(fill_manual_values, fill_manual_values)

mega_df <- dplyr::bind_rows(sprite_dfs)

p <- ggplot(mega_df, aes(x, y, fill=fill)) +
geom_tile(width=0.9, height=0.9) +
coord_equal(xlim=c(1, width), ylim=c(1, width)) +
scale_fill_manual(values = fill_manual_values) +
theme_gba() +
xlab("") +
ylab("") +
theme(legend.position = 'none', axis.text=element_blank(), axis.ticks = element_blank())

panim <- p +
transition_manual(idx, seq_along(sprite_frames)) +
labs(title = "gganimate Kirby")

gganimate::animate(panim, fps=30, width=400, height=400)

The Gamification Of Fitbit: How an API Provided the Next Level of tRaining was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Stories by Matt.0 on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Detailed introduction of “myprettyreport” R package

Sat, 11/10/2018 - 16:38

(This article was first published on R – MYHAPPYDATA BLOG, and kindly contributed to R-bloggers)

Package introduction:
Package myprettyreport helps to export ggplot2 graphs into a good-looking PDF file in a clear and easy way with a wide range of flexibility. It has a modular structure so the report elements can be combined in many ways.

Installation:
This package currently only available on Github so the proper way to install it:

#install.packages("devtools") devtools::install_github("tarkomatas/myprettyreport")

Quick overview:

Functions of the package:
start_report() is the first mandatory function to generate the report.
add_cover_page() function generates the cover page of the report.
add_new_page() function adds a single report page to the report.
add_multiple_page() function adds multiple report pages to the report.
end_report() function generates the final output and closes the process.

Quick example:

library(ggplot2) sample_plot <- ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) + geom_point() + stat_smooth(method = 'lm', color = "#f44242", fill = "#fd9068") library(magick) sample_logo <- image_read("http://myhappydata.com/img/1.png") library(myprettyreport) start_report() %>% add_cover_page( logo = sample_logo, logo_size = 0.3 ) %>% add_new_page( plot = sample_plot, need_header = TRUE, logo = sample_logo, logo_size = 0.2 ) %>% end_report()

Result:


Basics

A minimal example which generates a PDF file but it contains just a blank page:

start_report() %>% end_report()

report output

As we can see each elements can be added to each other with the magrittr::%>% (pipe) function.

Insert a new report page to a PDF file which creates a skeleton ggplot2 object as default.

start_report() %>% add_new_page() %>% end_report()

Now let’s create our first very own ggplot2 object and export it. The report syntax is almost the same except we have to add our ggplot2 object to a report as a plot parameter value.

library(ggplot2) sample_plot <- ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) + geom_point() + stat_smooth(method = 'lm', color = "#f44242", fill = "#fd9068") start_report() %>% add_new_page(plot = sample_plot) %>% end_report()

The result will be equivalent so far if we just use the ggplot2 built in export function:

ggsave("report_output.pdf", width = 21, height = 29.7, units = "cm")

output_ggsave
output_myprettyreport

Although the result is the same it is easier to modify the report layout a little bit thanks to the package myprettyreport. We could create a footer and a header section like this:

start_report() %>% add_new_page( plot = sample_plot, need_header = TRUE, need_footer = TRUE, footer_text = "Copyright © Tamas Marko | 2018 | myhappydata.com") %>% end_report()

report_output

Also it could be possible to add a logo to a top left corner. All we need is to load a logo file into R and add it as a logo parameter. Optionally the size of the logo could be specified (valid values are between 0 and 1).

library(magick) sample_logo <- image_read("http://myhappydata.com/img/1.png") start_report() %>% add_new_page( plot = sample_plot, need_header = TRUE, need_footer = TRUE, footer_text = "Copyright © Tamas Marko | 2018 | myhappydata.com", logo = sample_logo, logo_size = 0.2) %>% end_report()

report_output

Currently 2 themes have been implemented yet. Now we are changing the theme from the default to a theme called flashy. In this case we have to modify the default values of the header and footer colors also because the default color of the text is white.

start_report() %>% add_new_page( plot = sample_plot, need_header = TRUE, need_footer = TRUE, footer_text = "Copyright © Tamas Marko | 2018 | myhappydata.com", logo = sample_logo, logo_size = 1, theme = "flashy", header_color = "#f44242", footer_color = "#3d3c3c") %>% end_report()

report_output

We may notice that if we use this theme, the position of the logo will also change (now it is on the bottom left corner). So the size parameter of the logo will probably need to be reconfigured also.

The next step could be adding one more page to our report. In this case the preferred way is to use the add_multiple_page() function because it results more simplified syntax:

sample_plot2 <- ggplot(airquality, aes(x=Temp)) + geom_histogram(binwidth=1, fill = "#f66f6f") plot_list <- list(sample_plot, sample_plot2) start_report() %>% add_multiple_page( plot = plot_list, need_header = TRUE, need_footer = TRUE, page_number = rep("Copyright © Tamas Marko | 2018 | myhappydata.com", length(plot_list)), logo = sample_logo, logo_size = 1, theme = "flashy", header_color = "#f44242", footer_color = "#3d3c3c") %>% end_report()

Important to note that the add_multiple_page() function was executed (not the add_new_page()) and it has a page_number parameter instead of footer_text! So in this case we have to multiply the footer text value.

report_output

Now let’s add a cover page to our report. We have to insert the add_cover_page() function after the start_report() function.

start_report() %>% add_cover_page() %>% add_multiple_page( plot = plot_list, need_header = TRUE, need_footer = TRUE, page_number = rep("Copyright © Tamas Marko | 2018 | myhappydata.com", length(plot_list)), logo = sample_logo, logo_size = 1, theme = "flashy", header_color = "#f44242", footer_color = "#3d3c3c") %>% end_report()

report_output

It is also possible to change the theme of the cover page if we want the same “flashy” theme everywhere:

start_report() %>% add_cover_page( theme = "flashy" ) %>% add_multiple_page( plot = plot_list, need_header = TRUE, need_footer = TRUE, page_number = rep("Copyright © Tamas Marko | 2018 | myhappydata.com", length(plot_list)), logo = sample_logo, logo_size = 1, theme = "flashy", header_color = "#f44242", footer_color = "#3d3c3c") %>% end_report()

report_output

Now let’s add multiple plots to a single page:

start_report() %>% add_new_page( plot = plot_list, plot_hpos = c(1, 2), plot_vpos = c(1, 2), plot_area_layout = grid::grid.layout(2, 2, width = c(1, 1), heigh = c(1, 1)) ) %>% end_report()

In this case we have to specify the layout of the plot area with the grid::grid.layout() function. Also we have to determine the position of every single plot (plot_hpos and plot_vpos). For example if we create a 2×2 plot area plot_hpos = c(1, 2) means that the first plot in the plot_list will be in the left side of the page, and second one will be on the other side.

Optionally of course we could also use other libraries to reach the same goal. My personal recommendation is the patchwork package. As I’ve mentioned already the result will be the same:

library(patchwork) multiple_plot_one_page <- sample_plot + patchwork::plot_spacer() + patchwork::plot_spacer() + sample_plot2 + patchwork::plot_layout(ncol = 2, nrow = 2) start_report() %>% add_new_page( plot = multiple_plot_one_page ) %>% end_report()

report_output
report_output_patchwork

Finally create a report which has multiple pages and also multiple plots in every page:

plot_list <- list( list(sample_plot, sample_plot2), list(sample_plot2, sample_plot) ) start_report() %>% add_multiple_page( plot = plot_list, plot_hpos = c(1, 2), plot_vpos = c(1, 2), plot_area_layout = grid::grid.layout(2, 2, width = c(1, 1), heigh = c(1, 1)) ) %>% end_report()

So if we compare this syntax to the previous one we could see that the only difference is that we need to create sublists inside our list and every plot which has been defined inside a sublist will appear in the same report page.

Advanced features

Export multiple ggplot2 objects into multiple files

create_multiple_report <- function(db, xvar, yvar) { sample_plot <- ggplot(data = db, mapping = aes_string(x = xvar, y = yvar)) + geom_point() + stat_smooth(method = 'lm', color = "#f44242", fill = "#fd9068") start_report( filename = paste0(xvar,".pdf") ) %>% add_new_page(plot = sample_plot) %>% end_report() } lapply(colnames(mtcars)[5:7], create_multiple_report, db = mtcars, yvar = "mpg")

drat
qsec
wt

Customize the report with extra grid parameters

The package can allows us to pass extra grid parameters to our report to customize the final output. It can be useful if we need e.g. an additional text field. First of all we have to know the default layout structure of the report. If you haven’t heard about the basics of grid layouts it is recommended to go deeper into that topic at first. I share the structure of the layouts which was produced by the grid::grid.show.layout() function:

cover_page_theme_basic
report_page_theme_basic

cover_page_theme_flashy
report_page_theme_flashy

So e.g. if we want to insert an extra text field in the report page header when the theme is the basic, the syntax is the following:

rp_e_layout_params = function() { grid::grid.text("example text", y = 0.7, vp = grid::viewport(layout.pos.row = 1, layout.pos.col = 1)) } start_report() %>% add_new_page( plot = sample_plot, need_header = TRUE, extra_layout_params = rp_e_layout_params ) %>% end_report() var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – MYHAPPYDATA BLOG. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

4 ways to be more efficient using RStudio’s Code Snippets, with 11 ready to use examples

Sat, 11/10/2018 - 13:00

(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)

Introduction

In this post we will look at yet another productivity increasing feature of the RStudio IDE – Code Snippets. Code Snippets let us easily insert and potentially execute predefined pieces of code and work not just for R code, but many other languages as well.

In this post we will cover 4 different ways to increase productivity using Code Snippets and provide 11 real-life examples of their use that you can take advantage of instantly.

How do Code Snippets work Using, Viewing and editing snippets
  1. In RStudio, we can browse and define snippets under Tools -> Global Options... -> Code -> Edit Snippets window
  2. When typing code, the snippet will appear as an auto-complete option (similar to function names) if we type the first few letters of its name
  3. Use Shift+Tab to insert the snippet immediately or pick the snippet from the auto-complete list (by clicking or scrolling on it and pressing Tab)

Note that as there is no auto-completion when editing R Markdown documents, we need to use the Shift+Tab method exclusively in that case.

Sharing and exporting Snippets

Once we customize some snippets they are automatically saved to our ~/.R/snippets directory, one file per language. We can use these files to share our snippets with others, but also to edit them directly in the respecive file, without the need to click through the RStudio menus. RStudio will automatically load the Snippets from the .snippets files first, for languages that have the file present.

Four common use-case scenarios 1. Automatically insert boilerplate or template-style code

The first and probably most frequent use of the Code Snippets feature is to quickly insert predefined pieces of code that require a lot of typing with little alternation, a.k.a. boilerplate code. A good illustration is a snippet covering a tryCatch block:

snippet tryc ${1:variable} <- tryCatch({ ${2} }, warning = function(w) { message(sprintf("Warning in %s: %s", deparse(w[["call"]]), w[["message"]])) ${3} }, error = function(e) { message(sprintf("Error in %s: %s", deparse(e[["call"]]), e[["message"]])) ${4} }, finally = { ${5} })

Note that the snippet definition is intended using instead of spaces.

After defining this Snippet and running it we will automatically get a good template for the block and we can focus on writing the important parts:

The numbered sections prefixed with $ such as ${2} let us define sections to which the cursor will jump after pressing Tab. We can also use ${1:predefinedvalue} to predefine a value for the sections.

Another example of this type of use may be a testthat block that quickly prepares a unit-testing file:

snippet tt context("${1}") # ${2} ---------- test_that( "${2}", expect_${3}(${4}) ) 2. Pre-fill code to be ran quickly

The second use case scenario where the Code Snippets come in really handy is to use them in the console when we want to run a block of code that we execute often in some scenarios. One such example is to attach the packages we use in a particular context. For example, when developing an R package, the following may be handy:

snippet dd "library('devtools'); library('testthat'); library('pryr')"

With this snippet, after pressing dd and then Shift+Tab in the console, the library statements will appear and we can just press enter to run them and attach the mentioned packages. We can of course make separate snippets for example for attaching packages we use for interactive data analysis and plotting. This is one way to keep our .Rprofile clean and still have packages easily available when needed.

Another example for this scenario is to quickly run a benchmark comparing two or more pieces of code and visualize the results with a boxplot to get an overview:

snippet mm bench <- microbenchmark::microbenchmark( times = ${1:1:100}, ${2:one} = ${3}, ${4:two} = ${5} ) if (requireNamespace("highcharter")) { highcharter::hcboxplot(bench[["time"]], bench[["expr"]], outliers = FALSE) } else { boxplot(bench, outline = FALSE) } 3. Execute code combined with rstudioapi

The one scenario where RStudio really shines is combining multiple features it offers. We can neatly combine the use of snippets, rstudioapi and the Terminal feature that we discussed previously for an amazing variety of productivity boosts.

Just one practical example convenient when writing a blogdown site is to instantly serve a preview of the blog in a separate session via the Terminal and use the RStudio Viewer in one go to view the site. This is handy especially in the RStudio Server setting, where the site serving in the same session can make the IDE behave slow:

snippet ss `r eval({ nocon <- function(link = 'http://127.0.0.1:9999') { inherits(suppressWarnings(try({ con <- url(link, open = 'rb') close(con) }, silent = TRUE)), 'try-error') } if (nocon()) { termId <- rstudioapi::terminalExecute( 'R -q -e \"blogdown::serve_site(port = 9999, browser = FALSE)\"', show = FALSE ) while (nocon() && !identical(rstudioapi::terminalExitCode(termId), 1L)) { Sys.sleep(0.25) cat(".") } } if (identical(rstudioapi::terminalExitCode(termId), 1L)) { cat(rstudioapi::terminalBuffer(termId), sep = "\n") } else { rstudioapi::viewer('http://127.0.0.1:9999') } })`

After pressing ss and Shift+Tab, the site will be served in a separate R Session and previewed in the viewer.

Using eval(expression) like above lets us execute R code in snippets. This gives a lot of flexibility, even more extensive when combined with eval(parse(text = "code as character string"))

4. Execute code and paste result at cursor

The fourth option is to inject text following the cursor using $$. An example simple but potentially powerful use of this feature is to pass commands to be executed via base R’s system and getting the results directly at our cursor:

snippet $$ `r eval(parse(text = "system('$$', intern = TRUE)"))`

With the above, when typing $$ls into the editor and pressing Shift+Tab, we will see the list of files present in our working directory placed at our cursor.

Another handy use of this feature is to be able to quickly get a reproducible object definition by deparsing it:

snippet $$ `r paste("$$ <-", deparse(eval(parse(text="$$")), width.cutoff = 500L))`

TL;DR – Just give me the snippets

The promised 11 potentially helpful snippets can be found here.

Resources Did you find the article helpful or interesting? Help others find it by sharing var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

One-arm Bayesian Adaptive Trial Simulation Code

Sat, 11/10/2018 - 01:00

(This article was first published on R – BioStatMatt, and kindly contributed to R-bloggers)

I had an opportunity recently to design a Bayesian adaptive trial with several interim analyses that allow for early stopping due to efficacy or futility. The code below implements the one-arm trial described in the great introductory article by Ben Saville et al. Three of the coauthors are current or former colleagues at Vandy. The figures below the code show the sample size distribution under the null hypothesis and an alternative hypothesis, respectively.

## Simulate Bayesian single-arm adaptive trial ## Allow early termination due to futility or efficacy ## Binary outcome ## Beta-binomial: ## p ~ beta(a, b) ## x_i ~ binomial(p) i = 1..n ## p|x ~ beta(a + sum(x), b + n - sum(x)) ## Efficacy at interim t if Pr(p > p_0 | x_{(t)}) > \gamma_e ## Futility at interim t if Pr(p > p_0 | x_{(t_max)}) < \gamma_f ## https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4247348/ library('rmutil') ## for betabinom ## Simulate entire trial ## ptru - true probability of outcome (p) ## pref - reference probability of outcome (p_0) ## nint - sample sizes at which to conduct interim analyses ## efft - efficacy threshold ## futt - futility threshold ## apri - prior beta parameter \alpha ## bpri - prior beta parameter \beta simtrial <- function( ptru = 0.15, pref = 0.15, nint = c(10, 13, 16, 19), efft = 0.95, futt = 0.05, apri = 1, bpri = 1) { ## determine minimum number of 'successes' necessary to ## conclude efficacy if study continues to maximum ## sample size nmax <- max(nint) post <- sapply(0:nmax, function(nevt) 1-pbeta(pref, apri + nevt, bpri + nmax - nevt)) nsuc <- min(which(post > efft)-1) ## simulate samples samp <- rbinom(n = nmax, size = 1, prob = ptru) ## simulate interim analyses intr <- lapply(nint, function(ncur) { ## compute number of current events ecur <- sum(samp[1:ncur]) ## compute posterior beta parameters abb <- apri + ecur bbb <- bpri + ncur - ecur sbb <- abb + bbb mbb <- abb/(abb+bbb) ## compute efficacy Pr(p > p_0 | x_{(t)}) effp <- 1-pbeta(pref, abb, bbb) ## return for efficacy if(effp > efft) return(list(action='stop', reason='efficacy', n = ncur)) ## number of events necessary in remainder of ## study to conclude efficacy erem <- nsuc-ecur ## compute success probability Pr(p > p_0 | x_{(t_max)}) if(erem > nmax-ncur) { ## not enough possible events sucp <- 0 } else { ## not yet met efficacy threshold sucp <- 1-pbetabinom(q = erem-1, size = nmax-ncur, m = mbb, s = sbb) } if(sucp < futt) return(list(action='stop', reason='futility', n = ncur)) return(list(action='continue', reason='', n = ncur)) }) stpi <- match('stop', sapply(intr, `[[`, 'action')) return(intr[[stpi]]) } ## Simulate study with max sample size of 200 where true ## probability is identical to reference (i.e., the null ## hypothesis is true). This type of simulation helps us ## determine the overall type-I error rate. nint <- c(40,80,120,160,200) nmax <- max(nint) res <- do.call(rbind, lapply(1:10000, function(t) as.data.frame(simtrial(ptru = 0.72, pref = 0.72, nint = nint, efft = 0.975, futt = 0.20)))) ## Prob. early termination (PET) due to Futility mean(res$reason == 'futility' & res$n < nmax) ## PET Efficacy mean(res$reason == 'efficacy' & res$n < nmax) ## Pr(conclude efficacy) 'type-I error rate' mean(res$reason == 'efficacy') ## average and sd sample size mean(res$n); sd(res$n) barplot(prop.table(table(res$n)), xlab='Study Size (N)', main="No Difference") ## Simulate study where true probability is greater than ## reference (i.e., an alternative hypothesis). This type ## of simulation helps us determine the study power. res <- do.call(rbind, lapply(1:10000, function(t) as.data.frame(simtrial(ptru = 0.82, pref = 0.72, nint = nint, efft = 0.975, futt = 0.20)))) ## Prob. early termination (PET) due to Futility mean(res$reason == 'futility' & res$n < nmax) ## PET Efficacy mean(res$reason == 'efficacy' & res$n < nmax) ## Pr(conclude efficacy) 'power' mean(res$reason == 'efficacy') ## average and sd sample size mean(res$n); sd(res$n) barplot(prop.table(table(res$n)), xlab='Study Size (N)', main="35% Reduction")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – BioStatMatt. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages