Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 26 min ago

R tip: Make Your Results Clear with sigr

Sun, 11/04/2018 - 19:41

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

R is designed to make working with statistical models fast, succinct, and reliable.

For instance building a model is a one-liner:

model <- lm(Petal.Length ~ Sepal.Length, data = iris)

And producing a detailed diagnostic summary of the model is also a one-liner:

summary(model) # Call: # lm(formula = Petal.Length ~ Sepal.Length, data = iris) # # Residuals: # Min 1Q Median 3Q Max # -2.47747 -0.59072 -0.00668 0.60484 2.49512 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) -7.10144 0.50666 -14.02 <2e-16 *** # Sepal.Length 1.85843 0.08586 21.65 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.8678 on 148 degrees of freedom # Multiple R-squared: 0.76, Adjusted R-squared: 0.7583 # F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16

However, useful as the above is: it isn’t exactly presentation ready. To formally report the R-squared of our model we would have to cut and paste this information from the summary. That is a needlessly laborious and possibly error-prone step.

With the sigr package this can be made much easier:

library("sigr") Rsquared <- wrapFTest(model) print(Rsquared) # [1] "F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05)."

And this formal summary can be directly rendered into many formats (Latex, html, markdown, and ascii).

render(Rsquared, format="html")

F Test summary: (R2=0.76, F(1,148)=468.6, p<1e-05).

sigr can help make your publication workflow much easier and more repeatable/reliable.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New R Cheatsheet: Data Science Workflow with R

Sun, 11/04/2018 - 10:30

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

Teaching R is our mission at Business Science University because R is the most efficient language for exploring data, performing business analysis, and applying data science to business to extract ROI for an organization. R has an amazing ecosystem of tools that seemlessly work together, which has been termed the “tidyverse”. The “tidyverse” ecosystem can take business analysis to a new level. Here’s the key: 99.9% of business analysts don’t know R (and its awesome power). With our learning system, you can learn R quickly, setting yourself on a career tragectory in data science.

R is the most efficient language for exploring data, performing business analysis, and applying data science to business to extract ROI for an organization. …you can learn R quickly, setting yourself on a career tragectory in data science.

At Business Science, We are developing a revolutionary new system for teaching Business Analysis with R (Business Analysis with R is a new course we are developing at Business Science University). The system is revolutionary for a number of reasons (we’ll get to these in a minute). The cornerstone of our teaching process is the Data Science with R Workflow that was originally taught by Hadley Wickham and Garrett Grolemund in the the excellent book, R For Data Science. The NEW R Cheatsheet links the documentation, cheatsheets, and key resources available for every R package in the data science with R workflow into one meta-cheatsheet that illustrates the workflow.

The NEW R Cheatsheet links the documentation, cheatsheets, and key resources available for every R package in the data science with R workflow into one meta-cheatsheet that illustrates the workflow.

Get The New R Cheatsheet

Just go to our website, and you’ll see it available under the “Resources” Tab. The NEW R Cheatsheet with the Data Science Workflow in R is available for download here.

Download the NEW R Cheatsheet

How To Use The Cheatsheet

The NEW R cheatsheet is an amazing reference. It contains every resource you need for referencing the tidyverse documentation in one spot. Let’s take a look.

The Workflow

The first thing you will notice is the workflow that is prominently presented. You can see where the various R Packages are used.

Data Science Workflow

Links To Documentation

Here’s the beauty of the R cheatsheet. With one click, you can easily get to the web documentation for any of the key tidyverse R packages.

One-Click To Documentation

With one click, you can easily get to the web documentation for any of the key tidyverse R packages.

Links To Package Cheatsheets

Wait, there’s more. By clicking “CS”, you can even get the individual R package cheatsheets. These are PDF documents maintained by RStudio that provide snapshots of the most important functions contained in the package cheatsheets.

One-Click To Package Cheatsheets

With one click, you can easily get to the web documentation for any of the key tidyverse R packages.

Links To Key Resources

We didn’t stop at documentation and cheatsheets. We also added in important references to get you up to speed quickly.

One-Click To Important References

We didn’t stop at documentation and cheatsheets. We also added in important references to get you up to speed quickly.

Our Revolutionary System For Learning R For Business

Are you interested in learning R For Business? Then look no further.

  • Business Science University has the most advanced, technology intensive, and streamlined data science educational system for business on the planet.

  • We are developing a NEW INTRODUCTORY COURSE (DS4B 101-R) that delivers an amazing educational experience for learners that want to apply R to business analysis.

  • The beginner DS4B 101-R is the prequel to the intermediate/advanced DS4B 201-R course

Course Launch Date

Launch is expected at end of November 2018 / Beginning of December 2018. Sign up at Business Science University to get updates.

Course Details

Our NEW BUSINESS ANALYSIS WITH R COURSE (DS4B 101-R) combines:

  • Teaching the Data Science with R Workflow in a 100% business context: data import, data manipulation (business aggregations, time-based calculations, text manipulation, categorical manipulation, and missing data), data visualization, business reporting with RMarkdown, and advanced analysis including scaling analysis and building functions.

  • Two state-of-the-art business projects.

Project 1 – Exploratory Analysis of Digital Media Merchant

  • Business Project 1: Complete a full exploratory analysis on a simulated digital media merchant:

    • Connect to a sqlite database to retrieve transactional data,
    • Join data from 11 database tables including customers, products, and time-based transactional history
    • Cleaning data using dplyr and tidyr
    • Business Analysis: Product Level Analysis, Customer Level Analysis, Time-Based Analysis, and Business Filtering to Isolate Key Populations
  • Business Project 2: Apply the advanced skills through a Customer Segmentation project

    • Use K-Means Clustering to anlayze customer purchasing habits and group into segments
    • Apply data visualization techniques investigate the key buying habits
    • Bonus Material and More!
Why Choose Business Science for Education?
  • Research: We know how to learn R efficiently and what ingredients create high performance data science teams.

  • Business Application over Tools: We don’t teach tools. We teach how to solve business problems using tools. There is a key difference. Our approach builds knowledge you can apply immediately.

  • Learn In Weeks What Takes Years: When you take a Business Science University course, you learn everything needed to solve the business project. You learn from proven frameworks and workflows. We cut out anything that you don’t need to know. This makes our programs the most efficient programs for learning.

If You Can’t Wait for 101-R For Business Analysis, Get 201-R For Data Science



Available Now!

To be efficient as a data scientist, you need to learn R. Take the course that has cut data science projects in half (see this testimonial from a leading data science consultant) and has progressed data scientists more than anything they have tried before. Over 10-weeks you learn what it has taken data scientists 10-years to learn:

  • Our systematic data science for business framework
  • R and H2O for Machine Learning
  • How to produce Return-On-Investment from data science
  • And much more.

Start Learning Today!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New FSA Authors

Sun, 11/04/2018 - 06:00

(This article was first published on fishR Blog, and kindly contributed to R-bloggers)

It was always my hope that the Fisheries Stock Assessment (FSA) R package would become a community endeavor. With the recent release on CRAN of v0.8.21 that hope has finally come to fruition. This version has two new co-authors.

Alexis Dinno is the author of the dunn.test package. The dunnTest() function in FSA is simply a wrapper for dunn.test() in her package. In other words, the foundational “guts” of dunnTest() were written by Alexis. This has been acknowledged in the last several versions of FSA, but by including Alexis as a co-author on FSA she will finally get some credit whenever FSA is cited for dunnTest().

Powell Wheeler is a fisheries biologist in North Carolina who expanded removal() in FSA by coding the “Burnham method” as implemented in the popular (but compiled and commercial) MicroFish software. Adding this functionality to FSA has been on my “to-do list” since receiving a query for it from biologists in Idaho a few years ago. It now finally exists in FSA, thanks to Powell’s good work.

Check out the new version of FSA, welcome Alexis and Powell, and thank them for their important contributions to FSA.

If you have something to contribute to FSA, please contact me.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a Repository of Alpine-based Docker Images for R, Part I

Sun, 11/04/2018 - 01:00

(This article was first published on Curiosity Killed the Cat, and kindly contributed to R-bloggers)

The Rocker Project maintains the official Docker images of interest to R users. I use their images as a base to deploy containerized Shiny apps, but the virtual size of the images I build tends to fall in the range between 400 and 600 MB. To reduce the size of my images, I decided to try building a Shiny Server on Alpine Linux as an alternative to Rocker’s Debian-based images. In this series of articles, I’ll document my progress from building a base image with R to building an image with Shiny Server. The Dockerfiles included in this article can be found at the velaco/alpine-r repository.

About Alpine Linux

Alpine Linux is a lightweight distribution that is popular for deploying containerized apps because the virtual size of an Alpine Docker image starts at 5 MB. The virtual size of images based on other distributions exceeds 100 MB before adding anything else to them, so Alpine achieves a significant reduction in size.

Some key differences between Alpine and other distributions that are relevant to my project are:

  • Alpine uses musl lib as a lightweight alternative to glibc. That means precompiled installers built on platforms with glibc could be useless, even if I install libc6-compat or glibc packages for Alpine Linux.
  • Alpine drops old versions of packages when new versions become available, making version pinning impossible, so I probably won’t be able to use nodejs and npm packages from the official repositories to build Shiny Server.

Because of those characteristics, I assume that building the images will require compiling most key dependencies from source. I’ve never done that before, so working on this project will be a great (and painful) learning experience.

Build from Native Packages

Building from source can often take up a lot of time, so I decided to start with the native packages to build a working image as soon as possible. Here is what my Dockerfile looked like:

FROM alpine:3.8 MAINTAINER Aleksandar Ratesic "aratesic@gmail.com" # Declare environment variables ------------------ ENV LC_ALL en_US.UTF-8 ENV LANG en_US.UTF-8 ENV BUILD_DEPS \ cairo-dev \ libxmu-dev \ openjdk8-jre-base \ pango-dev \ perl \ tiff-dev \ tk-dev ENV PERSISTENT_DEPS \ R-mathlib \ gcc \ gfortran \ icu-dev \ libjpeg-turbo \ libpng-dev \ make \ openblas-dev \ pcre-dev \ readline-dev \ xz-dev \ zlib-dev \ bzip2-dev \ curl-dev # Install R and R-dev ------------------ RUN apk upgrade --update && \ apk add --no-cache --virtual .build-deps $BUILD_DEPS && \ apk add --no-cache --virtual .persistent-deps $PERSISTENT_DEPS && \ apk add --no-cache R R-dev && \ apk del .build-deps CMD ["R", "--no-save"]

The resulting image has a virtual size of 136.4 MB. That’s an improvement in size compared to the rocker/r-base image, which is 277.4MB according to MicroBadger. However, I should mention that the rocker/r-base image includes some packages like litter that my Dockerfile doesn’t, so it has additional features that could be of interest to R users. (Not to mention that it is also more stable for use in production than my image.)

Build from Source Code

I recently learned that Alpine drops old versions of packages from the repository when new versions become available ( Source ). That means some versions of R won’t always be available from the official repositories, so I decided to create a Dockerfile for building R from source as well. I made the following changes to the previous Dockerfile:

  • Declared environment variables for storing the R version and target directory for the source code.
  • Added wget and tar to build dependencies for getting and unpacking the sources.
  • Added libint and g++ to runtime dependencies. I learned that g++ is necessary for building from source after the first attempt raised an error while setting up the configuration options. The first container I started failed to run R because package libint was not installed, so I added it later.
  • Removed the R-mathlib package from runtime dependencies. Rmath is installed as a standalone library from source in this image.
  • Finally, I changed the RUN command accordingly to get, configure, and install R from source instead of adding the native packages.

This is what the final version of the file looked like:

FROM alpine:3.8 MAINTAINER Aleksandar Ratesic "aratesic@gmail.com" ENV LC_ALL en_US.UTF-8 ENV LANG en_US.UTF-8 ENV R_VERSION 3.5.1 ENV R_SOURCE /usr/local/src/R ENV BUILD_DEPS \ cairo-dev \ libxmu-dev \ openjdk8-jre-base \ pango-dev \ perl \ tiff-dev \ tk-dev \ wget \ tar ENV PERSISTENT_DEPS \ libint \ gcc \ g++ \ gfortran \ icu-dev \ libjpeg-turbo \ libpng-dev \ make \ openblas-dev \ pcre-dev \ readline-dev \ xz-dev \ zlib-dev \ bzip2-dev \ curl-dev RUN apk upgrade --update && \ apk add --no-cache --virtual .build-deps $BUILD_DEPS && \ apk add --no-cache --virtual .persistent-deps $PERSISTENT_DEPS && \ mkdir -p $R_SOURCE && cd $R_SOURCE && \ wget https://cran.r-project.org/src/base/R-3/R-${R_VERSION}.tar.gz && \ tar -zxvf R-${R_VERSION}.tar.gz && \ cd R-${R_VERSION} && \ ./configure --prefix=/usr/local \ --without-x && \ make && make install && \ cd src/nmath/standalone && make && make install && \ apk del .build-deps CMD ["R"]

The virtual size of the image it built was 311.6 MB, but it should be possible to reduce the image even further by cleaning up after building R. I didn’t want to do that yet because I still had to run the container and perform make check from the shell to test the installation.

There was one non-fatal error (raised by shownNonASCIIfile()) and two fatal errors raised by timezone.R and reg-tests-1c.R tests:

comparing ‘tools-Ex.Rout’ to ‘tools-Ex.Rout.save’ ... NOTE --- /tmp/RtmpdpmEgM/Rdiffa445b3d67023 +++ /tmp/RtmpdpmEgM/Rdiffb445b7ac41733 @@ -796,8 +796,8 @@ > cat(out, file = f, sep = "\n") > > showNonASCIIfile(f) -1: fa*ile test of showNonASCII(): -4: This has an *mlaut in it. +1: faile test of showNonASCII(): +4: This has an <fc>mlaut in it. > unlink(f) > > . . . running code in 'timezone.R' ...make[4]: *** [Makefile.common:105: timezone.Rout] Error 1 make[4]: Leaving directory '/usr/local/src/R/R-3.5.1/tests' Sys.timezone() appears unknown . . . running code in 'reg-tests-1c.R' ...make[3]: *** [Makefile.common:105: reg-tests-1c.Rout] Error 1 make[3]: Leaving directory '/usr/local/src/R/R-3.5.1/tests' make[2]: *** [Makefile.common:291: test-Reg] Error 2 make[2]: Leaving directory '/usr/local/src/R/R-3.5.1/tests' make[1]: *** [Makefile.common:170: test-all-basics] Error 1 make[1]: Leaving directory '/usr/local/src/R/R-3.5.1/tests' make: *** [Makefile:240: check] Error 2

According to the R Installation and Administration manual:

Failures are not necessarily problems as they might be caused by missing functionality, but you should look carefully at any reported discrepancies. (Some non-fatal errors are expected in locales that do not support Latin-1, in particular in true C locales and non-UTF-8 non-Western-European locales.)

Although the container runs R and has passed most tests, I wouldn’t feel confident using it in production or as a base for building Shiny Server because of the fatal errors. Therefore, this Dockerfile in the staging branch of my repository for now until those issues are fixed.

Next Steps

In R’s APKBUILD file I found the line

# TODO: Run provided test suite.

Because the R version installed from native packages was not tested, I have no reason to assume that is more reliable than my build from source. I am tempted to build Shiny Server on Alpine as soon as possible, and I will definitely make an attempt at building that image just to get an overview of the process. However, while I explore the process of building Shiny from Source, I will also focus on resolving the issues raised by the tests to create a reliable base image for Shiny.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Curiosity Killed the Cat. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

coalesce with wrapr

Sat, 11/03/2018 - 22:49

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

coalesce is a classic useful SQL operator that picks the first non-NULL value in a sequence of values.

We thought we would share a nice version of it for picking non-NA R with convenient operator infix notation wrapr::coalesce(). Here is a short example of it in action:

library("wrapr") NA %?% 0 # [1] 0

A more substantial application is the following.

library("wrapr") d <- wrapr::build_frame( "more_precise_sensor", "cheaper_back_up_sensor" | 0.31 , 0.25 | 0.41 , 0.5 | NA_real_ , 0.5 ) print(d) # more_precise_sensor cheaper_back_up_sensor # 1 0.31 0.25 # 2 0.41 0.50 # 3 NA 0.50 d$measurement <- d$more_precise_sensor %?% d$cheaper_back_up_sensor print(d) # more_precise_sensor cheaper_back_up_sensor measurement # 1 0.31 0.25 0.31 # 2 0.41 0.50 0.41 # 3 NA 0.50 0.50 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RProtoBuf 0.4.13 (and 0.4.12)

Sat, 11/03/2018 - 22:36

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release 0.4.13 of RProtoBuf got onto CRAN a few hours ago. RProtoBuf provides R with bindings for the Google Protocol Buffers (“ProtoBuf”) data encoding and serialization library used and released by Google, and deployed fairly widely in numerous projects as a language and operating-system agnostic protocol.

It would also appear that I failed to blog about release 0.4.12 from July. Both releases contain a number of very nice and focused contributed pull requests – see below for more – as well as some standard packaging maintenance.

Changes in RProtoBuf version 0.4.13 (2018-11-03)
  • The configure setup is more robust with respect to the C++ setup (CRAN request).

  • POSIXlt elements are now properly serialized too (Jeffrey Shen in #48 and #50 fixing #47)

  • Added two Dockerfiles for continuous integration and use; see this url for more.

Changes in RProtoBuf version 0.4.12 (2018-07-11)
  • Recursive serialization of sublists returning themselves is now recognised (Jeffrey Shen in #38 fixing #37)

  • New function readProtoFiles2 to be consistent with protoc (Siddhartha Bagaria in #40 fixing #39)

  • Update Windows binary library used (Maciej Lach and Jeroen Ooms in #42 and follow-up commits)

  • New unit tests for new functionality (Siddhartha Bagaria in #45)

CRANberries also provides a diff to the previous release. The RProtoBuf page has copies of the (older) package vignette, the ‘quick’ overview vignette, a unit test summary vignette, and the pre-print for the JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Book Review – Sound Analysis and Synthesis with R

Sat, 11/03/2018 - 18:38

(This article was first published on R-bloggers – Eryk Walczak, and kindly contributed to R-bloggers)

R might not be the most obvious tool when it comes to analysing audio data. However, an increasing number of packages allows analysing and synthesising sounds. One of such packages is seewave. Jerome Sueur, one of the authors of seewave, now wrote a book about working with audio data in R. The book is entitled Sound Analysis and Synthesis with R and was published by Springer in 2018. I highly recommend it to anyone working with audio data.

The book starts with a general explanation of sound. Then it introduces R to readers who have no experience using it. Over the 17 chapters the author describes basic audio analyses that can be conducted with R. The underlying concepts are explained using both mathematical equations and R code. There is also some material on sound synthesis, but this is a minor point when compared to the space devoted to the analysis. Additional materials include sound samples used across the book.

As mentioned before the main topic of the book is the analysis of sound, predominantly in scientific settings. Researchers (or data scientists) typically would want to load, visualise, play, and quantify a particular sound that they work on. These basic steps are desribed in this book with code examples that are simple to follow and richly illustrated with R-generated plots. Check the book preview here.

If you ever need to paste, delete, repeat or reverse audio files with R then recipes for these tasks can be found in this book. The book contains twenty DIY Boxes which show alternative ways to use already coded functions and demonstrate new tasks. These boxes cover topics ranging from loading audio files, plotting to frequency and amplitude analysis.

Even though the author created his own package, the book shows how to use a wide range of audio-specific R package like tuneR or warbleR.

I can only wish that this book had been released earlier. It would have saved me a lot of pain conducting audio analyses.

Final verdict: 5/5

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-bloggers – Eryk Walczak. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Visualize the Business Value of your Predictive Models with modelplotr

Sat, 11/03/2018 - 14:00

(This article was first published on Modelplot[R/py] introduction, and kindly contributed to R-bloggers)

Why ROC curves are a bad idea to explain your model to business people

Summary

In this blog we explain four most valuable evaluation plots to assess the business value of a predictive model. These plots are the cumulative gains, cumulative lift, response and cumulative response. Since these visualisations are not included in most popular model building packages or modules in R and Python, we show how you can easily create these plots for your own predictive models with our modelplotr r package and our modelplotpy python module (Prefer python? Read all about modelplotpy here!). This will help you to explain your model’s business value in laymans terms to non-techies.

Intro

‘…And as we can see clearly on this ROC plot, the sensitivity of the model at the value of 0.2 on one minus the specificity is quite high! Right?…’.

If your fellow business colleagues didn’t already wander away during your presentation about your fantastic predictive model, it will definitely push them over the edge when you start talking like this. Why? Because the ROC curve is not easy to quickly explain and also difficult to translate into answers on the business questions your spectators have. And these business questions were the reason you’ve built a model in the first place!

What business questions? We build models for all kinds of supervised classification problems. Such as predictive models to select the best records in a dataset, which can be customers, leads, items, events… For instance: You want to know which of your active customers have the highest probability to churn; you need to select those prospects that are most likely to respond to an offer; you have to identify transactions that have a high risk to be fraudulent. During your presentation, your audience is therefore mainly focused on answering questions like Does your model enable us to our target audience? How much better are we, using your model? What will the expected response on our campaign be?

During our model building efforts, we should already be focused on verifying how well the model performs. Often, we do so by training the model parameters on a selection or subset of records and test the performance on a holdout set or external validation set. We look at a set of performance measures like the ROC curve and the AUC value. These plots and statistics are very helpful to check during model building and optimization whether your model is under- or overfitting and what set of parameters performs best on test data. However, these statistics are not that valuable in assessing the business value the model you developed.

One reason that the ROC curve is not that useful in explaining the business value of your model, is because it’s quite hard to explain the interpretation of ‘area under the curve’, ‘specificity’ or ‘sensitivity’ to business people. Another important reason that these statistics and plots are useless in your business meetings is that they don’t help in determining how to apply your predictive model: What percentage of records should we select based on the model? Should we select only the best 10% of cases? Or should we stop at 30%? Or go on until we have selected 70%?… This is something you want to decide together with your business colleague to best match the business plans and campaign targets they have to meet. The four plots – the cumulative gains, cumulative lift, response and cumulative response – we are about to introduce are in our view the best ones for that cause.

Installing modelplotr

Before we start exploring the plots, let’s install our modelplotr package. Since it is available on github, it can easily be installed using devtools:

library(devtools) install_github("modelplot/modelplotr")

Now we’re ready to use modelplotr. As you’ll probably see when you run the installation code, it comes with a number of other very valuable package installs. In another post, we’ll go into much more detail on our modelplot packages in r and python and all its functionalities. Here, we’ll focus on using modelplotr with a business example.

Example: Predictive models from caret and mlr on the Bank Marketing Data Set

Let’s get down to business! Our example is based on a publicly available dataset, called the Bank Marketing Data Set. It is one of the most popular datasets which is made available on the UCI Machine Learning Repository. The data set comes from a Portugese bank and deals with a frequently-posed marketing question: whether a customer did or did not acquire a term deposit, a financial product. There are 4 datasets available and the bank-additional-full.csv is the one we use. It contains the information of 41.188 customers and 21 columns of information.

To illustrate how to use modelplotr, let’s say that we work for this bank and our marketing colleagues have asked us to help to select the customers that are most likely to respond to a term deposit offer. For that purpose, we will develop a predictive model and create the plots to discuss the results with our marketing colleagues. Since we want to show you how to build the plots, not how to build a perfect model, we’ll use six of these columns in our example. Here’s a short description on the data we use:

  1. y: has the client subscribed a term deposit?
  2. duration: last contact duration, in seconds (numeric)
  3. campaign: number of contacts performed during this campaign and for this client
  4. pdays: number of days that passed by after the client was last contacted from a previous campaign
  5. previous: number of contacts performed before this campaign and for this client (numeric)
  6. euribor3m: euribor 3 month rate

Let’s load the data and have a quick look at it:

# download bank data and prepare #zipname = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip' # we encountered that the source at uci.edu is not always available, therefore we made a copy to our repos. zipname = 'https://modelplot.github.io/img/bank-additional.zip' csvname = 'bank-additional/bank-additional-full.csv' temp <- tempfile() download.file(zipname,temp, mode="wb") bank <- read.table(unzip(temp,csvname),sep=";", stringsAsFactors=FALSE,header = T) unlink(temp) bank <- bank[,c('y','duration','campaign','pdays','previous','euribor3m')] # rename target class value 'yes' for better interpretation bank$y[bank$y=='yes'] <- 'term.deposit' #explore data str(bank) ## 'data.frame': 41188 obs. of 6 variables: ## $ y : chr "no" "no" "no" "no" ... ## $ duration : int 261 149 226 151 307 198 139 217 380 50 ... ## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ... ## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ... ## $ previous : int 0 0 0 0 0 0 0 0 0 0 ... ## $ euribor3m: num 4.86 4.86 4.86 4.86 4.86 ...

On this data, we’ve applied some predictive modeling techniques, created with the caret package and the mlr package. Both these popular R packages are wrappers for many predictive modeling techniques, such as logistic regression, random forest, XG boost, svm, neural nets and many, many others.

mlr and caret provide numerous different algorithms for binary classification. It should be noted, that to use our modelplotr package, you don’t have to use caret or mlr to build your models. More on this in the modelplotr package documentation, just have a look at ?modelplotr::aggregate_over_deciles(). To have a few models to evaluate with our plots, we do take advantage of mlr’s and caret’s greatness.

# prepare data for training and train models test_size = 0.3 train_index = sample(seq(1, nrow(bank)),size = (1 - test_size)*nrow(bank) ,replace = F) train = bank[train_index,] test = bank[-train_index,] # estimate some models with caret... # setting caret cross validation, here tuned for speed (not accuracy!) fitControl <- caret::trainControl(method = "cv",number = 2,classProbs=TRUE) # mnl model using glmnet package mnl = caret::train(y ~.,data = train, method = "glmnet",trControl = fitControl) # random forest using ranger package, here tuned for speed (not accuracy!) rf = caret::train(y ~.,data = train, method = "ranger",trControl = fitControl, tuneGrid = expand.grid(.mtry = 2,.splitrule = "gini",.min.node.size=10)) # ... and estimate some models with mlr task = mlr::makeClassifTask(data = train, target = "y") # discriminant model lrn = mlr::makeLearner("classif.lda", predict.type = "prob") lda = mlr::train(lrn, task) #xgboost model lrn = mlr::makeLearner("classif.xgboost", predict.type = "prob") xgb = mlr::train(lrn, task)

Ok, we’ve generated some predictive models. Now, let’s prepare the data for plotting! We will focus on explaining to our marketing colleagues how good our predictive model works and how it can help them select customers for their term deposit campaign.

library(modelplotr) # transform datasets and model objects into scored data and calculate deciles prepare_scores_and_deciles(datasets=list("train","test"), dataset_labels = list("train data","test data"), models = list("rf","mnl","xgb","lda"), model_labels = list("random forest","multinomial logit","XGBoost","Discriminant"), target_column="y") ## ... scoring caret model "rf" on dataset "train". ## ... scoring caret model "mnl" on dataset "train". ## ... scoring mlr model "xgb" on dataset "train". ## ... scoring mlr model "lda" on dataset "train". ## ... scoring caret model "rf" on dataset "test". ## ... scoring caret model "mnl" on dataset "test". ## ... scoring mlr model "xgb" on dataset "test". ## ... scoring mlr model "lda" on dataset "test". ## [1] "Data preparation step 1 succeeded! Dataframe 'scores_and_deciles' created." # transform data generated with prepare_scores_and_deciles into aggregated data for chosen plotting scope plotting_scope(select_model_label = 'XGBoost',select_dataset_label = 'test data') ## Data preparation step 3 succeeded! Dataframe 'plot_input' created. ## ## No comparison specified, default values are used. ## ## Single evaluation line will be plotted: Target value "term deposit" plotted for dataset "test data" and model "XGBoost. ## " ## -> To compare models, specify: scope = "compare_models" ## -> To compare datasets, specify: scope = "compare_datasets" ## -> To compare target classes, specify: scope = "compare_targetclasses" ## -> To plot one line, do not specify scope or specify scope = "no_comparison".

What just happened? In the prepare_scores_and_deciles function, we’ve scored the customers in the train dataset and test dataset with their probability to acquire a term deposit, according to the predictive models we’ve just built with caret and mlr. Aside from the datasets and model objects, we had to specify the name of the target variable and for our convenience, we gave our datasets and models some useful labels.

In the second step, we specified the scope of the analysis. We’ve not specified the “scope” parameter, therefore the default – no comparison – is chosen. As the output notes, you can use modelplotr to evaluate your model(s) from several perspectives:

  • Interpret just one model (the default)
  • Compare the model performance across different datasets
  • Compare the performance across different models
  • Compare the performance across different target classes

Here, we will keep it simple and evaluate – from a business perspective – how well a selected model will perform in a selected dataset for one target class. We did specify values for some parameters, to focus on the XGBoost model on the test data. The default value for the target class is term deposit ; since we want to focus on customers that do take term deposits, this default is perfect.

Let’s introduce the Gains, Lift and (cumulative) Response plots.

Before we throw more code and output at you, let’s get you familiar with the plots we so strongly advocate to use to assess a predictive model’s business value. Although each plot sheds light on the business value of your model from a different angle, they all use the same data:

  • Predicted probability for the target class
  • Equally sized groups based on this predicted probability
  • Actual number of observed target class observations in these groups

It’s common practice to split the data to score into 10 equally large groups and call these groups deciles. Observations that belong to the top-10% with highest model probability in a set, are in decile 1 of that set; the next group of 10% with high model probability are decile 2 and finally the 10% observations with the lowest model probability on the target class belong to decile 10.

Each of our four plots places the deciles on the x axis and another measure on the y axis. The deciles are plotted from left to right so the observations with the highest model probability are on the left side of the plot. This results in plots like this:

Now that it’s clear what is on the horizontal axis of each of the plots, we can go into more detail on the metrics for each plot on the vertical axis. For each plot, we’ll start with a brief explanation what insight you gain with the plot from a business perspective. After that, we apply it to our banking data and show some neat features of modelplotr to help you explain the value of your wonderful predictive models to others.

1. Cumulative gains plot

The cumulative gains plot – often named ‘gains plot’ – helps you answer the question:

When we apply the model and select the best X deciles, what % of the actual target class observations can we expect to target?

Hence, the cumulative gains plot visualises the percentage of the target class members you have selected if you would decide to select up until decile X. This is a very important business question, because in most cases, you want to use a predictive model to target a subset of observations – customers, prospects, cases,… – instead of targeting all cases. And since we won’t build perfect models all the time, we will miss some potential. And that’s perfectly all right, because if we are not willing to accept that, we should not use a model in the first place. Or build a perfect model, that scores all actual target class members with a 100% probability and all the cases that do not belong to the target class with a 0% probability. However, if you’re such a wizard, you don’t need these plots any way or you should have a careful look at your model – maybe you’re cheating?….

So, we’ll have to accept we will lose some. What percentage of the actual target class members you do select with your model at a given decile, that’s what the cumulative gains plot tells you. The plot comes with two reference lines to tell you how good/bad your model is doing: The random model line and the wizard model line. The random model line tells you what proportion of the actual target class you would expect to select when no model is used at all. This vertical line runs from the origin (with 0% of cases, you can only have 0% of the actual target class members) to the upper right corner (with 100% of the cases, you have 100% of the target class members). It’s the rock bottom of how your model can perform; are you close to this, then your model is not much better than a coin flip. The wizard model is the upper bound of what your model can do. It starts in the origin and rises as steep as possible towards 100%. If less than 10% of all cases belong to the target category, this means that it goes steep up from the origin to the value of decile 1 and cumulative gains of 100% and remains there for all other deciles as it is a cumulative measure. Your model will always move between these two reference lines – closer to a wizard is always better – and looks like this:

Back to our business example. How many of the term deposit buyers can we select with the top-20% of our predictive models? Let’s find out! To generate the cumulate gains plot, we can simply call the function plot_cumgains():

# plot the cumulative gains plot plot_cumgains()

We don’t need to specify any parameters, since the default input is plot_input, which is generated with the plotting_scope() function we ran previously. There are several parameters available to customize the plot, though. If we want to emphasize the model performance at a given point, we can add both highlighting to the plot and add some explanatory text below the plot. Both are optional, though:

# plot the cumulative gains plot and annotate the plot at decile = 2 plot_cumgains(highlight_decile = 2) ## ## Plot annotation: ## - When we select 20% with the highest probability according to XGBoost, this selection holds 82% of all term deposit cases in test data. ## ##

Our ‘highlight_decile’ parameter adds some guiding elements to the plot at decile 2 as well as a text box below the graph with the interpretation of the plot at decile 2 in words. This interpretation is also printed to the console. Our simple model – only 6 pedictors were used – seems to do a nice job selecting the customers interested in buying term deposites. When we select 20% with the highest probability according to XGBoost, this selection holds 82% of all term deposit cases in test data. With a perfect model, we would have selected 100%, since less than 20% of all customers in the test set buy term deposits. A random pick would only hold 20% of the customers with term deposits. How much better than random we do, brings us to plot number 2!

2. Cumulative lift plot

The cumulative lift plot, often referred to as lift plot or index plot, helps you answer the question:

When we apply the model and select the best X deciles, how many times better is that than using no model at all?

The lift plot helps you in explaining how much better selecting based on your model is compared to taking random selections instead. Especially when models are not yet used within a certain organisation or domain, this really helps business understand what selecting based on models can do for them.

The lift plot only has one reference line: the ‘random model’. With a random model we mean that each observation gets a random number and all cases are devided into deciles based on these random numbers. The % of actual target category observations in each decile would be equal to the overall % of actual target category observations in the total set. Since the lift is calculated as the ratio of these two numbers, we get a horizontal line at the value of 1. Your model should however be able to do better, resulting in a high ratio for decile 1. How high the lift can get, depends on the quality of your model, but also on the % of target class observations in the data: If 50% of your data belongs to the target class of interest, a perfect model would only do twice as good (lift: 2) as a random selection. With a smaller target class value, say 10%, the model can potentially be 10 times better (lift: 10) than a random selection. Therefore, no general guideline of a ‘good’ lift can be specified. Towards decile 10, since the plot is cumulative, with 100% of cases, we have the whole set again and therefore the cumulative lift will always end up at a value of 1. It looks like this:

To generate the cumulative lift plot for our XGBoost model predicting term deposit buying, we call the function plot_cumlift(). Let’s add some highlighting to see how much better a selection based on our model containing deciles 1 and 2 would be, compared to a random selection of 20% of all customers:

# plot the cumulative lift plot and annotate the plot at decile = 2 plot_cumlift(highlight_decile = 2) ## ## Plot annotation: ## - When we select 20% with the highest probability according to model XGBoost in test data, this selection for term deposit cases is 4.1 times better than selecting without a model. ## ##

A term deposit campaign targeted at a selection of 20% of all customers based on our XGBoost model can be expected to have a 4 times higher response (411%) compared to a random sample of customers. Not bad, right? The cumulative lift really helps in getting a positive return on marketing investments. It should be noted, though, that since the cumulative lift plot is relative, it doesn’t tell us how high the actual reponse will be on our campaign…

3. Response plot

One of the easiest to explain evaluation plots is the response plot. It simply plots the percentage of target class observations per decile. It can be used to answer the following business question:

When we apply the model and select decile X, what is the expected % of target class observations in that decile?

The plot has one reference line: The % of target class cases in the total set. It looks like this:

A good model starts with a high response value in the first decile(s) and suddenly drops quickly towards 0 for later deciles. This indicates good differentiation between target class members – getting high model scores – and all other cases. An interesting point in the plot is the location where your model’s line intersects the random model line. From that decile onwards, the % of target class cases is lower than a random selection of cases would hold.

To generate the response plot for our term deposit model, we can simply call the function plot_response(). Let’s immediately highlight the plot to have the interpretation of the response plot at decile 1 added to the plot:

# plot the response plot and annotate the plot at decile = 1 plot_response(highlight_decile = 1) ## ## Plot annotation: ## - When we select decile 1 according to model XGBoost in dataset test data the % of term deposit cases in the selection is 61%. ## ##

As the plot shows and the text below the plot states: When we select decile 1 according to model XGBoost in dataset test data the % of term deposit cases in the selection is 61%.. This is quite good, especially when compared to the overall likelihood of 11%. The response in the second decile is much lower, about 28%. From decile 3 onwards, the expected response will be lower than the overall likelihood of 10.4%. However, most of the time, our model will be used to select the highest decile up until some decile. That makes it even more relevant to have a look at the cumulative version of the response plot. And guess what, that’s our final plot!

4. Cumulative response plot

Finally, one of the most used plots: The cumulative response plot. It answers the question burning on each business reps lips:

When we apply the model and select up until decile X, what is the expected % of target class observations in the selection?

The reference line in this plot is the same as in the response plot: the % of target class cases in the total set.

Whereas the response plot crosses the reference line, in the cumulative response plot it never crosses it but ends up at the same point for decile 10: Selecting all cases up until decile 10 is the same as selecting all cases, hence the % of target class cases will be exactly the same. This plot is most often used to decide – together with business colleagues – up until what decile to select for a campaign.

Back to our banking business case. To generate the cumulative response plot, we call the function plot_cumresponse(). Let’s highlight it at decile 3 to see what the overall expected response will be if we select prospects for a term deposit offer based on our XGBoost model:

# plot the cumulative response plot and annotate the plot at decile = 3 plot_cumresponse(highlight_decile = 3) ## ## Plot annotation: ## - When we select deciles 1 until 3 according to model XGBoost in dataset test data the % of term deposit cases in the selection is 33%. ## ##

When we select deciles 1 until 3 according to model XGBoost in dataset test data the % of term deposit cases in the selection is 33%. Since the test data is an independent set, not used to train the model, we can expect the response on the term deposit campaign to be 33%.

The cumulative response percentage at a given decile is a number your business colleagues can really work with: Is that response big enough to have a successfull campaign, given costs and other expectations? Will the absolute number of sold term deposits meet the targets? Or do we lose too much of all potential term deposit buyers by only selecting the top 30%? To answer that question, we can go back to the cumulative gains plot. And that’s why there’s no absolute winner among these plots and we advice to use them all. To make that happen, there’s also a function to easily combine all four plots.

All four plots together

With the function call plot_all we get all four plots on one grid. We can easily save it to a file to include it in a presentation or share it with colleagues.

# plot all four evaluation plots and save to file plot_all(save_fig = TRUE,save_fig_filename = 'Selection model Term Deposits') ## Plot is saved as: c:/TEMP/jupyter-blog/content/Selection model Term Deposits.png

Neat! With these plots, we are able to talk with business colleagues about the actual value of our predictive model, without having to bore them with technicalities any nitty gritty details. We’ve translated our model in business terms and visualised it to simplify interpretation and communication. Hopefully, this helps many of you in discussing how to optimally take advantage of your predictive model building efforts.

Get more out of modelplotr: using different scopes.

As we mentioned earlier, the modelplotr package also enables to make interesting comparisons, using the scope parameter. Comparisons between different models, between different datasets and (in case of a multiclass target) between different target classes. Curious? Please have a look at the package documentation or read our other posts on modelplotr.

However, to give one example, we could compare whether XGBoost was indeed the best choice to select the top-30% customers for a term deposit offer:

# set plotting scope to model comparison plotting_scope(scope = "compare_models") ## Data preparation step 3 succeeded! Dataframe 'plot_input' created. ## ## Models "random forest", "multinomial logit", "XGBoost", "Discriminant" compared for dataset "test data" and target value "term deposit". # plot the cumulative response plot and annotate the plot at decile = 3 plot_cumresponse(highlight_decile = 3) ## ## Plot annotation: ## - When we select deciles 1 until 3 according to model random forest in dataset test data the % of term deposit cases in the selection is 34%. ## - When we select deciles 1 until 3 according to model multinomial logit in dataset test data the % of term deposit cases in the selection is 32%. ## - When we select deciles 1 until 3 according to model XGBoost in dataset test data the % of term deposit cases in the selection is 33%. ## - When we select deciles 1 until 3 according to model Discriminant in dataset test data the % of term deposit cases in the selection is 33%. ## ##

Seems like the algorithm used will not make a big difference in this case. Hopefully you agree by now that using these plots really can make a difference in explaining the business value of your predictive models!

In case you experience issues when using modelplotr, please let us know via the issues section on Github. Any other feedback or suggestions, please let us know via jurriaan.nagelkerke@gmail.com or pb.marcus@hotmail.com. Happy modelplotting!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Modelplot[R/py] introduction. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A quick look at GHCN version 4

Sat, 11/03/2018 - 03:15

(This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers)

GHCN  version 4   beta is available.  Using the GHS  population dataset the ~27000  GHCNV4 sites were filtered  to collect only rural stations.   GHS  combines two datasets,  a  10meter  built surface  satellite dataset and a human population dataset.   https://ghsl.jrc.ec.europa.eu/.    using site locations the population within 10km  of the site was extracted.  two cases were considered:  A case where rual was defined as  less than 16 people per sq km, and a case where there were less than 7 people  per sq km, this filtering led to two subsets of the  ~27K  ghcnv4 stations. One with 15K stations and a second with 12K stations.   The temperature data used was the unadjusted  monthly T Avg.

 

 

 

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Steven Mosher's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analyzing NetHack data, part 1: What kills the players

Sat, 11/03/2018 - 01:00

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


Abstract

In this post, I will analyse the data I scraped and put into an R package, which I called {nethack}.
NetHack is a roguelike game; for more context, read my previous blog
post.
You can install the {nethack} package and play around with the data yourself by installing it from github:

devtools::install_github("b-rodrigues/nethack")

And to use it:

library(nethack) data("nethack")

The data contains information on games played from 2001 to 2018; 322485 rows and 14 columns. I
will analyze the data in a future blog post. This post focuses on getting and then sharing the
data. By the way, all the content from the public server I scrape is under the CC BY 4.0 license.

I built the package by using the very useful {devtools} package.

Introduction

What I want from this first analysis are several, simple things: how many players manage to ascend
(meaning, winning), what monster kills most players, and finally extract data from the dumplog
column. The dumplog column is a bit special; each element of the dumplog column is a log file
that contains a lot of information from the last turns of a player. I will leave this for a future
blog post, though.

Let’s load some packages first:

library(nethack) library(tidyverse) library(lubridate) library(magrittr) library(brotools)

{brotools} is my own package that contains some functions that I use daily. If you want to
install it, run the following line:

devtools::install_github("b-rodrigues/brotools")

The documentation is not up-to-date, I think I’ll do that and release it on CRAN. Some day.

Now, let’s load the “nethack” data, included in the {nethack} package:

data("nethack") head(nethack) ## rank score name time turns lev_max hp_max role race gender alignment ## 1 1 360 jkm NA 2/2 -2/25 Sam Hum Mal Law ## 2 2 172 yosemite NA 1/1 -1/10 Tou Hum Fem Neu ## 3 3 2092 dtype NA 6/7 -2/47 Val Hum Fem Neu ## 4 4 32 joorko NA 1/1 0/15 Sam Hum Mal Law ## 5 5 118 jorko NA 1/1 0/11 Rog Orc Fem Cha ## 6 6 1757 aaronl NA 5/5 0/60 Bar Hum Mal Neu ## death date ## 1 killed by a brown mold 2001-10-24 ## 2 killed by a jackal 2001-10-24 ## 3 killed by a fire ant 2001-10-24 ## 4 killed by a jackal 2001-10-24 ## 5 killed by a jackal 2001-10-24 ## 6 killed by a hallucinogen-distorted ghoul, while helpless 2001-10-24 ## dumplog ## 1 NA ## 2 NA ## 3 NA ## 4 NA ## 5 NA ## 6 NA

Let’s create some variables that might be helpful (or perhaps not, we’ll see):

nethack %<>% mutate(date = ymd(date), year = year(date), month = month(date), day = day(date))

This makes it easy to look at the data from, say, June 2017:

nethack %>% filter(year == 2017, month == 6) %>% brotools::describe() ## # A tibble: 15 x 13 ## variable type nobs mean sd mode min max q25 median ## ## 1 day Nume… 1451 17.4 9.00e0 1 1 30 10 19 ## 2 month Nume… 1451 6 0. 6 6 6 6 6 ## 3 rank Nume… 1451 47.1 2.95e1 1 1 100 20 47 ## 4 score Nume… 1451 38156. 3.39e5 488 0 5966425 402. 953 ## 5 turns Nume… 1451 4179. 1.23e4 812 1 291829 860. 1796 ## 6 year Nume… 1451 2017 0. 2017 2017 2017 2017 2017 ## 7 alignme… Char… 1451 NA NA Law NA NA NA NA ## 8 death Char… 1451 NA NA kill… NA NA NA NA ## 9 gender Char… 1451 NA NA Mal NA NA NA NA ## 10 hp_max Char… 1451 NA NA -1/16 NA NA NA NA ## 11 lev_max Char… 1451 NA NA 4/4 NA NA NA NA ## 12 name Char… 1451 NA NA ohno… NA NA NA NA ## 13 race Char… 1451 NA NA Hum NA NA NA NA ## 14 role Char… 1451 NA NA Kni NA NA NA NA ## 15 time Char… 1451 NA NA 01:1… NA NA NA NA ## # ... with 3 more variables: q75 , n_missing , n_unique

Let’s also take a look at a dumplog:

Click to expand; the dumplog is quite long

nethack %>% filter(year == 2018, month == 10) %>% slice(1) %>% pull(dumplog) ## [[1]] ## [1] "Unix NetHack Version 3.6.1 - last build Fri Apr 27 19:25:48 2018. (d4ebae12f1a709d1833cf466dd0c553fb97518d2)" ## [2] "" ## [3] "Game began 2018-09-30 22:27:18, ended 2018-10-01 00:01:12." ## [4] "" ## [5] "brothertrebius, neutral female gnomish Ranger" ## [6] "" ## [7] " -----" ## [8] " -------- |....# ----- --------" ## [9] " |/..%.=| #...^|######|...| ##.......|" ## [10] " |/[%..%| #|...| #|...| # |......|" ## [11] " |......| #----- #-...-######....<..|" ## [12] " -----|-- ### -|-.- # |......|" ## [13] " ## ## # # ----f---" ## [14] " #### # # ## f@Y" ## [15] " # # # #" ## [16] " -----.-------# # #" ## [17] " |........%..|# # #" ## [18] " |............# # #" ## [19] " |...........| 0## #" ## [20] " |...........| -.--- #" ## [21] " ------------- |^..|##" ## [22] " |...|#" ## [23] " |0>..#" ## [24] " -----" ## [25] "" ## [26] "Brothertre the Trailblazer St:15 Dx:12 Co:16 In:13 Wi:15 Ch:6 Neutral" ## [27] "Dlvl:6 $:59 HP:0(54) Pw:40(40) AC:0 Exp:8 T:7398 Satiated Burdened" ## [28] "" ## [29] "Latest messages:" ## [30] " In what direction? l" ## [31] " You shoot 2 arrows." ## [32] " The 1st arrow hits the ape." ## [33] " The 2nd arrow hits the ape!" ## [34] " The ape hits!" ## [35] " The ape hits!" ## [36] " The ape bites!" ## [37] " You ready: q - 9 uncursed arrows." ## [38] " In what direction? l" ## [39] " The arrow hits the ape." ## [40] " The ape hits!" ## [41] " The ape hits!" ## [42] " The ape bites!" ## [43] " The ape hits!" ## [44] " The ape hits!" ## [45] " The ape misses!" ## [46] " In what direction? l" ## [47] " You shoot 2 arrows." ## [48] " The 1st arrow hits the ape!" ## [49] " The 2nd arrow hits the ape." ## [50] " The ape misses!" ## [51] " The ape hits!" ## [52] " The ape misses!" ## [53] " In what direction? l" ## [54] " You shoot 2 arrows." ## [55] " The 1st arrow misses the ape." ## [56] " The 2nd arrow hits the ape." ## [57] " The ape misses!" ## [58] " The ape hits!" ## [59] " The ape bites!" ## [60] " In what direction? l" ## [61] " The arrow hits the ape!" ## [62] " The ape hits!" ## [63] " The ape misses!" ## [64] " The ape bites!" ## [65] " You hear someone cursing shoplifters." ## [66] " The ape misses!" ## [67] " The ape hits!" ## [68] " The ape bites!" ## [69] " What do you want to write with? [- amnqsvBJM-OWZ or ?*] -" ## [70] " You write in the dust with your fingertip." ## [71] " What do you want to write in the dust here? Elbereth" ## [72] " The ape hits!" ## [73] " The ape hits!" ## [74] " You die..." ## [75] " Do you want your possessions identified? [ynq] (y) y" ## [76] " Do you want to see your attributes? [ynq] (y) n" ## [77] " Do you want an account of creatures vanquished? [ynaq] (y) n" ## [78] " Do you want to see your conduct? [ynq] (y) n" ## [79] " Do you want to see the dungeon overview? [ynq] (y) q" ## [80] "" ## [81] "Inventory:" ## [82] " Coins" ## [83] " $ - 59 gold pieces" ## [84] " Weapons" ## [85] " m - 17 blessed +1 arrows" ## [86] " n - a blessed +0 arrow" ## [87] " q - 3 +0 arrows (in quiver)" ## [88] " s - a +0 bow (weapon in hand)" ## [89] " B - 11 +1 darts" ## [90] " N - 11 +0 darts" ## [91] " a - a +1 dagger (alternate weapon; not wielded)" ## [92] " Armor" ## [93] " T - an uncursed +0 dwarvish iron helm (being worn)" ## [94] " z - an uncursed +0 pair of leather gloves (being worn)" ## [95] " U - a cursed -4 pair of iron shoes (being worn)" ## [96] " e - an uncursed +2 cloak of displacement (being worn)" ## [97] " h - a blessed +0 dwarvish mithril-coat (being worn)" ## [98] " Comestibles" ## [99] " f - 3 uncursed cram rations" ## [100] " j - 2 uncursed food rations" ## [101] " L - an uncursed food ration" ## [102] " P - an uncursed lembas wafer" ## [103] " I - an uncursed lizard corpse" ## [104] " o - an uncursed tin of spinach" ## [105] " Scrolls" ## [106] " G - 2 uncursed scrolls of blank paper" ## [107] " t - an uncursed scroll of confuse monster" ## [108] " V - an uncursed scroll of identify" ## [109] " Potions" ## [110] " x - an uncursed potion of gain ability" ## [111] " H - a blessed potion of sleeping" ## [112] " g - 3 uncursed potions of water" ## [113] " Rings" ## [114] " O - an uncursed ring of slow digestion (on left hand)" ## [115] " v - an uncursed ring of stealth (on right hand)" ## [116] " Tools" ## [117] " p - an uncursed magic lamp" ## [118] " k - an uncursed magic whistle" ## [119] " Q - an uncursed mirror" ## [120] " C - an uncursed saddle" ## [121] " D - an uncursed stethoscope" ## [122] " y - a +0 unicorn horn" ## [123] " i - 7 uncursed wax candles" ## [124] " Gems/Stones" ## [125] " W - an uncursed flint stone" ## [126] " M - an uncursed worthless piece of red glass" ## [127] " Z - an uncursed worthless piece of violet glass" ## [128] " J - an uncursed worthless piece of white glass" ## [129] "" ## [130] "Brothertrebius the Ranger's attributes:" ## [131] "" ## [132] "Background:" ## [133] " You were a Trailblazer, a level 8 female gnomish Ranger." ## [134] " You were neutral, on a mission for Venus" ## [135] " who was opposed by Mercury (lawful) and Mars (chaotic)." ## [136] "" ## [137] "Final Characteristics:" ## [138] " You had 0 hit points (max:54)." ## [139] " You had 40 magic power (max:40)." ## [140] " Your armor class was 0." ## [141] " You had 1552 experience points." ## [142] " You entered the dungeon 7398 turns ago." ## [143] " Your strength was 15 (limit:18/50)." ## [144] " Your dexterity was 12 (limit:18)." ## [145] " Your constitution was 16 (limit:18)." ## [146] " Your intelligence was 13 (limit:19)." ## [147] " Your wisdom was 15 (limit:18)." ## [148] " Your charisma was 6 (limit:18)." ## [149] "" ## [150] "Final Status:" ## [151] " You were satiated." ## [152] " You were burdened; movement was slightly slowed." ## [153] " You were wielding a bow." ## [154] "" ## [155] "Final Attributes:" ## [156] " You were piously aligned." ## [157] " You were telepathic." ## [158] " You had automatic searching." ## [159] " You had infravision." ## [160] " You were displaced." ## [161] " You were stealthy." ## [162] " You had slower digestion." ## [163] " You were guarded." ## [164] " You are dead." ## [165] "" ## [166] "Vanquished creatures:" ## [167] " a warhorse" ## [168] " a tengu" ## [169] " a quivering blob" ## [170] " an iron piercer" ## [171] " 2 black lights" ## [172] " a gold golem" ## [173] " a werewolf" ## [174] " 3 lizards" ## [175] " 2 dingoes" ## [176] " a housecat" ## [177] " a white unicorn" ## [178] " 2 dust vortices" ## [179] " a plains centaur" ## [180] " an ape" ## [181] " a Woodland-elf" ## [182] " 2 soldier ants" ## [183] " a bugbear" ## [184] " an imp" ## [185] " a wood nymph" ## [186] " a water nymph" ## [187] " a rock piercer" ## [188] " a pony" ## [189] " 3 fog clouds" ## [190] " a yellow light" ## [191] " a violet fungus" ## [192] " 2 gnome lords" ## [193] " 2 gnomish wizards" ## [194] " 2 gray oozes" ## [195] " 2 elf zombies" ## [196] " a straw golem" ## [197] " a paper golem" ## [198] " 2 giant ants" ## [199] " 2 little dogs" ## [200] " 3 floating eyes" ## [201] " 8 dwarves" ## [202] " a homunculus" ## [203] " 3 kobold lords" ## [204] " 3 kobold shamans" ## [205] " 13 hill orcs" ## [206] " 4 rothes" ## [207] " 2 centipedes" ## [208] " 3 giant bats" ## [209] " 6 dwarf zombies" ## [210] " a werejackal" ## [211] " 3 iguanas" ## [212] " 23 killer bees" ## [213] " an acid blob" ## [214] " a coyote" ## [215] " 3 gas spores" ## [216] " 5 hobbits" ## [217] " 7 manes" ## [218] " 2 large kobolds" ## [219] " a hobgoblin" ## [220] " 2 giant rats" ## [221] " 2 cave spiders" ## [222] " a yellow mold" ## [223] " 6 gnomes" ## [224] " 8 garter snakes" ## [225] " 2 gnome zombies" ## [226] " 8 geckos" ## [227] " 11 jackals" ## [228] " 5 foxes" ## [229] " 2 kobolds" ## [230] " 2 goblins" ## [231] " a sewer rat" ## [232] " 6 grid bugs" ## [233] " 3 lichens" ## [234] " 2 kobold zombies" ## [235] " 5 newts" ## [236] "206 creatures vanquished." ## [237] "" ## [238] "No species were genocided or became extinct." ## [239] "" ## [240] "Voluntary challenges:" ## [241] " You never genocided any monsters." ## [242] " You never polymorphed an object." ## [243] " You never changed form." ## [244] " You used no wishes." ## [245] "" ## [246] "The Dungeons of Doom: levels 1 to 6" ## [247] " Level 1:" ## [248] " A fountain." ## [249] " Level 2:" ## [250] " A sink." ## [251] " Level 3:" ## [252] " A general store, a fountain." ## [253] " Level 4:" ## [254] " A general store, a fountain." ## [255] " Stairs down to The Gnomish Mines." ## [256] " Level 5:" ## [257] " A fountain." ## [258] " Level 6: <- You were here." ## [259] " A general store." ## [260] " Final resting place for" ## [261] " you, killed by an ape." ## [262] "The Gnomish Mines: levels 5 to 8" ## [263] " Level 5:" ## [264] " Level 6:" ## [265] " Level 7:" ## [266] " Many shops, a temple, some fountains." ## [267] " Level 8:" ## [268] "" ## [269] "Game over:" ## [270] " ----------" ## [271] " / \\" ## [272] " / REST \\" ## [273] " / IN \\" ## [274] " / PEACE \\" ## [275] " / \\" ## [276] " | brothertrebius |" ## [277] " | 59 Au |" ## [278] " | killed by an ape |" ## [279] " | |" ## [280] " | |" ## [281] " | |" ## [282] " | 2018 |" ## [283] " *| * * * | *" ## [284] " _________)/\\\\_//(\\/(/\\)/\\//\\/|_)_______" ## [285] "" ## [286] "Goodbye brothertrebius the Ranger..." ## [287] "" ## [288] "You died in The Dungeons of Doom on dungeon level 6 with 6652 points," ## [289] "and 59 pieces of gold, after 7398 moves." ## [290] "You were level 8 with a maximum of 54 hit points when you died." ## [291] ""

Now, I am curious to see how many games are played per day:

runs_per_day <- nethack %>% group_by(date) %>% count() %>% ungroup() ggplot(runs_per_day, aes(y = n, x = date)) + geom_point(colour = "#0f4150") + geom_smooth(colour = "#82518c") + theme_blog() ## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The number of games seems to be stable since 2015, around 50. But what is also interesting is not
only the number of games played, but also how many of these games resulted in a win.

For this, let’s also add a new column that tells us whether the played ascended (won the game)
or not:

nethack %<>% mutate(Ascended = ifelse(death == "ascended", "Ascended", "Died an horrible death"))

I’m curious to see how many players managed to ascend… NetHack being as hard as diamonds, probably
not a lot:

ascensions_per_day <- nethack %>% group_by(date, Ascended) %>% count() %>% rename(Total = n) ggplot(ascensions_per_day) + geom_area(aes(y = Total, x = as.Date(date), fill = Ascended)) + theme_blog() + labs(y = "Number of runs", x = "Date") + scale_fill_blog() + theme(legend.title = element_blank())

Yeah, just as expected. Because there is so much data, it’s difficult to see clearly, though. Depending on
the size of the screen you’re reading this, it might seem that in some days there are a lot of ascensions.
This is only an impression due to the resolution of the picture. Let’s see the share of ascensions per
year (and how many times the quests fail miserably), and this will become more apparent:

ascensions_per_day %>% mutate(Year = year(as.Date(date))) %>% group_by(Year, Ascended) %>% summarise(Total = sum(Total, na.rm = TRUE)) %>% group_by(Year) %>% mutate(denom = sum(Total, na.rm = TRUE)) %>% ungroup() %>% mutate(Share = Total/denom) %>% ggplot() + geom_col(aes(y = Share, x = Year, fill = Ascended)) + theme_blog() + scale_fill_blog() + theme(legend.title = element_blank())

I will now convert the “time” column to seconds. I am not yet sure that this column is really useful,
because NetHack is a turn based game. This means that when the player does not move, neither do the
monsters. So the seconds spent playing might not be a good proxy for actual time spent playing.
But it makes for a good exercise:

convert_to_seconds <- function(time_string){ time_numeric <- time_string %>% str_split(":", simplify = TRUE) %>% as.numeric time_in_seconds <- sum(time_numeric * c(3600, 60, 1)) time_in_seconds }

The strings I want to convert are of the form “01:34:43”, so I split at the “:” and then convert
the result to numeric. I end up with an atomic vector (c(1, 34, 43)). Then I multiple each element
by the right number of seconds, and sum that to get the total. Let’s apply it to my data:

nethack %<>% mutate(time_in_seconds = map_dbl(time, convert_to_seconds))

What is the distribution of “time_in_seconds”?

nethack %>% describe(time_in_seconds) ## # A tibble: 1 x 13 ## variable type nobs mean sd mode min max q25 median q75 ## ## 1 time_in… Nume… 322485 23529. 2.73e5 61 2.72e7 622 1689 5486 ## # ... with 2 more variables: n_missing , n_unique

We see that the minimum of time_in_seconds is 61 whereas the maximum is of the order of 27200000…
This must be a mistake, because that is almost one year!

nethack %>% filter(time_in_seconds == max(time_in_seconds, na.rm = TRUE)) ## rank score name time turns lev_max hp_max role race ## 1 28 3173960108 fisted 7553:41:49 6860357 4/47 362/362 Wiz Elf ## gender alignment death date dumplog year ## 1 Mal Neu drowned in a pool of water 2017-02-02 NA 2017 ## month day Ascended time_in_seconds ## 1 2 2 Died an horrible death 27193309

Well… maybe “fisted” wanted to break the record of the longest NetHack game ever. Congratulations!

Let’s take a look at the density but cut it at 90th percentile:

nethack %>% filter(!is.na(time_in_seconds), time_in_seconds < quantile(time_in_seconds, 0.9, na.rm = TRUE)) %>% ggplot() + geom_density(aes(x = time_in_seconds), colour = "#82518c") + theme_blog()

As expected, the distribution is right skewed. However, as explained above NetHack is a turn based
game, meaning that if the player does not move, the monsters won’t move either. Perhaps it makes more
sense to look at the turns column:

nethack %>% describe(turns) ## # A tibble: 1 x 13 ## variable type nobs mean sd mode min max q25 median q75 ## ## 1 turns Nume… 322485 4495. 19853. 1 6.86e6 871 1818 3582 ## # ... with 2 more variables: n_missing , n_unique

The maximum is quite large too. Just like before, let’s focus by cutting the variable at the 90th percentile:

nethack %>% filter(!is.na(turns), turns < quantile(turns, 0.9, na.rm = TRUE)) %>% ggplot() + geom_density(aes(x = turns), colour = "#82518c") + theme_blog()

I think that using turns makes more sense. In the a future blog post, I will estimate a survival
model and see how long players survive, and will use turns instead of time_in_seconds.

Analysis What kills the players

To know what kills players so much, some cleaning of the death column is in order. Death can
occur from poisoning, starvation, accidents, drowning… of course monsters can kill the player too.
Here are some values of the death variable:

burned by a tower of flame choked on a lichen corpse died of starvation fell into a pit of iron spikes killed by a gnome killed by a gnome called Blabla killed by a gnome called Blabla while sleeping slipped while mounting a saddled pony slipped while mounting a saddled pony called Jolly Jumper zapped her/himself with a spell

To know what is the most frequent cause of death, I have to do some cleaning, because if not,
“killed by a gnome” and “killed by a gnome called Blabla” would be two different causes of death.
In the end, what interests me is to know how many times the player got killed by a gnome.

The following lines do a cleanup of the death variable:

nethack %<>% mutate(death2 = case_when(str_detect(death, "poisoned") ~ "poisoned", str_detect(death, "slipped") ~ "accident", str_detect(death, "petrified") ~ "petrified", str_detect(death, "choked") ~ "accident", str_detect(death, "caught.*self") ~ "accident", str_detect(death, "starvation") ~ "starvation", str_detect(death, "drowned") ~ "drowned", str_detect(death, "fell") ~ "fell", str_detect(death, "zapped") ~ "zapped", str_detect(death, "killed") ~ "killed", TRUE ~ death)) %>% mutate(death3 = str_extract(death, "(?<=by|while).*")) %>% mutate(death3 = case_when(str_detect(death3, ",|\\bcalled\\b") ~ str_extract(death3, "(.*?),|(.*?)\\bcalled\\b"), TRUE ~ death3)) %>% mutate(death3 = str_remove(death3, ",|called|\\ban?"), death3 = str_trim(death3))

death2 is a new variable, in which I broadly categorize causes of death. Using regular expressions
I detect causes of death and aggregate some categories, for instance “slipped” and “chocked” into
“accident”. Then, I want to extract everything that comes after the strings “by” or while, and put
the result into a new variable called death3. Then I detect the string “,” or “called”; if one
of these strings is present, I extract everything that comes before “,” or that comes before
“called”. Finally, I remove “,”, “called” or “a” or “an” from the string and trim the whitespaces.

Let’s take a look at these new variables:

set.seed(123) nethack %>% select(name, death, death2, death3) %>% sample_n(10) ## name death death2 death3 ## 92740 DianaFury killed by a death ray killed death ray ## 254216 Oddabit killed by a tiger killed tiger ## 131889 shachaf killed by a fire ant killed fire ant ## 284758 a43 poisoned by a killer bee poisoned killer bee ## 303283 goast killed by a gecko killed gecko ## 14692 liberty killed by a gnome king killed gnome king ## 170303 arch18 ascended ascended ## 287786 foolishwtf killed by a bat killed bat ## 177826 Renleve killed by a giant bat killed giant bat ## 147248 TheOV killed by a black unicorn killed black unicorn

Now, it is quite easy to know what monsters are the meanest buttholes; let’s focus on the top 15.
Most likely, these are going to be early game monsters. Let’ see:

nethack %>% filter(!is.na(death3)) %>% count(death3) %>% top_n(15) %>% mutate(death3 = fct_reorder(death3, n, .desc = FALSE)) %>% ggplot() + geom_col(aes(y = n, x = death3)) + coord_flip() + theme_blog() + scale_fill_blog() + ylab("Number of deaths caused") + xlab("Monster") ## Selecting by n

Seems like soldier ants are the baddest, followed by jackals and dwarfs. As expected, these are
mostly early game monsters. Thus, it would be interesting to look at this distribution, but at
different stages in the game. Let’s create a categorical variable that discretizes turns,
and then create one plot per category:

Click to expand

nethack %>% filter(!is.na(death3)) %>% filter(!is.na(turns)) %>% mutate(turn_flag = case_when(between(turns, 1, 5000) ~ "Less than 5000", between(turns, 5001, 10000) ~ "Between 5001 and 10000", between(turns, 10001, 20000) ~ "Between 10001 and 20000", between(turns, 20001, 40000) ~ "Between 20001 and 40000", between(turns, 40001, 60000) ~ "Between 40001 and 60000", turns > 60000 ~ "More than 60000")) %>% mutate(turn_flag = factor(turn_flag, levels = c("Less than 5000", "Between 5001 and 10000", "Between 10001 and 20000", "Between 20001 and 40000", "Between 40001 and 60000", "More than 60000"), ordered = TRUE)) %>% group_by(turn_flag) %>% count(death3) %>% top_n(15) %>% nest() %>% mutate(data = map(data, ~mutate(., death3 = fct_reorder(death3, n, .desc = TRUE)))) %>% mutate(plots = map2(.x = turn_flag, .y = data, ~ggplot(data = .y) + geom_col(aes(y = n, x = death3)) + coord_flip() + theme_blog() + scale_fill_blog() + ylab("Number of deaths caused") + xlab("Monster") + ggtitle(.x))) %>% pull(plots) ## Selecting by n ## [[1]]

## ## [[2]]

## ## [[3]]

## ## [[4]]

## ## [[5]]

## ## [[6]]


Finally, for this section, I want to know if there are levels, or floors, where players die more
often than others. For this, we can take a look at the lev_max column. Observations in this
column are of the form “8/10”. This means that the player died on level 8, but the lowest level
that was explored was the 10th. Let’s do this for the year 2017 first. Before anything, I have
to explain the layout of the levels of the game. You can see a diagram
here. The player starts on floor 1,
and goes down to level 53. Then, the player can ascend, by going on levels -1 to -5. But there
are more levels than these ones. -6 and -9 are the sky, and the player can teleport there (but will
fall to his death). If the player teleports to level -10, he’ll enter heaven (and die too). Because
these levels are special, I do not consider them here. I do not consider level 0 either, which is
“Nowhere”. Let’s get the number of players who died on each floor, but also compute the cumulative
death count:

died_on_level <- nethack %>% filter(Ascended == "Died an horrible death") %>% mutate(died_on = str_extract(lev_max, "-?\\d{1,}")) %>% mutate(died_on = as.numeric(died_on)) %>% group_by(year) %>% count(died_on) %>% filter(died_on >= -5, died_on != 0) %>% mutate(died_on = case_when(died_on == -1 ~ 54, died_on == -2 ~ 55, died_on == -3 ~ 56, died_on == -4 ~ 57, died_on == -5 ~ 58, TRUE ~ died_on)) %>% arrange(desc(died_on)) %>% mutate(cumul_deaths = cumsum(n))

Let’s take a look:

head(died_on_level) ## # A tibble: 6 x 4 ## # Groups: year [6] ## year died_on n cumul_deaths ## ## 1 2002 58 5 5 ## 2 2003 58 11 11 ## 3 2004 58 19 19 ## 4 2005 58 28 28 ## 5 2006 58 25 25 ## 6 2007 58 22 22

Now, let’s compute the number of players who ascended and add this to the cumulative count:

ascended_yearly <- nethack %>% filter(Ascended == "Ascended") %>% group_by(year) %>% count(Ascended)

Let’s take a look:

head(ascended_yearly) ## # A tibble: 6 x 3 ## # Groups: year [6] ## year Ascended n ## ## 1 2001 Ascended 4 ## 2 2002 Ascended 38 ## 3 2003 Ascended 132 ## 4 2004 Ascended 343 ## 5 2005 Ascended 329 ## 6 2006 Ascended 459

I will modify the dataset a little bit and merge it with the previous one:

ascended_yearly %<>% rename(ascended_players = n) %>% select(-Ascended)

Let’s add this to the data frame from before by merging both, and then we can compute the
surviving players:

died_on_level %<>% full_join(ascended_yearly, by = "year") %>% mutate(surviving_players = cumul_deaths + ascended_players)

Now we can compute the share of players who died on each level:

died_on_level %>% mutate(death_rate = n/surviving_players) %>% ggplot(aes(y = death_rate, x = as.factor(died_on))) + geom_line(aes(group = year, alpha = year), colour = "#82518c") + theme_blog() + ylab("Death rate") + xlab("Level") + theme(axis.text.x = element_text(angle = 90), legend.position = "none") + scale_y_continuous(labels = scales::percent)

Looks like level 7 is consistently the most dangerous! The death rate there is more than 35%!

That’s it for this blog post, in the next one, I will focus on what players kill!

If you found this blog post useful, you might want to follow me on twitter for blog post updates.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1073]

Sat, 11/03/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

And here is Le Monde mathematical puzzle  last competition problem

Find the number of integers such that their 15 digits are all between 1,2,3,4, and the absolute difference between two consecutive digits is 1. Among these numbers how many have 1 as their three-before-last digit and how many have 2?

Combinatorics!!! While it seems like a Gordian knot because the number of choices depends on whether or not a digit is an endpoint (1 or 4), there is a nice recurrence relation between the numbers of such integers with n digits and leftmost digit i, namely that

n⁴=(n-1)³, n³=(n-1)²+(n-1)⁴, n²=(n-1)²+(n-1)¹, n¹=(n-1)²

with starting values 1¹=1²=1³=1⁴=1 (and hopefully obvious notations). Hence it is sufficient to iterate the corresponding transition matrix

on this starting vector 14 times (with R, which does not enjoy a built-in matrix power) to end up with

15¹=610, 15²= 987, 15³= 987, 15⁴= 610

which leads to 3194 different numbers as the solution to the first question. As pointed out later by Amic and Robin in Dauphine, this happens to be twice Fibonacci’s sequence.

For the second question, the same matrix applies, with a different initial vector. Out of the 3+5+5+3=16 different integers with 4 digits, 3 start with 1 and 5 with 2. Then multiplying (3,0,0,0) by M¹⁰ leads to 267+165=432 different values for the 15 digit integers and multiplying (0,5,0,0) by M¹⁰ to. 445+720=1165 integers. (With the reassuring property that 432+1165 is half of 3194!) This is yet another puzzle in the competition that is of no special difficulty and with no further challenge going from the first to the second question…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Master R shiny: One trick to build maintainable and scalable event chains

Fri, 11/02/2018 - 13:10

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

Introduction

Writing appealing interactive web applications – one of STATWORX's many competences – is an ease with R shiny. Just a few lines of code in one R script create the whole logic you need to let the whole magic of shiny happen. It is so simple that you can make a hello world app in a heartbeat, like so.

library(shiny) ui <- fluidPage( "Hello, World!" ) server <- function(input, output, session) { } shinyApp(ui, server)

Today I am going to show you one way you can use native shiny syntax to modularize pieces of your code in a way that makes your code basis easily maintainable and extendable. Since I assume you are already familiar with shiny, I’ll skip the intro wading pool example and go right to the high-dive.

What are event chains?

An event chain describes the relationship between events and tasks and how the events affect each other. In some cases, you may want to have an app that takes user input and performs actions based on the nature of the input, potentially asking for more information along the way. In such a case, chances are you want to implement an event chain. You could immediately start hacking some crude solution to your problem, but you may risk creating hardly comprehensible code. Furthermore, imagine that requirements on your event chain suddenly change. In this case, it is important to modularize your event chain so that it remains maintainable and adaptable.

Example: the friend logger

So, let me illustrate how to build a modularized event chain. Imagine you are pedantic about time and take appointments seriously. Quite to the detriment of your so called "friends", you make no exceptions. Every time a friend is too late, you suffer so bad you have decided to use a shiny app to keep score of your friends' visits in order to determine how reliable they are (you pathetic you!). Requirements on the app's usage are simple, as shown in the graph below.

You want to compare the expected arrival time of your friend with his actual arrival time. If his delay is above a certain threshold (e.g. 5 minutes), you want to protocol his excuse for being late. If you deem his excuse as being acceptable, you neglect his sin (but still keep protocol!). If he is punctual, he receives a bonus point. If he arrives too late and his excuse is not acceptable, he receives a minus point. In any case, you log his visit (how low can you get?). To keep things more visual, here is a sketch of the app's UI including the event sequence when a friend is being late.

Now, it is time to implement the app.

Event chain architecture in R Shiny

It takes two ingredients to implement event chains:

  1. triggers that are stored in reactiveValues()
  2. observers (observeEvent()) that are triggered and carry out the actual checks and other computations

The actual trick is to find the appropriate number of observeEvent()s so that each step in the event chain is covered by one observeEvent and therefore no code redundancies are created. Using the example above, we have three possible sequences of events:

  1. Friend is too late and has a good excuse
  2. Friend is too late and doesn't have a good excuse
  3. Friend is not too late

In all three cases, we need to log a friend's visit, so it definitely makes sense to put the visit logging part in one observeEvent and to call that observer at the end of each of the sequences above. Drawing an event chain diagram comes in especially handy here, as it supports a suitable architectural design choice. I used draw.io for the task.

For the app, I used one reactiveValues-object in which I put all triggers (you can find the whole app code on GitHub).

shinyServer(function(input, output, session) { # Data rv <- reactiveValues( ... # Triggers ask_for_reason = TRUE, change_friend_score = TRUE, save_visit = TRUE, error = FALSE ) ... })

I use boolean values for the triggers so that I only have to negate them if I want to change their value (a <- !a). Using integers would also work, but I find the flip-trick nicer. Let's look at the part of the chain where a friend's punctuality is checked in more detail. The module that checks punctuality also reads in the data. Depending on the input, it either calls the "Ask-for-a-reason"-module or directly calls the visit logger.

# Submit friend data ---- observeEvent(input$submit, { # Collect data ... is_delayed <- difftime(actual_time, expected_time, units = "mins") > input$acceptance if (is_delayed) { # Friend is delayed --> trigger Ask-for-reason-module rv$ask_for_reason <- isolate(!rv$ask_for_reason) return() } # Friend seems punctual --> Add a point to score list :) friend_data <- set_data(friend_data, score = 1) # Trigger visit logger rv$change_friend_score <- isolate(!rv$change_friend_score) })

As you can see, once you have drawn the event chain it is quite intuitive to translate it into shiny code. If the friend is punctual, we set his score to one (score will be added in the visit logger module) and call the visit logger module, which looks like this:

# Change friend score ---- observeEvent(rv$change_friend_score, ignoreInit = TRUE, { rv$friend_score[friend_score$name == friend_data$name, "score"] <- isolate(rv$friend_score[friend_score$name == friend_data$name, "score"]) + friend_data$score # Make change permanent saveRDS(rv$friend_score, "data/friend_score.RDS") rv$save_visit <- isolate(!rv$save_visit) })

Note that the rv$save_visit trigger simply calls an observer that adds another row to the friend visit table and does some cleaning.

So now let's make a little test run with the ready product. For your app to work, you of course have to first create an initial dataset with your friends and their initial scores in order to know who you are keeping record of. In the example below, we have four friends: John, Emily, Olivia, and Ethan. You can see their scores in the lower left corner. Past visits are logged and displayed in the upper right corner.

John wants to hang out with us to play some brutal video games, and for no obvious reason we made an appointment at 9 am. However, John shows up 7 (!!!) minutes late. Enough is enough. We enter his misdeed.

It exceeds the threshold, so we are, as expected, prompted to enter the reason.

When we asked John to justify himself, he just shrugged his shoulders. How dare he?! That's a minus point…

Extend our event chain

Even though you hurt all over because of John's unreliability, you are quite happy with your app. Yet, things could be better! For example, every time you misspell a friend in the name field when protocolling a visit, the app crashes. Your app could use some (additional) sanity checks. A perfect use case for showing the flexibility of your architecture. After a few months of deep reflection, you came up with a new event flow graph that takes care of wrong inputs.

You figured two spots where the app ought to be stabilized. First, you want to throw an error to the user if a friend doesn't exist (without stopping the app). Second, you require yourself to enter a reason (we all know how sloppy our future self can be from time to time).

With the already chosen modularized structure, it is easy to incorporate these checks. You simply need to add one more trigger (rv$error) and one global container that stores the error information.

# Error handler error <- reactiveValues( title = "", message = "" )

If you for example want to check whether an entered name exists in your data base, all you have to do is to add a few lines of code at the beginning of the observer where a friend's punctuality is checked.

# Submit friend data ---- observeEvent(input$submit, { # Friend exists? if (!input$name %in% rv$friend_score$name) { error$title <- "%s not found" %>% sprintf(., input$name) error$message <- h1("404") rv$error <- isolate(!rv$error) return() } ... })

If the name doesn't match any of your friends' names, you trigger an error handler module whose only purpose is to show an error message:

# Error handling ---- observeEvent(rv$error, ignoreInit = TRUE, { showModal(modalDialog( title = error$title, error$message, footer = actionButton("exit", "Ok", class = "btn-primary") )) })

The nice thing is that you can use this module to handle any errors, no matter which sanity checks have caused them.

So if we go back to the app now and enter a name that doesn't exist (like Tobias), we get the following error message:

Furthermore, if we miss to enter a reason when being asked for one, we get a passive aggressive reminder:

You are welcome! So would you excuse me now? I have some visits to protocol…

Über den Autor

Tobias Krabel

Tobias ist im Data Science Team und absolviert im Moment seinen 2. Master in Informatik. In seiner Freizeit ist er sozial engagiert und geht gerne Wandern in der Natur.

Der Beitrag Master R shiny: One trick to build maintainable and scalable event chains erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The blocks and rows theory of data shaping

Fri, 11/02/2018 - 05:15

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the cdata data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppAnnoy 0.0.11

Fri, 11/02/2018 - 02:01

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new release of RcppAnnoy is now on CRAN.

RcppAnnoy is the Rcpp-based R integration of the nifty Annoy library by Erik. Annoy is a small and lightweight C++ template header library for very fast approximate nearest neighbours—originally developed to drive the famous Spotify music discovery algorithm.

This release updates to a new upstream version (including a new distance measure), and includes a spiffy new vignette by Aaron Lun describing how to use who Annoy from C++ as he does in his new BioConductor package BiocNeighbours.

All changes in this version are summarized below:

Changes in version 0.0.11 (2018-10-30)
  • Synchronized with Annoy upstream (#26, #30, #36).

  • Added new Hamming distance measure functionality; should be considered experimental as the functionality depends on integer values.

  • Travis CI use was updated to the R 3.5 PPA (#28)

  • New vignette about Annoy use from C++ via Rcpp (Aaron Lun in #29 addressing #19; also #32, #33)

  • The vignette was rewritten using pinp (#34, #35).

Courtesy of CRANberries, there is also a diffstat report for this release.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

collateral

Fri, 11/02/2018 - 01:00

(This article was first published on James Goldie (rstats posts), and kindly contributed to R-bloggers)

Toward the end of my PhD analysis, I’ve started leaning on purrr more and more to get analysis done on small multiples quickly and safely. Where before I might have used nested loops and potentially missed a lot of problems, I now split data frames out, build statistical models, extract model parameters and print plots all in one workflow.

This has incredible benefits for workflow: instead of having to track group identifiers across each output you produce, data and outputs and are linked to group identifiers by row, and you can do further tidying to summarise these complex outputs. You essentially have a summary of your analysis right there in the data frame.

In particular, purrr::safely() and purrr::quietly() wrap a number of R’s output and error handling systems, allowing you to push through errors as you iterate and capture them along with warnings, messages and other output. Instead of getting zero output until every group is behaving, you can get to the end of your analysis and then review to see what didn’t work out.

These two functions already lend themselves well to mapping, but since they return nested lists with each component of the output, it’s not always easy to tell until you burrow in to the returned column which groups returned output, which stopped with errors and which flagged other warnings or messages.

In response, I built the collateral package. Collateral provides two new map variants:

  • map_safely() wraps safely() and,
  • map_quietly() wraps quietly().

As well as automatically wrapping these functions (so you can use them exactly as you would the vanilla map()), these functions style the resulting list (or list-column) using pillar—the same system that provides enhanced tibble output. You can scan down the column and see which mapped elements returned a result, which stopped with errors, and which returned warnings, messages or other output. If you’d like to inspect these kinds of output in more detail (for example, retrieve an error or read returned warnings), you can do so just as you would a manually mapped-and-wrapped function call, using components like $result, $error or $warning.

Here’s a quick example, using a safe version of log (as in the safely() documentation):

library(tidyverse) library(collateral) test = # tidy up and trim down for the example mtcars %>% rownames_to_column(var = "car") %>% as_data_frame() %>% select(car, cyl, disp, wt) %>% # spike some rows in cyl == 4 to make them fail mutate(wt = dplyr::case_when( wt < 2 ~ -wt, TRUE ~ wt)) %>% # nest and do some operations quietly() nest(-cyl) %>% mutate(qlog = map_quietly(data, ~ log(.$wt))) test #> # A tibble: 3 x 4 #> cyl data qlog #> #> 1 6 R O _ _ #> 2 4 R O _ W #> 3 8 R O _ _

The captured side effects appear in the tibble columns: (R)results, (O)utput, (M)essages and (W)arnings. If you’re using map_safely() instead, you’ll see (R)esults and (E)rrors. if we investigate row 2 further, we can see the problem:

test$qlog[[2]]$warnings # [1] "NaNs produced" test$qlog[[2]]$result # [1] 0.8415672 1.1600209 1.1474025 0.7884574 NaN NaN 0.9021918 NaN 0.7608058 # [10] NaN 1.0224509

Even better: because collateral plugs into pillar, which powers enhanced tibble output, you get color in the terminal!

(Next on the to-do list: getting this going with knitted output!)

You can find and install collateral on GitHub. If you have ideas or bugs, please get in touch or
file an issue!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: James Goldie (rstats posts). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2018-11 Variable-Width Bezier Splines in R

Thu, 11/01/2018 - 23:58

(This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

This report describes support for a new type of variable-width line in the ‘vwline’ package for R that is based on Bezier curves. There is also a new function for specifying the width of a variable-width line based on Bezier curves and there is a new linejoin and lineend style, called “extend”, that is available when both the line and the width of the line are based on Bezier curves. This report also introduces a small ‘gridBezier’ package for drawing Bezier curves in R.

Paul Murrell

Download

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Automated Email Reports with R

Thu, 11/01/2018 - 23:00

(This article was first published on R programming – Journey of Analytics, and kindly contributed to R-bloggers)

R is an amazing tool to perform advanced statistical analysis and create stunning visualizations. However, data scientists and analytics practitioners do not work in silos, so these analysis have to be copied and emailed to senior managers and partners teams. Cut-copy-paste sounds great, but if it  is a daily or periodic task, it is more useful to automate the reports. So in this blogpost, we are going to learn how to do exactly that.

The R-code uses specific library packages to do this:

  • RDCOMClient – to connect to Outlook and send emails. In most offices, Outlook is still the defacto email client, so this is fine. However, if you are using Slack or something different it may not work.
  • r2excel – To create an excel output file.

The screenshot below shows the final email view:

email screenshot

As seen in the screenshot, the email contains the following:

  • Custom subject with current date
  • Embedded image
  • Attachments – 1 Excel and 1 pdf report
Code Explanation:

The code and supporting input files are available here, under the Projects page under Nov2018. The code has 4 parts:

  • Prepare the work space.
  • Pull the data from source.
  • Cleaning and calculations
  • Create pdf.
  • Create Excel file.
  • Send email.

 

Prepare the work space

I always set the relative paths and working directories at the very beginning, so it is easier to change paths later. You can replace the link with a shared network drive path as well.

Load library packages and custom functions. My code uses the r2excel package which is not directly available as an R-cran package. So you need to install using devtools using the code below.

It is possible to do something similar using the “xlsx” package, but r2excel is easier.

library(devtools) install_github("kassambara/r2excel") library(r2excel)

Some other notes:

  • you need the first 2 lines of code only for the first time you installation. From the second time onwards, you only need to load the library.
  • r2excel seems to work only with 64-bit installations of R and Rstudio.
  • you do need Java installed on your computer. If you see an error about java namespace, then check the path variables. There is a very useful thread on Stackoverflow, so take a look.
  • As always, if you see errors Google it and use the Stack Overflow conversations. In 99% of cases, you will find an answer.
Pull the data from source

This is where we connect to an Excel CSV (or text) file. In practice, most people connect to a database of some kind. The R-script I am using connects to a .csv file, but I have added the code to a connect to a SQL database.

That code snippet is commented out, so feel free to substitute your own sql database links. The code will also work for Amazon EC2 cluster.

Some points to keep in mind:

  • If you are using sqlquery() then please note that if your query has an error then R sadly shows only a standard error message. So test your query on SQL server to ensure that you are not missing anything.
  • Some queries do take a long time, if you are pulling from a huge dataset. Also the time taken will be longer in R compared to SQL server direct connection. Using the  Sys.time() command before and after the query is helpful to know how long the query took to complete.
  • If you are only planning to pull the data randomly, it may make sense to pull from SQL server and store locally. Use the fread() function to read those files.
  • If you are using R desktop instead of R-server, the amount of data you can pull may be limited to what your system configuration.
  • ALWAYS optimize your query. Even if you have unlimited memory and computation power, only pull the data you absolutely need. Otherwise you end up unnecessarily sorting through irrelevant data.
Cleaning and calculations

For the current data, there are no NAs, so we don’t need to account for those. However, the read.csv() command creates factors, which I personally do not like, as they sometimes cause issues while merging.

Some of the column names have “.” where R converted the space in the names. So we will manually replace those with an underscore using the gsub() function.

We will also rank the apps based on categories of interest, namely:

  • Most Popular Apps – by number of Reviews
  • Most Popular Apps – by number Downloads and Reviews
  • Most Popular Categories – Paid Apps only
  • Most popular apps with 1 billion installations.
Create pdf

We are going to use the pdf() function to paste all graphs to a pdf document. Basically what this function does is write the graphs to a file rather than show on the console. So the only thing to remember is that if you are testing graphs or make an incorrect graph, everything will get posted to the pdf until you hit the “dev.off()” function. Sometimes if the graph throws an error you may end up with a blank page, or worse, with a corrupt file that cannot be opened.

Currently, the code I am only printing 2 simple graphs using ggplot() and barplot() functions, but you can include many other plots as well.

 

Create Excel file.

The Excel is created in the sequence below:

  • Specify the filename and create an object of type .xlsx This will create an empty Excel placeholder. It is only complete when you save the Workbook using the saveWorkbook() at the end of the section.
  • Use the sheets() to create different worksheets within the Excel.
  • The  xlsx.addHeader() adds a bold Header to each sheet which will help readers understand the content on the page. The r2excel package has other functions to add more informative text in smaller (non-header) font as well, if you need to give some context to readers. Obviously, this is optional if you don’t want to add them.
  • xlsx.addTable() – this is the crucial function that adds the content to Excel, the main “meat” of what you need to show.
  • saveWorkbook() – this function will save the Excel to the folder.
  • xlsx.openFile() – this function opens the file so you can view contents. I typically have the script running on automated mode, so when the Excel opens I am notified that the script completed.
Send email

The email is sent using the following functions:

  • OutApp() – creates an Outlook object. As I mentioned earlier, you do need Outlook and need to be signed in for this to work. I use Outlook for work and at home, so I have not explored options for Slack or other email clients.
  • outmail[[“To”]] – specify the people in the “to” field. You could also read email addresses from a file and pass the values here.
  • outmail[[“cc’]] – similar concept, for the cc field.
  • outmail[[“Subject”]] – I have used the paste0() function to add the current date to the subject, so recipients know it is the latest report.
  • outMail[[“HTMLBody”]] – I used the HTML body so that I can embed the image. If you don’t know HTML programming, no worries! The code is pretty intuitive, you should be able to follow what I’ve done. The image basically is an attachment which the HTML code is forcing to be viewed within the body of the email. If you are sending the email to people outside the organization, they may see a small box instead of the image with a cross on the top left (or right) of the box. Usually, when you hover your mouse near box and right click, it will ask them to download images. You may have seen similar messages in gmail, along with a link to “show images” or ‘always show images from this sender’. You obviously cannot control what the recipient selects, but testing by sending to yourself first helps smoothing out potential aesthetic issues.
  • outMail[[“Attachments”]] – function to add attachments.
  • outMail$Send() – until you run this command, the mail will not be send. If you are using this in office, you may get a popup asking you to do one of the following. Most  of these will generally go away after the first use, but if they don’t, please look up the issue on StackOverflow or contact your IT support for firewall and other security settings.
    • popup to hit “send”
    • popup asking you to “classify” the attachments (internal / public/ confidential) Select as appropriate. For me, this selection is usually  “internal”
    • popup asking you to accept “trust” settings
    • popup blocker notifying you to allow backend app to access Outlook.

 

That is it – and you are done! You have successfully learned how to send an automated email via R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R programming – Journey of Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Course: Analyzing Election and Polling Data in R

Thu, 11/01/2018 - 19:19

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Here is the course link.

Course Description

This is an introductory course to the R programming language as applied in the context of political data analysis. In this course students learn how to wrangle, visualize, and model data with R by applying data science techniques to real-world political data such as public opinion polling and election results. The tools that you’ll use in this course, from the dplyr, ggplot2, and choroplethr packages, among others, are staples of data science and can be used to analyze almost any dataset you get your hands on. Students will learn how to mutate columns and filter datasets, graph points and lines on charts, make maps, and create models to understand relationships between variables and predict the future. This course is suitable for anyone who already has downloaded R and knows the basics, like how to install packages.

Chapter 1: Presidential Job Approval Polls (Free)

Chapter one uses a dataset of job approval polling for US presidents since Harry Truman to introduce you to data wrangling and visualization in the tidyverse.

Chapter 2: U.S. House and Senate Polling

In this chapter, you will embark on a historical analysis of "generic ballot" US House polling and use data visualization and modeling to answer two big questions: Has the country changed over time? Can polls predict elections?

Chapter 3: Election Results and Political Demography

This chapter teaches you how to make maps and understand linear regression in R. With election results from the United States and the United Kingdom, you’ll also learn how to use regression models to analyze the relationship between two (or more!) variables.

Chapter 4: Predicting the Future of Politics

In this ensemble of applied statistics and data analysis, you will wrangle, visualize, and model polling and prediction data for two sets of very important US elections: the 2018 House midterms and 2020 presidential election.

Prerequisites var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Azure ML Studio now supports R 3.4

Thu, 11/01/2018 - 17:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Azure ML Studio, the collaborative drag-and-drop data science workbench, now supports R 3.4 in the Execute R Script module. Now you can combine the built-in data manipulation and analysis modules of ML Studio with R scripts to accomplish other data tasks, as for example in this workflow for oil and gas tank forecasting.

With the Execute R Script module you can immediately use more than 650 R packages which come preinstalled in the Azure ML Studio environment. You can also use other R packages (including packages not on CRAN) and source in R scripts you develop elsewhere (as shown above), although this does require the time to install them in the Studio environment. You can even create custom ML Studio models encapsulating R code for others to use in the drag-and-drop environment.

If you're new to Azure ML Studio, check out the Quickstart Tutorial for R to learn how use the Execute R Script module, and to check out what's new in the latest update follow the link below.

Microsoft Docs: What's New in Azure Machine Learning Studio

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

EARL Houston: Interview with Robert Gentleman

Thu, 11/01/2018 - 16:57

(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)

Dr. Gentleman’s work at 23andMe focuses on the exploration of how human genetic and trait data in the 23andMe database can be used to identify new therapies for disease. Dr. Gentleman is also recognised as one of the originators of the R programming language and has been awarded the Benjamin Franklin Award, a recognition for Open Access in the Life Sciences presented by the Bioinformatics Organisation. His keynote will focus on the History of R and some thoughts on data science.

Dr Robert Gentleman it is an honour to have you as a keynote speaker at our EARL conference in Houston, we are intrigued to hear more about your career to date and how your work around open access to scientific data has helped shape valuable research worldwide…

Amongst your significant achievements to date has been the development of the R programming language alongside fellow statistician Ross Ihaka at the University of Auckland in the mid 1990’s. What prompted you to develop a new statistical programing language?

Ross and I had a lot of familiarity with S (from Bell Labs) and at that time (my recollection is) there were a lot more languages for Statistics around.  We were interested in how languages were used for data analysis. Both Ross and I had some experience with Lisp and Scheme and at that time some of the work in computer science was showing how one could easily write interpreters for different types of languages, largely based on simple Scheme prototypes. We liked a lot about S, but there were a few places where we thought that different design decisions might provide improved functionality. So we wrote a simple Scheme interpreter and then gradually modified it into the core of the R language. As we went forward we added all sorts of different capabilities and found a large number of great collaborators.

As we made some progress we found that were others around the world who were also interested in developing a system like R. And luckily at just about that time, the internet became reliable and tools evolved that really helped support a distributed software development process. That group of collaborators became R Core and then later formed the nucleus for the R Foundation.

Probably the most important development was CRAN and some of the important tools that were developed to support the widespread creation of packages. Really a very large part of the success of R is due to the ability of any scientist to write a package containing code to carry out an analysis and to share that.

In 2008 you were awarded the Benjamin Franklin Award for your contribution to open access research. What areas of your research contributed towards being awarded this prestigious accolade?

I believe that my work on R was important, but perhaps more important for that award was the creation of the Bioconductor Project, together with a number of really great colleagues. Our paper in Genome Biology describes both those involved and what we did.

In your opinion how has the application of open source R for big data analysis, predictive modelling, data science and visualisation evolved since its inception?

In too many ways for me to do a good job of really describing.  As I said above, the existence of CRAN (and the Bioconductor package repository) where there are a vast number of packages. UseRs can easily get a package to try out just about any idea. Those packages are not always well written, or well supported, but they provide a simple, fast way to try ideas out. And mostly the packages are of high quality and the developers are often very happy to discuss ideas with users. The community aspect is important. And R has become increasingly performant.

Your work today involves the fascinating work of combining bioinformatics and computational drug discovery at 23andMe. What led to your transition to drug discovery or was it a natural progression?

I had worked in two cancer centers including The Dana Farber in Boston and The Fred Hutchinson in Seattle. When I was on the faculty at Harvard, and then the Fred Hutchinson in Seattle, I was developing a computational biology department. The science is fantastic at both institutions and I learned a lot about cancer and how we could begin to use computational methods to begin exploring and understanding some of the computational molecular biology that is important.

But I also became convinced that making new drugs was something that would happen in a drug company. I wanted to see how computational methods could help lead to better and faster target discovery. When I was approached by Genentech it seemed like a great opportunity – and it was. Genentech is a fantastic company, I spend almost six years there and learned a huge amount.

As things progressed, I became convinced that using human genetics to do drug discovery was likely to be more effective than any other strategy that I was aware of. And when the opportunity came to join 23andMe I took it. And 23andMe is also a great company. We are in the early stages of drug discovery still, but I am very excited about the progress we are making and the team I am working with.

How is data science being used to improve health and accelerate the discovery of therapies?

If we are using a very broad definition (by training I am a statistician, and still think that careful hypothesis-driven research is essential to much discovery) of data science – it is essential. Better models, more data and careful analysis has yielded many breakthroughs.

Perhaps a different question is ‘where are the problems’?  And for me, the biggest problem is that I am not sure there is much appreciation of the impact of bias as there should be. Big data is great – but it really only addresses the variance problem. Bias is different, it is harder to discover and its effects can be substantial. Perhaps, put another way, is just how generalizable are the results?

 In addition to the development of the R programming language? What have been your proudest career achievements that you’d like to share?

The Bioconductor Project, working with my graduate students and post-docs and pretty much anytime I gave someone good advice.

Can you tell us about what to expect from your keynote talk and what might be the key take-home messages for our EARL delegates?

I hope an appreciation of why it is important to get involved in developing software systems and tools. And I hope some things to think about when approaching large-scale data analysis projects.

Inspired by the work of Dr Robert Gentleman, what questions would you like to ask? Tickets to EARL Houston are still available. Find out more and get tickets here.

 

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages