Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 23 min 8 sec ago

Use R to write multiple tables to a single Excel file

Tue, 08/07/2018 - 17:28

(This article was first published on R – Fabio Marroni's Blog, and kindly contributed to R-bloggers)

The possibility of saving several tables in a single file is a nice feature of Excel. When sharing results with colleagues, it might be useful to compact everything in a single file.
As a bioinformatician, I am too lazy to do that manually, and I searched the web for tools that allow doing that.
I found out that there are at least two R packages that work well: xlsx and openxlsx.
First of all, let’s install the packages.

install.packages(c("xlsx","openxlsx"))

Now, let’s try to write some random stuff using openxlsx.

start_time <- Sys.time() library(openxlsx) of="example_1.xlsx" OUT <- createWorkbook() for(aaa in 1:20) { mdf<-data.frame(matrix(runif(n=1000),ncol=10,nrow=100)) sname<-paste("Worksheet_",aaa,sep="") addWorksheet(OUT, sname) writeData(OUT, sheet = sname, x = mdf) } saveWorkbook(OUT,of) end_time <- Sys.time() end_time-start_time

Running time on my machine is around 3 seconds.
Now let's try to save some very similar random stuff using xlsx

start_time <- Sys.time() library(xlsx) of="example_2.xlsx" for(aaa in 1:20) { mdf<-data.frame(matrix(runif(n=1000),ncol=10,nrow=100)) sname<-paste("Worksheet_",aaa,sep="") ifelse(aaa==1,app<-"FALSE",app<-"TRUE") write.xlsx(mdf,file=of,sheetName=sname,row.names=FALSE,append=as.logical(app)) } end_time <- Sys.time() end_time-start_time

Running time is approximately 25s.

It looks like openxlsx is much faster than xlsx, and so I currently prefer it.
If you try to use openxlsx to save with the name of an existing file, you will get an error message.

saveWorkbook(OUT,outfile) Error in saveWorkbook(OUT, outfile) : File already exists!

This might be useful if you want to be sure you never overwrite files by mistake, but it can also be annoying. To avoid this error message and allow overwriting of files, you just have to use the following command:

saveWorkbook(OUT,outfile,overwrite=TRUE)

Note: I had an experience (which I am no longer able to reproduce, and maybe I am not remembering well!) of openxlsx being unable to save files that xlsx could save without any problem. With this in mind, I kept note of both approaches, hoping that at least one would work in any situation!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fabio Marroni's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ShinyProxy 2.0.1 is out!

Tue, 08/07/2018 - 14:45

(This article was first published on Open Analytics, and kindly contributed to R-bloggers)

ShinyProxy is a novel, open source platform to deploy Shiny apps for the enterprise
or larger organizations.

Embedding Shiny Apps

Although Shiny apps are very popular for interactive data analysis purposes, many organizations communicated a need to more closely integrate these apps within larger applications and portals. In previous releases we broke down the walls to make this happen: hiding the navbar, single-sign on, theming the landing page and advanced networking support were only a few steps in that direction. With ShinyProxy 2.0.1 we finished the job by implementing a REST API that allows to manage (launch, shut down) Shiny apps and consume the content programmatically in unprecedented ways. Data scientists can now keep ownership of their Shiny app knowing that it will seamlessly integrate
with any other technology that is used by their organization’s websites, applications or portals.

OpenId Connect Authentication

A second break-through that made it to the 2.0.1 release is the support of an additional
authentication / authorization technology. ShinyProxy already supports many technologies out of the box, including LDAP, social authentication (Github, facebook etc.) and single-sign on protocols, but with OpenId Connect we add another, modern protocol that e.g. allows to integrate with Auth0.

Miscellaneous improvements

In the authentication/authorization area and upon user request, we also added support for setting SSL/HTTPS modes in the Keycloak back-end for single sign-on. In terms of user experience, we now use the display-name of apps as browser tab name, so different Shiny apps in the browser can be more easily discriminated. Also, logging has been improved and the documentation was brushed up for the privileged setting in proxy.docker, proxy.kubernetes and for individual app containers.

Talking about documentation, be careful that a new YAML configuration is used in ShinyProxy 2.x.y which is slightly different from the 1.x.y versions.

Version 1.x.y:

shiny: proxy: ... apps: - name: 01_hello display-name: Hello Application description: Application which demonstrates the basics of a Shiny app docker-cmd: ["R", "-e", "shinyproxy::run_01_hello()"] docker-image: openanalytics/shinyproxy-demo groups: [scientists, mathematicians]

Version 2.x.y:

proxy: ... specs: - id: 01_hello display-name: Hello Application description: Application which demonstrates the basics of a Shiny app container-cmd: ["R", "-e", "shinyproxy::run_01_hello()"] container-image: openanalytics/shinyproxy-demo access-groups: [scientists, mathematicians]

There are many things cooking in the ShinyProxy kitchen and this change prepares for what is coming, but that will be the subject of a next blog post… In any case, updated documentation can be found on https://shinyproxy.io and as always community support on this new release is available at

https://support.openanalytics.eu

Don’t hesitate to send in questions or suggestions and have fun with ShinyProxy!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Open Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Kaggle Competition In 30 Minutes: Predict Home Credit Default Risk With R

Tue, 08/07/2018 - 10:15

(This article was first published on business-science.io - Articles, and kindly contributed to R-bloggers)

We were very excited when Home Credit teamed up with Kaggle to host the Home Credit Default Risk Challenge. Default risk is a topic that impacts all financial institutions, one that machine learning can help solve. We decided to flip the goal of this challenge: Kaggle competitions are performance driven, where a data scientist has months to fine tune a model to get maximum performance. This is not reality. We turned this goal upside down, focusing on a combination of speed, efficiency, and performance.

In the real world businesses have finite schedules for data science projects. Because of this, we redesigned the competition goal to get very good results quickly. We took a look at the top performance (measured by AUC or area under the curve), which is currently around 0.80. We decided to see how quickly we could develop a competitive model. Follow along and learn how we got a model of 0.70 on a Kaggle submission in about 30 minutes with H2O automated machine learning (AutoML)!

Related Business Machine Learning Articles Learning Trajectory

In this article, we flip the Kaggle Competition goal upside down focusing on a combination of efficiency and performance. Matt shows you how to get a very good model that gets a 0.70 AUC in about 30 minutes of analysis effort.

We kick the project off with an overview of credit default risk. You’ll begin by understanding what credit default risk is and how machine learning can help identify personal loan applications that are likely to lead to default. We’ll also go over why a Kaggle challenge redesign is more representative of what you’ll need to do in the real world.

Next, we’ll go through the process Matt used to build a competitive model in about 30 minutes. This includes about 10 minutes of strategizing and 20 minutes of actual coding. You’ll see how Matt used the following process to efficiently build a high performance model.

H2O AutoML Preprocessing Workflow

We end on next steps, which provides details on our end-to-end data science project course that will transform your data science abilities in 10 weeks.

Transform Your Abilities In 10 Weeks

If you’re interested in learning how to apply critical thinking and data science while solving a real-world business problem following an end-to-end data science project, check out Data Science For Business (DS4B 201-R). Over the course of 10 weeks you will solve an end-to-end Employee Churn data science project following our systematic Business Science Problem Framework.

Get Started Today!

Overview

We’ll briefly touch on what credit default risk is, how machine learning can help, and why a 30-Minute Kaggle Challenge.

What Is Credit Default Risk?

In the world of finance, Credit or Default Risk is the probability that companies or individuals will be unable to make the required payments on their debt obligations. Lenders are exposed to default risk on every extension of credit. Think of financial institutions like Citi, which lend to millions of people worldwide, or JP Morgan Chase & Co, which lend to hundreds of thousands of companies worldwide.

JP Morgan: A Major Lender Of Corporate Loans

Every time Citi or JP Morgan extends a loan to a person or a corporation it assesses the level of risk involved in the transaction. This is because every once in a while people or companies won’t be able to make payments. In the dynamic world we live in unfortunate events happen and circumstances change.
Default risk is comprised of two components:

  1. Systemic Risk – The risk related to an external system of interconnected relationships (think of this as the risk associated with they economy, interest rate changes, inflation, recessions, and wars). This risk is often considered unavoidable. The only option is to hedge or not be in the market.

  2. Unsystematic Risk – The risk inherent to a person or company (think of this as a company that has a competitor that moves in driving down sales). This risk is avoidable through diversification or making bets on many assets preferably uncorrelated.

In this article, we’ll be focusing on a specific type of risk called Credit or Default Risk, which has both systemic and unsystemic drivers. The main point is that the drivers of default risk can be measured and analyzed for patterns related to default. As a result, the probability of default for a person or an institution is not random. This is where machine learning can help.

The probability of default for a person or an institution is not random. This is where machine learning can help.

Machine Learning For Personal Default Risk

Like any machine learning problem, default risk has drivers, or features related to the person or company that have a relationship with the likelihood of a default. As discussed previously, some of the drivers are systemic and others are unsystemic. Identifying and measuring these drivers are the keys to predicting default.

For the Kaggle Competition, Home Credit (the company) has supplied us with data from several data sources. The following Data Architecture Diagram shows the interrelationships between the data files provided.

Data Architecture Diagram For Home Credit Default Risk Competition

For the purposes of this article, we’ll focus on the application_train.csv and application_test.csv files, which contain a significant amount of useful information for predicting credit default. This is the main source of information for people that have applied for personal loans including features related to their loan application.

Several other data sources are provided such as bureau.csv and previous_application.csv. These data have “one-to-many relationships”, which can be combined through a process called feature engineering. In a tentative future article, we will go into strategies for feature engineering with data that have “one-to-many relationships” (e.g. creating features from bureau.csv that can be combined with application_train.csv).

Why A Kaggle Challenge Redesign?

While we enjoy the idea of the Kaggle Challenge, it’s not representative of a real world scenario where projects have fixed time schedules. The Kaggle Competition goal is to compete on maximum performance. The goal in the real world is quite often to get a good model within a defined project schedule.

Time to completion is important because projects have fixed resources. The longer you are on a project, the more you cost the company money and the more you delay other projects that you could be helping with.

The task of developing a high performance model is just one small piece of a data scientists job. It’s unrealistic to think that an organization will be OK with you (the data scientist) dedicating hundreds of man-hours to squeezing every last drop of performance out of the model. It’s not reality. Refer to the saying, “Great is the enemy of Good”. This is often the case in real-world data science.

Real-World End-To-End Data Science Project Structure

In the real world, you won’t have months that you can dedicate to a model. Rather, you will have maybe a day or two. The majority of your project time will be spent on:

  1. Understanding the business (2 weeks)

  2. Communicating with decision makers to turn their input into a data collection strategy (2 weeks)

  3. Collecting and cleaning the data (1 week to several months)

  4. Building a model (1 to 2 days)

  5. Converting machine learning results to expected savings or profitability from a business change/improvement and communicating results (1 to 2 weeks)

  6. Releasing these results into production via web applications (Several weeks)

This is what we call an “end-to-end” data science project.

Interested in Solving an End-To-End Data Science Project?

If you’re interested in learning how to apply critical thinking and data science while solving a real-world business problem following an end-to-end data science project, check out Data Science For Business (DS4B 201-R). Over the course of 10 weeks you will solve an end-to-end Employee Churn data science project following our systematic Business Science Problem Framework.

Get Started Today!

Completing A Kaggle Competition In 30 Minutes

By Matt Dancho, Founder of Business Science

Why 30-minutes for a Kaggle Challenge?

Because I wanted to show you that you that if you leverage high performance tools, you can drastically cut your modeling time down while getting a very good model. As a result, you can spend more time getting better features (up front in the project) and communicating results (after the model is developed). If you get more time after the first round of models are built, you can always go back and improve model performance by adding more data (more features) and improving model performance.

I’ll be providing details how I personally approached the Kaggle Competition and what I did each step of the way to get a good solution fast. Here’s the general process with code that I used to complete a high performing model in about 30 minutes.

Important Note: I did not set out with 30 minutes in mind.

My actual time was roughly 30 minutes.

Basis For Kaggle Competition Redesign

One of my favorite episodes of the Tim Ferris Podcast is one in which he interviews Walter O’Brien, “the real life Scorpion”. Walter is the basis for the hit TV show, Scorpion, that brings together a team of hackers and intellectual misfits to solve crime with technology. While the podcast episode sparked a lot of controversy related to Walter’s history and claims that could not be verified, I found a particular piece of the episode very interesting: The setup of the 1993 Computing Olympics, which Walter reportedly participated in.

In the final challenge of the Computing Olympics, a computer scientist had a fixed amount of time (3 hours) to solve a very complicated business problem by creating an algorithm. The best contestants followed this strategy:

  1. The top contestants began by critically thinking: Thinking about the problem and what tools they had in their arsenal to solve it. This could be upwards of 50% to 2/3rd of the allotted competition time.

  2. They spent the remaining 50% to 1/3rd speed coding: Hammering out their solution, converting to code with the programming languages that they knew would be the best tool for the scenario.

There’s a lot of value in this approach because of the critical thinking process and the fact that in the real world you will likely only have a fixed amount of days you can work on a model before the business deadline is hit. I wanted to see if this approach could be applied to the Home Credit Default Risk challenge.

Setup And Data (5 min)

This part took about 5 minutes excluding the time to download the data and place it into a folder.

You can collect the data from the Kaggle Home Credit Default Risk Competition site. Just a heads up, the datasets in full total 688MB so we need to be mindful of space but more importantly RAM.

Once the data is downloaded, I added it to a folder in my home directory called “00_data”.

Libraries

I loaded the following libraries to tackle this Home Credit Default Risk problem. The tidyverse pulls in dplyr, ggplot2, and several other must-haves (e.g. forcats, purrr, etc). The recipes package is loaded for preprocessing. The h2o package is going to do the heavy lifting via automated machine learning (AutoML).

# General library(tidyverse) library(skimr) # Preprocessing library(recipes) # Machine Learning library(h2o) Data

Next, I read the data with read_csv(), which takes care of much of the column formats for us.

application_train_tbl <- read_csv("00_data/application_train.csv") application_test_tbl <- read_csv("00_data/application_test.csv")

Note that the files size is 158MB. This can be a problem when making multiple copies of this data in R, which can quickly cause us to run out of RAM. A strategy I used was to remove previous iterations of the data as I went through processing. You’ll see me use rm() a lot.

Next, I took a quick slice() of the data to see what features are present. I use kable() to format the table for printing here, but this was not done during the 30 minutes of coding.

application_train_tbl %>% slice(1:10) %>% knitr::kable() SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR 100002 1 Cash loans M N Y 0 202500 406597.5 24700.5 351000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648 -2120 NA 1 1 0 1 1 0 Laborers 1 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.0830370 0.2629486 0.1393758 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0 0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2 2 2 2 -1134 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 100003 0 Cash loans F N N 0 270000 1293502.5 35698.5 1129500 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186 -291 NA 1 1 0 1 1 0 Core staff 2 1 1 MONDAY 11 0 0 0 0 0 0 School 0.3112673 0.6222458 NA 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0 0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1 0 1 0 -828 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100004 0 Revolving loans M Y Y 0 67500 135000.0 6750.0 135000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260 -2531 26 1 1 1 1 1 0 Laborers 1 2 2 MONDAY 9 0 0 0 0 0 0 Government NA 0.5559121 0.7295667 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 -815 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100006 0 Cash loans F N Y 0 135000 312682.5 29686.5 297000 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833 -2437 NA 1 1 0 1 0 0 Laborers 2 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NA 0.6504417 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 0 2 0 -617 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA NA 100007 0 Cash loans M N Y 0 121500 513000.0 21865.5 513000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311 -3458 NA 1 1 0 1 0 0 Core staff 1 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NA 0.3227383 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 -1106 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100008 0 Cash loans M N Y 0 99000 490495.5 27517.5 454500 Spouse, partner State servant Secondary / secondary special Married House / apartment 0.035792 -16941 -1588 -4970 -477 NA 1 1 1 1 1 0 Laborers 2 2 2 WEDNESDAY 16 0 0 0 0 0 0 Other NA 0.3542247 0.6212263 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 -2536 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 100009 0 Cash loans F Y Y 1 171000 1560726.0 41301.0 1395000 Unaccompanied Commercial associate Higher education Married House / apartment 0.035792 -13778 -3130 -1213 -619 17 1 1 0 1 1 0 Accountants 3 2 2 SUNDAY 16 0 0 0 0 0 0 Business Entity Type 3 0.7747614 0.7239999 0.4920601 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 0 1 0 -1562 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 2 100010 0 Cash loans M Y Y 0 360000 1530000.0 42075.0 1530000 Unaccompanied State servant Higher education Married House / apartment 0.003122 -18850 -449 -4597 -2379 8 1 1 1 1 0 0 Managers 2 3 3 MONDAY 16 0 0 0 0 1 1 Other NA 0.7142793 0.5406545 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 0 2 0 -1070 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100011 0 Cash loans F N Y 0 112500 1019610.0 33826.5 913500 Children Pensioner Secondary / secondary special Married House / apartment 0.018634 -20099 365243 -7427 -3514 NA 1 0 0 1 0 0 NA 2 2 2 WEDNESDAY 14 0 0 0 0 0 0 XNA 0.5873340 0.2057473 0.7517237 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 100012 0 Revolving loans M N Y 0 135000 405000.0 20250.0 405000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.019689 -14469 -2019 -14437 -3992 NA 1 1 0 1 0 0 Laborers 1 2 2 THURSDAY 8 0 0 0 0 0 0 Electricity NA 0.7466436 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 0 2 0 -1673 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA NA

application_train_tbl: First 10 Rows

A few points about the data:

  • The training data contains 307K observations, which are people applied for and received loans

  • The “TARGET” column identifies whether or not the person defaulted

  • The remaining 121 features describe various attributes about the person, loan, and application

  • There are several additional data sets (bureau.csv, previous_application.csv). I ignored these since most of the important data was likely to be in the main file. These auxiliary files contain data that is one-to-many relationship, which would require aggregation (a feature engineering method) and considerably more work for the unknown benefit.

I was more interested in agile iteration: Building a good model quickly, then going back to try to improve later.

Train And Test Sets

Next, I formatted the data as train and test sets. I noted that the “application_test_tbl” did not have a TARGET column (I found this out when I tried to process it with recipes during the first go-around). The “test” set is the set that the Kaggle Competition is measured against that we will submit later. Note that I remove the unnecessary data sets after I’m finished with them in order to save memory.

# Training data: Separate into x and y tibbles x_train_tbl <- application_train_tbl %>% select(-TARGET) y_train_tbl <- application_train_tbl %>% select(TARGET) # Testing data: What we submit in the competition x_test_tbl <- application_test_tbl # Remove the original data to save memory rm(application_train_tbl) rm(application_test_tbl) Data Inspection

I used one of my favorite packages for data investigation, skimr. This package helps me understand how much missing data I’m dealing with and overall sense of the data by data type.

Normally I just run skim() on the data. However, for the purposes of outputting in a readable format on the web, I’ll use skim_to_list()

skim_to_list(x_train_tbl) ## $integer ## # A tibble: 40 x 12 ## variable missing complete n mean sd p0 p25 median ## * ## 1 CNT_CHIL~ 0 307511 307511 " ~ " ~ 0 " ~ 0 ## 2 DAYS_BIR~ 0 307511 307511 "-160~ " 43~ -252~ "-19~ -15750 ## 3 DAYS_EMP~ 0 307511 307511 " 638~ "1412~ -179~ " -2~ -1213 ## 4 DAYS_ID_~ 0 307511 307511 " -29~ " 15~ -7197 " -4~ -3254 ## 5 FLAG_CON~ 0 307511 307511 " ~ " ~ 0 " ~ 1 ## 6 FLAG_DOC~ 0 307511 307511 " ~ " ~ 0 " ~ 0 ## 7 FLAG_DOC~ 0 307511 307511 " ~ " ~ 0 " ~ 0 ## 8 FLAG_DOC~ 0 307511 307511 " ~ " ~ 0 " ~ 0 ## 9 FLAG_DOC~ 0 307511 307511 " ~ " ~ 0 " ~ 0 ## 10 FLAG_DOC~ 0 307511 307511 " ~ " ~ 0 " ~ 0 ## # ... with 30 more rows, and 3 more variables: p75 , p100 , ## # hist ## ## $character ## # A tibble: 16 x 8 ## variable missing complete n min max empty n_unique ## * ## 1 CODE_GENDER 0 307511 3075~ 1 3 0 3 ## 2 EMERGENCYSTATE_~ 145755 161756 3075~ 2 3 0 2 ## 3 FLAG_OWN_CAR 0 307511 3075~ 1 1 0 2 ## 4 FLAG_OWN_REALTY 0 307511 3075~ 1 1 0 2 ## 5 FONDKAPREMONT_M~ 210295 97216 3075~ 13 21 0 4 ## 6 HOUSETYPE_MODE 154297 153214 3075~ 14 16 0 3 ## 7 NAME_CONTRACT_T~ 0 307511 3075~ 10 15 0 2 ## 8 NAME_EDUCATION_~ 0 307511 3075~ 15 29 0 5 ## 9 NAME_FAMILY_STA~ 0 307511 3075~ 5 20 0 6 ## 10 NAME_HOUSING_TY~ 0 307511 3075~ 12 19 0 6 ## 11 NAME_INCOME_TYPE 0 307511 3075~ 7 20 0 8 ## 12 NAME_TYPE_SUITE 1292 306219 3075~ 6 15 0 7 ## 13 OCCUPATION_TYPE 96391 211120 3075~ 7 21 0 18 ## 14 ORGANIZATION_TY~ 0 307511 3075~ 3 22 0 58 ## 15 WALLSMATERIAL_M~ 156341 151170 3075~ 5 12 0 7 ## 16 WEEKDAY_APPR_PR~ 0 307511 3075~ 6 9 0 7 ## ## $numeric ## # A tibble: 65 x 12 ## variable missing complete n mean sd p0 p25 median ## * ## 1 AMT_ANNUI~ 12 307499 307511 " 271~ " 14~ " 1~ " 16~ " 249~ ## 2 AMT_CREDIT 0 307511 307511 " 6e+~ " 4e~ " 45~ "270~ "5135~ ## 3 AMT_GOODS~ 278 307233 307511 "5383~ "369~ " 40~ "238~ "4500~ ## 4 AMT_INCOM~ 0 307511 307511 "1687~ "237~ " 25~ "112~ "1471~ ## 5 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ ## 6 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ ## 7 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ ## 8 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ ## 9 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ ## 10 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ ## # ... with 55 more rows, and 3 more variables: p75 , p100 , ## # hist

I see that there are 3 data type formats: integer, numeric, and character. For H2O we need numeric and factor. I see some of the integer values are “FLAG”, which typically indicates a factor.

Analysis Strategy (5 min)

The first 5 minutes were dedicated to mocking up a plan. I began by framing a solution on a piece of scrap paper.

Analysis Strategy on Scrap Paper

The general workflow (1) is at the top where I wanted to accomplish the following goals:

  1. Read the data

  2. Prepare the data

  3. Model – This step I wanted to use H2O because I knew I could leverage automated machine learning (AutoML) to vastly speed up the modeling process.

  4. Make predictions

The second half (2) of the scrap paper is how I planned to execute each step. The workflow involved a special tool using recipes to accomplish the preprocessing. I planned to minimally transform the data addressing missing values and categorical data.

With a plan in hand, I got started.

Preprocessing Strategy (5 min)

I leveraged my knowledge of how H2O AutoML works under the hood to develop a simplified strategy to tackle preprocessing. H2O AutoML handles common tasks internally such as creating dummy variables (or one hot encoding), performing statistical transformations (e.g. YeoJohnson), etc. As a result, this greatly simplifies the preprocessing tasks that I needed to worry about.

The main things I focused on were:

  1. Character data that needs to be factored: h2o.automl() requires any character data to be converted to a factor.

  2. Numeric data that has a low number of unique counts: This is common in business data where there are numeric features that have low number of unique values. These are typically categorical data.

  3. Missing values: Missing values need to be either imputed or to have the columns and or rows dropped. I made a point to impute all data before modeling with h2o.automl() although I believe that AutoML has methods for handling missing data.

Here’s the simplified strategy I used for preprocessing for AutoML. I diagrammed it after the fact to show the flow of how the entire process works beyond my “chicken scratch” scrap paper diagram shown previously. Keep in mind that H2O handles most data transformations “under the hood”, which is really useful for getting good models quickly.

H2O AutoML Preprocessing Workflow

Interested in Learning Preprocessing Strategies For Business?

If you’re interested in learning how to apply the preprocessing strategies using R Programming beyond the simplified strategy shown, we teach preprocessing with recipes in Chapter 3 of Data Science For Business (DS4B 201-R) as part of our 10-Week end-to-end Employee Churn data science project.

Get Started Today!

Let’s implement the H2O AutoML Preprocessing Workflow shown in the diagram.

Character Data

First, I collected the names of all character data. These columns will be converted into factors.

string_2_factor_names <- x_train_tbl %>% select_if(is.character) %>% names() string_2_factor_names ## [1] "NAME_CONTRACT_TYPE" "CODE_GENDER" ## [3] "FLAG_OWN_CAR" "FLAG_OWN_REALTY" ## [5] "NAME_TYPE_SUITE" "NAME_INCOME_TYPE" ## [7] "NAME_EDUCATION_TYPE" "NAME_FAMILY_STATUS" ## [9] "NAME_HOUSING_TYPE" "OCCUPATION_TYPE" ## [11] "WEEKDAY_APPR_PROCESS_START" "ORGANIZATION_TYPE" ## [13] "FONDKAPREMONT_MODE" "HOUSETYPE_MODE" ## [15] "WALLSMATERIAL_MODE" "EMERGENCYSTATE_MODE" Numeric Factor Data

Next, I looked at which numeric data should be factored (categorical). These typically have a low number of unique levels. I used a few dplyr and purrr operations to get a count of the unique levels. The real trick is using map_df() with the anonymous function ~ unique(.) %>% length(), which counts the unique values in each column, then gather()-ing the results in the long format.

unique_numeric_values_tbl <- x_train_tbl %>% select_if(is.numeric) %>% map_df(~ unique(.) %>% length()) %>% gather() %>% arrange(value) %>% mutate(key = as_factor(key)) unique_numeric_values_tbl ## # A tibble: 105 x 2 ## key value ## ## 1 FLAG_MOBIL 2 ## 2 FLAG_EMP_PHONE 2 ## 3 FLAG_WORK_PHONE 2 ## 4 FLAG_CONT_MOBILE 2 ## 5 FLAG_PHONE 2 ## 6 FLAG_EMAIL 2 ## 7 REG_REGION_NOT_LIVE_REGION 2 ## 8 REG_REGION_NOT_WORK_REGION 2 ## 9 LIVE_REGION_NOT_WORK_REGION 2 ## 10 REG_CITY_NOT_LIVE_CITY 2 ## # ... with 95 more rows

I used a factor limit of 7 meaning any values with less than 7 unique observations will not be converted to factors. I then collected the column names by pull()-ing the key column and storing as character data.

factor_limit <- 7 num_2_factor_names <- unique_numeric_values_tbl %>% filter(value < factor_limit) %>% arrange(desc(value)) %>% pull(key) %>% as.character() num_2_factor_names ## [1] "AMT_REQ_CREDIT_BUREAU_HOUR" "REGION_RATING_CLIENT" ## [3] "REGION_RATING_CLIENT_W_CITY" "FLAG_MOBIL" ## [5] "FLAG_EMP_PHONE" "FLAG_WORK_PHONE" ## [7] "FLAG_CONT_MOBILE" "FLAG_PHONE" ## [9] "FLAG_EMAIL" "REG_REGION_NOT_LIVE_REGION" ## [11] "REG_REGION_NOT_WORK_REGION" "LIVE_REGION_NOT_WORK_REGION" ## [13] "REG_CITY_NOT_LIVE_CITY" "REG_CITY_NOT_WORK_CITY" ## [15] "LIVE_CITY_NOT_WORK_CITY" "FLAG_DOCUMENT_2" ## [17] "FLAG_DOCUMENT_3" "FLAG_DOCUMENT_4" ## [19] "FLAG_DOCUMENT_5" "FLAG_DOCUMENT_6" ## [21] "FLAG_DOCUMENT_7" "FLAG_DOCUMENT_8" ## [23] "FLAG_DOCUMENT_9" "FLAG_DOCUMENT_10" ## [25] "FLAG_DOCUMENT_11" "FLAG_DOCUMENT_12" ## [27] "FLAG_DOCUMENT_13" "FLAG_DOCUMENT_14" ## [29] "FLAG_DOCUMENT_15" "FLAG_DOCUMENT_16" ## [31] "FLAG_DOCUMENT_17" "FLAG_DOCUMENT_18" ## [33] "FLAG_DOCUMENT_19" "FLAG_DOCUMENT_20" ## [35] "FLAG_DOCUMENT_21" Missing Data

I wanted to see how much missing data I was dealing with. The trick here was using the summarize_all() function with an anonymous function (.funs = ~ sum(is.na(.)) / length(.)) to calculate the percentage of missing data.

missing_tbl <- x_train_tbl %>% summarize_all(.funs = ~ sum(is.na(.)) / length(.)) %>% gather() %>% arrange(desc(value)) %>% filter(value > 0) missing_tbl ## # A tibble: 67 x 2 ## key value ## ## 1 COMMONAREA_AVG 0.699 ## 2 COMMONAREA_MODE 0.699 ## 3 COMMONAREA_MEDI 0.699 ## 4 NONLIVINGAPARTMENTS_AVG 0.694 ## 5 NONLIVINGAPARTMENTS_MODE 0.694 ## 6 NONLIVINGAPARTMENTS_MEDI 0.694 ## 7 FONDKAPREMONT_MODE 0.684 ## 8 LIVINGAPARTMENTS_AVG 0.684 ## 9 LIVINGAPARTMENTS_MODE 0.684 ## 10 LIVINGAPARTMENTS_MEDI 0.684 ## # ... with 57 more rows

A number of the features have over 65% missing values. My plan is to impute all of the data based on mean for numeric and mode for categorical. More on this next.

Preprocessing Implementation And H2O Modeling (15 min)

This step took a little about 15 minutes in total. We first use recipes to preprocess the data for H2O AutoML. We next model with h2o.

Recipes Preprocessing

I used recipes to quickly preprocess the data for AutoML.

  1. Converting character to factor: I used step_string2factor() for the character names we stored earlier.

  2. Converting numeric values with few unique levels to factor: I used step_num2factor() for the numeric column names we stored earlier.

  3. Imputing missing data: I used step_meanimpute() for the numeric features and step_modeimpute() for the categorical features.

After applying the step_ functions, I prepared the recipe using prep(). The trick here was to use stringsAsFactors = FALSE, which eliminates a pesky error with .

rec_obj <- recipe(~ ., data = x_train_tbl) %>% step_string2factor(string_2_factor_names) %>% step_num2factor(num_2_factor_names) %>% step_meanimpute(all_numeric()) %>% step_modeimpute(all_nominal()) %>% prep(stringsAsFactors = FALSE) rec_obj ## Data Recipe ## ## Inputs: ## ## role #variables ## predictor 121 ## ## Training data contained 307511 data points and 298909 incomplete rows. ## ## Operations: ## ## Factor variables from NAME_CONTRACT_TYPE, ... [trained] ## Factor variables from 35 items [trained] ## Mean Imputation for SK_ID_CURR, ... [trained] ## Mode Imputation for NAME_CONTRACT_TYPE, ... [trained]

Next, I used bake() on the training and testing data sets to implement the minimal transformations.

x_train_processed_tbl <- bake(rec_obj, x_train_tbl) x_test_processed_tbl <- bake(rec_obj, x_test_tbl)

We can see the results of the transformation. For brevity, I show the first 30 columns and use the glimpse() function to quickly scan the data.

Before Transformation

Before transformation (x_train_tbl), we can see that NA values are present and many of the flags are coded as integer.

# Before transformation x_train_tbl %>% select(1:30) %>% glimpse() ## Observations: 307,511 ## Variables: 30 ## $ SK_ID_CURR 100002, 100003, 100004, 100006... ## $ NAME_CONTRACT_TYPE "Cash loans", "Cash loans", "R... ## $ CODE_GENDER "M", "F", "M", "F", "M", "M", ... ## $ FLAG_OWN_CAR "N", "N", "Y", "N", "N", "N", ... ## $ FLAG_OWN_REALTY "Y", "N", "Y", "Y", "Y", "Y", ... ## $ CNT_CHILDREN 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ... ## $ AMT_INCOME_TOTAL 202500.00, 270000.00, 67500.00... ## $ AMT_CREDIT 406597.5, 1293502.5, 135000.0,... ## $ AMT_ANNUITY 24700.5, 35698.5, 6750.0, 2968... ## $ AMT_GOODS_PRICE 351000, 1129500, 135000, 29700... ## $ NAME_TYPE_SUITE "Unaccompanied", "Family", "Un... ## $ NAME_INCOME_TYPE "Working", "State servant", "W... ## $ NAME_EDUCATION_TYPE "Secondary / secondary special... ## $ NAME_FAMILY_STATUS "Single / not married", "Marri... ## $ NAME_HOUSING_TYPE "House / apartment", "House / ... ## $ REGION_POPULATION_RELATIVE 0.018801, 0.003541, 0.010032, ... ## $ DAYS_BIRTH -9461, -16765, -19046, -19005,... ## $ DAYS_EMPLOYED -637, -1188, -225, -3039, -303... ## $ DAYS_REGISTRATION -3648, -1186, -4260, -9833, -4... ## $ DAYS_ID_PUBLISH -2120, -291, -2531, -2437, -34... ## $ OWN_CAR_AGE NA, NA, 26, NA, NA, NA, 17, 8,... ## $ FLAG_MOBIL 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ## $ FLAG_EMP_PHONE 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ... ## $ FLAG_WORK_PHONE 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, ... ## $ FLAG_CONT_MOBILE 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ## $ FLAG_PHONE 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, ... ## $ FLAG_EMAIL 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ OCCUPATION_TYPE "Laborers", "Core staff", "Lab... ## $ CNT_FAM_MEMBERS 1, 2, 1, 2, 1, 2, 3, 2, 2, 1, ... ## $ REGION_RATING_CLIENT 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, ... After Transformation

After the transformation (x_train_processed_tbl), the NA values are imputed, and we are left with factor and numeric (integer and double) data.

# After transformation x_train_processed_tbl %>% select(1:30) %>% glimpse() ## Observations: 307,511 ## Variables: 30 ## $ SK_ID_CURR 100002, 100003, 100004, 100006... ## $ NAME_CONTRACT_TYPE Cash loans, Cash loans, Revolv... ## $ CODE_GENDER M, F, M, F, M, M, F, M, F, M, ... ## $ FLAG_OWN_CAR N, N, Y, N, N, N, Y, Y, N, N, ... ## $ FLAG_OWN_REALTY Y, N, Y, Y, Y, Y, Y, Y, Y, Y, ... ## $ CNT_CHILDREN 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ... ## $ AMT_INCOME_TOTAL 202500.00, 270000.00, 67500.00... ## $ AMT_CREDIT 406597.5, 1293502.5, 135000.0,... ## $ AMT_ANNUITY 24700.5, 35698.5, 6750.0, 2968... ## $ AMT_GOODS_PRICE 351000, 1129500, 135000, 29700... ## $ NAME_TYPE_SUITE Unaccompanied, Family, Unaccom... ## $ NAME_INCOME_TYPE Working, State servant, Workin... ## $ NAME_EDUCATION_TYPE Secondary / secondary special,... ## $ NAME_FAMILY_STATUS Single / not married, Married,... ## $ NAME_HOUSING_TYPE House / apartment, House / apa... ## $ REGION_POPULATION_RELATIVE 0.018801, 0.003541, 0.010032, ... ## $ DAYS_BIRTH -9461, -16765, -19046, -19005,... ## $ DAYS_EMPLOYED -637, -1188, -225, -3039, -303... ## $ DAYS_REGISTRATION -3648, -1186, -4260, -9833, -4... ## $ DAYS_ID_PUBLISH -2120, -291, -2531, -2437, -34... ## $ OWN_CAR_AGE 12.06109, 12.06109, 26.00000, ... ## $ FLAG_MOBIL 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ## $ FLAG_EMP_PHONE 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, ... ## $ FLAG_WORK_PHONE 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, ... ## $ FLAG_CONT_MOBILE 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ## $ FLAG_PHONE 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, ... ## $ FLAG_EMAIL 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ## $ OCCUPATION_TYPE Laborers, Core staff, Laborers... ## $ CNT_FAM_MEMBERS 1, 2, 1, 2, 1, 2, 3, 2, 2, 1, ... ## $ REGION_RATING_CLIENT 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, ...

I also converted the target to a factor, which is the format needed for H2O to perform binary classification. The trick is to convert the numeric value to character then to factor.

y_train_processed_tbl <- y_train_tbl %>% mutate(TARGET = TARGET %>% as.character() %>% as.factor())

Last, I removed the unprocessed datasets and recipe object to save memory.

rm(rec_obj) rm(x_train_tbl) rm(x_test_tbl) rm(y_train_tbl)

Interested in Learning Preprocessing Strategies For Business?

If you’re interested in learning how to apply the preprocessing strategies using R Programming beyond the simplified strategy shown, we teach preprocessing with recipes in Chapter 3 of Data Science For Business (DS4B 201-R) as part of our 10-Week end-to-end Employee Churn data science project.

Get Started Today!

Modeling

I implemented Automated Machine Learning (AutoML) with H2O. First, I loaded h2o and initialized a local cluster with h2o.init(). We can also supply h2o.no_progress() to eliminate progress bars.

h2o.init() ## Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 4 hours 42 minutes ## H2O cluster version: 3.16.0.2 ## H2O cluster version age: 8 months and 7 days !!! ## H2O cluster name: H2O_started_from_R_mdanc_inu316 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 1.85 GB ## H2O cluster total cores: 8 ## H2O cluster allowed cores: 8 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## H2O Internal Security: FALSE ## H2O API Extensions: Algos, AutoML, Core V3, Core V4 ## R Version: R version 3.4.4 (2018-03-15) h2o.no_progress()

Next, I converted the dataframe to an H2O Frame using as.h2o(). H2O expects both the target and the training features to be in the same data frame, so I used bind_cols() first.

data_h2o <- as.h2o(bind_cols(y_train_processed_tbl, x_train_processed_tbl))

I split the training data into train, validation, and test sets into a 70/15/15 split using h2o.splitFrame(). Note that I are only used the loan application training data and not the testing data for this part (the testing data will be used to make the Kaggle Competition submission).

splits_h2o <- h2o.splitFrame(data_h2o, ratios = c(0.7, 0.15), seed = 1234) train_h2o <- splits_h2o[[1]] valid_h2o <- splits_h2o[[2]] test_h2o <- splits_h2o[[3]]

Next, I ran h2o.automl(). I set max_runtime_secs = 90 to speed the training time up at the expense of some(default is 3600, which can take quite a while). The resulting leader model is produced in approximately 6 minutes.

y <- "TARGET" x <- setdiff(names(train_h2o), y) automl_models_h2o <- h2o.automl( x = x, y = y, training_frame = train_h2o, validation_frame = valid_h2o, leaderboard_frame = test_h2o, max_runtime_secs = 90 ) automl_leader <- automl_models_h2o@leader

Interested in Learning H2O AutoML For Business?

If you’re interested in learning how to perform H2O Automated Machine Learning using R Programming beyond the simplified modeling shown, we teach H2O AutoML modeling (including Grid Search and Cross Validation) in Chapter 4 of Data Science For Business (DS4B 201-R) as part of our 10-Week end-to-end Employee Churn data science project.

Get Started Today!

H2O Performance And Kaggle Submission

This part was excluded from the timed challenge since the model was developed and this is simply testing the validity of the results and submitting the final solution.

H2O Performance

We can perform a brief performance analysis to assess the AutoML Leader model. First, let’s create an H2O performance object using h2o.performance(). We’ll set new data to test_h2o. Note that this is not the application test set, but rather a random sample of 15% of the application training set.

performance_h2o <- h2o.performance(automl_leader, newdata = test_h2o)

Next, let’s output the Confusion Matrix using the h2o.confusionMatrix() function. It’s important to understand that this model uses a threshold of 0.139, which maximizes the F1. This is NOT always the best case for the business. If you want to learn how to optimize the classification threshold, our Data Science For Business With R course covers Expected Value, Threshold Optimization, and Sensitivity Analysis in Chapters 7 and 8.

performance_h2o %>% h2o.confusionMatrix() ## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.112512548039536: ## 0 1 Error Rate ## 0 36871 5516 0.130134 =5516/42387 ## 1 2185 1528 0.588473 =2185/3713 ## Totals 39056 7044 0.167050 =7701/46100

We can see that 1528 people that defaulted were correctly detected, 5516 were incorrectly predicted to be default but actually did not, and 2185 were incorrectly predicted to not default, but actually did. This last group is called False Negatives, and this population can be EXTREMELY COSTLY to the lender. Typically a lot more costly than False Positives. This is why threshold optimization is important. Check out DS4B 201-R if you want to learn how to do threshold optimization.

We can also check out the AUC on the performance set. We’re getting about 0.73, which is pretty good.

performance_h2o %>% h2o.auc() ## [1] 0.7330787

Interested in Learning H2O Performance Analysis For Business?

If you’re interested in learning how to do performance analysis for H2O modeling using R Programming beyond the simplified performance analysis shown, we teach:

  • Chapter 5: H2O Performance Analysis (ROC, Precision vs Recall, Gain & Lift) on an entire leaderboard of H2O models (stacked ensembles, deep learning, random forest, and more)
  • Chapter 7: Expected Value Framework and targeting using cost/benefit analysis
  • Chapter 8: Optimizing the threshold to maximize profitability and performing sensitivity analysis to account for unknowns.

Check out Data Science For Business (DS4B 201-R) as part of our 10-Week end-to-end Employee Churn data science project.

Get Started Today!

Kaggle Submission

This is the moment of truth. I made my final predictions for the Kaggle Competition. Here’s the procedure and final results.

First, we can make some predictions using h2o.predict().

prediction_h2o <- h2o.predict(automl_leader, newdata = as.h2o(x_test_processed_tbl))

Next, let’s convert to a tibble and bind with the correct SK_ID_CURR. We’ll grab the SK_ID_CURR and p1 (positive class probability) columns, which is the format expected for the Kaggle Competition.

prediction_tbl <- prediction_h2o %>% as.tibble() %>% bind_cols( x_test_processed_tbl %>% select(SK_ID_CURR) ) %>% select(SK_ID_CURR, p1) %>% rename(TARGET = p1) prediction_tbl ## # A tibble: 48,744 x 2 ## SK_ID_CURR TARGET ## ## 1 100001 0.0632 ## 2 100005 0.0969 ## 3 100013 0.0516 ## 4 100028 0.0418 ## 5 100038 0.0892 ## 6 100042 0.0477 ## 7 100057 0.0367 ## 8 100065 0.0613 ## 9 100066 0.0348 ## 10 100067 0.131 ## # ... with 48,734 more rows

Finally, we can output the submission to a CSV file using the write_csv() function.

prediction_tbl %>% write_csv(path = "submission_001.csv")

Submitting my results, I got an AUC of approximately 0.70. Not bad for 30 minutes of modeling.

Kaggle Submission

The important thing to understand is that this is a great baseline for a model. We can begin to improve it, which can take significant effort depending on how far you want to go.

We did not perform feature engineering and any additional model tuning beyond what AutoML performs by default. Doing so will get us closer to the top results.

Next Steps (Transform Your Abilities)

It’s critical to understand that modeling is just one part of the overall data science project. Don’t mistake – it’s an important part, but other parts are equally if not more important, which is why our Data Science For Business With R (DS4B 201-R) course is successfully teaching data science students how to apply data science in the real world.

We teach end-to-end data science. This means you learn how to:

  • Chapter 1: Sizing The Problem, Custom Functions With Tidy Eval: Learn how to understand if a business problem exists by sizing the problem. In addition, create custom functions using Tidy Eval, a programming interface that is needed if you want to build custom functions using dplyr, ggplot2.

  • Chapter 2: Exploration with GGally, skimr – We show you how to explore data and identify relationships efficiently and effectively

  • Chapter 3: recipes, Premodeling Correlation Analysis: We teach you how to use recipes to develop data transformation workflow and we show you how to perform a pre-modeling correlation analysis so you do not move into modeling prematurely, which again saves you time

  • Chapters 4 and 5: H2O AutoML: You will first learn how to use all of the major h2o functions to perform automated machine learning for binary classification including working with the H2O leaderboard, visualizing results, and performing grid search and cross validation. In the next chapter, you learn how to evaluate multiple models against each other with ROC, Precision vs Recall, Gain and Lift, which is necessary to explain your choices for best model to the business stakeholders (your manager and key decision makers).

  • Chapter 6: LIME: You learn how to create local explanations to interpret black-box machine learning models. Explanations for the model predictions are critical because the business cannot do anything to improve unless they understand why. We show you how with lime.

  • Chapters 7 and 8: Expected Value Framework, Threshold Optimization, Sensitivity Analysis: These are my two favorite chapters because they show you how to link the churn classification model to financial results, and they teach purrr for iterative grid-search style optimization! These are POWERFUL CONCEPTS.

  • Chapter 9: Recommendation Algorithm: We again use a correlation analysis but in a different way. We discretize, creating a special visualization that enables us to develop a recommendation strategy.

Data Science For Business With R (DS4B 201-R)

Learn everything you need to know to complete a real-world, end-to-end data science project with the R programming language. Transform your abilities in 10 weeks.

Get Started Today!

Business Science University Course Roadmap

We have one course out and two courses coming soon!

Data Science For Business With R (DS4B 201-R)

Over the course of 10 weeks, we teach you how to solve an end-to-end data science project. Available now!

Transform you abilities by solving employee churn over a 10-week end-to-end data science project in DS4B 201-R

Get Started Today!

Building A Shiny Application (DS4B 301-R)

Our next course teaches you how to take the H2O Model, LIME explanation results, and the recommendation algorithm you develop in DS4B 201-R and turn it into a Shiny Web Application that predicts employee attrition! Coming in Q3 2018.

Shiny App That Predicts Attrition and Recommends Management Strategies, Taught in DS4B 301-R (Building A Shiny Web App)

Kelly O’Briant is lead developer for the Shiny App Course coming soon. She’s a brilliant software engineer / data scientist that knows how to make a great looking and performing Shiny app.

Sign Up! Coming Q3!

Data Science For Business With Python (DS4B 201-P)

Did we mention with have a DS4B Python Course coming?!?! Well we do! Coming in Q4 2018.

The problem changes: Customer Churn! The tools will be H2O, LIME, and a host of other tools implemented in Python + Spark.

Python Track: Data Science For Business With Python And Spark

Favio Vazquez, Principle Data Scientist at OXXO, is building the Python + Spark equivalent of DS4B 201-R. He’s so talented knowing Python, Spark, and R, along with a host of other data science tools.

Sign Up! Coming Q4!

Don’t Miss A Beat

Connect With Business Science

If you like our software (anomalize, tidyquant, tibbletime, timetk, and sweep), our courses, and our company, you can connect with us:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: business-science.io - Articles. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Animating the Goals of the World Cup: Comparing the old vs. new gganimate and tweenr API

Tue, 08/07/2018 - 03:48

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Welcome to Part 3 of my series on “Visualizing the World Cup with R”! This is the culmination of this mini project that I've been working on throughout the World Cup (You can check out the Github Repo here). In addition, from having listened to Thomas Pedersen's excellent keynote at UseR! 2018 in Brisbane on the NEW gganimate and tweenr API, I am taking advantage of the fortuitous timing to also compare the APIs using the goals as the examples!

I've had finished these animations a couple of weeks ago but didn't make them available until I presented at the TokyoR MeetUp last week! Hadley Wickham and Joe Rickert were the special guests and with the amount of speakers and attendees it felt more like a mini-conference than a regular meetup, if you're ever in Tokyo come join us for some R&R…and R! You can check out a recording of my talk on YouTube.

Let's get started!

Coordinate position data

Since this series started, several people have asked me where I got the data. I thought I made it quite clear in Part 1 but I will reiterate in the next few paragraphs.

I get a lot of my data science/visualization news from Twitter which has made a weird comeback by providing a platform for certain communities like #rstats (never thought I'll be creating a Twitter account in 2017!). Therefore, I've been able to come across some wonderful visualizations for the World Cup by The Financial Times, FiveThirtyEight, and a host of other people. As you can see from a great example of World Cup penalties by the BBC below, data is provided by sports analytics companies, primarily Opta!

Great! But can an average joe like me just waltz in, slap down a fiver, and say “GIMME THE DATA”? Well, unfortunately no, it costs quite a lot! This isn't really a knock on Opta or other sports analytics companies since FIFA or the FAs of respective nations didn't do this kind of stuff, the free market stepped in to fill the gap. Still, I'm 100% sure I am not the only one who wishes this kind of data was free though, well besides some datasets of varying quality you see on Kaggle (but none of those are as granular as the stuff Opta provides anyway).

So, envious of those who have the financial backing to procure such data and some mild annoyance at others online who didn't really bother sharing exactly how they got their data or even what tools they used, I started thinking of ways that I could get the data for myself.

One possible way was to use RSelenium or other JavaScript web scrapers on soccer analytics websites and their cool dashboards, like WhoScored.com. However, since I wouldn't have been able to master these tools before the World Cup ended (during which whatever I end up creating would be most relevant), I decided that I'll create the coordinate data positions myself!

With the plotting system in ggsoccer and ggplot2 it's really not that hard to figure out the positions on the soccer field plot, as you can see in the picture below:

ggplot(point_data) + annotate_pitch() + theme_pitch(aspect_ratio = NULL) + coord_flip() + geom_point( aes(x = x, y = y), size = 1.5) + geom_text( aes(x = x, y = y, label = label), vjust = 1.5, color = "red")

There's also a way to make the coordinates be in 120×80 format (which is much more intuitive) and you can do that by adding the *_scale arguments inside the annotate_pitch() function. However, I only realized this after I had embedded the coordinate positions for the 100×100 plot in my head so that's what I kept going with.

There is also the “Soccer event logger” here (incidentally also by Ben Torvaney) which allows you to mouse-click specific points on the field and then download a .csv file of the coordinate positions you clicked. This might be easier but personally, I like to experiment within the R environment and take notes/ideas in RMarkdown as I do so, it definitely is an option for others though.

… and that's how Part 1 was born! But I wasn't going to stop there, soccer is a moving – flowing game, static images are OK but it just doesn't capture the FEEL of the sport. So this is where gganimate and tweenr came in!

Out of all the World Cup stuff I've animated so far, by far the most complicated was Gazinsky's goal in the opening game. This is because I not only have to track the ball movement but the movement of multiple players as well. So most of the comparison aspect of the APIs will be done with this goal.

Let's take a look at the packages that I'll be using:

library(ggplot2) # general plotting base library(dplyr) # data manipulation/tidying library(ggsoccer) # draw soccer field plot library(ggimage) # add soccer ball emoji + flags library(extrafont) # incorporate Dusha font into plots library(gganimate) # animate goal plots library(tweenr) # create in-between frames for data library(purrr) # for creating a list of dataframes for tweenr library(countrycode)# for finding ISO codes for geom_flag() # loadfonts() run once every new session Gazinsky's first goal

Let's first look at the set of dataframes with the coordinate data points necessary for this to work:

pass_data <- data.frame( x = c(100, 94, 82, 82.5, 84, 76.5, 75.5, 94, 99.2), y = c(0, 35, 31, 22, 8, 13, 19, 60, 47.5), time = c(1, 2, 3, 4, 5, 6, 7, 8, 9)) golovin_movement <- data.frame( x = c(78, 80, 80, 80, 75.5, 74.5, 73.5, 73, 73), y = c(30, 30, 27, 25, 10, 9, 15, 15, 15), label = "Golovin", time = c(1, 2, 3, 4, 5, 6, 7, 8, 9) ) zhirkov_movement <- data.frame( x = c(98, 90, 84, 84, 84, 84, 84, 84, 84), y = c( 0, 2, 2, 2, 2, 2, 2, 2, 2), label = "Zhirkov", time = c(1, 2, 3, 4, 5, 6, 7, 8, 9) ) gazinsky_movement <- data.frame( x = c(0, 0, 0, 0, NA, 92, 92, 92, 92), y = c(0, 0, 0, 0, NA, 66.8, 66.8, 66.8, 66.8), label = "Gazinsky", time = c(1, 2, 3, 4, 5, 6, 7, 8, 9) ) # ONLY in static + gganimate versions segment_data <- data.frame( x = c(77.5, 98), y = c(22, 2), xend = c(75, 84), yend = c(15, 3), linetype = c("dashed", "dashed"), color = c("black", "black"), size = c(1.2, 1.25) ) # saudi defender saudi_data <- data.frame( x = c(95, 95, 90, 87, 84, 80, 79, 79, 79), y = c(35, 35, 35, 32, 28, 25, 24, 25, 26), label = "M. Al-Breik", time = c(1, 2, 3, 4, 5, 6, 7, 8, 9) ) ### soccer ball ball_data <- tribble( ~x, ~y, ~time, 100, 0, 1, 94, 35, 2, 82, 31, 3, 82.5, 25, 4, 84, 6, 5, 77, 13, 6, 76, 19, 7, 94, 60, 8, 99.2, 47.5, 9 )

If you're manually creating these, you could also use tribble() instead of a dataframe(). It takes up a bit more space, as you can see in ball_data, but it is probably more readable for when you're sharing the code (like creating a reprex on SO or RStudio Community).

And here is the ggplot code for the gganimate version (no tween frames)!

Note: You need to be careful about the ordering of the ggplot elements. You need to make sure the soccer ball emoji code is near the end, after the labels, so that the player name labels don't cover the soccer ball as it's moving around!

ggplot(pass_data) + annotate_pitch() + theme_pitch() + coord_flip( xlim = c(49, 101), ylim = c(-1, 101)) + geom_segment( data = segment_data, aes(x = x, y = y, xend = xend, yend = yend), size = segment_data$size, color = segment_data$color, linetype = c("dashed", "dashed")) + geom_label( data = saudi_data, aes(x = x, y = y, label = label), color = "darkgreen") + geom_label(data = zhirkov_movement, aes(x = x, y = y, frame = time, label = label), color = "red") + geom_label(data = golovin_movement, aes(x = x, y = y, frame = time, label = label), color = "red") + geom_label( data = gazinsky_movement, aes(x = x, y = y, label = label), color = "red") + ggimage::geom_emoji( data = ball_data, aes(x = x, y = y, frame = time), image = "26bd", size = 0.035) + ggtitle( label = "Russia (5) vs. (0) Saudi Arabia", subtitle = "First goal, Yuri Gazinsky (12th Minute)") + labs(caption = "By Ryo Nakagawara (@R_by_Ryo)") + annotate( "text", x = 69, y = 65, family = "Dusha V5", label = "After a poor corner kick clearance\n from Saudi Arabia, Golovin picks up the loose ball, \n exchanges a give-and-go pass with Zhirkov\n before finding Gazinsky with a beautiful cross!") + theme(text = element_text(family = "Dusha V5"))

Now let's check out how we would do it with the in-between frames added in using tweenr!

The important bit with the old API was that you had to create a list of dataframes of the different states of your data. In this case, it is a dataframe for each observation of the data or to put it more simply, the “time” variable (a dataframe of coordinate positions for time = 1, time = 2, etc.). This is done with pmap() with dataframe() being passed to the .f argument.

With this list of dataframes created, we can pass it into tween_states() function to create the in-between frames to connect each of the dataframes in the list. Take note of the arguments in tweent_states() as they'll show up again in the new API later.

### soccer ball b_list <- ball_data %>% pmap(data.frame) ball_tween <- b_list %>% tween_states(tweenlength = 0.5, statelength = 0.00000001, ease = "linear", nframes = 75) ### Golovin golovin_movement_list <- golovin_movement %>% pmap(data.frame) golovin_tween <- golovin_movement_list %>% tween_states(tweenlength = 0.5, statelength = 0.00000001, ease = "linear", nframes = 75) golovin_tween <- golovin_tween %>% mutate(label = "Golovin") ### Zhirkov zhirkov_movement_list <- zhirkov_movement %>% pmap(data.frame) zhirkov_tween <- zhirkov_movement_list %>% tween_states(tweenlength = 0.5, statelength = 0.00000001, ease = "linear", nframes = 75) zhirkov_tween <- zhirkov_tween %>% mutate(label = "Zhirkov")

Now with these newly created tween dataframes, we pass them into our ggplot code as before and specify the frame argument with the newly created “.frame” variable.

ggplot(pass_data) + annotate_pitch() + theme_pitch() + coord_flip(xlim = c(49, 101), ylim = c(-1, 101)) + geom_label( data = saudi_data, aes(x = x, y = y, label = label), color = "darkgreen") + geom_label(data = zhirkov_tween, aes(x = x, y = y, frame = .frame, label = label), color = "red") + geom_label(data = golovin_tween, aes(x = x, y = y, frame = .frame, label = label), color = "red") + geom_label( data = gazinsky_movement, aes(x = x, y = y, label = label), color = "red") + ggimage::geom_emoji( data = ball_tween, aes(x = x, y = y, frame = .frame), image = "26bd", size = 0.035) + ggtitle(label = "Russia (5) vs. (0) Saudi Arabia", subtitle = "First goal, Yuri Gazinsky (12th Minute)") + labs(caption = "By Ryo Nakagawara (@R_by_Ryo)") + annotate("text", x = 69, y = 65, family = "Dusha V5", label = "After a poor corner kick clearance\n from Saudi Arabia, Golovin picks up the loose ball, \n exchanges a give-and-go pass with Zhirkov\n before finding Gazinsky with a beautiful cross!") + theme(text = element_text(family = "Dusha V5"))

Looks good. Now, let's check out how things changed with the new API!

New gganimate & tweenr

Once again, let's start by looking at just animating across the “time” variable without creating in-between frames.

ggplot(pass_data) + annotate_pitch() + theme_pitch() + theme(text = element_text(family = "Dusha V5")) + coord_flip(xlim = c(49, 101), ylim = c(-1, 101)) + geom_label( data = zhirkov_movement, aes(x = x, y = y, label = label), color = "red") + geom_label( data = golovin_movement, aes(x = x, y = y, label = label), color = "red") + geom_label( data = gazinsky_movement, aes(x = x, y = y, label = label), color = "red") + geom_label( data = saudi_data, aes(x = x, y = y, label = label), color = "darkgreen") + ggimage::geom_emoji( data = ball_data, aes(x = x, y = y), image = "26bd", size = 0.035) + ggtitle(label = "Russia (5) vs. (0) Saudi Arabia", subtitle = "First goal, Yuri Gazinsky (12th Minute)") + labs(caption = "By Ryo Nakagawara (@R_by_Ryo)") + annotate("text", x = 69, y = 65, family = "Dusha V5", label = "After a poor corner kick clearance\n from Saudi Arabia, Golovin picks up the loose ball, \n exchanges a give-and-go pass with Zhirkov\n before finding Gazinsky with a beautiful cross!") + # new gganimate code: transition_manual(frames = time)

It's quite nice that I don't have to specify frame = some_time_variable in every geom that I want animated now!

However, you can see that like in the old gganimate code the new transition_manual() function just speeds through the specified “time” variable without actually creating in-between frames. Let's use the other transition_*() functions to specify the tween frames and set the animation speed.

Here I will use transition_states() with “time” being the column I pass to the states argument. Instead of having to create a “.frame” column with the tween_states() function I can just pass the “time” variable into the transition_states() function and it will tween the frames for you in addition to the ggplot code! The transition_length argument is the same as the tween_length argument from the old tween_states() function and state_length argument is the same here too.

Unlike in the version I showed in my presentation, I added Mohammed Al-Breik's movement as well. I felt it was a bit silly (and unfair) to show him just standing there after his headed clearance!

ggplot(pass_data) + annotate_pitch() + theme_pitch() + coord_flip(xlim = c(49, 101), ylim = c(-1, 101)) + geom_label( data = saudi_data, aes(x = x, y = y, label = label), color = "darkgreen") + geom_label( data = zhirkov_movement, aes(x = x, y = y, label = label), color = "red") + geom_label( data = golovin_movement, aes(x = x, y = y, label = label), color = "red") + geom_label( data = gazinsky_movement, aes(x = x, y = y, label = label), color = "red") + enter_grow(fade = TRUE) + ggimage::geom_emoji( data = ball_data, aes(x = x, y = y), image = "26bd", size = 0.035) + ggtitle( label = "Russia (5) vs. (0) Saudi Arabia", subtitle = "First goal, Yuri Gazinsky (12th Minute)") + labs(caption = "By Ryo Nakagawara (@R_by_Ryo)") + annotate( "text", x = 69, y = 65, family = "Dusha V5", label = "After a poor corner kick clearance\n from Saudi Arabia, Golovin picks up the loose ball, \n exchanges a give-and-go pass with Zhirkov\n before finding Gazinsky with a beautiful cross!") + theme(text = element_text(family = "Dusha V5")) + # new gganimate code: transition_states( time, transition_length = 0.5, state_length = 0.0001, wrap = FALSE) + ease_aes("linear") anim_save(filename = "gazin_new_tweenr.gif")

Now you may be wondering why I didn't use the more logical choice, the transition_time() function here so let me explain.

I manually created the timing of the coordinate data so naturally, the transitions would be slightly off compared to real data. This goal animation was split into 9 “time” values for each important bit of the play that I thought would transition well when connected with eachother. Then I ran it through gganimate to see if it flowed well and once I was satisfied, I let tweenr fill in the blanks between each “time” value.

With the new API however, using transition_time() function wouldn't allow me to control transition length and state length like with transition_states()! Try running the code above with transition_time(time = time) instead and you'll see what I mean.

If I had real data and the proper timing values in the “time” column that seamlessly worked with the coordinate data points it would have then been appropriate to use transition_time(). Some examples of these kind of data sets include the gapminder data set used in the package README which used the “year” variable or the data set in the cool NBA animation by James Curley seen here that had very granular data recording the coordinate positions and the exact times.

A cool new thing that you can play around with in the new gganimate are the different enter/exit animations! However, I couldn't really get it to work for Gazinsky's label… In the mtcars example on the gganimate GitHub Repo, the boxplots disappeared when there was no data for the specific combination of variables but I can't seem to properly set up the Gazinsky label dataframe correctly to implement it.

Ideally, I want Gazinsky's label to only show up from time = 6 onwards. I tried filling the coordinate positions for time = 1 to time = 5 with NAs or 0s but it didn't seem trigger the effect … when I tried with “x = 0, y = 0” in time = 5, the player label zipped in from the bottom of the screen to the penalty box at time = 6 and it was unintentionally very funny!

Any help here will be greatly appreciated!

Osako's goal vs. Colombia

Japan faced a tough opponent in Colombia, even with the man-advantage early on, in our opening game of the World Cup. Even with our passing tiring out the tenacious Colombians we were finding it hard to find a breakthrough. In came Keisuke Honda, who within minutes of his arrival, delivered a beautiful cross from a corner kick for Osako to head home!

This goal was a lot easier to animate and to be honest this was the first one I was able to actually get working a few weeks ago! This was mainly because the ball movement was the only thing I really had to worry about.

cornerkick_data <- data.frame( x = 99, y = 0.3, x2 = 94, y2 = 47) osako_gol <- data.frame( x = 94, y = 49, x2 = 100, y2 = 55.5) ball_data <- data.frame( x = c(99, 94, 100), y = c(0.3, 47, 55.5), time = c(1, 2, 3)) player_label <- data.frame( x = c(92, 99), y = c(49, 2), label = c("Osako", "Honda")) wc_logo <- data.frame( x = 107, y = 85) %>% mutate( image = "https://upload.wikimedia.org/wikipedia/en/thumb/6/67/2018_FIFA_World_Cup.svg/1200px-2018_FIFA_World_Cup.svg.png") flag_data <- data.frame( x = c(110, 110), y = c( 13, 53), team = c("japan", "colombia") ) %>% mutate( image = team %>% countrycode(., origin = "country.name", destination = "iso2c") ) %>% select(-team)

For this animation, I used one of the many easing functions available in tweenr, quadratic-out, to get the speed of the ball from a corner kick just about right. You can refer to this awesome website to check out most of the different easing functions you can use in ease_aes()!

ggplot(ball_data) + annotate_pitch() + theme_pitch() + theme( text = element_text(family = "Dusha V5"), plot.margin=grid::unit(c(0,0,0,0), "mm")) + coord_flip( xlim = c(55, 112), ylim = c(-1, 101)) + geom_label( data = player_label, aes(x = x, y = y, label = label), family = "Dusha V5") + geom_point( aes(x = 98, y = 50), size = 3, color = "green") + annotate( geom = "text", family = "Dusha V5", hjust = c(0.5, 0.5, 0.5, 0.5), size = c(6.5, 4.5, 5, 3), label = c("Japan (2) vs. Colombia (1)", "Kagawa (PEN 6'), Quintero (39'), Osako (73')", "Japan press their man advantage, substitute Honda\ndelivers a delicious corner kick for Osako to (somehow) tower over\nColombia's defense and flick a header into the far corner!", "by Ryo Nakagawara (@R_by_Ryo)"), x = c(110, 105, 70, 53), y = c(30, 30, 47, 85)) + ggimage::geom_emoji( # soccer ball emoji aes(x = x, y = y), image = "26bd", size = 0.035) + ggimage::geom_flag( # Japan + Colombia flag data = flag_data, aes(x = x, y = y, image = image), size = c(0.08, 0.08)) + geom_image( # World Cup logo data = wc_logo, aes(x = x, y = y, image = image), size = 0.17) + # new gganimate code: transition_states( time, transition_length = 0.5, state_length = 0.0001, wrap = FALSE) + ease_aes("quadratic-out")

As you can see it's quite easy and fun to make these! I am hoping to make more in the future, especially when the new season begins!

A small note on the flags: I used a bit of a hacky solution to get them into the title but both Ben and Hadley recommended I use the emo::ji() package which allows you to insert emoji into RMarkdown and inline. So that's something I'll be looking into in the near future!

Japan's Offside Trap vs. Senegal!

For the final animation, I tried to recreate something you don't see everyday – an offside trap!

# PLAYERS # JAPAN: x, y (blue) Senegal: x2, y2 (lightgreen) trap_data <- data.frame( time = c(1, 2, 3, 4, 5), # ball trajectory x = c(70, 70, 70, 87, 95), y = c(85, 85, 85, 52, 33), # offside line bar #xo = c(83, 81.2, 79, 77.5, 70), xoend = c(83.8, 81.8, 79, 78.5, 71), yo = c( 5, 5, 5, 5, 5), yoend = c(95, 95, 95, 95, 95), # players: japan jx = c(83, 81, 77, 75, 70), jy = c(rep(65, 5)), jx2 = c(83, 81.8, 78.5, 77, 70), jy2 = c(rep(60.5, 5)), jx3 = c(83, 81, 76.5, 75, 71), jy3 = c(rep(55, 5)), jx4 = c(83, 81.2, 76.3, 75, 70), jy4 = c(rep(52, 5)), jx5 = c(82.8, 81, 77, 74, 70), jy5 = c(rep(49, 5)), jx6 = c(83, 81.8, 77, 74, 70), jy6 = c(rep(45, 5)), jx7 = c(83.8, 81, 79, 77.5, 70), jy7 = c(rep(40, 5)), # players: senegal sx = c(83, 84, 84, 84, 84), sy = c(rep(33, 5)), sx2 = c(83, 85, 87, 92, 95), sy2 = c(38, 37, 35, 34, 33), sx3 = c(83, 84, 84, 83, 83), sy3 = c(rep(41, 5)), sx4 = c(83, 84, 83, 78, 78), sy4 = c(rep(45, 5)), sx5 = c(83, 84, 87, 88, 89), sy5 = c(rep(52, 5)), sx6 = c(83, 85, 84, 84, 83), sy6 = c(rep(69, 5)) ) # flags flag_data <- data.frame( x = c( 48, 93), y = c(107, 107), team = c("japan", "senegal") ) %>% mutate( image = team %>% countrycode(., origin = "country.name", destination = "iso2c") ) %>% select(-team) # extra players: goalkeeper_data <- data.frame( x = c(98), y = c(50) ) senegal_data <- data.frame( x = c(55, 55, 68.5), y = c(50, 60, 87) )

In the code below, take note of the “wrap” option in transition_states(). You can set it to false if you don't want the last state to transition back to the first state (default == TRUE).

ggplot(trap_data) + annotate_pitch() + theme_pitch(aspect_ratio = NULL) + coord_fixed( xlim = c(30, 101), ylim = c(-5, 131)) + # offside line geom_segment(aes(x = xoend, y = yo, xend = xoend, yend = yoend), color = "black", size = 1.3) + # japan geom_point(aes(x = jx, y = jy), size = 4, color = "blue") + geom_point(aes(x = jx2, y = jy2), size = 4, color = "blue") + geom_point(aes(x = jx3, y = jy3), size = 4, color = "blue") + geom_point(aes(x = jx4, y = jy4), size = 4, color = "blue") + geom_point(aes(x = jx5, y = jy5), size = 4, color = "blue") + geom_point(aes(x = jx6, y = jy6), size = 4, color = "blue") + geom_point(aes(x = jx7, y = jy7), size = 4, color = "blue") + # senegal geom_point(aes(x = sx, y = sy), size = 4, color = "green") + geom_point(aes(x = sx2, y = sy2), size = 4, color = "green") + geom_point(aes(x = sx3, y = sy3), size = 4, color = "green") + geom_point(aes(x = sx4, y = sy4), size = 4, color = "green") + geom_point(aes(x = sx5, y = sy5), size = 4, color = "green") + geom_point(aes(x = sx6, y = sy6), size = 4, color = "green") + # free kick spot (reference) geom_point(aes(x = 70, y = 85), color = "blue", size = 1.2) + # goalkeeper geom_point( data = goalkeeper_data, aes(x = x, y = y), size = 4, color = "blue") + # senegal defenders geom_point( data = senegal_data, aes(x = x, y = y), size = 4, color = "green") + annotate( geom = "text", family = "Dusha V5", hjust = c(0, 0), size = c(6, 6.5), label = c("Japan (2) vs. Senegal (2)", "The Perfect Offside Trap"), x = c(30, 30), y = c(107, 115)) + ggimage::geom_flag( data = flag_data, aes(x = x, y = y, image = image), size = c(0.07, 0.07)) + ggimage::geom_emoji( aes(x = x, y = y), image = "26bd", size = 0.035) + # NEW gganimate code transition_states( states = time, transition_length = 0.2, state_length = 0.00000001, wrap = FALSE) + ease_aes("linear")

So against the height advantage and physicality of Senegal, the thinking behind Japan's strategy was…

Let's take a few minutes to reflect on the new API.

Personal thoughts

The best thing about the new API is without a doubt, no more intermediary steps between tweening the data and plotting. As long as you have some kind of “time” variable you don't have to manually go and create the list of data frame for each state yourself as transition_*() functions does it all for you in the ggplot code!

The ease_aes() also allows you to specify the easing function of the transitions within the ggplot code as well. From “linear” to “quartic” to “elastic” along with modifiers such as “in”, “out”, “in-out” you have a lot to choose from to satisfy your animation needs. Thomas did mention in his keynote that he wants a better name for this, so any suggestions? Maybe something like ease_tween(), easing_fun(), ease_trans(), ease_transitions()?

With everything streamlined so that you can add in the animation code seamlessly with ggplot grammar I feel you can read the entirety of the code better. As in, you don't have to refer back to a separate chunk of code that showed how you created the tween frames. The transition to a “grammar of animation” is truly in motion!

New options in gganimate and tweenr!

Now I'll talk about a few other new things that I didn't have a chance to show this time around.

There are a host of different enter_*() and exit_*() functions to choose from to show how data appear and disappear throughout the duration of your animation. Some of the built-in effects that are available include, *_fade(), *_grow(), *_shrink() with extra arguments like early that change whether the data appears or disappears at the beginning of the transition or at the end.

With the old API, since you had to create the frames yourself with tween_states(), you got a dataframe output with the expanded tween-frames that you could view at your leisure. Now with the tweening done in the ggplot code you might think that you can't explicitly access them, but this is where the frame_vars() function comes in! Using this function you can access metadata about each of the frames rendered in your latest animation:

frames_data <- frame_vars(animation = last_animation())

The “frame_source” column shows you where each individual frame image is saved so you can copy them, re-animate them with magick instead, anything you want!

Panning and zooming across different states in the animation is another new concept introduced in the new gganimate with the series of view_*() functions like view_zoom() and view_step(). Within these you can use arguments like pause_length to specify the length of the zoomed view and step_length to specify the length of the transition between view points. I didn't implement them in these GIFs because I had already used the coord_*() functions to focus on certain areas of the pitch and the events I was animating needed a wide perspective of the field. This may come into play in future goal or play-by-play animations where I'm recreating a neat bit of build-up play from a full field view then zoom in on the off-the-ball movement of a certain player, so definitely a set of functions to keep an eye on!

Finally, in previous versions you used the gganimate() function to save the animation on your computer but now that is done with anim_save(). The README on GitHub has a very clear explanation on this so take a look under the “Where is my animation?” section here.

There's still much to learn from the new API and I'm sure there will still be more changes/fixes to come before the first CRAN release but this was a great step in the right direction. I will eagerly await the next release!

    Related Post

    1. Machine Learning Results in R: one plot to rule them all! (Part 1 – Classification Models)
    2. Seaborn Categorical Plots in Python
    3. Matplotlib Library Tutorial with Examples – Python
    4. Visualize the World Cup with R! Part 1: Recreating Goals with ggsoccer and ggplot2
    5. Creating Slopegraphs with R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Extracting and Processing eBird Data

    Tue, 08/07/2018 - 02:00

    (This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

    eBird is an online tool for recording bird
    observations. The eBird database currently contains over 500 million
    records of bird sightings, spanning every country and nearly every bird
    species, making it an extremely valuable resource for bird research and
    conservation. These data can be used to map the distribution and
    abundance of species, and assess how species’ ranges are changing over
    time. This dataset is available for download as a text file; however,
    this file is huge (over 180 GB!) and, therefore, poses some unique
    challenges. In particular, it isn’t possible to import and manipulate
    the full dataset in R. Working with these data typically requires
    filtering them to a smaller subset of desired observations before
    reading into R. This filtering is most efficiently done using AWK, a
    Unix utility and programming language for processing column formatted
    text data. The auk package acts as a front end for AWK, allowing users
    to filter eBird data before import into R, and provides tools to perform
    some important pre-processing of the data. Them name of this package
    comes from the happy coincidence that the command line tool
    AWK, upon which
    the package is based, is pronounced the same as
    auk, the family of sea birds also
    known as Alcids.

    This post will demonstrate how to use auk to extract eBird data for
    two bird species (Swainson’s and Bicknell’s Thrush) in New Hampshire,
    then use these data to produce some maps comparing the distributions of
    these species.

    Installation

    This package is available on
    CRAN and can be installed
    with:

    install.packages("auk")

    The development version can be installed from GitHub using devtools. Windows users should use this development version since it fixes a bug present in the CRAN version:

    devtools::install_github("CornellLabofOrnithology/auk")

    auk requires the Unix utility AWK, which is available on most Linux
    and Mac OS X machines. Windows users will first need to install
    Cygwin before using this package. Note that
    Cygwin should be installed in the default location
    (C:/cygwin/bin/gawk.exe or C:/cygwin64/bin/gawk.exe) in order for
    auk to work.

    Accessing the data

    Access to the full eBird dataset is provided via two large,
    tab-separated text files. The eBird Basic Dataset (EBD) contains the
    bird observation information and consists of one line for each
    observation of a species on a checklist. The Sampling Event Data
    contains just the checklist-level information (e.g. time and date,
    location, and search effort information) and consists of one line for
    each checklist.

    To obtain the eBird data, begin by creating an eBird account and
    signing in
    . eBird data
    are freely available; however, you will need to request
    access
    in order to obtain the EBD.
    Filling out the data request form allows eBird to keep track of the
    number of people using the data and obtain information on the
    applications for which the data are used.

    Once you have access to the data, proceed to the download
    page
    . Download and uncompress
    (twice!) both the EBD and the corresponding Sampling Event Data. Put
    these two large text files somewhere sensible on either your computer’s
    hard drive or an external drive, and remember the location of containing
    folder.

    Using auk

    There are essentially five main tasks that auk is designed to
    accomplish, and I’ll use a simple example to demonstrate each in turn:

    1. Cleaning
    2. Filtering
    3. Importing
    4. Pre-processing
    5. Zero-filling

    Let’s start by loading the required packages and defining a variable for
    the path to the folder that contains both the EBD sand sampling event
    data.

    library(auk) library(raster) library(tidyverse) library(sf) library(rnaturalearth) library(tigris) library(viridisLite) # path to ebird data # change this to the location where you saved the data ebd_dir <- "/Users/mes335/data/ebird/ebd_relMay-2018" 1. Cleaning

    Some rows in the EBD may have special characters in the comments fields
    (for example, tabs) that will cause import errors, and the dataset has
    an extra blank column at the end. The function auk_clean() drops these
    problematic records and removes the blank column. In addition, you can
    use remove_text = TRUE to remove free text entry columns, including
    the comments and location and observation names. These fields are
    typically not required and removing them can significantly decrease file
    size.

    This process should be run on both the EBD and sampling event data. It
    typically takes several hours for the full EBD; however, it only needs
    to be run once because the output from the process is saved out to a new
    tab-separated text file for subsequent use. After running auk_clean(),
    you can delete the original, uncleaned data files to save space

    # ebd f <- file.path(ebd_dir, "ebd_relMay-2018.txt") f_clean <- file.path(ebd_dir, "ebd_relMay-2018_clean.txt") auk_clean(f, f_out = f_clean, remove_text = TRUE) # sampling f_sampling <- file.path(ebd_dir, "ebd_sampling_relMay-2018.txt") f_sampling_clean <- file.path(ebd_dir, "ebd_sampling_relMay-2018_clean.txt") auk_clean(f, f_out = f_sampling_clean, remove_text = TRUE) 2. Filtering

    The EBD is huge! If we’re going to work with it, we need to extract a
    manageable subset of the data. With this in mind, the main purpose of
    auk is to provide a variety of functions to define taxonomic, spatial,
    temporal, or effort-based filters. To get started, we’ll use auk_ebd()
    to set up a reference to the EBD. We’ll also provide a reference to the
    sampling event data. This step is optional, but it will allow us to
    apply exactly the same set of filters (except for taxonomic filters) to
    the sampling event data and the EBD. We’ll see why this is valuable
    later.

    # define the paths to ebd and sampling event files f_in_ebd <- file.path(ebd_dir, "ebd_relMay-2018_clean.txt") f_in_sampling <- file.path(ebd_dir, "ebd_sampling_relMay-2018_clean.txt") # create an object referencing these files auk_ebd(file = f_in_ebd, file_sampling = f_in_sampling) #> Input #> EBD: /Users/mes335/data/ebird/ebd_relMay-2018/ebd_relMay-2018_clean.txt #> Sampling events: /Users/mes335/data/ebird/ebd_relMay-2018/ebd_sampling_relMay-2018_clean.txt #> #> Output #> Filters not executed #> #> Filters #> Species: all #> Countries: all #> States: all #> Spatial extent: full extent #> Date: all #> Start time: all #> Last edited date: all #> Protocol: all #> Project code: all #> Duration: all #> Distance travelled: all #> Records with breeding codes only: no #> Complete checklists only: no

    As an example, let’s extract records from New Hampshire for Bicknell’s
    Thrush and Swainson’s Thrush from
    complete
    checklists in June of any year. Bicknell’s Thrush breeds at high
    elevations in New England, such as the White Mountains of New Hampshire,
    one of my favourite places in the Northeast. Swainson’s Thrush is a
    related, and much more widespread species, that breeds at lower
    elevations in the same area. While hiking in the White Mountains in the
    early summer, I love listening for the turnover from Swainson’s to
    Bicknell’s Thrush as I move up in elevation. Let’s see if this pattern
    is reflected in the data.

    Note that each filtering criterion corresponds to a different function,
    all functions begin with auk_ for easy auto-completion, and the
    functions can be chained together to define multiple filters. Also,
    printing the auk_ebd object will show which filters have been defined.

    ebd_filters <- auk_ebd(f_in_ebd, f_in_sampling) %>% auk_species(c("Bicknell's Thrush", "Swainson's Thrush")) %>% auk_state("US-NH") %>% auk_date(c("*-06-01", "*-06-30")) %>% auk_complete() ebd_filters #> Input #> EBD: /Users/mes335/data/ebird/ebd_relMay-2018/ebd_relMay-2018_clean.txt #> Sampling events: /Users/mes335/data/ebird/ebd_relMay-2018/ebd_sampling_relMay-2018_clean.txt #> #> Output #> Filters not executed #> #> Filters #> Species: Catharus bicknelli, Catharus ustulatus #> Countries: all #> States: US-NH #> Spatial extent: full extent #> Date: *-06-01 - *-06-30 #> Start time: all #> Last edited date: all #> Protocol: all #> Project code: all #> Duration: all #> Distance travelled: all #> Records with breeding codes only: no #> Complete checklists only: yes

    This is just a small set of the potential filters, for a full list of
    filters consult the
    documentation.

    The above code only defines the filters, no data has actually been
    extracted yet. To compile the filters into an AWK script and run it, use
    auk_filter(). Since I provided an EBD and sampling event file to
    auk_ebd(), both will be filtered and I will need to provide two output
    filenames.

    f_out_ebd <- "data/ebd_thrush_nh.txt" f_out_sampling <- "data/ebd_thrush_nh_sampling.txt" ebd_filtered <- auk_filter(ebd_filters, file = f_out_ebd, file_sampling = f_out_sampling)

    Running auk_filter() takes a long time, typically a few hours, so
    you’ll need to be patient. To download the pre-filtered data used in
    the remainder of this post, visit the GitHub
    repository
    .

    3. Importing

    Now that we have an EBD extract of a reasonable size, we can read it
    into an R data frame. The files output from auk_filter() are just
    tab-separated text files, so we could read them using any of our usual R
    tools, e.g. read.delim(). However, auk contains functions
    specifically designed for reading in EBD data. These functions choose
    sensible variable names, set the data types of columns correctly, and
    perform two important post-processing steps: taxonomic roll-up and
    de-duplicating group checklists.

    We’ll put the sampling event data aside for now, and just read in the
    EBD:

    ebd <- read_ebd(f_out_ebd) glimpse(ebd) #> Observations: 2,253 #> Variables: 42 #> $ checklist_id "S3910100", "S4263131", "S3936987... #> $ global_unique_identifier "URN:CornellLabOfOrnithology:EBIR... #> $ last_edited_date "2014-05-07 19:23:51", "2014-05-0... #> $ taxonomic_order 24602, 24600, 24600, 24602, 24602... #> $ category "species", "species", "species", ... #> $ common_name "Swainson's Thrush", "Bicknell's ... #> $ scientific_name "Catharus ustulatus", "Catharus b... #> $ observation_count "X", "1", "2", "1", "1", "2", "5"... #> $ breeding_bird_atlas_code NA, NA, NA, NA, NA, NA, NA, NA, N... #> $ breeding_bird_atlas_category NA, NA, NA, NA, NA, NA, NA, NA, N... #> $ age_sex NA, NA, NA, NA, NA, NA, NA, NA, N... #> $ country "United States", "United States",... #> $ country_code "US", "US", "US", "US", "US", "US... #> $ state "New Hampshire", "New Hampshire",... #> $ state_code "US-NH", "US-NH", "US-NH", "US-NH... #> $ county "Grafton", "Grafton", "Coos", "Co... #> $ county_code "US-NH-009", "US-NH-009", "US-NH-... #> $ iba_code NA, NA, "US-NH_2408", "US-NH_2419... #> $ bcr_code 14, 14, 14, 14, 14, 14, 14, 14, 1... #> $ usfws_code NA, NA, NA, NA, NA, NA, NA, NA, N... #> $ atlas_block NA, NA, NA, NA, NA, NA, NA, NA, N... #> $ locality_id "L722150", "L291562", "L295828", ... #> $ locality_type "H", "H", "H", "H", "H", "H", "H"... #> $ latitude 43.94204, 44.16991, 44.27132, 45.... #> $ longitude -71.72562, -71.68741, -71.30302, ... #> $ observation_date 2008-06-01, 2004-06-30, 2008-06-... #> $ time_observations_started NA, NA, NA, "09:00:00", "08:00:00... #> $ observer_id "obsr50351", "obsr131204", "obsr1... #> $ sampling_event_identifier "S3910100", "S4263131", "S3936987... #> $ protocol_type "Incidental", "Incidental", "Inci... #> $ protocol_code "P20", "P20", "P20", "P22", "P22"... #> $ project_code "EBIRD", "EBIRD", "EBIRD_BCN", "E... #> $ duration_minutes NA, NA, NA, 120, 180, 180, 180, 1... #> $ effort_distance_km NA, NA, NA, 7.242, 11.265, 11.265... #> $ effort_area_ha NA, NA, NA, NA, NA, NA, NA, NA, N... #> $ number_observers NA, 1, NA, 2, 2, 2, 2, 2, 2, 16, ... #> $ all_species_reported TRUE, TRUE, TRUE, TRUE, TRUE, TRU... #> $ group_identifier NA, NA, NA, NA, NA, NA, NA, NA, N... #> $ has_media FALSE, FALSE, FALSE, FALSE, FALSE... #> $ approved TRUE, TRUE, TRUE, TRUE, TRUE, TRU... #> $ reviewed FALSE, FALSE, FALSE, FALSE, FALSE... #> $ reason NA, NA, NA, NA, NA, NA, NA, NA, N... 4. Pre-processing

    By default, two important pre-processing steps are performed to handle
    taxonomy and group checklists. In most cases, you’ll want this to be
    done; however, these can be turned off to get the raw data.

    ebd_raw <- read_ebd(f_out_ebd, # leave group checklists as in unique = FALSE, # leave taxonomy as is rollup = FALSE) Taxonomic rollup

    The eBird
    taxonomy

    is an annually updated list of all field recognizable taxa. Taxa are
    grouped into eight different categories, some at a higher level than
    species and others at a lower level. The function auk_rollup() (called
    by default by read_ebd()) produces an EBD containing just true
    species. The categories and their treatment by auk_rollup() are as
    follows:

    • Species: e.g., Mallard.
    • ISSF or Identifiable Sub-specific Group: Identifiable subspecies
      or group of subspecies, e.g., Mallard (Mexican). Rolled-up to
      species level.
    • Intergrade: Hybrid between two ISSF (subspecies or subspecies
      groups), e.g., Mallard (Mexican intergrade. Rolled-up to species
      level.
    • Form: Miscellaneous other taxa, including recently-described
      species yet to be accepted or distinctive forms that are not
      universally accepted (Red-tailed Hawk (Northern), Upland Goose
      (Bar-breasted)). If the checklist contains multiple taxa
      corresponding to the same species, the lower level taxa are rolled
      up, otherwise these records are left as is.
    • Spuh: Genus or identification at broad level – e.g., duck sp.,
      dabbling duck sp.. Dropped by auk_rollup().
    • Slash: Identification to Species-pair e.g., American Black
      Duck/Mallard). Dropped by auk_rollup().
    • Hybrid: Hybrid between two species, e.g., American Black Duck x
      Mallard (hybrid). Dropped by auk_rollup().
    • Domestic: Distinctly-plumaged domesticated varieties that may be
      free-flying (these do not count on personal lists) e.g., Mallard
      (Domestic type). Dropped by auk_rollup().

    auk contains the full eBird taxonomy as a data frame.

    glimpse(ebird_taxonomy) #> Observations: 15,966 #> Variables: 8 #> $ taxon_order 3.0, 5.0, 6.0, 7.0, 13.0, 14.0, 17.0, 32.0, 33... #> $ category "species", "species", "slash", "species", "spe... #> $ species_code "ostric2", "ostric3", "y00934", "grerhe1", "le... #> $ common_name "Common Ostrich", "Somali Ostrich", "Common/So... #> $ scientific_name "Struthio camelus", "Struthio molybdophanes", ... #> $ order "Struthioniformes", "Struthioniformes", "Strut... #> $ family "Struthionidae (Ostriches)", "Struthionidae (O... #> $ report_as NA, NA, NA, NA, NA, "lesrhe2", "lesrhe2", NA, ...

    We can use one of the example datasets in the package to explore what
    auk_rollup() does.

    # get the path to the example data included in the package f <- system.file("extdata/ebd-rollup-ex.txt", package = "auk") # read in data without rolling up ebd_wo_ru <- read_ebd(f, rollup = FALSE) # rollup ebd_w_ru <- auk_rollup(ebd_wo_ru) # all taxa not identifiable to species are dropped unique(ebd_wo_ru$category) #> [1] "domestic" "form" "hybrid" "intergrade" "slash" #> [6] "spuh" "issf" "species" unique(ebd_w_ru$category) #> [1] "species" # yellow-rump warbler subspecies rollup # without rollup, there are three observations ebd_wo_ru %>% filter(common_name == "Yellow-rumped Warbler") %>% select(checklist_id, category, common_name, subspecies_common_name, observation_count) #> # A tibble: 5 x 5 #> checklist_id category common_name subspecies_common_… observation_cou… #> #> 1 S45216549 intergra… Yellow-rump… Yellow-rumped Warb… 3 #> 2 S45115209 intergra… Yellow-rump… Yellow-rumped Warb… 1 #> 3 S33053360 issf Yellow-rump… Yellow-rumped Warb… 2 #> 4 S33053360 issf Yellow-rump… Yellow-rumped Warb… 1 #> 5 S33053360 species Yellow-rump… 2 # with rollup, they have been combined ebd_w_ru %>% filter(common_name == "Yellow-rumped Warbler") %>% select(checklist_id, category, common_name, observation_count) #> # A tibble: 3 x 4 #> checklist_id category common_name observation_count #> #> 1 S45216549 species Yellow-rumped Warbler 3 #> 2 S45115209 species Yellow-rumped Warbler 1 #> 3 S33053360 species Yellow-rumped Warbler 5 Group checklists

    eBird observers birding together can share checklists resulting in group
    checklists. In the simplest case, all observers will have seen the same
    set of species; however, observers can also add or remove species from
    their checklist. In the EBD, group checklists result in duplicate
    records, one for each observer. auk_unique() (called by default by
    read_ebd()) de-duplicates the EBD, resulting in one record for each
    species on each group checklist.

    5. Zero-filling

    So far we’ve been working with presence-only data; however, many
    applications of the eBird data require presence-absence information.
    Although observers only explicitly record presence, they have the option
    of designating their checklists as “complete”, meaning they are
    reporting all the species they saw or heard. With complete checklists,
    any species not reported can be taken to have an implicit count of zero.
    Therefore, by focusing on complete checklists, we can use the sampling
    event data to “zero-fill” the EBD producing presence-absence data. This
    is why it’s important to filter both the EBD and the sampling event data
    at the same time; we need to ensure that the EBD observations are drawn
    from the population of checklists defined by the sampling event data.

    Given an EBD file or data frame, and corresponding sampling event data,
    the function auk_zerofill() produces zero-filled, presence-absence
    data.

    zf <- auk_zerofill(f_out_ebd, f_out_sampling) zf #> Zero-filled EBD: 12,938 unique checklists, for 2 species. zf$observations #> # A tibble: 25,876 x 4 #> checklist_id scientific_name observation_count species_observed #> #> 1 G1003224 Catharus bicknelli 0 FALSE #> 2 G1003224 Catharus ustulatus X TRUE #> 3 G1022603 Catharus bicknelli 0 FALSE #> 4 G1022603 Catharus ustulatus 0 FALSE #> 5 G1054429 Catharus bicknelli 3 TRUE #> 6 G1054429 Catharus ustulatus X TRUE #> 7 G1054430 Catharus bicknelli 0 FALSE #> 8 G1054430 Catharus ustulatus X TRUE #> 9 G1054431 Catharus bicknelli 1 TRUE #> 10 G1054431 Catharus ustulatus X TRUE #> # ... with 25,866 more rows zf$sampling_events #> # A tibble: 12,938 x 29 #> checklist_id last_edited_date country country_code state state_code #> #> 1 S10836735 2014-05-07 19:1… United… US New … US-NH #> 2 S10936397 2014-05-07 19:0… United… US New … US-NH #> 3 S10995603 2014-05-07 19:0… United… US New … US-NH #> 4 S11005948 2012-06-19 12:4… United… US New … US-NH #> 5 S11013341 2012-06-20 18:2… United… US New … US-NH #> 6 S15136517 2013-09-10 08:1… United… US New … US-NH #> 7 S11012701 2017-06-12 06:3… United… US New … US-NH #> 8 S10995655 2012-06-17 19:5… United… US New … US-NH #> 9 S15295339 2014-05-07 18:5… United… US New … US-NH #> 10 S10932312 2014-05-07 18:4… United… US New … US-NH #> # ... with 12,928 more rows, and 23 more variables: county , #> # county_code , iba_code , bcr_code , usfws_code , #> # atlas_block , locality_id , locality_type , #> # latitude , longitude , observation_date , #> # time_observations_started , observer_id , #> # sampling_event_identifier , protocol_type , #> # protocol_code , project_code , duration_minutes , #> # effort_distance_km , effort_area_ha , #> # number_observers , all_species_reported , #> # group_identifier

    The resulting auk_zerofill object is a list of two data frames:
    observations stores the species’ counts for each checklist and
    sampling_events stores the checklists. The checklist_id field can be
    used to combine the files together manually, or you can use the
    collapse_zerofill() function.

    collapse_zerofill(zf) #> # A tibble: 25,876 x 32 #> checklist_id last_edited_date country country_code state state_code #> #> 1 S10836735 2014-05-07 19:1… United… US New … US-NH #> 2 S10836735 2014-05-07 19:1… United… US New … US-NH #> 3 S10936397 2014-05-07 19:0… United… US New … US-NH #> 4 S10936397 2014-05-07 19:0… United… US New … US-NH #> 5 S10995603 2014-05-07 19:0… United… US New … US-NH #> 6 S10995603 2014-05-07 19:0… United… US New … US-NH #> 7 S11005948 2012-06-19 12:4… United… US New … US-NH #> 8 S11005948 2012-06-19 12:4… United… US New … US-NH #> 9 S11013341 2012-06-20 18:2… United… US New … US-NH #> 10 S11013341 2012-06-20 18:2… United… US New … US-NH #> # ... with 25,866 more rows, and 26 more variables: county , #> # county_code , iba_code , bcr_code , usfws_code , #> # atlas_block , locality_id , locality_type , #> # latitude , longitude , observation_date , #> # time_observations_started , observer_id , #> # sampling_event_identifier , protocol_type , #> # protocol_code , project_code , duration_minutes , #> # effort_distance_km , effort_area_ha , #> # number_observers , all_species_reported , #> # group_identifier , scientific_name , #> # observation_count , species_observed

    This zero-filled dataset is now suitable for applications such as
    species distribution modeling.

    Applications of eBird data

    Now that we’ve imported some eBird data into R, it’s time for the fun
    stuff! Whether we’re making maps or fitting models, most of the
    application of eBird data will occur outside of auk; however, I’ll
    include a couple simple examples here just to give you a sense for
    what’s possible.

    Making an occurrence map

    One of the most obvious things to do with the species occurrence data is
    to make a map! Here I’ll compare the distributions of the two thrush
    species.

    # convert ebd to spatial object ebd_sf <- ebd %>% select(common_name, latitude, longitude) %>% st_as_sf(coords = c("longitude", "latitude"), crs = 4326) # get state boundaries using rnaturalearth package states <- ne_states(iso_a2 = c("US", "CA"), returnclass = "sf") nh <- filter(states, postal == "NH") %>% st_geometry() # map par(mar = c(0, 0, 0, 0), bg = "skyblue") # set plot extent plot(nh, col = NA) # add state boundaries plot(states %>% st_geometry(), col = "grey40", border = "white", add = TRUE) plot(nh, col = "grey20", border = "white", add = TRUE) # ebird data plot(ebd_sf %>% filter(common_name == "Swainson's Thrush"), col = "#4daf4a99", pch = 19, cex = 0.75, add = TRUE) plot(ebd_sf %>% filter(common_name == "Bicknell's Thrush") %>% st_geometry(), col = "#377eb899", pch = 19, cex = 0.75, add = TRUE)

    Bicknell’s Thrush is a high elevation breeder and, as expected,
    observations of this species are restricted to the White Mountains in
    the North-central part of the state. In constrast, Swainson’s Thrush
    occurs more widely throughout the state.

    Mapping frequency of occurrence

    The above map doesn’t account for effort. For example, is a higher
    density of observations indicative of a biological pattern or of more
    birders visiting an area? We can address this by using the zero-filled
    data to produce maps of frequency of observation on checklists. To do
    this, I’ll start by converting the eBird data into a spatial object in
    sf format and projecting to an equal area coordinate reference system.

    # pres-abs data by species checklists_sf <- collapse_zerofill(zf) %>% st_as_sf(coords = c("longitude", "latitude"), crs = 4326) %>% # project to equal area st_transform(crs ="+proj=laea +lat_0=44 +lon_0=-71.70") %>% inner_join(ebird_taxonomy %>% select(species_code, scientific_name, common_name), by = "scientific_name") %>% select(species_code, common_name, scientific_name, species_observed)

    Next, I’ll create a raster template from this spatial object consisting
    of a regular 10km grid of cells. Then I use rasterize() to calculate
    the observation frequency within each cell.

    r_template <- raster(as(checklists_sf, "Spatial")) res(r_template) <- 10000 species <- unique(checklists_sf$species_code) r_freq <- map(species, ~ filter(checklists_sf, species_code == .) %>% rasterize(r_template, field = "species_observed", fun = mean)) %>% stack() %>% setNames(species)

    Finally, let’s visualize these data to compare the spatial patterns
    between species.

    par(mar = c(3, 1, 1, 1), mfrow = c(1, 2)) for (sp in species) { common <- filter(ebird_taxonomy, species_code == sp) %>% pull(common_name) # set plot extent plot(nh %>% st_transform(crs = projection(r_freq)), main = common) # land s <- states %>% st_transform(crs = projection(r_freq)) %>% st_geometry() plot(s, col = "grey60", border = NA, add = TRUE) # frequency data plot(r_freq[[sp]], breaks = seq(0, 1, length.out = 101), col = viridis(100), legend = FALSE, add = TRUE) # state lines plot(s, col = NA, border = "white", add = TRUE) } # legend par(mfrow = c(1, 1), mar = c(0, 0, 0, 0), new = TRUE) fields::image.plot(zlim = c(0, 1), legend.only = TRUE, col = viridis(100), smallplot = c(0.25, 0.75, 0.05, 0.08), horizontal = TRUE, axis.args = list(at = seq(0, 1, by = 0.25), labels = c(0, 0.25, 0.5, 0.75, 1), fg = NA, col.axis = "black", lwd.ticks = 0, line = -1), legend.args = list(text = "Observation Frequency", side = 3, col = "black"))

    Swainson’s Thrush are more widepsread and more frequently seen, while
    Bicknell’s Thrush is constrained to the White Mountains, and both
    species are uncommon in southern New Hampshire.

    auk vs. rebird

    Those interested in eBird data may also want to consider
    rebird, an R package that
    provides an interface to the eBird
    APIs
    . The
    functions in rebird are mostly limited to accessing recent (
    i.e. within the last 30 days) observations, although ebirdfreq() does
    provide historical frequency of observation data. In contrast, auk
    gives access to the full set of ~ 500 million eBird observations. For
    most ecological applications, users will require auk; however, for
    some use cases, e.g. building tools for birders, rebird provides a
    quick and easy way to access data. rebird is part of the species
    occurrence data (spocc) suite
    .

    Users working with species occurrence data may be interested in
    scrubr, for cleaning biological
    occurrence records, and the rOpenSci taxonomy
    suite
    .

    Acknowledgements

    This package is based on AWK scripts provided as part of the eBird Data
    Workshop given by Wesley Hochachka, Daniel Fink, Tom Auer, and Frank La
    Sorte at the 2016 NAOC on August 15, 2016.

    auk benefited significantly from the rOpenSci
    review process, including helpful suggestions from Auriel
    Fournier
    and Edmund
    Hart
    .

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Mapping the stock market using self-organizing maps

    Mon, 08/06/2018 - 21:24

    (This article was first published on Data based investing, and kindly contributed to R-bloggers)

    Self-organizing maps are an unsupervised learning approach for visualizing multi-dimensional data in a two-dimensional plane. They are great for clustering and finding out correlations in the data. In this post we apply self-organizing maps on historical US stock market data to find out interesting correlations and clusters. We’ll use data from ShillerGoyal and BLS to calculate the historical valuations levels, interest rates, inflation rates, unemployment rates and future ten-year total real returns from years 1948 to 2008.
    Click to enlarge images

    .center { display: block; margin-left: auto; margin-right: auto; width: 50%; }


    You can see a clear correlation between the different valuation measures, and that low valuations have led to high returns. There’s a slight negative correlation between the valuation measures and unemployment, i.e. valuations have been higher when unemployment has been lower. Charlie Bilello has a great article on the subject. There’s also a positive correlation between unemployment and rates, which means that rates have typically been higher when unemployment has been higher. Next, let’s look at clusters formed using hierarchical clustering. We’ll form four clusters on the same plane as used in the above analysis. Let’s look at the results: The balls inside each hexagon correspond to each month. We are currently in the green cluster, which has typically lead to low returns. Why has low unemployment, low rates and low inflation led to low returns, aren’t these things good for the stock market? I see two possible causes: these conditions tend to revert back to their mean (which means worsening macroeconomical conditions), and investors tend to extrapolate past returns into the future (a great tweet on the subject by Michael Batnick). The second part causes high valuations, which is present in the green cluster. Which cluster is the best place to be in? I’d say the gray one, but the data seems to support the blue one as well. The good thing is that there are other countries that are in both of these clusters. Even though I recommend looking at valuations alone rather than macroeconomic indicators, a good place worth checking for all that macro stuff is tradingeconomics.com. The R code used in the analysis is available here. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Data based investing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    SQL Server 2017 Machine Learning services with R book

    Mon, 08/06/2018 - 19:33

    (This article was first published on R – TomazTsql, and kindly contributed to R-bloggers)

    Yes, I am finally blogging this.

    This blog post is slighty different, since it brings you the tittle of the book, that my dear friend Julie Koesmarno (blog | twitter) and I have written in and it was published in March 2018 at Packt Publishing.

    Book covers the aspect of the R Machine Learning services available in Microsoft SQL Server 2017 (and 2016), how to start, handle and operationalise R code, deploy and manage your predictive models  and how to bring the complete solution to your enterprise environment. Exploring the CD/CI,  diving into examples supporting RevoScaleR algorithms, bringing closer the data science to database administrators and data analysts.

    More specifically, content of the book is following (as noted in table of content):

    1: Introduction to R and SQL Server
    2: Overview of Microsoft Machine Learning Server and SQL Server
    3: Managing Machine Learning Services for SQL Server 2017 and R
    4: Data Exploration and Data Visualization
    5: RevoScaleR Package
    6: Predictive Modeling
    7: Operationalizing R Code
    8: Deploying, Managing, and Monitoring Database Solutions containing R Code
    9: Machine Learning Services with R for DBAs
    10: R and SQL Server 2016/2017 Features Extended

    My dear friend, co-author and long time SQL Server community dedicated tech and data science lover, Julie and myself, we had great time working on this book, sharing the code, the ideas and collaborating on what was the great end product. Thank you, Julie.

    I would also like to thank all the people involved, with their help, expertise, inspirations, people at the Packt Publishing, to Hamish Watson and also a special thanks, to you, Marlon Ribunal (blog | twitter), for your reviews and comments in the time of the writing and your review  and to you, dear David Wentzel (website| linkedin ) for your chapter comments and your review.

    Finally, thank you Microsoft SQL Server community, SQL friends and SQL family, R community and R Consortium, and the Revolution Analytics community, gather and led by David Smith (twitter). Not only did this concept of R in Microsoft SQL Server, but also the intersection of technologies brought together so many beautiful people, minds and ideas, that will in future time help so many business and industries world-wide.

    Much appreciated!

    Book is available on Amazon  or you can get your copy at the Packt.

    Happy reading and coding!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Data Journalism & Interactive Visualization (Transcript)

    Mon, 08/06/2018 - 19:23

    (This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

    Here is a link to the podcast.

    Introducing Amber Thomas

    Hugo: Hi there, Amber, and welcome to DataFramed.

    Amber: Hey, hi. Thanks for having me.

    Hugo: Such a pleasure to have you on the show. I’m really excited to hear about your work at The Pudding today, data journalism, data storytelling, interactive visualization, all of these things, and I’d love to jump in and hear a bit about The Pudding and what you do there.

    Amber: Yeah, definitely. The Pudding is this amorphous internet thing. We refer to ourselves a collection of data-driven visual essays. Other people have referred to us as an online magazine, or some people call us a blog, which I feel like we’re not quite a blog, and we’re not quite a newspaper. We’re this weird thing in the middle, but basically what we do is we go out into the world and ask all sorts of questions. Generally, questions about things that people are already talking about. We joke that a lot of our stories are things that people would argue about over beers. So, a friendly argument about something, and we try to add data to that discussion or that argument, and then tell the story in a fun, interactive, visual way on the web.

    Hugo: That’s cool, and I love that you describe The Pudding as something amorphous and a weird thing in the middle, because essentially from my perspective, The Pudding is something that has arisen relatively recently as a result of many of the new technologies we have, right?

    Amber: Yeah, absolutely. I mean, it wouldn’t be possible without a lot of the interactive tools that we use and the programming languages and all of the data that we have access to and the ability to analyze all of the data. None of the stories that we tell would be possible without all of that computing power and those languages. So, it is a new thing, and I think it started January of 2017. So, it’s about a year and a half old.

    Hugo: And I do think this idea of interactive visualization, particularly with platforms and websites such as The Pudding, it’s very interesting because it changes the whole paradigm of storytelling and journalism in a sense, right? Instead of having thousands of words that people need to read and navigate in order to extract meaning, allowing people to interact with visuals and figures and that type of stuff allows people to move through that space of data themselves, and I think this is something we’ll speak more about later in this talk.

    Amber: Yeah, absolutely. And we try to make our stories as visually driven as possible, so we actually will often write a story and then go back and cut out a bunch of the prose and make sure that the visuals are telling the story. And of course, there’s still prose to give you background information and some other insight, but we try to let the interactives and the visuals be the actual driving force of the story.

    Hugo: That’s actually something I’ve admired in much of The Pudding’s work is that prose will come up interspersed through plots or when you hover over something to add more information, as opposed to directing the information flow.

    Amber: Absolutely, yeah. It’s like bonus content rather than the content itself.

    What do you do at The Pudding?

    Hugo: Great. So, what do you do at The Pudding?

    Amber: So, just as The Pudding is an amorphous thing, my role is like that as well. My title is journalist engineer, which is a weird collaboration or amalgamation of different roles that I do there, and this is the case for most of the people that work at The Pudding, but basically, I have the ability to do all of the things for a story. So, everything from coming up to the idea and collecting data, analyzing data, designing what the visuals should look like, writing the story itself, and programming the interactive visualization for the web. So, that’s why the title is a little weird. There’s the data analyst part and the journalist part and the front-end engineer part and the designer part and all of these pieces. So, that’s what I do there. All of the things.

    Hugo: And actually, as you know, your colleague and founder of The Pudding, Matt Daniels, has a great medium piece called ‘The Journalist Engineer,’ which elucidates a lot of these things, and we’ll link to that in the show notes.

    Amber: Yeah, definitely. He actually wrote that before we landed on that title for our roles, but we just kept coming back to the journalist engineer. We kicked around a bunch of other titles and just decided that one worked best.

    Hugo: It makes sense, but as you said, there are so many moving parts in this type of position. You mentioned web developer, data analyst, designer, journalist, and I’m just wondering what was your trajectory to pick up all these moving parts so that you could work in this role? How did you get into it?

    Amber: So, I took a very winding path to get to where I am now. By training, I am not any of those things. By training, I’m actually a marine biologist. I went to college for marine biology and chemistry. I went to grad school, have a Master’s Degree in Marine Sciences. I worked as a research scientist for several years. So, my background is in academic science. In terms of picking up all of those skills, I actually picked up some of them in my work as a scientist. I started learning to do data analysis and statistics in the R programming language when I was in grad school, and I used on and off throughout my work as a researcher. It really depended on who I was working with and what tools they were using. But data analysis and experimental design and all of those things I still do today, that all came up in my work as a research scientist or as an academic.

    The communication of complex things also came up a lot. When you’re a scientist, it’s really important that you can communicate the results of your research, whether that’s in an academic setting or in a conference talk. When I was working as a research scientist, I was at an aquarium, and so I was studying the animals that lived there and I often had to communicate my research to children. I got, I think, pretty decent at explaining really complicated topics to anybody who was listening to me, which comes in really, really handy when that’s your job. The only difference now is that I’m writing it down instead of talking about it.

    Yeah, so I did a lot of that. I worked briefly … I had started this small science communication service where I worked with scientists to translate their research into something more accessible to the general public, and then I worked with an animator to animate that story into something that the general public could consume in a more exciting way. All of that came into play in transitioning into this current role. The rest of it I have learned on the fly. I started learning JavaScript and D3 programming specifically about a year and a half ago because I had an idea for a visualization that I wanted to use that I couldn’t, for the life of me, figure how to do in R, and so I just started teaching myself D3, and when you do that, you learn HTML and CSS on the way. I knew a little bit of that, but I have learned a bunch more as I’ve continued working in D3. A lot of it I just have picked up as I’ve needed it.

    Hugo: Yeah and it looks as though D3 is part of the bread and butter of how you work at The Pudding. Is that safe to say?

    Amber: Yes, definitely. That was the most steep learning curve when I started working with The Pudding ’cause I had only been using D3 for a month or two when I reached out to them about freelancing the first time, and they were really receptive to helping me out when I got stuck with D3, and I was really motivated to learn it, and now I use it all the time. And I still spend a lot of time on Stack Overflow.

    Hugo: Yeah. So, I’m glad that you’ve said that. I’m glad that you’ve said it was a steep learning curve, because D3, I think, incredibly powerful, but it’s also well renowned for being pretty tough to start off with, right?

    Amber: Oh, absolutely. Because I’m an R user, I was used to ggplot where you can make a bar graph or something in two lines of code, and in D3, that same bar graph would take 50 lines of code, or maybe not quite that many, but it was way more than two. And I think that was because I had this disconnect in my mind that D3 was a charting library in the same way that ggplot is a charting library. When D3 is really this … it can do a bunch of other stuff. It can make charts, but it doesn’t have to make charts. It gives you the ability to control every pixel on your screen, which is really cool, but that means that there’s a lot of information that has to go into telling it where every pixel belongs and how you want it to be presented.

    Why is D3 code used most commonly in your line of work?

    Hugo: So, when it takes so much work to write the D3 code, why is it then the state-of-the-art or the most powerful or why is the most standard in your line of work?

    Amber: I think because it’s so powerful, especially when it comes to interactive things. You can do so much with D3, and again, because it’s not really … it doesn’t have default charts. There’s no bar chart function that you can just feed stuff into. If you want to add bars, you have to program it to add rectangles that’s bound to your data that you’ve fed into it and all of these things. Because of that, it’s incredibly customizable. We joke that all of our data vis things are bespoke. They’re all custom-written code. And sometimes that’s overkill. So, we do, on occasion for static graphics, make stuff in R, and then sometimes we’ll clean it up in Illustrator or other static design things, and then you’ll have just a JPG or a PNG, an image on your website, but sometimes it works well to make it in D3 and have everything
    just natively on the website instead of an image embedded there.

    Hugo: Yep, exactly and it was developed actually at the New York Times by Mike Bostock?
    Amber: Mm-hmm (affirmative), yeah and I think it was only in 2011 that it came out, so it’s still pretty recent, even though it’s grown like crazy in popularity.

    What projects have you worked on at The Pudding?

    Hugo: Fantastic. So, I’m so excited about getting in and hearing about some of your projects and stories that you’ve worked on at The Pudding, so maybe you can start off by telling me about one or two of them.

    Amber: Sure. So, I’ve worked on a quite a few stories now at The Pudding, and I did mention that my job entails doing all of the parts of a story, but for any given story, I’m not necessarily doing all of the pieces. A lot of our work is collaborative, where somebody does one piece of it and somebody else does another piece. This first example I worked on parts of it, but this was a huge collaborative effort, so the story is called ‘How Far is Too Far?’ and it basically focuses on how long it takes people to drive from where they are in the US to their nearest abortion clinic. This story was very sensitive, and it’s one of our stories that has a little bit more of a political tone to it, but it’s a story that we thought was really important to tell.

    Hugo: Yeah. And particularly at a time when these things are changing and legislators are altering this type of landscape that may affect a lot of people on the ground, right?

    Amber: Absolutely. That’s where the general idea of this came from. There’s these laws that specifically target abortion clinics, and in the state of Texas, these laws were actually causing a bunch of clinics to close. There was a group that brought this to the Circuit Court of Appeals in Texas, saying that when these clinics closed patients that need to access these clinics now have to drive over 150 miles to their nearest clinics, and one of the judges replied, "Do you know how long that takes in Texas at 75 miles an hour?" That really got us thinking about … well, I mean that’s not very good feedback to that problem, but that does bring up an interesting point that the thing that really affects people is how long it actually takes to drive from where they are to where these clinics are, and a lot of other research projects we had seen around this focused on the distance as the crow flies, when the driving distance is what really affects people.

    So, we broke up the country into these hexagons and looked at how long it would take you to drive to the nearest clinic from the center of each of these hexagons and from the center of any city with a population above 50,000.

    Hugo: I recall that in the first figure, there’s a slider which allows me or the user to have a look at the distribution of cities and places geographically where people can reach an abortion clinic in under a certain amount of time and change that and see how that changes with respect to the time that I’m wanting to know about.

    Amber: Yep. Exactly. We wanted to let people really explore the data themselves. When we talk about something being a really long trip … I’m originally from Connecticut where anything over an hour feels so long, but I know people from other parts of the country where a five hour trip is what they consider long. So, we wanted to allow people to really experience this data with whatever frame of far distance they have. Yeah, that first chart really lets you restrict the data and change how long of a drive you want to see.

    Hugo: And how does the story evolve after that?

    Amber: The next thing we wanted to really drive home is that we can say that you can reach a clinic from wherever you are in, say, a two hour round trip drive, but the thing is that not all clinics treat all patients. Depending on how far along they are in the pregnancy, they may or may not treat a specific patient. So, we made another graphic that is, again, looking at the whole country with all of these little hexagons, and we showed how the round trip time to the nearest clinic changes the further along in a pregnancy you get. The further along you get, the fewer and fewer clinics will actually be able to work with you.

    Hugo: I won’t give out too many spoilers, but there are quite dramatic differences there.

    Amber: Yeah. Absolutely. That was something that was really surprising for us and that’s why we wanted to include this. When we started exploring the data, we didn’t expect there to be quite as drastic differences as there were. For this graphic there’s, again, a slider that allows you to change how far along in a pregnancy a patient is. We have it auto playing, so if you just scroll through, you’ll see that map animate.
    Hugo: What’s your impression of what this interactivity gives to the reader? Does it empower the reader? What’s your take on this?

    Amber: I think so. I think it really allows the reader to make the story their own. Again, if they don’t really think an hour round-trip drive is far, they can decide how far they want to look at. Like, which distances they are most interested in looking at and how far into a pregnancy they are interested in getting this story from. They can change the lens of the story a little bit to be a little bit personalized. We do have a table in the story as well that it shows you the cities with the longest round-trip travel times, and it geolocates you. It’ll give you the cities with the longest travel times, but it’ll also show you your city for reference. Again, we’re really trying to make these stories personal to the reader so that it drives home the point a little bit more. Adding these interactions and some of these subtle elements really helps to make a story personal for whoever’s reading it.

    Greetings from Mars

    Hugo: That’s fantastic, and that actually provides a really nice segue into another piece of yours that I think is wonderful, which is the ‘Greetings from Mars,’ which actually locates me or whoever’s looking at it to compare martian environments to where I am, right?

    Amber: Mm-hmm (affirmative). Absolutely. This one is a story that I actually did do all the pieces. So, the data comes from the Curiosity Rover, which is of course on Mars, and it collects a bunch of data, but some of the data it collects is about the weather and its current environment, and I was really excited to find this data and I didn’t know what to do with it. then, I started thinking about what if the mars rover didn’t understand what postcards were for and thought that postcards were just when you go on vacation and you say, "The weather is great, and I wish you were here." And the Curiosity Rover was telling this story through postcards because it just wanted to tell you all about the weather using postcards, ’cause of course, that’s what postcards are for. This story, when you scroll through, it is literally postcards that flip over as you scroll, and Curiosity tells you all about the weather on Mars and it uses the weather where you currently are as a point of comparison.

    When I’m looking at it … right now, I’m in Seattle, and it says, "It looks like today in Seattle, the weather is partly cloudy throughout the day," and it gives you a range of temperatures, so yeah, it walks you through it. This one updates everyday, so the data is the most current weather on Mars and the most current weather in your area as well. This one is very personal for the reader.

    Hugo: For the listener I’ll remind what I said to you earlier. I thought it was great, but it said, "It looks like Today you’re in Hoboken." I’m very close to Hoboken, but it gives me the temperatures in Fahrenheit, and I said as an Australian I’d love them in Celsius, and your response was, "Of course if I was in Australia, I would receive them in Celsius, but because I’m in the US, I don’t."

    Amber: Right. It’s actually pretty funny. Shortly after we published the story, somebody gave us similar feedback that they wish the story was in Celsius. I asked where they were, and I believe they were in Sweden, so I was confused as to why it wasn’t working. It turned out that they had their location services disabled, so we had a default location for this story as New York. So, it was giving them information as if they were in New York, which was of course not what they were expecting, but it was functioning the way we wanted it to. We did our best to make it as personal for as many people as possible, but sometimes we can’t control for everything like that.

    Hugo: You said something that I found very interesting that you did all the things on this piece, as opposed to working in a collaborative environment. I’m wondering if there’s anything very different when you’re doing it all yourself about the process. I can imagine it’s more frustrating.

    Amber: Yeah. I think it’s interesting when I’m working with stuff on my own. And when I say on my own, I still do rely on the team a lot, and I did have an editor for this piece to bounce some ideas of off and things like that. I was getting feedback throughout the process, but I find that I think through things out loud. When I’m working with somebody, sometimes that process is smoother or faster because I have somebody else to bounce ideas off, and we work together on figuring out the story.

    When I work on projects by myself, sometimes I get really caught up in one part of it. I will be content to analyze data for weeks and never move forward with the rest of the story, or I get really excited about one piece of it and then it makes sense in my head, but as soon as I tell it to someone else, they’re like, "Oh, that doesn’t really make sense," and I have to take a step back and start over. I’ve gotten better about it, of at least being aware that I do that and looking for more feedback throughout the process when I work on projects by myself now. But yeah, it’s a little bit different ’cause you just put your head down and move forward on the story and hope that it ends up in a good spot.

    Hugo: For sure. But you’re absolutely right. I have this all the time where I think something I have done is a great idea, and as soon as I start to explain it to someone else, even before they tell me it isn’t, I’m like, "Oh, wait a second. That doesn’t sound like it did in my head."

    Amber: Yeah. Exactly. That’s actually what happened with the Mars story. My original thought once I found out there was all this weather data, I was like, "Oh, I’m gonna write ‘Welcome to Mars,’ and it’s gonna be a welcome packet for people who just moved there," and I was super excited and I mocked up this whole thing, and as soon as I told it to the team, they were like, "But, Amber, if people were moving to Mars, don’t you think they would know what the weather was like before they left for Mars?" I was like, "Oh, yep. You’re 100% right." It was just something I hadn’t thought of because I was so excited about the data and the idea. It needed a reframing.

    Hugo: For sure. That’s true, unless it’s one of those mystery vacation packages, but I wouldn’t expect that to be … that would be a horrible mystery vacation.

    Amber: It would take so long to get there.

    Hugo: Seriously. I need to be back at work now. There’s one other piece that we discussed of yours that I think is so wonderful. It’s the makeup shades piece, so I thought maybe you could tell our listeners a little bit about this.

    Amber: Yeah. This actually was a story that came to us from a freelancer. So, we do work with a lot of freelancers at The Pudding, and this was one that I had stumbled upon on Twitter. This illustrator names Jason Li was trying to come up with a better color palette for skin tones for the characters that he was illustrating, and he had written about using the skin tones from emojis and given feedback from people on that. He decided to look into the shades of makeup that are being offered because makeup brands … they have a stake in their colors actually matching people’s skin tones. He wrote this little blog post about making a new color palette, and I read this and was like, "This is amazing. I want more of this."

    I reached out and asked if he would be interested in expanding the story for The Pudding, and he was really excited, and he brought on another person to help us with the story. For this one, basically the idea came down to these days foundations shades and makeup shades, they are all coming out saying that they have these diverse shade ranges. Rihanna came out with this brand called Fenty Beauty last year, and it had 40 shades when it launched. It was this huge thing, and now a bunch of other brands are rushing to get up to 40 shades, and we wanted to see if all 40 shades are created equal across all of these brands. Just because you have 40 shades are you actually making the colors diverse enough for the people that need them?

    It’s this exploration through the color shades that are offered through different brands here in the US and Japan, Nigeria, and India. Jason did most of the data collection and story writing, and I did a lot of the design of the graphics. Jason and I went back and forth a lot on how the graphics should be designed and shown to the reader to help illustrate our point as well as we could do it, and then I did all of the front-end programming for it.

    Hugo: Great. I won’t say too much about it, and we’ll include a link to it in the show notes, but I do think it’s quite wonderful how you’ve demonstrated the relative distributions of shades, where the highest density of shades are in certain products in certain countries, and there are some quite … ones that make sense post knowing it, but some quite surprising results in there with respect to countries. For example, in India, there result makes sense after knowing the fact, but it is also surprising at first sight, I think.

    Amber: Absolutely. That’s the feedback we’ve gotten the most from readers is that India was surprising, and it was super shocking for us, too. That’s always interesting when you’re still surprised by what you’re finding.

    Hugo: Absolutely. In particular, the way you show the frequencies and distributions is very, very nice visually. I won’t say anything more ’cause I want our listeners to go and check these out straightaway.

    Amber: Awesome. Well, thank you.

    The Process

    Hugo: Of course. I really wanna thank you before we move on for … this has been a difficult task to describe interactive data visualizations in words on a podcast, and I think you’ve done it incredibly well. Moving on, I’d like to know a bit about process because in data science and science in general, a common paradigm of the process is to start with data and let it lead you to the hypotheses and then questions. But in what you do, you start with a question, right?

    Amber: Mm-hmm (affirmative).

    Hugo: How does this different approach change things for you as a practitioner?

    Amber: I think it changes a lot of things, but primarily it changes where we find the data. Before I was working with The Pudding when I was just doing personal projects on my own, I would go and just find a data set and analyze it and just look for interesting things, but now basically the stories that we end up writing that end up being the most interesting are things that started with a question. Again, we’re trying to come up with these questions that people are already discussing. Questions that people are maybe having a friendly argument about, and we try to answer that using data, which usually, not always … a lot of the times means we need to collect the data ourselves or scrape it from some source. It’s usually not a dataset that’s already existing and on the internet and cleaned and ready to go.

    Hugo: Is there a challenge with the fact that you’ve got a correct or interesting question on the data once you find it doesn’t really give you the story that you were hoping for?

    Amber: Uh-huh. Yep. Yeah, that happens a lot. That actually happened with the very first story I proposed. I was really excited about this story and I thought it was gonna be great and I collected all this data that was … I had to scrape a website, and so it wasn’t a data set that already existed. Once I started analyzing it, it just wasn’t interesting, and sometimes we’re able to recover and we’ll end up answering a different question than the one we set out to answer, so this is where we fall back into line with what’s more typical. Once you have the data, you can start looking for other things because you might find surprising insights that wanna talk about either in addition to or instead.

    But yeah, that first project I tried working on, every angle I could think of was not coming across the way I wanted it to. Sometimes that means we end up killing a story and we just don’t write it, which happens sometimes. I think of it a little bit as publishing negative results in science. It just doesn’t happen very often because basically all you’re saying is, "What we thought was gonna happen didn’t," and that’s the end. There’s of course huge debate about whether or not that should be the case and whether or not we should publish uninteresting things, but at least for The Pudding, we try not to publish things that we don’t think will be interesting.

    Hugo: Yeah, absolutely. Of course, in basic science research, there’s an argument that there should be more negative results being published.

    Amber: Of course. Exactly. That comes down to having people not repeat projects and experiments that other people already know aren’t gonna work and things like that. In science, it does make sense to do that sort of thing. We’ve talked about making blog posts about failed stories for that purpose so people are aware that, "Hey, we tried this, and it didn’t really work, but if somebody else wants to try it, go for it." We haven’t gotten around to that, but maybe if there’s public interest we can check that out as well.

    Hugo: Something that you mentioned at the start of our chat is that at The Pudding, you think about the types of questions and conversations you write about as the types of things you wanna talk about over beer, right?

    Amber: Mm-hmm (affirmative).

    Hugo: Yeah, I was gonna say I think starting with a question like that probably gives you a sense of direction and some sort of interest there.

    Amber: Absolutely. It really gives you a sense of focus. Like I said, sometimes I have the bad habit of when I find a new data set, I’m like, "I wanna explore all the things," and I could easily spend weeks and weeks just having fun and playing with the data set, but when you go in with some sort of vague purpose in mind, it helps to give you a sense of, "Okay, here’s where I should at least start," because those big data sets can also be really intimidating if you don’t know where to start. I’ve done that, too, where I’ve downloaded huge data sets and been excited, and then been like, "I don’t know what to do with all of this," and then it just sits on my computer and I don’t do anything with it. It definitely gives you a direction to go, and then you can branch off from there.

    Hugo: I think I’ve definitely been there with large data sets, and I think a lot of us probably have as well, so you’re in good company. As we’ve discussed earlier, this is a really new phenomenon, this interactive data vis stuff. It’s essentially a revolution in terms of how we can tell stories, how we can receive stories. The fact that the paradigm of storytelling and journalism in particular used to be from the printing press to the broadsheets and that you have static information passed to you from a newspaper. Now that we have web-based infrastructures, a lot of JavaScript and D3 which allows practitioners such as yourself and your colleagues to build this type of stuff, I think the future is an incredibly interesting space. I’m just wondering, in your mind, what type of future can we expect to see in terms of interactive data vis and the impact on journalism and storytelling in particular?

    Amber: That’s such a great question, and I think it can go in any number of ways. People are experimenting with all sorts of stuff in this space right now. They’re trying out different technologies: so, virtual reality and augmented reality are being brought into the data vis space, and we’re experimenting with the ways we present data. I mentioned a couple minutes ago that we occasionally kill stories, and that sounds like a bad thing, but we actually are set up so that we have the flexibility to do that. That allows us to experiment with things, which when you experiment, sometimes things are gonna fail, and I think we’re not the only people that are doing that. Lots of people are experimenting and trying to find new and exciting ways to present information.

    On the internet, people’s attention is the most valuable commodity, and everybody’s always looking for your attention if you’re a consumer of anything on the internet. It’s hard to break through the noise. I think that’s where these interactive graphics and some of this new experimentation and things like that are really being leveraged. I think it’s an exciting future. I have high hopes for it. I know that there’s some discussion within the community of where everything is headed, but from where I’m sitting, I’m looking forward to it. I think we’re in a good spot.

    Hugo: As am I, and I do think this idea of turning what essentially have been consumers as passive readers, essentially, into active storytellers to themselves is incredibly interesting. Getting people doing something and interacting as soon as possible. That’s something we strive for at DataCamp as well, really, is getting people coding as soon as possible and giving them automated but interactive feedback. We do have a philosophy here of getting people doing stuff with their hands, either on a keyboard or on a mouse or whatever it may be as soon as possible.

    Amber: Absolutely. It really brings them into the story, and it makes it their own. Like you said, that’s how DataCamp works as well. You feel like you’re actually gaining the experience and doing the thing.

    Does everyone have their own specialization at The Pudding?

    Hugo: I’ve got the strong sense that you and your colleagues are all generalists in terms of you do a lot of different things, but does everyone have their own specialization. How does this work?

    Amber: A little bit. We are a pretty small team. We very recently got up to six people, so when you have a small team and you’re trying to put out these stories and things like that, it really helps us to have people that can be generalists and do all the parts. But yeah, we do have things that we are better at. When I first started at The Pudding, I was definitely data analysis was my forte and I was still learning D3 and I was making a bunch of mistakes and stuff failed all the time. But I was able to lean into my ability to do data analysis, and that’s the case for some of my other colleagues. Some of them are great with design, and we’re constantly tapping them for design input. Some are great with the programming side, so we’re constantly tapping them for help with fixing bugs and things like that. We do all have our strengths, and so we have people who are the go-to resource if you have an issue with X, Y, or Z, but we all have the ability to do all of the pieces if we need to.

    Hugo: I’m sure we have a bunch of listeners out there who would love to get involved in this type of work. What skills are needed to be successful at data journalism and visual storytelling and so on?

    Amber: There’s a lot of things. It sounds like I’m listing off a ton of technologies that you need to know, and I don’t know that that’s always the case. Our team uses a variety of tools for analysis. Like I said, I use R. One of my colleagues uses Python. One uses SQL, JavaScript. So, we use all sorts of stuff for analysis, so the tool itself doesn’t make as big of a difference.

    Hugo: And they may change in the future, right, as well?

    Amber: Totally. That’s actually what I was gonna say is one of the best things to have in this field is the ability to adapt and learn new things as you go, whether that’s a new language or a new statistical test because we’re not focused on the same topic every time, we’re constantly having to learn new methods of analysis. We’re just trying to stay right on the edge of things as much as possible. The ability to just learn new things and be ready to do that stuff as you go is really important, and I think on a similar note, having a beginner’s mindset is also really helpful.

    Basically what I mean by that is just our stories are aimed at the general public, and so when you’re writing a story you become a mini expert on a topic, and it’s really easy to slip into giving a bunch of jargon or making things not quite accessible. If you try to approach things as a beginner would, it really helps you to communicate things in an easy way and in a way that other people who are actually beginners can appreciate and can understand. I think being able to think about things like a beginner would is really helpful as well. If you are a beginner, you’re in good company because you have an exact beginner’s mindset.

    Hugo: I think that’s really cool, and it seems to me a lot of the topics you’ve worked on you may have been a beginner at the start in some sense. I don’t know how much you knew about the atmosphere of Mars beforehand, but I’m sure you learnt along the way, which allowed you, in that case, to have that beginner’s mindset from the start, essentially.

    Amber: Exactly. Yeah. I didn’t really know anything about the atmosphere of Mars. I knew very little. I knew it was cold, but I didn’t know how cold. I think you have to be willing to be a beginner all the time. That’s hard sometimes, but the more you practice it, the easier it gets to just … you’re always gonna kind of feel like you don’t know anything, but that’s okay because you can learn it and then you’ll know the things. There’s two other things I think are really important: one is the ability to give and receive feedback. I mentioned that my team works a lot, whether we’re working collaboratively or I’m working on a story on my own, I still get a lot of feedback from my team, and we have designated sessions where we just give feedback to one another. That also happens a lot online. People post things that they’re working on on Twitter, and I’ve had pretty good experiences with people giving feedback on the internet, but of course, people on the internet aren’t really very nice about things.

    Finding a way to give feedback to people that’s constructive and can help them become better, and then also being able to receive that feedback without feeling like they are criticizing you as a person is really important. That’s something that I am still working on all the time. I get feedback and I’m like, "Oh, no. I worked so on that," and them I’m like, "Oh, wait, no. They’re not criticizing me, they’re trying to make me better." That mind flip and that ability to work on that is really important.

    Hugo: For sure.

    Amber: The last thing is tied into all of this is just communication is so important, and I think that’s the case for so many fields, especially in the data world ’cause you’re often not working in a silo. You’re working in collaboration with other teams and sometimes with clients and sometimes with the general public, so being able to communicate what you have done and what you’re planning on doing is really, really important.

    Hugo: I agree completely, and I think communication is something that we all know is incredibly important, but you don’t necessarily see it in the job listings. You see like, "Know how to use Hadoop," or something like that, and communication isn’t something that’s necessarily stressed in terms of working data scientists having to communicate with managers or non-technical stakeholders and this type of stuff. It is really key to keep on reiterating this again and again.

    Amber: I think it really should be on more job listings and stuff, but I think it’s also fair to put that on a resume if you’re applying for jobs because I think everybody realizes that it is important, but I don’t know why it doesn’t end up on job listings.

    What is your favorite data science technique?

    Hugo: We haven’t spoken too much about specific techniques and methodologies. We have a bit, but I’m just wondering what one of your favorite data sciencey techniques or methodologies is.

    Amber: I don’t know that I have a favorite, and part of that just comes down to I’m really bad at picking favorites of anything. I’m very indecisive when it comes to stuff like that, but I think it’s also because we’re constantly using different methods of analysis. I don’t know that I have a favorite technique, but one thing that I do a lot that I absolutely love doing, so I’m gonna take a moment to preach it here because I think it makes sense; within the R environment, I used R Markdown a lot, which is very similar to Jupyter Notebooks or things if you’re a Python user, and basically, the way that I used them is I include my analysis, but in between chunks of code I also include notes about why I did something or where the information came from, or sometimes it’s like, "This was the Stack Overflow post that helped me figure this out." The reason I love doing that is ’cause it helped me a lot when I was learning how to do things so that I could go back and figure out what I was thinking and what my thought process was.

    But also, sometimes The Pudding has a client branch called Polygraph, and so we work with clients a lot. Those markdown documents are really great to send to clients of like, "Here’s an analysis of your data, and here’s how we worked through it," and they can read your inner thoughts and see what you’ve found before you make it all pretty and on the internet. I’ve found that that’s been really fantastic, and so I always make those documents and I have never once regretted it. I think that’s my favorite data sciencey thing that I do.

    Hugo: Fantastic. I think, once again, that speaks to a certain form of comprehensive communication as well, developing this document so that other people have access to it and you have access to it downstream as well.

    Amber: Honestly, I put a lot of those … when I was still learning, I had made a portfolio online and I had put a lot of those things into this portfolio, and everybody was like, "Oh, that’s so great that you make this for other people." And I was like, "Yeah, totally." And I do make it for other people, but I also, like you said, totally make it for myself because sometimes I forget why I did something, and I’ve gone back to stuff a year out and been like, "Why did I do that?" And I’ve been really glad that I’ve wrote it down.

    Hugo: Yeah. I say this probably too often for my listeners on the podcast, but when I talk about commenting my code, it’s for other people but the most important other person is me in three weeks.

    Amber: Yep. And you always think you’re gonna remember what you were doing, but I don’t know. In my experience, I never do.

    Call to Action

    Hugo: Future me hates me now. I can tell you that. Great. So Amber, I’m wondering, as a final question, do you have a call to action for our listeners out there?

    Amber: Yeah. I think for anybody’s who’s looking to to start in this field or if you’re trying to figure out if data journalism is something you might enjoy, my biggest advice is just to get started. Start with a small-ish question that you’re really excited about and start exploring it and don’t be discouraged if you have to go and collect some data on your own. Again, keep it to a reasonable size so that you don’t get completely overwhelmed, but I’ve found starting with something you’re excited about and something you’re personally invested in helps you really follow a project all the way through. You can learn so many fun things along the way. Feel free to go on the internet and look up how to do stuff and ask people there for advice and feedback because I think there’s a lot of really strong data science and data vis community member out there who are really excited to help beginners. If you have a story idea that you think would be great for The Pudding, we are always looking for freelancers. If you’ve got something that’s a proof of concept with analysis and a little bit of a general story, send it our way. We’d love to hear it.

    Hugo: I’ll just reiterate that all of Amber’s work we’re gonna put in the show notes, and check out everything on The Pudding. It’s all fantastic. Start with a small question and send through some ideas to them if you’re interested. Amber, it’s been an absolute pleasure having you on the show.

    Amber: Yeah. Thanks so much. This was so great. I appreciate being here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Temporal aggregations on time series data – Writing R functions to tidy meteorological data and getting some insights from it

    Mon, 08/06/2018 - 17:08

    (This article was first published on R – Monkeys Working, and kindly contributed to R-bloggers)

    Hi everyone!

      In this post we’re going to work with time series data, and write R functions to aggregate hourly and daily time series in monthly time series to catch a glimpse of their underlying patterns. For this analysis we’re going to use public meteorological data recorded by the government of the Argentinian province of San Luis. Data about rainfalls, temperature, humidity and in some cases winds, is published in the REM website (Red de Estaciones Meteorológicas, http://www.clima.edu.ar/). Also, here you can download meteorological data (in .csv format) that has been recorded by weather stations around different places from San Luis.

       Even though weather stations are capable to record a huge amount of data, there is also a lot of work to do to obtain some information. For example, if observations are taken every hour, you’ll have 8700 records in 1 year!

    So, in this post we’re going to:

    • Write functions in R to aggregate the big amount of information provided by weather stations.
    • Make these functions as general as possible in terms of the structure of the input data (time series data) in order to simplify downstream analysis of this data.
    • Finally, we’re going to plot our results to reach some conclusions about the weather in one town in San Luis Province, Argentina.

        We chose to work with data from La Calera. Take a look to our target site on the map! It’s a little town in the Northwest of the San Luis Province, located in a semiarid region. It makes this town and its neighbors really interesting places to work with because in this desert places, climate has well defined seasons and rainfalls occurs mainly in a few months of the year.

        We have already downloaded the .csv file of La Calera from the REM website, and you can find it here . Ok! Let’s work!

    Tidying data – Writing an R function

        First, we’re going to read the .csv with R and see its structure. How many rows does it have? Which are the first and the last dates? Which are the variables recorded? We’ll use head(), nrow() and str() to see it.

        As you can see, this weather station recorded rainfalls (mm), humidity (%) and temperature (°C) every hour. Data recording started on October 2007 and finished on April 2018 (the day that we downloaded the .csv). There are in total more than 91.000 observations for each variable!. Hourly data won’t allow us to find some of the climatic patterns that we are looking for, so, we need to summarize its information first as daily data and then as monthly data.

       We’re going to start writing an R function that will be useful for organize data in different stages of the analysis (see the flowchart below) . It takes the date of every row and splits it up in 3 columns: one column with the day’s number, a second column with the month’s number, and the last one with the year. In this function, we’re going to use the lubridate package. This package has some useful functions that we need: day(), month() and year() which extract the corresponding time component from a given date. In the following functions, we’ll use these new columns to subset the data.

    If we use our function, we get:

    Aggregate daily data – Writing an R function

        Now, we’re going to aggregate our hourly data in daily data. We wrote another function that uses the columns that we have obtained to subset the data. For each day,  we created a subset with its corresponding hours and calculated the daily mean or the sum for data (see the diagram below).

     As we’re working with meteorological data, the sum will be useful for precipitations, and the mean for temperature. We calculated the daily mean value as an average of extreme hourly values, due to daily mean temperature it’s commonly estimated as the average of minimum and maximum temperatures (see http://climod.nrcc.cornell.edu/static/glossary.html and Dall’Amico and Hornsteiner, 2006).  This is:

    For each day of hourly data we have the following set: so we calculate the average as:

    In R, we can do it using max() and min() functions. But wait! It’s normal that weather stations don’t work during some days, so you can have entire days where they only recorded “NAs”. So, you have to keep this in mind, because max() and min() functions will return you:

    In min(x) : no non-missing arguments to min; returning Inf In max(x) : no non-missing arguments to min; returning Inf

        To avoid that, we used an if…else statement in  line 32.

    Exploratory plots – Daily data from La Calera

    Let’s plot daily data to look for patterns!

    # 1) a) Daily accumulated rainfall d_mm_ac <- HourlyToDaily(data.x = calera_organized, ncol.obs = 2, na.presence = TRUE, value = "accumulated", ncol.date = 1) head(d_mm_ac) # plot(x = d_mm_ac$date, y = d_mm_ac$value, type = "l", xlab = "Date", ylab = "daily accumulated precipitation (mm)") # # 1) b) Daily mean temperature (°C) d_t_mean <- HourlyToDaily(data.x = calera_organized, ncol.obs = 3, na.presence = TRUE, value = "mean", ncol.date = 1) head(d_t_mean) # plot(x = d_t_mean$date, y = d_t_mean$value, type = "l", xlab = "Date", ylab = "daily mean temperature (°C)")

    We’ve already have our first results! Rainfalls are higher during a period of each year, and between periods there are gaps of dry weather. On the other side, mean temperatures have an oscillating pattern, reaching 30°C in summer and 0°C in winter. So, as we expected from this temperate desert place, seasons are well-defined and temperature have a wide range through the year.

    Aggregate monthly data – Writing an R function

      Ok, let’s continue a bit more! To aggregate from daily data to monthly data, we have to use again the first function (add the time component’s columns) and then build a function with “for loops” similar to the second function. In this case, we didn’t use max() or min() functions, because monthly means are simply the average of daily means per month.

        We ran the function:

    # 2) Getting monthly data from daily data # MonthlyValue <- function(data.x, ncol.obs, value = "mean", na.presence = TRUE) { # if (!any(value == c("mean", "accumulated"))) { stop("value must be 'mean' or 'accumulated'") } # # Calculate the mean of the variable of interest first.year <- min(data.x$year, na.rm = na.presence) last.year <- max(data.x$year, na.rm = na.presence) x.per.month <- list() name.month.tot <- list() name.year.tot <- list() c = 1 # Creates a temporary i data.frame for each year == i for (i in first.year:last.year) { x.anual <- data.x[data.x$year == i, ] # It takes the i year df and creates a j df for each month == j first.month = min(x.anual$month, na.rm = na.presence) last.month = max(x.anual$month, na.rm = na.presence) name.month <- list() name.year <- list() x.value <- list() for (j in first.month:last.month) { x.month <- x.anual[x.anual$month == j, ] # Calculates the mean value or the accumulated value if (value == "mean") { x.value[j] <- mean(x = x.month[ , ncol.obs], na.rm = na.presence) name.month[j] <- j name.year[j] <- i } else { x.value[j] <- sum(x = x.month[ , ncol.obs], na.rm = na.presence) name.month[j] <- j name.year[j] <- i } } # end cycle month x.per.month[[c]] <- unlist(x.value) name.month.tot[[c]] <- unlist(name.month) name.year.tot[[c]] <- unlist(name.year) c = c + 1 } # end cycle year monthly.data <- as.data.frame(cbind(unlist(name.month.tot), unlist(name.year.tot), unlist(x.per.month))) if (value == "mean") { colnames(monthly.data) <- c("month.order", "year", "mean.value") } else { colnames(monthly.data) <- c("month.order", "year", "accumulated.value") } return(monthly.data) }

    and we obtained the followings monthly time series:

    Exploratory plots – Monthly data from La Calera

        To end this analysis, we’re going to do an exploratory analysis of our monthly data. For this we are going to make a multiple barplot using a ‘for loop’ and an overlayed lineplot for each one of the years in the dataset.

    # plot with monthly data plot.calera <- function() { parameters.1 <- par(mfrow = c(4, 3), mar = c(3,3,3,1.5), oma = c(3,3,3,3)) for (i in seq(from = unique(m_mm_ac$year)[1], to = unique(m_mm_ac$year)[length(unique(m_mm_ac$year))])) { barplot(height = m_mm_ac[m_mm_ac$year == i, 3], names.arg = m_mm_ac[m_mm_ac$year == i, 1], col = "lightgreen", axes = TRUE, main = i, cex.main = 0.8, ylim = c(0, 140), xlim = c(0,12), space = 0) lines(y = (m_t_mean[m_t_mean$year == i, 3]), ylim = c(0, 70), x = ((m_mm_ac[m_mm_ac$year == i, 1]) - 0.5), type = "o", col = "darkorange", lwd = 2, pch = 16) axis(side = 4) } title(main = "Accumulated rainfalls and temperature per month in La Calera", cex.main = 1.3, xlab = "Months", cex.lab = 1.5, ylab = "Monthly accumulated precipitation", outer = TRUE, line = 0) mtext(text = "Monthly mean temperature °C", side = 4, cex = 1, outer = TRUE, line = 1.2) par(parameters.1) } plot_calera <- plot.calera()

        As we previously saw, rains are concentrated only in some months. We can distinguish from the bar charts that La Calera has a dry period during winter, when temperatures are lower, and a rainy period during the warm seasons.

        Well, the analysis about yearly means of accumulated precipitations and temperatures  it is now up to you!

       You can see the entire R script of this post in here, and remember that all of our scripts and data used for the blog are available in our GitHubs.

        And, if you’re interested in other analysis of different kinds of time series data, you could take a look to our other post:

    See you in the next post!

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Monkeys Working. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Do GPU-based Basic Linear Algebra Subprograms (BLAS) improve the performance of standard modeling techniques in R?

    Mon, 08/06/2018 - 09:30

    (This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

    Introduction

    The speed or run-time of models in R can be a critical factor, especially considering the size and complexity of modern datasets. The number of data points as well as the number of features can easily be in the millions. Even relatively trivial modeling procedures can consume a lot of time, which is critical both for optimization and update of models. An easy way to speed up computations is to use an optimized BLAS (Basic Linear Algebra Subprograms). Especially since R’s default BLAS is well regarded for its stability and portability, not necessarily its speed, this has potential. Alternative libraries are for example ATLAS and OpenBLAS which we will use below. Multiple blog-posts showed that they are able to improve the performance of linear algebra operations in R, especially those of the infamous R-benchmark-25.R.

    With nvblas, nvidia offers a GPU-based BLAS-library, which it claims to be significantly faster than standard procedures. nvblas is part of the latest CUDA-implementations such as CUDA 9. According to nvidia, nvblas is able to speed up procedures such as the computation of the determinant of a 2500×2500 matrix by a factor of more than ten. However, most users from the data science sphere don’t use basic linear algebra operations directly – they usually use high-level-abstractions such as lm() or gam() for modeling.

    With little effort we can swap the default BLAS to OpenBLAS and already get significant performance gains. And so we want to find out if nvblas can speed up computations just as easily! For this comparison we compare R’s default BLAS, the optimized libraries ATLAS and OpenBLAS (all of which are using CPUs only) with nvblas (which uses CPU and GPU).

    Setup

    We simulated data for three scenarios:

    • Regression: We draw data for a fixed number of numerical and dummy variables (using a normal distribution with poison-distributed variance and mean 0 and a Bernoulli distribution with uniformly distributed probability, respectively) and calculate the target as a linear combination of the features, adding nonlinear interaction effects and a random error term. The following models are set up to solve this problem:
    • Linear Regression: using glm(family = “gaussian”) from stats
    • Additive Regression: using bam(family = “gaussian”) with max. 20 smoothed variables from mgcv
    • Extreme Gradient Boosting: using xgb.train(nrounds = 30, tree_method = “hist”) from xgboost, using a 75%-25% train-test-split
    • Classification: We draw data in a similar fashion as with the regression example, but derive the binary outcome from the former dependent variable by splitting at the median of the outcome variable and adding an additional 5%-dropout. The following models are set up to solve this problem:
    • Generalized linear Regression: using glm(family = “binomial”) from stats
    • Generalized additive Regression: using bam(family = “binomial”) with max. 20 smoothed variables from mgcv
    • Extreme Gradient Boosting: using xgb.train(nrounds = 30, tree_method = “hist”) from xgboost, using a 75%-25% train-test-split
    • Random Intercept: We draw data in a similar fashion as with the regression example, but adding an additional random intercept for random groups. The following models are set up to solve this problem:
    • Linear Mixed Model: using lmer() from lme4
    • Linear Mixed Model: using lme() from nlme
    • Additive Mixed Model: using bam(family = “gaussian”) with max. 20 smoothed variables and the bs parameter from mgcv

    The particular pick of functions simply reflects which models and packages we currently use at INWT Statistics. So those are the most interesting ones for us to look at. When you have a thorough understanding of how these methods are implemented, you may, in part, anticipate some of the results. To execute the experiments, we used an AWS machine capable of both efficient CPU and GPU computations: p3.2xlarge with Machine Learning AMI containing the latest CUDA version. With this setup, it is relatively easy to change the BLAS library by configuring it via an environment variable:

    env LD_PRELOAD=libnvblas.so Rscript code.R

    To permanently switch the BLAS version, one can use update-alternatives:

    sudo update-alternatives --config libblas.so.3gf

    To check which BLAS and LAPACK-version a running R-console is using, do

    ps aux | grep exec/R

    to get the PID and then do:

    lsof -p <PID> | grep 'blas\|lapack'

    The xgboost-package plays a special role in our tests: additionally to change the BLAS-package to nvblas (the optimization done by xgboost does not improve much by using different blas-versions), one can change the optimization-algorithm to gpu-hist, if the package is installed correctly. This approach to the histogram tree growth method is optimized for GPUs and should outperform the cpu-version. This was done for our tests.

    Results

    The experiments were done for different sample sizes and numbers of categorical and numerical features. We show the results for a sample of 100,000 and 100 categorical as well as 100 numerical features. If you think this is small: we do not analyse images, videos or gene expression data; this is a somewhat realistic scenario in which performance becomes interesting and relevant for us.

    The R-benchmark-25 needs 5.54 seconds using OpenBLAS, 11.21 seconds using ATLAS, 34.62 with the default BLAS and 7.78 – 8.08 seconds using nvblas, depending on OpenBLAS or ATLAS as fallback (which is specified in the nvblas.config). Even for R-benchmark-25, nvblas is not faster compared to OpenBLAS, however both run significantly faster than the default BLAS.

    So how do the different BLAS-versions perform for high-level-functions in R? The plots show the speed of the methods using different BLAS-versions.

    When we started out we were hoping for a clear winner and somehow we also found one. However depending on your pick of models and applications there are some unexpected results.

    • No matter which BLAS library we used, bam picked it up and the run-time improved significantly. When we have a machine with sufficiently strong CPUs ATLAS is about as fast as using nvblas and a GPU. OpenBLAS still shows an improvement over the default BLAS. A good thing about bam is that it covers a wide range of model classes. Especially random intercept or generally random effect models (sadly) do not get a lot of attention in the machine learning era.
    • xgboost gains from nvblas. This is not very surprising since it uses an optimized algorithm for GPUs. However, it does not gain anything from ATLAS or OpenBLAS.
    • For the traditionalists: glm and lme will profit from OpenBLAS. Very surprising however is the performance loss when we used ATLAS and nvblas. Either we have a considerable overhead when using these libraries or the configuration is not as plug-and-play as we hoped for. In any case it is reason enough to praise OpenBLAS to speed up a lot of computations out of the box. Even when specialized models may gain more from other libraries.
    • What is wrong with lmer: lme4 uses the RcppEigen package to replace the default BLAS library. For a lot of users this has the advantage that lmer will run fast without you doing anything. But since it is not using the default BLAS it will also not benefit from replacing it with ATLAS, OpenBLAS or nvblas. lmer and lme (OpenBLAS) have been comparable in run time in most of our tests. More pressing is the question which one implements the method you need.

    We hoped to see that nvblas would easily outperform all other libraries. Of course it is not that simple. For statistical modeling, the libraries which have been tuned, optimized and debugged for decades are hard to beat. If you are looking for a plug-and-play speed improvement I would still vote for OpenBLAS.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    StatsModels: the first nail in R’s coffin

    Mon, 08/06/2018 - 03:49

    (This article was first published on The Shape of Code » R, and kindly contributed to R-bloggers)

    In 2012, when I decided to write a book on evidence-based software engineering, R was the obvious system to use for data analysis. At the time, lots of new books had “using R” or “with R” added at the end of their titles; I chose “using R”.

    When developers tell me they need to do some statistical analysis, and ask whether they should use Python or R, I tell them to use Python if statistics is a small part of the program, otherwise use R.

    If I started work on the book today, I would till choose R. If I were starting five-years from now, I could be choosing Python.

    To understand why I think Python will eventually take over the niche currently occupied by R, we need to understand the unique selling points of both systems.

    R’s strengths are that it supports a way of thinking that is a good fit for doing data analysis and has an extensive collection of packages that simplify the task of applying a wide variety of analysis techniques to data.

    Python also has packages supporting the commonly used data analysis techniques. But nearly all the Python packages provide a developer-mentality interface (i.e., they provide an API like any other package), R provides data-analysis-mentality interfaces. R supports a way of thinking that data analysts can identify with.

    Python’s strengths, over R, are a much larger base of developers and language support for writing large programs (R is really a scripting language). Yes, Python has a package ecosystem supporting the full spectrum of application domains, this is not relevant for analysing a successful invasion of R’s niche market (but it is relevant for enticing new developers who are still making up their mind).

    StatsModels is a Python package based around R’s data-analysis-mentality interface. When I discovered this package a few months ago, I realised the first nail had been hammered into R’s coffin.

    Yes, today R has nearly all the best statistical analysis packages and a large chunk of the leading edge stuff. But packages can be reimplemented (C code can be copy-pasted, the R code mapped to Python); there is no magic involved. Leading edge has a short shelf life, and what proves to be useful can be duplicated; the market for leading edge code in a mature market (e.g., data analysis) is tiny.

    A bunch of bright young academics looking to make a name for themselves will see the major trees in the R forest have been felled. The trees in the Python data-analysis-mentality forest are still standing; all it takes is a few people wanting to be known as the person who implemented the Python package that everybody uses for XYZ analysis.

    A collection of packages supporting the commonly (and eventually not so commonly) used data analysis techniques, with a data-analysis-mentality interface, removes a major selling point for using R. Python is a bigger developer market with support for many other application domains.

    The flow of developers starting out with R will slow down, casual R users will have nothing to lose from switching when the right project comes along. There will be groups where everybody uses R and will continue to use R because that is what everybody else in the group uses. Ten-Twenty years from now R, developers could be working in a ghost town.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: The Shape of Code » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Wrangling Wikileaks DMs

    Mon, 08/06/2018 - 02:00

    (This article was first published on Colin Fay, and kindly contributed to R-bloggers)

    Using R to turn raw data into browsable and reusable content.

    About

    On the 29th of July 2018, Emma Best published on her website the copy of
    11k+ wikileaks Twitter DM :
    https://emma.best/2018/07/29/11000-messages-from-private-wikileaks-chat-released/

    To be honnest, I’m not really interested in the content of this dataset.
    What really interested me is that it’s raw data (copied and pasted
    text) waiting to be parsed, and that I could use R to turn these
    elements into a reusable and browsable content
    .

    The results

    Here are the links to the pages I’ve created with R from this dataset:

    • Home has the full
      dataset, to search and download.
    • Timeline has a
      series of time-related content: notably DMs by years, and daily
      count of DMs.
    • Users holds the
      dataset for each users.
    • mentions_urls
      holds the extracted mentions and urls
    • methodo contains the
      methodology used for the data wrangling
    Methodology Extracting the content

    As I wanted to use the data offline (and not re-download it each time I
    compile the outputs), I’ve first extracted and saved the dataset as a
    .txt. You can now see it at https://colinfay.me/wikileaksdm/raw.txt.

    Here is the code
    used:

    library(tidyverse) ## ── Attaching packages ────────────────────────────────────────── tidyverse 1.2.1 ── ## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5 ## ✔ tibble 1.4.2 ✔ dplyr 0.7.5 ## ✔ tidyr 0.8.1 ✔ stringr 1.3.1 ## ✔ readr 1.1.1 ✔ forcats 0.3.0 ## ── Conflicts ───────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() library(rvest) ## Loading required package: xml2 ## ## Attaching package: 'rvest' ## The following object is masked from 'package:purrr': ## ## pluck ## The following object is masked from 'package:readr': ## ## guess_encoding # Reading the page doc <- read_html("https://emma.best/2018/07/29/11000-messages-from-private-wikileaks-chat-released/") # Extracting the paragraphs doc <- doc %>% # Getting the p html_nodes("p") %>% # Getting the text html_text() # Removing the empty lines doc <- doc[! nchar(doc) == 0] # Lines 1 to 9 are the content of the blogpost, not the content of the conversation. doc[1:9] ## [1] "“Objectivity is short-hand for not having a significant pre-conceived agenda, eliding facts the audience would be interested in, or engaging in obvious falsehoods.” ~ WikiLeaks" ## [2] "Presented below are over 11,000 messages from the WikiLeaks + 10 chat, from which only excerpts have previously been published." ## [3] "The chat is presented nearly in its entirety, with only a handful redactions made to protect the privacy and personal information of innocent, third parties as well as the already public name of an individual who has sent hate mail, made legal threats and who the source for the DMs considers a threat. It is at the request of the source that Mark’s full name is redacted, leaving only his first name and his initials (which he specified is alright). Though MGT’s full name is already public and easily discoverable, the source’s wishes are being respected. Beyond this individual, the redactions don’t include any information that’s relevant to understanding WikiLeaks or their activities." ## [4] "The chat log shows WikiLeaks’ private attitudes, their use of FOIA laws, as well as discussions about WikiLeaks’ lobbying and attempts to “humiliate” politicians, PR and propaganda efforts (such as establishing a “medium term truth” for “phase 2”), troll operations, attempts to engineer situations where WikiLeaks would be able to sue their critics, and in some instances where WikiLeaks helped direct lawsuits filed by third parties or encouraged criminal investigations against their opponents. In some instances, the chats are revealing. In others, they show a mundane consistency with WikiLeaks’ public stances. A few are provocative and confounding." ## [5] "The extract below was created using DMArchiver, and is presented as pure text to make it easier to search and to provide as much metadata as possible (i.e. times as well as dates). The formatting is presented as-is, and shows users’ display names rather than their twitter handles. (Note: Emmy B is @GreekEmmy, not the author.)" ## [6] "CW: At various points in the chat, there are examples of homophobia, transphobia, ableism, sexism, racism, antisemitism and other objectionable content and language. Some of these are couched as jokes, but are still likely to (and should) offend, as a racist or sexist jokes doesn’t cease to be racist or sexist because of an expected or desired laugh. Attempts to dismiss of these comments as “ironic” or “just trolling” merely invites comparisons to 4chan and ironic nazis. These comments, though offensive, are included in order to present as full and complete a record as possible and to let readers judge the context, purpose and merit of these comments for themselves." ## [7] " " ## [8] "If any current or former staffers, volunteers or hackers wants to add to my growing collection of leaks from within #WikiLeaks, please reach out. DMs are open and I’m EmmaBest on Wire." ## [9] "— Emma Best (\u1d1c//ғ\u1d0f\u1d1c\u1d0f) \U0001f3f3‍\U0001f308 (@NatSecGeek) June 28, 2018" doc[10] ## [1] "[2015-05-01 13:52:11] group Dm on Wls related trolls activity, incoming events & general topics." # Removing these lines: doc <- doc[10:length(doc)]

    And then simply save it as a txt:

    write(doc, "raw.txt")

    I can now easily reaccess it.

    res_raw <- read.delim("https://colinfay.me/wikileaksdm/raw.txt", sep = "\n", header = FALSE) %>% # Turning the character vector into a tibbe as.tibble() %>% # Renaming the V1 columné rename(value = V1) res_raw ## # A tibble: 11,377 x 1 ## value ## ## 1 [2015-05-01 13:52:11] group Dm on Wls related trolls activity, i… ## 2 [2015-05-01 13:53:39] There’s a race on now to ‘spoil’ our … ## 3 [2015-05-01 13:55:02] Greenberg, who palled up with the ope… ## 4 [2015-05-01 13:55:53] ..stripped the hostility. We’re putti… ## 5 [2015-05-01 13:56:03] suppressed. ## 6 [2015-05-01 14:01:26] yes, both Wired & Verge contained DDB’s li… ## 7 [2015-05-01 14:02:37] – cyber attacks. must be one of, if not th… ## 8 [2015-05-01 14:03:55] Greenberg is a partisan. Fraudulent t… ## 9 [2015-05-01 14:29:48] [Tweet] https://twitter.com/m_cetera/statu… ## 10 [2015-05-01 14:32:09] Hi all. More comfortable discuss… ## # ... with 11,367 more rows Cleaning the data

    DMs have a specific structure: [date hour] text, except for
    one “author”, , which is the meta-information
    about the conversation (renaming of the channel, user joining and
    leaving, etc). In order to tidy the format, let’s add
    as an author.

    res <- res_raw %>% mutate(value = str_replace_all(value, "\\[DMConversationEntry\\]", ""))

    Also I’ll remove, the last entry of the corpus, which doesn’t fit the
    conversation format:

    res[nrow(res),] ## # A tibble: 1 x 1 ## value ## ## 1 [LatestTweetID] 931704226425856001 res <- filter(res, ! str_detect(value, "931704226425856001"))

    Some messages are splitted between lines. These lines don’t start with a
    date (they are the middle of a DM). I’ll then paste the content of these
    lines at the end of the line before.

    Here is an example with lines 93 &
    94:

    value “[2015-05-02 14:12:27] OK, thanks H. Security issues were about who was on the list then?” “Never quite know who you’re dealing with online I guess. I don’t, anyway!”

    Here, 94 will be pasted at the end of 93 and removed.

    Let’s loop this:

    for (i in nrow(res):1){ if (!grepl(pattern = "\\[.{4}-.{2}-.{2} .{2}:.{2}:.{2}\\]|DMConversationEntry", res[i,])){ res[i-1,] <- paste(res[i-1,], res[i,]) } } # Remove lines with no date or no DMConversationEntry res <- res %>% mutate(has_date = str_detect(value, pattern = "\\[.{4}-.{2}-.{2} .{2}:.{2}:.{2}\\]|DMConversationEntry")) %>% filter(has_date) %>% select(-has_date) Extract key elements

    We’ll now need to split the content in three: user, date, and
    text.

    My first try was with :

    res <- res %>% extract( value, c("date", "user", "text"), regex = "\\[(.{4}-.{2}-.{2} .{2}:.{2}:.{2})\\] <([a-zA-Z0-9 ]*)> (.*)" )

    But that didn’t fit well: the DMConversationEntry has no date (I will
    fill them later), so I need a NA here, hence the three steps process:

    res <- res %>% extract(value,"user", regex = "<([a-zA-Z0-9 ]*)>", remove = FALSE) %>% extract(value,"date", regex = "\\[(.{4}-.{2}-.{2} .{2}:.{2}:.{2})\\] .*", remove = FALSE) %>% extract(value, "text", regex = "<[a-zA-Z0-9 ]*> (.*)", remove = FALSE) %>% select(-value)

    When date is missing, it’s because it’s a DMConversationEntry. Let’s
    verify that:

    res %>% filter(user == "DMConversationEntry") %>% summarize(nas = sum(is.na(date)), nrow = n()) ## # A tibble: 1 x 2 ## nas nrow ## ## 1 20 20

    In order to have a date here, we will fill this with the directly
    preceeding date:

    res <- fill(res, date) Saving data Global write_csv(res, "wikileaks_dm.csv") Year

    Find the min and max years:

    range(res$date) ## [1] "2015-05-01 13:52:11" "2017-11-10 04:30:46"

    Filter and save a csv for each year:

    walk(2015:2017, ~ filter(res, lubridate::year(date) == .x) %>% write_csv(glue::glue("{.x}.csv")) ) User

    Filter and save a csv for each user:

    walk(unique(res$user), ~ filter(res, user == .x) %>% write_csv(glue::glue("user_{make.names(.x)}.csv")) ) Counting users participation res %>% count(user, sort = TRUE) %>% write_csv("user_count.csv") Counting activity by days res %>% mutate(date = lubridate::ymd_hms(date), date = lubridate::date(date)) %>% count(date) %>% write_csv("daily.csv") Adding extra info

    Extracting all the mentions (@something):

    mentions <- res %>% mutate(mention = str_extract_all(text, "@[a-zA-Z0-9_]+")) %>% unnest(mention) %>% select(mention, everything()) write_csv(mentions, "mentions.csv") # Count them mentions %>% count(mention, sort = TRUE) %>% write_csv("mentions_count.csv")

    Extracting all the urls (http(s)something):

    urls <- res %>% mutate(url = str_extract_all(text, "http.+")) %>% unnest() %>% select(url, everything()) write_csv(urls, "urls.csv") Adding JSON format

    I’ve also chosen to export JSON format of the csv.

    list.files(pattern = "csv") %>% walk(function(x) { o <- read_csv(x) jsonlite::write_json( o, path = glue::glue("{tools::file_path_sans_ext(x)}.json") ) }) dir.create("json") list.files(pattern = "json") %>% walk(function(x){ file.copy(x, glue::glue("json/{x}")) unlink(x) }) Building a website with Markdown and GitHub

    Here’s a list of random elements from the process of building these
    pages with R.

    Hosting

    My website in hosted on
    GitHub, with the home
    url (colinfay.me) pointing to the root of this repo. If I create a new
    folder pouet, and put inside this folder a file called index.html, I
    can then go to colinfay.me/pouet, and get a new website from there. As
    the wikileaks extraction already had its own repo, I’ve chosen to list
    this repo https://github.com/ColinFay/wikileaksdm as a submodule of my
    website’s repo.

    More about submodules:
    https://git-scm.com/book/en/v2/Git-Tools-Submodules

    Inside this wikileaksdm project, I gathered all the data, an index.Rmd
    which will be used as a homepage, and other Rmd for other pages. Each
    are compiled as html.

    Styling the pages

    Markdown default style is nice, but I wanted something different. This
    is why I used {markdowntemplates}, with the skeleton template. The
    yaml looks like:

    --- title: "Wikileaks Twitter DM - Home" author: '@_colinfay' date: "2018-08-06" fig_width: 10 fig_height: 4 navlink: "[Wikileaks Twitter DM](https://colinfay.me/wikileaksdm)" og: type: "article" title: "Wikileaks Twitter DM" footer: - content: 'colinfay.me@_colinfay
    ' output: markdowntemplates::skeleton ---

    Here, you can see some new things: footer content, og for open graph
    data, and navlink for the content of the header.

    Include the same markdown content several time

    All the pages have the same intro content, so I can use
    shiny::includeMarkdown to include it on each page (this way, I’ll only
    need to update the content once if needed). Put it between backticks
    with an r, and the markdown is integrated at compilation time as html.

    See here, line 21:
    https://raw.githubusercontent.com/ColinFay/wikileaksdm/master/index.Rmd

    Include font awesome icons

    Before every link, there is a:

    This could have been done with CSS, but I’ve used the {fontawesome}
    package, also between backticks and with an r, to include them.

    See here, line 33:
    https://raw.githubusercontent.com/ColinFay/wikileaksdm/master/index.Rmd

    Page content

    All the pages include interactive elements, and a static plot.
    Interactive tables have been rendered with the {DT} package, and the
    timeline with {dygraphs}. Under each dygraph, there is a static plot
    made with {ggplot2}. In order to organise this two plots (interactive
    and none), the second plot is put inside a HTML tag. This
    allows to create a foldable content inside the page.

    See: https://twitter.com/_ColinFay/status/1022836135452663809

    Prefilling functions

    I use dygraph and datatable several times, with the same defaut
    arguments (e.g extensions = "Buttons",options = list(scrollX = TRUE,
    dom = "Bfrtip", buttons = c("copy", "csv")). As I didn’t want to retype
    these elements each time, I’ve called purrr::partial on it:

    dt <- partial( datatable, extensions = "Buttons", options = list( scrollX = TRUE, dom = "Bfrtip", buttons = c("copy", "csv") ) )

    This new dt function is then used as the defaut datatable rendering.

    Read more

    If you want to read the code and discover the content, feel free to
    browse the website and the github repo:

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Colin Fay. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Collecting Expressions in R

    Mon, 08/06/2018 - 01:36

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    Not a full R article, but a quick note demonstrating by example the advantage of being able to collect many expressions and pack them into a single extend_se() node.

    This example may seem extreme or unnatural. However we have seen once you expose a system to enough users you see a lot more extreme use cases than you would at first expect. We have actually seen large tens of columns added to a mart in a large irregular block (so not the same transform for each columns) by building up long pipelines, so this simplified example is in fact relevant to production deployments.

    First set up our packages, database connection, and remote table.

    library("dplyr") ## Warning: package 'dplyr' was built under R version 3.5.1 ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union library("rquery") library("microbenchmark") library("ggplot2") library("WVPlots") library("rqdatatable") library("cdata") use_spark <- TRUE # connect if(use_spark) { conf <- sparklyr::spark_config() conf$spark.yarn.am.cores <- 2 conf$spark.executor.cores <- 2 mem_size <- "4G" conf$spark.executor.memory <- mem_size conf$spark.yarn.am.memory <- mem_size conf$`sparklyr.shell.driver-memory` <- mem_size conf$`sparklyr.shell.executor-memory` <- mem_size conf$`spark.yarn.executor.memoryOverhead` <- mem_size con <- sparklyr::spark_connect(version='2.2.0', master = "local", config = conf) } else { con <- DBI::dbConnect(RPostgreSQL::PostgreSQL(), host = 'localhost', port = 5432, user = 'johnmount', password = '') } # configure rquery connection options dbopts <- rq_connection_tests(con) db_hdl <- rquery_db_info( connection = con, is_dbi = TRUE, connection_options = dbopts) print(db_hdl) ## [1] "rquery_db_info(DBIConnection_spark_connection_spark_shell_connection, is_dbi=TRUE, note=\"\")" nrow <- 1000000 td <- rq_copy_to(db_hdl, "d", data.frame(x = seq_len(nrow)), overwrite = TRUE, temporary = TRUE) tbl <- dplyr::tbl(con, "d") ncol <- 100

    rquery torture function: add 100 columns to a 1000000 row table.

    rquery_fn <- function(db_hdl, td, ncol, return_sql = FALSE) { expressions <- character(0) for(i in seq_len(ncol)) { expri <- paste0("x_", i) %:=% paste0("x + ", i) expressions <- c(expressions, expri) } ops <- td %.>% extend_se(., expressions) %.>% select_rows_nse(., x == 3) if(return_sql) { return(to_sql(ops, db_hdl)) } # force execution db_hdl %.>% ops } cat(rquery_fn(db_hdl, td, 5, return_sql = TRUE)) ## SELECT * FROM ( ## SELECT ## `x`, ## `x` + 1 AS `x_1`, ## `x` + 2 AS `x_2`, ## `x` + 3 AS `x_3`, ## `x` + 4 AS `x_4`, ## `x` + 5 AS `x_5` ## FROM ( ## SELECT ## `x` ## FROM ## `d` ## ) tsql_97238965696940256963_0000000000 ## ) tsql_97238965696940256963_0000000001 ## WHERE `x` = 3 rquery_fn(db_hdl, td, 5) ## x x_1 x_2 x_3 x_4 x_5 ## 1 3 4 5 6 7 8

    The row-selection step is cut down on the in-memory cost of bringing the result back to R. Obviously we could optimize the example away by pivoting the filter to earlier in the example pipeline. We ask the reader to take this example as a stand-in for a more complicated (though nasty) real-world example where such optimizations are not available.

    Same torture for dplyr.

    dplyr_fn <- function(tbl, ncol, return_sql = FALSE) { pipeline <- tbl xvar <- rlang::sym("x") for(i in seq_len(ncol)) { res_i <- rlang::sym(paste0("x_", i)) pipeline <- pipeline %>% mutate(., !!res_i := !!xvar + i) } pipeline <- pipeline %>% filter(., x == 3) if(return_sql) { return(dbplyr::remote_query(pipeline)) } # force execution pipeline %>% collect(.) } cat(dplyr_fn(tbl, 5, return_sql = TRUE)) ## SELECT * ## FROM (SELECT `x`, `x_1`, `x_2`, `x_3`, `x_4`, `x` + 5 AS `x_5` ## FROM (SELECT `x`, `x_1`, `x_2`, `x_3`, `x` + 4 AS `x_4` ## FROM (SELECT `x`, `x_1`, `x_2`, `x` + 3 AS `x_3` ## FROM (SELECT `x`, `x_1`, `x` + 2 AS `x_2` ## FROM (SELECT `x`, `x` + 1 AS `x_1` ## FROM `d`) `tyiyhhxjag`) `gshhunpiup`) `teowzjcshb`) `hdrfwlzycc`) `lsniejpwft` ## WHERE (`x` = 3.0) dplyr_fn(tbl, 5) ## # A tibble: 1 x 6 ## x x_1 x_2 x_3 x_4 x_5 ## ## 1 3 4 5 6 7 8

    We can also collect expressions efficiently using seplyr (seplyr is a thin wrapper over dplyr, so seplyr‘s method mutate_se() is essentially instructions how to do the same thing using rlang).

    seplyr_fn <- function(tbl, ncol, return_sql = FALSE) { expressions <- character(0) for(i in seq_len(ncol)) { expri <- paste0("x_", i) %:=% paste0("x + ", i) expressions <- c(expressions, expri) } pipeline <- tbl %>% seplyr::mutate_se(., expressions) %>% filter(., x == 3) if(return_sql) { return(dbplyr::remote_query(pipeline)) } # force execution pipeline %>% collect(.) } cat(seplyr_fn(tbl, 5, return_sql = TRUE)) ## SELECT * ## FROM (SELECT `x`, `x` + 1.0 AS `x_1`, `x` + 2.0 AS `x_2`, `x` + 3.0 AS `x_3`, `x` + 4.0 AS `x_4`, `x` + 5.0 AS `x_5` ## FROM `d`) `gaktzzkzxq` ## WHERE (`x` = 3.0) seplyr_fn(tbl, 5) ## # A tibble: 1 x 6 ## x x_1 x_2 x_3 x_4 x_5 ## ## 1 3 4 5 6 7 8

    Time the functions. Timing is not going to be certain given issues such as cluster state and query caching.

    timings <- microbenchmark( rquery = rquery_fn(db_hdl, td, ncol), dplyr = dplyr_fn(tbl, ncol), seplyr = seplyr_fn(tbl, ncol), times = 10L) saveRDS(timings, "CollectExprs_timings.RDS")

    Present the results.

    print(timings) ## Unit: milliseconds ## expr min lq mean median uq max neval ## rquery 995.955 1018.481 1153.364 1065.092 1281.715 1502.717 10 ## dplyr 2156.264 2219.900 2899.534 2473.929 3791.673 4714.063 10 ## seplyr 1074.775 1180.591 1453.980 1273.424 1598.804 2398.883 10 #autoplot(timings) timings <- as.data.frame(timings) timings$seconds <- timings$time/10^9 timings$method <- factor(timings$expr) timings$method <- reorder(timings$method, timings$seconds) WVPlots::ScatterBoxPlotH(timings, "seconds", "method", "task time by method") tratio <- timings %.>% project_nse(., groupby = "method", mean_seconds = mean(seconds)) %.>% pivot_to_rowrecs(., columnToTakeKeysFrom = "method", columnToTakeValuesFrom = "mean_seconds", rowKeyColumns = NULL) %.>% extend_nse(., ratio = dplyr/rquery) tratio[] ## dplyr rquery seplyr ratio ## 1: 2.899534 1.153364 1.45398 2.513979 ratio_str <- sprintf("%.2g", tratio$ratio)

    rquery is about 2.5 times faster than dplyr for this task at this scale for this data implementation and configuration (we have also seen an over 8 times difference for this example on PostgreSQL).

    if(use_spark) { sparklyr::spark_disconnect(con) } else { DBI::dbDisconnect(con) }

    Original code here.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Time to Accept It: publishing in the Journal of Statistical Software

    Sun, 08/05/2018 - 22:10

    (This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

    I’m reblogging this article mostly for myself. If you’ve been following my blog, you’ll see that recently I published an article on organizing R code that mentioned using packages to organize that code. One of the advantages of doing so is that the work you’ve done is easily distributed. If the methods are novel in some way, you may even get a paper in J. Stat. Soft. or the R Journal that helps people learn how to use your software and exposes the methodology to a wider audience. Therefore we should know something about those journals. (I recently got a good reply on Reddit about the difference between these journals.)

    The Geokook.

    When I was considering submitting my paper on psd to J. Stat. Soft. (JSS), I kept noticing that the time from “Submitted” to “Accepted” was nearly two years in many cases.  I ultimately decided that was much too long of a review process, no matter what the impact factor might be (and in two years time, would I even care?).  Tonight I had the sudden urge to put together a dataset of times to publication.

    Fortunately the JSS website is structured such that it only took a few minutes playing with XML scraping (*shudder*) to write the (R) code to reproduce the full dataset.  I then ran a changepoint (published in JSS!) analysis to see when shifts in mean time have occurred.  Here are the results:

    Top: The number of days for a paper to go from ‘Submitted’ to ‘Accepted’ as a function of the cumulative issue index (each paper is an “issue”…

    View original post 152 more words

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Longitudinal heat plots

    Sat, 08/04/2018 - 18:51

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)

    During our research on the effect of prednisone consumption during pregency on health outcomes of the baby (Palmsten K, Rolland M, Hebert MF, et al., Patterns of prednisone use during pregnancy in women with rheumatoid arthritis: Daily and cumulative dose. Pharmacoepidemiol Drug Saf. 2018 Apr;27(4):430-438. https://www.ncbi.nlm.nih.gov/pubmed/29488292) we developed a custom plot to visualize for each patient their daily and cumulative consumption of prednisone during pregenancy. Since the publication these plots have raised some interest so here is the code used to produce them.

    Data needs to be in the following format:
    1 line per patient
    1 column for the patient ID (named id) and then 1 column per unit of time (here days) reporting the measure for that day
    To illustrate the type of data we dealt with I first generate a random dataset containing 25 patients followed for n days (n different for each patient, randomly chosen between 50 and 200) with a daily consumption value randomly selected between 0 and a maximum dose (randomly determined for each patient between 10 and 50 ).

    Then we compute the cumulated consumption ie sum of all previous days.

    # initial parameters for simulating data n_indiv <- 25 min_days <- 50 max_days <- 200 min_dose <- 10 max_dose <- 50 # list of ids id_list <- str_c("i", 1:n_indiv) # intializing empty table my_data <- as.data.frame(matrix(NA, n_indiv, (max_days+1))) colnames(my_data) <- c("id", str_c("d", 1:max_days)) my_data$id <- id_list # daily simulated data set.seed(113) for(i in 1:nrow(my_data)){ # n days follow up n_days <- round(runif(1, min_days, max_days)) # maximum dose dose <- round(runif(1, min_dose, max_dose)) # random daily value my_data[i,2:ncol(my_data)] <- c(runif(n_days, 0, max_dose), rep(NA,(max_days-n_days))) } # cumulative simulated data my_cum_data <- my_data for(i in 3:ncol(my_cum_data)){ my_cum_data[[i]] <- my_cum_data[[i]] + my_cum_data[[i-1]] }

    Our plots use the legend.col function found here: https://aurelienmadouasse.wordpress.com/2012/01/13/legend-for-a-continuous-color-scale-in-r/

    Here is the longitudinal heat plot function:

    # Color legend on the top long_heat_plot <- function(my_data, cutoff, xmax) { # my_data: longitudinal data with one line per individual, 1st column with id, and then one column per unit of time # cutoff: cutoff value for color plot, all values above cutoff are same color # xmax: x axis max value n_lines <- nrow(my_data) line_count <- 1 # color scale COLS <- rev(heat.colors(cutoff)) # plotting area par(oma=c(1,1,4,1), mar=c(2, 2, 2, 4), xpd=TRUE) # plot init plot(1,1,xlim=c(0,xmax), ylim=c(1, n_lines), pch='.', ylab='Individual',xlab='Time unit',yaxt='n', cex.axis = 0.8) # plot line for each woman one at a time for (i in 1 : n_lines) { # get id id1 <- my_data$id[i] # get trajectory data maxed at max_val id_traj <- my_data[my_data$id == id1, 2:ncol(my_data)] # get last day END <- max(which(!is.na(id_traj))) # plot dotted line x1 <- 1:xmax y1 <- rep(i,xmax) lines(x1, y1, lty=3) for (j in 1 : (ncol(my_data) - 1)) { # trim traj to max val val <- min(id_traj[j], cutoff) # plot traj points(j, i, col=COLS[val], pch=20, cex=1) } # add limit line points((END+1), i, pch="|", cex=0.9) } # add legend legend.col(col = COLS, lev = 1:cutoff) mtext(side=3, line = 3, 'unit of measurement') opar <- par() text_size <- opar$cex.main }

    Then we generate the corresponding plots:

    long_heat_plot(my_data, 50, 200)

    long_heat_plot(my_cum_data, 5000, 200)

    And here is what we did in our study:

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Introducing the HCmodelSets Package

    Sat, 08/04/2018 - 17:17

    (This article was first published on R – insightR, and kindly contributed to R-bloggers)

    By Henrique Helfer Hoeltgebaum

    Introduction

    I am happy to introduce the package HCmodelSets, which is now available on CRAN. This package implements the methods proposed by Cox, D.R. and Battey, H.S. (2017). In particular it performs the reduction, exploratory and model selection phases given in the aforementioned reference. The software supports linear regression, likelihood-based fitting of generalized linear regression models and the proportional hazards model fitted by partial likelihood.

    The standard method described in the literature to deal with sparse regression is the LASSO proposed by Tibshirani (1996), which assumes sparsity of the effects. This results in only one single model, leaving open the possibility that other sparse choices of explanatory features fit equally as well. To explore these other sparse possibilities, Cox, D.R. and Battey, H.S. (2017) provide methods that specify alternative models that are essentially equally as descriptive as the models fitted using LASSO. The key idea of Cox, D.R. and Battey, H.S. (2017) is to organize the variable indexes into a hypercube and then run regressions over rows and columns (and equivalently in higher dimensions). Further, significant variables in the hypercube are retained and then organized into a lower dimensional one. This process is repeated until a user specified low dimensional hypercube and returns the number of times each variable is considered significant at each dimension across all models fitted. Such a strategy conducts numerous combinations of a sparse models which allows for the separate analysis of subsets of explanatory variables.

    Constructing sets of well-fitting models with HCmodelSets package

    Now let us make use of the R package HCmodelSets, its functions will be used and detailed to estimate model sets when dealing with a high number of exploratory variables. Note that different response variables, namely binary and continuous, have distinct statistical proprieties which, under some mild conditions, were recently discussed in Cox, D.R. and Battey, H.S. (2018).

    We demonstrate the procedure of Cox, D.R. and Battey, H.S. (2017) using simulated data to recover the true model. Consider a continuous variable which depends on only a sparse number of variables. Firstly we create this Data Generator Process (DGP) assuming continuous values using the dgp function.

    library("HCmodelSets") dgp = DGP(s=5, a=3, sigStrength=1, rho=0.9, n=100, intercept=5, noise=1, var=1, d=1000, DGP.seed = 2019)

    The DGP sample size is 100 and out of 1000 variables, only 5 are relevant. To access which ones are relevant, just use the TRUE.idx value. Next we perform the reduction phase using the first 70 observations, which successively discard variables according to appropriate decision rules. At the end, it provides the number and indexes of variables selected at each stage.

    outcome.Reduction.Phase = Reduction.Phase(X=dgp$X[1:70,],Y=dgp$Y[1:70], family=gaussian, seed.HC = 1093)

    Also, one can use the vector.signif argument to create a vector of p-values proposing its own decision rule for each reduction of the hypercube. If it is not provided, as we did in the example above, the default option selects the two most significant variables in each transverse of the hypercube for the highest dimensional one and then adopts pvalues of 0.01 for the sub-sequential lower dimensions. Also, since , family=”gaussian” must be specified. In the former example we did not provided the value of dimension of the hypercube to be used in the first-stage reduction, however this could be easily done by using the dmHC argument.

    The next step is to perform the exploratory phase on the variables retained through the reduction phase in the same 70 observations, which will return any significant squared and/or interaction terms. We opt to use the variables in the hypercube of dimension 2 that were selected at least once as this retains 25 variables.

    idxs = outcome.Reduction.Phase$List.Selection$`Hypercube with dim 2`$numSelected1 outcome.Exploratory.Phase = Exploratory.Phase(X=dgp$X[1:70,],Y=dgp$Y[1:70], list.reduction = idxs, family=gaussian, signif=0.01)

    At the end, only one squared variable is relevant, and will be used to construct sets of well-fitting models using the remaining 30 observations.

    sq.terms = outcome.Exploratory.Phase$mat.select.SQ in.terms = outcome.Exploratory.Phase$mat.select.INTER MS = ModelSelection.Phase(X=dgp$X[71:100,],Y=dgp$Y[71:100], list.reduction = idxs, sq.terms = sq.terms,in.terms = in.terms, signif=0.01)

    Note that since we did not specified the modelSize argument, the procedure will adopt the minimum value between the sum of 25 (the number of variables in the reduction phase) and 1 (the number of variables in the exploratory phase) with 5 (i.e. min(26,5)). Since we specified in the DPG function s=5, i.e., the true number of variables that generates the true process, we will analyze the results of models of size 5.

    TRUE.idx = dgp$TRUE.idx sets = sort(table(MS$goodModels$`Model Size 5`), decreasing=TRUE) ### frequency table of variables selected print(sets) ## ## 379 479 732 438 613 152 901 333 596 408 ## 12650 12650 12649 11710 10753 7810 7809 7787 7716 7491 ## 756 24 123 366 986 670 333 ^2 335 200 661 ## 7283 7276 7257 7233 7193 7160 7119 7108 7099 7094 ## 634 171 844 20 225 909 ## 7066 7048 7032 7017 7002 6978 ### intersection between the true model and the variables obtained in the Model Selection phase intersect(TRUE.idx,names(sets)) ## [1] "152" "379" "479" "613" "901"

    Note that the true variables are fully recovered at the end with also a few other variables and non-linear effects which might also be relevant to fit the observed process .

    References

    Cox, D. R., & Battey, H. S. (2017). Large numbers of explanatory variables, a semi-descriptive analysis. Proceedings of the National Academy of Sciences, 114(32), 8592-8595.

    Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288.

    Battey, H. S., & Cox, D. R. (2018). Large numbers of explanatory variables: a probabilistic assessment. Proc. R. Soc. A, 474(2215), 20170631.

    Hoeltgebaum HH (2018). HCmodelSets: Regression with a Large Number of Potential Explanatory Variables. R package version 0.1.1, URL https://CRAN.R-project.org/package=HCmodelSets.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Video: How to run R and Python in SQL Server from a Jupyter notebook

    Fri, 08/03/2018 - 18:53

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    Did you know that you can run R and Python code within a SQL Server instance? Not only does this give you a powerful server to process your data science calculations, but it makes things faster by eliminating the need to move data between client and server. Best of all, you can access this power via your favourite interface on your local machine, like the RStudio IDE for R or the VS Code Python extensions

    To make this work, all you need is the necessary functions to launch computations in SQL Server, provided in the revoscaler package for R, and the revoscalepy library for Python. The video above demonstrates how to set things up for Python, and R users should check out the tutorials Learn in-database analytics using R in SQL Server.

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Stats Note: Making Sense of Open-Ended Responses with Text Analysis

    Fri, 08/03/2018 - 18:02

    (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

    .knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

    Using Text Mining on Open Ended Items Good survey design is both art and science. You have to think about how people will read and process your questions, and what sorts of responses might result from different question forms and wording. One of the big rules I follow in survey design is that you don’t assess any of your most important topics with an open-ended item. Most people skip them, because they’re more work than selecting options from a list, and people who do complete may give you terse, unhelpful, or gibberish answers.

    When I was working on a large survey among people with spinal cord injuries/disorders, the survey designers decided to assess the exact details of the respondent’s spinal injury or disorder with an open-ended question so that people could describe it “in their own words.” As you might have guessed, the data were mostly meaningless and as such unusable. But many hypotheses and research questions dealt with looking at differences between people who sustained an injury and those with a disorder affecting their spinal cord. We couldn’t even begin to test or examine any of those in the course of our study. It was unbelievably frustrating, because we could have gotten the information we needed with a single question and some categorical responses. We could have then asked people to supplement their answer with the open-ended question. Most would skip it, but we’d have the data we needed to answer our questions.

    In my job, I inherited the results of a large survey involving a variety of dental professional populations. Once again, certain items that could have been addressed with a few close-ended questions were instead open-ended questions and not many of the responses are useful. The item that inspired this blog post assessed the types of products dental assistants are involved in purchasing, which can include anything from office supplies to personal protective equipment to large equipment (X-ray machine, etc.). Everyone had a different way of articulating what they were involved with purchasing, some simply saying “all dental supplies” or “everything,” while others gave more specific details. In total, about 400 people responded to this item, which is a lot of data to dig through. But thanks to my new experience with text mining in R, I was able to try to make some sense of responses. Mind you, a lot of the responses can’t be understood, but it’s at least something.

    I decided I could probably begin to categorize responses by slicing the responses up into the individual words. I can generate counts overall as well as by respondent, and use this to begin examining and categorizing the data. Finally, if I need some additional context, I can ask R to give me all responses that contain a certain word.

    Because I don’t own these data, I can’t share them on my blog. But I can demonstrate with different text data to show what I did. To do this, I’ll use one of my all-time favorite books, The Wonderful Wizard of Oz by L. Frank Baum, which is available through Project Gutenberg. The gutenbergr package will let me download the full-text.

    library(gutenbergr)
    gutenberg_works(title == "The Wonderful Wizard of Oz")
    ## # A tibble: 1 x 8
    ## gutenberg_id title author gutenberg_autho~ language gutenberg_books~
    ##
    ## 1 55 The Wo~ Baum, L~ 42 en Children's Lite~
    ## # ... with 2 more variables: rights , has_text

    Now we have the gutenberg_id, which will be used to download the fulltext into a data frame.

    WOz <- gutenberg_download(gutenberg_id = 55)
    ## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
    ## Using mirror http://aleph.gutenberg.org
    WOz$line <- row.names(WOz)

    To start exploring the data, I’ll need to use the tidytext package to unnest tokens (words) and remove stop words that don’t tell us much.

    library(tidyverse)
    library(tidytext)
    tidy_oz <- WOz %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words)
    ## Joining, by = "word"

    Now I can begin to explore my “responses” by generating overall counts and even counts by respondent. (For the present example, I’ll use my line number variable instead of the respondent id number I used in the actual dataset.)

    word_counts <- tidy_oz %>%
    count(word, sort = TRUE)
    resp_counts <- tidy_oz %>%
    count(line, word, sort = TRUE) %>%
    ungroup()

    When I look at my overall counts, I see that the most frequently used words are the main characters of the book. So from this, I could generate a category “characters.” Words like “Kansas”, “Emerald” and “City” are also common, and I could create a category called “places.” Finally, “heart” and “brains” are common – a category could be created to encompass what the characters are seeking. Obviously, this might not be true for every instance. It could be that someone “didn’t have the heart” to tell someone something. I can try to separate out those instances by looking at the original text.

    heart <- WOz[grep("heart",WOz$text),]
    head(heart)
    ## # A tibble: 6 x 3
    ## gutenberg_id text line
    ##
    ## 1 55 happiness to childish hearts than all other human cr~ 48
    ## 2 55 the heartaches and nightmares are left out. 62
    ## 3 55 and press her hand upon her heart whenever Dorothy's~ 110
    ## 4 55 strange people. Her tears seemed to grieve the kind~ 375
    ## 5 55 Dorothy ate a hearty supper and was waited upon by t~ 517
    ## 6 55 She ate a hearty breakfast, and watched a wee Munchk~ 545

    Unfortunately, this got me any line containing a word starting with “heart” like “hearty” and “heartache.” Let’s rewrite that command to request an exact match.

    heart <- WOz[grep("\\bheart\\b",WOz$text),]
    head(heart)
    ## # A tibble: 6 x 3
    ## gutenberg_id text line
    ##
    ## 1 55 and press her hand upon her heart whenever Dorothy's~ 110
    ## 2 55 "\"Do you suppose Oz could give me a heart?\"" 935
    ## 3 55 brains, and a heart also; so, having tried them both~ 975
    ## 4 55 "rather have a heart.\"" 976
    ## 5 55 grew to love her with all my heart. She, on her par~ 992
    ## 6 55 now no heart, so that I lost all my love for the Mun~ 1024

    Now I can use this additional context to determine which instances of “heart” are about the Tin Man’s quest.

    It seems like these tools would be most useful for open-ended responses that are still very categorical, but it could help with determining themes for more narrative data.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The program for uRos2018 is online

    Fri, 08/03/2018 - 12:55

    (This article was first published on R – Mark van der Loo, and kindly contributed to R-bloggers)

    The uRos2018 conference is aimed at professionals and academics who are involved in producing or consuming official (government) statistics.

    We are happy to announce that we recently posted the full program of the 6th international conference on the use of R in official Statistics (uRos2018) on our website.

    In summary:
    • Six tutorials in the areas of
      • Data Cleaning
      • Network Analyses
      • Survey Estimation
      • Data manipulation with data.table
      • Analyzing spatial data
      • Visualizing spatial data.
    • Two keynote speakers:
      • Alina Matei, professor of statistics at the University of Neuchatel and maintainer of the sampling package.
      • Jeroen Ooms, R superprogrammer and maintainer of R and Rtools for Windows (UC Berkeley)
    • Eleven sessions with contributed talks with five presentations from
      all over the world.
    • One session devoted to the results of a two-day unconf that is held prior to the conference.
    • One social dinner
    • Two journals will devote a special topic to the conference.

    All the abstracts will be published online soon.

    Registration is still open
    • You are welcome to register
    • Follow us on twitter for the latest news and updates!
    Markdown with by wp-gfm var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Mark van der Loo. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    World Cup prediction winners

    Fri, 08/03/2018 - 02:00

    (This article was first published on R blog posts on sandsynligvis.dk, and kindly contributed to R-bloggers)

    Predicting the outcome of the different teams in the FIFA World Cup has been of great interest to the general public, and predicting the outcome has also attracted quite some attention in the R community. The World Cup has ended and by now, everyone knows that France managed to take home the trophy that slipped through their fingers when they hosted the UEFA Euro 2016 championship. But who won the more important competition of predicting the outcome?

    ]]>

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R blog posts on sandsynligvis.dk. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages