Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 33 min ago

R 3.5.2 now available

Thu, 12/20/2018 - 22:37

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

R 3.5.2, the latest version of the R language for statistical computation and graphics from the R Foundation, was released today. (This release is codenamed "Eggshell Igloo", likely in reference to this or this Peanuts cartoon.) Compared to R 3.5.1, this update includes only bug fixes, so R scripts and packages compatible with R 3.5.0 or R 3.5.1 should work without modification.

Installers for R on Windows and Debian Linux are available now; other platforms should be available from CRAN within the next few days. If you want to see the full list of fixes, check out the announcement from the R Core Group at the link below.

R-announce mailing list: R 3.5.2 is released

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Your AI journey… and Happy Holidays!

Thu, 12/20/2018 - 11:31

(This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

I want to draw your attention to a very valuable (and short!) whitepaper from my colleague, Professor Andrew Ng, where he shares important insights on how to lead companies into the AI era.

The five steps outlined in the paper are:

  1. Execute pilot projects to gain momentum
  2. Build an in-house AI team
  3. Provide broad AI training
  4. Develop an AI strategy
  5. Develop internal and external communications

Many points raised in the paper coincide with my own consulting experience in different industries: AI Transformation Playbook

I also recently gave an interview on the same topic, you can find it here:
Artificial intelligence: What is the best way for companies to innovate?

Happy reading!

If you need qualified support/consulting for your own AI journey, especially for crafting an AI strategy, setting up and management of AI projects, building an AI organization and planning/conducting AI coachings please contact me here: Prof. Dr. Holger K. von Jouanne-Diedrich

…and now for something completely different: like every year Christmas arrives totally unexpected!

So, I wish you a Merry Christmas/Happy Holidays (whatever applies ) with the following little ctree function:

# ctree: prints a Christmas tree with numbers of branch levels N ctree <- function(N = 7) { for (i in 1:N) cat(rep("", N-i+1), rep("", i), "\n") cat(rep("", N), "*\n") } ctree(5) ## * ## * * ## * * * ## * * * * ## * * * * * ## * ctree() ## * ## * * ## * * * ## * * * * ## * * * * * ## * * * * * * ## * * * * * * * ## * ctree(9) ## * ## * * ## * * * ## * * * * ## * * * * * ## * * * * * * ## * * * * * * * ## * * * * * * * * ## * * * * * * * * * ## * var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Day 20 – little helper char_replace

Thu, 12/20/2018 - 08:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 20th day of Christmas my true love gave to me…

What can it do?

This little helper replaces non-standard characters (such as the German umlaut "ä") with their standard equivalents (in this case "ae"). It is also possible to force all characters to lower case, trim whitespaces or replace whitespaces and dashes with underscores.

How to use it?

Let's look at a small example with different settings:

x <- " Élizàldë-González Strasse" char_replace(x, to_lower = TRUE) [1] "elizalde-gonzalez strasse" char_replace(x, to_lower = TRUE, to_underscore = TRUE) [1] "elizalde_gonzalez_strasse" char_replace(x, to_lower = FALSE, rm_space = TRUE, rm_dash = TRUE) [1] "ElizaldeGonzalezStrasse" Overview

To see all the other functions you can either check out our GitHub or you can read about them here.

Have a merry advent season!

Über den Autor

Jakob Gepp

Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

.button {background-color: #0085af;}

Der Beitrag Day 20 – little helper char_replace erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

BH 1.69.0-0 pre-releases and three required changes

Thu, 12/20/2018 - 04:39

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Our BH package provides a sizeable portion of the Boost C++ libraries as a set of template headers for use by R. It is quite popular, and frequently used together with Rcpp. The BH CRAN page shows e.g. that it is used by rstan, dplyr as well as a few other packages. The current count of reverse dependencies is at 159.

Boost releases every four months. The last release we packaged was 1.66 from February—and a new Boost 1.69 just came out. So I packaged it, being somewhat careful as usual as CRAN insists on suppressing compiler diagnostics #pragma statements and a few other things, see the BH GitHub repo for details.

Given the rather ginormous footprint of BH — in this release about 154mb installed — I am proceeding carefully as usual. I started with two full reverse-depends checks on the weekend which lead to three regressions for which I will contact the two corresponding maintainers (as it affects me for the third package). Details follow. In short, we should be in good shape to release in due course. Details are below on how to access the package now for testing and local use.

Upgrading BH on CRAN from 1.66 to 1.69

We ran two initial reverse-dependency checks, with results pushed to the usual repo. The finally summary is in this file with the moneyline being

141 successes, 10 failures, and 8 skipped packages.

Of the 10 failures, four are due to missing depends or suggests: these packages passed once installed. Of the remaining six, three are recurring issues we also have with e.g. Rcpp. That leaves three new regressions, or as CRAN calls it, ‘change to worse’. We also verified that these packages do indeed pass with the CRAN version of BH, i.e, 1.66.

What follows is a short discussion of each respective package.

First regression: phonics

The package uses

using namespace Rcpp; using namespace boost; using namespace std;

which flattens namespaces—and with Boost 1.69 now gets a collision as distance() is no longer unique. Simply prefixing the four or five occurrances with std:: fixes it. The complete (and very short) patch follows.

diff -ru phonics.orig/src/metaphone.cpp phonics/src/metaphone.cpp --- phonics.orig/src/metaphone.cpp 2018-08-16 19:05:22.000000000 +0000 +++ phonics/src/metaphone.cpp 2018-12-20 03:31:06.325881441 +0000 @@ -151,13 +151,13 @@ break; case 'G': if (nc == 'H') { - if(!(is("BDH", at(word, distance(word.begin(), i) - 3)) || - at(word, distance(word.begin(), i) - 4) == 'H')) { + if(!(is("BDH", at(word, std::distance(word.begin(), i) - 3)) || + at(word, std::distance(word.begin(), i) - 4) == 'H')) { meta += 'F'; i++; } } else if(nc == 'N') { - if (is(alpha, nnc) && substr(word, distance(word.begin(), i) + 1, 3) != "NED") { + if (is(alpha, nnc) && substr(word, std::distance(word.begin(), i) + 1, 3) != "NED") { meta += 'K'; } } else if(is(soft, nc) && pc != 'G') { @@ -187,7 +187,7 @@ } else if(nc == 'H') { meta += 'X'; i += 1; - } else if(!traditional && substr(word, distance(word.begin(), i) + 1, 3) == "CHW") { + } else if(!traditional && substr(word, std::distance(word.begin(), i) + 1, 3) == "CHW") { meta += 'X'; i += 2; } else { @@ -200,7 +200,7 @@ } else if(nc == 'H') { meta += '0'; i += 1; - } else if(substr(word, distance(word.begin(), i) + 1, 2) != "CH") { + } else if(substr(word, std::distance(word.begin(), i) + 1, 2) != "CH") { meta += 'T'; } break; Second regression: RcppStreams

Here we end up with a linking error. The following symbol, expanded with c++filt is coming up as missing:

$ c++filt _ZN5boost5proto6detail17display_expr_implC1ERKS2_ boost::proto::detail::display_expr_impl::display_expr_impl(boost::proto::detail::display_expr_impl const&) $

I am the upstream author/maintainer here. This looks somewhat tricky, but a quick commenting-out of an optional / verbose display call does the trick. So that is what we may end up doing.

Third regression: TDA

This generated a lot of messages with error: ‘next’ is not a member of ‘boost’ when boost::next(somevar) was invoked. A quick Google search suggests that the declaration moved to another header file and that including boost/next_prior.hpp should fix it — and it does. One-line patch follows.

diff -ru TDA.orig/src/diag.cpp TDA/src/diag.cpp --- TDA.orig/src/diag.cpp 2018-08-06 00:09:20.000000000 +0000 +++ TDA/src/diag.cpp 2018-12-20 03:48:10.404143947 +0000 @@ -5,6 +5,8 @@ // for Rcpp #include +#include + // for Rips #include #include

That may not be the best place for the include but it demonstrated that we simply need to add it somewhere so that diag.cpp compiles.

Package Access

We installed the release candiate package in the ghrr drat repo. To install BH 1.69.0-0, either install the drat package and set the repo as described at the ghrr drat repo, or just do

install.packages("BH", repo="http://ghrr.github.io/drat/")

This will allow for continued testing of BH 1.69.0 before we upload it to CRAN, probably early January.

Comments and suggestions about BH are welcome via the mailing list or the issue tracker at the GitHub repo.

This post by [Dirk Eddelbuettel](http://dirk.eddelbuettel.com)
originated on his [Thinking inside the box](http://dirk.eddelbuettel.com/blog/) blog.
Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Examining the Tweeting Patterns of Prominent Crossfit Gyms

Thu, 12/20/2018 - 03:16

(This article was first published on R – Mathew Analytics, and kindly contributed to R-bloggers)

A. Introduction

The growth of Crossfit has been one of the biggest developments in the fitness industry over the past decade. Promoted as both a physical exercise philosophy and also as a competitive fitness sport, Crossfit is a high-intensity fitness program incorporating elements from several sports and exercise protocols such as high-intensity interval training, Olympic weightlifting, plyometrics, powerlifting, gymnastics, strongman, and so forth. Now with over 10,000 Crossfit affiliated gyms (boxes) throughout the United States, the market has certainly become more saturated and gyms must initiate more unique marketing strategies to attract new members. In this post, I will investigate how some prominent Crossfit boxes are utilizing Twitter to engage with consumers. While Twitter is a great platform for news and entertainment, it is usually not the place for customer acquisition given the lack of targeted messaging. Furthermore, unlike platforms like Instagram,Twitter is simply not an image/video centric tool where followers can view accomplishments from their favorite fitness heroes, witness people achieving their goals, and so forth. Given these shortcomings, I wanted to understand how some prominent Crossfit boxes are actually using their Twitter accounts. In this post, I will investigate tweeting patterns, content characteristics, perform sentiment analysis, and build a predictive model to predict retweets.

B. Extract Data From Twitter

We begin by extracting the desired data from Twitter using the rtweet package in R. There are six prominent Crossfit gyms whose entire Twitter timeline we will use. To get this data, I looped through a vector containing each of their Twitter handles and used the get_timeline function to pull the desired data. Notice that there is a user defined function called add_engineered_features that is used to add a number of extra date columns. That function is available on my GitHub page here.

library(rtweet) library(lubridate) library(devtools) library(data.table) library(ggplot2) library(hms) library(scales) # set working directory setwd("~/Desktop/rtweet_crossfit") final_dat.tmp <- list() cf_gyms <- c("reebokcrossfit5", "crossfitmayhem", "crossfitsanitas", "sfcrossfit", "cfbelltown", "WindyCityCF") for(each_box in cf_gyms){ message("Getting data for: ", each_box) each_cf % data.table() each_cf$crossfit_name <- each_box suppressWarnings( add_engineered_dates(each_cf, date_col = "created_at") ) final_dat.tmp[[each_box]] <- each_cf message("") } final_dat <- rbindlist(final_dat.tmp) final_dat$contains_hashtags <- ifelse(!is.na(final_dat$hashtags), 1, 0) final_dat$hashtags_count <- lapply(final_dat$hashtags, function(x) length(x[!is.na(x)]))

C. Exploratory Data Analysis

Let us start by investigating this data set to get a better understanding of trends and patterns across these various Crossfit boxes. The important thing to note is that not all these twitter accounts are currently active. We can see that crossfitmayhem, sfcrossfit, and WindyCityCF are the only ones who remain active.

final_dat[, .(min_max = range(as.Date(created_at))), by=crossfit_name][,label := rep(c("first_tweet","last_tweet"))][]

C1. Total Number of Tweets

Sfcrossfit, which is the oldest of these gyms, has the highest number of tweets. However, when looking at the total number of tweets per active month, they were less active than two other gyms.

## total number of tweets p1 = final_dat[, .N, by=crossfit_name] %>% ggplot(., aes(x=reorder(crossfit_name, N), y=N)) + geom_bar(stat='identity', fill="steelblue") + coord_flip() + labs(x="", y="") + ylim(0,3000) + ggtitle("Total Number of Tweets") + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black")) ## number of tweets per active month p2 = final_dat[, .(.N, start=lubridate::ymd_hms(min(created_at)), months_active=lubridate::interval(lubridate::ymd_hms(min(created_at)), Sys.Date()) %/% months(1)), by=crossfit_name][, .(tweets_per_month = N/months_active), by=crossfit_name] %>% ggplot(., aes(x=reorder(crossfit_name, tweets_per_month), y=tweets_per_month)) + geom_bar(stat='identity', fill="steelblue") + coord_flip() + labs(x="", y="") + ylim(0,32) + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black")) + ggtitle("Total Number of Tweets per Active Month") ## add both plots to a single pane grid.arrange(p1, p2, nrow=1)

C2. Total Number of Tweets Over Time

The time series for the total number of tweets by month shows that each gym had one or two peaks from 2012 through 2016 where they were aggressively sharing content with their followers. However, over the past two years, each gym has reduced their twitter usage significantly.

## total number of tweets by month final_dat[, .N, by = .(crossfit_name, created_at_YearMonth)][order(crossfit_name, created_at_YearMonth)][, created_at_YearMonth := lubridate::ymd(paste(created_at_YearMonth, "-01"))] %>% ggplot(., aes(created_at_YearMonth, N, colour=crossfit_name)) + geom_line(group=1, lwd=0.6) + facet_wrap(~crossfit_name) + labs(x="", y="") + theme(legend.position="none") + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black"), strip.text.x = element_text(size = 10)) + ggtitle("Total Number of Tweets") ## total number of tweets by year ggplot(data = final_dat, aes(lubridate::month(created_at, label=TRUE, abbr=TRUE), group=factor(lubridate::year(created_at)), color=factor(lubridate::year(created_at))))+ geom_line(stat="count") + geom_point(stat="count") + facet_wrap(~crossfit_name) + labs(x="", colour="Year") + xlab("") + ylab("") + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black"), strip.text.x = element_text(size = 10)) + ggtitle("Total Number of Tweets by Year")

C3. Tweeting Volume by Year, Month, and Day

For each Crossfit gym, I plotted the volume of tweets by year, month, and day. Oddly enough, there really are not any noticeable patterns in these charts.

## years with the highest number of tweets ggplot(final_dat, aes(created_at_Year)) + geom_bar(fill="steelblue") + facet_wrap(~crossfit_name) + labs(x="", y="") + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black"), strip.text.x = element_text(size = 10)) + ylim(0,800) + ggtitle("Total Number of Tweets by Year") ## months with the highest number of tweets final_dat[, created_at_YearMonth2 := lubridate::ymd(paste(created_at_YearMonth, "-01"))][] %>% ggplot(., aes(lubridate::month(created_at_YearMonth2, label=TRUE, abbr=TRUE))) + geom_bar(fill="steelblue") + facet_wrap(~crossfit_name) + labs(x="", y="") + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black"), strip.text.x = element_text(size = 10)) + ylim(0,500) + ggtitle("Total Number of Tweets by Month") ## days with the highest number of tweets final_dat[, created_at_YearMonth2 := lubridate::wday(lubridate::ymd(paste(created_at_YearMonth, "-01")), label=T)][] %>% ggplot(., aes(created_at_YearMonth2)) + geom_bar(fill="steelblue") + facet_wrap(~crossfit_name) + labs(x="", y="") + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black"), strip.text.x = element_text(size = 10)) + ylim(0,600) + ggtitle("Total Number of Tweets by Day")

C4. Tweeting Volume by Time of Day

For most of these gyms, we see that the majority of their tweets came during the second half of the day or early in the morning. Given that most Crossfit boxes have their first classes at around 5am or 6am, and are usually most busy in the evening, this indicates that these businesses are tweeting during those hours where their facility is busiest.

## tweet density over the day ggplot(data = final_dat) + geom_density(aes(x = created_at_Time, y = ..scaled..), fill="steelblue", alpha=0.3) + xlab("Time") + ylab("Density") + scale_x_datetime(breaks = date_breaks("2 hours"), labels = date_format("%H:%M")) + facet_wrap(~crossfit_name) + labs(x="", y="") + theme(plot.title = element_text(hjust = 0.5), axis.text.x=element_text(angle=90,hjust=1)) + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black"), strip.text.x = element_text(size = 10)) + ggtitle("Tweet Density by Time of Day")

C5. Source of Tweets

While a couple of the gyms are using a marketing management platform like Hootsuite and IFTTT, the majority of Tweets are coming from a Twitter application or some other social media tool such as Facebook and Instagram.

## source of tweets final_dat[, .N, by = .(crossfit_name, source)][order(crossfit_name,-N)][, head(.SD, 3),by=crossfit_name] %>% ggplot(., aes(x=source, y=N)) + geom_bar(stat="identity", fill="steelblue") + coord_flip() + facet_wrap(~crossfit_name) + labs(x="", y="") + labs(x="", y="", colour="") + ylim(0,2500) + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black")) + ggtitle("Source of Tweets")

D. Content Characteristics

Let us investigate the characteristics of the content that is being tweeted. From looking at average tweet length to the amount of retweets, this information will further expand our understanding of how these prominent Crossfit boxes are utilizing their Twitter accounts.

D1. Frequency of Retweets and Original Tweets

Only one gym, cfbelltown, had a majority of their tweets being retweets. Furthermore, a plurality of tweets from crossfitmayhem were not original. In the plots showing retweets over time, we can also see that crossfitmayhem did a lot of retweeting from 2014 through 2016, but started pushing original tweets from then on out.

## retweets and original tweets final_dat[, .N, by=.(crossfit_name,is_retweet)][, percen := N/sum(N), by=.(crossfit_name)][is_retweet=="TRUE"] %>% ggplot(., aes(x=reorder(crossfit_name, N), y=percen)) + geom_bar(stat="identity", position ="dodge", fill="steelblue") + coord_flip() + labs(x="", y="", colour="") + ylim(0,0.8) + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black")) + ggtitle("Percentage of Tweets That Were Not Original") # total number of retweets over time ggplot(data = final_dat, aes(x = created_at, fill = is_retweet)) + geom_histogram(bins=48) + labs(x="", y="", colour="") + scale_fill_manual(values = c("steelblue2","steelblue"), name = "Retweet") + facet_wrap(~crossfit_name) + labs(x="", y="") + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black"), strip.text.x = element_text(size = 10)) + ggtitle("Total Number of Retweets Over Time")

D2. Frequency of Favourited Tweets

Crossfitmayhem was the only box where a majority of their tweets were actually favorited at least once. Furthermore, they had a moderately large number of those tweets that were getting favorited three or more times.

## amount of favorited tweets final_dat[, .N, by=.(crossfit_name,favorite_count)][, percen := N/sum(N), by=.(crossfit_name)][favorite_count==0,] %>% ggplot(., aes(x=reorder(crossfit_name,percen), y=percen)) + geom_bar(stat="identity", fill="steelblue") + coord_flip() + labs(x="", y="", colour="") + ylim(0,1) + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black")) + ggtitle("Percentage of Tweets That Were Not Favorited") ## amount of favorited tweets by count final_dat[, .N, by=.(crossfit_name,favorite_count)][, percen := N/sum(N), by=.(crossfit_name)][favorite_count <= 10,][order(crossfit_name,favorite_count)] %>% ggplot(., aes(x=reorder(favorite_count, -favorite_count), y=percen)) + geom_bar(stat="identity", fill="steelblue") + coord_flip() + labs(x="", y="", colour="") + facet_wrap(~crossfit_name) + ylim(0,1) + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black"), strip.text.x = element_text(size = 10)) + ggtitle("Percentage of Tweets Based on Their Favorite Count")

D3. Use of Hashtags

Reebokcrossfit5 had the largest percentage of tweets that contained hashtags while crossfitsanitas and sfcrossfit rarely used hashtags in their tweets. And when it comes to the hashtags that are being used, it seems that these boxes are using the crossfit hashtag and the name of their own gym.

## contains hashtags final_dat[, .N, by=.(crossfit_name,contains_hashtags)][, percen := N/sum(N), by=.(crossfit_name)][contains_hashtags==1] %>% ggplot(., aes(x=reorder(crossfit_name,percen), y=percen)) + geom_bar(stat="identity", fill="steelblue") + coord_flip() + labs(x="", y="", colour="") + ylim(0,0.8) + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black")) + ggtitle("Percentage of Tweets That Contain Hashtags") ## frequently used hashtags final_dat[hashtags != "", ][,.N, by=.(crossfit_name, hashtags)][order(crossfit_name, -N)][, head(.SD, 3),by=crossfit_name] %>% ggplot(., aes(x = reorder(hashtags, -N), y=N)) + geom_bar(stat="identity", fill="steelblue") + coord_flip() + ylim(0,150) + facet_wrap(~crossfit_name) + labs(x="", y="", colour="") + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black"), strip.text.x = element_text(size = 10)) + ggtitle("Most Commonly Used Hashtags")

D4. Tweet Length

The following chart shows the summary statistics pertaining to the length of the tweets from each of the crossfit boxes. crossfitsanitas and WindyCityCF seem to have the longest tweets on average, and sfcrossfit has the shortest tweet.

## tweet length melt(final_dat[, .(avg_length = mean(display_text_width,na.rm=T), med_length = median(display_text_width,na.rm=T), min_length = min(display_text_width,na.rm=T), max_length = mean(display_text_width,na.rm=T)), by=.(crossfit_name)], id="crossfit_name") %>% ggplot(., aes(x=reorder(crossfit_name, value), y=value)) + geom_bar(stat="identity", fill="steelblue") + coord_flip() + labs(x="Date", y="Number of Tweets", colour="Crossfit Box") + theme(plot.title = element_text(hjust = 0.5)) + facet_wrap(~variable) + labs(x="Time", y="Number of Tweets", colour="Crossfit Box") + theme(axis.text.x=element_text(angle=90,hjust=1)) + theme(plot.title = element_text(hjust = 0.5), strip.text.x = element_text(size = 10)) + ylim(0,200) + ggtitle("Tweet Length Summary Statistics")

E. Sentiment Analysis

In order to evaluate the emotion associated with the tweets of each crossfit box, I used the syuzhet package. This package is based on emotion lexicon which maps different words with the various emotions (joy, fear, etc.) and sentiment polarity (positive/negative). We’ll have to calculate the emotion score based on the words present in the tweets and plot the same.

We can see that the majority of tweets for each Crossfit box had a largely positive sentiment, trust and anticipation were other common emotions. For cfbelltown and crossfitsanitas, the third most common emotion was negative.

## sentiment analysis library(syuzhet) sentiment_scores = list() #each_gym="sfcrossfit" for(each_gym in unique(final_dat[,crossfit_name])){ print(each_gym) tweet_text = final_dat[crossfit_name==each_gym, text] # removing retweets tweet_text <- gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",tweet_text) # removing mentions tweet_text <- gsub("@\\w+","",tweet_text) all_sentiments <- get_nrc_sentiment((tweet_text)) #head(all_sentiments) final_sentimentscores <- data.table(colSums(all_sentiments)) #head(final_sentimentscores) names(final_sentimentscores) <- "score" final_sentimentscores <- cbind("sentiment"=colnames(all_sentiments), final_sentimentscores) final_sentimentscores$gym = each_gym #final_sentimentscores sentiment_scores[[each_gym]] <- final_sentimentscores } sentiment_scores <- rbindlist(sentiment_scores) dim(sentiment_scores) sentiment_scores ggplot(sentiment_scores, aes(x=sentiment, y=score))+ geom_bar(aes(fill=sentiment), stat = "identity")+ theme(legend.position="none")+ xlab("Sentiments")+ylab("Scores")+ coord_flip() + facet_wrap(~gym) + labs(x="", y="", colour="") + theme(plot.title = element_text(hjust = 0.5), axis.ticks.x = element_line(colour = "black"), axis.ticks.y = element_line(colour = "black"), strip.text.x = element_text(size = 10)) + ggtitle("Total Sentiment Across All Original Tweets")

F. Predictive Model

Out of curiosity, I wanted to see if a predictive model could be generated to predict whether a tweet would be retweeted. For this task, I looked at just the tweets from sfcrossfit. The target variable was 0 or 1, with zero representing a tweet that was not retweeted and one representing any tweet that was retweeted one or more times. Using the caret package, I trained a random forest on my training data and tried to predict on new data. We can see from the confusion matrix that the model did a very job at predicting whether a tweet would be retweeted or not. Perhaps with some better feature, a better model could have been produced, but I chose not to invest more time on this task.

library(tidyverse) library(tidytext) library(stringr) library(caret) library(tm) final_dat[1:2] final_dat[, .N, by=crossfit_name] final_dat_sub = final_dat[crossfit_name=="sfcrossfit",] final_dat_sub[, favorite_count_new := ifelse(favorite_count==0, 0, 1)] final_dat_sub[, retweet_count_new := ifelse(retweet_count==0, 0, 1)] final_dat_sub[1:3] #final_dat_sub$retweet_count_new splt <- createDataPartition(final_dat_sub$retweet_count_new, p = 0.70, list = F) train <- final_dat_sub[splt, .(created_at,favorite_count,is_retweet, display_text_width, contains_hashtags, retweet_count_new)] dim(train) test <- final_dat_sub[-splt, .(created_at,favorite_count,is_retweet, display_text_width, contains_hashtags, retweet_count_new)] dim(test) congress_rf <- train(x = as.matrix(train[,.(year(created_at), month(created_at),favorite_count,is_retweet, display_text_width, contains_hashtags)]), y = factor(train$retweet_count_new), method = "rf", ntree = 200, trControl = trainControl(method = "oob")) congress_rf$finalModel varImpPlot(congress_rf$finalModel) pred = predict(congress_rf, as.matrix(final_dat_sub[-splt,.(year(created_at), month(created_at),favorite_count,is_retweet, display_text_width, contains_hashtags)])) confusionMatrix(final_dat_sub[-splt,retweet_count_new], pred, positive="1")

G. Conclusion

So there you have it. An investigation of how several prominent Crossfit gyms are using their Twitter accounts to engage with consumers. At the end of the day, I would suggest that any business in the health and wellness space should invest more time on Instagram or YouTube than Twitter when it comes to brand marketing or customer acquisition. Twitter is great for news and entertainment, but it isn’t the ideal platform to share fitness content or inspire new members.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Mathew Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to Scrape Data from a JavaScript Website with R

Thu, 12/20/2018 - 01:00

(This article was first published on Curiosity Killed the Cat, and kindly contributed to R-bloggers)

In September 2017, I found myself working on a project that required odds data for football. At the time I didn’t know about resources such as Football-Data or the odds-api, so I decided to build a scraper to collect data directly from the bookmakers. However, most of them used JavaScript to display their odds, so I couldn’t collect the data with R and rvest alone. In this article, I’ll demonstrate how PhantomJS can be used with R to scrape JS-rendered content from the web.

Requirements

In addition to R’s base packages, I’ll need the following for this example:

  • rvest – This package for R is a wrapper for xml2 and httr packages. It allows you to download and extract data from HTML and XML.
  • PhantomJS – A headless browser that can be used to access webpages and extract information from them, among other things.
  • Selector Gadget – This one is optional, but it makes life easier because it isn’t necessary to go through the source code to find the XPath and/or CSS selectors for extracting the information you need from an HTML document.
Collecting Data

For this example I’ll stick to the original intent of the project and collect odds from an online bookmaker. Paddy Power (no affiliation) uses JavaScript and also doesn’t have an API to my knowledge, so I’ll collect odds from their website.

Without PhantomJS

To collect odds data with rvest, we need to:

  1. Read the HTML file from a URL.
  2. Extract the HTML elements containing the odds. (The elements that contain the data I am interested in have the attribute class="avb-item".)
  3. Extract the contents of the elements.
  4. Clean the data if necessary.
> withoutJS <- xml2::read_html("https://www.paddypower.com/football?tab=today") %>% + rvest::html_nodes(".avb-item") %>% + rvest::html_text() > withoutJS character(0)

We ended up with an empty character vector because the HTML document contains only JS snippets in the source code instead of the content we see in the browser. That means the CSS selector we can see in the browser does not exist in the document R is trying to extract data from.

With PhantomJS

The first step is to save the following script in a separate file:

// scraper_PaddyPower.js // Create a webpage object var page = require('webpage').create(); // Include the File System module for writing to files var fs = require('fs'); // Specify source and path to output file var url = 'https://www.paddypower.com/football?tab=today' var path = 'odds_PaddyPower.html' page.open(url, function (status) { var content = page.content; fs.write(path,content,'w') phantom.exit(); });

The purpose of this script is to retrieve the HTML file from the specified URL and store it into a local HTML file, so that R can read contents from that file instead of reading the contents directly from the URL.

We can then use the system() function to invoke the script from R and repeat the steps outlined in the attempt without PhantomJS to extract the data:

> system("/usr/local/phantomjs-2.1.1/bin/phantomjs scraper_PaddyPower.js") > withJS <- xml2::read_html("odds_PaddyPower.html") %>% + rvest::html_nodes(".avb-item") %>% + rvest::html_text()

Note for Windows users: The invocation of the script has the same format as the command for Linux, but you may have to add .exe to phantomjs for it to work.

The content is currently a mess and the variable is one large chunk of text, i.e. a character vector of length 1. If I remove the white space and split it into individual elements, here’s what I get:

> pp_odds_data <- withJS %>% + gsub("[\n\t]", "", .) %>% + stringr::str_trim() %>% + gsub("\\s+", " ", .) %>% + strsplit(" ") %>% + unlist() > pp_odds_data[1:40] [1] "Bahrain" "Tajikistan" "4/7" "13/5" "5/1" "17:00" "More" [8] "Bets" ">" "Hatayspor" "Genclerbirligi" "1/14" "13/2" "19/1" [15] "More" "Bets" ">" "Darica" "Genclerbirligi" "Antalyaspor" "4/1" [22] "23/10" "4/7" "More" "Bets" ">" "Akhisar" "Belediye" [29] "Fatih" "Karagumrukspor" "1/3" "10/3" "6/1" "16:00" "More" [36] "Bets" ">" "Giresunspor" "Fenerbahce" "9/2"

There’s still some useless data mixed in, and there is no guarantee that I can turn this vector into a data frame to store and analyze the data. In this particular case, team names can have more than two words, so I should extract information about matches and odds separately:

> matches <- xml2::read_html("odds_PaddyPower.html") %>% + rvest::html_nodes(".avb-item .ui-scoreboard-coupon") %>% + rvest::html_text() %>% + stringr::str_trim() %>% + gsub("\\s\\s+"," vs ", .) > odds <- xml2::read_html("odds_PaddyPower.html") %>% + rvest::html_nodes(".avb-item .btn-odds") %>% + rvest::html_text() %>% + stringr::str_trim() > odds_selector <- seq(1, length(odds), 3) > pp_odds_df <- data.frame(Match = matches, Home = odds[odds_selector], Draw = odds[(odds_selector + 1)], Away = odds[(odds_selector + 2)]) > head(pp_odds_df) Match Home Draw Away 1 Bahrain vs Tajikistan 4/7 13/5 5/1 2 Hatayspor vs Genclerbirligi 1/14 13/2 19/1 3 Darica Genclerbirligi vs Antalyaspor 4/1 23/10 4/7 4 Akhisar Belediye vs Fatih Karagumrukspor 1/3 10/3 6/1 5 Giresunspor vs Fenerbahce 9/2 11/4 1/2 6 Botosani vs Dunarea Calarasi 8/13 5/2 9/2

Now I can further transform the data before storing and analyzing it, e.g. convert fractional odds into decimal, split the Match column into “Home Team” and “Away Team” columns, and so on.

Final Comments

Although the script I used still works today, I should note that the development of PhantomJS is currently suspended and its git repository is archived. However, there are many ways to collect data from a JavaScript-rendered website, so you can also try using rvest and V8 from R or Python and Selenium to collect the data you need. Good luck!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Curiosity Killed the Cat. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Spelling 2.0: Improved Markdown and RStudio Support

Thu, 12/20/2018 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

We have released updates for the rOpenSci text analysis tools. This technote will highlight some of the major improvements in the spelling package and also the underlying hunspell package, which provides the spelling engine for the spelling package.

install.packages("spelling")

Update to the latest versions to use these cool new features!

Upcoming version of #rstats spelling package will also check your package readme and news files. pic.twitter.com/bem8rGx9e3

— Jeroen Ooms (@opencpu) December 17, 2018

Automatic Checking of README and NEWS files

Users that are already using spelling on their packages might discover a few new typos! The new version of spelling now also checks the readme.md and news.md and index.md files in your package root directory.

spelling::spell_check_package() # WORD FOUND IN # AppVeyor README.md:5 # CMD description:3 # README.md:12 # devtools README.md:93 # hunspell spell_check_files.Rd:18,32 # spell_check_package.Rd:24 # README.md:31

Of course it also still checks your manual pages, vignettes and description fields. Hence we now spell check all content that is used to generate your package website with pkgdown.

Improved Markdown Parsing

Parsing of markdown files in spell_check_files() and spell_check_package() has been tweaked in a few places, resulting in less false positives.

First of all, it now properly parses the yaml front-matter and only runs the spell check only on the title, subtitle and description fields. The other fields usually contain code or options, not actual language we want to check.

spelling::spell_check_files("README.md") # WORD FOUND IN # AppVeyor README.md:5 # CMD README.md:12 # devtools README.md:93 # hunspell README.md:31 # readme README.md:37 # RStudio README.md:8,93 # wordlist README.md:12

Also the parser now filters words that contain @ symbol. Such a word is usually a rmarkdown citation or twitter handle, or email address, neither of which should be spell checked.

Finally all markdown files are now treated as UTF-8 which should fix a problem that Windows users may have been seeing when checking files that contain non-ascii text, such as words with diacritics or basically anything that is not in Latin script. Which brings us to the next topic…

Support for RStudio Dictionaries

The spelling and hunspell packages require a dictionary for spell checking text in a given language. By default we include en-US and en-GB, but installing additional dictionaries has become even easier. The package will now automatically find dictionaries that you have installed in RStudio:

The RStudio IDE has a menu to install additional languages (Global Options -> Spelling -> Main Dictionary Language -> Install). The installed dictionaries will automatically become available to hunspell and you can use them in all hunspell/spelling functions that take a lang or dict argument.

hunspell::list_dictionaries() # [1] "bg_BG" "ca_ES" "cs_CZ" "da_DK" "de_DE_neu" "de_DE" "el_GR" # [8] "en_AU" "en_CA" "en_GB" "en_US" "es_ES" "fr_FR" "hr_HR" # [15] "hu-HU" "id_ID" "it_IT" "lt_LT" "lv_LV" "nb_NO" "nl_NL" # [22] "pl_PL" "pt_BR" "pt_PT" "ro_RO" "ru_RU" "sh" "sk_SK" # [29] "sl_SI" "sr" "sv_SE" "uk_UA" "vi_VN" hunspell::hunspell("Nou breekt mijn klomp!", dict = 'nl_NL') # character(0)

Not using RStudio? The last chapter of the hunspell vignette explains in detail how to install dictionaries on different platforms.

Low Level Hunspell Updates

Text analysis experts might notice some significant improvements in hunspell because it includes the latest libhunspell 1.7 library. This version has lots of new features and bug fixes by László Németh.

Among the changes are performance enhancements and tweaks in the suggestion algorithm, i.e. the hunspell::hunspell_suggest() function in R. For a complete overview of changes have a look at the release notes from libhunspell.

Also we have updates the en-US and en-GB dictionaries included with the hunspell package to the latest version 2018.11.01 from libreoffice. This update include many new words that may have been previously marked as spelling errors.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data, movies and ggplot2

Wed, 12/19/2018 - 09:05

(This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers)

Yet another boring barplot?
No!
I’ve asked my students from MiNI WUT to visualize some data about their favorite movies or series.
Results are pretty awesome.
Believe me or not, but charts in these posters are created with ggplot2 (most of them)!

Star Wars

Fan of StaR WaRs? Find out which color is the most popular for lightsabers!
Yes, these lightsabers are created with ggplot2.
Would you guess which characters are overweighed?
Find the R code and the data on the GitHub.

Harry Pixel

Take fames from Harry Potter movies, use k-means to extract dominant colors for each frame, calculate derivative for color changes and here you are.
The R code and the poster are here.
(steep derivatives in color space is a nice proxy for dynamic scenes).

Social Network for Super Heroes

Have you ever wondered how the distribution of super powers looks like among Avengers?
Check put this poster or dive in the data.

Pardon my French, but…

Scrap transcripts from over 100k movies, find out how many curse words you may find in these movies, plot these statistics.
Here are sources and the poster.
(Bonus Question 1: how curse words are related to Obama/Trump presidency?
Bonus Question 2: is the number of hard curse words increasing or not?)

Rick and Morty

Interested in the demography of characters from Rick and Morty?
Here is the R code and the poster.
(Tricky question: what is happening with season 3?)

Twin Peaks

Transcripts from Twin Peaks are full of references to coffee and donuts.
Except the episode in which the Laura’s murdered is revealed (ups, spoiler alert).
Check out this by yourself with these scripts.

The Lion King

Which Disney’s movie is the most popular?
It wasn’t hard to guess.

Box Office

5D scatterplots?
Here you have.

Next time I will ask my students to visualize data about R packages…
Or maybe you have some other ideas?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Spinning Pins

Wed, 12/19/2018 - 08:00

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Condenado a estar toda la vida, preparando alguna despedida (Desarraigo, Extremoduro)

I live just a few minutes from the Spanish National Museum of Science and Technology (MUNCYT), where I use to go from time to time with my family. The museum is plenty of interesting artifacts, from a portrait of Albert Einstein made with thousands of small dices to a wind machine to experiment with the Bernoulli’s effect. There is also a device which creates waves. It is formed by a number of sticks, arranged vertically, each of one ending with a ball (so it forms a sort of pin). When you push the start button, all the pins start to move describing circles. Since each pair of consecutive pins are separated by the same angle, the resulting movement imitates a wave. In this experiment I created some other machines. The first one imitates exactly the one of the museum:

If you look carefully only to one pin, you will see how it describes a circle. Each one starts at a different angle and all them move at the same speed. Although an individual pin is pretty boring, all together create a nice pattern. The museum’s machine is formed just by one row of 20 pins. I created a 20×20 grid of pins to make result more appealing.

Playing with the angle between pins you can create another nice patterns like these:

The code is incredibly simple and can be used as a starting point to create much more complicated patterns, changing the speed depending on time or the location of pins. Play with colors or shapes of points, the number of pins or with the separation and speed of them. The magic of movement is created with gganimate package. You can find the code here.

Merry Christmas and Happy New Year. Thanks a lot for reading my posts.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My R take on Advent of Code – Day 2

Wed, 12/19/2018 - 01:00

(This article was first published on r-tastic, and kindly contributed to R-bloggers)

This is my second blog post from the series of My R take on Advent of Code. If you’d like to know more about Advent of Code, check out the first post from the series or simply go to their website. Below you’ll find the challnge from Day 2 and the solution that worked for me. As always, feel free to leave comments if you have different ideas on how this could have been solved!

Day 2 Puzzle

(…) you scan the likely candidate boxes again, counting the number that have an ID containing exactly two of any letter and then separately counting those with exactly three of any letter. You can multiply those two counts together to get a rudimentary checksum and compare it to what your device predicts.
For example, if you see the following box IDs:

abcdef contains no letters that appear exactly two or three times.
bababc contains two a and three b, so it counts for both.
abbcde contains two b, but no letter appears exactly three times.
abcccd contains three c, but no letter appears exactly two times.
aabcdd contains two a and two d, but it only counts once.
abcdee contains two e.
ababab contains three a and three b, but it only counts once.

Of these box IDs, four of them contain a letter which appears exactly twice, and three of them contain a letter which appears exactly three times. Multiplying these together produces a checksum of 4 * 3 = 12.
What is the checksum for your list of box IDs?

So what is it all about? As complicated as it may sound, essentially we need to:

  • understand which string contains letters that appear exactly 2 times
  • understand which string contains letters that appear exactly 3 times
  • count the number of each type of string
  • multiply them together

Doesn’t sound so bad anymore, ey? This is how we can go about it:

First load your key packages…

library(dplyr) library(stringr) library(tibble) library(purrr)

… and have a look at what the raw input looks like.

# check raw input glimpse(input) ## chr "xrecqmdonskvzupalfkwhjctdb\nxrlgqmavnskvzupalfiwhjctdb\nxregqmyonskvzupalfiwhjpmdj\nareyqmyonskvzupalfiwhjcidb\"| __truncated__

Right, Advent of Code will never give you nice and clean data to work with, that’s for sure. But it doesn’t look like things are too bad this time – let’s just split it by the new line and keep it as a vector for now. Does it look reaosnably good?

# clean it clean_input = strsplit(input, '\n') %>% unlist() # splt by NewLine glimpse(clean_input) ## chr [1:250] "xrecqmdonskvzupalfkwhjctdb" "xrlgqmavnskvzupalfiwhjctdb" ...

Much better! Now, let’s put it all in a data frame for now, we’ll need it very soon.

# put it in the data.frame df2 <- tibble(input = str_trim(clean_input)) head(df2) ## # A tibble: 6 x 1 ## input ## ## 1 xrecqmdonskvzupalfkwhjctdb ## 2 xrlgqmavnskvzupalfiwhjctdb ## 3 xregqmyonskvzupalfiwhjpmdj ## 4 areyqmyonskvzupalfiwhjcidb ## 5 xregqpyonskvzuaalfiwhjctdy ## 6 xwegumyonskvzuphlfiwhjctdb

Now, the way I approached this was to split each word into letters and then count how many times they occured. Then, for identifying words with 2 occurences, I filtered only those that occur twice and if the final table has any rows, then this counts as yes. Take the first example:

strsplit(input, '\n') %>% unlist() %>% .[[1]] # get the first example ## [1] "xrecqmdonskvzupalfkwhjctdb"

Let’s split it by the letter, put it in a tibble and count each letter occurances:

strsplit(input, '\n') %>% unlist() %>% .[[1]] %>% # get the first example strsplit('') %>% # split letters unlist() %>% # get a vector as_tibble() %>% # trasform vector to tibble rename_(letters = names(.)[1]) %>% # name the column: letters count(letters) ## # A tibble: 23 x 2 ## letters n ## ## 1 a 1 ## 2 b 1 ## 3 c 2 ## 4 d 2 ## 5 e 1 ## 6 f 1 ## 7 h 1 ## 8 j 1 ## 9 k 2 ## 10 l 1 ## # ... with 13 more rows

Now, do we have any double occurances there?

# test: counting double letter occurances strsplit(input, '\n') %>% unlist() %>% .[[1]] %>% # get the first example strsplit('') %>% # split letters unlist() %>% # get a vector as_tibble() %>% # trasform vector to tibble rename_(letters = names(.)[1]) %>% # name the column: letters count(letters) %>% # count letter occurances filter(n == 2) %>% # get only those with double occurances nrow() # how many are there? ## [1] 3

Definitely yes. Let’s repeat the process for tripple occurances:

# test: counting triple letter occurances strsplit(input, '\n') %>% unlist() %>% .[[1]] %>% # get the first example strsplit('') %>% # split letters unlist() %>% as_tibble() %>% # trasforming vector to tibble rename_(letters = names(.)[1]) %>% count(letters) %>% filter(n == 3) %>% nrow() ## [1] 0

Not much luck with those in this case. To make our life easier, let’s wrap both calculations in functions…

### wrap-up in functions # count double occurances count2 <- function(x) { result2 <- as.character(x) %>% strsplit('') %>% # split by letters unlist() %>% as_tibble() %>% # trasforming vector to tibble rename_(letters = names(.)[1]) %>% count(letters) %>% # count letter occurances filter(n == 2) %>% nrow() return(result2) } # count triple occurances count3 <- function(x) { result2 <- as.character(x) %>% strsplit('') %>% unlist() %>% as_tibble() %>% # trasforming vector to tibble rename_(letters = names(.)[1]) %>% count(letters) %>% filter(n == 3) %>% nrow() return(result2) }

…and apply them to the whole dataset:

### apply functions to input occurs2 <- map_int(df2$input, count2) occurs3 <- map_int(df2$input, count3) str(occurs2) ## int [1:250] 3 3 3 3 2 3 3 2 2 2 ...

Now, all we need to do is check how many positive elements we have in each vector and multiple their lengths by each other:

#solution length(occurs2[occurs2 != 0]) * length(occurs3[occurs3 != 0]) ## [1] 5976

Voila!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-tastic. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

All the (NBA) box scores you ever wanted

Tue, 12/18/2018 - 18:13

(This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

In this previous post, I showed how one can scrape top-level NBA game data from BasketballReference.com. In the post after that, I demonstrated how to scrape play-by-play data for one game. After writing those posts, I thought to myself: why not do both? And that is what I did: scrape all the box scores for the 2017-18 NBA season and save them to an RDS object.

The code for web scraping can be found here. I don’t think it’s particularly insightful to go through the details line-by-line… What I’ll do instead is explain what is in the saved RDS object which will allow you to make use of the data, and I’ll end off with some visualizations to demonstrate what one can do with this data, specifically:

  • Do Kevin Durant and Steph Curry get in each other’s way?
  • Are the Golden State Warriors really better in the 3rd quarter?

(Code that I wrote for these questions can be found here.) Let’s load the packages and data (data available as links from this page):

library(tidyverse) game_df <- readRDS("NBA-2018_game_data.rds") master_list <- readRDS("NBA-2018_box_score.rds")

First, let’s take a quick look at game_df which contains the top-level game data:

head(game_df) #> game_id date_game game_start_time visitor_team_name visitor_pts home_team_name home_pts #> 1 201710170CLE 2017-10-17 8:01p Boston Celtics 99 Cleveland Cavaliers 102 #> 2 201710170GSW 2017-10-17 10:30p Houston Rockets 122 Golden State Warriors 121 #> 3 201710180BOS 2017-10-18 7:30p Milwaukee Bucks 108 Boston Celtics 100 #> 4 201710180DAL 2017-10-18 8:30p Atlanta Hawks 117 Dallas Mavericks 111 #> 5 201710180DET 2017-10-18 7:00p Charlotte Hornets 90 Detroit Pistons 102 #> 6 201710180IND 2017-10-18 7:00p Brooklyn Nets 131 Indiana Pacers 140 #> overtimes attendance game_remarks game_type #> 1 20562 Regular #> 2 19596 Regular #> 3 18624 Regular #> 4 19709 Regular #> 5 20491 Regular #> 6 15008 Regular

The first column, game_id, is BasketBallReference.com’s way of uniquely identifying a game. For example, if we wanted to see the results of the game between the Celtics and the Cavs on 17 Oct 2017, we would look up the game_id (201710170CLE) and go to the URL https://www.basketball-reference.com/boxscores/201710170CLE.html. The game_id is how I was able to scrape the box scores for all the games.

Next, let’s look at the master_list object which contains all the box score information:

typeof(master_list) #> [1] "list" length(master_list) #> [1] 1312 head(names(master_list)) #> [1] "201710170CLE" "201710170GSW" "201710180BOS" "201710180DAL" "201710180DET" "201710180IND"

It is a list of length 1312, which corresponds to the total number of games (Regular Season and Playoffs) in the 2017-18 season. The keys of the list are the game_ids, and (as we will see shortly) each element of the list is itself a list of length 5. Let’s take a closer look at the first element of the list:

str(master_list[[1]]) #> List of 5 #> $ visitor_basic_boxscore:'data.frame': 13 obs. of 22 variables: #> ..$ Player: chr [1:13] "Jaylen Brown" "Kyrie Irving" "Jayson Tatum" "Al Horford" ... #> ..$ MP : chr [1:13] "39:36" "39:21" "36:32" "32:07" ... #> ..$ FG : num [1:13] 11 8 5 2 1 5 2 2 0 0 ... #> ..$ FGA : num [1:13] 23 17 12 7 2 16 6 2 2 1 ... #> ..$ FG% : num [1:13] 0.478 0.471 0.417 0.286 0.5 0.313 0.333 1 0 0 ... #> ..$ 3P : num [1:13] 2 4 1 0 0 0 1 0 0 0 ... #> ..$ 3PA : num [1:13] 9 9 2 2 1 4 3 0 1 1 ... #> ..$ 3P% : num [1:13] 0.222 0.444 0.5 0 0 0 0.333 NA 0 0 ... #> ..$ FT : num [1:13] 1 2 3 5 0 2 4 2 0 0 ... #> ..$ FTA : num [1:13] 2 2 3 7 0 3 4 4 0 0 ... #> ..$ FT% : num [1:13] 0.5 1 1 0.714 NA 0.667 1 0.5 NA NA ... #> ..$ ORB : num [1:13] 1 2 4 0 0 0 0 2 0 0 ... #> ..$ DRB : num [1:13] 5 2 6 7 1 9 3 3 0 1 ... #> ..$ TRB : num [1:13] 6 4 10 7 1 9 3 5 0 1 ... #> ..$ AST : num [1:13] 0 10 3 5 0 3 2 1 0 0 ... #> ..$ STL : num [1:13] 2 3 0 0 0 2 4 0 0 0 ... #> ..$ BLK : num [1:13] 0 0 0 1 0 2 0 1 0 0 ... #> ..$ TOV : num [1:13] 3 2 1 0 0 2 0 2 0 0 ... #> ..$ PF : num [1:13] 5 4 4 2 1 2 0 5 1 0 ... #> ..$ PTS : num [1:13] 25 22 14 9 2 12 9 6 0 0 ... #> ..$ +/- : num [1:13] -5 -1 6 8 3 -8 4 -14 -10 2 ... #> ..$ Role : chr [1:13] "Starter" "Starter" "Starter" "Starter" ... #> $ visitor_adv_boxscore :'data.frame': 13 obs. of 17 variables: #> ..$ Player: chr [1:13] "Jaylen Brown" "Kyrie Irving" "Jayson Tatum" "Al Horford" ... #> ..$ MP : chr [1:13] "39:36" "39:21" "36:32" "32:07" ... #> ..$ TS% : num [1:13] 0.523 0.615 0.526 0.446 0.5 0.346 0.58 0.798 0 0 ... #> ..$ eFG% : num [1:13] 0.522 0.588 0.458 0.286 0.5 0.313 0.417 1 0 0 ... #> ..$ 3PAr : num [1:13] 0.391 0.529 0.167 0.286 0.5 0.25 0.5 0 0.5 1 ... #> ..$ FTr : num [1:13] 0.087 0.118 0.25 1 0 0.188 0.667 2 0 0 ... #> ..$ ORB% : num [1:13] 2.4 4.9 10.5 0 0 0 0 10.1 0 0 ... #> ..$ DRB% : num [1:13] 13.2 5.3 17.1 22.7 19.9 26.8 16 16.4 0 21.7 ... #> ..$ TRB% : num [1:13] 7.6 5.1 13.7 10.9 9.5 12.8 7.7 13.1 0 10.4 ... #> ..$ AST% : num [1:13] 0 46.5 13.4 22.6 0 14.1 15.8 8.1 0 0 ... #> ..$ STL% : num [1:13] 2.4 3.7 0 0 0 2.8 9.9 0 0 0 ... #> ..$ BLK% : num [1:13] 0 0 0 2.5 0 4.5 0 4.1 0 0 ... #> ..$ TOV% : num [1:13] 11.2 10.1 7 0 0 10.4 0 34.7 0 0 ... #> ..$ USG% : num [1:13] 29.9 22.2 17.3 13.8 16.8 24.3 17.5 13.3 10.2 9.1 ... #> ..$ ORtg : num [1:13] 89 124 115 111 95 70 129 110 0 0 ... #> ..$ DRtg : num [1:13] 104 103 108 105 107 96 87 105 112 107 ... #> ..$ Role : chr [1:13] "Starter" "Starter" "Starter" "Starter" ... #> $ home_basic_boxscore :'data.frame': 14 obs. of 22 variables: #> ..$ Player: chr [1:14] "LeBron James" "Jae Crowder" "Derrick Rose" "Dwyane Wade" ... #> ..$ MP : chr [1:14] "41:12" "34:44" "31:15" "28:30" ... #> ..$ FG : num [1:14] 12 3 5 3 4 4 2 3 2 0 ... #> ..$ FGA : num [1:14] 19 10 14 10 9 7 3 8 3 0 ... #> ..$ FG% : num [1:14] 0.632 0.3 0.357 0.3 0.444 0.571 0.667 0.375 0.667 NA ... #> ..$ 3P : num [1:14] 1 1 1 0 1 1 0 0 0 0 ... #> ..$ 3PA : num [1:14] 5 5 3 1 4 3 0 1 0 0 ... #> ..$ 3P% : num [1:14] 0.2 0.2 0.333 0 0.25 0.333 NA 0 NA NA ... #> ..$ FT : num [1:14] 4 4 3 2 6 1 1 0 0 0 ... #> ..$ FTA : num [1:14] 4 4 4 2 7 1 3 0 0 0 ... #> ..$ FT% : num [1:14] 1 1 0.75 1 0.857 1 0.333 NA NA NA ... #> ..$ ORB : num [1:14] 1 1 1 1 3 0 1 0 1 0 ... #> ..$ DRB : num [1:14] 15 4 3 1 8 4 5 0 1 0 ... #> ..$ TRB : num [1:14] 16 5 4 2 11 4 6 0 2 0 ... #> ..$ AST : num [1:14] 9 2 2 3 0 1 2 0 0 0 ... #> ..$ STL : num [1:14] 0 2 0 0 0 0 0 0 1 0 ... #> ..$ BLK : num [1:14] 2 0 0 2 0 0 0 0 0 0 ... #> ..$ TOV : num [1:14] 4 1 2 4 2 0 2 1 1 0 ... #> ..$ PF : num [1:14] 3 2 2 1 2 4 3 3 3 2 ... #> ..$ PTS : num [1:14] 29 11 14 8 15 10 5 6 4 0 ... #> ..$ +/- : num [1:14] 2 7 -7 0 1 7 2 -2 6 -1 ... #> ..$ Role : chr [1:14] "Starter" "Starter" "Starter" "Starter" ... #> $ home_adv_boxscore :'data.frame': 14 obs. of 17 variables: #> ..$ Player: chr [1:14] "LeBron James" "Jae Crowder" "Derrick Rose" "Dwyane Wade" ... #> ..$ MP : chr [1:14] "41:12" "34:44" "31:15" "28:30" ... #> ..$ TS% : num [1:14] 0.698 0.468 0.444 0.368 0.621 0.672 0.579 0.375 0.667 NA ... #> ..$ eFG% : num [1:14] 0.658 0.35 0.393 0.3 0.5 0.643 0.667 0.375 0.667 NA ... #> ..$ 3PAr : num [1:14] 0.263 0.5 0.214 0.1 0.444 0.429 0 0.125 0 NA ... #> ..$ FTr : num [1:14] 0.211 0.4 0.286 0.2 0.778 0.143 1 0 0 NA ... #> ..$ ORB% : num [1:14] 2.5 3 3.3 3.7 11 0 5.3 0 8.1 0 ... #> ..$ DRB% : num [1:14] 35 11.1 9.2 3.4 27 17.5 24.5 0 7.5 0 ... #> ..$ TRB% : num [1:14] 19.4 7.2 6.4 3.5 19.4 9.1 15.3 0 7.8 0 ... #> ..$ AST% : num [1:14] 43.7 8.2 10.1 15.3 0 7.5 14.8 0 0 0 ... #> ..$ STL% : num [1:14] 0 2.8 0 0 0 0 0 0 3.8 0 ... #> ..$ BLK% : num [1:14] 4.2 0 0 6 0 0 0 0 0 0 ... #> ..$ TOV% : num [1:14] 16.2 7.8 11.3 26.9 14.2 0 31.6 11.1 25 NA ... #> ..$ USG% : num [1:14] 26 15.9 24.6 22.6 21.4 14.7 13.9 27.3 13.5 0 ... #> ..$ ORtg : num [1:14] 126 100 88 67 117 138 92 67 102 0 ... #> ..$ DRtg : num [1:14] 94 98 105 102 100 102 100 107 96 106 ... #> ..$ Role : chr [1:14] "Starter" "Starter" "Starter" "Starter" ... #> $ pdp_df :'data.frame': 114 obs. of 4 variables: #> ..$ time : num [1:114] 16 72 88 99 131 151 166 191 191 238 ... #> ..$ visitor: num [1:114] 2 4 4 4 4 4 6 6 6 8 ... #> ..$ home : num [1:114] 0 0 1 3 5 7 7 8 9 9 ... #> ..$ period : num [1:114] 1 1 1 1 1 1 1 1 1 1 ...

The first 4 elements of the list (visitor_basic_boxscore, visitor_adv_boxscore, home_basic_boxscore, home_basic_boxscore) are data frames for the basic and advanced box scores of both teams. They are the box scores you see on BasketBallReference.com (with some minor preprocessing):

Box scores from BasketballReference.com.

The last element of the list is a data frame containing play-by-play information, much like that in the previous post.

Hopefully some of you will take this data and make some cool data visualizations from it! Below, I will walk through how I use this dataset for two visualizations.

Do Kevin Durant and Steph Curry get in each other’s way?

Kevin Durant and Steph Curry are two of the most lethal scorers in today’s game. Does both of them being on the same team hamper the other’s production? This is pretty difficult to pin down what we mean by this statement precisely; we will instead make do with a couple of suggestive plots.

The first plot we can make is a scatterplot of the points scored by these two players. If the relationship is positively correlated, it could mean that they make each other better.

Let’s pull out the game_ids for Golden State:

team_name <- "Golden State Warriors" player1 <- "Stephen Curry" player2 <- "Kevin Durant" # get game_ids game_rows <- which(game_df$visitor_team_name == team_name | game_df$home_team_name == team_name) game_ids <- game_df[game_rows, "game_id"]

Next, we set up a data frame consisting of 4 columns: the game_id, game_type (regular or playoff), and points scored by players 1 and 2 respectively.

points_df <- data.frame(matrix(NA, ncol = 4, nrow = length(game_ids))) names(points_df) <- c("game_id", "game_type", player1, player2)

The loop below builds up the data frame row by row (not the most R-like way of programming but it gets the job done…):

for (i in 1:nrow(points_df)) { id <- game_ids[i] points_df[i, 1] <- id points_df[i, 2] <- game_df[game_rows[i], "game_type"] # get the correct basic box score if (game_df[game_rows[i], "visitor_team_name"] == team_name) { boxscore <- master_list[[id]]$visitor_basic_boxscore } else { boxscore <- master_list[[id]]$home_basic_boxscore } # get player points if (player1 %in% boxscore$Player) { points_df[i, 3] <- subset(boxscore, Player == player1)$PTS } if (player2 %in% boxscore$Player) { points_df[i, 4] <- subset(boxscore, Player == player2)$PTS } } head(points_df) #> game_id game_type Stephen Curry Kevin Durant #> 1 201710170GSW Regular 22 20 #> 2 201710200NOP Regular 28 22 #> 3 201710210MEM Regular 37 29 #> 4 201710230DAL Regular 29 25 #> 5 201710250GSW Regular 30 29 #> 6 201710270GSW Regular 20 31

Next, we make the scatterplot (notice how we can use the get function to refer to the columns programmatically). We force the axes to start at 0 and the aspect ratio to be 1, and include the 45 degree line for reference.

ggplot(data = points_df, aes(x = get(player1), y = get(player2))) + geom_point(alpha = 0.5) + geom_smooth(se = FALSE) + geom_abline(slope = 1, intercept = 0, lwd = 0.5, col = "red", lty = 2) + scale_x_continuous(limits = c(0, NA)) + scale_y_continuous(limits = c(0, NA)) + labs(x = player1, y = player2, title = "Points scored") + coord_fixed()

One might say, wouldn’t it be better to compare games where KD and Curry play, vs. games where only KD plays? If KD scores more if he is playing with Curry, that could suggest that playing together makes him more productive. To make these plots, we introduce two new columns which are Boolean variables representing whether a given player played in that game or not. We then make boxplots to compare the points scored depending on whether the other player played.

# ignore games where both didn't play points_df2 <- points_df[!(is.na(points_df[[3]]) & is.na(points_df[[4]])), ] points_df2[[paste(player1, "played")]] <- !is.na(points_df2[[3]]) points_df2[[paste(player2, "played")]] <- !is.na(points_df2[[4]]) # boxplots ggplot(data = points_df2, aes(x = get(paste(player1, "played")), y = get(player2))) + geom_boxplot() + geom_jitter(height = 0, alpha = 0.5, col = "blue") + labs(x = paste(player1, "played?"), y = "Points scored", title = paste("Points", player2, "scored")) ggplot(data = points_df2, aes(x = get(paste(player2, "played")), y = get(player1))) + geom_boxplot() + geom_jitter(height = 0, alpha = 0.5, col = "blue") + labs(x = paste(player2, "played?"), y = "Points scored", title = paste("Points", player1, "scored"))

From the plots, it looks like each player scores a bit more when the other person is not playing, although there is quite a bit of variance.

Instead of looking at points scored, we could look at more sophisticated measures of production. For example, the NBA computes individual offensive and defensive ratings for each game based on a complicated formula (see here for details). These measures are provided as ORtg and DRtg in the advanced box score.

The code below pulls out the ORtg and DRtg values for KD and Curry:

points_df <- data.frame(matrix(NA, ncol = 6, nrow = length(game_ids))) names(points_df) <- c("game_id", "game_type", sapply(c(player1, player2), function(x) paste(x, "ORtg")), sapply(c(player1, player2), function(x) paste(x, "DRtg"))) for (i in 1:nrow(points_df)) { id <- game_ids[i] points_df[i, 1] <- id points_df[i, 2] <- game_df[game_rows[i], "game_type"] # get the correct advanced box score if (game_df[game_rows[i], "visitor_team_name"] == team_name) { boxscore <- master_list[[id]]$visitor_adv_boxscore } else { boxscore <- master_list[[id]]$home_adv_boxscore } # get player ORtg & DRtg if (player1 %in% boxscore$Player) { points_df[i, 3] <- subset(boxscore, Player == player1)$ORtg points_df[i, 5] <- subset(boxscore, Player == player1)$DRtg } if (player2 %in% boxscore$Player) { points_df[i, 4] <- subset(boxscore, Player == player2)$ORtg points_df[i, 6] <- subset(boxscore, Player == player2)$DRtg } }

And here are the plots (along with a linear fit):


This was pretty interesting to me: it looks like the ORtgs are independent of each other while the DRtgs are highly correlated with each other. Might this suggest that defense is more of a team activity, while offense can be driven by individuals?

There are of course many other visualizations we could do to explore this question. We might also want to see how the KD-Curry pairing compares with other pairings. Because of the way the code is written, it’s really easy to generate these figures for other pairings: we just need to update the team_name, player1 and player2 variables and rerun the code. As an example, here are the ORtg and DRtg scatterplots for Damian Lillard and C.J. McCollum of the Portland Trail Blazers:


We see similar trends here.

Are the Golden State Warriors really better in the 3rd quarter?

Last season, it seemed like the Warriors were often down at the half, only to come roaring back in the third quarter. Is this really their best quarter? (This NYT article says yes and suggests why that might be the case.) To answer this question, we’ll make a lead tracker graphic (much like the previous post) but (i) we will do it for all games GSW played in, and (ii) we reset the score differential to 0 at the beginning of each period.

We first set up a data frame that records the game_id, whether the team was at home, and whether they won (we might want to use latter two variables in our plotting later):

team_name <- "Golden State Warriors" game_rows <- which(game_df$visitor_team_name == team_name | game_df$home_team_name == team_name) game_ids <- game_df[game_rows, "game_id"] # set up game data frame and play-by-play list df <- data.frame(matrix(NA, ncol = 3, nrow = length(game_ids))) names(df) <- c("game_id", "home", "win") df$game_id <- game_ids df$home <- game_df[game_rows, "home_team_name"] == team_name df$win <- (df$home & game_df[game_rows, "home_pts"] > game_df[game_rows, "visitor_pts"]) | ((!df$home) & game_df[game_rows, "home_pts"] < game_df[game_rows, "visitor_pts"])

Next, we create a list with keys being the game_ids and the value being the play-by-play data frame from the master list we scraped earlier. We standardize the column names so that GSW’s score is labeled team and the opponent’s score is labeled opp. (Note: At the time of writing, BasketBallReference.com had incorrect data for the game 201710200NOP, so I have removed that game from our data.)

pbp_list <- list() for (i in 1:nrow(df)) { id <- game_ids[i] pdp <- master_list[[id]]$pdp_df if (df[i, "home"]) { names(pdp) <- c("time", "opp", "team", "period") } else { names(pdp) <- c("time", "team", "opp", "period") } pbp_list[[id]] <- pdp } # NOTE: mistake in 201710200NOP. original data on website is wrong # https://www.basketball-reference.com/boxscores/pbp/201710200NOP.html # so we remove it df <- df[-which(df$game_id == "201710200NOP"), ] pbp_list[["201710200NOP"]] <- NULL

We create a function parse_pbp which helps us reset the score at the beginning of every quarter. To make the lines in our eventual visualization smoother, parse_pbp also adds additional rows for the beginning and end of every quarter.

parse_pbp <- function(pbp) { pbp <- rbind(0, pbp) new_pbp <- pbp # get points scored in the period last_opp <- 0; last_team <- 0; last_period <- 0 for (i in 2:nrow(pbp)) { if (pbp[i, "period"] > last_period + 1) { last_period <- last_period + 1 last_opp <- pbp[i-1, "opp"] last_team <- pbp[i-1, "team"] } new_pbp$opp[i] <- pbp$opp[i] - last_opp new_pbp$team[i] <- pbp$team[i] - last_team } # add extra rows to denote beginning and end of periods num_period <- max(new_pbp$period) for (i in 1:num_period) { end_row <- new_pbp[max(which(new_pbp$period == i)), ] end_row[1] <- 12 * 60 * min(i, 4) + 5 * 60 * max(i-4, 0) beg_row <- c(0, 0, 0, i) beg_row[1] <- 12 * 60 * min(i-1, 4) + 5 * 60 * max(i-1-4, 0) new_pbp <- rbind(new_pbp, beg_row) new_pbp <- rbind(new_pbp, end_row) } new_pbp <- new_pbp[order(new_pbp$time), ] new_pbp$adv <- with(new_pbp, team - opp) new_pbp[-1, ] } pbp_list <- lapply(pbp_list, parse_pbp)

Finally, we make the list into a dataframe:

pbp_df <- data.frame(matrix(ncol = 6, nrow = 0)) names(pbp_df) <- c("game_id", "home", "win", "time", "period", "adv") for (i in seq_along(pbp_list)) { xx <- pbp_list[[i]] xx$game_id <- df$game_id[i] xx$home <- df$home[i] xx$win <- df$win[i] pbp_df <- rbind(pbp_df, xx) } head(pbp_df) #> time opp team period adv game_id home win #> 124 0 0 0 1 0 201710170GSW TRUE FALSE #> 2 13 2 0 1 -2 201710170GSW TRUE FALSE #> 3 55 2 1 1 -1 201710170GSW TRUE FALSE #> 4 55 2 2 1 0 201710170GSW TRUE FALSE #> 5 90 2 5 1 3 201710170GSW TRUE FALSE #> 6 104 5 5 1 0 201710170GSW TRUE FALSE

We can now make the line plot easily. We use the interaction function to group the line plots by both game_id and quarter, color the lines by whether GSW won or not, and manually define the breaks to be at the end of the periods:

periods <- unique(pbp_df$period) x_value <- ifelse(periods <= 4, 12 * 60 * periods, 12 * 60 * 4 + 5 * 60 * (periods - 4)) x_label <- ifelse(periods <= 4, paste0("Q", periods), paste0("OT", periods - 4)) ggplot(pbp_df, aes(x = time, y = adv)) + geom_line(aes(col = win, group = interaction(game_id, period)), lwd = 0.1) + geom_smooth(aes(group = period), col = "black", se = FALSE) + scale_x_continuous(breaks = x_value, labels = x_label) + scale_color_manual(values = c("#ff6600", "#3366ff")) + labs(title = paste("Point Advantage by Quarter,", team_name)) + theme_minimal() + theme(plot.title = element_text(face = "bold", hjust = 0.5), axis.title.x = element_blank(), axis.title.y = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), legend.position = "bottom")

It may not be so obvious in this plot, but there certainly seems to be a bit of a difference in the third quarter. On average, the Warriors are even or up 1-2 points, but in the third quarter GSW gains almost 5 points. The difference is more obvious in the plot below, where we zoom in on the y-axis. (Note that we have to use coord_cartesian instead of scale_y_continuous to zoom in: scale_y_continuous will remove the points outside the range before plotting, which results in the incorrect result for the geom_smooth layer.

ggplot(pbp_df, aes(x = time, y = adv)) + geom_line(aes(col = win, group = interaction(game_id, period)), lwd = 0.1) + geom_smooth(aes(group = period), col = "black", se = FALSE) + scale_x_continuous(breaks = x_value, labels = x_label) + scale_color_manual(values = c("#ff6600", "#3366ff")) + coord_cartesian(ylim = c(-10, 10)) + labs(title = paste("Point Advantage by Quarter,", team_name)) + theme_minimal() + theme(plot.title = element_text(face = "bold", hjust = 0.5), axis.title.x = element_blank(), axis.title.y = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), legend.position = "bottom")

The advantage becomes even more obvious when we facet by whether GSW won or not: in wins, they gain almost 10 points in the 3rd quarter, compared to something like 1-3 points in each of the other quarters.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

AzureStor: an R package for working with Azure storage

Tue, 12/18/2018 - 17:18

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Hong Ooi, senior data scientist, Microsoft Azure

A few weeks ago, I introduced the AzureR family of packages for working with Azure in R. Since then, I’ve also written articles on how to use AzureRMR to interact with Azure Resource Manager, how to use AzureVM to manage virtual machines, and how to use AzureContainers to deploy R functions with Azure Kubernetes Service. This article is the next in the series, and covers AzureStor: an interface to Azure storage.

The Resource Manager interface: creating and deleting storage accounts

AzureStor implements an interface to Azure Resource Manager, which you can use manage storage accounts: creating them, retrieving them, deleting them, and so forth. This is done via the appropriate methods of the az_resource_group class. For example, the following code shows how you might create a new storage account from scratch.

library(AzureStor) # get the resource group for the storage account rg <- AzureRMR::az_rm$ new(tenant="{tenant_id}", app="{app_id}", password="{password}")$ get_subscription("{subscription_id}")$ get_resource_group("myresourcegroup") # create the storage account # by default, this will be in the resource group's region rg$create_storage_account("mynewstorage")

Without any options, this will create a storage account with the following parameters:

  • General purpose account (all storage types supported)
  • Locally redundant storage (LRS) replication
  • Hot access tier (for blob storage)
  • HTTPS connection required for access

You can change these by setting the arguments to create_storage_account(). For example, to create an account with geo-redundant storage replication and the default blob access tier set to “cool”:

rg$create_storage_account("myotherstorage", replication="Standard_GRS", access_tier="cool")

To retrieve an existing storage account, use the get_storage_account() method. Only the storage account name is required.

# retrieve one of the accounts created above stor2 <- rg$get_storage_account("myotherstorage")

Finally, to delete a storage account, you simply call its delete() method. Alternatively, you can call the delete_storage_account() method of the az_resource_group class, which will do the same thing. In both cases, AzureStor will prompt you for confirmation that you really want to delete the storage account.

rg$delete_storage_account("mynewstorage") stor2$delete() # if you have the storage account object The client interface: working with storage Storage endpoints

Perhaps the more relevant part of AzureStor for most users is its client interface to storage. With this, you can upload and download files and blobs, create containers and shares, list files, and so on. Unlike the ARM interface, the client interface uses S3 classes. This is for a couple of reasons: it is more familiar to most R users, and it is consistent with most other data manipulation packages in R, in particular the tidyverse.

The starting point for client access is the storage_endpoint object, which stores information about the endpoint of a storage account: the URL that you use to access storage, along with any authentication information needed. The easiest way to obtain an endpoint object is via the storage account resource object’s get_blob_endpoint() and get_file_endpoint() methods:

# get the storage account object stor <- AzureRMR::az_rm$ new(tenant="{tenant_id}", app="{app_id}", password="{password}")$ get_subscription("{subscription_id}")$ get_resource_group("myresourcegroup")$ get_storage_account("mynewstorage") stor$get_blob_endpoint() # Azure blob storage endpoint # URL: https://mynewstorage.blob.core.windows.net/ # Access key: # Account shared access signature: # Storage API version: 2018-03-28 stor$get_file_endpoint() # Azure file storage endpoint # URL: https://mynewstorage.file.core.windows.net/ # Access key: # Account shared access signature: # Storage API version: 2018-03-28

This shows that the base URL to access blob storage is https://mynewstorage.blob.core.windows.net/, while that for file storage is https://mynewstorage.file.core.windows.net/. While it’s not displayed, the endpoint objects also include the access key necessary for authenticated access to storage; this is obtained directly from the storage account resource.

More practically, you will usually want to work with a storage endpoint without having to go through the process of authenticating with Azure Resource Manager (ARM). Often, you may not have any ARM credentials to start with. In this case, you can create the endpoint object directly with blob_endpoint() and file_endpoint():

# same as above blob_endp <- blob_endpoint( "https://mynewstorage.blob.core.windows.net/", key="mystorageaccesskey") file_endp <- file_endpoint( "https://mynewstorage.file.core.windows.net/", key="mystorageaccesskey")

Notice that when creating the endpoint this way, you have to provide the access key explicitly.

Instead of an access key, you can provide a shared access signature (SAS) to gain authenticated access. The main difference between using a key and a SAS is that the former unlocks access to the entire storage account. A user who has a key can access all containers and files, and can read, modify and delete data without restriction. On the other hand, a user with a SAS can be limited to have access only to specific files, or be limited to read access, or only for a given span of time, and so on. This is usually much better in terms of security.

Usually, the SAS will be given to you by your system administrator. However, if you have the storage acccount resource object, you can generate and use a SAS as follows. Note that generating a SAS requires the storage account’s access key.

# shared access signature: read/write access, container+object access, valid for 12 hours now <- Sys.time() sas <- stor$get_account_sas(permissions="rw", resource_types="co", start=now, end=now + 12 * 60 * 60, key=stor$list_keys()[1]) # create an endpoint object with a SAS, but without an access key blob_endp <- stor$get_blob_endpoint(sas=sas)

If you don’t have a key or a SAS, you will only have access to unauthenticated (public) containers and file shares.

Container and object access: blob containers, file shares, blobs, files

Given an endpoint object, AzureStor provides the following methods for working with containers:

  • blob_container: get an existing blob container
  • create_blob_container: create a new blob container
  • delete_blob_container: delete a blob container
  • list_blob_containers: return a list of blob container objects
  • file_share: get an existing file share
  • create_file_share: create a new file share
  • delete_file_share: delete a file share
  • list_file_shares: return a list of file share objects

Here is some example blob container code showing their use. The file share code is similar.

# an existing container cont <- blob_container(blob_endp, "mycontainer") # create a new container and allow # unauthenticated (public) access to blobs newcont <- create_blob_container(blob_endp, "mynewcontainer", public_access="blob") # delete the container delete_blob_container(newcont) # piping also works library(magrittr) blob_endp %>% blob_container("mycontainer")

As a convenience, instead of providing an endpoint object and a container name, you can also provide the full URL to the container. If you do this, you’ll also have to supply any necessary authentication details such as the access key or SAS.

cont <- blob_container( "https://mynewstorage.blob.core.windows.net/mycontainer", key="mystorageaccountkey") share <- file_share( "https://mynewstorage.file.core.windows.net/myshare", key="mystorageaccountkey")

Given a blob container or file share object, use the list_blobs() and list_azure_files() functions to list the storage objects they contain. The “azure” in list_azure_files is to avoid any confusion with R’s regular list.files function.

# list blobs inside a blob container list_blobs(cont) # Name Last-Modified Content-Length # 1 fs.txt 2018-10-13 11:34:30 132 # 2 fs2.txt 2018-10-13 11:04:36 731930 # if you want only the filenames list_blobs(cont, info="name") # [1] "fs.txt" "fs2.txt" # and for files inside a file share list_azure_files(share, "/") # name type size # 1 100k.txt File 100000 # 2 fs.txt File 132

To transfer files and blobs, use the following functions:

  • upload_blob/download_blob: transfer a file to or from a blob container.
  • upload_azure_file/download_azure_file: transfer a file to or from a file share.
  • upload_to_url: upload a file to a destination given by a URL. This dispatches to either upload_blob or upload_azure_file as appropriate.
  • download_from_url: download a file from a source given by a URL, the opposite of upload_from_url. This is analogous to base R’s download.file but with authentication built in.
# upload a file to a blob container blob_endp <- blob_endpoint( "https://mynewstorage.blob.core.windows.net/", key="mystorageaccesskey") cont <- blob_container(blob_endp, "mycontainer") upload_blob(cont, src="myfile", dest="myblob") # again, piping works blob_endpoint( "https://mynewstorage.blob.core.windows.net/", key="mystorageaccesskey") %>% blob_container("mycontainer") %>% upload_blob("myfile", "myblob") # download a blob, overwriting any existing destination file download_blob(cont, "myblob", "myfile", overwrite=TRUE) # as a convenience, you can download directly from an Azure URL download_from_url( "https://mynewstorage.blob.core.windows.net/mycontainer/myblob", "myfile", key="mystorageaccesskey", overwrite=TRUE)

File shares have the additional feature of supporting directories. To create and delete directories, use create_azure_dir() and delete_azure_dir():

list_azure_files(share, "/") # name type size # 1 100k.txt File 100000 # 2 fs.txt File 132 # create a directory under the root of the file share create_azure_dir(share, "newdir") # confirm that the directory has been created list_azure_files(share, "/") # name type size # 1 100k.txt File 100000 # 2 fs.txt File 132 # 3 newdir Directory NA # delete the directory delete_azure_dir(share, "newdir")

The AzureStor package is available now on Github.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

So you want to play a pRank in R…?

Tue, 12/18/2018 - 13:36

(This article was first published on R – Open Source Automation, and kindly contributed to R-bloggers)


(adsbygoogle = window.adsbygoogle || []).push({ google_ad_client: "ca-pub-4184791493740497", enable_page_level_ads: true });

So…you want to play a pRank with R? This short post will give you a fun function you can use in R to help you out!

How to change a file’s modified time with R

Let’s say we have a file, test.txt. What if we want to change the last modified date of the file (let’s suppose the file’s not that important)? Let’s say, for instance, we want to make a file have a last modified date back in the 1980’s. We can do that with one line of code.

First, let’s use file.info to check the current modified date of some file called test.txt.

file.info("test.txt")

We can see above by looking at mtime that this file was last modified December 4th, 2018.

Now, we can use a function called Sys.setFileTime to change the modified date to any date including or after January 1, 1970.

Sys.setFileTime("test.txt", "1980-12-04")

The base R function above takes the file name as the first parameter, and the date for the second parameter. If we use file.info on this file, we can see the newly changed last modified date – set back in 1980!

If we want to change the modified dates for all the files in a directory, we could do this:

sapply(list.files("/path/to/some/directory", full.names = TRUE), function(FILE) Sys.setFileTime(FILE, "1975-01-01"))

The above code will change the last modified time of every file in the directory specified to be January 1, 1975.

You can also, to an extent, make the last modified time some date in the future (up to 2038 as of this writing).

sapply(list.files("/path/to/some/directory", full.names = TRUE), function(FILE) Sys.setFileTime(FILE, "2038-01-18"))

Please check out my other R posts here, or subscribe via the right side of the page to receive updates about new posts! To learn more about file manipulation in R, please click here.

The post So you want to play a pRank in R…? appeared first on Open Source Automation.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Open Source Automation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Alternative approaches to scaling Shiny with RStudio Shiny Server, ShinyProxy or custom architecture.

Tue, 12/18/2018 - 13:23

(This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers)

Shiny is a great tool for fast prototyping. When a data science team creates a Shiny app, sometimes it becomes very popular. From that point this app becomes a tool used on production by many people, that should be reliable and work fast for many concurrent users. There are many ways to optimize a Shiny app like using promises for non-blocking access, profvis for finding bottlenecks, and shiny modules for well-structured code, etc. Regardless of this, the main aspect you should focus on how you serve the application.

Why is scaling Shiny tricky? R is single-threaded

This means all users connected to one R process will block each other. Multithreading allows for application responsiveness, and by design, R doesn’t. You have to use workarounds to provide it.

For example, you can do this by serving multiple instances of the app. Also, a promises package can be used to improve responsiveness, but it is still fundamentally different than promises in JavaScript. JavaScript promises are different thanks to the event loop mechanism, which features fully asynchronous I/O and worker threads.

R is slow. It was designed for convenient data analysis, not for webapps.

R language was created to make data analysis faster. It proved its power, and that’s why it is the tool of choice for many data scientists. However, faster analysis doesn’t mean better performance.

Data analysis in R is fast because of convenient syntax and the amazing amount of useful statistical packages, but the language execution time itself is slow. A detailed explanation of this is covered by Hadley Wickham in this article about R performance. There are also many benchmarks on what the speed of R is.

Personally, I am excited by alternative implementations of R language, which aim to improve performance. One of them is Oracle FastR based on GraalVM, and another is JVM-based Renjin. Currently, they are still evolving, but I believe their time will come soon.

Four ways how you can serve Shiny

Many people ask us what the effective options are for serving a Shiny app, especially for a large number of users. The industry standard tool for enterprises is RStudio Shiny Server Pro and it costs $9,995/year for 20 concurrent users. It is mature, powerful solution, which we often recommend to our clients.

However, this is not the only possibility.

Before you make a decision on how to scale your app to multiple users, it is important to understand what your needs are:

  • What is the app initialization time?
  • Does it load a lot of data into memory on startup?
  • What is the app complexity?
  • How heavy are the calculations?
  • What is the expected behavior of users?

Answering these questions will help you choose the best fit for your use case.

Now, let’s see what options are available to you:

(1) Shiny Server Open Source

Shiny Server Open Source is limited to one R process per app, which potentially can serve multiple user sessions (connections to the app). This is totally fine for apps that don’t have many users, but it doesn’t work well for apps that will have large amounts of users. When you have many users, they all will be served by one process and will inevitably block each other.

This architecture is visualized by the following diagram:

(2) Shiny Server Pro

In contrast, on Shiny Server Pro, you can have multiple R processes per app. This means that many concurrent users can be distributed between separate processes and are served more efficiently. As there is no limitation on the number of processes, you can make use of all your machine resources.

On Shiny Server Pro, you can configure a strategy on how resources should be handled with `utilization_scheduler` parameter.

For example, you can set:

  • The maximum R process capacity, i.e. the number of concurrent users per single R process;
  • The maximum number of R processes per single app;
  • When the server should spawn a new R process, e.g. when existing processes reach 90% of their capacity.

Below you can see an example situation:

In this scenario, five users want to access the app. Shiny Server Pro initializes three worker processes, and users are distributed between them.

(3) ShinyProxy

ShinyProxy from OpenAnalytics is an open source alternative for serving Shiny apps. It provides many features required by enterprise applications. Its architecture is based on docker containers, which isolate the app’s environment and can serve not only Shiny apps but other apps, too (like Dash).

The key difference is that ShinyProxy starts a new app instance for each new user. The architecture is straightforward, convenient, and easy to maintain. This is how it looks on a diagram:

(4) Custom architecture like Appsilon Revolver

Sometimes deploying a Shiny app requires a custom solution. Let’s say we have a Shiny app that takes 60 seconds to initialize, and that time amount is deemed as an unacceptable wait time for a user. In this situation, starting an app instance for each user is not the way to go.

At Appsilon, we create custom architectures for Shiny apps. In many cases, we recommend Shiny Server or ShinyProxy because they are often the right tool for a given task. However, there are exceptions.

One of our custom architectures is a project code-named “Revolver.” We wanted to create a scalable architecture for a Shiny app that takes a long time to initialize and performs a lot of heavy computations. It is not a standalone product, and it was first designed for one of our clients. Since then, we’ve used it to deploy several production apps.

Here is how it looks:

 

This approach uses docker containers that serve the application using Shiny Server Open Source. In front of application instances, there is a load balancer that distributes the traffic.

The difference between our product and ShinyProxy is that there are N pre-initialized containers that wait for user connections. The number of containers is configured by the app admin and can be auto-adjusted. The advantage of this approach is that the app is served instantly. Users do not have to wait for the app initialization.

It is also worth noting that this custom architecture supports SSL connection, authentication, and other enterprise requirements.

Cheat Sheet

I hope this article on the possible solutions for scaling Shiny apps to many concurrent users was helpful to you. If you’d like to chat about a specific use case, you can contact us through https://appsilon.com/shiny

Below, we have compiled the information shared above into a comparison cheat sheet diagram. Again, I hope it helps!

Article Alternative approaches to scaling Shiny with RStudio Shiny Server, ShinyProxy or custom architecture. comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

vtreat Variable Importance

Tue, 12/18/2018 - 05:07

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

vtreat‘s purpose is to produce pure numeric R data.frames that are ready for supervised predictive modeling (predicting a value from other values). By ready we mean: a purely numeric data frame with no missing values and a reasonable number of columns (missing-values re-encoded with indicators, and high-degree categorical re-encode by effects codes or impact codes).

In this note we will discuss a small aspect of the vtreat package: variable screening.

Part of the vtreat philosophy is to assume after the vtreat variable processing the next step is a sophisticated supervised machine learning method. Under this assumption we assume the machine learning methodology (be it regression, tree methods, random forests, boosting, or neural nets) will handle issues of redundant variables, joint distributions of variables, overall regularization, and joint dimension reduction.

However, an important exception is: variable screening. In practice we have seen wide data-warehouses with hundreds of columns overwhelm and defeat state of the art machine learning algorithms due to over-fitting. We have some synthetic examples of this (here and here).

The upshot is: even in 2018 you can not treat every column you find in a data warehouse as a variable. You must at least perform some basic screening.

To help with this vtreat incorporates a per-variable linear significance report. This report shows how useful each variable is taken alone in a linear or generalized linear model (some details can be found here). However, this sort of calculation was optimized for speed, not discovery power.

vtreat now includes a direct variable valuation system that works very well with complex numeric relationships. It is a function called vtreat::value_variables_N() for numeric or regression problems and vtreat::value_variables_C() for binomial classification problems. It works by fitting two transformed copies of each numeric variable to the outcome. One transform is a low frequency transform realized as an optimal k-segment linear model for a moderate choice of k. The other fit is a high-frequency trasnform realized as a k-nearest neighbor average for moderate choice of k. Some of the methodology is shown here.

We recommend using vtreat::value_variables_*() as an initial variable screen.

Let’s demonstrate this using the data from the segment fitter example. In our case the value to be predicted ("y") is a noisy copy of sin(x). Let’s set up our example data:

set.seed(1999) d <- data.frame(x = seq(0, 15, by = 0.25)) d$y_ideal <- sin(d$x) d$x_noise <- d$x[sample.int(nrow(d), nrow(d), replace = FALSE)] d$y <- d$y_ideal + 0.5*rnorm(nrow(d)) dim(d) ## [1] 61 4

Now a simple linear valuation of the the variables can be produced as follows.

cfe <- vtreat::mkCrossFrameNExperiment( d, varlist = c("x", "x_noise"), outcomename = "y") ## [1] "vtreat 1.3.4 start initial treatment design Mon Dec 17 20:02:58 2018" ## [1] " start cross frame work Mon Dec 17 20:02:59 2018" ## [1] " vtreat::mkCrossFrameNExperiment done Mon Dec 17 20:02:59 2018" sf <- cfe$treatments$scoreFrame knitr::kable(sf[, c("varName", "rsq", "sig")]) varName rsq sig x 0.0050940 0.5846527 x_noise 0.0155874 0.3377146

Notice the signal carrying variable did not score better (having a larger r-squared and a smaller (better) significance value) than the noise variable (that is unrelated to the outcome). This is because the relation between x and y is not linear.

Now let’s try vtreat::value_variables_N().

vf = vtreat::value_variables_N( d, varlist = c("x", "x_noise"), outcomename = "y") rownames(vf) <- NULL knitr::kable(vf[, c("var", "rsq", "sig")]) var rsq sig x 0.26719313 0.0000600 x_noise 0.01591601 0.9978984

Now the difference is night and day. The important variable x is singled out (scores very well), and the unimportant variable x_noise doesn’t often score well. Though, as with all significance tests, useless variables can get lucky from time to time- (an issue that can be addressed by using a Cohen’s-d style calculation).

Our modeling advice is:

  • Use vtreat::value_variables_*()
  • Pick all variables with sig <= 1/number_of_variables_being_considered.

The idea is: each "pure noise" (or purely useless) variable has a significance that is distributed uniformly between zero and one. So the expected number of useless variables that make it through the above screening is number_of_useless_varaibles * P[useless_sig <= 1/number_of_variables_being_considered]. This equals number_of_useless_varaibles * 1/number_of_variables_being_considered. As number_of_useless_varaibles <= number_of_variables_being_considered we get this quantity is no more than one. So we expect a constant number of useless variables to sneak through this filter. The hope is: this should not be enough useless variables to overwhelm the next stage supervised machine learning step.

Obviously there are situations where variable importance can not be discovered without considering joint distributions. The most famous one being "xor" where the concept to be learned is if an odd or even number of indicator variables are zero or one (each such variable is individual completely uninformative about the outcome until you have all of the variables simultaneously). However, for practical problems you often have that most variables have a higher marginal predictive power taken alone than they have in the final joint model (as other, better, variables consume some of common variables’ predictive power in the joint model). With this in mind single variable screening often at least gives an indication where to look.

In conclusion the vtreat package and vtreat::value_variables_*() can be a valuable addition to your supervised learning practice.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Statistics in Glaucoma: Part III

Tue, 12/18/2018 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Samuel Berchuck is a Postdoctoral Associate in Duke University’s Department of Statistical Science and Forge-Duke’s Center for Actionable Health Data Science.

Joshua L. Warren is an Assistant Professor of Biostatistics at Yale University.

Looking Forward in Glaucoma Progression Research

The contribution of the womblR package and corresponding statistical methodology is a technique for correctly accounting for the complex spatial structure of the visual field. The purpose of this method is to properly model visual field data, so that an effective diagnostic is derived that discriminates progression status. This is one of many important clinical questions that needs to be addressed in glaucoma research. Others include: quantifying visual field variability to create simulations of healthy and progression patients, combining multiple data modalities to obtain a composite diagnostic, and predicting the timing and spatial location of future vision loss. There is opportunity within the glaucoma literature for the development of quantitative methods that answer important clinical questions, are easy to understand, and are simple to use. To this end, in closing this three-part series, we present a final example of a new method that uses change points to assess future vision loss.

Modeling Changes on the Visual Field Using Spatial Change Points

To motivate the use of change points, we note that patients diagnosed with glaucoma are often monitored for years with slow changes in visual functionality. It is not until disease progression that notable vision loss occurs, and the deterioration is often swift. This disease course inspires a modeling framework that can identify a point of functional change in the course of follow-up, thus change points are employed. This is an appealing modeling framework, because the concept of disease progression becomes intrinsically parameterized into the model, with the change point representing the point of functional change. In this model, the time of the change point triggers a simultaneous change in both the mean and variance process. Furthermore, to account for the typical trend of a long period of monitoring with little change followed by abrupt vision loss, we force the mean and variance to be constant before the change point. For the mean process, and assuming data from a patient with nine visual field tests, this results in \[\mu_t\left(\mathbf{s}_i\right)=\left\{ \begin{array}{ll}
{\beta}_0\left(\mathbf{s}_i\right) & x_t \leq \theta\left(\mathbf{s}_i\right),\\
{\beta}_0\left(\mathbf{s}_i\right) + {\beta}_1\left(\mathbf{s}_i\right)\left\{x_t-\theta\left(\mathbf{s}_i\right)\right\} & \theta\left(\mathbf{s}_i\right) \leq x_t.\end{array} \right. \quad t = 1,\ldots,9 \quad i = 1,\ldots,52 \] Here, the change point at location \(\mathbf{s}_i\) is given by \(\theta(\mathbf{s}_i)\), and \(x_t\) is the days from baseline visit for follow-up visit \(t\). A final important detail is that the change points \(\theta(\mathbf{s}_i)\) are truncated in the observed follow-up range, \((x_1, x_9)\). In practice, the true change point can occur outside of this region, so we define a latent process, \(\eta(\mathbf{s}_i)\), that defines the true change point, \(\theta(\mathbf{s}_i) = \max\{\min\{\eta(\mathbf{s}_i), x_9\}, x_1\}\). Finally, all of the location-specific effects are modeled using a novel multivariate conditional autoregressive (MCAR) prior that incorporates the anatomy detailed in the womblR method. More details can be found in Berchuck et al. 2018.

We once again rely on MCMC methods for inference, and a package similar to womblR was developed that implements the spatially varying change point model, spCP. This package has much of the same functionality as womblR, and we demonstrate its functionality below.

We begin by loading spCP. All of the visual field data (VFSeries), adjacencies (HFAII_Queen), and anatomical angles (GarwayHeath) are included in the spCP package.

###Load package library(spCP) ###Format data blind_spot <- c(26, 35) # define blind spot VFSeries <- VFSeries[order(VFSeries$Location), ] # sort by location VFSeries <- VFSeries[order(VFSeries$Visit), ] # sort by visit VFSeries <- VFSeries[!VFSeries$Location %in% blind_spot, ] # remove blind spot locations Y <- VFSeries$DLS # define observed outcome data Time <- unique(VFSeries$Time) / 365 # years since baseline visit MaxTime <- max(Time) ###Neighborhood objects W <- HFAII_Queen[-blind_spot, -blind_spot] # visual field adjacency matrix M <- dim(W)[1] # number of locations DM <- GarwayHeath[-blind_spot] # Garway-Heath angles Nu <- length(Time) # number of visits ###Obtain bounds for spatial parameter (details are in Berchuck et al. 2018) pdist <- function(x, y) pmin(abs(x - y), (360 - pmax(x, y) + pmin(x, y))) #Dissimilarity metric distance function (i.e., circumference) DM_Matrix <- matrix(nrow = M, ncol = M) for (i in 1:M) { for (j in 1:M) { DM_Matrix[i, j] <- pdist(DM[i], DM[j]) } } BAlpha <- -log(0.5) / min(DM_Matrix[DM_Matrix > 0]) AAlpha <- 0 ###Hyperparameters Hypers <- list(Alpha = list(AAlpha = AAlpha, BAlpha = BAlpha), Sigma = list(Xi = 6, Psi = diag(5)), Delta = list(Kappa2 = 1000)) ###Starting values Starting <- list(Sigma = 0.01 * diag(5), Alpha = mean(c(AAlpha, BAlpha)), Delta = c(0, 0, 0, 0, 0)) ###Metropolis tuning variances Tuning <- list(Lambda0Vec = rep(1, M), Lambda1Vec = rep(1, M), EtaVec = rep(1, M), Alpha = 1) ###MCMC inputs MCMC <- list(NBurn = 10000, NSims = 250000, NThin = 25, NPilot = 20)

Once the inputs have been properly formatted, the program can be run.

###Run MCMC sampler reg.spCP <- spCP(Y = Y, DM = DM, W = W, Time = Time, Starting = Starting, Hypers = Hypers, Tuning = Tuning, MCMC = MCMC, Family = "tobit", Weights = "continuous", Distance = "circumference", Rho = 0.99, ScaleY = 10, ScaleDM = 100, Seed = 54) ## Burn-in progress: |*************************************************| ## Sampler progress: 0%.. 10%.. 20%.. 30%.. 40%.. 50%.. 60%.. 70%.. 80%.. 90%.. 100%..

To visualize the estimated change points, we can use the PlotCP function from spcP. The function requires the model fit object and the original data set, plus the variable names of the raw DLS, time (in years), and spatial locations.

VFSeries$TimeYears <- VFSeries$Time / 365 PlotCP(reg.spCP, VFSeries, dls = "DLS", time = "TimeYears", location = "Location", cp.line = TRUE, cp.ci = TRUE)

Using the PlotCP function, we present the posterior means of the change points using a blue vertical line, with dashed 95% credible intervals. Furthermore, the mean process and credible interval are plotted using red lines, and the raw DLS values are given by black points. For this example patient, the majority of the change points are at the edges of follow-up. When the DLS is constant over time, the estimated change points are at the end of follow-up, while any trends that are present before follow-up correspond to the change point occurring at the beginning. This information provides clinicians with visual and quantitative confirmation of functional changes across the visual field.

Change Points as a Proxy for Progression

To formalize the importance of the change points, we look to convert their presence or absence into a clinical decision. We decide to calculate the probability that a change point has been observed at each location across the visual field. To provide a tool that is useful for clinicians, we create a gif that presents the probability of a change point throughout a patient’s follow-up, and are able to predict one and a half years into the future. In Berchuck et al. 2018, it is shown that these change points are highly predictive of progression.

We begin by extracting and calculating the change point probabilities.

###Extract change point posterior samples eta <- reg.spCP$eta ###Convert change points to probabilties of occuring before time t NFrames <- 50 # number of frames in GIF GIF_Times <- seq(0, MaxTime + 1.5, length.out = NFrames) # obtain GIF 1.5 years after the end of follow-up GIF_Days <- round(GIF_Times * 365) # convert to days for use later CP_Probs <- matrix(nrow = M, ncol = NFrames) ###Obtain probabilties at each time point for (t in 1:NFrames) { CP_Probs[, t] <- apply(eta, 2, function(x) mean(x < GIF_Times[t])) } colnames(CP_Probs) <- GIF_Times

Now, to create a gif of the probabilities, we use the magick package, and in particular, the functions image_graph and image_animate. Furthermore, we use the PlotSensitivity function from womblR to plot the predicted probabilities on the visual field.

###Load packages library(magick) # package for creating GIFs library(womblR) # loaded for PlotSensitivity ###Create GIF Img <- image_graph(600, 600, res = 96) for (f in 1:NFrames) { p <- womblR::PlotSensitivity(CP_Probs[, f], legend.lab = expression(paste("Pr[", eta, "(s)] < ", t)), zlim = c(0, 1), bins = 250, legend.round = 2, border = FALSE, main = bquote("Days from baseline: t = " ~ .(GIF_Days[f]) ~ " (" ~ t[max] ~ " = " ~ .(Time[Nu] * 365) ~ ")")) } dev.off() ## quartz_off_screen ## 2

Now, we animate and print the created gif using the image_animate function, specifying 10 frames per second using the fps option.

Animation <- image_animate(Img, fps = 10) Animation

This gif has many properties that make it clinically useful. The space-time nature of the image allows for clinicians to understand not only the current state of the disease, but also the progression pattern throughout all of follow-up. Furthermore, the gif shows the pattern and future risk of progression over the next one and a half years, presenting clinicians a tool for planning for future risk.

Conclusions and Future Directions

The hope in developing these R packages is for them to be used clinically, and to inspire other quantitative scientists to do the same. When statistical methods are typically developed for medical research, it is more common for the methodologies to be published without any corresponding software package. This means that no matter how impactful the method may be, it is unlikely to make a clinical impact for many years, due to the complexity in implementing the inferential methods. Clinicians are dependent on quantitative methods for analyzing the massive amounts of data that exist in today’s world, and they are typically reliant on the proprietary software that is built into the imaging machines themselves. This software is useful, but because the methods are often not published, it can be difficult to interpret the results. More open-source software being developed for medical research will lead to greater collaboration and visibility of the important problems being addressed by health researchers. The R environment, including CRAN and RStudio, make it particularly easy to create and share R packages, and the development of Rcpp and its relatives allow for the packages to be computationally fast. Our hope is that the womblR and spCP packages illustrate this concept and excite people to get involved in glaucoma research, or one of many other important health areas.

Reference
  1. Berchuck, S.I., Mwanza, J.C., & Warren, J.L. (2018). “A Spatially Varying Change Points Model for Monitoring Glaucoma Progression Using Visual Field Data”.

_____='https://rviews.rstudio.com/2018/12/18/statistics-in-glaucoma-part-iii/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

rcites – The story behind the package

Tue, 12/18/2018 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

The Ecology Hackathon

Almost one year ago now, ecologists filled a room for the “Ecology Hackathon: Developing R Packages for Accessing, Synthesizing and Analyzing Ecological Data” that was co-organised by rOpenSci Fellow, Nick Golding and Methods in Ecology and Evolution. This hackathon was part of the “Ecology Across Borders” Joint Annual Meeting 2017 of BES, GfÖ, NecoV, and EEF in Ghent. At different tables, different people joined each other to work on different ideas to implement as R packages. At our table, we were around ten people that more or less did not know anything about what we aimed for. We barely knew each other and nobody had clear expectations, just the desire of learning more about R packages. We were interested in a common idea posted as a wishlist in the rOpenSci community: building an R package to interact with CITES and its Speciesplus database. CITES (the Convention on International Trade in Endangered Species of Wild Fauna and Flora) is an international agreement between governments and provides key information to ensure that international trade in specimens of wild animals and plants does not threaten their survival. At 10 am, nobody had a clear idea on where to start. By 6 pm, we had a functional prototype of the rcites package, which was really rewarding and gave motivation to follow up on the package development. We did great team-work, met new researchers, and learned a bunch of new stuff. This was definitely a successful hackathon!

Now, almost one year after, it appears to us that there were several key issues that the hackathon contributed to our success. The first was the short but thorough introduction to the development of R packages by Maëlle Salmon (Maëlle was an Editor for rOpenSci software peer review at the time, and became a staff Research Software Engineer with rOpenSci shortly after). Within an hour or so we had a clear vision of what the development of an R package pipeline looks like. A second key was the help of more experienced people. Our group immensely benefited from the experience of Tamora James who kindly answered our questions. The third key was the diversity of people interested in the project. Indeed, while some of us were beginner R users, others had moderate experience in creating packages and others had already dealt with CITES before, so we were able to split tasks organically. Some of us learned how to access the CITES/Speciesplus API and how to write a request via URLs. Others worked on rough versions of a functional code, while others started writing the help files and tested the package. After a few hours of frustration, ignoring why a given function refused to work as intended, things started to fit like pieces of a jigsaw puzzle. This was very empowering.

To summarize the hackathon: We learned a lot. First, how to create an R package, i.e. creating a structure of files, writing functions, and their documentation, testing functions, checking whether the package is working (it was already a lot at that stage of development). Second, how to write URL requests in R to query a web API. Third, how to use GitHub for collaborative coding. Of course, we did not become experts on all of these topics, it was just the beginning of an adventure and a lot of work remained, but still … what a day!

The long way to perfection

Although most of the code was written during the hackathon in Ghent, the package was deeply restructured in the following months. Indeed we realized that a good R package is more than a set of functions and requires a proper organization of the code to avoid redundancy, adequate unit testing, clear and complete documentation, a cool hexagonal sticker and a lot of thought on how to make the package user-friendly. All this requires time, and we moved slowly while completing the package. After seven months, we were ready to submit to rOpenSci and submit a first version (v0.1.0) to CRAN.

The help of rOpenSci and its onboarding program was pivotal. Two dedicated reviewers, Noam Ross and Margaret Siple, tested the package and suggested enhancements. Their reviews were of top-notch quality, they gave extremely valuable advice and were really supportive of our work and the development of the package. After the review, we were very enthusiastic and decided to do our best to address all of their comments. Interestingly enough, during that stage, the size of the code of our key functions shrunk while the number of “private” functions (gathered in zzz.R) increased. We actually added several functionalities while better structuring the code to offer a better user experience (well, we hope so). Now, we can tell that the difference between a working package and a well-crafted package is important. We also understand why rOpenSci suggests doing the CRAN submission after the review. There is no rush for a CRAN release, as the package is already available on GitHub, plus the number of changes after a revision is likely to generate a major release. Hence, scripts built based on earlier versions of the package will no longer work with the revised package. This is exactly what happened for us, we released v1.0.0 after the review process, and we’ve learned our lesson.

So how does rcites work?

For a quick illustration, we can retrieve CITES information on the African bush elephant ( Loxodonta africana, hereafter “elephant”) native distribution. The elephant is a highly endangered species that is not only illegally traded globally but also a flagship species of nature conservation and the logo species of CITES.

We start with a basic setup: we load the package and set the token (see details in the package vignettes):

library(rcites) set_token("8QW6Qgh57sBG2k0gtt")

In order to access information about the elephant, we first need to retrieve its Speciesplus taxon identifier. For this, we use the spp_taxonconcept() function and the elephant’s scientific name, Loxodonta africana, as query_taxon argument.

elephant_taxonconcept <- spp_taxonconcept(query_taxon = "Loxodonta africana") elephant_taxonconcept

Now we can map the elephant’s distribution with the help of the rworldmap package.

library(rworldmap) par(las = 1) elephant_distr <- spp_distributions(taxon_id = "4521", verbose = FALSE)$distributions map2 <- joinCountryData2Map(elephant_distr, joinCode="ISO2", nameJoinColumn = "iso_code2", nameCountryColumn = "name") plot(c(-23, 62), c(45, -40), type = "n", main = "Loxodonta africana", xlab = "Longitude", ylab = "Latitude") plot(map2, add = TRUE) plot(map2[!is.na(map2$iso_code2),], col = "#cba74d", add = TRUE)


In this map we can see all countries where the elephant occurrs naturally.

For more functionalities, please check the package documentation and the vignette Study case: the African bush elephant (Loxodonta africana).
For citation, please also refer to the release paper (doi: 10.21105/joss.01091).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An R Shiny app to recognize flower species

Mon, 12/17/2018 - 14:45

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

Introduction

Playing around with PyTorch and R Shiny resulted in a simple Shiny app where the user can upload a flower image, the system will then predict the flower species.

Steps that I took
  1. Download labeled flower data from the Visual Geometry Group,
  2. Install Pytorch and download their transfer learning tutorial script,
  3. You need to slightly adjust the script to work on the flower data,
  4. Train and Save the model as a (*.pt) file, 
  5. Using the R reticulate package you can call python code from within R so that you can use a pytorch models in R,
  6. Create a Shiny app that allows the user to upload an image and display the predicted flower species.
Some links

Github repo with: Python notebook to fine tune the resnet18 model, R script with Shiny App, data folder with images.

Live running shiny app can be found here

Cheers, Longhow

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Day 17 – little helper to_na

Mon, 12/17/2018 - 08:00

(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)

We at STATWORX work a lot with R and we often use the same little helper functions within our projects. These functions ease our daily work life by reducing repetitive code parts or by creating overviews of our projects. At first, there was no plan to make a package, but soon I realised, that it will be much easier to share and improve those functions, if they are within a package. Up till the 24th December I will present one function each day from helfRlein. So, on the 17th day of Christmas my true love gave to me…

What can it do?

This little helper is just a convenience function. Some times during your data preparation, you have a vector with infinite values like Inf or -Inf or even NaN values. Thos kind of value can (they do not have to!) mess up your evaluation and models. But most functions do have a tendency to handle missing values. So, this little helper removes such values and replaces them with NA.

How to use it?

A small exampe to give you the idea:

test <- list(a = c("a", "b", NA), b = c(NaN, 1,2, -Inf), c = c(TRUE, FALSE, NaN, Inf)) lapply(test, to_na) $a [1] "a" "b" NA $b [1] NA 1 2 NA $c [1] TRUE FALSE NA

A little advice along the way! Since there are different types of NA depending on the other values within a vector. You might want to check the format if you do to_na on groups or subsets.

test <- list(NA, c(NA, "a"), c(NA, 2.3), c(NA, 1L)) str(test) List of 4 $ : logi NA $ : chr [1:2] NA "a" $ : num [1:2] NA 2.3 $ : int [1:2] NA 1 Overview

To see all the other functions you can either check out our GitHub or you can read about them here.

Have a merry advent season!

Über den Autor

Jakob Gepp

Numbers were always my passion and as a data scientist and statistician at STATWORX I can fullfill my nerdy needs. Also I am responsable for our blog. So if you have any questions or suggestions, just send me an email!

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. 

.button {background-color: #0085af;}

Der Beitrag Day 17 – little helper to_na erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2018-13 Rendering HTML Content in R Graphics

Mon, 12/17/2018 - 00:23

(This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

This report describes several R packages that allow HTML content to be rendered as part of an R plot. The core package is called ‘layoutEngine’, but that package requires a “backend” package to perform HTML layout calculations. Three example backends are demonstrated: ‘layoutEngineCSSBox’, ‘layoutEnginePhantomJS’, and ‘layoutEngineDOM’. We also introduce two new font packages, ‘gyre’ and ‘courier’.

Paul Murrell

Download

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages