Error message

  • Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
  • Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 9 hours 18 min ago

Writing academic articles using R Markdown and LaTeX

Thu, 10/05/2017 - 01:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

One of my favourite activities in R is using Markdown to create business reports. Most of my work I export to MS Word to communicate analytical results with my colleagues. For my academic work and eBooks, I prefer LaTeX to produce great typography. This article explains how to write academic articles and essays combining R Markdown and LaTeX. The article is formatted in accordance with the APA (American Psychological Association) requirements.

To illustrate the principles of using R Markdown and LaTeX, I recycled an essay about problems with body image that I wrote for a psychology course many years ago. You can find the completed paper and all necessary files on my GitHub repository.

Body Image

Body image describes the way we feel about the shape of our body. The literature on this topic demonstrates that many people, especially young women, struggle with their body image. A negative body image has been strongly associated with eating disorders. Psychologists measure body image using a special scale, shown in the image below.

My paper measures the current and ideal body shape of the subject and the body shape of the most attractive other sex. The results confirm previous research which found that body dissatisfaction for females is significantly higher than for men. The research also found a mild positive correlation between age and ideal body shape for women and between age and the female body shape found most attractive by men. You can read the full paper on my personal website.

Body shape measurement scale.

R Markdown and LaTeX

The R Markdown file for this essay uses the Sweave package to integrate R code with LaTeX. The first two code chunks create a table to summarise the respondents using the xtable package. This package creates LaTeX or HTML tables from data generated by R code.

The first lines of the code read and prepare the data, while the second set of lines creates a table in LaTeX code. The code chunk uses results=tex to ensure the output is interpreted as LaTeX. This approach is used in most of the other chunks. The image is created within the document and saved as a pdf file and back integrated into the document as an image with appropriate label and caption.

<>= body <- read.csv("body_image.csv") # Respondent characteristics body$Cohort <- cut(body$Age, c(0, 15, 30, 50, 99), labels = c("<16", "16--30", "31--50", ">50")) body$Date <- as.Date(body$Date) body$Current_Ideal <- body$Current - body$Ideal library(xtable) respondents <- addmargins(table(body$Gender, body$Cohort)) xtable(respondents, caption = "Age profile of survey participants", label = "gender-age", digits = 0) @ Configuration

I created this file in R Studio, using the Sweave and knitr functionality. To knit the R Markdown file for this paper you will need to install the apa6 and ccicons packages in your LaTeX distribution. The apa6 package provides macros to format papers in accordance with the requirements American Psychological Association.

The post Writing academic articles using R Markdown and LaTeX appeared first on The Devil is in the Data.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing the Deep Learning Virtual Machine on Azure

Wed, 10/04/2017 - 21:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

A new member has just joined the family of Data Science Virtual Machines on Azure: The Deep Learning Virtual Machine. Like other DSVMs in the family, the Deep Learning VM is a pre-configured environment with all the tools you need for data science and AI development pre-installed. The Deep Learning VM is designed specifically for GPU-enabled instances, and comes with a complete suite of deep learning frameworks including Tensorflow, PyTorch, MXNet, Caffe2 and CNTK. It also comes witth example scripts and data sets to get you started on deep learning and AI problems, including:

The DLVM along with all the DSVMs also provides a complete suite of data science tools including R, Python, Spark, and much more:

There have also been some updates and additions to the tools provided in the entire DSVM family, including:

All Data Science Virtual Machines, including the Deep Learning Virtual Machine, are available as Windows and Ubuntu Linux instances, and are free of any software charges: pay only for the infrastructure charge according to the power and size of the instance you choose. An Azure account is required, but you can get started with $200 in free Azure credits here.

Microsoft Azure: Data Science Virtual Machines

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Time Series Analysis in R Part 3: Getting Data from Quandl

Wed, 10/04/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

This is part 3 of a multi-part guide on working with time series data in R. You can find the previous parts here: Part 1, Part 2.

Generated data like that used in Parts 1 and 2 is great for sake of example, but not very interesting to work with. So let’s get some real-world data that we can work with for the rest of this tutorial. There are countless sources of time series data that we can use including some that are already included in R and some of its packages. We’ll use some of this data in examples. But I’d like to expand our horizons a bit.

Quandl has a great warehouse of financial and economic data, some of which is free. We can use the Quandl R package to obtain data using the API. If you do not have the package installed in R, you can do so using:

install.packages('Quandl')

You can browse the site for a series of interest and get its API code. Below is an example of using the Quandl R package to get housing price index data. This data originally comes from the Yale Department of Economics and is featured in Robert Shiller’s book “Irrational Exuberance”. We use the Quandl function and pass it the code of the series we want. We also specify “ts” for the type argument so that the data is imported as an R ts object. We can also specify start and end dates for the series. This particular data series goes all the way back to 1890. That is far more than we need so I specify that I want data starting in January of 1990. I do not supply a value for the end_date argument because I want the most recent data available. You can find this data on the web here.

library(Quandl) hpidata <- Quandl("YALE/NHPI", type="ts", start_date="1990-01-01") plot.ts(hpidata, main = "Robert Shiller's Nominal Home Price Index")

Gives this plot:

While we are here, let’s grab some additional data series for later use. Below, I get data on US GDP and US personal income, and the University of Michigan Consumer Survey on selling conditions for houses. Again I obtained the relevant codes by browsing the Quandl website. The data are located on the web here, here, and here.

gdpdata <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01") pidata <- Quandl("FRED/PINCOME", type="ts", start_date="1990-01-01") umdata <- Quandl("UMICH/SOC43", type="ts")[, 1] plot.ts(cbind(gdpdata, pidata), main="US GPD and Personal Income, billions $")

Gives this plot:

plot.ts(umdata, main = "University of Michigan Consumer Survey, Selling Conditions for Houses")

Gives this plot:

The Quandl API also has some basic options for data preprocessing. The US GDP data is in quarterly frequency, but assume we want annual data. We can use the collapse argument to collapse the data to a lower frequency. Here we covert the data to annual as we import it.

gdpdata_ann <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01", collapse="annual") frequency(gdpdata_ann) [1] 1

We can also transform our data on the fly as its imported. The Quandl function has a argument transform that allows us to specify the type of data transformation we want to perform. There are five options – “diff“, ”rdiff“, ”normalize“, ”cumul“, ”rdiff_from“. Specifying the transform argument as”diff” returns the simple difference, “rdiff” yields the percentage change, “normalize” gives an index where each value is that value divided by the first period value and multiplied by 100, “cumul” gives the cumulative sum, and “rdiff_from” gives each value as the percent difference between itself and the last value in the series. For more details on these transformations, check the API documentation here.

For example, here we get the data in percent change form:

gdpdata_pc <- Quandl("FRED/GDP", type="ts", start_date="1990-01-01", transform="rdiff") plot.ts(gdpdata_pc * 100, ylab= "% change", main="US Gross Domestic Product, % change")

Gives this plot:

You can find additional documentation on using the Quandl R package here. I’d also encourage you to check out the vast amount of free data that is available on the site. The API allows a maximum of 50 calls per day from anonymous users. You can sign up for an account and get your own API key, which will allow you to make as many calls to the API as you like (within reason of course).

In Part 4, we will discuss visualization of time series data. We’ll go beyond the base R plotting functions we’ve used up until now and learn to create better-looking and more functional plots.

    Related Post

    1. Pulling Data Out of Census Spreadsheets Using R
    2. Extracting Tables from PDFs in R using the Tabulizer Package
    3. Extract Twitter Data Automatically using Scheduler R package
    4. An Introduction to Time Series with JSON Data
    5. Get Your Data into R: Import Data from SPSS, Stata, SAS, CSV or TXT
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    nzelect 0.4.0 on CRAN with results from 2002 to 2014 and polls up to September 2017 by @ellis2013nz

    Wed, 10/04/2017 - 13:00

    (This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

    More nzelect New Zealand election data on CRAN

    Version 0.4.0 of my nzelect R package is now on CRAN. The key changes from version 0.3.0 are:

    • election results by voting place are now available back to 2002 (was just 2014)
    • polling place locations, presented as consistent values of latitude and longitude, now available back to 2008 (was just 2014)
    • voting intention polls are now complete up to the 2017 general election (previously stopped about six months ago on CRAN, although the GitHub version was always kept up to date)
    • a few minor bug fixes eg allocate_seats now takes integer arguments, not just numeric.

    The definitive source of New Zealand election statistics is the Electoral Commission. If there are any discrepencies between their results and those in nzelect, it’s a bug, and please file an issue on GitHub. The voting intention polls come from Wikipedia.

    Example – special and early votes

    Currently, while we wait for the counting of the “special” votes from the 2017 election, there’s renewed interest in the differences between special votes and those counted on election night. Special votes are those made by anyone outside their electorate, enrolled in the month before the election, or are in one of a few other categories. Most importantly, it’s people who are on the move or who are very recently enrolled, and obviously such people are different from the run of the mill voter.

    Here’s a graphic created with the nzelect R package that shows how the “special votes” in the past have been disproportionately important for Greens and Labour:

    Here’s the code to create that graphic:

    # get the latest version from CRAN: install.packages("nzelect") library(nzelect) library(tidyverse) library(ggplot2) library(scales) library(ggrepel) library(forcats) # palette of colours for the next couple of charts: palette <- c(parties_v, Other = "pink2", `Informal Votes` = "grey") # special votes: nzge %>% filter(voting_type == "Party") %>% mutate(party = fct_lump(party, 5)) %>% mutate(dummy = grepl("special", voting_place, ignore.case = TRUE)) %>% group_by(electorate, party, election_year) %>% summarise(prop_before = sum(votes[dummy]) / sum(votes), total_votes = sum(votes)) %>% ungroup() %>% mutate(party = gsub(" Party", "", party), party = gsub("ACT New Zealand", "ACT", party), party = gsub("New Zealand First", "NZ First", party)) %>% mutate(party = fct_reorder(party, prop_before)) %>% ggplot(aes(x = prop_before, y = party, size = total_votes, colour = party)) + facet_wrap(~election_year) + geom_point(alpha = 0.1) + ggtitle("'Special' votes proportion by party, electorate and year", "Each point represents the proportion of a party's vote in each electorate that came from special votes") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package", y = "") + scale_size_area("Total party votes", label = comma) + scale_x_continuous("\nPercentage of party's votes that were 'special'", label = percent) + scale_colour_manual(values = palette, guide = FALSE)

    Special votes are sometimes confused with advance voting in general. While many special votes are advance votes, the relationship is far from one to one. We see this particularly acutely by comparing the previous graphic to one that is identical except that it identifies all advance votes (those with the phrase “BEFORE” in the Electoral Commission’s description of polling place):

    While New Zealand First are the party that gains least proportionately from special votes, they gain the most from advance votes, although the difference between parties is fairly marginal. New Zealand First voters are noticeably more likely to be in an older age bracket than the voters for other parties. My speculation on their disproportionate share of advance voting is that it is related to that, although I’m not an expert in that area and am interested in alternative views.

    This second graphic also shows nicely just how much the advance voting is becoming a feature of the electoral landscape. Unlike the proportion of votes that are “special” which has been fairly stable, the proportion of votes that are case in advance has increased very substantially over the past decade, and increased further in the 2017 election (for which final results come out on Saturday).

    Here’s the code for the second graphic; it’s basically the same as the previous chunk of code, except filtering on a different character string in the voting place name:

    nzge %>% filter(voting_type == "Party") %>% mutate(party = fct_lump(party, 5)) %>% mutate(dummy = grepl("before", voting_place, ignore.case = TRUE)) %>% group_by(electorate, party, election_year) %>% summarise(prop_before = sum(votes[dummy]) / sum(votes), total_votes = sum(votes)) %>% ungroup() %>% mutate(party = gsub(" Party", "", party), party = gsub("ACT New Zealand", "ACT", party), party = gsub("New Zealand First", "NZ First", party)) %>% mutate(party = fct_reorder(party, prop_before)) %>% ggplot(aes(x = prop_before, y = party, size = total_votes, colour = party)) + facet_wrap(~election_year) + geom_point(alpha = 0.1) + ggtitle("'Before' votes proportion by party, electorate and year", "Each point represents the proportion of a party's vote in each electorate that were cast before election day") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package", y = "") + scale_size_area("Total party votes", label = comma) + scale_x_continuous("\nPercentage of party's votes that were before election day", label = percent) + scale_colour_manual(values = palette, guide = FALSE) ‘Party’ compared to ‘Candidate’ vote

    Looking for something else to showcase, I thought it might be interesting to pool all five elections for which I have the detailed results and compare the party vote (ie the proportional representation choice out of the two votes New Zealanders get) to the candidate vote (ie the representative member choice). Here’s a graphic that does just that:

    We see that New Zealand First and the Greens are the two parties that are most noticeably above the diagonal line indicating equality between party and candidate votes. This isn’t a surprise – these are minority parties that appeal to (different) demographic and issues-based communities that are dispersed across the country, and generally have little chance of winning individual electorates. Hence the practice of voters is often to split their votes. This is all perfectly fine and is exactly how mixed-member proportional voting systems are meant to work.

    Here’s the code that produced the scatter plot:

    nzge %>% group_by(voting_type, party) %>% summarise(votes = sum(votes)) %>% spread(voting_type, votes) %>% ggplot(aes(x = Candidate, y = Party, label = party)) + geom_abline(intercept = 0, slope = 1, colour = "grey50") + geom_point() + geom_text_repel(colour = "steelblue") + scale_x_log10("Total 'candidate' votes", label = comma, breaks = c(1, 10, 100, 1000) * 1000) + scale_y_log10("Total 'party' votes", label = comma, breaks = c(1, 10, 100, 1000) * 1000) + ggtitle("Lots of political parties: total votes from 2002 to 2014", "New Zealand general elections") + labs(caption = "Source: www.electionresults.govt.nz, collated in the nzelect R package") + coord_equal() What next?

    Obviously, the next big thing for nzelect is to get the 2017 results in once they are announced on Saturday. I should be able to do this for the GitHub version by early next week. I would have delayed the CRAN release until 2017 results were available, but unfortunately I had a bug in some of the examples in my helpfiles that stopped them working after the 23 September 2017, so I had to rush a fix and the latest enhancements to CRAN to avoid problems with the CRAN maintainers (which I fully endorse and thank by the way).

    My other plans for nzelect over the next months to a year include:

    • reliable point locations for voting places back to 2002
    • identify consistent/duplicate voting places over time to make it easier to analyse comparative change by micro location
    • add detailed election results for the 1999 and 1996 elections (these are saved under a different naming convention to those from 2002 onwards, which is why they need a bit more work)
    • add high level election results for prior to 1994

    The source code for cleaning the election data and packaging it into nzelect is on GitHub. The package itself is on CRAN and installable in the usual way.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RProtoBuf 0.4.11

    Wed, 10/04/2017 - 02:25

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    RProtoBuf provides R bindings for the Google Protocol Buffers ("ProtoBuf") data encoding and serialization library used and released by Google, and deployed fairly widely in numerous projects as a language and operating-system agnostic protocol.

    A new releases RProtoBuf 0.4.11 appeared on CRAN earlier today. Not unlike the other recent releases, it is mostly a maintenance release which switches two of the vignettes over to using the pinp package and its template for vignettes.

    Changes in RProtoBuf version 0.4.11 (2017-10-03)
    • The RProtoBuf-intro and RProtoBif-quickref vignettes were converted to Rmarkdown using the templates and style file from the pinp package.

    • A few minor internal upgrades

    CRANberries also provides a diff to the previous release. The RProtoBuf page has copies of the (older) package vignette, the ‘quick’ overview vignette, a unit test summary vignette, and the pre-print for the JSS paper. Questions, comments etc should go to the GitHub issue tracker off the GitHub repo.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Create Powerpoint presentations from R with the OfficeR package

    Tue, 10/03/2017 - 23:30

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    For many of us data scientists, whatever the tools we use to conduct research or perform an analysis, our superiors are going to want the results as a Microsoft Office document. Most likely it's a Word document or a PowerPoint presentation, it probably has to follow the corporate branding guidelines to boot. The OfficeR package, by David Gohel, addresses this problem by allowing you to take a Word or PowerPoint template and programatically insert text, tables and charts generated by R into the template to create a complete document. (The OfficeR package also represents a leap forward from the similar ReporteRs package: it's faster, and no longer has a dependency on a Java installation.)

    At his blog, Len Kiefer takes the OfficeR package through its paces, demonstrating how to create a PowerPoint deck using R. The process is pretty simple:

    • Create a template PowerPoint presentation to host the slides. You can use the Slide Master mode in PowerPoint to customize the style of the slides you will create using R, and you can use all of PowerPoint's features to control layout, select fonts and colors, and include images like logos. 
    • For each slide you wish to create, you can either reference a template slide included in your base presentation, or add a new slide based on one of your custom layouts.
    • For each slide, you will reference on or or more placeholders (regions where content goes) in the layout using their unique names. You can then use R functions to fill them with text, hyperlinks, tables or images. The formatting of each will be controlled by the formatting you specified within Powerpoint.

    Len used this process to convert the content of a blog post on housing inventory into the PowerPoint deck you see embedded below.

    You can find details of how to create Microsoft Office documents in the OfficeR vignette on PowerPoint documents, and a there's a similar guide for Word documents, too. Or just take a look at the blog post linked before for a worked example.

    Len Kiefer: Crafting a PowerPoint Presentation with R

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R live class – Statistics for Data Science

    Tue, 10/03/2017 - 17:52

    (This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

     

    Statistics for Data Science is our second course of the autumn term. It takes place in October 11-12 in Milano Lima.

    In this two-day course you will learn how to develop a wide variety of linear and generalized linear models with R. The course follows a step-by-step approach: starting from the simplest linear regression we will add complexity to the model up to the most sophisticated GLM, looking at the statistical theory along with its R implementation. Supported by plenty of examples, the course will give you a wide overview of the R capabilities to model and to make predictions.

    Statistics for Data Science Outlines
    • t-test, ANOVA
    • Linear, Polynomial and Multiple Regression
    • More Complex Linear Models
    • Generalized Linear Models
    • Logistic Regression
    • Poisson Dep. Var. Regression
    • Gamma Dep. Var. Regression
    • Check of models assumptions
    • Brief outlines of GAM, Mixed Models. Neural Networks, Tree-based Modelling

     

    Statistics for Data Science is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

    The course is for max 6 attendees.

    Location

    The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

    Registration

    If you want to reserve a seat go to: FAQ, detailed program and tickets.

    Other R courses | Autumn term

    You can find an overview of all our courses here. Next dates will be:

    • October 25-26: Machine Learning with R. Find patterns in large data sets using the R tools for Dimensionality Reduction, Clustering, Classification and Prediction. Reserve now!
    • November 7-8: Data Visualization and Dashboard with R. Show the story behind your data: create beautiful effective visualizations and interactive Shiny dashboards. Reserve now!
    • November 21-22: R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
    • November 29-30: Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!

    In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

    The post R live class – Statistics for Data Science appeared first on Quantide – R training & consulting.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Dashboard Design: 8 Types of Online Dashboards

    Tue, 10/03/2017 - 13:31

    (This article was first published on R – Displayr, and kindly contributed to R-bloggers)

    What type of online dashboard will work best for your data? This post reviews eight types of online dashboards to assist you in choosing the right approach for your next dashboard. Note that there may well be more than eight types of dashboards, I am sure I will miss a few. If so, please tell me in the comments section of this post.

    KPI Online Dashboards

    The classic dashboards are designed to report key performance indicators (KPIs). Think of the dashboard of a car or the cockpit of an airplane. The KPI dashboard is all about dials and numbers. Typically, these dashboards are live and show the latest numbers. In a business context, they typically show trend data as well.

    A very simple example of a KPI Dashboard is below. Such dashboards can, of course, be huge. Huge dashboards have lots of pages crammed with numbers and charts, looking at all manner of operational and strategic data.

    Click the image for an interactive version

    Geographic Online Dashboards

    The most attractive dashboards are often geographic. The example below was created by Iaroslava Mizai in Tableau. Due to people being inspired by such dashboards, I imagine that a lot of money has been spent on Tableau licenses. While visually attractive, such dashboards tend to make up a tiny proportion of the dashboards in widespread use.

    While visually attractive, such dashboards tend to make up a tiny proportion of the dashboards in widespread use. Outside of sales, geography, and demography, few people spend much time exploring geographic data.

    Click the image for an interactive version

    Catalog Online Dashboards

    catalog online dashboard is based around a menu. The viewer can select the results they are interested in from that menu. It is a much more general dashboard used for displaying data rather than geography. Here, you can also use any variable to cut the data. For example, the Catalog Dashboard below gives the viewer a choice of country to investigate.

    Click the image for an interactive version

    The dashboard below has the same basic idea, except the user navigates by clicking the control box on the right-side of the heading. In this example of a brand health dashboard, the control box is currently set to IGA (but you could click on it to change it to another supermarket).

     

    Click the image for an interactive version

    The PowerPoint Alternative Dashboard

    story dashboard consists of a series of pages specifically ordered for the reader. This type of online dashboard is used as a powerful alternative to PowerPoint, with the additional benefits of being interactive, updatable and live. Typically, a user either navigates through such a dashboard using navigation buttons (i.e., forward and backward). Alternatively, they use the navigation bar on the left, as shown in the online dashboard example below.

    Drill-Down Online Dashboards

    A drill-down is an online dashboard (or dashboard control) where the viewer can “drill” into the data to get more information. The whole dashboard is organized in in hierarchical fashion.

    There are five common ways of facilitating drill-downs in dashboards: zoom, graphical filtering, control-based, filtering, and landing pages. The choice of which to use is partly technological and partly related to the structure of the data.

    1. Zoom

    Zooming is perhaps the most widely used technique for permitting users to drill-down. The user can typically achieve the zoom via mouse, touch, + buttons, and draggers. For example, the earlier Microsoft KPI dashboard permitted the viewer to change the time series window by dragging on the bottom of each chart.

    While zooming is the most aesthetically pleasing way of drilling into data, it is also the least general. This approach to dashboarding only works when there is a strong and obvious ordering of the data. This is typically only the case with geographic and time series data, although sometimes data is forced into a hierarchy to make zooming possible. This is the case in the Zooming example below, which shows survival rates for the Titanic (double-click on it to zoom).

    Unless writing everything from scratch, the ability to add zoom to a dashboard will depend on the components being used (i.e. whether the components support zoom).

    Click the image for an interactive version

    2. Graphical filtering

    Graphical filtering allows the user to explore data by clicking on graphical elements of the dashboard. For example, in this QLik dashboard, I clicked on the Ontario pie slice (on the right of the screen) and all the other elements on the page automatically updated to show data relating to Ontario.

    Graphical filtering is cool. However, it both requires highly structured data and quite a bit of time figuring out how to design and implement the user interface. They are also the most challenging to build. The most amazing examples tend to be bespoke websites created by data journalists (e.g., http://www.poppyfield.org/). The most straightforward way of creating such dashboards with graphical filtering tends to be using business intelligence tools, like Qlik and Tableau. Typically, there is a lot of effort required to structure the data up front. You then get the graphical filtering “for free”.  If you are more the DIY-type, wanting to build your own dashboards and pay nothing, RStudio’s Shiny is probably the most straightforward option.

    Click the image for an interactive version

    3. Control-based drill-downs

    A quicker and easier way of implementing drill-downs is to give the user controls that they can use to select data. From a user interface perspective, the appearance is essentially the same as with the Supermarket Brand Health dashboard (example a few dashboards above). Here, a user chooses from the available options (or uses sliders, radio buttons, etc.).

    4. Filtered drill-downs

    When drilling-down involves restricting the data to a subset of the observations (e.g., to a subset of respondents in a survey), users can zoom in using filtering tools. For example, you can filter the Supermarket Brand Health dashboard by various demographic groups. While using filters to zoom is the least sexy of the ways of permitting users to drill into data, it is usually the most straightforward to implement. Furthermore, it is also a lot more general than any of the other styles of drill-downs considered so far. For example, the picture below illustrates drilling into the data of women aged 35 or more (using the Filters drop-down menu on the top right corner).

    Click the image for an interactive version

    5. Hyperlink drill-downs

    The most general approach for creating drill-downs is to link together multiple pages with hyperlinks. While all of the other approaches involve some aspect of filtering. On the other hand, hyperlinks enable the user to drill into qualitatively different data. Typically, there is a landing page that contains a summary of key data. So the user clicks on the data of interest to drill down and get more information. In the example of a hyperlinked dashboard below, the landing page shows the performance of different departments in a supermarket. The viewer clicks on the result for a department (e.g.: CHECK OUT) which takes them to a screen showing more detailed results.

    Click the image for an interactive version

     

    Interactive Infographic Dashboard

    Infographic dashboards present viewers with a series of closely related charts, text, and images. Here is an example of an interactive infographic on Gamers, where the user can change the country at the top and the dashboard automatically updates.

    Click the image for an interactive version

    Visual Confections Dashboard

    visual confection is an online dashboard that layers multiple visual elements. On the other hand, a series of related visualizations is an infographic. The dashboard below overlays time series information, with exercise and diet information.

    Click the image for an interactive version

    Simulator Dashboards

    The final type of dashboard that I can think of is a simulator. The simulator dashboard example below is from a latent class logit choice model of the egg market. The user can select different properties for each of the competitors and the dashboard predicts market share.

    Click the image for an interactive version

     Create your own Online Dashboards

    I have mentioned a few specific apps for creating online dashboards, including Tableau, QLik, and Shiny. All the other online dashboards in this post used R from within Displayr (you can even just use Displayr to see the underlying R code for each online dashboard). To explore or replicate the Displayr dashboards, just follow the links below for Edit mode for each respective dashboard, and then click on each of the visual elements.

    Microsoft KPI

    Overview: A one-page dashboard showing stock price and Google Trends data for Microsoft.
    Interesting features: Automatically updated every 24 hours, pulling in data from Yahoo Finance and Google Trends.
    Edit mode: Click here to see the underlying document.
    View modeClick here to see the dashboard.

     

    Europe and Immigration

    Overview: Attitudes of Europeans to Immigration
    Interesting features: Based on 213,308 survey responses collected over 13 years. Custom navigation via images and hyperlinks.
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboard.

     

    Supermarket Brand Health

    Overview: Usage and attitudes towards supermarkets
    Interesting features: Uses a control (combo box) to update the calculations for the chosen supermarket brand.
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboard.

     

    Supermarket Department NPS

    Overview: Performance by department of supermarkets.
    Interesting features: Color-coding of circles based on underlying data (they change when the data is filtered using the Filters menu in the top right). Custom navigation, whereby the user clicks on the circle for a department and gets more information about that department.
    Edit mode: Click here to see the dashboard.
    View mode: Click here to see the underlying document.

     

    Blood Glucose Confection

    Overview: Blood glucose measurements and food diary.
    Interesting features: The fully automated underlying charts that integrate data from a wearable blood glucose implant and a food diary. See Layered Data Visualizations Using R, Plotly, and Displayr for more about this dashboard.
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboards.

     

    Interactive infographic

    Overview: An infographic that updates based on the viewer’s selection of country.
    Interesting features: Based on an infographic created in Canva. The data is pasted in from a spreadsheet  (i.e., no hookup to a database).
    Edit mode: Click here to see the dashboard.
    View mode: Click here to see the underlying document.

     

     

    Presidential MaxDiff

    Overview: A story-style dashboard showing an analysis of what Americans desire in their Commander-in-Chief.
    Interesting features: A revised data file can be used to automatically update the visualizations, text, and the underlying analysis (a MaxDiff model)(i.e., it is an automated report).
    Edit mode: Click here to see the underlying document.
    View mode: Click here to see the online dashboards.

     

    Choice Simulator

    Overview: A decision-support system
    Interesting features: The simulator is hooked up directly to an underlying latent class model. See How to Create an Online Choice Simulator for more about this dashboard.
    Edit mode: Click here to see the dashboard.
    View mode: Click here to see the underlying document.

     

     

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    googleLanguageR – Analysing language through the Google Cloud Machine Learning APIs

    Tue, 10/03/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    One of the greatest assets human beings possess is the power of speech and language, from which almost all our other accomplishments flow. To be able to analyse communication offers us a chance to gain a greater understanding of one another.

    To help you with this, googleLanguageR is an R package that allows you to perform speech-to-text transcription, neural net translation and natural language processing via the Google Cloud machine learning services.

    An introduction to the package is below, but you can find out more details at the googleLanguageR website.

    Google's bet

    Google predicts that machine learning is to be a fundamental feature of business, and so they are looking to become the infrastructure that makes machine learning possible. Metaphorically speaking: If machine learning is electricity, then Google wants to be the pylons carrying it around the country.

    Google may not be the only company with such ambitions, but one advantage Google has is the amount of data it possesses. Twenty years of web crawling has given it an unprecedented corpus to train its models. In addition, its recent moves into voice and video gives it one of the biggest audio and speech datasets, all of which have been used to help create machine learning applications within its products such as search and Gmail. Further investment in machine learning is shown by Google's purchase of Deepmind, a UK based A.I. research firm that recently was in the news for defeating the top Go champion with its neural network trained Go bot. Google has also taken an open-source route with the creation and publication of Tensorflow, a leading machine learning framework.

    Whilst you can create your own machine learning models, for those users who haven't the expertise, data or time to do so, Google also offers an increasing range of machine learning APIs that are pre-trained, such as image and video recognition or job search. googleLanguageR wraps the subset of those machine learning APIs that are language flavoured – Cloud Speech, Translation and Natural Language.

    Since they carry complementary outputs that can be used in each other's input, all three of the APIs are included in one package. For example, you can transcribe a recording of someone speaking in Danish, translate that to English and then identify how positive or negative the writer felt about its content (sentiment analysis) then identify the most important concepts and objects within the content (entity analysis).

    Motivations Fake news

    One reason why I started looking at this area was the growth of 'fake news', and its effect on political discourse on social media. I wondered if there was some way to put metrics on how much a news story fuelled one's own bias within your own filter bubble. The entity API provides a way to perform entity and sentiment analysis at scale on tweets, and by then comparing different users and news sources preferences the hope is to be able to judge how much they are in agreement with your own bias, views and trusted reputation sources.

    Make your own Alexa

    Another motivating application is the growth of voice commands that will become the primary way of user interface with technology. Already, Google reports up to 20% of search in its app is via voice search. I'd like to be able to say "R, print me out that report for client X". A Shiny app that records your voice, uploads to the API then parses the return text into actions gives you a chance to create your very own Alexa-like infrastructure.

    The voice activated internet connected speaker, Amazon's Alexa – image from www.amazon.co.uk

    Translate everything

    Finally, I live and work in Denmark. As Danish is only spoken by less than 6 million people, applications that work in English may not be available in Danish very quickly, if at all. The API's translation service is the one that made the news in 2016 for "inventing its own language", and offers much better English to Danish translations that the free web version and may make services available in Denmark sooner.

    Using the library

    To use these APIs within R, you first need to do a one-time setup to create a Google Project, add a credit card and authenticate which is detailed on the package website.

    After that, you feed in the R objects you want to operate upon. The rOpenSci review helped to ensure that this can scale up easily, so that you can feed in large character vectors which the library will parse and rate limit as required. The functions also work within tidyverse pipe syntax.

    Speech-to-text

    The Cloud Speech API is exposed via the gl_speech function.

    It supports multiple audio formats and languages, and you can either feed a sub-60 second audio file directly, or perform asynchrnous requests for longer audio files.

    Example code:

    library(googleLanguageR) my_audio <- "my_audio_file.wav" gl_speech(my_audio) # A tibble: 1 x 3 # transcript confidence words #* #1 Hello Mum 0.9227779 Translation

    The Cloud Translation API lets you translate text via gl_translate

    As you are charged per character, one tip here if you are working with lots of different languages is to perform detection of language offline first using another rOpenSci package, cld2. That way you can avoid charges for text that is already in your target language i.e. English.

    library(googleLanguageR) library(cld2) library(purrr) my_text <- c("Katten sidder på måtten", "The cat sat on the mat") ## offline detect language via cld2 detected <- map_chr(my_text, detect_language) # [1] "DANISH" "ENGLISH" ## get non-English text translate_me <- my_text[detected != "ENGLISH"] ## translate gl_translate(translate_me) ## A tibble: 1 x 3 # translatedText detectedSourceLanguage text #* #1 The cat is sitting on the mat da Katten sidder på måtten Natural Language Processing

    The Natural Language API reveals the structure and meaning of text, accessible via the gl_nlp function.

    It returns several analysis:

    • Entity analysis – finds named entities (currently proper names and common nouns) in the text along with entity types, salience, mentions for each entity, and other properties. If possible, will also return metadata about that entity such as a Wikipedia URL.
    • Syntax – analyzes the syntax of the text and provides sentence boundaries and tokenization along with part of speech tags, dependency trees, and other properties.
    • Sentiment – the overall sentiment of the text, represented by a magnitude [0, +inf] and score between -1.0 (negative sentiment) and 1.0 (positive sentiment)

    These are all useful to get an understanding of the meaning of a sentence, and has potentially the greatest number of applications of the APIs featured. With entity analysis, auto categorisation of text is possible; the syntax returns let you pull out nouns and verbs for parsing into other actions; and the sentiment analysis allows you to get a feeling for emotion within text.

    A demonstration is below which gives an idea of what output you can generate:

    library(googleLanguageR) quote <- "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." nlp <- gl_nlp(quote) str(nlp) #List of 6 # $ sentences :List of 1 # ..$ :'data.frame': 1 obs. of 4 variables: # .. ..$ content : chr "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." # .. ..$ beginOffset: int 0 # .. ..$ magnitude : num 0.6 # .. ..$ score : num -0.6 # $ tokens :List of 1 # ..$ :'data.frame': 20 obs. of 17 variables: # .. ..$ content : chr [1:20] "Two" "things" "are" "infinite" ... # .. ..$ beginOffset : int [1:20] 0 4 11 15 23 25 29 38 42 48 ... # .. ..$ tag : chr [1:20] "NUM" "NOUN" "VERB" "ADJ" ... # .. ..$ aspect : chr [1:20] "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" "ASPECT_UNKNOWN" ... # .. ..$ case : chr [1:20] "CASE_UNKNOWN" "CASE_UNKNOWN" "CASE_UNKNOWN" "CASE_UNKNOWN" ... # .. ..$ form : chr [1:20] "FORM_UNKNOWN" "FORM_UNKNOWN" "FORM_UNKNOWN" "FORM_UNKNOWN" ... # .. ..$ gender : chr [1:20] "GENDER_UNKNOWN" "GENDER_UNKNOWN" "GENDER_UNKNOWN" "GENDER_UNKNOWN" ... # .. ..$ mood : chr [1:20] "MOOD_UNKNOWN" "MOOD_UNKNOWN" "INDICATIVE" "MOOD_UNKNOWN" ... # .. ..$ number : chr [1:20] "NUMBER_UNKNOWN" "PLURAL" "NUMBER_UNKNOWN" "NUMBER_UNKNOWN" ... # .. ..$ person : chr [1:20] "PERSON_UNKNOWN" "PERSON_UNKNOWN" "PERSON_UNKNOWN" "PERSON_UNKNOWN" ... # .. ..$ proper : chr [1:20] "PROPER_UNKNOWN" "PROPER_UNKNOWN" "PROPER_UNKNOWN" "PROPER_UNKNOWN" ... # .. ..$ reciprocity : chr [1:20] "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" "RECIPROCITY_UNKNOWN" ... # .. ..$ tense : chr [1:20] "TENSE_UNKNOWN" "TENSE_UNKNOWN" "PRESENT" "TENSE_UNKNOWN" ... # .. ..$ voice : chr [1:20] "VOICE_UNKNOWN" "VOICE_UNKNOWN" "VOICE_UNKNOWN" "VOICE_UNKNOWN" ... # .. ..$ headTokenIndex: int [1:20] 1 2 2 2 2 6 2 6 9 6 ... # .. ..$ label : chr [1:20] "NUM" "NSUBJ" "ROOT" "ACOMP" ... # .. ..$ value : chr [1:20] "Two" "thing" "be" "infinite" ... # $ entities :List of 1 # ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6 obs. of 9 variables: # .. ..$ name : chr [1:6] "human stupidity" "things" "universe" "universe" ... # .. ..$ type : chr [1:6] "OTHER" "OTHER" "OTHER" "OTHER" ... # .. ..$ salience : num [1:6] 0.1662 0.4771 0.2652 0.2652 0.0915 ... # .. ..$ mid : Factor w/ 0 levels: NA NA NA NA NA NA # .. ..$ wikipedia_url: Factor w/ 0 levels: NA NA NA NA NA NA # .. ..$ magnitude : num [1:6] NA NA NA NA NA NA # .. ..$ score : num [1:6] NA NA NA NA NA NA # .. ..$ beginOffset : int [1:6] 42 4 29 86 29 86 # .. ..$ mention_type : chr [1:6] "COMMON" "COMMON" "COMMON" "COMMON" ... # $ language : chr "en" # $ text : chr "Two things are infinite: the universe and human stupidity; and I'm not sure about the universe." # $ documentSentiment:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 2 variables: # ..$ magnitude: num 0.6 # ..$ score : num -0.6 Acknowledgements

    This package is 10 times better due to the efforts of the rOpenSci reviewers Neal Richardson and Julia Gustavsen, who have whipped the documentation, outputs and test cases into the form they are today in 0.1.0. Many thanks to them.

    Hopefully, this is just the beginning and the package can be further improved by its users – if you do give the package a try and find a potential improvement, raise an issue on GitHub and we can try to implement it. I'm excited to see what users can do with these powerful tools.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Mapping ecosystems of software development

    Tue, 10/03/2017 - 02:00

    (This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

    I have a new post on the Stack Overflow blog today about the complex, interrelated ecosystems of software development. On the data team at Stack Overflow, we spend a lot of time and energy thinking about tech ecosystems and how technologies are related to each other. One way to get at this idea of relationships between technologies is tag correlations, how often technology tags at Stack Overflow appear together relative to how often they appear separately. One place we see developers using tags at Stack Overflow is on their Developer Stories. If we are interested in how technologies are connected and how they are used together, developers’ own descriptions of their work and careers is a great place to get that.

    I released the data for this network structure as a dataset on Kaggle so you can explore it for yourself! For example, the post for Stack Overflow includes an interactive visualization created using the networkD3 package but we can create other kinds of visualizations using the ggraph package. Either way, trusty igraph comes into play.

    library(readr) library(igraph) library(ggraph) stack_network <- graph_from_data_frame(read_csv("stack_network_links.csv"), vertices = read_csv("stack_network_nodes.csv")) set.seed(2017) ggraph(stack_network, layout = "fr") + geom_edge_link(alpha = 0.2, aes(width = value)) + geom_node_point(aes(color = as.factor(group), size = 10 * nodesize)) + geom_node_text(aes(label = name), family = "RobotoCondensed-Regular", repel = TRUE) + theme_graph(base_family = "RobotoCondensed-Regular") + theme(plot.title = element_text(family="Roboto-Bold"), legend.position="none") + labs(title = "Stack Overflow Tag Network", subtitle = "Tags correlated on Developer Stories")

    We have explored these kinds of network structures using all kinds of data sources at Stack Overflow, from Q&A to traffic, and although we see similar relationships across all of them, we really like Developer Stories as a data source for this particular question. Let me know if you have any comments or questions!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Comparing assault death rates in the US to other advanced democracies

    Mon, 10/02/2017 - 23:28

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    In an effort to provide context to the frequent mass shootings in the United States, Kieran Healy (Associate Professor of Sociology at Duke University) created this updated chart comparing assault death rates in the US to that of 23 other advanced democracies. The chart shows the rate (per 100,000 citizens) of death caused by assaults (stabbings, gunshots, etc. by a third party). Assaults are used rather than gun deaths specifically, as that's the only statistic for which readily comparable data is available. The data come from the OECD Health Status Database through 2015, the most recent complete year available.

    The goal of this chart is to "set the U.S. in some kind of longitudinal context with broadly comparable countries", and to that end OECD countries Estonia and Mexico are not included. (Estonia suffered a spike of violence in the mid-90's, and Mexico has been embroiled in drug violence for decades. See the chart with Estonia and Mexico included here.) Healy provides a helpful FAQ justifying this decision and other issues related to the data and their presentation.

    Healy used the R language (and, specifically the ggplot2 graphics package) to create this chart, and the source code is available on Github.

    For more context around this chart follow the link below, and also his prior renderings and commentary related to the same data through 2013 and through 2010.

    Kieran Healy: Assault deaths to 2015

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Processing Rmarkdown documents with Eclipse and StatET

    Mon, 10/02/2017 - 23:06

    (This article was first published on R-project – lukemiller.org, and kindly contributed to R-bloggers)

    Processing R markdown (Rmd) documents with Eclipse/StatET external tools requires a different setup than processing ‘regular’ knitr documents (Rnw). I was having problems getting the whole rmarkdown -> pandoc workflow working on Eclipse, but the following fix seems to have resolved it, and I can generate Word or HTML documents from a single .Rmd file with a YAML header (see image below).

    A .Rmd document with YAML header set to produce a Microsoft Word .docx output file.

    For starters, I open the Run > External Tools > External Tools Configurations window. The Rmarkdown document is considered a type of Wikitext, so create a new Wikitext + R Document Processing entry. To do that, hit the little blank-page-with-a-plus icon found in the upper left corner of the window, above where it says “type filter text”.

    Using the Auto (YAML) single-step preset.

    Enter a name for this configuration in the Name field (where it says knitr_RMD_yaml in my image above). I then used the Load Preset/Example menu in the upper right to choose the “Auto (YAML) using RMarkdown, single-step” option. This processes your .Rmd document directly using the rmarkdown::render() function, skipping over a separate knitr::knit() step, but the output should look the same.

    Next go to the “2) Produce Output” tab (you are skipping over the “1) R-Weave” tab because you chose the 1-step process). By default the entry in the File field here was causing errors for me. The change is pictured below, so that the File entry reads "${file_name_base:${source_file_path}}.${out_file_ext}". This change allowed my setup to actually find the .Rmd and output .md files successfully, so that the .md document could then be passed on to pandoc.

    Modify the File entry from the default so that instead reads the same as what’s pictured here.

    This all assumes that you’ve previously downloaded and installed pandoc so that it can be found on the Windows system $PATH.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-project – lukemiller.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Marketing Multi-Channel Attribution model based on Sales Funnel with R

    Thu, 09/28/2017 - 06:00

    (This article was first published on R language – AnalyzeCore – data is beautiful, data is a story, and kindly contributed to R-bloggers)

    This is the last post in the series of articles about using Multi-Channel Attribution in marketing. In previous two articles (part 1 and part 2), we’ve reviewed a simple and powerful approach based on Markov chains that allows you to effectively attribute marketing channels.

    In this article, we will review another fascinating approach that marries heuristic and probabilistic methods. Again, the core idea is straightforward and effective.

    Sales Funnel Usually, companies have some kind of idea on how their clients move along the user journey from first visiting a website to closing a purchase. This sequence of steps is called a Sales (purchasing or conversion) Funnel. Classically, the Sales Funnel includes at least four steps:
    • Awareness – the customer becomes aware of the existence of a product or service (“I didn’t know there was an app for that”),
    • Interest – actively expressing an interest in a product group (“I like how your app does X”),
    • Desire – aspiring to a particular brand or product (“Think I might buy a yearly membership”),
    • Action – taking the next step towards purchasing the chosen product (“Where do I enter payment details?”).

    For an e-commerce site, we can come up with one or more conditions (events/actions) that serve as an evidence of passing each step of a Sales Funnel.

    For some extra information about Sales Funnel, you can take a look at my (rather ugly) approach of Sales Funnel visualization with R.

    Companies, naturally, lose some share of visitors on each following step of a Sales Funnel as it gets narrower. That’s why it looks like a string of bottlenecks. We can calculate a probability of transition from the previous step to the next one based on recorded history of transitions. On the other hand, customer journeys are sequences of sessions (visits) and these sessions are attributed to different marketing channels.

    Therefore, we can link marketing channels with a probability of a customer passing through each step of a Sales Funnel. And here goes the core idea of the concept. The probability of moving through each “bottleneck” represents the value of the marketing channel which leads a customer through it. The higher probability of passing a “neck”, the lower the value of a channel that provided the transition. And vice versa, the lower probability, the higher value of a marketing channel in question.

    Let’s study the concept with the following example. First off, we’ll define the Sales Funnel and a set of conditions which will register as customer passing through each step of the Funnel.

    • 0 step (necessary condition) – customer visits a site for the first time
    • 1st step (awareness) – visits two site’s pages
    • 2nd step (interest) – reviews a product page
    • 3rd step (desire) –  adds a product to the shopping cart
    • 4th step (action) – completes purchase

    Second, we need to extract the data that includes sessions where corresponding events occurred. We’ll simulate this data with the following code:

    click to expand R code library(tidyverse) library(purrrlyr) library(reshape2) ##### simulating the "real" data ##### set.seed(454) df_raw <- data.frame(customer_id = paste0('id', sample(c(1:5000), replace = TRUE)), date = as.POSIXct(rbeta(10000, 0.7, 10) * 10000000, origin = '2017-01-01', tz = "UTC"), channel = paste0('channel_', sample(c(0:7), 10000, replace = TRUE, prob = c(0.2, 0.12, 0.03, 0.07, 0.15, 0.25, 0.1, 0.08))), site_visit = 1) %>% mutate(two_pages_visit = sample(c(0,1), 10000, replace = TRUE, prob = c(0.8, 0.2)), product_page_visit = ifelse(two_pages_visit == 1, sample(c(0, 1), length(two_pages_visit[which(two_pages_visit == 1)]), replace = TRUE, prob = c(0.75, 0.25)), 0), add_to_cart = ifelse(product_page_visit == 1, sample(c(0, 1), length(product_page_visit[which(product_page_visit == 1)]), replace = TRUE, prob = c(0.1, 0.9)), 0), purchase = ifelse(add_to_cart == 1, sample(c(0, 1), length(add_to_cart[which(add_to_cart == 1)]), replace = TRUE, prob = c(0.02, 0.98)), 0)) %>% dmap_at(c('customer_id', 'channel'), as.character) %>% arrange(date) %>% mutate(session_id = row_number()) %>% arrange(customer_id, session_id) df_raw <- melt(df_raw, id.vars = c('customer_id', 'date', 'channel', 'session_id'), value.name = 'trigger', variable.name = 'event') %>% filter(trigger == 1) %>% select(-trigger) %>% arrange(customer_id, date)

    And the data sample looks like:

    Next up, the data needs to be preprocessed. For example, it would be useful to replace NA/direct channel with the previous one or separate first-time purchasers from current customers, or even create different Sales Funnels based on new and current customers, segments, locations and so on. I will omit this step but you can find some ideas on preprocessing in my previous blogpost.

    The important thing about this approach is that we only have to attribute the initial marketing channel, one that led the customer through their first step. For instance, a customer initially reviews a product page (step 2, interest) and is brought by channel_1. That means any future product page visits from other channels won’t be attributed until the customer makes a purchase and starts a new Sales Funnel journey.

    Therefore, we will filter records for each customer and save the first unique event of each step of the Sales Funnel using the following code:

    click to expand R code ### removing not first events ### df_customers <- df_raw %>% group_by(customer_id, event) %>% filter(date == min(date)) %>% ungroup()

    I point your attention that in this way we assume that all customers were first-time buyers, therefore every next purchase as an event will be removed with the above code.

    Now, we can use the obtained data frame to compute Sales Funnel’s transition probabilities, importance of Sale Funnel steps, and their weighted importance. According to the method, the higher probability, the lower value of the channel. Therefore, we will calculate the importance of an each step as 1 minus transition probability. After that, we need to weight importances because their sum will be higher than 1. We will do these calculations with the following code:

    click to expand R code ### Sales Funnel probabilities ### sf_probs <- df_customers %>% group_by(event) %>% summarise(customers_on_step = n()) %>% ungroup() %>% mutate(sf_probs = round(customers_on_step / customers_on_step[event == 'site_visit'], 3), sf_probs_step = round(customers_on_step / lag(customers_on_step), 3), sf_probs_step = ifelse(is.na(sf_probs_step) == TRUE, 1, sf_probs_step), sf_importance = 1 - sf_probs_step, sf_importance_weighted = sf_importance / sum(sf_importance) )

    A hint: it can be a good idea to compute Sales Funnel probabilities looking at a limited prior period, for example, 1-3 months. The reason is that customers’ flow or “necks” capacities could vary due to changes on a company’s site or due to changes in marketing campaigns and so on. Therefore, you can analyze the dynamics of the Sales Funnel’s transition probabilities in order to find the appropriate time period.

    I can’t publish a blogpost without visualization. This time I suggest another approach for the Sales Funnel visualization that represents all customer journeys through the Sales Funnel with the following code:

    click to expand R code ### Sales Funnel visualization ### df_customers_plot <- df_customers %>% group_by(event) %>% arrange(channel) %>% mutate(pl = row_number()) %>% ungroup() %>% mutate(pl_new = case_when( event == 'two_pages_visit' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'two_pages_visit'])) / 2), event == 'product_page_visit' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'product_page_visit'])) / 2), event == 'add_to_cart' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'add_to_cart'])) / 2), event == 'purchase' ~ round((max(pl[event == 'site_visit']) - max(pl[event == 'purchase'])) / 2), TRUE ~ 0 ), pl = pl + pl_new) df_customers_plot$event <- factor(df_customers_plot$event, levels = c('purchase', 'add_to_cart', 'product_page_visit', 'two_pages_visit', 'site_visit' )) # color palette cols <- c('#4e79a7', '#f28e2b', '#e15759', '#76b7b2', '#59a14f', '#edc948', '#b07aa1', '#ff9da7', '#9c755f', '#bab0ac') ggplot(df_customers_plot, aes(x = event, y = pl)) + theme_minimal() + scale_colour_manual(values = cols) + coord_flip() + geom_line(aes(group = customer_id, color = as.factor(channel)), size = 0.05) + geom_text(data = sf_probs, aes(x = event, y = 1, label = paste0(sf_probs*100, '%')), size = 4, fontface = 'bold') + guides(color = guide_legend(override.aes = list(size = 2))) + theme(legend.position = 'bottom', legend.direction = "horizontal", panel.grid.major.x = element_blank(), panel.grid.minor = element_blank(), plot.title = element_text(size = 20, face = "bold", vjust = 2, color = 'black', lineheight = 0.8), axis.title.y = element_text(size = 16, face = "bold"), axis.title.x = element_blank(), axis.text.x = element_blank(), axis.text.y = element_text(size = 8, angle = 90, hjust = 0.5, vjust = 0.5, face = "plain")) + ggtitle("Sales Funnel visualization - all customers journeys")

    Ok, seems we now have everything to make final calculations. In the following code, we will remove all users that didn’t make a purchase. Then, we’ll link weighted importances of the Sales Funnel steps with sessions by event and, at last, summarize them.

    click to expand R code ### computing attribution ### df_attrib <- df_customers %>% # removing customers without purchase group_by(customer_id) %>% filter(any(as.character(event) == 'purchase')) %>% ungroup() %>% # joining step's importances left_join(., sf_probs %>% select(event, sf_importance_weighted), by = 'event') %>% group_by(channel) %>% summarise(tot_attribution = sum(sf_importance_weighted)) %>% ungroup()

    As the result, we’ve obtained the number of conversions that have been distributed by marketing channels:

    In the same way you can distribute the revenue by channels.

    At the end of the article, I want to share OWOX company’s blog where you can read more about the approach: Funnel Based Attribution Model.

    In addition, you can find that OWOX provides an automated system for Marketing Multi-Channel Attribution based on BigQuery. Therefore, if you are not familiar with R or don’t have a suitable data warehouse, I can recommend you to test their service.

    SaveSaveSaveSaveSaveSaveSaveSaveSaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSaveSaveSave

    SaveSaveSaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSave

    SaveSave

    The post Marketing Multi-Channel Attribution model based on Sales Funnel with R appeared first on AnalyzeCore – data is beautiful, data is a story.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R language – AnalyzeCore – data is beautiful, data is a story. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppZiggurat 0.1.4

    Thu, 09/28/2017 - 04:06

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A maintenance release of RcppZiggurat is now on the CRAN network for R. It switched the vignette to the our new pinp package and its two-column pdf default.

    The RcppZiggurat package updates the code for the Ziggurat generator which provides very fast draws from a Normal distribution. The package provides a simple C++ wrapper class for the generator improving on the very basic macros, and permits comparison among several existing Ziggurat implementations. This can be seen in the figure where Ziggurat from this package dominates accessing the implementations from the GSL, QuantLib and Gretl—all of which are still way faster than the default Normal generator in R (which is of course of higher code complexity).

    The NEWS file entry below lists all changes.

    Changes in version 0.1.4 (2017-07-27)
    • The vignette now uses the pinp package in two-column mode.

    • Dynamic symbol registration is now enabled.

    Courtesy of CRANberries, there is also a diffstat report for the most recent release. More information is on the RcppZiggurat page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Featurizing images: the shallow end of deep learning

    Thu, 09/28/2017 - 00:09

    (This article was first published on Revolutions, and kindly contributed to R-bloggers)

    //

    by Bob Horton and Vanja Paunic, Microsoft AI and Research Data Group

    Training deep learning models from scratch requires large data sets and significant computational reources. Using pre-trained deep neural network models to extract relevant features from images allows us to build classifiers using standard machine learning approaches that work well for relatively small data sets. In this context, a deep learning solution can be thought of as incorporating layers that compute features, followed by layers that map these features to outcomes; here we’ll just map the features to outcomes ourselves.

    We explore an example of using a pre-trained deep learning image classifier to generate features for use with traditional machine learning approaches to address a problem the original model was never trained on (see the blog post “Image featurization with a pre-trained deep neural network model” for other examples). This approach allows us to quickly and easily create a custom classifier for a specific specialized task, using only a relatively small training set. We use the image featurization abilities of Microsoft R Server 9.1 (MRS) to create a classifier for different types of knots in lumber. These images were made publicly available from the laboratory of Prof., Dr. Olli Silven, University of Oulu, Finland, in 1995. Please note that we are using this problem as an academic example of an image classification task with clear industrial implications, but we are not really trying to raise the bar in this well-established field.

    We characterize the performance of the machine learning model and describe how it might fit into the framework of a lumber grading system. Knowing the strengths and weaknesses of the classifier, we discuss how it could be used to triage additional image data for labeling by human experts, so that the system can be iteratively improved.

    The pre-trained deep learning models used here are optional components that can be installed alongside Microsoft R Server 9.1; directions are here.

    Lumber grading

    In the sawmill industry lumber grading is an important step of the manufacturing process. Improved grading accuracy and better control of quality variation in production leads directly to improved profits. Grading has traditionally been done by visual inspection, in which a (human) grader marks each piece of lumber as it leaves the mill, according to a factors like size, category, and position of knots, cracks, species of tree, etc. Visual inspection is often an error prone and laborious task. Certain defect classes may be difficult to distinguish, even for a human expert. To that end, a number of automated lumber grading systems have been developed which aim to improve the accuracy and the efficiency of lumber grading [2-7].

    Loading and visualizing data

    Let us start by downloading the data.

    DATA_DIR <- file.path(getwd(), 'data') if(!dir.exists(DATA_DIR)){ dir.create(DATA_DIR) knots_url <- 'http://www.ee.oulu.fi/research/imag/knots/KNOTS/knots.zip' names_url <- 'http://www.ee.oulu.fi/research/imag/knots/KNOTS/names.txt' download.file(knots_url, destfile = file.path(DATA_DIR, 'knots.zip')) download.file(names_url, destfile = file.path(DATA_DIR, 'names.txt')) unzip(file.path(DATA_DIR, 'knots.zip'), exdir = file.path(DATA_DIR, 'knot_images')) }

    Let’s now load the data from the downloaded files and look at some of those files.

    knot_info <- read.delim(file.path(DATA_DIR, "names.txt"), sep=" ", header=FALSE, stringsAsFactors=FALSE)[1:2] names(knot_info) <- c("file", "knot_class") knot_info$path <- file.path(DATA_DIR, "knot_images", knot_info$file) names(knot_info) ## [1] "file" "knot_class" "path"

    We’ll be trying to predict knot_class, and here are the counts of the categories:

    table(knot_info$knot_class) ## ## decayed_knot dry_knot edge_knot encased_knot horn_knot ## 14 69 65 29 35 ## leaf_knot sound_knot ## 47 179

    Four of these labels relate to how the knot is integrated into the structure of the surrounding wood; these are the descriptions from the README file:

    • sound: “A knot grown firmly into the surrounding wood material and does not contain any bark or signs of decay. The color may be very close to the color of sound wood.”

    • dry: “A firm or partially firm knot, and has not taken part to the vital processes of growing wood, and does not contain any bark or signs of decay. The color is usually darker than the color of sound wood, and a thin dark ring or a partial ring surrounds the knot.”

    • encased: “A knot surrounded totally or partially by a bark ring. Compared to dry knot, the ring around the knot is thicker.”

    • decayed: “A knot containing decay. Decay is difficult to recognize, as it usually affects only the strength of the knot.”

    Edge, horn and leaf knots are related to the orientation of the knot relative to the cutting plane, the position of the knot on the board relative to the edge, or both. These attributes are of different character than the ones related to structural integration. Theoretically, you could have an encased horn knot, or a decayed leaf knot, etc., though there do not happen to be any in this dataset. Unless otherwise specified, we have to assume the knot is sound, cross-cut, and not near an edge. This means that including these other labels makes the knot_class column ‘untidy’, in that it refers to more than one different kind of characteristic. We could try to split out distributed attributes for orientation and position, as well as for structural integration, but to keep this example simple we’ll just filter the data to keep only the sound, dry, encased, and decayed knots that are cross-cut and not near an edge. (It turns out that you can use featurization to fit image classification models that recognize position and orientation quite well, but we’ll leave that as an exercise for the reader.)

    knot_classes_to_keep <- c("sound_knot", "dry_knot", "encased_knot", "decayed_knot") knot_info <- knot_info[knot_info$knot_class %in% knot_classes_to_keep,] knot_info$knot_class <- factor(knot_info$knot_class, levels=knot_classes_to_keep)

    We kept the knot_class column as character data while we were deciding which classes to keep, but now we can make that a factor.

    Here are a few examples of images from each of the four classes:

    library("pixmap") set.seed(1) samples_per_category <- 3 op <- par(mfrow=c(1, samples_per_category), oma=c(2,2,2,2), no.readonly = TRUE) for (kc in levels(knot_info$knot_class)){ kc_files <- knot_info[knot_info$knot_class == kc, "path"] kc_examples <- sample(kc_files, samples_per_category) for (kc_ex in kc_examples){ pnm <- read.pnm(kc_ex) plot(pnm, xlab=gsub(".*(knot[0-9]+).*", "\\1", kc_ex)) mtext(text=gsub("_", " ", kc), side=3, line=0, outer=TRUE, cex=1.7) } }

    par(op) Featurizing images

    Here we use the rxFeaturize function from Microsoft R Server, which allows us to perform a number of transformations on the knot images in order to produce numerical features. We first resize the images to fit the dimensions required by the pre-trained deep neural model we will use, then extract the pixels to form a numerical data set, then run that data set through a DNN pre-trained model. The result of the image featurization is a numeric vector (“feature vector”) that represents key characteristics of that image.

    Image featurization here is accomplished by using a deep neural network (DNN) model that has already been pre-trained by using millions of images. Currently, MRS supports four types of DNNs – three ResNet models (18, 50, 101)[1] and AlexNet [8].

    knot_data_df <- rxFeaturize(data = knot_info, mlTransforms = list(loadImage(vars = list(Image = "path")), resizeImage(vars = list(Features = "Image"), width = 224, height = 224, resizingOption = "IsoPad"), extractPixels(vars = "Features"), featurizeImage(var = "Features", dnnModel = "resnet101")), mlTransformVars = c("path", "knot_class")) ## Elapsed time: 00:02:59.8362547

    We have chosen the “resnet101” DNN model, which is 101 layers deep; the other ResNet options (18 or 50 layers) generate features more quickly (well under 2 minutes each for this dataset, as opposed to several minutes for ResNet-101), but we found that the features from 101 layers work better for this example.

    We have placed the features in a dataframe, which lets us use any R algorithm to build a classifier. Alternatively, we could have saved them directly to an XDF file (the native file format for Microsoft R Server, suitable for large datasets that would not fit in memory, or that you want to distribute on HDFS), or generated them dynamically when training the model (see examples in the earlier blog post).

    Since the featurization process takes a while, let’s save the results up to this point. We’ll put them in a CSV file so that, if you are so inspired, you can open it in Excel and admire the 2048 numerical features the deep neural net model has created to describe each image. Once we have the features, training models on this data will be relatively fast.

    write.csv(knot_data_df, "knot_data_df.csv", row.names=FALSE) Select training and test data

    Now that we have extracted numeric features from each image, we can use traditional machine learning algorithms to train a classifier.

    in_training_set <- sample(c(TRUE, FALSE), nrow(knot_data_df), replace=TRUE, prob=c(2/3, 1/3)) in_test_set <- !in_training_set

    We put about two thirds of the examples (203) into the training set and the rest (88) in the test set.

    Fit Random Forest model

    We will use the popular randomForest package, since it handles both “wide” data (with more features than cases) as well as multiclass outcomes.

    library(randomForest) form <- formula(paste("knot_class" , paste(grep("Feature", names(knot_data_df), value=TRUE), collapse=" + "), sep=" ~ ")) training_set <- knot_data_df[in_training_set,] test_set <- knot_data_df[in_test_set,] fit_rf <- randomForest(form, training_set) pred_rf <- as.data.frame(predict(fit_rf, test_set, type="prob")) pred_rf$knot_class <- knot_data_df[in_test_set, "knot_class"] pred_rf$pred_class <- predict(fit_rf, knot_data_df[in_test_set, ], type="response") # Accuracy with(pred_rf, sum(pred_class == knot_class)/length(knot_class)) ## [1] 0.8068182

    Let’s see how the classifier did for all four classes. Here is the confusion matrix, as well as a moisaic plot showing predicted class (pred_class) against the actual class (knot_class).

    with(pred_rf, table(knot_class, pred_class)) ## pred_class ## knot_class sound_knot dry_knot encased_knot decayed_knot ## sound_knot 51 0 0 0 ## dry_knot 3 19 0 0 ## encased_knot 6 5 1 0 ## decayed_knot 2 1 0 0

    mycolors <- c("yellow3", "yellow2", "thistle3", "thistle2") mosaicplot(table(pred_rf[,c("knot_class", "pred_class")]), col = mycolors, las = 1, cex.axis = .55, main = NULL, xlab = "Actual Class", ylab = "Predicted Class")

    It looks like the classifier performs really well on the sound knots. On the other hand, all of the decayed knots in the test set are misclassified. This is a small class, so let’s look at those misclassified samples.

    library(dplyr) pred_rf$path <- test_set$path # misclassified knots misclassified <- pred_rf %>% filter(knot_class != pred_class) # number of misclassified decayed knots num_example <- sum(misclassified$knot_class == "decayed_knot") op <- par(mfrow = c(1, num_example), mar=c(2,1,0,1), no.readonly = TRUE) for (i in which(misclassified$knot_class == "decayed_knot")){ example <- misclassified[i, ] pnm <- read.pnm(example$path) plot(pnm) mtext(text=sprintf("predicted as\n%s", example$pred_class), side=1, line=-2) } mtext(text='Misclassified Decayed Knots', side = 3, outer = TRUE, line = -3, cex=1.2)

    par(op)

    We were warned about the decayed knots in the README file; they are more difficult to visually classify. Interestingly, the ones classified as sound actually do appear to be well-integrated into the surrounding wood; they also happen to look somewhat rotten. Also, the decayed knot classified as dry does have a dark border, and looks like a dry knot. These knots appear to be two decayed sound knots and a decayed dry knot; in other words, maybe decay should be represented as a separate attribute that is independent of the structural integration of the knot. We may have found an issue with the labels, rather than a problem with the classifier.

    Let’s look at the performance of the classifier more closely. Using the scores returned by the classifier, we can plot an ROC curve for each of the individual classes.

    library(pROC) plot_target_roc <- function(target_class, outcome, test_data, multiclass_predictions){ is_target <- test_data[[outcome]] == target_class roc_obj <- roc(is_target, multiclass_predictions[[target_class]]) plot(roc_obj, print.auc=TRUE, main=target_class) text(x=0.2, y=0.2, labels=sprintf("total cases: %d", sum(is_target)), pos=3) } op <- par(mfrow=c(2,2), oma=c(2,2,2,2), #mar=c(0.1 + c(7, 4, 4, 2)), no.readonly = TRUE) for (knot_category in levels(test_set$knot_class)){ plot_target_roc(knot_category, "knot_class", test_set, pred_rf) } mtext(text="ROC curves for individual classes", side=3, line=0, outer=TRUE, cex=1.5)

    par(op)

    These ROC curves show that the classifier scores can be used to provide significant enrichment for any of the classes. With sound knots, for example, the score can unambiguously identify a large majority of the cases that are not of that class. Another way to look at this is to consider the ranges of each classification score for each actual class:

    library(tidyr) library(ggplot2) pred_rf %>% select(-path, -pred_class) %>% gather(prediction, score, -knot_class) %>% ggplot(aes(x=knot_class, y=score, col=prediction)) + geom_boxplot()

    Note that the range of the sound knot and dry knot scores tend to be higher for all four classes. But when you consider the scores for a given prediction class across all the actual classes, the scores tend to be higher when they match than when they don’t. For example, even though the decayed knot score never goes very high, it tends to be higher for the decayed knots than for other classes. Here’s a boxplot of just the decayed_knot scores for all four actual classes:

    boxplot(decayed_knot ~ knot_class, pred_rf, xlab="actual class", ylab="decayed_knot score", main="decayed_knot score for all classes")

    Even though the multi-class classifier did not correctly identify any of the three decayed knots in the test set, this does not mean that it is useless for finding decayed knots. In fact, the decayed knots had higher scores for the decayed predictor than most of the other knots did (as shown in the boxplots above), and this score it is able to correctly determine that almost 80% of the knots are not decayed. This means that this classifier could be used to screen for knots that are more likely to be decayed, or alternatively, you could use these scores to isolate a collection of cases that are unlikely to be any of the kinds of knots your classifier is good at recognizing. This ability to focus on things we are not good at might be helpful if we needed to search through a large collection of images to find more of the kinds of knots for which we need more training data. We could show those selected knots to expert human labelers, so they wouldn’t need to label all the examples in the entire database. This would help us get started with an active learning process, where we use a model to help develop a better model.

    Here we’ve shown an open-source random forest model trained on features that have been explicitly written to a dataframe. The model classes in the MicrosoftML package can also generate these features dynamically, so that the trained model can accept a file name as input and automatically generate the features on demand during scoring. For examples see the earlier blog post. In general, these data sets are “wide”, in that they have a large number of features; in this example, they have far more columns of features than rows of cases. This means that you need to select algorithms that can handle wide data, like elastic net regularized linear models, or, as we’ve seen here, random forests. Many of the MicrosoftML algorithms are well suited for wide datasets.

    Discussion

    Image featurization allows us to make effective use of relatively small sets of labeled images that would not be sufficient to train a deep network from scratch. This is because we can re-use the lower-level features that the pre-trained model had learned for more general image classification tasks.

    This custom classifier was constructed quite quickly; although the featurizer required several minutes to run, after the features were generated all training was done on relatively small models and data sets (a few hundred cases with a few thousand features), where training a given model takes on the order of seconds, and tuning hyperparameters is relatively straightforward. Industrial applications commonly require specialized classification or scoring models that can be used to address specific technical or regulatory concerns, so having a rapid and adaptable approach to generate such models should have considerable practical utility.

    Featurization is the low-hanging fruit of transfer learning. More general transfer learning allows the specialization of deeper and deeper layers in a pre-trained general model, but the deeper you want to go, the more labelled training data you will generally need. We often find ourselves awash in data, but limited by the availability of quality labels. Having a partially effective classifier can let you bootstrap an active learning approach, where a crude model is used to triage images, classifying the obvious cases directly and referring only those with uncertain scores to expert labelers. The larger set of labeled images is then used to build a better model, which can do more effective triage, and so on, leading to iterative model improvement. Companies like CrowdFlower use these kinds of approaches to optimize data labeling and model building, but there are many use cases where general crowdsourcing may not be adequate (such as when the labeling must be done by experts), so having a straightforward way to bootstrap that process could be quite useful.

    The labels we have may not be tidy, that is, they may not refer to distinct characteristics. In this case, the decayed knot does not seem to really be an alternative to “encased”, “dry”, or “sound” knots; rather, it seems that any of these categories of knots might possibly be decayed. In many applications, it is not always obvious what a properly distributed outcome labeling should include. This is one reason that a quick and easy approach is valuable; it can help you to clarify these questions with iterative solutions that can help frame the discussion with domain experts. In a subsequent iteration we may want to consider “decayed” as a separate attribute from “sound”, “dry” or “encased”.

    Image featurization gives us a simple way to build domain-specific image classifiers using relatively small training sets. In the common use case where data is plentiful but good labels are not, this provides a rapid and straightforward first step toward getting the labeled cases we need to build more sophistiated models; with a few iterations, we might even learn to recognize decayed knots. Seriously, wood that knot be cool?

    References
    1. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. (2016) The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778

    2. Kauppinen H. and Silven O.: A Color Vision Approach for Grading Lumber. In Theory & Applications of Image Processing II – Selected papers from the 9th Scandinavian Conference on Image Analysis (ed. G. Borgefors), World Scientific, pp. 367-379 1995.

    3. Silven O. and Kauppinen H.: Recent Developments in Wood Inspection. (1996) International Journal on Pattern Recognition and Artificial Intelligence, IJPRAI, pp. 83-95, 1996.

    4. Rinnhofer, Alfred & Jakob, Gerhard & Deutschl, Edwin & Benesova, Wanda & Andreu, Jean-Philippe & Parziale, Geppy & Niel, Albert. (2006). A multi-sensor system for texture based high-speed hardwood lumber inspection. Proceedings of SPIE – The International Society for Optical Engineering. 5672. 34-43. 10.1117/12.588199.

    5. Irene Yu-Hua Gu, Henrik Andersson, Raul Vicen. (2010) Wood defect classification based on image analysis and support vector machines. Wood Science and Technology 44:4, 693-704. Online publication date: 1-Nov-2010.

    6. Kauppinen H, Silven O & Piirainen T. (1999) Self-organizing map based user interface for visual surface inspection. Proc. 11th Scandinavian Conference on Image Analysis (SCIA’99), June 7-11, Kangerlussuaq, Greenland, 801-808.

    7. Kauppinen H, Rautio H & Silven O (1999). Nonsegmenting defect detection and SOM-based classification for surface inspection using color vision. Proc. EUROPTO Conf. on Polarization and Color Techniques in Industrial Inspection (SPIE Vol. 3826), June 17-18, Munich, Germany, 270-280.

    8. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. (2012) ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    New Course! Supervised Learning in R: Classification

    Wed, 09/27/2017 - 16:08

    (This article was first published on DataCamp Blog, and kindly contributed to R-bloggers)

    Hi there! We proud to launch our latest R & machine learning course, Supervised Learning in R: Classification! By Brett Lantz.

    This beginner-level introduction to machine learning covers four of the most common classification algorithms. You will come away with a basic understanding of how each algorithm approaches a learning task, as well as learn the R functions needed to apply these tools to your own work.

    Take me to chapter 1!

    Supervised Learning in R: Classification features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you an expert in machine learning in R!

    What you’ll learn:

    Chapter 1: k-Nearest Neighbors (kNN)

    This chapter will introduce classification while working through the application of kNN to self-driving vehicles.

    Chapter 2: Naive Bayes

    Naive Bayes uses principles from the field of statistics to make predictions. This chapter will introduce the basics of Bayesian methods.

    Chapter 3: Logistic Regression

    Logistic regression involved fitting a curve to numeric data to make predictions about binary events.

    Chapter 4: Classification Trees

    Classification trees use flowchart-like structures to make decisions. Because humans an readily understand these tree structures, classification trees are useful when transparency is needed.

    Start your path to mastering ML in R with Supervised Learning in R: Classification!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: DataCamp Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Oneway ANOVA Explanation and Example in R; Part 1

    Wed, 09/27/2017 - 15:00

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    This tutorial was inspired by a this post published at DataScience+ by Bidyut Ghosh. Special thanks also to Dani Navarro, The University of New South Wales (Sydney) for the book Learning Statistics with R (hereafter simply LSR) and the lsr packages available through CRAN. I highly recommend it.

    Let’s load the required R packages library(ggplot2) library(lsr) library(psych) library(car) library(tidyverse) library(dunn.test) library(BayesFactor) library(scales) library(knitr) library(kableExtra) options(width = 130) options(knitr.table.format = "html") The Oneway Analysis of Variance (ANOVA)

    The Oneway ANOVA is a statistical technique that allows us to compare mean differences of one outcome (dependent) variable across two or more groups (levels) of one independent variable (factor). If there are only two levels (e.g. Male/Female) of the independent (predictor) variable the results are analogous to Student’s t-test. It is also true that ANOVA is a special case of the GLM or regression models so as the number of levels increases it might make more sense to try one of those approaches. ANOVA also allows for comparisons of mean differences across multiple factors (Factorial or Nway ANOVA) which we won’t address here.

    Our scenario and data

    Professor Ghosh’s original scenario can be summarized this way. Imagine that you are interested in understanding whether knowing the brand of car tire can help you predict whether you will get more or less mileage before you need to replace them. We’ll draw what is hopefully a random sample of 60 tires from four different manufacturers and use the mean mileage by brand to help inform our thinking. While we expect variation across our sample we’re interested in whether the differences between the tire brands (the groups) is significantly different than what we would expect in random variation within the groups.

    Our research or testable hypothesis is then described \[\mu_{Apollo} \ne \mu_{Bridgestone} \ne \mu_{CEAT} \ne \mu_{Falken}\] as at least one of the tire brand populations is different than the other three. Our null is basically “nope, brand doesn’t matter in predicting tire mileage – all brands are the same”.
    He provides the following data set with 60 observations. I’ve chosen to download directly but you could of course park the file locally and work from there.

    Column Contains Type Brands What brand tyre factor Mileage Tyre life in thousands num tyre<-read.csv("https://datascienceplus.com/wp-content/uploads/2017/08/tyre.csv") # tyre<-read.csv("tyre.csv") # if you have it on your local hard drive str(tyre) ## 'data.frame': 60 obs. of 2 variables: ## $ Brands : Factor w/ 4 levels "Apollo","Bridgestone",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ Mileage: num 33 36.4 32.8 37.6 36.3 ... summary(tyre) ## Brands Mileage ## Apollo :15 Min. :27.88 ## Bridgestone:15 1st Qu.:32.69 ## CEAT :15 Median :34.84 ## Falken :15 Mean :34.74 ## 3rd Qu.:36.77 ## Max. :41.05 head(tyre) ## Brands Mileage ## 1 Apollo 32.998 ## 2 Apollo 36.435 ## 3 Apollo 32.777 ## 4 Apollo 37.637 ## 5 Apollo 36.304 ## 6 Apollo 35.915

    View(tyre) if you use RStudio this is a nice way to see the data in spreadsheet format
    The data set contains what we expected. The dependent variable Mileage is numeric and the independent variable Brand is of type factor. R is usually adept at coercing a chr string or an integer as the independent variable but I find it best to explicitly make it a factor when you’re working on ANOVAs.

    Let’s graph and describe the basics. First a simple boxplot of all 60 data points along with a summary using the describe command from the package psych. Then in reverse order lets describe describeBy and boxplot breaking it down by group (in our case tire brand).

    boxplot(tyre$Mileage, horizontal = TRUE, main="Mileage distribution across all brands", col = "blue") describe(tyre) # the * behind Brands reminds us it's a factor and some of the numbers are nonsensical ## vars n mean sd median trimmed mad min max range skew kurtosis se ## Brands* 1 60 2.50 1.13 2.50 2.50 1.48 1.00 4.00 3.00 0.00 -1.41 0.15 ## Mileage 2 60 34.74 2.98 34.84 34.76 3.09 27.88 41.05 13.17 -0.11 -0.44 0.38

    describeBy(tyre$Mileage,group = tyre$Brand, mat = TRUE, digits = 2) ## item group1 vars n mean sd median trimmed mad min max range skew kurtosis se ## X11 1 Apollo 1 15 34.80 2.22 34.84 34.85 2.37 30.62 38.33 7.71 -0.18 -1.24 0.57 ## X12 2 Bridgestone 1 15 31.78 2.20 32.00 31.83 1.65 27.88 35.01 7.13 -0.29 -1.05 0.57 ## X13 3 CEAT 1 15 34.76 2.53 34.78 34.61 2.03 30.43 41.05 10.62 0.64 0.33 0.65 ## X14 4 Falken 1 15 37.62 1.70 37.38 37.65 1.18 34.31 40.66 6.35 0.13 -0.69 0.44 boxplot(tyre$Mileage~tyre$Brands, main="Boxplot comparing Mileage of Four Brands of Tyre", col= rainbow(4), horizontal = TRUE)

    Let’s format the table describeby generates to make it a little nicer using the kable package. Luckily describeby generates a dataframe with mat=TRUE and after that we can select which columns to publish (dropping some of the less used) as well as changing the column labels as desired.

    describeBy(tyre$Mileage,group = tyre$Brand, mat = TRUE) %>% #create dataframe select(Brand=group1, N=n, Mean=mean, SD=sd, Median=median, Min=min, Max=max, Skew=skew, Kurtosis=kurtosis, SEM=se) %>% kable(align=c("lrrrrrrrr"), digits=2, row.names = FALSE, caption="Tire Mileage Brand Descriptive Statistics") %>% kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE) Tire Mileage Brand Descriptive Statistics
    Brand N Mean SD Median Min Max Skew Kurtosis SEM Apollo 15 34.80 2.22 34.84 30.62 38.33 -0.18 -1.24 0.57 Bridgestone 15 31.78 2.20 32.00 27.88 35.01 -0.29 -1.05 0.57 CEAT 15 34.76 2.53 34.78 30.43 41.05 0.64 0.33 0.65 Falken 15 37.62 1.70 37.38 34.31 40.66 0.13 -0.69 0.44

    Certainly much nicer looking and I only scratched the surface of the options available. We can certainly look at the numbers and learn a lot. But let’s see if we can also improve our plotting to be more informative.
    The more I use ggplot2 the more I love the ability to use it to customize the presentation of the data to optimize understanding! The next plot might be accused of being a little “busy” but essentially answers our Oneway ANOVA question in one picture (note that I have stayed with the original decision to set \(\alpha\) = 0.01 significance level (99% confidence intervals)).

    ggplot(tyre, aes(reorder(Brands,Mileage),Mileage,fill=Brands))+ # ggplot(tyre, aes(Brands,Mileage,fill=Brands))+ # if you want to leave them alphabetic geom_jitter(colour = "dark gray",width=.1) + stat_boxplot(geom ='errorbar',width = 0.4) + geom_boxplot()+ labs(title="Boxplot, dotplot and SEM plot of mileage for four brands of tyres", x = "Brands (sorted)", y = "Mileage (in thousands)", subtitle ="Gray dots=sample data points, Black dot=outlier, Blue dot=mean, Red=99% confidence interval", caption = "Data from https://datascienceplus.com/one-way-anova-in-r/") + guides(fill=FALSE) + stat_summary(fun.data = "mean_cl_normal", colour = "red", size = 1.5, fun.args = list(conf.int=.99)) + stat_summary(geom="point", fun.y=mean, color="blue") + theme_bw()

    By simple visual inspection it certainly appears that we have evidence of the effect of tire brand on mileage. There is one outlier for the CEAT brand but little cause for concern. Means and medians are close together so no major concerns about skewness. Different brands have differing amounts of variability but nothing shocking visually.

    Oneway ANOVA Test & Results

    So the heart of this post is to actually execute the Oneway ANOVA in R. There are several ways to do so but let’s start with the simplest from the base R first aov. While it’s possible to wrap the command in a summary or print statement I recommend you always save the results out to an R object in this case tyres.aov. It’s almost inevitable that further analysis will ensue and the tyres.aov object has a wealth of useful information. If you’re new to R a couple of quick notes. The dependent variable goes to the left of the tilde and our independent or predictor variable to the right. aov is not limited to Oneway ANOVA so adding additional factors is possible.
    As I mentioned earlier ANOVA is a specialized case of the GLM and therefore the list object returned tyres.aov is actually of both aov and lm class. The names command will give you some sense of all the information contained in the list object. We’ll access some of this later as we continue to analyze our data. The summary command gives us the key ANOVA data we need and produces a classic ANOVA table. If you’re unfamiliar with them and want to know more especially where the numbers come from I recommend a good introductory stats text. As noted earlier I recommend Learning Statistics with R LSR see Table 14-1 on page 432.

    tyres.aov<- aov(Mileage~Brands, tyre) class(tyres.aov) ## [1] "aov" "lm" typeof(tyres.aov) ## [1] "list" names(tyres.aov) ## [1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" "qr" ## [8] "df.residual" "contrasts" "xlevels" "call" "terms" "model" summary(tyres.aov) ## Df Sum Sq Mean Sq F value Pr(>F) ## Brands 3 256.3 85.43 17.94 2.78e-08 *** ## Residuals 56 266.6 4.76 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    We can reject the null hypothesis at the \(\alpha\) = 0.01 significance level (99% confidence). The F statistic is calculated as \[F = \frac{MS_{between}}{MS_{within}}\] and the table gives us the precise p value and the common asterisks to show “success”.
    In published results format that probably looks like “a Oneway ANOVA showed a significant effect for brand on tire mileage, F(3,56)=17.94, p<.01”. In other words, we can reject the null hypothesis that these data came from brand tire populations where the average tire mileage life was the same! Making it a prediction statement, we can see that brand type helps predict mileage life.
    That’s exciting news, but leaves us with some other unanswered questions.
    The data provide support for the hypothesis that the means aren’t all equal – that’s called the omnibus test. We have support for rejecting \[\mu_{Apollo} = \mu_{Bridgestone} = \mu_{CEAT} = \mu_{Falken}\] but at this point we can’t state with any authority which specific pairs are different, all we can say is that at least one is different! When we look at the graph we made earlier we can guess we know but let’s do better than that. How can we use confidence intervals to help us understand whether the data are indicating simple random variation or whether the underlying population is different. We just need to compute the confidence interval for each brand’s mean and then see which brand means lie inside or outside the confidence interval of the others. We would expect that if we ran our experiment 100 times with our sample size numbers for each brand the mileage mean would lie inside the upper and lower limit of our confidence interval 99 times (with \(\alpha\) = 0.01) out of those 100 times. If our data shows it outside the confidence interval that is evidence of a statistically significant difference for that specific pairing.
    But we don’t have to rely on our graph, we can be more precise and test it in a very controlled fashion.

    A priori and post hoc comparisons

    We could just take mileage and brands and run all the possible t tests. There would be 6 of them; Apollo -v- Bridgestone, Apollo -v- CEAT, Apollo -v- Falken, Bridgestone -v- CEAT, Bridgestone -v- Falken, and CEAT -v- Falken. Base R provides pairwise.t.test to feed it the data and allow it to rapidly make all the relevant comparisons. lsr provides a helper function that makes it possible to simply feed it the aov object and do the same.
    The “answers” appear to support our read of the graph. All of the possible pairs seem to be different other than Apollo -v- CEAT which is what the graph shows. The significance levels R spits out are all much smaller than p<.01. Break out the champagne start the victory dance.

    pairwise.t.test(tyre$Mileage,tyre$Brands,p.adjust.method = "none") ## ## Pairwise comparisons using t tests with pooled SD ## ## data: tyre$Mileage and tyre$Brands ## ## Apollo Bridgestone CEAT ## Bridgestone 0.00037 - - ## CEAT 0.96221 0.00043 - ## Falken 0.00080 9.7e-10 0.00069 ## ## P value adjustment method: none # unfortunately pairwise.t.test doesn't accept formula style or an aov object # lsr library to the rescue posthocPairwiseT(tyres.aov,p.adjust.method = "none") #equivalent just easier to use the aov object ## ## Pairwise comparisons using t tests with pooled SD ## ## data: Mileage and Brands ## ## Apollo Bridgestone CEAT ## Bridgestone 0.00037 - - ## CEAT 0.96221 0.00043 - ## Falken 0.00080 9.7e-10 0.00069 ## ## P value adjustment method: none

    But that would be wrong and here’s why. Assuming we want to have 99% confidence again, across all six unique pairings, we are “cheating” if we don’t adjust the rejection region (and our confidence intervals) and just run the test six times. It’s analogous to rolling the die six times instead of once. The more simultaneous tests we run the more likely we are to find a difference even though none exists. We need to adjust our thinking and our confidence to account for the fact that we are making multiple comparisons (a.k.a. simultaneous comparisons). Our confidence interval must be made wider (more conservative) to account for the fact we are making multiple simultaneous comparisons. Thank goodness the tools exist to do this for us. As a matter of fact there is no one single way to make the adjustment… there are many.
    One starting position is that it makes a difference whether you have specified (hypothesized) some specific relationships a priori (in advance) or whether you’re exploring posthoc (after the fact also called “fishing”). The traditional position is that a priori grants you more latitude and less need to be conservative. The only thing that is certain is that some adjustment is necessary. In his original post Professor Ghosh applied one of the classical choices for making an adjustment Tukey’s Honestly Significant Difference (HSD) https://en.wikipedia.org/wiki/Tukey%27s_range_test. Let’s reproduce his work first as two tables at two confidence levels.

    TukeyHSD(tyres.aov, conf.level = 0.95) ## Tukey multiple comparisons of means ## 95% family-wise confidence level ## ## Fit: aov(formula = Mileage ~ Brands, data = tyre) ## ## $Brands ## diff lwr upr p adj ## Bridgestone-Apollo -3.01900000 -5.1288190 -0.909181 0.0020527 ## CEAT-Apollo -0.03792661 -2.1477456 2.071892 0.9999608 ## Falken-Apollo 2.82553333 0.7157143 4.935352 0.0043198 ## CEAT-Bridgestone 2.98107339 0.8712544 5.090892 0.0023806 ## Falken-Bridgestone 5.84453333 3.7347143 7.954352 0.0000000 ## Falken-CEAT 2.86345994 0.7536409 4.973279 0.0037424 TukeyHSD(tyres.aov, conf.level = 0.99) ## Tukey multiple comparisons of means ## 99% family-wise confidence level ## ## Fit: aov(formula = Mileage ~ Brands, data = tyre) ## ## $Brands ## diff lwr upr p adj ## Bridgestone-Apollo -3.01900000 -5.6155816 -0.4224184 0.0020527 ## CEAT-Apollo -0.03792661 -2.6345082 2.5586550 0.9999608 ## Falken-Apollo 2.82553333 0.2289517 5.4221149 0.0043198 ## CEAT-Bridgestone 2.98107339 0.3844918 5.5776550 0.0023806 ## Falken-Bridgestone 5.84453333 3.2479517 8.4411149 0.0000000 ## Falken-CEAT 2.86345994 0.2668783 5.4600415 0.0037424

    A lot of output there but not too difficult to understand. We can see the 6 pairings we have been tracking listed in the first column. The diff column is the difference between the means of the two brands listed. So the mean for Bridgestone is 3,019 miles less than Apollo. The lwr and upr columns show the lower and upper CI limits. Notice they change between the two different confidence levels we’ve run, whereas the mean difference and exact p value do not. So good news here is that even with our more conservative Tukey HSD test we have empirical support for 5 out of the 6 possible differences.

    Now let’s graph just the .99 CI version.

    par()$oma # current margins ## [1] 0 0 0 0 par(oma=c(0,5,0,0)) # adjust the margins because the factor names are long plot(TukeyHSD(tyres.aov, conf.level = 0.99),las=1, col = "red") par(oma=c(0,0,0,0)) # put the margins back

    If you’re a visual learner, as I am, this helps. We’re looking at the differences in means amongst the pairs of brands. 0 on the x axis means no difference at all and the red horizontals denote 99% confidence intervals.
    Finally, as I mentioned earlier there are many different ways (tests) for adjusting. Tukey HSD is very common and is easy to access and graph. But two others worth noting are the Bonferroni and it’s successor the Holm. Let’s go back to our earlier use of the pairwise.t.test. We’ll use it again (as well as the lsr wrapper function posthocPairwise). You can use the built-in R help for p.adjust to see all the methods available. I recommend holm as a general position but know your options.

    pairwise.t.test(tyre$Mileage,tyre$Brands,p.adjust.method = "bonferroni") ## ## Pairwise comparisons using t tests with pooled SD ## ## data: tyre$Mileage and tyre$Brands ## ## Apollo Bridgestone CEAT ## Bridgestone 0.0022 - - ## CEAT 1.0000 0.0026 - ## Falken 0.0048 5.8e-09 0.0041 ## ## P value adjustment method: bonferroni pairwise.t.test(tyre$Mileage,tyre$Brands,p.adjust.method = "holm") ## ## Pairwise comparisons using t tests with pooled SD ## ## data: tyre$Mileage and tyre$Brands ## ## Apollo Bridgestone CEAT ## Bridgestone 0.0019 - - ## CEAT 0.9622 0.0019 - ## Falken 0.0021 5.8e-09 0.0021 ## ## P value adjustment method: holm posthocPairwiseT(tyres.aov) # default is Holm ## ## Pairwise comparisons using t tests with pooled SD ## ## data: Mileage and Brands ## ## Apollo Bridgestone CEAT ## Bridgestone 0.0019 - - ## CEAT 0.9622 0.0019 - ## Falken 0.0021 5.8e-09 0.0021 ## ## P value adjustment method: holm

    Happily, given our data, we get the same overall answer with very slightly different numbers. As it turns out we have very strong effect sizes and the tests don’t change our overall answers. Wait what’s an effect size you ask? That’s our next question which we will cover in the second part.

      Related Post

      1. One-way ANOVA in R
      2. Cubic and Smoothing Splines in R
      3. Chi-Squared Test – The Purpose, The Math, When and How to Implement?
      4. Missing Value Treatment
      5. R for Publication by Page Piccinini
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Churn Prediction with Automatic ML

      Wed, 09/27/2017 - 10:01

      (This article was first published on MLJAR Blog Category: R, and kindly contributed to R-bloggers)

      Sometimes we don’t even realize how common machine learning (ML) is in our daily lives. Various “intelligent” algorithms help us for instance with finding the most important facts (Google), they suggest what movie to watch (Netflix), or influence our shopping decisions (Amazon). The biggest international companies quickly recognized the potential of machine learning and transferred it to business solutions.

      Nowadays not only big companies are able to use ML. Imagine — not so abstract — situation when a company tries to predict customer behavior based on some personal data. Just a few years ago, the best strategy to solve this problem would be to hire a good data science team. Nowadays, thanks to growing ML popularity, it is available even for small start-ups.

      Today, I would like to present you a demo of how to solve difficult business problems with ML. We will take advantage of mljar.com service and its R API. With just a few lines of code we will be able to achieve very good results.

      Typical problem for companies operating on a contractual basis (like Internet, or phone providers) is whether a customer will decide to stay for a next period of time, or churn. It is crucial for a company to focus on customers who are at risk of churning in order to prevent it.

      In this example we will help a telecom company to predict, which consumers are likely to renew a contract and which are not. We will base on the data from the past. This dataset in publicly available and can be downloaded for example here: https://github.com/WLOGSolutions/telco-customer-churn-in-r-and-h2o/tree/master/data.

      It contains descriptors, which can be divided into three different categories:

      • demographical data: gender, age, education, etc.
      • account information: number of complaints, average call duration, etc.
      • services used: Internet, voice etc.

      Finally, we will also have a column with two labels: churn and no churn, which is our target to predict.

      Now, that we have the problem set and understand our data, we can move on to the code. mljar.com has both R and Python API, but this time we focus on the former.

      First of all, we need to import necessary libraries.

      If you miss one of them, you can always easily install it with install.packages command.

      The following lines enable you to read and clean the dataset. In three steps we: get rid of irrelevant columns (time), select only complete records and remove duplicated rows.

      It is important to validate our final ML model before publishing, so we split the churn data into training and test set in proportion 7:3.

      For convenience, let’s assign features and labels to separate variables:

      And that’s it! We are ready to use mljar to train our ML models.

      (Before, make sure that you have mljar account and your mljar token is set. For more information take a look at: https://github.com/mljar/mljar-api-R/)

      The main advantage of mljar is that it frees the user from thinking about model selection. We can use various different models and compare their performance. In this case, as a performance measure we choose Area Under Curve (https://en.wikipedia.org/wiki/Receiveroperatingcharacteristic).

      We use the following ML algorithms: logistic regression, xgboost, light gradient boosting, extra trees, regularized greedy forest and k-nearest neighbors. Moreover, in R training all these models are as simple as that:

      After running this, when you log in into mljar.com web service and select “Churn” project you should see training data preview with some basic statistics. It is very helpful when you consider which features to use. In this example we decided to take advantage of all of them, but that’s not always the best strategy.

      With a little bit of patience (which depends on your mljar subscription), we got our final model with the best results. All models are also listed at mljar.com in Results card.

      mljarfit function returns the best model, which in this case is: ensembleavg.

      Ensemble average means that this is a combination of various models, which usually performs better together than individual models.

      Another useful option of mljar web service is hidden in “Feature Analysis” card. If you select some model there, you will see a graphical illustration of feature importance. For instance for extra trees algorithm we have:

      The graph leads to a conclusion that age, unpaid invoice balance and monthly billed amounts are the most important customer descriptors, whereas number of calls or using some extra services have almost no impact on churning.

      To predict labels on the test set, we use mljar_predict command.

      Note that the variable y_pr contains probabilities of our churn/no-churn labels. With help of pROC package we will easily find out which threshold is the best in our case:

      In data frame yprc our classification results are encoded as logical values, so let’s transform them into more readable form:

      Finally, we can evaluate our model’s prediction with different measures:

      Similar analysis has been performed using another R library for machine learning: H2O (look here: (https://github.com/WLOGSolutions/telco-customer-churn-in-r-and-h2o). You can see a comparison of results in the table below:

      I hope that I convinced you that some complex business problems can be solved easily with machine-learning algorithms using tools like mljar.

      (Worth to know: you can repeat all steps of the above analysis with no coding at all, using only mljar web-service. Try it out at mljar.com. If you would like to run code: it is here.)

      For more interesting R resources please check r-bloggers.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: MLJAR Blog Category: R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      rrricanes to Access Tropical Cyclone Data

      Wed, 09/27/2017 - 09:00
      What is rrricanes Why Write rrricanes?

      There is a tremendous amount of weather data available on the internet. Much of it is in raw format and not very easy to obtain. Hurricane data is no different. When one thinks of this data they may be inclined to think it is a bunch of map coordinates with some wind values and not much else. A deeper look will reveal structural and forecast data. An even deeper look will find millions of data points from hurricane reconnaissance, computer forecast models, ship and buoy observations, satellite and radar imagery, …

      rrricanes is an attempt to bring this data together in a way that doesn't just benefit R users, but other languages as well.

      I began learning R in 2015 and immediately had wished I had a hurricane-specific dataset when Hurricane Patricia became a harmless, but historic hurricane roaming the Pacific waters. I found this idea revisited again as Hurricane Matthew took aim at Florida and the southeast in 2016. Unable to use R to study and consolidate Matthew's data in R led me to begin learning package development. Thus, the birth of rrricanes.

      In this article, I will take you on a lengthy tour of the most important features of rrricanes and what the data means. If you have a background working with hurricane data, most of this will be redundant. My aim here is to cover the big ideas behind the package and explain them under the assumption you, the reader, are unfamiliar with the data offered.

      rrricanes is not intended to be used in emergency situations. I write this article as areas I have lived or currently live are under the gun from Hurricane Harvey and rrricanes is unable to obtain data due to external issues (I will describe these later). It is designed with the intent of answering questions and exploring ideas outside of a time-sensitive environment.

      rrricanes will not be available in CRAN for quite some time. The current schedule is May 15, 2018 (the "start" of the East Pacific hurricane season). This year is soley for testing under real-time conditions.

      And rrricanesdata

      The NHC archives text products dating back to at least 1998 (some earlier years exist but yet to be implemented in this package). Accessing this data is a time-consuming process on any computer. A limit of 4 requests per second is put in place to avoid being banned (or restricted) from the archives. So, if a hurricane has 20 text products you wish to pull and parse, this will take 5 seconds. Most cyclones have more and some, far more.

      rrricanesdata is a compliment package to rrricanes. rrricanesdata contains post-scraped datasets of the archives for all available storms with the exception of advisories issued in the current month.This means you can explore the various datasets without the wait.

      rrricanesdata will be updated monthly if an advisory has been issued the previous month. There will be regular monthly updates approximately from May through November – the typical hurricane season. In some cases, a cyclone may develop in the off-season. rrricanesdata will be updated on the same schedule.

      ELI5 the Data

      This package covers tropical cyclones that have developed in the Atlantic basin (north Atlantic ocean) or East Pacific basin (northeast Pacific east of 140#°W). Central Pacific (140#°W – 180#°W) may be mixed in if listed in the NHC archives.

      While traditionally the hurricane season for each basin runs from mid-May or June through November, some cyclones have developed outside of this time frame.

      Every tropical cylone (any tropical low whether classified as a tropical depression, tropical storm or hurricane) contains a core set of text products officially issued from the National Hurricane Center. These products are issued every six hours.

      Much of this data has changed in format over the years. Some products have been discontinued and replaced by new products or wrapped into existing products. Some of these products are returned in raw text format; it is not cleaned and may contain HTML characters. Other products are parsed with every piece of data extracted and cleaned.

      I have done my best to ensure data is high quality. But, I cannot guarantee it is perfect. If you do believe you have found an error, please let me know; even if it seems small. I would rather be notified of a false error than ignore a true one.

      The Products

      Each advisory product is listed below with an abbreviation in parentheses. Unless otherwise noted, these products are issued every six hours. Generally, the times issued are 03:00, 09:00, 15:00 and 21:00 UTC. Some products may be issued in three-hour increments and, sometimes, two-hour increments. update can be issued at any time.

      • Storm Discussion (discus) – These are technical discussions centered on the current structure of the cyclone, satellite presentation, computer forecast model tendencies and more. These products are not parsed.

      • Forecast/Adivsory (fstadv) – This data-rich product lists the current location of the cyclone, its wind structure, forecast and forecast wind structure.

      • Public Advisory (public) – These are general text statements issued for the public-at-large. Information in these products is a summary of the Forecast/Advisory product along with any watches and warnings issued, changed, or cancelled. Public Advisory products are the only regularly-scheduled product that may be issued intermittently (every three hours and, occasionally, every two hours) when watches and warnings are in effect. These products are not parsed.

      • Wind Speed Probabilities (wndprb) – These products list the probability of a minimum sustained wind speed expected in a given forecast window. This product replaces the Strike Probabilities product beginning in 2006 (see below).

      • Updates (update) – Tropical Cyclone Updates may be issued at any time if a storm is an immediate threat to land or if the cyclone undergoes a significant change of strength or structure. The information in this product is general. These products are not parsed.

      Discontinued Products
      • Strike Probabilities (prblty) – List the probability of a tropical cyclone passing within 65 nautical miles of a location within a forecast window. Replaced in 2006 by the Wind Speed Probabilities product.

      • Position Estimates (posest) – Typically issued as a storm is threatening land but generally rare (see Hurricane Ike 2008, Key AL092008). It is generally just an update of the current location of the cyclone. After the 2011 hurricane season, this product was discontinued; Updates are now issued in their place. These products are not parsed.

      Primary Key

      Every cyclone has a Key. However, not all products contain this value (prblty, for example). Products issued during and after the 2005 hurricane season contain this variable.

      Use Key to tie datasets together. If Key does not exist, you will need to use a combination of Name and Date, depending on your requirements. Keep in mind that, unless a name is retired, names are recycled every seven years. For example, there are multiple cyclones named Katrina but you may want to isolate on Katrina, 2005.

      Installation

      rrricanes will not be submitted to CRAN until prior to the hurricane season, 2018. It can be installed via github using devtools:

      devtools::install_github("ropensci/rrricanes", build_vignettes = TRUE) Optional Supporting Packages

      rrricanesdata uses a drat repository to host the large, pre-processed datasets.

      install.packages("rrricanesdata", repos = "https://timtrice.github.io/drat/", type = "source")

      To use high resolution tracking charts, you may also wish to install the `rnaturalearthhires' package:

      install.packages("rnaturalearthhires", repos = "http://packages.ropensci.org", type = "source")

      Linux users may also need to install:

      • libgdal-dev
      • libproj-dev
      • libxml2-dev
      Get a List of Storms

      We start exploring rrricanes by finding a storm (or storms) we wish to analyze. For this, we use get_storms. There are two optional parameters:

      • years Between 1998 and current year

      • basins One or both "AL" and "EP"

      An empty call to the function will return storms for both the Atlantic and East Pacific basin for the current year.

      library(dplyr) library(rrricanes) get_storms() %>% print(n = nrow(.)) ## # A tibble: 33 x 4 ## Year Name Basin ## ## 1 2017 Tropical Storm Arlene AL ## 2 2017 Tropical Storm Bret AL ## 3 2017 Tropical Storm Cindy AL ## 4 2017 Tropical Depression Four AL ## 5 2017 Tropical Storm Don AL ## 6 2017 Tropical Storm Emily AL ## 7 2017 Hurricane Franklin AL ## 8 2017 Hurricane Gert AL ## 9 2017 Hurricane Harvey AL ## 10 2017 Potential Tropical Cyclone Ten AL ## 11 2017 Hurricane Irma AL ## 12 2017 Hurricane Jose AL ## 13 2017 Hurricane Katia AL ## 14 2017 Hurricane Lee AL ## 15 2017 Hurricane Maria AL ## 16 2017 Tropical Storm Adrian EP ## 17 2017 Tropical Storm Beatriz EP ## 18 2017 Tropical Storm Calvin EP ## 19 2017 Hurricane Dora EP ## 20 2017 Hurricane Eugene EP ## 21 2017 Hurricane Fernanda EP ## 22 2017 Tropical Storm Greg EP ## 23 2017 Tropical Depression Eight-E EP ## 24 2017 Hurricane Hilary EP ## 25 2017 Hurricane Irwin EP ## 26 2017 Tropical Depression Eleven-E EP ## 27 2017 Tropical Storm Jova EP ## 28 2017 Hurricane Kenneth EP ## 29 2017 Tropical Storm Lidia EP ## 30 2017 Hurricane Otis EP ## 31 2017 Hurricane Max EP ## 32 2017 Hurricane Norma EP ## 33 2017 Tropical Storm Pilar EP ## # ... with 1 more variables: Link

      Function get_storms returns four variables:

      • Year – year of the cyclone.

      • Name – name of the cyclone.

      • Basin – basin the cyclone developed (AL for Atlantic, EP for east Pacific).

      • Link – URL to the cyclone's archive page.

      The variables Name and Link are the only variables that could potentially change. For example, you'll notice a Name value of Potential Tropical Cyclone Ten. If this storm became a tropical storm then it would receive a new name and the link to the archive page would change as well.

      For this example we will explore Hurricane Harvey.

      Text Products Current Data

      Once we have identified the storms we want to retrieve we can begin working on getting the products. In the earlier discussion of the available products, recall I used abbreviations such as discus, fstadv, etc. These are the terms we will use when obtaining data.

      The easiest method to getting storm data is the function get_storm_data. This function can take multiple storm archive URLs and return multiple datasets within a list.

      ds <- get_storms() %>% filter(Name == "Hurricane Harvey") %>% pull(Link) %>% get_storm_data(products = c("discus", "fstadv"))

      This process may take some time (particularly, fstadv products). This is because the NHC website allows no more than 80 connections every 10 seconds. rrricanes processes four links every half second.

      rrricanes uses the dplyr progress bar to keep you informed of the status. You can turn this off by setting option dplyr.show_progress to FALSE.

      An additional option is rrricanes.working_msg; FALSE by default. This option will show a message for each advisory currently being worked. I primarily added it to help find products causing problems but you may find it useful at some point.

      At this point, we have a list – ds – of dataframes. Each dataframe is named after the product.

      names(ds) ## [1] "discus" "fstadv"

      discus is one of the products that isn't parsed; the full text of the product is returned.

      str(ds$discus) ## Classes 'tbl_df', 'tbl' and 'data.frame': 43 obs. of 6 variables: ## $ Status : chr "Potential Tropical Cyclone" "Tropical Storm" "Tropical Storm" "Tropical Storm" ... ## $ Name : chr "Nine" "Harvey" "Harvey" "Harvey" ... ## $ Adv : num 1 2 3 4 5 6 7 8 9 10 ... ## $ Date : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 21:00:00" ... ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Contents: chr "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nPotential Tropical Cyclone Nine Discussion Number 1\nNWS National"| __truncated__ "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nTropical Storm Harvey Discussion Number 2\nNWS National Hurricane"| __truncated__ "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nTropical Storm Harvey Discussion Number 3\nNWS National Hurricane"| __truncated__ "\nZCZC MIATCDAT4 ALL\nTTAA00 KNHC DDHHMM\n\nTropical Storm Harvey Discussion Number 4\nNWS National Hurricane"| __truncated__ ...

      The fstadv dataframes, however, are parsed and contain the bulk of the information for the storm.

      str(ds$fstadv) ## Classes 'tbl_df', 'tbl' and 'data.frame': 43 obs. of 117 variables: ## $ Status : chr "Potential Tropical Cyclone" "Tropical Storm" "Tropical Storm" "Tropical Storm" ... ## $ Name : chr "Nine" "Harvey" "Harvey" "Harvey" ... ## $ Adv : num 1 2 3 4 5 6 7 8 9 10 ... ## $ Date : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 21:00:00" ... ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Lat : num 13.1 13 13 13.1 13.1 13.4 13.7 13.8 13.9 14.1 ... ## $ Lon : num -54.1 -55.8 -57.4 -59.1 -61.3 -62.9 -64.1 -65.9 -68.1 -70 ... ## $ Wind : num 30 35 35 35 35 35 35 35 35 30 ... ## $ Gust : num 40 45 45 45 45 45 45 45 45 40 ... ## $ Pressure : num 1008 1004 1005 1004 1005 ... ## $ PosAcc : num 30 30 30 30 30 40 40 30 30 30 ... ## $ FwdDir : num 270 270 270 270 270 275 275 275 275 275 ... ## $ FwdSpeed : num 15 16 16 16 18 18 16 18 19 19 ... ## $ Eye : num NA NA NA NA NA NA NA NA NA NA ... ## $ SeasNE : num NA 60 60 60 75 150 60 60 45 NA ... ## $ SeasSE : num NA 0 0 0 0 0 0 0 0 NA ... ## $ SeasSW : num NA 0 0 0 0 0 0 0 0 NA ... ## $ SeasNW : num NA 60 60 60 60 60 45 60 45 NA ... ## $ NE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ SE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ SW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ NW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ NE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ SE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ SW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ NW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ NE34 : num NA 30 50 50 60 60 60 60 0 NA ... ## $ SE34 : num NA 0 0 0 0 0 0 0 0 NA ... ## $ SW34 : num NA 0 0 0 0 0 0 0 0 NA ... ## $ NW34 : num NA 30 50 50 60 60 60 60 60 NA ... ## $ Hr12FcstDate : POSIXct, format: "2017-08-18 00:00:00" "2017-08-18 06:00:00" ... ## $ Hr12Lat : num 13.1 13.1 13.2 13.2 13.3 13.6 14 14 14.1 14.3 ... ## $ Hr12Lon : num -56.4 -58.3 -59.9 -61.7 -63.8 -65.7 -66.8 -68.7 -70.9 -73 ... ## $ Hr12Wind : num 30 35 35 35 35 35 35 35 35 30 ... ## $ Hr12Gust : num 40 45 45 45 45 45 45 45 45 40 ... ## $ Hr12NE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12SE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12SW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12NW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12NE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12SE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12SW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12NW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr12NE34 : num NA 30 50 50 60 60 60 60 60 NA ... ## $ Hr12SE34 : num NA 0 0 0 0 0 0 0 0 NA ... ## $ Hr12SW34 : num NA 0 0 0 0 0 0 0 0 NA ... ## $ Hr12NW34 : num NA 30 50 50 60 60 60 60 60 NA ... ## $ Hr24FcstDate : POSIXct, format: "2017-08-18 12:00:00" "2017-08-18 18:00:00" ... ## $ Hr24Lat : num 13.2 13.4 13.6 13.5 13.6 13.9 14.3 14.3 14.4 14.6 ... ## $ Hr24Lon : num -59.8 -61.6 -63.3 -65.2 -67.3 -69.3 -70.4 -72.7 -74.9 -77 ... ## $ Hr24Wind : num 35 40 40 40 40 40 40 40 40 35 ... ## $ Hr24Gust : num 45 50 50 50 50 50 50 50 50 45 ... ## $ Hr24NE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24SE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24SW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24NW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24NE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24SE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24SW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24NW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr24NE34 : num 50 40 50 50 60 60 60 60 60 60 ... ## $ Hr24SE34 : num 0 0 0 0 30 0 0 0 0 0 ... ## $ Hr24SW34 : num 0 0 0 0 30 0 0 0 0 0 ... ## $ Hr24NW34 : num 50 40 50 50 60 60 60 60 60 60 ... ## $ Hr36FcstDate : POSIXct, format: "2017-08-19 00:00:00" "2017-08-19 06:00:00" ... ## $ Hr36Lat : num 13.5 13.7 13.9 13.9 14 14.2 14.5 14.6 14.9 15.2 ... ## $ Hr36Lon : num -63.2 -65.1 -67 -68.8 -71.1 -73 -74.3 -76.7 -78.7 -80.5 ... ## $ Hr36Wind : num 40 45 45 40 40 40 40 45 45 40 ... ## $ Hr36Gust : num 50 55 55 50 50 50 50 55 55 50 ... ## $ Hr36NE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36SE64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36SW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36NW64 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36NE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36SE50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36SW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36NW50 : num NA NA NA NA NA NA NA NA NA NA ... ## $ Hr36NE34 : num 60 60 60 60 60 60 60 70 70 60 ... ## $ Hr36SE34 : num 0 30 30 30 30 0 0 0 0 0 ... ## $ Hr36SW34 : num 0 30 30 30 30 0 0 0 0 0 ... ## $ Hr36NW34 : num 60 60 60 60 60 60 60 70 70 60 ... ## $ Hr48FcstDate : POSIXct, format: "2017-08-19 12:00:00" "2017-08-19 18:00:00" ... ## $ Hr48Lat : num 13.9 14 14.1 14.1 14.3 14.5 14.8 15.2 15.7 16 ... ## $ Hr48Lon : num -66.7 -68.8 -70.9 -72.7 -75 -76.7 -78.1 -80.1 -82.4 -83.8 ... ## $ Hr48Wind : num 45 45 50 50 50 45 45 50 50 45 ... ## $ Hr48Gust : num 55 55 60 60 60 55 55 60 60 55 ... ## $ Hr48NE50 : num NA NA 30 30 30 NA NA 30 30 NA ... ## $ Hr48SE50 : num NA NA 0 0 0 NA NA 0 0 NA ... ## $ Hr48SW50 : num NA NA 0 0 0 NA NA 0 0 NA ... ## $ Hr48NW50 : num NA NA 30 30 30 NA NA 30 30 NA ... ## $ Hr48NE34 : num 60 60 60 60 70 60 60 90 90 70 ... ## $ Hr48SE34 : num 30 30 30 30 30 30 0 50 50 0 ... ## $ Hr48SW34 : num 30 30 30 30 30 30 0 40 40 0 ... ## $ Hr48NW34 : num 60 60 60 60 70 60 60 70 70 70 ... ## $ Hr72FcstDate : POSIXct, format: "2017-08-20 12:00:00" "2017-08-20 18:00:00" ... ## $ Hr72Lat : num 14.5 14.5 14.8 15 15 15.5 16.5 17 17.5 18 ... ## $ Hr72Lon : num -74.5 -76.5 -78.6 -80.2 -82 -83.5 -84.7 -86.5 -88 -89 ... ## $ Hr72Wind : num 55 55 60 60 60 60 55 55 60 45 ... ## $ Hr72Gust : num 65 65 75 75 75 75 65 65 75 55 ... ## [list output truncated]

      Each product can also be accessed on its own. For example, if you only wish to view discus products, use the get_discus function. fstadv products can be accessed with get_fstadv. Every products specific function is preceeded by get_.

      To understand the variable definitions, access the help file for each of these functions (i.e., ?get_fstadv). They contain full definitions on the variables and their purpose.

      As you can see, the fstadv dataframe is very wide. There may be instances you only want to focus on specific pieces of the product. I've developed tidy functions to help trim these datasets:

      • tidy_fcst

      • tidy_fcst_wr

      • tidy_fstadv

      • tidy_wr

      These datasets exist in rrricanesdata as fcst, fcst_wr, adv, and wr, respectively (see below).

      Most tropical cyclone forecast/advisory products will contain multiple forecast points. Initially, only three-day forecasts were issued. Beginning the with the 2003 season, 96 hour (five-day) forecasts were issued.

      If a storm is not expected to survive the full forecast period, then only relevant forecasts will be issued.

      We use tidy_fcst to return these forecast points in a tidy fashion from fstadv.

      str(tidy_fcst(ds$fstadv)) ## Classes 'tbl_df', 'tbl' and 'data.frame': 283 obs. of 8 variables: ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Adv : num 1 1 1 1 1 1 1 2 2 2 ... ## $ Date : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 15:00:00" ... ## $ FcstDate: POSIXct, format: "2017-08-18 00:00:00" "2017-08-18 12:00:00" ... ## $ Lat : num 13.1 13.2 13.5 13.9 14.5 15.5 17 13.1 13.4 13.7 ... ## $ Lon : num -56.4 -59.8 -63.2 -66.7 -74.5 -82 -87.5 -58.3 -61.6 -65.1 ... ## $ Wind : num 30 35 40 45 55 65 65 35 40 45 ... ## $ Gust : num 40 45 50 55 65 80 80 45 50 55 ...

      Wind radius values are issued with parameters of 34, 50 and 64. These values are the radius to which minimum one-minute sustained winds can be expected or exist.

      A tropical depression will not have associated wind radius values since the maximum winds of a depression are 30 knots. If a tropical storm has winds less than 50 knots, then it will only have wind radius values for the 34-knot wind field. If winds are greater than 50 knots, then it will have wind radius values for 34 and 50 knot winds. A hurricane will have all wind radius fields.

      Wind radius values are further seperated by quadrant; NE (northeast), SE, SW and NW. Not all quadrants will have values; particularly if the cyclone is struggling to organize. For example, you will often find a minimal hurricane only has hurricane-force winds (64 knots) in the northeast quadrant.

      When appropriate, a forecast/advisory product will contain these values for the current position and for each forecast position. Use tidy_wr and tidy_fcst_wr, respectively, for these variables.

      str(tidy_wr(ds$fstadv)) ## Classes 'tbl_df', 'tbl' and 'data.frame': 56 obs. of 8 variables: ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Adv : num 2 3 4 5 6 7 8 9 15 16 ... ## $ Date : POSIXct, format: "2017-08-17 21:00:00" "2017-08-18 03:00:00" ... ## $ WindField: num 34 34 34 34 34 34 34 34 34 34 ... ## $ NE : num 30 50 50 60 60 60 60 0 100 80 ... ## $ SE : num 0 0 0 0 0 0 0 0 0 30 ... ## $ SW : num 0 0 0 0 0 0 0 0 0 20 ... ## $ NW : num 30 50 50 60 60 60 60 60 50 50 ... str(tidy_fcst_wr(ds$fstadv)) ## Classes 'tbl_df', 'tbl' and 'data.frame': 246 obs. of 9 variables: ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Adv : num 1 1 1 1 1 2 2 2 2 2 ... ## $ Date : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 15:00:00" ... ## $ FcstDate : POSIXct, format: "2017-08-18 12:00:00" "2017-08-19 00:00:00" ... ## $ WindField: num 34 34 34 34 50 34 34 34 34 34 ... ## $ NE : num 50 60 60 80 30 30 40 60 60 80 ... ## $ SE : num 0 0 30 40 0 0 0 30 30 40 ... ## $ SW : num 0 0 30 40 0 0 0 30 30 40 ... ## $ NW : num 50 60 60 80 30 30 40 60 60 80 ...

      Lastly, you may only want to focus on current storm details. For this, we use tidy_fstadv:

      str(tidy_fstadv(ds$fstadv)) ## Classes 'tbl_df', 'tbl' and 'data.frame': 43 obs. of 18 variables: ## $ Key : chr "AL092017" "AL092017" "AL092017" "AL092017" ... ## $ Adv : num 1 2 3 4 5 6 7 8 9 10 ... ## $ Date : POSIXct, format: "2017-08-17 15:00:00" "2017-08-17 21:00:00" ... ## $ Status : chr "Potential Tropical Cyclone" "Tropical Storm" "Tropical Storm" "Tropical Storm" ... ## $ Name : chr "Nine" "Harvey" "Harvey" "Harvey" ... ## $ Lat : num 13.1 13 13 13.1 13.1 13.4 13.7 13.8 13.9 14.1 ... ## $ Lon : num -54.1 -55.8 -57.4 -59.1 -61.3 -62.9 -64.1 -65.9 -68.1 -70 ... ## $ Wind : num 30 35 35 35 35 35 35 35 35 30 ... ## $ Gust : num 40 45 45 45 45 45 45 45 45 40 ... ## $ Pressure: num 1008 1004 1005 1004 1005 ... ## $ PosAcc : num 30 30 30 30 30 40 40 30 30 30 ... ## $ FwdDir : num 270 270 270 270 270 275 275 275 275 275 ... ## $ FwdSpeed: num 15 16 16 16 18 18 16 18 19 19 ... ## $ Eye : num NA NA NA NA NA NA NA NA NA NA ... ## $ SeasNE : num NA 60 60 60 75 150 60 60 45 NA ... ## $ SeasSE : num NA 0 0 0 0 0 0 0 0 NA ... ## $ SeasSW : num NA 0 0 0 0 0 0 0 0 NA ... ## $ SeasNW : num NA 60 60 60 60 60 45 60 45 NA ...

      In release 0.2.1, tidy_fstadv will be renamed to tidy_adv.

      One final note on the data: all speed variables are measured in knots, distance variables in nautical miles, and pressure variables in millibars. Functions knots_to_mph and mb_to_in are available for speed/pressure conversions. Function nm_to_sm to convert nautical miles to survey miles will be included in release 0.2.1.

      Archived Data

      rrricanesdata was built to make it easier to get pre-processed datasets. As mentioned earlier, rrricanesdata will be updated the first of every month if any advisory was issued for the previous month. (As I am now writing this portion in September, all of Hurricane Harvey's advisories – the last one issued the morning of August 31 – exist in rrricanesdata release 0.0.1.4.)

      As with rrricanes, rrricanesdata is not available in CRAN (nor will be due to size limitations).

      I'll load all datasets with the call:

      library(rrricanesdata) data(list = data(package = "rrricanesdata")$results[,3])

      All core product datasets are available. The dataframes adv, fcst, fcst_wr and wr are the dataframes created by tidy_fstadv, tidy_fcst, tidy_fcst_wr and tidy_wr, respectively.

      Tracking Charts

      rrricanes also comes with helper functions to quickly generate tracking charts. These charts use rnaturalearthdata (for high resolution maps, use package rnaturalearthhires). These charts are not required – Bob Rudis demonstrates demonstrates succintly – so feel free to experiment.

      You can generate a default plot for the entire globe with tracking_chart:

      tracking_chart()

      You may find this handy when examining cyclones that cross basins (from the Atlantic to east Pacific such as Hurricane Otto, 2016).

      tracking_chart takes three parameters (in addition to dots for other ggplot calls):

      • countries – By default, show country borders

      • states – By default, show state borders

      • res – resolution; default is 110nm.

      We do not see countries and states in the map above because of the ggplot defaults. Let's try it again:

      tracking_chart(color = "black", size = 0.1, fill = "white")

      We can "zoom in" on each basin with helper functions al_tracking_chart and ep_tracking_chart:

      al_tracking_chart(color = "black", size = 0.1, fill = "white")

      ep_tracking_chart(color = "black", size = 0.1, fill = "white")

      GIS Data

      GIS data exists for some cyclones and varies by year. This is a relatively new archive by the NHC and is inconsistent from storm to storm.

      The "gis" functions are as follows:

      • gis_advisory

      • gis_latest

      • gis_prob_storm_surge

      • gis_windfield

      • gis_wsp

      Another area of inconsistency with these products is how they are organized. For example, gis_advisory, gis_prob_storm_surge and gis_windfield can be retrieved with a storm Key (unique identifier for every cyclone; see fstadv$Key). Except for gis_prob_storm_surge, you can even pass an advisory number (see fstadv$Adv).

      gis_wsp requires a datetime value; to access a specific GIS package for a storm's advisory you would need to use a variable such as fstadv$Date, subtract three hours and convert to "%Y%m%d%H" format ("%m", "%d", and "%H" are optional).

      All above functions only return URL's to their respective datasets. This was done to allow you to validate the quantity of datasets you wish to retrieve as, in some cases, the dataset may not exist at all or there may be several available. Use gis_download with the requested URL's to retrieve your datasets.

      Let's go through each of these. First, let's get the Key of Hurricane Harvey:

      # Remember that ds already and only contains data for Hurricane Harvey key <- ds$fstadv %>% pull(Key) %>% first() gis_advisory

      gis_advisory returns a dataset package containing past and forecast plot points and lines, a forecast cone (area representing where the cyclone could track), wind radius data and current watches and warnings.

      gis_advisory takes two parameters:

      • Key

      • advisory (optional)

      If we leave out advisory we get all related datasets for Hurricane Harvey:

      x <- gis_advisory(key = key) length(x) ## [1] 77 head(x, n = 5L) ## [1] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_001.zip" ## [2] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_001A.zip" ## [3] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_002.zip" ## [4] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_002A.zip" ## [5] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_003.zip"

      As you can see, there is quite a bit (and why the core gis functions only return URLs rather than the actual datasets). Let's trim this down a bit. Sneaking a peek (cheating) I find advisory 19 seems a good choice.

      gis_advisory(key = key, advisory = 19) ## [1] "http://www.nhc.noaa.gov/gis/forecast/archive/al092017_5day_019.zip"

      Good; there is a data package available for this advisory. Once you have confirmed the package you want to retrieve, use gis_download to get the data.

      gis <- gis_advisory(key = key, advisory = 19) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_5day_lin" ## with 1 features ## It has 7 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_5day_pgn" ## with 1 features ## It has 7 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_5day_pts" ## with 8 features ## It has 23 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017-019_ww_wwlin" ## with 5 features ## It has 8 fields

      Let's see what we have.

      str(gis) ## List of 4 ## $ al092017_019_5day_lin:Formal class 'SpatialLinesDataFrame' [package "sp"] with 4 slots ## .. ..@ data :'data.frame': 1 obs. of 7 variables: ## .. .. ..$ STORMNAME: chr "Harvey" ## .. .. ..$ STORMTYPE: chr "HU" ## .. .. ..$ ADVDATE : chr "1000 PM CDT Thu Aug 24 2017" ## .. .. ..$ ADVISNUM : chr "19" ## .. .. ..$ STORMNUM : num 9 ## .. .. ..$ FCSTPRD : num 120 ## .. .. ..$ BASIN : chr "AL" ## .. ..@ lines :List of 1 ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:8, 1:2] -94.6 -95.6 -96.5 -97.1 -97.3 -97.5 -97 -95 25.2 26.1 ... ## .. .. .. .. ..@ ID : chr "0" ## .. ..@ bbox : num [1:2, 1:2] -97.5 25.2 -94.6 29.5 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "x" "y" ## .. .. .. ..$ : chr [1:2] "min" "max" ## .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot ## .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs" ## $ al092017_019_5day_pgn:Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots ## .. ..@ data :'data.frame': 1 obs. of 7 variables: ## .. .. ..$ STORMNAME: chr "Harvey" ## .. .. ..$ STORMTYPE: chr "HU" ## .. .. ..$ ADVDATE : chr "1000 PM CDT Thu Aug 24 2017" ## .. .. ..$ ADVISNUM : chr "19" ## .. .. ..$ STORMNUM : num 9 ## .. .. ..$ FCSTPRD : num 120 ## .. .. ..$ BASIN : chr "AL" ## .. ..@ polygons :List of 1 ## .. .. ..$ :Formal class 'Polygons' [package "sp"] with 5 slots ## .. .. .. .. ..@ Polygons :List of 1 ## .. .. .. .. .. ..$ :Formal class 'Polygon' [package "sp"] with 5 slots ## .. .. .. .. .. .. .. ..@ labpt : num [1:2] -95.4 29.2 ## .. .. .. .. .. .. .. ..@ area : num 51.7 ## .. .. .. .. .. .. .. ..@ hole : logi FALSE ## .. .. .. .. .. .. .. ..@ ringDir: int 1 ## .. .. .. .. .. .. .. ..@ coords : num [1:1482, 1:2] -94.6 -94.7 -94.7 -94.7 -94.7 ... ## .. .. .. .. ..@ plotOrder: int 1 ## .. .. .. .. ..@ labpt : num [1:2] -95.4 29.2 ## .. .. .. .. ..@ ID : chr "0" ## .. .. .. .. ..@ area : num 51.7 ## .. ..@ plotOrder : int 1 ## .. ..@ bbox : num [1:2, 1:2] -100 24.9 -91 33 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "x" "y" ## .. .. .. ..$ : chr [1:2] "min" "max" ## .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot ## .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs" ## $ al092017_019_5day_pts:Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots ## .. ..@ data :'data.frame': 8 obs. of 23 variables: ## .. .. ..$ ADVDATE : chr [1:8] "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" ... ## .. .. ..$ ADVISNUM : chr [1:8] "19" "19" "19" "19" ... ## .. .. ..$ BASIN : chr [1:8] "AL" "AL" "AL" "AL" ... ## .. .. ..$ DATELBL : chr [1:8] "10:00 PM Thu" "7:00 AM Fri" "7:00 PM Fri" "7:00 AM Sat" ... ## .. .. ..$ DVLBL : chr [1:8] "H" "H" "M" "M" ... ## .. .. ..$ FCSTPRD : num [1:8] 120 120 120 120 120 120 120 120 ## .. .. ..$ FLDATELBL: chr [1:8] "2017-08-24 7:00 PM Thu CDT" "2017-08-25 7:00 AM Fri CDT" "2017-08-25 7:00 PM Fri CDT" "2017-08-26 7:00 AM Sat CDT" ... ## .. .. ..$ GUST : num [1:8] 90 115 135 120 85 45 45 45 ## .. .. ..$ LAT : num [1:8] 25.2 26.1 27.2 28.1 28.6 28.5 28.5 29.5 ## .. .. ..$ LON : num [1:8] -94.6 -95.6 -96.5 -97.1 -97.3 -97.5 -97 -95 ## .. .. ..$ MAXWIND : num [1:8] 75 95 110 100 70 35 35 35 ## .. .. ..$ MSLP : num [1:8] 973 9999 9999 9999 9999 ... ## .. .. ..$ SSNUM : num [1:8] 1 2 3 3 1 0 0 0 ## .. .. ..$ STORMNAME: chr [1:8] "Hurricane Harvey" "Hurricane Harvey" "Hurricane Harvey" "Hurricane Harvey" ... ## .. .. ..$ STORMNUM : num [1:8] 9 9 9 9 9 9 9 9 ## .. .. ..$ STORMSRC : chr [1:8] "Tropical Cyclone" "Tropical Cyclone" "Tropical Cyclone" "Tropical Cyclone" ... ## .. .. ..$ STORMTYPE: chr [1:8] "HU" "HU" "MH" "MH" ... ## .. .. ..$ TCDVLP : chr [1:8] "Hurricane" "Hurricane" "Major Hurricane" "Major Hurricane" ... ## .. .. ..$ TAU : num [1:8] 0 12 24 36 48 72 96 120 ## .. .. ..$ TCDIR : num [1:8] 315 9999 9999 9999 9999 ... ## .. .. ..$ TCSPD : num [1:8] 9 9999 9999 9999 9999 ... ## .. .. ..$ TIMEZONE : chr [1:8] "CDT" "CDT" "CDT" "CDT" ... ## .. .. ..$ VALIDTIME: chr [1:8] "25/0000" "25/1200" "26/0000" "26/1200" ... ## .. ..@ coords.nrs : num(0) ## .. ..@ coords : num [1:8, 1:2] -94.6 -95.6 -96.5 -97.1 -97.3 -97.5 -97 -95 25.2 26.1 ... ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : NULL ## .. .. .. ..$ : chr [1:2] "coords.x1" "coords.x2" ## .. ..@ bbox : num [1:2, 1:2] -97.5 25.2 -94.6 29.5 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "coords.x1" "coords.x2" ## .. .. .. ..$ : chr [1:2] "min" "max" ## .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot ## .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs" ## $ al092017_019_ww_wwlin:Formal class 'SpatialLinesDataFrame' [package "sp"] with 4 slots ## .. ..@ data :'data.frame': 5 obs. of 8 variables: ## .. .. ..$ STORMNAME: chr [1:5] "Harvey" "Harvey" "Harvey" "Harvey" ... ## .. .. ..$ STORMTYPE: chr [1:5] "HU" "HU" "HU" "HU" ... ## .. .. ..$ ADVDATE : chr [1:5] "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" "1000 PM CDT Thu Aug 24 2017" ... ## .. .. ..$ ADVISNUM : chr [1:5] "19" "19" "19" "19" ... ## .. .. ..$ STORMNUM : num [1:5] 9 9 9 9 9 ## .. .. ..$ FCSTPRD : num [1:5] 120 120 120 120 120 ## .. .. ..$ BASIN : chr [1:5] "AL" "AL" "AL" "AL" ... ## .. .. ..$ TCWW : chr [1:5] "TWA" "HWA" "TWR" "TWR" ... ## .. ..@ lines :List of 5 ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:3, 1:2] -97.7 -97.4 -97.2 24.3 25.2 ... ## .. .. .. .. ..@ ID : chr "0" ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:3, 1:2] -97.2 -97.2 -97.3 26 26.1 ... ## .. .. .. .. ..@ ID : chr "1" ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:3, 1:2] -97.2 -97.2 -97.3 26 26.1 ... ## .. .. .. .. ..@ ID : chr "2" ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:10, 1:2] -95.6 -95.3 -95.1 -95.1 -94.8 ... ## .. .. .. .. ..@ ID : chr "3" ## .. .. ..$ :Formal class 'Lines' [package "sp"] with 2 slots ## .. .. .. .. ..@ Lines:List of 1 ## .. .. .. .. .. ..$ :Formal class 'Line' [package "sp"] with 1 slot ## .. .. .. .. .. .. .. ..@ coords: num [1:16, 1:2] -97.3 -97.3 -97.3 -97.4 -97.4 ... ## .. .. .. .. ..@ ID : chr "4" ## .. ..@ bbox : num [1:2, 1:2] -97.7 24.3 -94.4 29.8 ## .. .. ..- attr(*, "dimnames")=List of 2 ## .. .. .. ..$ : chr [1:2] "x" "y" ## .. .. .. ..$ : chr [1:2] "min" "max" ## .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot ## .. .. .. ..@ projargs: chr "+proj=longlat +a=6371200 +b=6371200 +no_defs"

      We get four spatial dataframes – points, polygons and lines.

      names(gis) ## [1] "al092017_019_5day_lin" "al092017_019_5day_pgn" "al092017_019_5day_pts" ## [4] "al092017_019_ww_wwlin"

      With the expection of point spatial dataframes (which can be converted to dataframe using tibble::as_data_frame, use helper function shp_to_df to convert the spatial dataframes to dataframes.

      Forecast Track library(ggplot2) al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_path(data = shp_to_df(gis$al092017_019_5day_lin), aes(x = long, y = lat))

      Use geom_path instead of geom_line to keep the positions in order.

      You can "zoom in" even further using ggplot2::coord_equal. For that, we need to know the limits of our objects (minimum and maximum latitude and longitude) or bounding box. Thankfully, the sp package can get us this information with the bbox function.

      But, we don't want to use the "al092017_019_5day_lin" dataset. Our gis dataset contains a forecast cone which expands well beyond the lines dataset. Take a look:

      sp::bbox(gis$al092017_019_5day_lin) ## min max ## x -97.5 -94.6 ## y 25.2 29.5 sp::bbox(gis$al092017_019_5day_pgn) ## min max ## x -100.01842 -90.96327 ## y 24.86433 33.01644

      So, let's get the bounding box of our forecast cone dataset and zoom in on our map.

      bb <- sp::bbox(gis$al092017_019_5day_pgn) al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_path(data = shp_to_df(gis$al092017_019_5day_lin), aes(x = long, y = lat)) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

      That's much better. For simplicity I'm going to save the base map, bp, without the line plot.

      bp <- al_tracking_chart(color = "black", size = 0.1, fill = "white") + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2])) Forecast Points

      Forecast points identify each forecast position along with forecast winds and date. Remember that for point spatial dataframes you use tibble::as_data_frame rather than sp_to_df.

      bp + geom_point(data = tibble::as_data_frame(gis$al092017_019_5day_pts), aes(x = long, y = lat))

      If you ran the code above you would get an error.

      Error in FUN(X[[i]], ...) : object 'long' not found

      Why? The variable long does not exist as it does in other GIS datasets; it is lon. This is one of the inconsistencies I was referring to previously. Additionally, the variables are all uppercase.

      names(gis$al092017_019_5day_pts) ## [1] "ADVDATE" "ADVISNUM" "BASIN" "DATELBL" "DVLBL" ## [6] "FCSTPRD" "FLDATELBL" "GUST" "LAT" "LON" ## [11] "MAXWIND" "MSLP" "SSNUM" "STORMNAME" "STORMNUM" ## [16] "STORMSRC" "STORMTYPE" "TCDVLP" "TAU" "TCDIR" ## [21] "TCSPD" "TIMEZONE" "VALIDTIME"

      Let's try it again.

      bp + geom_point(data = tibble::as_data_frame(gis$al092017_019_5day_pts), aes(x = LON, y = LAT))

      Better.

      Forecast Cone

      A forecast cone identifies the probability of error in a forecast. Forecasting tropical cyclones is tricky business – errors increase the further out a forecast is issued. Theoretically, any area within a forecast cone is at risk of seeing cyclone conditions within the given period of time.

      Generally, a forecast cone package contains two subsets: 72-hour forecast cone and 120-hour forecast cone. This is identified in the dataset under the variable FCSTPRD. Let's take a look at the 72-hour forecast period:

      bp + geom_polygon(data = shp_to_df(gis$al092017_019_5day_pgn) %>% filter(FCSTPRD == 72), aes(x = long, y = lat, color = FCSTPRD))

      Nothing there!

      As mentioned earlier, these are experimental products issued by the NHC and they do contain inconsistencies. To demonstrate, I'll use Hurricane Ike advisory 42.

      df <- gis_advisory(key = "AL092008", advisory = 42) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_5day_lin" ## with 2 features ## It has 9 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_5day_pgn" ## with 2 features ## It has 9 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_5day_pts" ## with 13 features ## It has 20 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092008.042_ww_wwlin" ## with 5 features ## It has 10 fields al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = shp_to_df(df$al092008_042_5day_pgn) %>% filter(FCSTPRD == 72), aes(x = long, y = lat))

      We do, however, have a 120-hour forecast cone for Hurricane Harvey. Let's go ahead and plot that.

      bp + geom_polygon(data = gis$al092017_019_5day_pgn, aes(x = long, y = lat), alpha = 0.15)

      It's an odd-looking forecast cone, for sure. But this demonstrates the entire area that Harvey could have potentially traveled.

      Watches and Warnings

      Our last dataset in this package is "al092017_09_ww_wlin". These are the current watches and warnings in effect. This is a spatial lines dataframe that needs shp_to_df. Again, we use geom_path instead of geom_line. And we want to group our paths by the variable TCWW.

      bp + geom_path(data = shp_to_df(gis$al092017_019_ww_wwlin), aes(x = long, y = lat, group = group, color = TCWW), size = 1)

      The paths won't follow our coastlines exactly but you get the idea. The abbreviations don't really give much information, either. Convert TCWW to factor and provide better labels for your legend.

      ww_wlin <- shp_to_df(gis$al092017_019_ww_wwlin) ww_wlin$TCWW <- factor(ww_wlin$TCWW, levels = c("TWA", "TWR", "HWA", "HWR"), labels = c("Tropical Storm Watch", "Tropical Storm Warning", "Hurricane Watch", "Hurricane Warning")) bp + geom_path(data = ww_wlin, aes(x = long, y = lat, group = group, color = TCWW), size = 1)

      See Forecast/Adivsory GIS on the rrricanes website for an example of putting all of this data together in one map.

      gis_prob_storm_surge

      We can also plot the probablistic storm surge for given locations. Again, you will need the storm Key for this function. There are two additional parameters:

      • products

      • datetime

      products can be one or both of "esurge" and "psurge". esurge shows the probability of the cyclone exceeding the given storm surge plus tide within a given forecast period. psurge shows the probability of a given storm surge within a specified forecast period.

      One or more products may not exist depending on the cyclone and advisory.

      The products parameter expects a list of values for each product. For esurge products, valid values are 10, 20, 30, 40 or 50. For psurge products, valid values are 0, 1, 2, …, 20.

      Let's see if any esurge products exist for Harvey.

      length(gis_prob_storm_surge(key = key, products = list("esurge" = seq(10, 50, by = 10)))) ## [1] 150

      And psurge:

      length(gis_prob_storm_surge(key = key, products = list("psurge" = 0:20))) ## [1] 630

      So, we have access to a ton of data here. When discussing gis_advisory, we were able to filter by advisory number. With gis_prob_storm_surge, this is not an option; we have to use the datetime parameter to filter. Let's find the Date for advisory 19.

      (d <- ds$fstadv %>% filter(Adv == 19) %>% pull(Date)) ## [1] "2017-08-25 03:00:00 UTC" esurge

      Now, let's view all esurge products for date only (exlude time).

      gis_prob_storm_surge(key = key, products = list("esurge" = seq(10, 50, by = 10)), datetime = strftime(d, "%Y%m%d", tz = "UTC")) ## [1] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082500.zip" ## [2] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082506.zip" ## [3] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082512.zip" ## [4] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082518.zip" ## [5] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082500.zip" ## [6] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082506.zip" ## [7] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082512.zip" ## [8] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082518.zip" ## [9] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082500.zip" ## [10] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082506.zip" ## [11] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082512.zip" ## [12] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082518.zip" ## [13] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082500.zip" ## [14] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082506.zip" ## [15] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082512.zip" ## [16] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082518.zip" ## [17] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082500.zip" ## [18] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082506.zip" ## [19] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082512.zip" ## [20] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082518.zip"

      That's still quite a bit. We can filter it to more by adding hour to the datetime parameter.

      gis_prob_storm_surge(key = key, products = list("esurge" = seq(10, 50, by = 10)), datetime = strftime(d, "%Y%m%d%H", tz = "UTC"))

      This call will give you an error:

      Error: No data available for requested storm/advisory

      But, this isn't entirely correct. When an advisory package is issued it contains information for the release time. Some of the GIS datasets are based on the release time -3 hours. So, we need to subtract 3 hours from d.

      Note: There is an additional value that, as of the latest release is not extracted, records the position of the cyclone three hours prior. (As I understand it from the NHC, this is due to the time it takes to collect and prepare the data.) Per Issue #102, these values will be added for release 0.2.1. Therefore, instead of subtracting three hours from the Date variable, you can simply use the PrevPosDate value for this function.

      Let's try it again with the math:

      gis_prob_storm_surge(key = key, products = list("esurge" = seq(10, 50, by = 10)), datetime = strftime(d - 60 * 60 * 3, "%Y%m%d%H", tz = "UTC")) ## [1] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge10_2017082500.zip" ## [2] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge20_2017082500.zip" ## [3] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge30_2017082500.zip" ## [4] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge40_2017082500.zip" ## [5] "http://www.nhc.noaa.gov/gis/storm_surge/al092017_esurge50_2017082500.zip"

      As I don't want to get all of these datasets, I'll limit my esurge to show surge values with at least a 50% chance of being exceeded:

      gis <- gis_prob_storm_surge(key = key, products = list("esurge" = 50), datetime = strftime(d - 60 * 60 * 3, "%Y%m%d%H", tz = "UTC")) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082500_e50" ## with 97 features ## It has 2 fields

      This will bring us a spatial polygon dataframe.

      df <- shp_to_df(gis$al092017_2017082500_e50) bb <- sp::bbox(gis$al092017_2017082500_e50) str(df) ## Classes 'tbl_df', 'tbl' and 'data.frame': 161313 obs. of 9 variables: ## $ long : num -93.2 -93.2 -93.2 -93.2 -93.2 ... ## $ lat : num 29.9 29.9 29.9 29.9 29.9 ... ## $ order : int 1 2 3 4 5 6 7 8 9 10 ... ## $ hole : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ piece : Factor w/ 349 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ group : Factor w/ 7909 levels "0.1","0.2","0.3",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ id : chr "0" "0" "0" "0" ... ## $ POINTID: int 1 1 1 1 1 1 1 1 1 1 ... ## $ TCSRG50: num 0 0 0 0 0 0 0 0 0 0 ... al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = df, aes(x = long, y = lat, group = group, color = TCSRG50)) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

      This plot tells us that, along the central Texas coast, the expected storm surge along with tides is greater than 7.5 feet and there is a 50% chance of this height being exceeded.

      psurge

      The psurge product gives us the probabilistic storm surge for a location within the given forecast period.

      gis <- gis_prob_storm_surge(key = key, products = list("psurge" = 20), datetime = strftime(d - 60 * 60 * 3, "%Y%m%d%H", tz = "UTC")) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082500_gt20" ## with 12 features ## It has 2 fields

      This will bring us a spatial polygon dataframe.

      df <- shp_to_df(gis$al092017_2017082500_gt20) bb <- sp::bbox(gis$al092017_2017082500_gt20) str(df) ## Classes 'tbl_df', 'tbl' and 'data.frame': 3293 obs. of 9 variables: ## $ long : num -96.8 -96.8 -96.8 -96.8 -96.8 ... ## $ lat : num 28.5 28.5 28.4 28.4 28.4 ... ## $ order : int 1 2 3 4 5 6 7 8 9 10 ... ## $ hole : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ piece : Factor w/ 54 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ group : Factor w/ 227 levels "0.1","0.2","0.3",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ id : chr "0" "0" "0" "0" ... ## $ POINTID : int 1 1 1 1 1 1 1 1 1 1 ... ## $ PSurge20c: num 1 1 1 1 1 1 1 1 1 1 ... al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = df, aes(x = long, y = lat, group = group, color = PSurge20c)) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

      This map shows the cumulative probability that a storm surge of greater than 20 feet will be seen within the highlighted regions.

      This particular map doesn't help much as we've zoomed in too far. What may provide use is a list of probability stations as obtained from the NHC. For this, you can use al_prblty_stations (ep_prblty_stations returns FALSE since, as of this writing, the format is invalid).

      stations <- al_prblty_stations() al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = df, aes(x = long, y = lat, group = group, color = PSurge20c)) + geom_label(data = stations, aes(x = Lon, y = Lat, label = Location)) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

      gis_windfield

      When possible, there may also be a cyclone wind radius dataset for the current and forecast positions. With this function we can resort back to Key and an advisory number.

      gis <- gis_windfield(key = key, advisory = 19) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082503_forecastradii" ## with 15 features ## It has 13 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "al092017_2017082503_initialradii" ## with 3 features ## It has 13 fields names(gis) ## [1] "al092017_2017082503_forecastradii" "al092017_2017082503_initialradii"

      Let's get the bounding box and plot our initialradii dataset.

      bb <- sp::bbox(gis$al092017_2017082503_initialradii) al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = shp_to_df(gis$al092017_2017082503_initialradii), aes(x = long, y = lat, group = group, fill = factor(RADII)), alpha = 0.5) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

      And add the forecast wind radii data onto the chart (modifying bb):

      bb <- sp::bbox(gis$al092017_2017082503_forecastradii) al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = shp_to_df(gis$al092017_2017082503_initialradii), aes(x = long, y = lat, group = group, fill = factor(RADII)), alpha = 0.5) + geom_polygon(data = shp_to_df(gis$al092017_2017082503_forecastradii), aes(x = long, y = lat, group = group, fill = factor(RADII)), alpha = 0.5) + geom_label(data = stations, aes(x = Lon, y = Lat, label = Location)) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

      gis_wsp

      Our last GIS dataset is wind speed probabilities. This dataset is not storm specific nor even basin-specific; you may get results for cyclones halfway across the world.

      The two parameters needed are:

      • datetime – again, using the %Y%m%d%H format (not all values are required)

      • res – Resolution of the probabilities; 5 degrees, 0.5 degrees and 0.1 degrees.

      Wind fields are for 34, 50 and 64 knots. Not all resolutions or windfields will be available at a given time.

      Sticking with our variable d, let's first make sure there is a dataset that exists for that time.

      gis_wsp(datetime = strftime(d - 60 * 60 * 3, format = "%Y%m%d%H", tz = "UTC")) ## [1] "http://www.nhc.noaa.gov/gis/"

      For this article, we'll stick to the higher resolution plot.

      we need a temporarily fixed function to replace gis_wsp(), which will be
      fixed in package soon

      gis_wsp_2 <- function(datetime, res = c(5, 0.5, 0.1)) { res <- as.character(res) res <- stringr::str_replace(res, "^5$", "5km") res <- stringr::str_replace(res, "^0.5$", "halfDeg") res <- stringr::str_replace(res, "^0.1$", "tenthDeg") year <- stringr::str_sub(datetime, 0L, 4L) request <- httr::GET("http://www.nhc.noaa.gov/gis/archive_wsp.php", query = list(year = year)) contents <- httr::content(request, as = "parsed", encoding = "UTF-8") ds <- rvest::html_nodes(contents, xpath = "//a") %>% rvest::html_attr("href") %>% stringr::str_extract(".+\\.zip$") %>% .[stats::complete.cases(.)] if (nchar(datetime) < 10) { ptn_datetime <- paste0(datetime, "[:digit:]+") } else { ptn_datetime <- datetime } ptn_res <- paste(res, collapse = "|") ptn <- sprintf("%s_wsp_[:digit:]{1,3}hr(%s)", ptn_datetime, ptn_res) links <- ds[stringr::str_detect(ds, ptn)] links <- paste0("http://www.nhc.noaa.gov/gis/", links) return(links) } gis <- gis_wsp_2( datetime = strftime(d - 60 * 60 * 3, format = "%Y%m%d%H", tz = "UTC"), res = 5) %>% gis_download() ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "2017082500_wsp34knt120hr_5km" ## with 11 features ## It has 1 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "2017082500_wsp50knt120hr_5km" ## with 11 features ## It has 1 fields ## OGR data source with driver: ESRI Shapefile ## Source: "/var/folders/gs/4khph0xs0436gmd2gdnwsg080000gn/T//Rtmp1wYg2x", layer: "2017082500_wsp64knt120hr_5km" ## with 11 features ## It has 1 fields

      All of these datasets are spatial polygon dataframes. Again, we will need to convert to dataframe using shp_to_df.

      bb <- sp::bbox(gis$`2017082500_wsp34knt120hr_5km`)

      Examine the structure.

      df <- shp_to_df(gis$`2017082500_wsp34knt120hr_5km`) str(df) ## Classes 'tbl_df', 'tbl' and 'data.frame': 24182 obs. of 8 variables: ## $ long : num -97.2 -97.2 -97.2 -97.3 -97.3 ... ## $ lat : num 20.3 20.3 20.3 20.3 20.4 ... ## $ order : int 1 2 3 4 5 6 7 8 9 10 ... ## $ hole : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ piece : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ group : Factor w/ 52 levels "0.1","0.2","0.3",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ id : chr "0" "0" "0" "0" ... ## $ PERCENTAGE: chr "<5%" "<5%" "<5%" "<5%" ... al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = df, aes(x = long, y = lat, group = group, fill = PERCENTAGE), alpha = 0.25) + coord_equal(xlim = c(bb[1,1], bb[1,2]), ylim = c(bb[2,1], bb[2,2]))

      There aren't many ways we can narrow this down other than using arbitrary longitude values. The observations in the dataset do not have a variable identifying which storm each set of values belongs to. So, I'll remove the coord_equal call so we're only focused on the Atlantic basin.

      al_tracking_chart(color = "black", size = 0.1, fill = "white") + geom_polygon(data = df, aes(x = long, y = lat, group = group, fill = PERCENTAGE), alpha = 0.25)

      Of course, you can narrow it down further as you see fit.

      Do not confuse this GIS dataset with the wndprb product or similar prblty products; both of which only identify probabilities for given locations.

      gis_latest

      For active cyclones, you can retrieve all available GIS datasets using gis_latest. Note that, unlike the previous GIS functions, this function will return a list of all GIS dataframes available.

      gis <- gis_latest()

      Now we have a large list of GIS spatial dataframes. Two things to point out here; first, we now have a "windswath" GIS dataset. This dataset, to the best of my knowledge, does not exist on it's own. Therefore, no archived "windswath" datasets are available.

      Second, I have found this data fluctuates even from minute to minute. Earlier this year when attempting to develop automated reporting, I found the return value of the call would vary almost with every call.

      Of course, that doesn't mean it is not valuable, and why it has been included. You can easily perform checks for specific data you are looking for. If it doesn't exist, bail and try again in a few minutes.

      Potential Issues Using rrricanes

      I cannot stress enough that rrricanes is not intended for use during emergency situations, as I myself learned during Hurricane Harvey. The package currently relies on the NHC website which, I truly believe, is curated by hand.

      The most common problems I've noticed are:

      • The NHC website unable to load or slow to respond. This was a hassle in previous releases but seems to be ironed out as of release 0.2.0.6. Nonetheless, there may be instances where response time is slow.

      • Incorrect storm archive links. Another example would be during Harvey when the link to Harvey's archive page was listed incorrectly. If I manually typed the link as it should be, the storm's correct archive page would load. However, the NHC website listed it incorrectly on the annual archives page.

      As I become more aware of potential problems, I will look for workarounds. I would be greatly appreciative of any problems being posted to the rrricanes repository.

      I will also post known issues beyond my control (such as NHC website issues) to Twitter using the #rrricanes hashtag.

      Future Plans

      The following data will be added to rrricanes as time allows:

      • Reconnaissance data (release 0.2.2)

      • Computer forecast model data (release 0.2.3)

      • Archived satellite images (tentative)

      • Ship and buoy data (tentative)

      Reconnaissance data itself will be a massive project. There are numerous types of products. And, as advisory product formats have changed over the years, so have these. Any help in this task would be tremendously appreciated!

      Some computer forecast models are in the public domain and can certainly be of tremendous use. Some are difficult to find (especially archived). The caution in this area is that many websites post this data but may have limitations on how it can be accessed.

      Additionally, data may be added as deemed fitting.

      Contribute

      Anyone is more than welcome to contribute to the package. I would definitely appreciate any help. See Contributions for more information.

      I would ask that you follow the Tidyverse style guide. Release 0.2.1 will fully incorporate these rules.

      You do not need to submit code in order to be listed as a contributor. If there is a data source (that can legally be scraped) that you feel should be added, please feel free to submit a request. Submitting bug reports and feature requests are all extremely valuable to the success of rrricanes.

      Acknowledgments

      I want to thank the rOpenSci community for embracing rrricanes and accepting the package into their vast portfolio. This is my first attempt and putting a project into part of a larger community and the lessons learned have been outstanding.

      I want to thank Maelle Salmon who, in a sense, has been like a guiding angel from start to finish during the entire onboarding and review process.

      I want to give a very special thanks to Emily Robinson and Joseph Stachelek for taking the time to put rrricanes to the test, giving valuable insight and recommendations on improving it.

      And a thank-you also to James Molyneux, Mark Padgham, and Bob Rudis, all of whom have offered guidance or input that has helped make rrricanes far better than it would have been on my own.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      What is the appropriate population scaling of the Affordable Care Act Funding?

      Wed, 09/27/2017 - 05:15

      (This article was first published on Jonathan Sidi's R Blog, and kindly contributed to R-bloggers)

      Analysis of the effects of the Graham-Cassidy Bill on the ACA population –

      I have been trying to decipher for myself, what is in the current (well, yesterday’s) Graham-Cassidy health care bill. I saw this image on many news outlets a few days ago and my inner hate for pie charts bubbled up.

      This is a zoom in on the pie chart … From what I can gather, these figures are attempting to say that there are specific states that are getting relatively more of the federal health care funds under the Afordable Care Act (ACA) than their relative state population. Among the many things that are really hard to do with pie charts , comparing distributions ranks pretty high up there.

      It is common practice when comparing different geographical areas that have different populations levels to scale it using the rate per person (per capita) or per number of people, e.g. rate per 1,000 people. In this case it would be population adjusted state federal funding. The question that needs answering, what is the relevant population.

      Many charts in the last week have scaled the funding adjusted to state population (as is alluded to in the figure above), but the funds are not actually being used by everyone in each state, most people have health care from their employer. So, what is the actual population that is being serviced by the federal funding for the ACA? How much of a different picture does that paint from the original figure?

      Hopefully this post will help motivate readers to start looking around for more data on what is the effect of the proposed bill on the approprations of federal funds on the state level.

      My sources of information is the Kaiser Family Foundation site that have a great database for data on the ACA and the proposed bill, and Wikipedia for auxilary population data. We will end up with the following figure, but along the way I learned a number of things that I didn’t know from reading online and seeing the news on TV.

      A quick note before you proceed – This is not meant to be an all encompassing analysis of the predicted effects of the Graham-Cassidy bill, as it has been said before: “Healthcare is hard…”, and if I made any bad assumptions I apologize in advanced and welcome any comments and suggestions to better understand the subject matter.

      Saying that, let’s continue:

      library(xml2) library(rvest) library(dplyr) library(ggplot2) library(geofacet) knitr::opts_chunk$set(fig.height=7,fig.width=12,warning=FALSE,message=FALSE) Scraping the relevant information Kaiser Family Foundation ACA and Graham-Cassidy federal spending by state. kf_spending <- (xml2::read_html('http://www.kff.org/health-reform/issue-brief/state-by-state-estimates-of-changes-in-federal-spending-on-health-care-under-the-graham-cassidy-bill/')%>% rvest::html_table())[[2]][-c(1,2,3,55),] names(kf_spending) <- c('Location','ACA','GC','DIFF','PCT') kf_spending$Location[which(kf_spending$Location=='DC')]='District of Columbia' kf_spending <- kf_spending%>%mutate_at(.vars=vars(ACA,GC,DIFF,PCT),.funs=funs(as.numeric(gsub('[,%]','',.)))) ACA medicare expansion by state

      The decision of each state to accept medicare expansion will have a large affect on the net affect of GC on the redistribution of federal funds. States that did not accept medicare expansion are expected to have a net positive increase of federal funds.

      #http://www.kff.org/health-reform/state-indicator/state-activity-around-expanding-medicaid-under-the-affordable-care-act/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D kf_expansion <- read.csv('data/kf_expansion.csv',stringsAsFactors = FALSE,skip = 2) kf_expansion <- kf_expansion[-c(1,53:61),-3] names(kf_expansion)[2] <- 'Expansion' Population of ACA enrollment by state.

      The target population that will be used to scale the federal funds is the total marketplace enrollment for each state. We also add the characteristic of type of marketplace applied in the state to check if that has any effect.

      • Federally-Facilitated Market
      • State-based Marketplace
      • State-based Marketplace (using HealthCare.gov)
      #http://www.kff.org/health-reform/state-indicator/total-marketplace-enrollment/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Total%20Individuals%20Who%20Have%20Selected%20a%20Marketplace%20Plan%22,%22sort%22:%22asc%22%7D kf_marketplace_pop <- read.csv('data/kf_marketplace_pop.csv',stringsAsFactors = FALSE,skip = 2) kf_marketplace_pop <- kf_marketplace_pop[-c(1,53:59),] names(kf_marketplace_pop)[c(2,3)] <- c('Marketplace_Type','N') Wikipedia State characteristics (2016 elections and general population)

      To gather more characteristics of each state are the 2016 general election results and the total population in each state, so the prevalent scaling can be used as a comparison.

      wiki_elections <- (xml2::read_html('https://en.wikipedia.org/wiki/United_States_presidential_election,_2016')%>% rvest::xml_nodes(xpath='//*[@id="mw-content-text"]/div/div[40]/table')%>% rvest::html_table())[[1]][-c(1,58),c(1,3,6,23)] names(wiki_elections) <- c('Location','Clinton','Trump','Total') wiki_elections$Location[grep('^Nebraska',wiki_elections$Location)] <- 'Nebraska' wiki_elections$Location[grep('^Maine',wiki_elections$Location)] <- 'Maine' wiki_elections <- wiki_elections%>% mutate_at(.vars = vars(Clinton,Trump,Total),.funs=funs(as.numeric(gsub('[,]','',.))))%>% group_by(Location)%>%summarise_at(.vars = vars(Clinton,Trump,Total),.funs = funs(sum))%>% mutate(ClintonPct=Clinton/Total,TrumpPct=Trump/Total,TrumpWin=ifelse(TrumpPct>ClintonPct,'Trump Win','Clinton Win')) wiki_pop <- (xml2::read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population')%>% rvest::xml_nodes(xpath='//*[@id="mw-content-text"]/div/table[1]')%>% rvest::html_table())[[1]][-c(30,53:64),c(3,4)] names(wiki_pop) <- c('Location','Total_N') wiki_pop$Total_N <- as.numeric(gsub('[,]','',wiki_pop$Total_N)) Join all the data sets

      We join all the information and create a new variable – the change in federal funds from ACA to GC. This is done for the rate per 1,000 individuals who have selected a market based plan and the broader per 1,000 individuals state total. The former assumes that this the more consice defition of the population better reflects what is the population serviced by the federal funding, and that it is the potential population that would be serviced by the GC bill.

      kf <- kf_marketplace_pop%>% left_join(kf_expansion,by='Location')%>% left_join(wiki_pop,by='Location') kf <- kf_spending%>%left_join(kf, by = c('Location'))%>% mutate(ratio_ACA=1000*ACA/N,ratio_GC=1000*GC/N,ratio_DIFF = ratio_GC-ratio_ACA, tot_ratio_ACA=1000*ACA/Total_N,tot_ratio_GC=1000*GC/Total_N,tot_ratio_DIFF = tot_ratio_GC-tot_ratio_ACA)%>% arrange(desc(ratio_DIFF)) kf <- kf%>%left_join(wiki_elections,by='Location') kf$Expansion <- sprintf('Medicaid Expansion %s',kf$Expansion) kf$Location <- factor(kf$Location,levels = kf$Location) kf$Marketplace_Type <- factor(kf$Marketplace_Type,labels=c('Federally-Facilitated Market','State-based Marketplace','State-based Marketplace (using HealthCare.gov)')) Plots Percent of state population enrolled in ACA

      First we want to see what is the scope of the population in each state that have selected an ACA market based plan. (note California… not quite 12% of the US population)

      kf%>% mutate(pop_pct=100*N/Total_N)%>% arrange(desc(pop_pct))%>% mutate(Location=factor(Location,levels=Location))%>% ggplot(aes(x=Location,y=pop_pct))+ geom_point()+ coord_flip()+ labs(y='Percent of Population that have selected an ACA market based plan')

      Overall distribution by Medicare Expansion

      We then check that there really is a difference between states that expanded and did not expand medicaid under the ACA and if being a state that voted Republican compared to Democratic.

      boxplot_dat <- kf%>% dplyr::select(Expansion,Marketplace_Type,TrumpWin, ratio_DIFF,tot_ratio_DIFF)%>% reshape2::melt(.,id=c('Marketplace_Type','Expansion','TrumpWin')) levels(boxplot_dat$variable) <- c('per 1,000 Individuals who have\nselected a market based plan','per 1,000 Individuals') boxplot_dat%>% ggplot(aes(x=Expansion, y=value, fill=TrumpWin))+ geom_boxplot()+ geom_hline(aes(yintercept=0),linetype=2)+ facet_wrap(~variable,ncol=1,scales='free_y')+ labs(title='Change in Federal Funds ACA vs Graham-Cassidy, 2020-2026', y='Change in Federal Funds ($ Millions) per 1,000 individuals')+ theme_bw()+ theme(legend.position = 'bottom')

      Drilling down to state level figures we show for each state the change from ACA funding to the proposed GC funding per 1,000 persons who selected a market based ACA plan. The arrows move from ACA to GC funding and the y-axis is ordered by the increasing net difference. This comparison is faceted among the different characteristics scrapped from above.

      Some things to look for:

      • New York has the largest negative net funding per 1,000 persons.
      • Kentucky has the largest negative net funding per 1,000 persons among Republican leaning states.
      • The net increase in funding per 1,000 persons for states that did not expand medicaid is mostly minimal.
      p <- kf%>%ggplot(aes(x=Location,xend=Location,yend=ratio_GC,y=ratio_ACA,colour=ratio_DIFF))+ geom_segment(arrow = arrow(length = unit(0.02, "npc")))+ coord_flip()+ scale_colour_gradient(low = 'red',high = 'blue',name='Difference')+ labs(title='Change in Federal Funds per 1,000 Individuals who have\nselected a market based plan ACA vs Graham-Cassidy, 2020-2026', subtitle='Arrow pointing to movement from ACA to Graham-Cassidy', caption='Source: Kaiser Family Foundation', y='Federal Funds ($ Millions) per 1,000 individuals who have selected a market based plan')+ theme_bw()+ theme(legend.position = 'bottom') Policial Leaning p + facet_wrap(~ TrumpWin , scales='free_y')

      ACA Medicare expansion p + facet_wrap(~ Expansion , scales='free_y')

      ACA Medicare expansion and Political Leaning p + facet_wrap(~ Expansion + TrumpWin , scales='free_y')

      State Marketplace Type p + facet_wrap(~ Marketplace_Type, scales='free_y')

      ACA Medicare expansion and State Marketplace Type p + facet_wrap(~ Expansion + Marketplace_Type , scales='free_y')

      Geofaceting

      Lastly, we construct geographic representation of the difference between the ACA and the GC bill using Ryan Hafen’s geofacet package.

      states_facet <- state_ranks%>%left_join(kf%>%rename(name=Location),by='name') states_facet$Expansion <- factor(states_facet$Expansion,labels=c('Expansion','No Expansion')) states_facet$tile_lbl <- sprintf('%s\n%s',states_facet$Expansion,states_facet$TrumpWin) Total State Population states_facet%>% ggplot(aes(x='1', y='1',fill=tot_ratio_DIFF)) + geom_tile() + geom_text(aes(label=tile_lbl),size=2)+ theme_bw() + facet_geo( ~ state)+ scale_fill_gradient2(low='red',mid='white',high='green',name='Difference') + theme(legend.position = 'bottom', axis.text = element_blank(), axis.title = element_blank(), axis.ticks = element_blank())+ labs(title='Change in Federal Funds per 1,000 Individuals, 2020-2026', caption='Source: Kaiser Family Foundation')

      ACA enrollment population states_facet%>% ggplot(aes(x='1', y='1',fill=ratio_DIFF)) + geom_tile() + geom_text(aes(label=tile_lbl),size=2)+ theme_bw() + facet_geo( ~ state)+ scale_fill_gradient2(low='red',mid='white',high='green',name='Difference') + theme(legend.position = 'bottom', axis.text = element_blank(), axis.title = element_blank(), axis.ticks = element_blank())+ labs(title='Change in Federal Funds per 1,000 Individuals who have\nselected a market based plan ACA vs Graham-Cassidy, 2020-2026', caption='Source: Kaiser Family Foundation')

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Jonathan Sidi's R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages