Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 days 1 hour ago

Recent R Data Packages

Wed, 11/01/2017 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

It has never been easier to access data from R. Not only does there seem to be a constant stream of new packages that access the APIs of data providers, but it is also becoming popular for package authors to wrap up fairly large datasets into R packages. Below are 44 R packages concerned with data in one way or another that have made it to CRAN over the past two months.

alphavantager v0.1.0: Implements an interface to the Alpha Vantage API to fetch historical data on stocks, physical currencies, and digital/crypto currencies. See the README to get the key.

AmesHousing v0.0.2: Contains raw and processed versions of the Ames Iowa Housing Data. See De Cock (2011).

billboard v0.1.0: Contains data sets regarding songs on the Billboard Hot 100 list from 1960 to 2016, including ranks for the given year, musical features, and lyrics.

BIS v0.1.0: Provides an interface to data provided by the Bank for International Settlements. There is a vignette.

bomrang v0.0.8: Provides functions to interface with Australian Government Bureau of Meteorology data. There is an Introduction and an Example.

brazilmaps v0.1.0: Enables users to obtain Brazilian map spatial objects of varying region types, e.g., cities, states, microregions, and mesoregions.

census v0.2.0: Provides functions to scrape US Census data from the American Community Survey (ACS) data and metadata. There is a brief vignette.

cptec v0.1.0: Retrieves data from the CPTEC/INPE, Centro de Previsão de Tempo e Estudos Climáticos Instituto Nacional de Pesquisas Espaciais, weather and climate forecasting center in Latin America.

coinmarketcapr v0.1 Provides functions to extract and monitor price and market cap of ‘Crypto currencies’ from Coin Market Cap.

copulaData v0.0-1: Contains data sets used for copula modeling in addition to those in the package copula.

CytobankAPIstats v1.0: Provides tools to access and process cytometry data from Cytobank.

data360r v1.0.1: Provides an interface to the API for the World Bank’s TCdata360 and Govdata360 platforms.

DescriptiveStats.OBeu v1.2.1: Provides functions to estimate and return the needed parameters for visualizations designed for OpenBudgets.eu datasets. There is a Getting Started Guide and two other vignettes.

dobson v0.1: Contains example data sets from the book An Introduction to Generalised Linear Models by Dobson and Barnett.

ensemblepp v0.1-0: Contains temperature and precipitation ensemble weather forecasts and observations taken at Innsbruck, Austria, which are contained in the forthcoming book (Elsevier) “Statistical Post processing of Ensemble Forecasts” by Vannitsem, Wilks, and Messner.

fedreporter v0.2.1: Implements an API to the Federal RePORTER, allowing users to search job projects from different government agencies.

fingertipsR v0.1.3: Provides an interface to the API for Fingertips, which contains data for many indicators of public health in England. There are vignettes for plotting Life Expentancy and for Interactively Selecting Indicators.

GetITRData v0.6: Provides functions to read quarterly and annual financial reports – including assets, liabilities, income, and cash flow statements – from Bovespa’s ITR (informacoes trimestrais) system. There is a vignette.

GetLattesData v0.6: Implements an API for downloading and reading XML data directly from Lattes. There is a vignette.

IIS v1.0: Contains the datasets and functions form the book Intuitive Introductory Statistics.

jaod v0.1.0: Provides a client for the Directory of Open Access Journals (DOAJ). See the API documentation and README file.

LAGOSNE v1.00: Provides a client for programmatic access to the Lake Multi-scaled Geospatial and Temporal database, with functions for accessing lake water quality and ecological context data for the US. There is a vignette on the Structure of LAGOSNE and another for Working with LAGOSNE. The following map shows chlorophyll concentrations in Pennsylvania.

lexiconPT v0.1.0: Provides access to Portuguese lexicons for sentiment analysis.

mapsapi v0.1.0: Provides an interface to the Google Maps API. See the vignette to get started.

matchbook v1.0.7: Provides a wrapper for the Matchbook API.

microdemic v0.1.0: Provides programmatic access to scholarly articles in the Microsoft Academic Graph.

mozzie v0.1.0: Contains weekly dengue cases in 25 districts of Sri Lanka from 2008/ week-52 to 2014/ week-21.

nyctaxi v0.0.1: Provides an interface to New York City’s Taxi and Limousine Commission Trip Data. The vignette describes how to get started.

opendotaR v0.1.4: Provides access to the OpenDota API.

PakPC2017 v0.3.0: Provides data sets and functions for exploration of the Pakistan Population Census 2017.

PakPMICS2014Ch v0.1.0: Provides a data set and function for exploring the Multiple Indicator Cluster Survey: 2014 Children Punjab, Pakistan.

PakPMICS2014HH v0.1.0: Provides a data set and function for exploring the Multiple Indicator Cluster Survey: 2014 Households Punjab, Pakistan.

PakPMICS2014HL v0.1.0: Provides a data set and function for exploring the Multiple Indicator Cluster Survey: 2014 Household Listings Punjab, Pakistan.

PakPMICS2014Wm v0.1.0: Provides a data set and function for exploring the Multiple Indicator Cluster Survey: 2014 Women (age 15-49 years) Punjab, Pakistan.

pder v1.0-0: Provides data sets for the Panel Date Econometrics with R book.

realestateDK v0.1.0: Provides quarterly information on Denmark Housing Market Statistics, including average square meter prices and the number of free trades for parcel and terraced houses, condominiums, and holiday homes in Denmark since 1992.

rcongresso v0.1.3: Wraps the API for the Brazilian Chamber of Deputies. There is an Introduction and a vignette for using the package with purrr.

repurrrsive v0.1.0: Contains recursive lists in the form of R objects, ‘JSON’, and ‘XML’, for use in teaching with examples, including color palettes, Game of Thrones characters, GitHub repositories, entities from the Star Wars universe, and more.

spatstat.data v1.1-1: Contains datasets for the spatstat package.

statsDK v0.1.1: Provides a wrapper for the API to Statistics Denmark. Have a look at the API Console and the vignette to get started with the package.

SPYvsSPY: Contains historical data from the legendary Mad magazine comic strip.

UdderQuarterInfectionData v1.0.0: Provides the udder quarter infection data set, which contains infection times of individual cow udder quarters with Corynebacterium bovis. See Laevens et al. 1997.

USCF v0.1.3: Provides a function to retrieve information from the U.S. Chess Federation website.

XKCDdata v0.1.0: Provides a function to download data from individual XKCD comics, by Randall Munroe.

_____='https://rviews.rstudio.com/2017/11/01/r-data-packages/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

An example of how to use the new R promises package

Wed, 11/01/2017 - 01:00

(This article was first published on Appsilon Data Science Blog, and kindly contributed to R-bloggers)

The long awaited promises will be released soon!

Being as impatient as I am when it comes to new technology, I decided to play with currently available implementation of promises that Joe Cheng shared and presented recently in London at EARL conference.

From this article you’ll get to know the upcoming promises package, how to use it and how it is different from the already existing future package.

Promises/Futures are a concept used in almost every major programming language. We’ve used Tasks in C#, Futures in Scala, Promises in Javascript and they all adhere to a common understanding of what a promise is.

If you are not familiar with the concept of Promises, asynchronous tasks or Futures, I advise you to take a longer moment and dive into the topic. If you’d like to dive deeper and achieve a higher level of understanding, read about Continuation Monads in Haskell. We’ll be comparing the new promises package with the future package, which has been around for a while so I suggest you take a look at it https://cran.r-project.org/web/packages/future/vignettes/future-1-overview.html first if you haven’t used it before.

Citing Joe Cheng, our aim is to:

  1. Execute long-running code asynchronously on separate thread.
  2. Be able to do something with the result (if success) or error (if failure), when the task completes, back on the main R thread.

A promise object represents the eventual result of an async task. A promise is an R6 object that knows:

  1. Whether the task is running, succeeded, or failed
  2. The result (if succeeded) or error (if failed)

Without further ado, let’s get our hands on the code! You should be able to just copy-paste code into RStudio and run it.

R is single threaded. This means that user cannot interact with your shiny app if there is a long running task being executed on the server. Let’s take a look at an example:

longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- longRunningFunction(1) b <- longRunningFunction(2) print("User interaction") # Time: 10s c <- longRunningFunction(10) print(a) print(b) sumAC <- a + c sumBC <- b + c print("User interaction") # Time: 15s print(sumAC + sumBC) print("User interaction") # Time: 15s

We’ll use a simplified version of user interaction while there are some additional computations happening on the server. Let’s assume that we can’t just put all the computations in a separate block of code and just run it separately using the future package. There are many cases when it is very difficult or even almost impossible to just gather all computations and run them elsewhere as one big long block of code.

User cannot interact with the app for 10 seconds until the computations are finished and then the user has to wait another 5 seconds for next interaction. This is not a place where we would like to be in. User interactions should be as fast as possible and the user shouldn’t have to wait if it is not required. Let’s fix that using R future package that we know.

install.packages("future") library(future) plan(multiprocess) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(1)) b <- future(longRunningFunction(2)) print("User interaction") # Time: 0s c <- future(longRunningFunction(10)) print(value(a)) print(value(b)) sumAC <- value(a) + value(c) sumBC <- value(b) + value(c) print("User interaction") # Time: 5s print(sumAC + sumBC) print("User interaction") # Time: 5s

Nice, now the first user interaction can happen in parallel! But the second interaction is still blocked – we have to wait for the values, to print their sum. In order to fix that we’d like to chain the computation into the summing function instead of waiting synchronously for the result. We can’t do that using pure futures though (assuming we can’t just put all these computations in one single block of code and run it in parallel). Ideally we’d like to be able to write code similar to the one below:

library(future) plan(multiprocess) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(1)) b <- future(longRunningFunction(2)) print("User interaction") # Time: 0s c <- future(longRunningFunction(10)) future(print(value(a))) future(print(value(b))) sumAC <- future(value(a) + value(c)) sumBC <- future(value(b) + value(c)) print("User interaction") # Time: 0s future(print(value(sumAC) + value(sumBC))) print("User interaction") # Time: 0s

Unfortunately future package won’t allow us to do that.

What we can do, is use the promises package from RStudio!

devtools::install_github("rstudio/promises")

Let’s play with the promises! I simplified our example to let us focus on using promises first:

library(future) plan(multiprocess) library(tibble) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(tibble(number = 1:100))) print(value(a)) print("User interaction") # Time: 5s

We’d like to chain the result of longRunningFunction to a print function so that once the longRunningFunction is finished, its results are printed.

We can achieve that by using %…>% operator. It works like the very popular %>% operator from magrittr. Think of %...>% as “sometime in the future, once I have the result of the operation, pass the result to the next function”. The three dots symbolise the fact that we have to wait and that the result will be passed in future, it’s not happening now.

library(future) plan(multiprocess) library(promises) library(tibble) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(tibble(number = 1:100))) a %...>% print() # Time: 5s print("User interaction") # Time: 0s

Pure magic.

But what if I want to filter the result first and then print the processed data? Just keep on chaining:

library(future) plan(multiprocess) library(promises) library(tibble) library(dplyr) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(tibble(number = 1:100))) a %...>% filter(number %% 2 == 1) %...>% sum() %...>% print() print("User interaction")

Neat. But, how can I print the result of filtering and pass it to the sum function? There is a tee operator, the same as the one magrittr provides (but one that operates on a promise). It will pass the result of the function to the next function. If you chain it further, it will not pass the result of print() function but previous results. Think of it as splitting a railway, printing the value on a side track and ending the run, then getting back to the main track:

library(future) plan(multiprocess) library(promises) library(tibble) library(dplyr) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(tibble(number = 1:100))) a %...>% filter(number %% 2 == 1) %...T>% print() %...>% sum() %...>% print() print("User interaction")

What about errors? They are being thrown somewhere else than in the main thread, how can I catch them? You guessed it – there is an operator for that as well. Use %...!% to handle errors:

library(future) plan(multiprocess) library(promises) library(tibble) library(dplyr) longRunningFunction <- function(value) { stop("ERROR") return(value) } a <- future(longRunningFunction(tibble(number = 1:100))) a %...>% filter(number %% 2 == 1) %...T>% print() %...>% sum() %...>% print() %...!% (function(error) { print(paste("Unexpected error: ", error$message)) }) print("User interaction")

But in our example we’re not just chaining one computation. There is a longRunningFunction call that eventually returns 1 and another one that eventually returns 2. We need to somehow join the two. Once both of them are ready, we’d like to use them and return the sum. We can use promise_all function to achieve that. It takes a list of promises as an argument and returns a promise that eventually resolves to a list of results of each of the promises.

Perfect. We know the tools that we can use to chain asynchronous functions. Let’s use them in our example then:

library(future) plan(multiprocess) library(promises) library(purrr) longRunningFunction <- function(value) { Sys.sleep(5) return(value) } a <- future(longRunningFunction(1)) b <- future(longRunningFunction(2)) print("User interaction") # Time: 0s c <- future(longRunningFunction(10)) a %...>% print() b %...>% print() sumAC <- promise_all(a, c) %...>% reduce(`+`) sumBC <- promise_all(b, c) %...>% reduce(`+`) print("User interaction") # Time: 0s promise_all(sumAC, sumBC) %...>% reduce(`+`) %...>% print() print("User interaction") # Time: 0s

A task for you – in line sumAC <- promise_all(a, c) %...>% reduce(+), print the list of values from promises a and c before they are summed up.

Handful of useful information:

[1] There is support for promises implemented in shiny but neither CRAN nor GitHub master branch versions of Shiny support promises. Until support is merged, you’ll have to install from async branch:

devtools::install_github("rstudio/shiny@async")

[2] Beta-quality code at https://github.com/rstudio/promises

[3] Early drafts of docs temporarily hosted at: https://medium.com/@joe.cheng

[4] Joe Cheng talk on EARL 2017 in London – https://www.dropbox.com/s/2gf6tfk1t345lyf/async-talk-slides.pdf?dl=0

[5] The plan is to release everything on CRAN by end of this year.

I hope you have as much fun playing with the promises as I did! I’m planning to play with shiny support for promises next.

Till next time!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Appsilon Data Science Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

2017 Beijing Workshop on Forecasting

Wed, 11/01/2017 - 01:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

Later this month I’m speaking at the 2017 Beijing Workshop on Forecasting, to be held on Saturday 18 November at the Central University of Finance and Economics.
I’m giving four talks as part of the workshop. Other speakers are Junni Zhang, Lei Song, Hui Bu, Feng Li and Yanfei Kang.
Full program details are available online.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Survey of Kagglers finds Python, R to be preferred tools

Tue, 10/31/2017 - 20:14

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Competitive predictive modeling site Kaggle conducted a survey of participants in prediction competitions, and the 16,000 responses provide some insights about that user community. (Whether those trends generalize to the wider community of all data scientists is unclear, however.) One question of interest asked what tools Kagglers use at work. Python is the most commonly-used tool within this community, and R is second. (Respondents could select more than one tool.)

Interestingly, the rankings varied according to the job title of the respondent. R and Python received top-ranking for every job-title subgroup except one (database administrators, who preferred SQL), according to the following division:

  • R: Business Analyst, Data Analyst, Data Miner, Operations Researcher, Predictive Modeler, Statistician
  • Python: Computer Scientist, Data Scientist, Engineer, Machine Learning Engineer, Other, Programmer, Researcher, Scientist, Software Developer

You can find summaries of the other questions in the survey at the link below. An anonymized dataset of survey responses is also available, as is the "Kaggle Kernel" (a kind of notebook) of the R code behind the survey analysis.

Kaggle: The State of Data Science and Machine Learning, 2017

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Free Software Foundation “Social Benefit” Award Nominations

Tue, 10/31/2017 - 16:00

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Ezra Haber Glenn, the author of the acs package in R, recently posted about the Free Software Foundation’s “Social Benefit” Award on the acs mailing list:

acs.R Community:

The Free Software Foundation is now accepting nominations for the 2017
“Project of Social Benefit Award,” presented to the project or team
responsible for applying free software, or the ideas of the free
software movement, in a project that intentionally and significantly
benefits society in other aspects of life.

If anyone is willing to nominate the acs package, the recognition
would be much appreciated — the package has been generously supported
by MIT and the Puget Sound Regional Council, as well as a great deal
of user-feedback and creative development on the part of the
ACS/Census/R community.

The nomination form is quick and easy — see
https://my.fsf.org/projects-of-social-benefit-award-nomination.
Deadline 11/5.

More info at https://www.fsf.org/awards/sb-award/.

Thanks!

I’m reposting this here for a few reasons.

The first is that I only learned about this award from Ezra’s post, and I think that it’s worth raising awareness of the award itself.

The second is that, in my opinion, the acs package does “intentionally and significantly benefit society.” I have used the acs package over several years to learn more about US demographics. Choroplethr, my R package for creating statistical maps, also uses the acs package to retrieve data from the Census Bureau. Several thousand people have taken my free course on Choroplethr, and each of those people has benefitted from the acs package as well.

Finally, I’m mentioning this award to point out that R package developers receive compensation in different ways. None of us receive monetary compensation when people use our packages. However, Ezra has indicated that getting nominated for this award would be useful to him.

For all these reasons, I was happy to nominate the acs package for the Free software Foundation’s “Social Benefit” Award. It took me less than 5 minutes to fill out the form. If you are a user of choroplethr, and you enjoy its integration with US Census Data, then I encourage you to nominate the acs package as well. You can do so here.

The post Free Software Foundation “Social Benefit” Award Nominations appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

linl 0.0.2: Couple improvements

Tue, 10/31/2017 - 13:34

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Following up on the initial 0.0.1 release of linl, Aaron and I are happy to announce release 0.0.2 which reached the CRAN network on Sunday in a smooth ‘CRAN-pretest-publish’ auto-admittance. linl provides a simple-yet-powerful Markdown—and RMarkdown—wrapper around the venerable LaTeX letter class; see below for an expanded example also included as the package vignette.

This versions sets a few sensible default values for font, font size, margins, signature (non-)indentation and more; it also expands the documentation.

The NEWS entry follows:

Changes in tint version 0.0.2 (2017-10-29)
  • Set a few defaults for a decent-looking skeleton and template: font, fontsize, margins, left-justify closing (#3)

  • Blockquote display is now a default as well (#4).

  • Updated skeleton.Rmd and vignette source accordingly

  • Documented new default options (#5 and #6).

  • Links are now by default printed as footnotes (#9).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

pinp 0.0.3: More docs, more features

Tue, 10/31/2017 - 02:31

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Our pinp package for snazzier one or two column vignette received it second update. Now at version 0.0.3, it arrived on CRAN on Saturday with minimal fuzz as an ‘CRAN-pretest-publish’ transition.

We added more frontmatter options, documented more, and streamlined some internals of the LaTeX class borrowed from PNAS. A screenshot of the (updated) vignette can be seen below. Additional screenshots of are at the pinp page.

The NEWS entry for this release follows.

Changes in tint version 0.0.3 (2017-10-28)
  • Section ‘Acknowledgements’ now conditional on a frontmatter setting, section ‘Matmethods’ has been removed, pnasbreak no longer used which stabilizes LaTeX float formatting. References are now shown in the column just like other content (Dirk in #36).

  • Vignette now uses new numbered sections frontmatter switch which improves the pdf outline.

  • New front-matter options for title/section header colors, and link colors (Dirk in #39).

  • YAML frontmater options are now documented in the help page for pinp as well (Dirk in #41).

  • Some typos were fixed (Michael in #42 and #43).

Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

gg_tweet’ing Power Outages

Tue, 10/31/2017 - 02:16

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

As many folks know, I live in semi-rural Maine and we were hit pretty hard with a wind+rain storm Sunday to Monday. The hrbrmstr compound had no power (besides a generator) and no stable/high-bandwidth internet (Verizon LTE was heavily congested) since 0500 Monday and still does not as I write this post.

I’ve played with scraping power outage data from Central Maine Power but there’s a great Twitter account — PowerOutage_us — that has done much of the legwork for the entire country. They don’t cover everything and do not provide easily accessible historical data (likely b/c evil folks wld steal it w/o payment or credit) but they do have a site you can poke at and do provide updates via Twitter. As you’ve seen in a previous post, we can use the rtweet package to easily read Twitter data. And, the power outage tweets are regular enough to identify and parse. But raw data is so…raw.

While one could graph data just for one’s self, I decided to marry this power scraping capability with a recent idea I’ve been toying with adding to hrbrthemes or ggalt: gg_tweet(). Imagine being able to take a ggplot2 object and “plot” it to Twitter, fully conforming to Twitter’s stream or card image sizes. By conforming to these size constraints, they don’t get cropped in the timeline view (if you allow images to be previewed in-timeline). This is even more powerful if you have some helper functions for proper theme-ing (font sizes especially need to be tweaked). Enter gg_tweet().

Power Scraping

We’ll cover scraping @PowerOutage_us first, but we’ll start with all the packages we’ll need and a helper function to convert power outage estimates to numeric values:

library(httr) library(magick) library(rtweet) library(stringi) library(hrbrthemes) library(tidyverse) words_to_num <- function(x) { map_dbl(x, ~{ val <- stri_match_first_regex(.x, "^([[:print:]]+) #PowerOutages")[,2] mul <- case_when( stri_detect_regex(val, "[Kk]") ~ 1000, stri_detect_regex(val, "[Mm]") ~ 1000000, TRUE ~ 1 ) val <- stri_replace_all_regex(val, "[^[:digit:]\\.]", "") as.numeric(val) * mul }) }

Now, I can’t cover setting up rtweet OAuth here. The vignette and package web site do that well.

The bot tweets infrequently enough that this is really all we need (though, bump up n as you need to):

outage <- get_timeline("PowerOutage_us", n=300)

Yep, that gets the last 300 tweets from said account. It’s amazingly simple.

Now, the outage tweets for the east coast / northeast are not individually uniform but collectively they are (there’s a pattern that may change but you can tweak this if they do):

filter(outage, stri_detect_regex(text, "\\#(EastCoast|NorthEast)")) %>% mutate(created_at = lubridate::with_tz(created_at, 'America/New_York')) %>% mutate(number_out = words_to_num(text)) %>% ggplot(aes(created_at, number_out)) + geom_segment(aes(xend=created_at, yend=0), size=5) + scale_x_datetime(date_labels = "%Y-%m-%d\n%H:%M", date_breaks="2 hours") + scale_y_comma(limits=c(0,2000000)) + labs( x=NULL, y="# Customers Without Power", title="Northeast Power Outages", subtitle="Yay! Twitter as a non-blather data source", caption="Data via: @PowerOutage_us " ) -> gg

That pipe chain looks for key hashtags (for my area), rejiggers the time zone, and calls the helper function to, say, convert 1.2+ Million to 1200000. Finally it builds a mostly complete ggplot2 object (you should make the max Y limit more dynamic).

You can plot that on your own (print gg). We’re here to tweet, so let’s go into the next section.

Magick Tweeting

@opencpu made it possible shunt plot output to a magick device. This means we have really precise control over ggplot2 output size as well as the ability to add other graphical components to a ggplot2 plot using magick idioms. One thing we need to take into account is “retina” plots. They are — essentially — double resolution plots (72 => 144 pixels per inch). For the best looking plots we need to go retina, but that also means kicking up base plot theme font sizes a bit. Let’s build on hrbrthemes::theme_ipsum_rc() a bit and make a theme_tweet_rc():

theme_tweet_rc <- function(grid = "XY", style = c("stream", "card"), retina=TRUE) { style <- match.arg(tolower(style), c("stream", "card")) switch( style, stream = c(24, 18, 16, 14, 12), card = c(22, 16, 14, 12, 10) ) -> font_sizes theme_ipsum_rc( grid = grid, plot_title_size = font_sizes[1], subtitle_size = font_sizes[2], axis_title_size = font_sizes[3], axis_text_size = font_sizes[4], caption_size = font_sizes[5] ) }

Now, we just need a way to take a ggplot2 object and shunt it off to twitter. The following gg_tweet() function does not (now) use rtweet as I’ll likely add it to either ggalt or hrbrthemes and want to keep dependencies to a minimum. I may opt-in to bypass the current method since it relies on environment variables vs an RDS file for app credential storage. Regardless, one thing I wanted to do here was provide a way to preview the image before tweeting.

So you pass in a ggplot2 object (likely adding the tweet theme to it) and a Twitter status text (there’s a TODO to check the length for 140c compliance) plus choose a style (stream or card, defaulting to stream) and decide on whether you’re cool with the “retina” default.

Unless you tell it to send the tweet it won’t, giving you a chance to preview the image before sending, just in case you want to tweak it a bit before committing it to the Twitterverse. It als returns the magick object it creates in the event you want to do something more with it:

gg_tweet <- function(g, status = "ggplot2 image", style = c("stream", "card"), retina=TRUE, send = FALSE) { style <- match.arg(tolower(style), c("stream", "card")) switch( style, stream = c(w=1024, h=512), card = c(w=800, h=320) ) -> dims dims["res"] <- 72 if (retina) dims <- dims * 2 fig <- image_graph(width=dims["w"], height=dims["h"], res=dims["res"]) print(g) dev.off() if (send) { message("Posting image to twitter...") tf <- tempfile(fileext = ".png") image_write(fig, tf, format="png") # Create an app at apps.twitter.com w/callback URL of http://127.0.0.1:1410 # Save the app name, consumer key and secret to the following # Environment variables app <- oauth_app( appname = Sys.getenv("TWITTER_APP_NAME"), key = Sys.getenv("TWITTER_CONSUMER_KEY"), secret = Sys.getenv("TWITTER_CONSUMER_SECRET") ) twitter_token <- oauth1.0_token(oauth_endpoints("twitter"), app) POST( url = "https://api.twitter.com/1.1/statuses/update_with_media.json", config(token = twitter_token), body = list( status = status, media = upload_file(path.expand(tf)) ) ) -> res warn_for_status(res) unlink(tf) } fig } Two Great Tastes That Taste Great Together

We can combine the power outage scraper & plotter with the tweeting code and just do:

gg_tweet( gg + theme_tweet_rc(grid="Y"), status = "Progress! #rtweet #gg_tweet", send=TRUE )

That was, in-fact, the last power outage tweet I sent.

Next Steps

Ironically, given current levels of U.S. news and public “discourse” on Twitter and some inane machinations in my own area of domain expertise (cyber), gg_tweet() is likely one of the few ways I’ll be interacting with Twitter for a while. You can ping me on Keybase — hrbrmstr — or join the rstats Keybase team via keybase team request-access rstats if you need to poke me for anything for a while.

FIN

Kick the tyres and watch for gg_tweet() ending up in ggalt or hrbrthemes. Don’t hesitate to suggest (or code up) feature requests. This is still an idea in-progress and definitely not ready for prime time without a bit more churning. (Also, words_to_num() can be optimized, it was hastily crafted).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building Communities Together at ozunconf, 2017

Tue, 10/31/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Just last week we organised the 2nd rOpenSci ozunconference, the sibling rOpenSci unconference, held in Australia. Last year it was held in Brisbane, this time around, the ozunconf was hosted in Melbourne, from October 27-27, 2017.
At the ozunconf, we brought together 45 R-software users and developers, scientists, and open data enthusiasts from academia, industry, government, and non-profits. Participants travelled from far and wide, with people coming from 6 cities around Australia, 2 cities in New Zealand, and one city in the USA.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

computational methods for numerical analysis with R [book review]

Tue, 10/31/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

This is a book by James P. Howard, II, I received from CRC Press for review in CHANCE. (As usual, the customary warning applies: most of this blog post will appear later in my book review column in CHANCE.) It consists in a traditional introduction to numerical analysis with backup from R codes and packages. The early chapters are setting the scenery, from basics on R to notions of numerical errors, before moving to linear algebra, interpolation, optimisation, integration, differentiation, and ODEs. The book comes with a package cmna that reproduces algorithms and testing. While I do not find much originality in the book, given its adherence to simple resolutions of the above topics, I could nonetheless use it for an elementary course in our first year classes. With maybe the exception of the linear algebra chapter that I did not find very helpful.

“…you can have a solution fast, cheap, or correct, provided you only pick two.” (p.27)

The (minor) issue I have with the book and that a potential mathematically keen student could face as well is that there is little in the way of justifying a particular approach to a given numerical problem (as opposed to others) and in characterising the limitations and failures of the presented methods (although this happens from time to time as e.g. for gradient descent, p.191). [Seeping in my Gallic “mal-être”, I am prone to over-criticise methods during classing, to the (increased) despair of my students!, but I also feel that avoiding over-rosy presentations is a good way to avoid later disappointments or even disasters.] In the case of this book, finding [more] ways of detecting would-be disasters would have been nice.

An uninteresting and highly idiosyncratic side comment is that the author preferred the French style for long division to the American one, reminding me of my first exposure to the latter, a few months ago! Another comment from a statistician is that mentioning time series inter- or extra-polation without a statistical model sounds close to anathema! And makes extrapolation a weapon without a cause.

“…we know, a priori, exactly how long the [simulated annealing] process will take since it is a function of the temperature and the cooling rate.” (p.199)

Unsurprisingly, the section on Monte Carlo integration is disappointing for a statistician/probabilistic numericist like me,  as it fails to give a complete enough picture of the methodology. All simulations seem to proceed there from a large enough hypercube. And recommending the “fantastic” (p.171) R function integrate as a default is scary, given the ability of the selected integration bounds to misled its users. Similarly, I feel that the simulated annealing section is not providing enough of a cautionary tale about the highly sensitive impact of cooling rates and absolute temperatures. It is only through the raw output of the algorithm applied to the travelling salesman problem that the novice reader can perceive the impact of some of these factors. (The acceptance bound on the jump (6.9) is incidentally wrongly called a probability on p.199, since it can take values larger than one.)

Filed under: Books, Kids, pictures, R, Statistics, University life Tagged: book review, CRC Press, differential equation, Euler discretisation, integrate, integration, introductory textbooks, Monte Carlo integration, numerical analysis, optimisation, partial differential equations, R, R function, R package, Runge-Kutta, simulated annealing

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Recent updates to the Team Data Science Process

Mon, 10/30/2017 - 20:25

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

It's been over a year since we first introduced introduced the Team Data Science Process (TDSP). The data, technology and practices behind Data Science continue to evolve, and the TDSP has evolved in parallel. Over the past year, several new facets have been added, including:

For an example of applying the TDSP to effective data science projects, check out Buck Woody's 10-part series walking through every stage of a typical data science project

As the practice of data science changes, the TDSP continues to evolve. The TDSP is an open project hosted on Github, and your contributions are welcome.

Cortana Intelligence and Machine Learning Blog: The Microsoft Team Data Science Process (TDSP) – Recent Updates

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R live class | Data Visualization and Dashboard with R | Nov 7-8 Milan

Mon, 10/30/2017 - 17:31

(This article was first published on R blog | Quantide - R training & consulting, and kindly contributed to R-bloggers)

 

Data Visualization and Dashboard with R is our fourth course of the autumn term. It takes place in November 7-8 in a location close to Milano Lima.
This course will teach you how to build beautiful, effective and flexible plots using the most modern R tools for data visualization. Then you will discover how to embed visualizations and tables in a powerful Shinyapp, to make your data easily navigable and let their insights emerge.
You should take this course if you have some knowledge of R and would like to deepen your knowledge in data visualization with R, both static data visualization and dashboards.

Data Visualization and Dashboard with R: Outlines

– ggplot2 grammar
– Creating plots with ggplot (Scatter Plot, Line Plot, Bar Plot, Histogram, Box Plot, Surface Plot)
– Customizing Plots (aesthetics, legend, axes, faceting and themes)
– Specialised visualisation tools: ggmap and ggally
– Basic shiny concepts
– The structure of a shiny app
– Shiny: the server side and the user side
– Understanding reactivity in shiny
– An overview of html widgets

Data Visualization and Dashboard with R is organized by the R training and consulting company Quantide and is taught in Italian, while all the course materials are in English.

This course is for max 6 attendees.

Location

The course location is 550 mt. (7 minutes on walk) from Milano central station and just 77 mt. (1 minute on walk) from Lima subway station.

Registration

If you want to reserve a seat go to: FAQ, detailed program and tickets.

Other R courses | Autumn term

You can find an overview of all our courses here. Next dates will be:

  • November 21-22R with Database and Big Data. From databases to distributed infrastructure, master the R techniques to handle and query Big Data. Reserve now!
  • November 29-30Professional R Programming. Organise, document and test your code: write efficient functions, improve the code reproducibility and build R packages. Reserve now!

In case you are a group of people interested in more than one class, write us at training[at]quantide[dot]com! We can arrange together a tailor-made course, picking all the topics that are interesting for your organization and dropping the rest.

The post R live class | Data Visualization and Dashboard with R | Nov 7-8 Milan appeared first on Quantide – R training & consulting.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R blog | Quantide - R training & consulting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

heatmaply: an R package for creating interactive cluster heatmaps for online publishing

Mon, 10/30/2017 - 15:07

(This article was first published on R – R-statistics blog, and kindly contributed to R-bloggers)

This post on the heatmaply package is based on my recent paper from the journal bioinformatics (a link to a stable DOI). The paper was published just last week, and since it is released as CC-BY, I am permitted (and delighted) to republish it here in full. My co-authors for this paper are Jonathan Sidi, Alan O’Callaghan, and Carson Sievert.

Summary: heatmaply is an R package for easily creating interactive cluster heatmaps that can be shared online as a stand-alone HTML file. Interactivity includes a tooltip display of values when hovering over cells, as well as the ability to zoom in to specific sections of the figure from the data matrix, the side dendrograms, or annotated labels.  Thanks to the synergistic relationship between heatmaply and other R packages, the user is empowered by a refined control over the statistical and visual aspects of the heatmap layout.

Availability: The heatmaply package is available under the GPL-2 Open Source license. It comes with a detailed vignette, and is freely available from: http://cran.r-project.org/package=heatmaply

Contact: Tal.Galili@math.tau.ac.il

Introduction

A cluster heatmap is a popular graphical method for visualizing high dimensional data. In it, a table of numbers is scaled and encoded as a tiled matrix of colored cells. The rows and columns of the matrix are ordered to highlight patterns and are often accompanied by dendrograms and extra columns of categorical annotation. The ongoing development of this iconic visualization, spanning over more than a century, has provided the foundation for one of the most widely used of all bioinformatics displays (Wilkinson and Friendly, 2009). When using the R language for statistical computing (R Core Team, 2016), there are many available packages for producing static heatmaps, such as: stats, gplots, heatmap3, fheatmap, pheatmap, and others. Recently released packages also allow for more complex layouts; these include gapmap, superheat, and ComplexHeatmap (Gu et al., 2016). The next evolutionary step has been to create interactive cluster heatmaps, and several solutions are already available. However, these solutions, such as the idendro R package (Sieger et al., 2017), are often focused on providing an interactive output that can be explored only on the researcher’s personal computer. Some solutions do exist for creating shareable interactive heatmaps. However, these are either dependent on a specific online provider, such as XCMS Online, or require JavaScript knowledge to operate, such as InCHlib. In practice, when publishing in academic journals, the reader is left with a static figure only (often in a png or pdf format).

To fill this gap, we have developed the heatmaply R package for easily creating a shareable HTML file that contains an interactive cluster heatmap. The interactivity is based on a client-side JavaScript code that is generated based on the user’s data, after running the following command:

install.packages("heatmaply") library(heatmaply) heatmaply(data, file = "my_heatmap.html")

The HTML file contains a publication-ready, interactive figure that allows the user to zoom in as well as see values when hovering over the cells. This self-contained HTML file can be made available to interested readers by uploading it to the researcher’s homepage or as a supplementary material in the journal’s server. Concurrently, this interactive figure can be displayed in RStudio’s viewer pane, included in a Shiny application, or embedded in a knitr/RMarkdown HTML documents.

The rest of this paper offers guidelines for creating effective cluster heatmap visualization. Figure 1 demonstrates the suggestions from this section on data from project Tycho (van Panhuis et al., 2013), while the online supplementary information includes the interactive version, as well as several examples of using the package on real-world biological data.

Fig. 1. The (square root) number of people infected by Measles in 50 states, from 1928 to 2003. Vaccines were introduced in 1963

click the image for the online interactive version of the plot

An interactive version of the measles heatmap (embedded in the post using iframe)

I uploaded the measles_heatmaply.html to github and then used the following code to embed it in the post:

Here is the result:

heatmaply – a simple example

The generation of cluster heatmaps is a subtle process (Gehlenborg and Wong, 2012; Weinstein, 2008), requiring the user to make many decisions along the way. The major decisions to be made deal with the data matrix and the dendrogram. The raw data often need to be transformed in order to have a meaningful and comparable scale, while an appropriate color palette should be picked. The clustering of the data requires us to decide on a distance measure between the observation, a linkage function, as well as a rotation and coloring of branches that manage to highlight interpretable clusters. Each such decision can have consequences on the patterns and interpretations that emerge. In this section, we go through some of the arguments in the function heatmaply, aiming to make it easy for the user to tune these important statistical and visual parameters. Our toy example visualizes the effect of vaccines on measles infection. The output is given in the static Fig. 1, while an interactive version is available online in the supplementary file “measles.html”. Both were created using:

heatmaply(x = sqrt(measles), color = viridis, # the default Colv = NULL, hclust_method = "average", k_row = NA, # ... file = c("measles.html", "measles.png") )

The first argument of the function (x) accepts a matrix of the data. In the measles data, each row corresponds with a state, each column with a year (from 1928 to 2003), and each cell with the number of people infected with measles per 100,000 people. In this example, the data were scaled twice – first by not giving the raw number of cases with measles, but scaling them relatively to 100,000 people, thus making it possible to more easily compare between states. And second by taking the square root of the values. This was done since all the values in the data represent the same unit of measure, but come from a right-tailed distribution of count data with some extreme observations. Taking the square root helps with bringing extreme observations closer to one another, helping to avoid an extreme observation from masking the general pattern. Other transformations that may be considered come from Box-Cox or Yeo-Johnson family of power transformations. If each column of the data were to represent a different unit of measure, then leaving the values unchanged will often result in the entire figure being un-usable due to the column with the largest range of values taking over most of the colors in the figure. Possible per-column transformations include the scale function, suitable for data that are relatively normal. normalize, and percentize functions bring data to the comparable 0 to 1 scale for each column. The normalize function preserves the shape of each column’s distribution by subtracting the minimum and dividing by the maximum of all observations for each column. The percentize function is similar to ranking but with the simpler interpretation of each value being replaced by the percent of observations that have that value or below. It uses the empirical cumulative distribution function of each variable on its own values. The sparseness of the dataset can be explored using is.na10.

Once the data are adequately scaled, it is important to choose a good color palette for the data. Other than being pretty, an ideal color palette should have three (somewhat conflicting) properties: (1) Colorful, spanning as wide a palette as possible so as to make differences easy to see; (2) Perceptually uniform, so that values close to each other have similar-appearing colors compared with values that are far away, consistently across the range of values; and (3) Robust to colorblindness, so that the above properties hold true for people with common forms of colorblindness, as well as printing well in grey scale. The default passed to the color argument in heatmaply is viridis, which offers a sequential color palette, offering a good balance of these properties. Divergent color scale should be preferred when visualizing a correlation matrix, as it is important to make the low and high ends of the range visually distinct. A helpful divergent palette available in the package is cool_warm (other alternatives in the package include RdBu, BrBG, or RdYlBu, based on the RColorBrewer package). It is also advisable to set the limits argument to range from -1 to 1.

Passing NULL to the Colv argument, in our example, removed the column dendrogram (since we wish to keep the order of the columns, relating to the years). The row dendrogram is automatically calculated using hclust with a Euclidean distance measure and the average linkage function. The user can choose to use an alternative clustering function (hclustfun), distance measure (dist_method), or linkage function (hclust_method), or to have a dendrogram only in the rows/columns or none at all (through the dendrogram argument). Also, the users can supply their own dendrogram objects into the Rowv (or Colv) arguments. The preparation of the dendrograms can be made easier using the dendextend R package (Galili, 2015) for comparing and adjusting dendrograms. These choices are all left for the user to decide. Setting the k_col/k_row argument to NA makes the function search for the number of clusters (between from 2 to 10) by which to color the branches of the dendrogram. The number picked is the one that yields the highest average silhouette coefficient (based on the find_k function from dendextend). Lastly, the heatmaply function uses the seriation package to find an “optimal” ordering of rows and columns (Hahsler et al., 2008). This is controlled using the seriation argument where the default is “OLO” (optimal-leaf-order) – which rotates the branches so that the sum of distances between each adjacent leaf (label) will be minimized (i.e.: optimize the Hamiltonian path length that is restricted by the dendrogram structure). The other arguments in the example were omitted since they are self-explanatory – the exact code is available in the supplementary material.

In order to make some of the above easier, we created the shinyHeatmaply package (available on CRAN) which offers a GUI to help guide the researcher with the heatmap construction, with the functionality to export the heatmap as an html file and summaries parameter specifications to reproduce the heatmap with heatmaply. For more detailed step-by-step demonstration of using heatmaply on biological datasets, you should explore the heatmaplyExamples package (at github.com/talgalili/heatmaplyExamples).

The following biological examples are available and fully reproducible from within the package. You may also view them online in the following links (the html files also include the R code for producing the figures):

Acknowledgements

The heatmaply package was made possible by leveraging many wonderful R packages, including ggplot2 (Wickham, 2009), plotly (Sievert et al., 2016), dendextend (Galili, 2015) and many others. We would also like to thank Yoav Benjamini, Madeline Bauer, and Marilyn Friedes for their helpful comments, as well as Joe Cheng for initiating the collaboration with Tal Galili on d3heatmap, which helped lay the foundation for heatmaply.

Funding: This work was supported in part by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project).  

Conflict of Interest: none declared.

References
  • Galili,T. (2015) dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, 31, 3718–3720.
  • Gehlenborg,N. and Wong,B. (2012) Points of view: Heat maps. Nat. Methods, 9, 213–213.
  • Gu,Z. et al. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics, 32, 2847–2849.
  • Hahsler,M. et al. (2008) Getting Things in Order : An Introduction to the R Package seriation. J. Stat. Softw., 25, 1–27.
  • van Panhuis,W.G. et al. (2013) Contagious Diseases in the United States from 1888 to the Present. N. Engl. J. Med., 369, 2152–2158.
  • R Core Team,(R Foundation for Statistical Computing) (2016) R: A Language and Environment for Statistical Computing.
  • Sieger,T. et al. (2017) Interactive Dendrograms: The R Packages idendro and idendr0. J. Stat. Softw., 76.
  • Sievert,C. et al. (2016) plotly: Create Interactive Web Graphics via ‘plotly.js’.
  • Weinstein,J.N. (2008) BIOCHEMISTRY: A Postgenomic Visual Icon. Science (80-. )., 319, 1772–1773.
  • Wickham,H. (2009) ggplot2 Elegant Graphics for Data Analysis.
  • Wilkinson,L. and Friendly,M. (2009) The History of the Cluster Heat Map. Am. Stat., 63, 179–184.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Making a Shiny dashboard using ‘highcharter’ – Analyzing Inflation Rates

Mon, 10/30/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Shiny is an amazing R package which lets the R developers and users build amazing web apps using R itself. It lets the R users analyze, visualize and deploy their machine learning models directly in the form of the web app. This package lets you host standalone apps on a webpage or embed them in R markdown documents or build dashboards and various forecasting applications. You can also extend your Shiny apps with CSS themes, htmlwidgets, and JavaScript actions. Shiny lets us write client-side front-end code in R itself and also lets users write server-side script in R itself. More details on this package can be found here.

I recently learned Shiny and started developing a web application using it.And since then I have been in love with it and have been using it in each and every data science and analytics project. The syntax is super easy to understand and there are lots of amazing articles and documentation available for you to learn it and use it. I personally had a background of developing full-stack web applications using HTML, CSS and javascript and other JS based scripting languages so I found the syntax easy.

This article is meant for people who at least have a basic understanding of how to use shiny in R or who either have knowledge of developing web apps. But still, it is easy for an intermediate level or beginner R user to learn shiny and its syntax to develop web apps.

In this article I am going to demonstrate how to make a dashboard using shinydashboard package. Shiny Dashboard is also another package similar to shiny which is specifically used to make dashboards easily in R and deploy them. More details on this package can be found here.

Secondly, in this article I am going to demonstrate how to use highcharter package which is a Javascript based visualization library in R to develop amazing and beautiful plots. Its syntax is somewhat similar to qplot function from the ggplot2 package in R. So if you have good experience of using ggplot2 package then you would find it easier to use. Details about the highcharter can be found here.

Analyzing Inflation Rates

In this article I am going to demonstrate how to use shinydashboard and highcharter package in R to analyze and visualize Inflation rates of major economies and other economic and regional trade unions.

What are inflation rates

Inflation rates are the general rate at which price of the goods and services within a particular economy are rising and the purchasing power of the currency is declining due to the higher priced goods. High inflation is definitely not good for an economy because it will always reduce the value for money. In general central banks of an economy tries to and work towards reducing the inflation rate and avoiding deflation.

Deflation is opposite of inflation. Deflation occurs when the inflation rates become negative or are below 0. Deflation is more harmful and dangerous for an economy because it means that the prices of goods and services are going to decrease. Now, this sounds amazing for consumers like us. But what actually happens is that the demand for goods and services have declined over a long term of time. This directly indicates that a recession is on its way. This brings job losses, declining wages and a big hit to the stock portfolio. Deflation slows economy’s growth. As prices fall, people defer(postpone) purchases in hope of a better lower price deal. Due to this companies and firms have to cut down the cost of their goods and products which directly affects the wages of the employees which have to be lowered.

Now I won’t be explaining much about these economic terms. I leave these things for the readers to go check and read these things out. I am sure you will find such subjects quite interesting.

Lets start with designing UI in R

In shiny you have the choice to write UI and server-side code in a single file. But I prefer to write the client-side and server-side code in separate files for easy understanding and modularity and stop the things to get messed up if the code gets too long.

#loading the packages library(shinydashboard) require(shiny) require(highcharter) #layout of the dashboard #defining character vectors for select inputs country<-c("India","United States","Mexico","Canada","China, People's Republic of","Japan","Russian Federation","Germany","United Kingdom","European Union", "ASEAN-5","New Zealand","Australia","Netherlands","Luxembourg", "France","Qatar","United Arab Emirates","Saudi Arabia") unions<-c("Major advanced economies (G7)","European Union","Emerging and Developing Europe","ASEAN-5","Commonwealth of Independent States", "Emerging and Developing Asia","Latin America and the Caribbean", "Middle East, North Africa, Afghanistan, and Pakistan") #function used to define the dashboard dashboardPage( #defines header skin = "red", #header of the dashboard dashboardHeader( title="Inflation Rates" , dropdownMenu() ), #defines sidebar of the dashboard dashboardSidebar( sidebarMenu( #the sidebar menu items menuItem("Dashboard", tabName = "dashboard", icon = icon("dashboard")), menuItem("About", tabName = "about", icon = icon("th")), menuItem("Trade Unions",tabName="unions",icon=icon("signal")), menuItem("World",tabName="world",icon=icon("globe")) )), #defines the body of the dashboard dashboardBody( #to add external CSS tags$head( tags$link(rel = "stylesheet", type = "text/css", href = "custom.css") ), tabItems( #First TAB Menu-dashboard- first argument should be the 'tabName' value of the menuItem function tabItem(tabName = "dashboard", fluidRow( column(12, #box() is similar to a 'div' element in HTML box( selectInput("country",label="Select Country",choices=country), width = 12)# end box ),#end column #box for plotting the time series plot column(12, box( #below function is used to define a highcharter output plot which will be made in the server side highchartOutput("hcontainer"), width="12") #end box2 ), #end column br(), #line break h4("Relative inflation rates time series plot",align="center"), br(), column(12, box(highchartOutput("hc2"),width=12)) ),#end row h4("Made with love from", strong("Anish Singh Walia")), a("R code for this project",target="_blank",href="https://github.com/anishsingh20/Analzying-Inflation-Rates-Worldwide") ), #second tab menu- ABOUT tabItem(tabName="about", h2("What is Inflation ?",style="text-align:center"), br(), br(), box(width=12,height="400px", p(style="font-size:20px",strong("Inflation"),"rates are the general rate at which price of the goods and services within a particular economy are rising and the purchasing power of the currency is declining due to the higher priced goods. High inflation is definitely not good for an economy because it will always reduce the value for money.In general central banks of an economy tries to and work towards reducing the inflation rate and avoiding deflation."), p(style="font-size:20px",strong("Deflation"), "is opposite of inflation. Deflation occurs when the inflation rates become negative or are below 0. Deflation is more harmful and dangerous for an economy because it means that the prices of goods and services are going to decrease. Now, this sounds amazing for consumers like us. But what actually happens is that the demand for goods and services have declined over a long term of time. This directly indicates that a recession is on its way. This brings job losses, declining wages and a big hit to the stock portfolio. Deflation slows economy's growth. As prices fall, people defer(postpone) purchases in hope of a better lower price deal. Due to this companies and firms have to cut down the cost of their goods and products which directly affects the wages of the employees which have to be lowered.") ) ), tabItem(tabName = "unions", h3("Time series of Inflation rates of Economic trade unions",align="center") , fluidRow( column(12, box(selectInput("region",label="Select Economic Region",choices=unions),width = 12) ), box(highchartOutput("hc3"), width=12) )# end row ), tabItem(tabName = "world", h3("World's Inflation Rates",align="center") , box(highchartOutput("hc4"), width=12) ) )#end tabitems )#end body )#end dashboard

Now, in the above code one can notice that the various functions which are used to design the UI are actually similar to the HTML elements. But in R shiny allows us to use those like functions and the HTML attributes as the function arguments. It’s hard to explain everything in the above code, but I insist the readers to check the documentation and help files of the above functions and go and check out the links to various resources which I will add at the end.

The highchartOutput(id) function which takes the ‘id’ as an argument which is used to define a high charter plot which will be developed and plotted at the server side. Here we only design and define the UI and layout of the dashboard.

Let’s write the Server logic

All the server-side code would be in a separate file named server.R. In the server side, it would only contain the logic to reactively and dynamically plot the various plots.

The inflation rates dataset is downloaded from IMF(International Monetary Fund) website and is publically available. I first had to do some preprocessing and transformations using the tidyr and dplyr packages to convert the dataset to the desired format.

require(shinydashboard) require(ggplot2) require(dplyr) require(highcharter) #to plot amazing time series plots library(readxl) require(tidyr) #loading the dataset inflation <- read_excel("inflation.xls") year<-c(1980:2022) #making a vector consisting of all years year<-as.character(year)#converting to character type to use in gather() #converting dataset to a long format inf% gather(year,key = "Year",value="InflationRate") inf<-na.omit(inf) #omitting NA values names(inf)<-c("region","year","inflation") inf$year<-as.integer(inf$year) India<-filter(inf,region=="India") India$inflation<-as.numeric(India$inflation) India$year<-as.numeric(India$year) China<-filter(inf,region=="China, People's Republic of") Ger<-filter(inf,region=="Germany") Japan<-filter(inf,region=="Japan") US<-filter(inf,region=="United States") EU<-filter(inf,region=="European Union") UK<-filter(inf,region=="United Kingdom") Fr<-filter(inf,region=="France") uae<-filter(inf,region=="United Arab Emirates") #server function and logic server <- function(input, output) { output$hcontainer <- renderHighchart ({ #write all R-code inside this df% filter(region==input$country)#making is the dataframe of the country #above input$country is used to extract the select input value from the UI and then make #a dataframe based on the selected input df$inflation<-as.numeric(df$inflation) df$year% hc_exporting(enabled = TRUE) %>% #to enable downloading the plot image hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5", shared = TRUE, borderWidth = 2) %>% #designing the hoverable tooltil hc_title(text="Time series plot of Inflation Rates",align="center") %>% #title hc_subtitle(text="Data Source: IMF",align="center") %>% #subtile of the plot hc_add_theme(hc_theme_elementary()) #adding theme #to add 3-d effects #hc_chart(type = "column", #options3d = list(enabled = TRUE, beta = 15, alpha = 15)) }) # end hcontainer output$hc2% hc_xAxis(categories=inf$year) %>% hc_add_series(name = "India", data = India$inflation) %>% hc_add_series(name = "USA", data = US$inflation) %>% hc_add_series(name = "UK", data = UK$inflation) %>% hc_add_series(name = "China", data = China$inflation) %>% hc_add_series(name = "Germany", data = Ger$inflation) %>% hc_add_series(name="Japan",data=Japan$inflation) %>% #to add colors hc_colors(c("red","blue","green","purple","darkpink","orange")) %>% hc_add_theme(hc_theme_elementary()) }) # end hc2 output$hc3<-renderHighchart({ union% filter(region==input$region) union$year<-as.numeric(union$year) union$inflation% hc_exporting(enabled = TRUE) %>% hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5", shared = TRUE, borderWidth = 2) %>% hc_title(text="Time series plot of Inflation Rates for Economic Unions",align="center") %>% hc_subtitle(text="Data Source: IMF",align="center") %>% hc_add_theme(hc_theme_elementary()) }) #end hc3 output$hc4<-renderHighchart({ world% filter(region=="World") world$year<-as.numeric(world$year) world$inflation% hc_exporting(enabled = TRUE) %>% hc_tooltip(crosshairs = TRUE, backgroundColor = "#FCFFC5", shared = TRUE, borderWidth = 2) %>% hc_title(text="Time series plot of Inflation Rates for World",align="center") %>% hc_subtitle(text="Data Source: IMF",align="center") %>% hc_add_theme(hc_theme_elementary()) }) #end hc4 }

In the above code we access the plot defined in the UI using their id in the form output$id. We use the function renderHighchart({ #write all R code and logic }) to render the plots and write all the R-code to develop any output inside it.v Inside this function as you all can notice is simply the R code to make plots.

Now the function hchart(df,type,hcaes(x,y)...) works in the same fashion as qplot. It takes data, type of plot, plot asthetics as its arguments. Rest things such as title, subtitle etc are piped to it like we piped extra things and plotting characteristics in ggplot. For more details read this http://jkunst.com/highcharter/hchart.html

Screenshots of the Dashboard:

Now deploying a shiny app is just a matter of minutes. You just need to have an account on shinyapps.io and it offers infrastructure as a service(IaaS) to run and manage you application. You can deploy you app directly from your Rstudio IDE.

Conclusion

This was just a short tutorial on how to make a dashboard in R and use shinydashboard and highcharter package in R. This article is not aimed to explain everything in detail,rather just demonstrate the use of shinydashboard and highcharter in R to build awesome dashboards and web apps.I will attach various resources below to help out with learning to develop web apps in R and which helped me out too.

Resources

Hope this article is motivating enough to at least get you started with learning and developing your own web apps in R and start using highcharter in R to visualize data. For any doubt or queries feel free to drop a comment. I will be happy to help. Do like and share the article.
Happy coding!

    Related Post

    1. Time Series Analysis in R Part 2: Time Series Transformations
    2. Time Series Analysis in R Part 1: The Time Series Object
    3. Parsing Text for Emotion Terms: Analysis & Visualization Using R
    4. Using MongoDB with R
    5. Finding Optimal Number of Clusters
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Increasing Airline Customer Satisfaction

    Mon, 10/30/2017 - 02:38

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    Motivation

    The sheer size of the airline industry provides a reason to care about it: it affects not only millions of people directly (flyers, pilots, engineers, etcetera), but also millions more indirectly by the heft of its economic presence. In a December 2016 report, the International Air Transport Association (IATA) wrote:

    “While airline industry profits are expected to have reached a cyclical peak in 2016 of $35.6 billion, a soft landing in profitable territory is expected in 2017 with a net profit of $29.8 billion. 2017 is expected to be the eighth year in a row of aggregate airline profitability, illustrating the resilience to shocks that have been built into the industry structure. On average, airlines will retain $7.54 for every passenger carried.” (reference)

    As a resident of the US, the daughter of an airline pilot, and a semi-frequent flyer, I have a personal interest in the US airline industry. In looking at carriers by region in the same article mentioned above, the IATA concluded: “The strongest financial performance is being delivered by airlines in North America. Net post-tax profits will be the highest at $18.1 billion next year […]. The net margin for the region’s carriers is also expected to be the strongest at 8.5% with an average profit of $19.58/passenger.”

    Although the North American airline industry is strong, it must be ever-vigilant about keeping up with customer demands in order to maintain its continued growth and its continued position as industry leader across regions. Of course, success in this regard requires airlines to know what customers care about in the first place. Discovering what airline customers like and dislike about their flight experiences was the starting point for this project. To view the project in an interactive Shiny App, click here.

    Data 

    To understand more precisely which aspects of a flight shape customers’ opinions, I decided to scrape the website Skytrax, which collects customer-written reviews of flights for nearly every airline in operation. A typical review appears as follows:

    To get the data from posts of this kind into an analyzable format, I wrote a python script using Selenium to define a web scraping spider, the code for which can be found here on my Github. Of the hundreds of airlines available for reviewing on Skytrax, I limited the scope of my investigation to the eleven largest US-based companies: Alaska Airlines, Allegiant Air, American Airlines, Delta Air Lines, Frontier Airlines, Hawaiian Airlines, JetBlue Airways, Southwest Airlines, Spirit Airlines, United Airlines, and Virgin America. The variables I included in the scrape were:

    • airline: Airline with which the review-writer flew
    • overall: Overall airline rating (out of 10) given by the review-writer
    • author: Name of the review-writer
    • date: Date the review was written
    • customer_review: Text of the customer review
    • aircraft: Aircraft class/type on which the review-writer flew (possibilities too numerous to list; example: Boeing 737)
    • traveller_type: Type of traveller of the review-writer (Business, Couple Leisure, Family Leisure, Solo Leisure)
    • cabin: Type of cabin/class in which the review-writer flew (Business Class, Economy Class, First Class, Premium Economy)
    • route: Origin and destination of the flight (example: Chicago to Boston)
    • date_flown: Month and year in which the flight in question took place
    • seat_comfort: Rating (out of 5) of seat comfort
    • cabin_service: Rating (out of 5) of the inflight service
    • food_bev: Rating (out of 5) of the quality of the inflight food and beverages
    • entertainment: Rating (out of 5) of the availability and connectivity of the inflight entertainment; includes movies and wifi connectivity
    • ground_service: Rating (out of 5) of the service on the ground before and/or after the flight
    • value_for_money: Rating (out of 5) of the value of the airline against the cost of a ticket
    • recommended: Does the review-writer plan to recommend the airline to others (Yes or No)

    One detail of the data worth mentioning is that the variables coming from the “table” in the review (aircraft, type of traveller, etcetera) were not required fields for the review writer. This meant that different reviews could contain different subsets of variables from that table. This affected both the engineering and analysis aspects of this project. On the engineering side, the changing tables from review to review forced me to use Selenium rather than the much faster python package Scrapy. On the analysis side, reviews that did not contain all possible table fields would have missing values (NAs) for those omitted variables, and missingness was leveraged to help answer the more focused questions that follow.

     

    Question 1

    Out of seat comfort, cabin service, food and beverages, entertainment, and ground service, which aspect of a flight has the most influence on a customers’ overall rating?

    This is a classic machine learning question that is easy to ask but difficult to answer, the difficulty lying in the potentially subtle interactions among the predictor variables. To attack this question in my project, I turned to random forests; this choice was motivated by the need to avoid machine learning techniques (such as linear regression) that rely on normality assumptions for the predictor variables. Given the biased nature of reviews, nearly all of the predictor variables in question here (seat comfort, cabin service, food and beverages, entertainment, ground service) were not normally distributed but rather bimodal, with peaks near ratings of 1 and 5.

    I used the randomForest() function from the R package “randomForest”,  which uses the non-parametric Breiman random forest algorithm to produce regression models. As a side perk, it estimates the importance of each predictor variable, relative to the others in question, in predicting the response variable. This output is what I used to determine which of my five variables was most important to the overall flight rating. The details of the Breiman random forest regression algorithm as well as the details for the algorithms used to determine the prediction importance can be found here. Below is a visual of the variable importance output from the randomForest() function when run on my data:

    One thing to note is that the above computation was done on a version of my dataset in which I had imputed “3” ratings into all the missing ratings. This was done with the idea that if someone had omitted a variable from their review, they likely felt ambivalent about it and would give a middling rating of 3/5 if pressed. However, this was an assumption that could alter my results so I did two things:

    1. Ran the same computation on three different versions of my data: one with “3” imputed for NAs; one with the variable mean imputed for NAs in that variable; one with only reviews that were entirely complete. The above result, as mentioned, is from the first version of imputed data. While the importance ratings (the numbers themselves) changed from version to version, the order of variable importance was constant throughout. This allows me to conclude that according to the Breiman algorithm, ground service was the most important variable in predicting a customer’s overall rating for a flight, followed by seat comfort, cabin service, food and beverage, and entertainment (in that order).
    2. Analyzed the proportion of reviews that included each variable. Imputing missing values inevitably skews results, but eliminating them costs us potentially valuable information. In this case, I believed missingness in various fields tended to fall into the category of “Missing Not at Random,” meaning that the cause for missingness was actually related to the value of the variable in question; in particular, I believed fields with high amounts of missingness were likely less important to customers than fields with very little missingness. To analyze this, I graphed the proportion of reviews that included each variable:

    From this we see that cabin service and seat comfort are fields that are included in nearly every review, while ground service is only included in about 55% of the reviews.

    Insight:  Cabin service and seat comfort are filled in by nearly every customer who writes a review, so these aspects of a flight are important to most customers. Since these variables rank as second and third most important in predicting overall flight score according to the Breiman random forest algorithm, we may conclude that these two fields have the most influence on a customer’s overall rating of a flight. Furthermore, while ground service appears to be the field most often omitted from reviews (and therefore possibly the least important to customers in general), overall flight rating is highly dependent on ground service for those customers who do include it. In other words, most people don’t care too much about ground service, but those who do care a lot. 

     

    Question 2

    How are US airlines performing across different aspects of customer flight experience?

    Given our results in Question 1, an airline may now want to compare itself to other airlines and to the industry as a whole across the variables of cabin service, entertainment, food and beverage, ground service, and seat comfort. To analyze this, I counted the number of reviews giving 1, 2, 3, 4, 5, and NA ratings for each variable and within each airline, as well as for the industry as a whole. For seat comfort ratings specifically, we have the following results:

     

    Each of the five variables of cabin service, entertainment, food and beverage, ground service, and seat comfort were explored, and going over all of these lead to the following observations:

    • JetBlue Airways had the best ratings for seat comfort across all airlines and thus should market themselves as the industry leaders in seat comfort. Similarly, Alaska Airlines should market themselves as the industry leader in cabin service. Given the results from Question 1, both JetBlue and Alaska would likely see a boost in sales if customers knew they lead in seat comfort and cabin service since these are the variables (of the five studied so far) that impact customers’ overall impressions of a flight the most.
    • The industry as a whole sees quite a lot of low or missing ratings in entertainment. An airline that has received middling ratings in other fields could distinguish itself with entertainment (inflight movies, wifi access, etcetera) and potentially influence more customers to care more about their inflight entertainment experiences.
    • Spirit Airlines consistently receives mostly 1 ratings, which suggests that across all considered fields customers tend to be unhappy with their experience. Yet, Spirit Airlines continues to grow. This suggests a need for more exploration into the needs and wants of airline customers.

    Insight: On the whole the US airline industry is doing the best with cabin service and the worst with ground service and seat comfort (in these fields, there were fewer 5s than any other rating). In addition, there is a huge lack of ratings for entertainment. Within each of the five fields studied, there are opportunities for individual airlines to capitalize on the difference between their own performance and that of the industry: either boast about leading in a currently important field (seat comfort or cabin service) or put money towards becoming a leader in an overlooked arena (entertainment, ground service, food and beverage).

     

    Question 3

    What words come up most frequently in positive reviews? In negative reviews?

    The previous questions aimed to better understand airline customer views on five specific aspects of flight experience (cabin service, entertainment, food and beverage, ground service, seat comfort), but since these five fields do not account for all that could impact a customer’s overall experience, I wanted to analyze the actual text of their reviews. To do this, I used word clouds generated by a combination of the “tm”, “wordcloud,” and “memoise” packages in R. I analyzed the positive and negative reviews separately and for both individual airlines and the industry as a whole. Positive reviews were those with overall ratings of 6/10 or better and negative reviews were those with overall ratings of 5/10 or worse.

    Here are the positive and negative word clouds for the industry as a whole:

    In both the positive and negative reviews, the word ‘time’ is one of the top three most-used words. This suggests that, indeed, the five fields considered in questions 1 and 2 did not capture all that is important to flyers (an obvious statement to flyers, of course!). Looking through the word clouds for all the airlines individually further emphasizes that customers primarily rave about and complain about time-related topics. Interestingly, the word ‘hours’ comes up in negative reviews, suggesting that customers can tolerate a moderate amount of lateness or delay (say, under an hour) but lose patience after that. Conversely, no amount of time comes up in positive reviews; potentially, any amount of time that an airline can save a customer is rewarded.

    The words ‘seats’ and ‘service’ still appear in the top five words too, though, so the analysis from the previous questions is corroborated.

    In addition to providing additional information about the trends in the industry, individual airlines can benefit from these word cloud analyses by seeing words that appear in their clouds and not in others (or vice versa). For example, ‘Atlanta’ comes up in the negative reviews for Delta, suggesting that Delta has problems in Atlanta; similarly, Frontier has problems in Denver, and Hawaiian airline customers mention class – both positively and negatively – more than those of any other airline. Spirit, although it ranks as the worst among the airlines in nearly every one of the five fields considered in questions 1 and 2, has a negative word cloud that is not distinguished from the other negative word clouds. That is, Spirit customers still complain the most in text reviews about delays and time, and in fact, Spirit customers mention the word ‘seat’ in their positive reviews!

    Insight: Airline customers write about time more than anything else, followed by service and seats. Given the surprising findings about Spirit Airlines, it’s possible that saving/wasting time is an even stronger predictor of a customer’s overall flight rating than anything else. In business practice, this would suggest that anything an airline can do to save customers time would boost their sales; for some airlines, making tweaks even in single cities may lead to higher level of customer satisfaction.

     

    Conclusion and future directions

    Despite the fact that there are a myriad of factors that impact a flyer’s experience with a flight, airlines can boost customer satisfaction by focusing on a few main aspects of flights – particularly time, seat comfort, and cabin service. This project only scratches the surface in understanding all that impacts flyers’ impressions and each flight aspect merits a full investigation of its own. For example, research into the cost and benefit of improving seat comfort would be highly beneficial for an airline looking to improve sales through seat comfort. In addition, I’d like to employ more advanced machine learning techniques to help predict customer flight ratings and, subsequently, sales. For this, I’d like to look into the unused data I scraped about routes, traveller type, cabin, and date flown, as well as incorporate weather and economic data.

    The post Increasing Airline Customer Satisfaction appeared first on NYC Data Science Academy Blog.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Deep learning and the German Data Science Job Market

    Mon, 10/30/2017 - 01:00

    (This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

    Almost 2 years ago, I wrote a short post on the German data science market by analysing open position on the key job platforms; Stepstone, Monster and Indeed. Recently I received requests to update the post with fresh data.

    So here it comes. To add a slightly different spin, I thought it would be interesting to see how widespread and prevalent the deep learning / AI trend shows up in the job market.

    To get started, I scraped the number of available job openings from the sites: Monster, Stepstone and Indeed searching for the term “Data Scientist”.

    To put the numbers in perspective, in the December 2015, I found 65 jobs on Indeed, 36 on Monster and 36 on Stepstone. As a first take-away; the data science market increased somewhere between 5 and 10 fold.
    The ~600 positions on Indeed are advertised by 254 companies, which is almost a 5 fold increase vs 2015.

    In order to see how important deep learning is, I compare the number of open positions for the search term in relation to other data search terms. We see that there are 400 open positions with that keyword. Data mining or machine learning are more common by factor 2 respectively factor 4.5.

    Looking at advertised methods, we see that it is always good to have some standard data preperation and statistic skills under the belt. Looking at the technology search terms, we see a hugh interest in spark (almost a 10 fold increase compared to the 2015 data.) What is espcially interesting here is that there are “only” 330 data engineer positions. Hence big data technologies are mentioned in a much wider range of job positions.

    Finally, looking at programming languages and business intelligence tools we see that Java and Excel are still highly in demand.

    Two years ago I concluded with stating:
    Given this analysis, I would conclude that Germany is not (yet) a hot market for data scientist. Both in terms of regional distribution as well as the low absolute and relative number of open positions suggests that the job market is not well developed.

    Given the 5 fold increase in postions, I think that German companies finally realized that the digital transformation requires highly-skilled data-educated employees.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    2017-04 Speeding up gridSVG

    Sun, 10/29/2017 - 23:23

    (This article was first published on R – Stat Tech, and kindly contributed to R-bloggers)

    This report describes changes in version 1.6-0 of the ‘gridSVG’ package for R. The most important result of these changes is that ‘gridSVG’ is now much faster at generating SVG files (in situations where it used to be painfully slow).

    Paul Murrell

    Download

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Stat Tech. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Big Data Transforms

    Sun, 10/29/2017 - 16:33

    (This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

    As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using R and substantial data stores (such as relational database variants such as PostgreSQL or big data systems such as Spark).



    Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."

    The R package DBI gives us direct access to SQL and the package dplyr gives us access to a transform grammar that can either be executed or translated into SQL.

    But, as we point out in the replyr README: moving from in-memory R to large data systems is always a bit of a shock as you lose a lot of your higher order data operators or transformations. Missing operators include:

    • union (binding by rows many data frames into a single data frame).
    • split (splitting a single data frame into many data frames).
    • pivot (moving row values into columns).
    • un-pivot (moving column values to rows).

    I can repeat this. If you are an R user used to using one of dply::bind_rows() , base::split(), tidyr::spread(), or tidyr::gather(): you will find these functions do not work on remote data sources, but have replacement implementations in the replyr package.

    For example:

    library("RPostgreSQL") ## Loading required package: DBI suppressPackageStartupMessages(library("dplyr")) isSpark <- FALSE # Can work with PostgreSQL my_db <- DBI::dbConnect(dbDriver("PostgreSQL"), host = 'localhost', port = 5432, user = 'postgres', password = 'pg') # # Can work with Sparklyr # my_db <- sparklyr::spark_connect(version='2.2.0', # master = "local") # isSpark <- TRUE d <- dplyr::copy_to(my_db, data.frame(x = c(1,5), group = c('g1', 'g2'), stringsAsFactors = FALSE), 'd') print(d) ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 1 g1 ## 2 5 g2 # show dplyr::bind_rows() fails. dplyr::bind_rows(list(d, d)) ## Error in bind_rows_(x, .id): Argument 1 must be a data frame or a named atomic vector, not a tbl_dbi/tbl_sql/tbl_lazy/tbl

    The replyr package supplies R accessible implementations of these missing operators for large data systems such as PostgreSQL and Spark.

    For example:

    # using the development version of replyr https://github.com/WinVector/replyr library("replyr") ## Loading required package: seplyr ## Loading required package: wrapr ## Loading required package: cdata packageVersion("replyr") ## [1] '0.8.2' # binding rows dB <- replyr_bind_rows(list(d, d)) print(dB) ## # Source: table [?? x ## # 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 1 g1 ## 2 5 g2 ## 3 1 g1 ## 4 5 g2 # splitting frames replyr_split(dB, 'group') ## $g2 ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 5 g2 ## 2 5 g2 ## ## $g1 ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## x group ## ## 1 1 g1 ## 2 1 g1 # pivoting pivotControl <- buildPivotControlTable(d, columnToTakeKeysFrom = 'group', columnToTakeValuesFrom = 'x', sep = '_') dW <- moveValuesToColumnsQ(keyColumns = NULL, controlTable = pivotControl, tallTableName = 'd', my_db = my_db, strict = FALSE) %>% compute(name = 'dW') print(dW) ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## group_g1 group_g2 ## ## 1 1 5 # un-pivoting unpivotControl <- buildUnPivotControlTable(nameForNewKeyColumn = 'group', nameForNewValueColumn = 'x', columnsToTakeFrom = colnames(dW)) moveValuesToRowsQ(controlTable = unpivotControl, wideTableName = 'dW', my_db = my_db) ## # Source: table [?? x 2] ## # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] ## group x ## ## 1 group_g1 1 ## 2 group_g2 5

    The point is: using the replyr package you can design in terms of higher-order data transforms, even when working with big data in R. Designs in terms of these operators tend to be succinct, powerful, performant, and maintainable.

    To master the terms moveValuesToRows and moveValuesToColumns I suggest trying the following two articles:

    if(isSpark) { status <- sparklyr::spark_disconnect(my_db) } else { status <- DBI::dbDisconnect(my_db) } my_db <- NULL var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Practical Machine Learning with R and Python – Part 4

    Sun, 10/29/2017 - 14:40

    (This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

    This is the 4th installment of my ‘Practical Machine Learning with R and Python’ series. In this part I discuss classification with Support Vector Machines (SVMs), using both a Linear and a Radial basis kernel, and Decision Trees. Further, a closer look is taken at some of the metrics associated with binary classification, namely accuracy vs precision and recall. I also touch upon Validation curves, Precision-Recall, ROC curves and AUC with equivalent code in R and Python

    This post is a continuation of my 3 earlier posts on Practical Machine Learning in R and Python
    1. Practical Machine Learning with R and Python – Part 1
    2. Practical Machine Learning with R and Python – Part 2
    3. Practical Machine Learning with R and Python – Part 3

    The RMarkdown file with the code and the associated data files can be downloaded from Github at MachineLearning-RandPython-Part4

    Support Vector Machines (SVM) are another useful Machine Learning model that can be used for both regression and classification problems. SVMs used in classification, compute the hyperplane, that separates the 2 classes with the maximum margin. To do this the features may be transformed into a larger multi-dimensional feature space. SVMs can be used with different kernels namely linear, polynomial or radial basis to determine the best fitting model for a given classification problem.

    In the 2nd part of this series Practical Machine Learning with R and Python – Part 2, I had mentioned the various metrics that are used in classification ML problems namely Accuracy, Precision, Recall and F1 score. Accuracy gives the fraction of data that were correctly classified as belonging to the +ve or -ve class. However ‘accuracy’ in itself is not a good enough measure because it does not take into account the fraction of the data that were incorrectly classified. This issue becomes even more critical in different domains. For e.g a surgeon who would like to detect cancer, would like to err on the side of caution, and classify even a possibly non-cancerous patient as possibly having cancer, rather than mis-classifying a malignancy as benign. Here we would like to increase recall or sensitivity which is  given by Recall= TP/(TP+FN) or we try reduce mis-classification by either increasing the (true positives) TP or reducing (false negatives) FN

    On the other hand, search algorithms would like to increase precision which tries to reduce the number of irrelevant results in the search result. Precision= TP/(TP+FP). In other words we do not want ‘false positives’ or irrelevant results to come in the search results and there is a need to reduce the false positives.

    When we try to increase ‘precision’, we do so at the cost of ‘recall’, and vice-versa. I found this diagram and explanation in Wikipedia very useful Source: Wikipedia

    “Consider a brain surgeon tasked with removing a cancerous tumor from a patient’s brain. The surgeon needs to remove all of the tumor cells since any remaining cancer cells will regenerate the tumor. Conversely, the surgeon must not remove healthy brain cells since that would leave the patient with impaired brain function. The surgeon may be more liberal in the area of the brain she removes to ensure she has extracted all the cancer cells. This decision increases recall but reduces precision. On the other hand, the surgeon may be more conservative in the brain she removes to ensure she extracts only cancer cells. This decision increases precision but reduces recall. That is to say, greater recall increases the chances of removing healthy cells (negative outcome) and increases the chances of removing all cancer cells (positive outcome). Greater precision decreases the chances of removing healthy cells (positive outcome) but also decreases the chances of removing all cancer cells (negative outcome).”

    1.1a. Linear SVM – R code

    In R code below I use SVM with linear kernel

    source('RFunctions-1.R') library(dplyr) library(e1071) library(caret) library(reshape2) library(ggplot2) # Read data. Data from SKLearn cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) # Split into training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Fit a linear basis kernel. DO not scale the data svmfit=svm(target~., data=train, kernel="linear",scale=FALSE) ypred=predict(svmfit,test) #Print a confusion matrix confusionMatrix(ypred,test$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 54 3 ## 1 3 82 ## ## Accuracy : 0.9577 ## 95% CI : (0.9103, 0.9843) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9121 ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.9474 ## Specificity : 0.9647 ## Pos Pred Value : 0.9474 ## Neg Pred Value : 0.9647 ## Prevalence : 0.4014 ## Detection Rate : 0.3803 ## Detection Prevalence : 0.4014 ## Balanced Accuracy : 0.9560 ## ## 'Positive' Class : 0 ## 1.1b Linear SVM – Python code

    The code below creates a SVM with linear basis in Python and also dumps the corresponding classification metrics

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.svm import LinearSVC from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import confusion_matrix from matplotlib.colors import ListedColormap from sklearn.datasets import load_breast_cancer # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) clf = LinearSVC().fit(X_train, y_train) print('Breast cancer dataset') print('Accuracy of Linear SVC classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Linear SVC classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Breast cancer dataset ## Accuracy of Linear SVC classifier on training set: 0.92 ## Accuracy of Linear SVC classifier on test set: 0.94 1.2 Dummy classifier

    Often when we perform classification tasks using any ML model namely logistic regression, SVM, neural networks etc. it is very useful to determine how well the ML model performs agains at dummy classifier. A dummy classifier uses some simple computation like frequency of majority class, instead of fitting and ML model. It is essential that our ML model does much better that the dummy classifier. This problem is even more important in imbalanced classes where we have only about 10% of +ve samples. If any ML model we create has a accuracy of about 0.90 then it is evident that our classifier is not doing any better than a dummy classsfier which can just take a majority count of this imbalanced class and also come up with 0.90. We need to be able to do better than that.

    In the examples below (1.3a & 1.3b) it can be seen that SVMs with ‘radial basis’ kernel with unnormalized data, for both R and Python, do not perform any better than the dummy classifier.

    1.2a Dummy classifier – R code

    R does not seem to have an explicit dummy classifier. I created a simple dummy classifier that predicts the majority class. SKlearn in Python also includes other strategies like uniform, stratified etc. but this should be possible to create in R also.

    # Create a simple dummy classifier that computes the ratio of the majority class to the totla DummyClassifierAccuracy <- function(train,test,type="majority"){ if(type=="majority"){ count <- sum(train$target==1)/dim(train)[1] } count } cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) # Create training and test sets train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] #Dummy classifier majority class acc=DummyClassifierAccuracy(train,test) sprintf("Accuracy is %f",acc) ## [1] "Accuracy is 0.638498" 1.2b Dummy classifier – Python code

    This dummy classifier uses the majority class.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.dummy import DummyClassifier from sklearn.metrics import confusion_matrix (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Negative class (0) is most frequent dummy_majority = DummyClassifier(strategy = 'most_frequent').fit(X_train, y_train) y_dummy_predictions = dummy_majority.predict(X_test) print('Dummy classifier accuracy on test set: {:.2f}' .format(dummy_majority.score(X_test, y_test))) ## Dummy classifier accuracy on test set: 0.63 1.3a – Radial SVM (un-normalized) – R code

    SVMs perform better when the data is normalized or scaled. The 2 examples below show that SVM with radial basis kernel does not perform any better than the dummy classifier

    library(dplyr) library(e1071) library(caret) library(reshape2) library(ggplot2) # Radial SVM unnormalized train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Unnormalized data svmfit=svm(target~., data=train, kernel="radial",cost=10,scale=FALSE) ypred=predict(svmfit,test) confusionMatrix(ypred,test$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 0 0 ## 1 57 85 ## ## Accuracy : 0.5986 ## 95% CI : (0.5131, 0.6799) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : 0.5363 ## ## Kappa : 0 ## Mcnemar's Test P-Value : 1.195e-13 ## ## Sensitivity : 0.0000 ## Specificity : 1.0000 ## Pos Pred Value : NaN ## Neg Pred Value : 0.5986 ## Prevalence : 0.4014 ## Detection Rate : 0.0000 ## Detection Prevalence : 0.0000 ## Balanced Accuracy : 0.5000 ## ## 'Positive' Class : 0 ## 1.4b – Radial SVM (un-normalized) – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.svm import SVC # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) clf = SVC(C=10).fit(X_train, y_train) print('Breast cancer dataset (unnormalized features)') print('Accuracy of RBF-kernel SVC on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of RBF-kernel SVC on test set: {:.2f}' .format(clf.score(X_test, y_test))) ## Breast cancer dataset (unnormalized features) ## Accuracy of RBF-kernel SVC on training set: 1.00 ## Accuracy of RBF-kernel SVC on test set: 0.63 1.5a – Radial SVM (Normalized) -R Code

    The data is scaled (normalized ) before using the SVM model. The SVM model has 2 paramaters a) C – Large C (less regularization), more regularization b) gamma – Small gamma has larger decision boundary with more misclassfication, and larger gamma has tighter decision boundary

    The R code below computes the accuracy as the regularization paramater is changed

    trainingAccuracy <- NULL testAccuracy <- NULL C1 <- c(.01,.1, 1, 10, 20) for(i in C1){ svmfit=svm(target~., data=train, kernel="radial",cost=i,scale=TRUE) ypredTrain <-predict(svmfit,train) ypredTest=predict(svmfit,test) a <-confusionMatrix(ypredTrain,train$target) b <-confusionMatrix(ypredTest,test$target) trainingAccuracy <-c(trainingAccuracy,a$overall[1]) testAccuracy <-c(testAccuracy,b$overall[1]) } print(trainingAccuracy) ## Accuracy Accuracy Accuracy Accuracy Accuracy ## 0.6384977 0.9671362 0.9906103 0.9976526 1.0000000 print(testAccuracy) ## Accuracy Accuracy Accuracy Accuracy Accuracy ## 0.5985915 0.9507042 0.9647887 0.9507042 0.9507042 a <-rbind(C1,as.numeric(trainingAccuracy),as.numeric(testAccuracy)) b <- data.frame(t(a)) names(b) <- c("C1","trainingAccuracy","testAccuracy") df <- melt(b,id="C1") ggplot(df) + geom_line(aes(x=C1, y=value, colour=variable),size=2) + xlab("C (SVC regularization)value") + ylab("Accuracy") + ggtitle("Training and test accuracy vs C(regularization)")

    1.5b – Radial SVM (normalized) – Python

    The Radial basis kernel is used on normalized data for a range of ‘C’ values and the result is plotted.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) print('Breast cancer dataset (normalized with MinMax scaling)') trainingAccuracy=[] testAccuracy=[] for C1 in [.01,.1, 1, 10, 20]: clf = SVC(C=C1).fit(X_train_scaled, y_train) acctrain=clf.score(X_train_scaled, y_train) accTest=clf.score(X_test_scaled, y_test) trainingAccuracy.append(acctrain) testAccuracy.append(accTest) # Create a dataframe C1=[.01,.1, 1, 10, 20] trainingAccuracy=pd.DataFrame(trainingAccuracy,index=C1) testAccuracy=pd.DataFrame(testAccuracy,index=C1) # Plot training and test R squared as a function of alpha df=pd.concat([trainingAccuracy,testAccuracy],axis=1) df.columns=['trainingAccuracy','trainingAccuracy'] fig1=df.plot() fig1=plt.title('Training and test accuracy vs C (SVC)') fig1.figure.savefig('fig1.png', bbox_inches='tight') ## Breast cancer dataset (normalized with MinMax scaling)

    Output image: 

    1.6a Validation curve – R code

    Sklearn includes code creating validation curves by varying paramaters and computing and plotting accuracy as gamma or C or changd. I did not find this R but I think this is a useful function and so I have created the R equivalent of this.

    # The R equivalent of np.logspace seqLogSpace <- function(start,stop,len){ a=seq(log10(10^start),log10(10^stop),length=len) 10^a } # Read the data. This is taken the SKlearn cancer data cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) set.seed(6) # Create the range of C1 in log space param_range = seqLogSpace(-3,2,20) # Initialize the overall training and test accuracy to NULL overallTrainAccuracy <- NULL overallTestAccuracy <- NULL # Loop over the parameter range of Gamma for(i in param_range){ # Set no of folds noFolds=5 # Create the rows which fall into different folds from 1..noFolds folds = sample(1:noFolds, nrow(cancer), replace=TRUE) # Initialize the training and test accuracy of folds to 0 trainingAccuracy <- 0 testAccuracy <- 0 # Loop through the folds for(j in 1:noFolds){ # The training is all rows for which the row is != j (k-1 folds -> training) train <- cancer[folds!=j,] # The rows which have j as the index become the test set test <- cancer[folds==j,] # Create a SVM model for this svmfit=svm(target~., data=train, kernel="radial",gamma=i,scale=TRUE) # Add up all the fold accuracy for training and test separately ypredTrain <-predict(svmfit,train) ypredTest=predict(svmfit,test) # Create confusion matrix a <-confusionMatrix(ypredTrain,train$target) b <-confusionMatrix(ypredTest,test$target) # Get the accuracy trainingAccuracy <-trainingAccuracy + a$overall[1] testAccuracy <-testAccuracy+b$overall[1] } # Compute the average of accuracy for K folds for number of features 'i' overallTrainAccuracy=c(overallTrainAccuracy,trainingAccuracy/noFolds) overallTestAccuracy=c(overallTestAccuracy,testAccuracy/noFolds) } #Create a dataframe a <- rbind(param_range,as.numeric(overallTrainAccuracy), as.numeric(overallTestAccuracy)) b <- data.frame(t(a)) names(b) <- c("C1","trainingAccuracy","testAccuracy") df <- melt(b,id="C1") #Plot in log axis ggplot(df) + geom_line(aes(x=C1, y=value, colour=variable),size=2) + xlab("C (SVC regularization)value") + ylab("Accuracy") + ggtitle("Training and test accuracy vs C(regularization)") + scale_x_log10()

    1.6b Validation curve – Python

    Compute and plot the validation curve as gamma is varied.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import MinMaxScaler from sklearn.svm import SVC from sklearn.model_selection import validation_curve # Load the cancer data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X_cancer) # Create a gamma values from 10^-3 to 10^2 with 20 equally spaced intervals param_range = np.logspace(-3, 2, 20) # Compute the validation curve train_scores, test_scores = validation_curve(SVC(), X_scaled, y_cancer, param_name='gamma', param_range=param_range, cv=10) #Plot the figure fig2=plt.figure() #Compute the mean train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) fig2=plt.title('Validation Curve with SVM') fig2=plt.xlabel('$\gamma$ (gamma)') fig2=plt.ylabel('Score') fig2=plt.ylim(0.0, 1.1) lw = 2 fig2=plt.semilogx(param_range, train_scores_mean, label='Training score', color='darkorange', lw=lw) fig2=plt.fill_between(param_range, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.2, color='darkorange', lw=lw) fig2=plt.semilogx(param_range, test_scores_mean, label='Cross-validation score', color='navy', lw=lw) fig2=plt.fill_between(param_range, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.2, color='navy', lw=lw) fig2.figure.savefig('fig2.png', bbox_inches='tight')

    Output image: 

    1.7a Validation Curve (Preventing data leakage) – Python code

    In this course Applied Machine Learning in Python, the Professor states that when we apply the same data transformation to a entire dataset, it will cause a data leakage. “The proper way to do cross-validation when you need to scale the data is not to scale the entire dataset with a single transform, since this will indirectly leak information into the training data about the whole dataset, including the test data (see the lecture on data leakage later in the course). Instead, scaling/normalizing must be computed and applied for each cross-validation fold separately”

    So I apply separate scaling to the training and testing folds and plot. In the lecture the Prof states that this can be done using pipelines.

    import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.cross_validation import KFold from sklearn.preprocessing import MinMaxScaler from sklearn.svm import SVC # Read the data (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) # Set the parameter range param_range = np.logspace(-3, 2, 20) # Set number of folds folds=5 #Initialize overallTrainAccuracy=[] overallTestAccuracy=[] # Loop over the paramater range for c in param_range: trainingAccuracy=0 testAccuracy=0 kf = KFold(len(X_cancer),n_folds=folds) # Partition into training and test folds for train_index, test_index in kf: # Partition the data acccording the fold indices generated X_train, X_test = X_cancer[train_index], X_cancer[test_index] y_train, y_test = y_cancer[train_index], y_cancer[test_index] # Scale the X_train and X_test scaler = MinMaxScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Fit a SVC model for each C clf = SVC(C=c).fit(X_train_scaled, y_train) #Compute the training and test score acctrain=clf.score(X_train_scaled, y_train) accTest=clf.score(X_test_scaled, y_test) trainingAccuracy += np.sum(acctrain) testAccuracy += np.sum(accTest) # Compute the mean training and testing accuracy overallTrainAccuracy.append(trainingAccuracy/folds) overallTestAccuracy.append(testAccuracy/folds) overallTrainAccuracy=pd.DataFrame(overallTrainAccuracy,index=param_range) overallTestAccuracy=pd.DataFrame(overallTestAccuracy,index=param_range) # Plot training and test R squared as a function of alpha df=pd.concat([overallTrainAccuracy,overallTestAccuracy],axis=1) df.columns=['trainingAccuracy','testAccuracy'] fig3=plt.title('Validation Curve with SVM') fig3=plt.xlabel('$\gamma$ (gamma)') fig3=plt.ylabel('Score') fig3=plt.ylim(0.5, 1.1) lw = 2 fig3=plt.semilogx(param_range, overallTrainAccuracy, label='Training score', color='darkorange', lw=lw) fig3=plt.semilogx(param_range, overallTestAccuracy, label='Cross-validation score', color='navy', lw=lw) fig3=plt.legend(loc='best') fig3.figure.savefig('fig3.png', bbox_inches='tight')

    Output image: 

    1.8 a Decision trees – R code

    Decision trees in R can be plotted using RPart package

    library(rpart) library(rpart.plot) rpart = NULL # Create a decision tree m <-rpart(Species~.,data=iris) #Plot rpart.plot(m,extra=2,main="Decision Tree - IRIS")

     

    1.8 b Decision trees – Python code from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn import tree from sklearn.model_selection import train_test_split import graphviz iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state = 3) clf = DecisionTreeClassifier().fit(X_train, y_train) print('Accuracy of Decision Tree classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Decision Tree classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) dot_data = tree.export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph ## Accuracy of Decision Tree classifier on training set: 1.00 ## Accuracy of Decision Tree classifier on test set: 0.97 1.9a Feature importance – R code

    I found the following code which had a snippet for feature importance. Sklean has a nice method for this. For some reason the results in R and Python are different. Any thoughts?

    set.seed(3) # load the library library(mlbench) library(caret) # load the dataset cancer <- read.csv("cancer.csv") cancer$target <- as.factor(cancer$target) # Split as data data <- cancer[,1:31] target <- cancer[,32] # Train the model model <- train(data, target, method="rf", preProcess="scale", trControl=trainControl(method = "cv")) # Compute variable importance importance <- varImp(model) # summarize importance print(importance) # plot importance plot(importance)

    1.9b Feature importance – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_breast_cancer import numpy as np # Read the data cancer= load_breast_cancer() (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) # Use the DecisionTreClassifier clf = DecisionTreeClassifier(max_depth = 4, min_samples_leaf = 8, random_state = 0).fit(X_train, y_train) c_features=len(cancer.feature_names) print('Breast cancer dataset: decision tree') print('Accuracy of DT classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of DT classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) # Plot the feature importances fig4=plt.figure(figsize=(10,6),dpi=80) fig4=plt.barh(range(c_features), clf.feature_importances_) fig4=plt.xlabel("Feature importance") fig4=plt.ylabel("Feature name") fig4=plt.yticks(np.arange(c_features), cancer.feature_names) fig4=plt.tight_layout() plt.savefig('fig4.png', bbox_inches='tight') ## Breast cancer dataset: decision tree ## Accuracy of DT classifier on training set: 0.96 ## Accuracy of DT classifier on test set: 0.94

    Output image: 

    1.10a Precision-Recall, ROC curves & AUC- R code

    I tried several R packages for plotting the Precision and Recall and AUC curve. PRROC seems to work well. The Precision-Recall curves show the tradeoff between precision and recall. The higher the precision, the lower the recall and vice versa.AUC curves that hug the top left corner indicate a high sensitivity,specificity and an excellent accuracy.

    source("RFunctions-1.R") library(dplyr) library(caret) library(e1071) library(PRROC) # Read the data (this data is from sklearn!) d <- read.csv("digits.csv") digits <- d[2:66] digits$X64 <- as.factor(digits$X64) # Split as training and test sets train_idx <- trainTestSplit(digits,trainPercent=75,seed=5) train <- digits[train_idx, ] test <- digits[-train_idx, ] # Fit a SVM model with linear basis kernel with probabilities svmfit=svm(X64~., data=train, kernel="linear",scale=FALSE,probability=TRUE) ypred=predict(svmfit,test,probability=TRUE) head(attr(ypred,"probabilities")) ## 0 1 ## 6 7.395947e-01 2.604053e-01 ## 8 9.999998e-01 1.842555e-07 ## 12 1.655178e-05 9.999834e-01 ## 13 9.649997e-01 3.500032e-02 ## 15 9.994849e-01 5.150612e-04 ## 16 9.999987e-01 1.280700e-06 # Store the probability of 0s and 1s m0<-attr(ypred,"probabilities")[,1] m1<-attr(ypred,"probabilities")[,2] # Create a dataframe of scores scores <- data.frame(m1,test$X64) # Class 0 is data points of +ve class (in this case, digit 1) and -ve class (digit 0) #Compute Precision Recall pr <- pr.curve(scores.class0=scores[scores$test.X64=="1",]$m1, scores.class1=scores[scores$test.X64=="0",]$m1, curve=T) # Plot precision-recall curve plot(pr)

    #Plot the ROC curve roc<-roc.curve(m0, m1,curve=TRUE) plot(roc)

    1.10b Precision-Recall, ROC curves & AUC- Python code

    For Python Logistic Regression is used to plot Precision Recall, ROC curve and compute AUC

    import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_digits from sklearn.metrics import precision_recall_curve from sklearn.metrics import roc_curve, auc #Load the digits dataset = load_digits() X, y = dataset.data, dataset.target #Create 2 classes -i) Digit 1 (from digit 1) ii) Digit 0 (from all other digits) # Make a copy of the target z= y.copy() # Replace all non 1's as 0 z[z != 1] = 0 X_train, X_test, y_train, y_test = train_test_split(X, z, random_state=0) # Fit a LR model lr = LogisticRegression().fit(X_train, y_train) #Compute the decision scores y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test) y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20])) #Show the decision_function scores for first 20 instances y_score_list precision, recall, thresholds = precision_recall_curve(y_test, y_scores_lr) closest_zero = np.argmin(np.abs(thresholds)) closest_zero_p = precision[closest_zero] closest_zero_r = recall[closest_zero] #Plot plt.figure() plt.xlim([0.0, 1.01]) plt.ylim([0.0, 1.01]) plt.plot(precision, recall, label='Precision-Recall Curve') plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3) plt.xlabel('Precision', fontsize=16) plt.ylabel('Recall', fontsize=16) plt.axes().set_aspect('equal') plt.savefig('fig5.png', bbox_inches='tight') #Compute and plot the ROC y_score_lr = lr.fit(X_train, y_train).decision_function(X_test) fpr_lr, tpr_lr, _ = roc_curve(y_test, y_score_lr) roc_auc_lr = auc(fpr_lr, tpr_lr) plt.figure() plt.xlim([-0.01, 1.00]) plt.ylim([-0.01, 1.01]) plt.plot(fpr_lr, tpr_lr, lw=3, label='LogRegr ROC curve (area = {:0.2f})'.format(roc_auc_lr)) plt.xlabel('False Positive Rate', fontsize=16) plt.ylabel('True Positive Rate', fontsize=16) plt.title('ROC curve (1-of-10 digits classifier)', fontsize=16) plt.legend(loc='lower right', fontsize=13) plt.plot([0, 1], [0, 1], color='navy', lw=3, linestyle='--') plt.axes() plt.savefig('fig6.png', bbox_inches='tight')

    output

    output

    1.10c Precision-Recall, ROC curves & AUC- Python code

    In the code below classification probabilities are used to compute and plot precision-recall, roc and AUC

    import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.datasets import load_digits from sklearn.svm import LinearSVC from sklearn.calibration import CalibratedClassifierCV dataset = load_digits() X, y = dataset.data, dataset.target # Make a copy of the target z= y.copy() # Replace all non 1's as 0 z[z != 1] = 0 X_train, X_test, y_train, y_test = train_test_split(X, z, random_state=0) svm = LinearSVC() # Need to use CalibratedClassifierSVC to redict probabilities for lInearSVC clf = CalibratedClassifierCV(svm) clf.fit(X_train, y_train) y_proba_lr = clf.predict_proba(X_test) from sklearn.metrics import precision_recall_curve precision, recall, thresholds = precision_recall_curve(y_test, y_proba_lr[:,1]) closest_zero = np.argmin(np.abs(thresholds)) closest_zero_p = precision[closest_zero] closest_zero_r = recall[closest_zero] #plt.figure(figsize=(15,15),dpi=80) plt.figure() plt.xlim([0.0, 1.01]) plt.ylim([0.0, 1.01]) plt.plot(precision, recall, label='Precision-Recall Curve') plt.plot(closest_zero_p, closest_zero_r, 'o', markersize = 12, fillstyle = 'none', c='r', mew=3) plt.xlabel('Precision', fontsize=16) plt.ylabel('Recall', fontsize=16) plt.axes().set_aspect('equal') plt.savefig('fig7.png', bbox_inches='tight')

    output

    Note: As with other posts in this series on ‘Practical Machine Learning with R and Python’,   this post is based on these 2 MOOC courses
    1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
    2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

    Conclusion

    This 4th part looked at SVMs with linear and radial basis, decision trees, precision-recall tradeoff, ROC curves and AUC.

    Stick around for further updates. I’ll be back!
    Comments, suggestions and correction are welcome.

    Also see
    1. A primer on Qubits, Quantum gates and Quantum Operations
    2. Dabbling with Wiener filter using OpenCV
    3. The mind of a programmer
    4. Sea shells on the seashore
    5. yorkr pads up for the Twenty20s: Part 1- Analyzing team”s match performance

    To see all posts see Index of posts

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Capturing the Brexit vote in data

    Sun, 10/29/2017 - 13:27

    (This article was first published on fReigeist » R, and kindly contributed to R-bloggers)

    In my recent research I have worked on understanding the key correlates of Brexit. This paper is joint work with my co-authors Sascha Becker and Dennis Novy and has now been published in Economic Policy. After having gone through the peer-review process, I am very happy to share the data and the underlying code.

    On the construction of a rich local authority level dataset

    After the EU Referendum, I started working on assembling a dataset to study the question to what extent low turnout among potential remain voters may help to understand the result of the referendum. In particular, back then I looked at the extent to which rainfall may have discouraged people to turn up to vote in the first place. After all, a lot of polls, the financial markets and betting markets seemed to indicate that Remain would win.

    It was not inconceivable to imagine that turnout may have been particularly low among a set of voters living in the London commuter belt, who were turned off from voting on a working day after an arduous journey into London that was marred by train cancelations due to the bad weather.

    The result then was that there appears to have been a weak effect on turnout, but this effect was not biased in either direction with regard to voter preferences: Remain would have lost also on a sunny day (we confirm this finding in the new paper with a different rainfall data set).

    In any case, this was the starting point to a significant data collection effort. In particular we have combined data from all the 380 local authorities in England, Scotland and Wales as well as some data of ward level voting across five cities. The resulting set of covariates is quite sizable and they can be roughly grouped into the following four categories:

    1. characteristics of the underlying economic structure of an area, like the unemployment rate, wages or sectoral employment shares and changes therein,
    2. demographic and human capital characteristics like age structure, education or life satisfaction,
    3. exposure to the EU measured by indicators like the level of EU transfers, trade and the growth rate of immigration from other EU countries,
    4. public service provision and fiscal consolidation that covers the share of public employment, as well as the reduction in public spending per capita among others and NHS performance metrics.

    To analyse which of these covariates are strong correlates of the Leave vote share in an area, we used a simple machine-learning method that chooses the best subset of covariates that best predicts the referendum Leave share. The idea is to build a robust predictive model that could achieve significant out of sample prediction accurracy.

    More formally, best subset selection solves the following optimization problem to obtain an optimal vector beta that can robustly predict the vote leave share y_c

    While the formulation may be a bit technical, it boils down to estimating all possible ways to combine regressors. That means, all possible ways of estimating a model that includes 1, 2, 3, …, p covariates are being estimated. This amounts to estimating (2^p) models, which becomes infeasible to estimate very fast. Lasso and other methods such as forward- and backward stepwise selection are solving approximations to the combinatorical problem that BSS solves. The method is usually taught at the beginning of machine learning courses as it provides a great way to illustrate the bias-variance tradeoff.

    The result can be summarized in a simple bar chart and is somewhat striking:

     

    R2 of best models with different groups of regressors

    What this suggests is that “Fundamental factors” such economic structure, demographics and human capital are strongest correlates of Leave vote in the UK; the direct level effects of migration and trade exposure captured in the group of regressors called “EU Exposure” seem second order.

    Also what is striking that even very simple empirical models with just 10 – 20 variables are doing a good job in capturing the variation in the vote leave share across the whole of the UK.  The best model that includes variables measuring vote shares of different parties (especially UKIP) from the 2014 European Parliamentary election captures 95% of the overall variation.

    The observation that variables capturing direct exposure to the European Union, in particular, Immigration seems at odds with voter narratives around the EU referendum, which were doinated by the immigration topic. In fact, the 2015 British Election study suggests that voters  considered immigration to be the single most important issue of the time.

    Unigram word cloud constructed out of responses to question “what is single most important issue facing country at the time” from the 2015 BES study

    In the above the word cloud the keyword immigration is just quite striking. Looking at the bigram word cloud drawn from the British Election study, the themes of the individual responses become even more apparent.

    Bigram word cloud constructed out of responses to question “what is single most important issue facing country at the time” from the 2015 BES study

    There things like “many people”, “small island”, “uncontrolled immigration” appear in addition to the simple immigration keyword. But also other themes, such as “national debt”, “health service” and concerns about the “welfare system” seem to feature quite large. Overall this suggests that immigration may have been a salient feature in the public debate, but it seems at odds with the fact that the variable groups pertaining to EU immigration seem to capture very little of the variation in the intensity of the leave vote across the UK.

     

    In another paper with Sascha we confirm this observation  in a panel setup. We show  that a local area’s exposure to Eastern European immigration has, at best, only a small impact on anti-EU preferences in the UK as measured by UKIP voting in European Parliamentary Elections.

    While our code for the Economic policy paper is implemented in Stata, it is very easy to replicate this work in Stata. Below is an example of how you would implement BSS in R through the leaps package.

    The leaps package has a nice visualization of which regressors are included in which of the best models. The below example replicates the results in our Table 1.

    library(data.table) library(haven) library(leaps) DTA<-data.table(read_dta(file="data/brexitevote-ep2017.dta")) #TABLE 1 IN THE PAPER WOULD BE REPLICATED BY RUNNING TABLE1<-regsubsets(Pct_Leave~InitialEUAccession+MigrantEUAccessionGrowth+InitialOldEUMember+MigrantOldEUMemberGrowth+InitialElsewhere+MigrantElsewhereGrowth+Total_EconomyEU_dependence+eufundspercapitapayment13+EU75Leaveshare,nbest=1,data=DTA) plot(TABLE1,scale="r2")

     

    Plotting best subset results

    If you want to replicate the word clouds from the 2015 British Election study, the following lines of code will do. You will need to download the British Election Study dataset here.

    library(data.table) library(haven) library(quanteda) DTA<-data.table(read_dta(file="data/bes_f2f_original_v3.0.dta")) C<-corpus(DTA$A1) docnames(C)<-DTA$finalserialno C.dfmunigram C.dfmunigram <- dfm(C, tolower = TRUE, stem = FALSE, removeNumbers=TRUE,removePunct=TRUE, remove = stopwords("english"), ngrams=1) C.dfm<-dfm(C, tolower = TRUE, stem = FALSE, removeNumbers=TRUE,removePunct=TRUE, remove = stopwords("english"), ngrams=2) plot(C.dfmunigram) plot(C.dfm) this var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: fReigeist » R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages