Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 hours 24 min ago

Business confidence and economic growth by @ellis2013nz

Tue, 07/31/2018 - 20:30

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

The ANZ bank have a nice bit of publicity for themselves each month in New Zealand with the release of the results of their monthly Business Outlook Survey. This week it caused a bit of a stir, with the ANZ’s own commentary reporting New Zealand corporate sector is “in a funk” (ANZ’s words) with a net 45 percentage points of businesses pessimistic about the economy (while a net positive 4% were positive about their own activity). This led to some animated discussion on what this measure actually means on Twitter and elsewhere.

I set out to explore the value of business confidence as a measure of what the economy is actually going to do. The ANZ’s data back to the 1970s are available from Trading Economics, but for non-subscribers this is only as a graphic, slightly inconvenient for my purposes. David Hood pointed out that the OECD have a monthly business confidence measure for many countries. The ultimate provenance of the data with regard to New Zealand isn’t clear to me, but it follows a similar pattern to the ANZ (not identical, and on a different scale) and it is publicly available so I thought I would work with that.

All the code for today’s post is at the bottom of the post.

Introducing “Business Confidence”

Here’s the business confidence series reported by the OECD as part of its leading indicators collection. Because a crucial part of the controversy is whether business report (and possibly feel) their confidence differently according to their personal political views of the government of the day, I’ve coloured this chart by the party of the Prime Minister for most of the month that the survey relates to.

Business confidence is usually reported as “percent of optimists minus percent of pessimists”, which in principle is symmetric around zero. It’s not obvious how the OECD have converted such a measure (if they have) into something centred around 100.

The data go back to the 1960s. The dramatic drop reported by the ANZ in the last week hasn’t yet percolated to the OECD database, if indeed the ANZ are one of the sources for the OECD data. The OECD data go to June 2018, which is extremely up to date for data of this sort.

The most interesting thing with this graph is probably the historical matter – how confidence used to have a semi-regular and large cycle, but since the 1990s the regularity has gone (as has some of the volatility). Those big patterns in the first few decades that look like seasonality actually reflect some other cycle, longer than a year; a cycle that no longer seems to apply with such regularity. There were indeed dramatic changes in the economy, and in government-business relations, in the 1980s and 1990s that could explain this. But I’d also want to check out on how the survey is delivered, and particularly how things are converted to this index, before drawing much from that.

There’s a small component of seasonal regularity, changing over time, which we can see in this LOESS-based seasonal decomposition:

In the analysis later on, I made a seasonally adjusted version of confidence, and took the mean value over a quarter as the quarterly value.

Economic growth

I decided the quarterly chain-volume (ie adjusted for price differences) GDP production measure would be a good one to compare to confidence levels. I manually downloaded this as a CSV from Stats NZ’s Infoshare tool. The published data naturally shows very strong non-stationarity (ie growth, with increasing population and productivity) and much stronger seasonality:

Seasonality in the national accounts is driven by many things, with tourism a particularly big contributor in New Zealand.

I used the same methods (X13-SEATS-ARIMA, via Christoph Sax’s seasonal R package – with defaults such as adjusting for Easter moving from quarter to quarter) to seasonally adjust these quarterly GDP totals as for business confidence. Then I used the seasonally adjusted totals as the basis for growth quarter-to-quarter:

Note that these percentage rates are just from one quarter to the next, not the annualised growth rates that get most attention in policy discussions.

Stats NZ publish this series only back to 1987.

Combining the two

The hope/expectation of business confidence as a leading indicator is that it tells us something about future GDP. If we look at the relationship between business confidence and GDP 1, 2, 3 and 4 quarters ahead, we see that there is indeed a positive statistical relationship between the two:

Connected scatter plots like those above are probably the best exploratory tool for this sort of bivariate time series, but they’re too ugly for lots of communication purposes. But in this case, we can see that generally, when business confidence is higher, then economic growth in the current period and the next four periods forward (with gradually weakening relations) is likely to be higher too. However, there’s a lot of noise in the relationship; check out for example the top left quadrant of any of the facets comparing business confidence (at various lags) with GDP, which show when business confidence was low, but economic growth ended up being high.

It’s worth noting that a fair chunk of the noise seems to come from the older data (darker in the image above).

As a reference point, I’ve included the lagged value of quarterly growth itself. Later on, I’ll be looking to see if business confidence helps us predict GDP more than just knowing the historical values of growth does, so this is the right thing to compare it to. The relationship between growth and its own lagged value is stronger than between growth and business confidence, as seen visually in the connected scatter plots above but also in simple correlations:

How does it look when we consider periods with a Labour Prime Minister separately from with one from the Nationals (the only two parties to provide Prime Ministers in the time considered)? Visually, it looks like the relationship between business confidence and subsequent economic growth figures is stronger when the Nationals (centre-right on the political spectrum and associated with being pro-business) are in power or leading a coalition rather than Labour (centre-left):

This fits with a convenient narrative of business owners erring on the side of pessimism when Labour are in power but being more rational in their expectations with a National government. However, the statistical modelling discussed in the next section didn’t find significant evidence of this or other party effects (on growth, on confidence, or on the relationship between growth and confidence). So we leave this one as a non-conclusion.

Modelling growth on confidence (and politics)

I fit three different models to the data exploring how best to explain future economic growth. These were:

  • model_simp – treat it as a univariate time series problem, and only use past values of economic growth to explain the future
  • model_apol – an apolitical model which adds lagged business confidence (at 1, 2, 3 and 4 quarter lags) as explanatory variables but does not take into account who was in political power
  • model_full – which adds a dummy indicator for periods with a Labour Prime Minister, and also an interaction of that indicator with the first quarter lag of business confidence

All the models were fit with auto.arima from Rob Hyndman’s forecast R package, which uses a well-tested algorithm to choose the best time series approach (combination of moving average randomness and lagged values of the response variable) for the response variable itself, in this case quarterly growth in seasonally adjusted real GDP.

Time series data is not worth as much, data point for data point, as is data that is sampled independently of a correlating dimension such as time. The modelling in this section takes this into account (unlike far too much applied statistical/econometric work), and this adds some much-needed conservatism to the inference process. Graphics such as those shown above are likely to lead to hasty conclusions if they are interpreted as though the data are all independently selected and equally valuable, rather than containing a lot of redundant information from the correlation over time.

The model_apol was the strongest of the three (with the lowest AIC). This was counter to my expectation that the simple univariate model would win. This is a vote of confidence for business confidence measures having something to add to understanding of future economic growth. This exercise doesn’t compare them to other leading indicators (such as commodity and currency prices, to pick two of the most available and pertinent measures, particularly for an export-oriented economy like New Zealand with very large dairy and tourism industries), and it’s quite possible that business confidence is picking up signals from those other measures. But it certainly is picking up something that is of value.

An increase in business confidence of 1 point on the OECD scale (which would be a bit over one standard deviation of changes in the past decade or so) is estimated to lead to an increase in quarterly economic growth of somewhere between 0.0 and 0.3 percentage points – for example, from real seasonally adjusted quarterly growth of 1.00% to 1.15% (or to anywhere between 1.00% and 1.30%), all else being equal. So plunges in business confidence are not worth panicking over, but shouldn’t be ignored either.

Modelling confidence on growth (and politics)

I also tried flipped the problem around. Can we predict ‘business confidence”, knowing just past values of business confidence, the political party in power, and current and past values of GDP? Following the same process as above, I came to a slightly surprising conclusion – neither the party in power, nor current and lagged values of GDP growth, not a combination of the two, add materially to a model trying to explain business confidence. My best model to explain business confidence was a simple univariate time series model. In other words, the best predictor (of the very limited set of candidates) of business confidence is past business confidence.

Code library(tidyverse) library(stringr) library(lubridate) library(forecast) library(seasonal) library(nzelect) # for party colours library(scales) #--------------download data that was manually downloaded earlier-------------- download.file("https://github.com/ellisp/blog-source/blob/master/_working/DP_LIVE_01082018094138947.csv?raw=true", destfile = "DP_LIVE_01082018094138947.csv") download.file("https://raw.githubusercontent.com/ellisp/blog-source/master/_working/SNE445001_20180801_075329_92.csv", destfile = "SNE445001_20180801_075329_92.csv") #--------------prep - who was in power?------------- # Dates taken from https://en.wikipedia.org/wiki/List_of_Prime_Ministers_of_New_Zealand # a data frame of who was in power for every day over a 70 year period: power <- data_frame( date = as.Date(c( "12/12/1957", "12/12/1960", "8/12/1972", "12/12/1975", "26/07/1984", "2/11/1990", "5/12/1999", "19/11/2008", "26/10/2017"), format = "%d/%m/%Y"), pm_party = c( "Labour", "National", "Labour", "National", "Labour", "National", "Labour", "National", "Labour"), # the pm_id identifier is used later in grouping data together to avoid annoying connecting lines # across the years: pm_id = 1:9 ) %>% right_join( data_frame(date = as.Date("1957-12-11") + 1:(70 * 365)), by = "date" ) %>% fill(pm_party, pm_id) %>% mutate(qtr = quarter(date), yr = year(date), mon = month(date)) # roll up by month for later use: power_m <- power %>% group_by(yr, mon, pm_party, pm_id) %>% summarise(freq = n()) %>% group_by(yr, mon) %>% summarise(pm_party = pm_party[freq == max(freq)], pm_id = pm_id[freq == max(freq)]) # roll up by quarter for later use: power_q <- power %>% group_by(yr, qtr, pm_party, pm_id) %>% summarise(freq = n()) %>% group_by(yr, qtr) %>% summarise(pm_party = pm_party[freq == max(freq)], pm_id = pm_id[freq == max(freq)]) #--------------------Monthly business confidence data---------------------- # See https://data.oecd.org/leadind/business-confidence-index-bci.htm # Data were downloaded as CSV non-programmatically. # On a different scale and a bit different in shape (but related) to: # https://tradingeconomics.com/new-zealand/business-confidence bc <- read.csv("DP_LIVE_01082018094138947.csv", stringsAsFactors = FALSE) names(bc)[1] <- "LOCATION" # read.csv corrupts the first name on the way in bc_nz <- bc %>% filter(LOCATION == "NZL") %>% mutate(Time = as.Date(paste0(TIME, "-15"), format = "%Y-%m-%d"), yr = year(Time), mon = month(Time), qtr = quarter(Time)) %>% select(-INDICATOR, -SUBJECT, -MEASURE, -FREQUENCY) %>% as_tibble() %>% left_join(power_m, by = c("yr", "mon")) bc_nz %>% ggplot(aes(x = Time, y = Value, colour = pm_party, group = pm_id)) + geom_step() + labs(x = "", y = "Amplitude-adjusted business confidence index\nlong term average = 100", caption = "Source: OECD, Business Confidence Index; Wikipedia, Prime Ministers of New Zealand") + ggtitle("Monthly business confidence in New Zealand") + scale_colour_manual("Prime Minister's party:", values = parties_v) # There's a small monthly seasonality present in business confidence bc_ts <- ts(bc_nz$Value, start = c(1961, 6), frequency = 12) par(family = main_font, font.main = 1) plot(stl(bc_ts, s.window = 7), main = "Weak and variable seasonality in New Zealand monthly business confidence") tsdisplay(bc_ts) # not shown # seasonally adjusted version of business confidence bc_nz$bc_sa <- final(seas(bc_ts)) #----------------quarterly GDP----------------------------- # Total GDP, chain volume, production measure: gdp <- read.csv("SNE445001_20180801_075329_92.csv", stringsAsFactors = FALSE, skip = 1) names(gdp) <- c("yr_qtr", "gdp_p_cv") gdp_q <- gdp %>% mutate(qtr = substring(yr_qtr, 6, 6), yr = substring(yr_qtr, 1, 4), yr_num = as.numeric(yr) + as.numeric(qtr) / 4 - 0.125) %>% filter(yr == as.numeric(yr)) %>% arrange(yr, qtr) # Much stronger seasonality in the GDP growth rates than there was in the business confidence: gdp_ts <- ts(gdp_q$gdp_p_cv, start = c(1987, 3), frequency = 4) par(family = main_font, font.main = 1) plot(stl(gdp_ts, s.window = 7), main = "Strong and consistent seasonality in New Zealand quarterly GDP") tsdisplay(gdp_ts) # not shown in blog # create a seasonally adjusted version of the volume series, which we'll use for growth rates gdp_sa <- final(seas(gdp_ts)) gdp_q$gdp_sa <- gdp_sa # note that this next method only works because it isn't a tibble. Viva la base. n <- nrow(gdp_q) gdp_q$growth <- c(NA, gdp_q[2:n, "gdp_sa"] / gdp_q[1:(n - 1), "gdp_sa"] - 1) # remove the first row, which is NA for growth (no comparison possible): gdp_q <- gdp_q[-1, ] # make a time series out of it for use later: growth_ts <- ts(gdp_q$growth, start = c(1987, 3), frequency = 4) # add a lagged growth series for some graphic comparisons: gdp_q <- gdp_q %>% mutate(growth_lag1 = c(NA, growth[-n()])) %>% mutate(yr = as.integer(yr), qtr = as.integer(qtr)) %>% left_join(power_q, by = c("yr", "qtr")) ggplot(gdp_q, aes(x = yr_num, y = growth, colour = pm_party, group = pm_id)) + geom_line() + labs(x = "", y = "Quarterly growth in seasonally adjusted GDP\n(Compares each quarter to the previous; not annualised)", caption = "Source: Stats NZ, chain volume GDP, production measure") + scale_y_continuous(label = percent) + scale_colour_manual("Prime Minister's party:", values = parties_v) + ggtitle("Growth rates in New Zealand GDP over time") #--------------combining the two series------------- # First we need to make a quarterly version of the monthly business confidence data bc_q <- bc_nz %>% group_by(yr, qtr) %>% summarise(bc = mean(bc_sa)) %>% ungroup() %>% mutate(bc_lag1 = c(NA, bc[-n()]), bc_lag2 = c(NA, bc_lag1[-n()]), bc_lag3 = c(NA, bc_lag2[-n()]), bc_lag4 = c(NA, bc_lag3[-n()])) # then we merge this with the quarterly GDP: comb <- gdp_q %>% inner_join(bc_q, by = c("yr", "qtr")) %>% as_tibble() # Draw a connected scatter plot: comb %>% select(yr_num:bc_lag4) %>% gather(variable, value, -yr_num, -growth, -gdp_sa, -pm_party, -pm_id) %>% ggplot(aes(y = growth, x = value, colour = yr_num)) + scale_colour_viridis(option = "B") + facet_wrap(~variable, scale = "free_x") + geom_path() + geom_smooth(method = "lm") + scale_y_continuous(label = percent)+ labs(y = "Quarterly growth in seasonally adjusted real GDP", x = "Value of business confidence, or lagged business confidence, or lagged quarterly growth", caption = "Source: OECD (business confidence) and Stats NZ (chain volume GDP, production measure)") + ggtitle("Relationship between business confidence and economic growth", "Business confidence ('bc') is positively correlated with future economic growth,\nbut not as strongly as past values of growth itself.") # Let's quantify those correlations: cors <- comb %>% select(growth:bc_lag4, -pm_party, -pm_id) %>% slice(2:n()) %>% cor %>% as.data.frame() cors[ , "correlate"] <- row.names(cors) cors %>% filter(correlate != "growth") %>% mutate(correlate = fct_reorder(correlate, growth)) %>% ggplot(aes(x = growth, y = correlate)) + geom_point() + labs(x = "Correlation coefficient with real quarterly growth in seasonally adjusted GDP", caption = "Source: OECD (business confidence) and Stats NZ (chain volume GDP, production measure)") + ggtitle("Relationship between business confidence and economic growth", "Business confidence is positively correlated with future economic growth, but not as strongly as is past values of growth itself.") # alternative version of the connected scatter plot, this time with political party in power: comb %>% select(yr_num:bc_lag4) %>% gather(variable, value, -yr_num, -growth, -gdp_sa, -pm_party, -pm_id) %>% ggplot(aes(y = growth, x = value)) + geom_path(aes(colour = pm_party, group = pm_id)) + geom_smooth(method = "lm") + facet_grid(pm_party ~ variable, scales = "free_x") + labs(y = "Quarterly growth in seasonally adjusted real GDP", x = "Value of business confidence, or lagged business confidence, or lagged quarterly growth", caption = "Source: OECD (business confidence) and Stats NZ (chain volume GDP, production measure)") + ggtitle("Relationship between business confidence and economic growth", "The relationship between self-reported business confidence and GDP *looks* stronger under National than Labour. However, the evidence isn't significant of Party of the Prime Minister impacting on either GDP or business confidence") + scale_colour_manual("Prime Minister's party:", values = parties_v) + scale_y_continuous(label = percent) #----------------does business confidence help predict growth?------------ # Modelling to see if the lagged business confidence adds anything to a simple univariate time series model of GDP: xreg <- comb %>% mutate(labour_ind = as.integer(pm_party == "Labour")) %>% select(bc_lag1:bc_lag4, labour_ind) %>% mutate(bc_lag1_lab = labour_ind * bc_lag1) # Full model predicts growth including with Labour PM as a dummy, and interacting with the first level lag: model_full <- auto.arima(growth_ts, xreg = xreg) # Apolitical model ignores who is PM: model_apol <- auto.arima(growth_ts, xreg = xreg[ , c(1:4)]) # Simple model just treats growth as a univariate time series, says other information has nothing to offer: model_simp <- auto.arima(growth_ts) # Business confidence does help. The best model (lowest AIC) for predicting GDP ignores who is PM, # but does included lagged values of business confidence (ie model_apol): AIC(model_full, model_apol, model_simp) model_apol model_full # the estimate of the bc_lag1 coefficient in the model_full is 0.0018. So an increase in business confidence by 1 on the OECD # scale would increase forecast quarterly GDP by 0.18 percentage points (eg from 1.00% to 1.18%). 1 is a # reasonable increase - about 1.1 standard deviations (using last 20 years' variability as a baseline) # sd(tail(bc_ts, 80)) # blog reports results for model_apol, which has the lowest AIC. #---------------------can party explain business confidence?------------------- # On the other hand, it's worth noting that lagged and current values of GDP growth # do *not* at all appear to be much related to business confidence xreg <- comb %>% mutate(labour_ind = as.integer(pm_party == "Labour")) %>% select(growth, growth_lag1, labour_ind) bc_sub_ts <- ts(comb$bc, frequency = 4, start = c(1987, 3)) model_all <- auto.arima(bc_sub_ts, xreg = xreg) model_apol <- auto.arima(bc_sub_ts, xreg = xreg[ ,1:2]) model_lab_only <- auto.arima(bc_sub_ts, xreg = xreg[ , 3]) model_nothing <- auto.arima(bc_sub_ts) AIC(model_all, model_apol, model_lab_only, model_nothing) # model_nothing is lowest var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New R package debugr – use automatic debug messages to improve your code

Mon, 07/30/2018 - 23:41

(This article was first published on Topics in R, and kindly contributed to R-bloggers)

debugr is a new package designed to support debugging in R. It mainly provides the dwatch() function which prints a debug output to the console or to a file. A debug output can consist of a static text message, the values of one or more objects (potentially transformed by applying some functions) or the value of one or multiple (more complex) R expressions. Whether or not a debug message is displayed can be made dependent on the evaluation of a criterion phrased as an R expression. Generally, debug messages are only shown if the debug mode is activated. The debug mode is activated and deactivated with debugr_switchOn() and debugr_switchOff(), respectively, which change the logical debugr.active value in the global options. Since debug messages are only displayed in debug mode, the dwatch() function calls can even remain in the original code as they remain silent and won’t have any effect until the debug mode is switched on again. Read more:

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Topics in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Long Running Tasks With Shiny: Challenges and Solutions

Mon, 07/30/2018 - 21:08

(This article was first published on Fells Stats, and kindly contributed to R-bloggers)

One of the great additions to the R ecosystem in recent years is RStudio’s Shiny package. With it, you can easily whip up and share a user interface for a new statistical method in just a few hours. Today I want to share some of the methods and challenges that come up when the actual computation of a result takes a non-trivial amount of time (e.g >5 seconds).

First Attempt

Shiny operates in a reactive programming framework. Fundamentally this means that any time any UI element that affects the result changes, so does the result. This happens automatically, with your analysis code running every time a widget is changed. In a lot of cases, this is exactly what you want and it makes Shiny programs concise and easy to make; however in the case of long running processes, this can lead to frozen UI elements and a frustrating user experience.

The easiest solution is to use an Action Button and only run the analysis code when the action button is clicked. Another important component is to provide your user with feedback as to how long the analysis is going to take. Shiny has nice built in progress indicators that allow you to do this.

?View Code R

library(shiny)   # Define UI for application that draws a histogram ui <- fluidPage(   # Application title titlePanel("Long Run"),   # Sidebar with a slider input for number of bins sidebarLayout( sidebarPanel( actionButton('run', 'Run') ),   # Show a plot of the generated distribution mainPanel( tableOutput("result") ) ) )   server <- function(input, output) { N <- 10   result_val <- reactiveVal() observeEvent(input$run,{ result_val(NULL) withProgress(message = 'Calculation in progress', { for(i in 1:N){   # Long Running Task Sys.sleep(1)   # Update progress incProgress(1/N) } result_val(quantile(rnorm(1000))) }) }) output$result <- renderTable({ result_val() }) }   # Run the application shinyApp(ui = ui, server = server)

 

 

The above implementation has some of the things we want out of our interface:

  • The long running analysis is only executed when “Run” is clicked.
  • Progress is clearly displayed to the user.

It does have some serious downsides though:

  • If “Run” is clicked multiple times, the analysis is run back to back. A frustrated user can easily end up having to abort their session because they clicked to many times.
  • There is no way to cancel the calculation. The session’s UI is locked while the computation takes place. Often a user will realize that some of the options they’ve selected are incorrect and will want to restart the computation. With this interface, they will have to wait however long the computation takes before they can fix the issue.
  • The whole server is blocked while the computation takes place. If multiple users are working with the app, the UIs of all users are frozen while any one user has an analysis in progress.
A Second Attempt With Shiny Async

Having the whole server blocked is a big issue if you want to have your app scale beyond a single concurrent user. Fortunately, Shiny’s new support of asynchronous processing can be used to remove this behavior. Instead of assigning a value to the reactive value ‘result_val’, we will instead create a promise to execute the analysis in the future (using the future function) and when it is done assign it to result_val (using %…>%).

?View Code R

library(shiny) library(promises) library(future) plan(multiprocess)   # Define UI for application that draws a histogram ui <- fluidPage(   # Application title titlePanel("Long Run Async"),   # Sidebar with a slider input for number of bins sidebarLayout( sidebarPanel( actionButton('run', 'Run') ),   # Show a plot of the generated distribution mainPanel( tableOutput("result") ) ) )   server <- function(input, output) { N <- 10   result_val <- reactiveVal() observeEvent(input$run,{ result_val(NULL) future({ print("Running...") for(i in 1:N){ Sys.sleep(1) } quantile(rnorm(1000)) }) %...>% result_val() }) output$result <- renderTable({ req(result_val()) }) }   # Run the application shinyApp(ui = ui, server = server)

 

 

When “Run” is clicked, the UI is now blocked only for the individual performing the analysis. Other users will be able to perform analyses of there own concurrently. That said, we still have some undesirable properties:

  • If “Run” is clicked multiple times, the analysis is run back to back.
  • There is no way to cancel the calculation.
  • The user cannot monitor progress. Shiny’s progress bar updates do not support calling them from within future, so we’ve had to remove the progress bar from the UI. This is not a huge problem for tasks that take a few seconds, but for those that take minutes or hours, not knowing how long until the results show up can be very frustrating.
Third Time Is the Charm

In order to solve the cancel and monitoring problems, we need to be able to communicate between the app and the inside of the promise. This can be accomplished with the use of a file, where progress and interrupt requests are read and written. If the user clicks the cancel button, “interrupt” is written to the file. During the course of the computation the analysis code checks whether interrupt has been signaled and if so, throws an error. If no interrupt has been requested, the analysis code writes its progress to the file. If Status is clicked, Shiny reads the file and notifies the user of its contents.

The last addition to the code is to create a reactive value nclicks that prevents the Run button from triggering multiple analyses.

 

?View Code R

library(shiny) library(promises) library(future) plan(multiprocess)   ui <- fluidPage( titlePanel("Long Run Stoppable Async"), sidebarLayout( sidebarPanel( actionButton('run', 'Run'), actionButton('cancel', 'Cancel'), actionButton('status', 'Check Status') ), mainPanel( tableOutput("result") ) ) )   server <- function(input, output) { N <- 10   # Status File status_file <- tempfile()   get_status <- function(){ scan(status_file, what = "character",sep="\n") }   set_status <- function(msg){ write(msg, status_file) }   fire_interrupt <- function(){ set_status("interrupt") }   fire_ready <- function(){ set_status("Ready") }   fire_running <- function(perc_complete){ if(missing(perc_complete)) msg <- "Running..." else msg <- paste0("Running... ", perc_complete, "% Complete") set_status(msg) }   interrupted <- function(){ get_status() == "interrupt" }   # Delete file at end of session onStop(function(){ print(status_file) if(file.exists(status_file)) unlink(status_file) })   # Create Status File fire_ready()     nclicks <- reactiveVal(0) result_val <- reactiveVal() observeEvent(input$run,{   # Don't do anything if analysis is already being run if(nclicks() != 0){ showNotification("Already running analysis") return(NULL) }   # Increment clicks and prevent concurrent analyses nclicks(nclicks() + 1)   result_val(data.frame(Status="Running..."))   fire_running()   result <- future({ print("Running...") for(i in 1:N){   # Long Running Task Sys.sleep(1)   # Check for user interrupts if(interrupted()){ print("Stopping...") stop("User Interrupt") }   # Notify status file of progress fire_running(100*i/N) }   #Some results quantile(rnorm(1000)) }) %...>% result_val()   # Catch inturrupt (or any other error) and notify user result <- catch(result, function(e){ result_val(NULL) print(e$message) showNotification(e$message) })   # After the promise has been evaluated set nclicks to 0 to allow for anlother Run result <- finally(result, function(){ fire_ready() nclicks(0) })   # Return something other than the promise so shiny remains responsive NULL })   output$result <- renderTable({ req(result_val()) })   # Register user interrupt observeEvent(input$cancel,{ print("Cancel") fire_interrupt() })   # Let user get analysis progress observeEvent(input$status,{ print("Status") showNotification(get_status()) }) }   # Run the application shinyApp(ui = ui, server = server)

 

 

All three of the problems with the original async code have been solved with this implementation. That said, some care should be taken when using async operations like this. It is possible for race conditions to occur, especially if you have multiple “Run” buttons in a single app.

 

Final Thoughts

It is great that Shiny now supports Asynchronous programming. It allows applications to be scaled much more easily, especially when long running processes are present. Making use of these features does add some complexity. The final implementation has ~ 3 times more lines of code compared to the first (naive) attempt.

It is less than ideal that the user has to click a button to get the status of the computation. I’d much prefer it if we were able to remove this button and just have a progress bar; however this is currently not possible within Shiny proper, though it might be achievable to inject some kludgy javascript magic to get a progress bar.

 

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Fells Stats. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Certification for R Package Quality

Mon, 07/30/2018 - 18:35

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

There are more than 12,000 packages for R available on CRAN, and many others available on Github and elsewhere. But how can you be sure that a given R package follows best development practices for high-quality, secure software?

Based on a recent survey of R users related to challenges in selecting R packages, the R Consortium now recommends a way for package authors to self-validate that their package follows best practices for development. The CII Best Practices Badge Program, developed by the Linux Foundation's Core Infrastructure Initiative, defines a set of criteria that open-source software projects should follow for quality and security. The criteria relate to open-source license standards, documentation, secure development and delivery practices, version control and bug reporting practices, build and test processes, and much more. R packages that meet these standards are a signal to R users that can be relied upon, and as the standards document notes:

There is no set of practices that can guarantee that software will never have defects or vulnerabilities; even formal methods can fail if the specifications or assumptions are wrong. Nor is there any set of practices that can guarantee that a project will sustain a healthy and well-functioning development community. However, following best practices can help improve the results of projects. For example, some practices enable multi-person review before release, which can both help find otherwise hard-to-find technical vulnerabilities and help build trust and a desire for repeated interaction among developers from different organizations.

Developers can self-validate their packages (free of charge) using an online tool to check their adherence to the standards and get a "passing", "silver" or "gold" rating. Passing projects are eligible to display the badge (shown above) as a sign of quality and security, and almost 200 projects qualify as of this writing. A number of R packages have already gone through or started the certification process, including ggplot2, R Consortium projects covr and DBI, and the built-in R package Matrix Matrix. For more details on how R packages can get to a passing certification, and the R Consortium survey they led to the recommendation, see the R Consortium blog post at the link below.

R Consortium: Should R Consortium Recommend CII Best Practices Badge for R Packages: Latest Survey Results

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Podcast on Nonclinical Statistics

Mon, 07/30/2018 - 17:47

(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

Hugo Bowne-Anderson and I spoke about about data science in pharmaceuticals, the tidyverse, and more for the excellent DataFramed podcast from DataCamp. Listen to it here or through your favorite blogging app.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Consortium Proposal Accepted!

Mon, 07/30/2018 - 15:00

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

Today I am happy to announce that my proposal to the R Consortium was accepted!

I first announced that I was submitting a proposal back in March. As a reminder, the proposal has two distinct parts:

  1. Creating a guide to working with US Census data in R.
  2. Creating an R Consortium Working Group focused on US Census Data.

If you’d like to read the proposal in its entirety, you can do so here.

I am currently planning to develop the “Census Guide” in public using github. If you’d like to follow along with the development, you can do so by visiting the github respository and clicking the “watch” button:

View the Github Repository 

The post R Consortium Proposal Accepted! appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Rj Editor – Analyse your data with R in jamovi

Mon, 07/30/2018 - 10:00

(This article was first published on jamovi, and kindly contributed to R-bloggers)

tl;dr
  • Rj Editor lets you analyse data in jamovi with R, and make use of your favourite R packages from within the jamovi statistical spreadsheet
  • jmvconnect makes it easy to access jamovi data sets from R

Rj is a new module for the jamovi statistical spreadsheet that allows you to use the R programming language to analyse data from within jamovi. Although jamovi is already built on top of R, and all the analyses it provides are written in R, to date it hasn’t been possible to enter R code directly. Rj changes that.

There are many reasons you might want to do this; there are a lot (thousands!) of analyses available in R packages that haven’t been made available as jamovi modules (yet!), and Rj allows you to make use of these analyses from within jamovi. Additionally, you can make use of loops and if-statements, allowing (among other things) conditional analyses and simulation studies.

For some, using R in a spreadsheet will be an ideal place to begin learning R. For others it can be an easy way to share R analyses with less technically savvy colleagues. (And some people just prefer to code!)

Installation

Rj is available from the jamovi library, and requires a recent version of jamovi. The jamovi library is accessible from in the Analyses tab -> modules menu at the top right of the jamovi window. Scroll down the list until you find Rj, and select install. jamovi will download Rj, install it, and then Rj will be available from the jamovi analysis menu (The menu should pulsate blue when you first install a module).

Enter code

To run an R analysis, select Rj Editor from the R analysis menu. This will bring up the editor for entering your R code. The data set is available to you as a data frame, simply as data. To get started, you might like to run descriptives on it.

Or if you prefer the dplyr approach, you could go:

library(jmv) library(dplyr) library(magrittr) select(data, 1:3) %>% descriptives()

You’ll notice as you’re entering this code, Rj auto-suggests function names, like in RStudio.

To run the code, press Control+Shift+Enter (or ⌘+Shift+Enter if you’re on a Mac). jamovi will run the R code and the results will appear in the results panel like other analyses. You can continue to make changes to the code, and press Control+Shift+Enter to run it again.

By default, Rj makes use of the version of R bundled with jamovi. This includes many packages (jmv and all it’s dependencies, see here), and will be sufficient for many people, but if you need to make use of additional R packages then you’ll need to make use of the System R version. If you select the configuration gear to the top right of the code box, you’ll see an option to change the R version used.

The System R version uses the version of R you have installed (i.e. from CRAN). This has the advantage that your R code now has access to all of the packages you have installed for that version of R. jamovi should locate your system R installation automatically (It uses the same algorithm that RStudio uses). The last thing you will need is to have the jmvconnect R package installed in your system R library. This package allows your system R version to access the jamovi data sets. You can install it from an R terminal or from RStudio with:

install.packages('jmvconnect')

Once this is done, moving from the jamovi R to the System R should be seamless.

It’s worth remembering that sharing jamovi files with colleagues becomes a bit more complicated when you make use of the System R version. If they want to make changes and re-run your analyses, they will need to have the same R packages installed – that’s the price of flexibility!

Not all data

When Rj runs R code, by default it makes the whole data set available as a data frame called data. However, it’s likely that your analysis only makes use of a few columns, and doesn’t need the whole data set. You can limit the columns made available to the analysis by including a special comment at the top of your script, of the form:

# (column1, column2, column3) library(jmv) ...

In this instance, only the named columns will appear in the data data frame. This can speed the analysis up, particularly if you are working with large data sets. Additionally, this lets jamovi know that the analysis is only using these columns, and the analysis will not need to be re-run if changes are made to other columns.

To the terminal!

There may be times where you’ll want to transition to an R terminal or RStudio for analysing a data set. This is where the jmvconnect R package comes in handy. jmvconnect let’s you read the data sets from a running jamovi instance into an R session. At time of writing it has two functions:

  • what()
  • read()

what() lists the available data sets, and read() reads them. For example, you might go:

> library(jmvconnect) > > what() Available Data Sets ───────────────────────────────────── Title Rows Cols ───────────────────────────────────── 1 iris 150 5 2 Tooth Growth 60 3 ─────────────────────────────────────

and then read the data set with:

data <- read('Tooth Growth')

or

data <- read(2)

Before you ask, we intend on adding support for reading .omv files from R too (and for saving/opening .RData files from inside jamovi).

Work in progress

Auto-suggest is a work in progress, but you’ll still find it pretty useful. At present, it only suggests functions from the base packages, the recommended packages, and jmv. This is something we’ll broaden in the future. We also need to add a help system where you can access package documentation.

Package installation for the System R is technically possible through Rj, but is less than ideal. The UI will hang, and you won’t receive any feedback as to what’s going on. This is something we’ll improve in the future. For now, you can just use an R terminal, or RStudio.

On some windows machines, an R window flashes briefly when running analyses using the System R. I’m not sure why this is, but I think it’s a bug in the evaluate package. If anyone knows any more about the issue, I’d be keen to hear from you.

Finally, if you’re using jamovi on linux from the PPA, it’s currently using an older version of R, and you you may encounter issues with plots when using the system R. If this is a problem, we recommend using the flatpak version from flathub instead.

In summary

Rj and jmvconnect make is easy to access R from jamovi, and jamovi from R.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: jamovi. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

K-fold cross-validation in Stan

Mon, 07/30/2018 - 08:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Comparing multiple models is one of the core but also one of the trickiest element of data analysis. Under a Bayesian framework the loo package in R allows you to derive (among other things) leave-one-out cross-validation metrics to compare the predictive abilities of different models.

Cross-validation is basically: (i) separating the data into chunks, (ii) fitting the model while holding out one chunk at a time, (iii) evaluating the probability density of the held-out chunk of data based on the parameter estimates, (iv) derive some metrics from the likelihood of the held-out data. The basic idea is that if your model does a good job, then it can predict the value of held-out data pretty well.

We can separate the data into a different number of chunks, or folds, from 2 up to N, the number of data points. Holding out one data point at a time is called leave-one-out cross-validation and can be computationally costly (you need to fit the model N times), fortunately Vehtari and colleagues have been developing methods to estimate leave-one-out cross validation without having to do expensive computation. Now, as with every estimation, it sometime cannot give reliable results and one could/should then fall back to standard K-fold cross-validation by fitting the model multiple time.

The aim of this post is to show one simple example of K-fold cross-validation in Stan via R, so that when loo cannot give you reliable estimates, you may still derive metrics to compare models.

The Stan code

Below is the Stan code for a simple linear normal regression allowing K-fold cross-validation

/* Standard normal regression for any number of predictor variables with weakly informative priors on the betas and on the standard deviation */ data{ int N; //number of observations int K; //number of predictor variables matrix[N,K] X; //the model matrix including intercept vector[N] y; //the response variable int holdout[N]; //index whether the observation should be held out (1) or used (0) } parameters{ vector[K] beta; //the regression parameters real sigma; } model{ vector[N] mu; //the linear predictor mu = X * beta; //the regression //priors beta[1] ~ normal(0,10); beta[2:K] ~ normal(0,5); sigma ~ normal(0,10); //likelihood holding out some data for(n in 1:N){ if(holdout[n] == 0){ target += normal_lpdf(y[n] | mu[n], sigma); } } } generated quantities{ vector[N] log_lik; for (n in 1:N){ log_lik[n] = normal_lpdf(y[n] | X[n,] * beta, sigma); } }

The key element here is to pass as data a vector of 0s and 1s as long as the observed data indicating whether each point should be held-out during model estimation or not. Estimating the likelihood is done by going through each data point, asking if it should be held-out, if not then this data point is used to increment the likelihood based on the model parameter.

Below in the generated quantities segment, we are retrieving the probability density (the height of the probability distribution) for each data, also the one held-out.

The best way to go on if you want to try out the code is to copy/paste the model code into a .stan file.

Simulating some data and fitting the models

Let's start by setting some basic quantities and creating the held-out indeces.

#first load the libraries library(rstan) rstan_options(auto_write = TRUE) library(pbmcapply) library(loo) N <- 100 #sample size K <- 2 #number of predictors n_fold <- 10 #number of folds #create 10 folds of data hh <- sample(1:N,size = N,replace = FALSE) holdout_10 <- matrix(0,nrow=N,ncol=n_fold) for(i in 1:n_fold){ id <- seq(1,100,by=10) holdout_10[hh[id[i]:(id[i] + 9)],i] <- 1 } #some sanity checks #apply(holdout_10,1,sum) #apply(holdout_10,2,sum) #turn into a list holdout_10 <- split(holdout_10,rep(1:ncol(holdout_10),each=nrow(holdout_10)))

Randomly assigning each data point to a different fold is the trickiest part of the data preparation in K-fold cross-validation. What I basically did is randomly sample N times with no replacement from the data point index (the object hh), and put the first 10 index in the first fold, the subsequent 10 in the second fold and so on.

Note that if there is some kind of grouping structure in your data that you want to be taken care of while splitting it, it might get a bit more complex, there is certainly some clever functions out there that might help you out (such as in the caret package).

Now we can simulate some data and create the data object to be passed to Stan, already note that we will need one data object per fold, so a list might be an easy way to combine these:

X <- cbind(rep(1,N),runif(N,-2,2)) y <- rnorm(N,X %*% c(1, 0.5),1) #the basic data object data_m <- list(N=N,K=K,X=X,y=y) #create a list of data list data_l <- rep(list(data_m),10) #add the holdout index to it for(i in 1:10) data_l[[i]]$holdout <- holdout_10[[i]]

We are now ready to fit the model to each fold, we could just loop through the folds but this is inefficient, rather I am going to use the functions available in this link and pasted at the end of this post. Basically stan_kfold output a list of stanfit objects (one for each fold), extract_log_lik_K output a S x N matrix where S is the number of posterior draws where each element is the log-likelihood when the data point was held-out and kfold compute the expected log pointwise predictive density elpd. Basically, the elpd is the height (density) of the probability distribution, given the model parameters, at the data point (pointwise) that were held-out (predictive).

#run the functions ss <- stan_kfold(file="Documents/PostDoc_Ghent/STAN_stuff/Models/normal_model_basic_cv.stan",data_l,chains=4,cores=2) ee <- extract_log_lik_K(ss,holdout_10) kk <- kfold(ee) #compare with official loo results ll <- loo(ee)

The ee matrix can actually also be used as input to loo to get some more metrics.

All this is nice and fine but these metrics are only relative and only truly make sense when comparing between different models fitted tot he same data. So let's fit two additional models, one overly complex and one overly simple:

# fit a too complex and a too simple model X_comp <- cbind(X,runif(N,-2,2)) X_simp <- X[,1,drop=FALSE] # new data data_comp <- data_l for(i in 1:10){ data_comp[[i]]$X <- X_comp data_comp[[i]]$K <- 3 } data_simp <- data_l for(i in 1:10){ data_simp[[i]]$X <- X_simp data_simp[[i]]$K <- 1 } #fit the new models ss_comp <- stan_kfold(file="Documents/PostDoc_Ghent/STAN_stuff/Models/normal_model_basic_cv.stan",data_comp,chains=4,cores=2) ss_simp <- stan_kfold(file="Documents/PostDoc_Ghent/STAN_stuff/Models/normal_model_basic_cv.stan",data_simp,chains=4,cores=2) ee_comp <- extract_log_lik_K(ss_comp,holdout_10) ee_simp <- extract_log_lik_K(ss_simp,holdout_10) #compare the models compare(loo(ee),loo(ee_comp),loo(ee_simp)) #output: # elpd_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic #loo(ee) 0.0 -148.9 9.5 4.3 1.0 297.7 19.1 #loo(ee_comp) -1.8 -150.7 9.6 5.5 1.3 301.4 19.2 #loo(ee_simp) -11.2 -160.1 8.9 2.8 0.8 320.1 17.8

The compare functions ouput quite some information, let's go through it.

The first column is the difference in the summed expected log pointwise predictive density, the difference between the second and the first model is -1.8, meaning that the first model as slightly higher predictive density. The second column is the summed expected log pointwise predictive density, the values in the first column are differences from this one. The third column is the standard error in the expected log poitwise predictive density. The fourth and fifth columns is the effective number of parameter in the model and its standard error. The sixth column is the -2 * the second column, putting the elpd on the deviance scale so being similar to other information criteria metrics such as AIC or DIC. And the final column is the standard error of the information criteria.

From this output one could argue that the model with two predictors is definitively superior to the intercept-only model, and his slightly better than the model with 3 parameters.

Voila, do remember that model selection is very tricky, I would not encourage using information criteria or cross-validation metrics to decide whether one should include that particular interaction or this extra covariate. To my mind, these type of comparison are relevant to compare different model structure such as simple regression vs hierarchical models vs spatial effects vs autoregressive structure. Or to compare models based on a different set of covariates such as to compare if plant growth can be better predicted by trait information vs environmental information.

Happy folding

The function code

Below is the code of the function used to fit the models and extract the information:

#functions slightly modified from: https://github.com/stan-dev/stancon_talks/blob/master/2017/Contributed-Talks/07_nicenboim/kfold.Rmd #function to parrallelize all computations #need at least two chains !!! stan_kfold <- function(file, list_of_datas, chains, cores,...){ library(pbmcapply) badRhat <- 1.1 # don't know why we need this? n_fold <- length(list_of_datas) model <- stan_model(file=file) # First parallelize all chains: sflist <- pbmclapply(1:(n_fold*chains), mc.cores = cores, function(i){ # Fold number: k <- ceiling(i / chains) s <- sampling(model, data = list_of_datas[[k]], chains = 1, chain_id = i,...) return(s) }) # Then merge the K * chains to create K stanfits: stanfit <- list() for(k in 1:n_fold){ inchains <- (chains*k - (chains - 1)):(chains*k) # Merge `chains` of each fold stanfit[[k]] <- sflist2stanfit(sflist[inchains]) } return(stanfit) } #extract log-likelihoods of held-out data extract_log_lik_K <- function(list_of_stanfits, list_of_holdout, ...){ require(loo) K <- length(list_of_stanfits) list_of_log_liks <- plyr::llply(1:K, function(k){ extract_log_lik(list_of_stanfits[[k]],...) }) # `log_lik_heldout` will include the loglike of all the held out data of all the folds. # We define `log_lik_heldout` as a (samples x N_obs) matrix # (similar to each log_lik matrix) log_lik_heldout <- list_of_log_liks[[1]] * NA for(k in 1:K){ log_lik <- list_of_log_liks[[k]] samples <- dim(log_lik)[1] N_obs <- dim(log_lik)[2] # This is a matrix with the same size as log_lik_heldout # with 1 if the data was held out in the fold k heldout <- matrix(rep(list_of_holdout[[k]], each = samples), nrow = samples) # Sanity check that the previous log_lik is not being overwritten: if(any(!is.na(log_lik_heldout[heldout==1]))){ warning("Heldout log_lik has been overwritten!!!!") } # We save here the log_lik of the fold k in the matrix: log_lik_heldout[heldout==1] <- log_lik[heldout==1] } return(log_lik_heldout) } #compute ELPD kfold <- function(log_lik_heldout) { library(matrixStats) logColMeansExp <- function(x) { # should be more stable than log(colMeans(exp(x))) S <- nrow(x) colLogSumExps(x) - log(S) } # See equation (20) of @VehtariEtAl2016 pointwise <- matrix(logColMeansExp(log_lik_heldout), ncol= 1) colnames(pointwise) <- "elpd" # See equation (21) of @VehtariEtAl2016 elpd_kfold <- sum(pointwise) se_elpd_kfold <- sqrt(ncol(log_lik_heldout) * var(pointwise)) out <- list( pointwise = pointwise, elpd_kfold = elpd_kfold, se_elpd_kfold = se_elpd_kfold) #structure(out, class = "loo") return(out) }

    Related Post

    1. Automated Text Feature Engineering using textfeatures in R
    2. Explaining Keras image classification models with LIME
    3. Image classification with keras in roughly 100 lines of code
    4. R vs Python: Image Classification with Keras
    5. Update: Can we predict flu outcome with Machine Learning in R?
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Following the Movement of Birds in the United States

    Mon, 07/30/2018 - 07:22

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    The American Birder

    For the millions of bird watchers in America, relevant and useful resources are always a welcome sight. Range maps and ecological histories enhance the bird watching experience by adding a layer of conservation awareness and help hobbyists become more acquainted with the birds they observe. As a birder myself, I am always looking for new applications that help me achieve a greater understanding of the birds I observe on a daily basis. Learning more about these fascinating and beautiful animals helps to bring the bigger picture into focus; they have a much larger role in our world than just the momentary glimpse you get when observing them in a park or when walking down the street.

    The Data

    I chose to use the eBird dataset from the Cornell Lab of Ornithology to construct a Shiny application in R that allows a birder to “zoom out” from an isolated bird observation. eBird uses crowd-sourced data from around the world to track the locations and times of bird observations. Using their online interface, a user can view observations of many different species of birds, explore bird watching hot-spots, and even see a real time observation submission map. Observations are available from as far back as the year 1900, and the increasing accessibility of technology translates to an ever increasing avalanche of data pouring in. The total data set today consists of over 500 million observations!

    The App

    Using the Shiny package in R, I was able to build an application that explored observations from 2016 of 10 species of birds in the United States. I aggregated the observations by county and produced a graph using ggplot2 to show the frequency of observations throughout the US. The observations can be filtered by month using a slider bar to inspect the distribution of sightings during a particular season. I also added a feature that allows the user to filter the observations by breeding season, which I implemented by using estimated breeding season ranges from the Cornell Lab of Ornithology Birds of North America website. A feature that I found really interesting and was excited to add to my application is the “play” button which shows an animated map that cycles through the months of the year and displays the bird sightings accordingly. This provides the user with an important perspective on the movement of different species throughout the US which can sometimes be lost when looking at separate monthly range maps one at a time.

     

    Map that shows number of species observations in a specific month range

     

    In addition to visualizing the movement of birds throughout the US, I wanted to add additional functionality for bird watchers. I thought an interesting question to ask would be: at what time are birds most often being seen? To answer this question, I added a histogram which shows the frequency of times that a certain species was observed. What I found was that 8:00 AM was by far the most frequent time that an eBird user submitted an observation. What I have concluded is that the data is biased; many more people are actively bird watching around 8:00 AM, thus, the amount of observations spike around that time. This does not necessarily mean that a species is more likely to be seen at 8:00 AM; it means that there are more people actively looking. However, I did find a different trend for the only owl species (the Short-eared Owl) that I included in my list of species. This species showed a maximum sighting frequency at around 5:00 PM, which is in agreement with the fact that owls become more active around dusk. This leads me to believe that the functionality of this feature is mainly relevant for determining very basic activity levels for certain species.

     

    Histogram that shows the frequency of species observations by time of day

     

    When viewing the range map I found that regional movement of bird sightings was apparent, however it was easy to overlook state-level observation trends. I elected to add a bar graph that brakes down observations by month for every state where there were sightings. This feature allows the user to see a clear pattern in sighting frequency over the course of a year. The visualization enabled by the graph makes simple work of detecting whether the bird is a year-round resident or only present for certain seasons. This is important for understanding seasonal distribution of species. Although I have not implemented this functionality yet, a daily breakdown of sightings could yield important information on bird migration stopover sites (i.e. where birds temporarily stop to refuel during migration).

     

    Graph that shows the frequency of species observations by month for a specific state

     

    The final section of my application allows the user to inspect the data behind the graphics. This can be useful if the user wishes to extract a specific value of observations at a certain time or in a certain location. The data table has a search function that allows the user to filter the data by county, state, or time of day.

     

    Data tables that allow users to inspect specific values in the data set

     

    Going Forward

    Although the application is functional, there are several potential areas for improvement that I would like to address in the future.

    1. First (and I believe most important) is that the size of the data for some species is quite large which leads to issues with loading where graphics can take several seconds to render. This makes the play function of the map difficult to use effectively in some cases. I believe that these cases would benefit tremendously from either further optimization of the code or incorporating graphics packages with quicker rendering capabilities (ideally a combination of both)
    2. Second, I would like to allow the user to inspect a much larger list of species and range of years. Due to the size of the data, storage is a significant issue which may not be avoidable without establishing a dedicated server to host the data.
    3. Third, the observations by county are currently displayed using a log scale. I decided to use a log scale over the raw number of observations because many areas have observations of 1-100 and were vastly overshadowed by areas that had observations in the thousands. These areas with lower observations can still show significant trends, and I wanted to make sure they were not ignored. Still, this system does not address the issue of there being a bias in sighting frequency strictly due to larger numbers of available birders. Areas with larger populations will produce higher numbers of sightings simply due to the fact that there are more people actively looking for birds (similar to the issue I have with the time histogram). I would like to implement a system that standardizes county sightings by county population which would give a more normalized representation of sighting frequency.
    4. Finally, my current system uses counties as groups for aggregating observations by location. This can introduce problems in location detail. For example, many counties in Western United States are very large and can diminish the granularity of the map. I have seen other bird range mapping tools (such as the eBird map) that instead use rectangular areas denoted by longitude and latitude and do not rely on human designated borders. Implementing this method could increase the level of accuracy of the map’s representation of sighting hot-spots.
    Hey, Thanks!

    Thank you for taking the time to read about my project! As someone who is passionate about ecology and animal behavior I found building this application to be very rewarding and insightful. I am always trying to think about new ways to marry technology and environmental biology, and data science is an incredibly powerful tool that I can use to ask and answer questions in a field that I think is fascinating. Feel free to check out my application and I welcome any feedback!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Version 0.6-12 of NIMBLE released

    Mon, 07/30/2018 - 03:36

    (This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)

    We’ve released the newest version of NIMBLE on CRAN and on our website. Version 0.6-12 is primarily a maintenance release with various bug fixes.

    Changes include:

    • a fix for the bootstrap particle filter to correctly calculate weights when particles are not resampled (the filter had been omitting the previous weights when calculating the new weights);
    • addition of an option to print MCMC samplers of a particular type;
    • avoiding an overly-aggressive check for ragged arrays when building models; and
    • avoiding assigning a sampler to non-conjugacy inverse-Wishart nodes (thereby matching our handling of Wishart nodes).

    Please see the NEWS file in the installed package for more details.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Beyond Basic R – Introduction and Best Practices

    Mon, 07/30/2018 - 02:00

    (This article was first published on The USGS OWI blog , and kindly contributed to R-bloggers)

    We queried more than 60 people who have taken the USGS Introduction to R class over the last two years to understand what other skills and techniques are desired, but not covered in the course. Though many people have asked for an intermediate level class, we believe that many of the skills could be best taught through existing online materials. Instead of creating a stand-alone course, we invested our time into compiling the results of the survey, creating short examples, and linking to the necessary resources within a series of blog posts. This is the first in a series of 5 posts called Beyond basic R.

    Other posts that will be released during the next month include:

    • Data munging
    • Plotting with ggplot2 and setting custom themes
    • Mapping
    • Version control via Git

    You can see all blogs in this series by clicking the tag “Beyond Basic R” or follow this link.

    Best practices for writing reproducible R scripts

    Scripting can significantly increase your ability to create reproducible results, figures, or reports so that both your collaborators and future self can successfully rerun code and get the same results. However, just because you’ve put code into a script doesn’t make it reproducible. There are some general organization and code writing tips that can elevate your scripts into reproducible code.

    Code organization
    • Put all library() calls and any hard-coded variables at the top of the script. This makes package dependencies and variables that need to be changed very apparent.
    • Use RStudio projects to organize your scripts, data, and output. Add scripts to your RStudio project inside a subfolder called R, src, or something similar. Make sure you have separate folders for data inputs, data outputs, plots, and reports (e.g. R Markdown). All of these folders help keep content in a project organized so that others can find what they need. When you share an RStudio project or go between computers, R knows to interpret file paths relative to the project folder – no more changing working directories! However, having all your RMarkdown files in a different subfolder from data means you might need to use .. in your file path to go “up” a level, e.g. ../input/mydata.csv. Learn more about RStudio projects here and see an example of an RStudio project setup here.
    • Modularize your code. One script that has hundreds of lines of code can be challenging to maintain and troubleshoot. Instead, you should separate your analysis into multiple “modules” at logical points. Then, you can use source to execute the contents of each module. When you separate your workflow into multiple scripts, it is useful to name them in ways that indicate the order in which they should be used, e.g. 1_fetch_data.R, 7_plot_data.R, etc. Modules are useful as actual scripts, but modularized workflows could also be a collection of user-defined functions (or a mixture of both). You can learn the basics about functions in the USGS Introduction to R material and find advanced information about functions in Hadley Wickham’s Advanced R tutorial.
    • Do not save your working directory. When you exit R/RStudio, you probably get a prompt about saving your workspace image – don’t do this! The very nature of reproducible workflows is that you can reproduce your results by re-running your scripts. For this reason, there is no need to save your working directory – you can get all of your variables back by re-running your code the next time you open R. You can learn more about why this is not a great practice from a lesson by Jenny Bryan and a post by Martin Johnsson. You can turn off this prompt so that you are not tempted by going to Tools > Global Options and clicking “Never” in the dropdown next to “Save workspace to .RData on exit”.
    Code itself
    • Use library() NOT require() when loading packages in scripts. library will throw an error if it the library is not already installed, but require will silently fail. This means you get a mysterious failure later on when functions within a package are not available and you thought they were. Learn more about this in Yihui Xie’s blog post.
    • Do not use functions that change someone’s computer (e.g. install.packages, or setwd). Jenny Bryan has a great blog about the pitfalls of creating code that is not self-contained.
    • Comment incessantly. Comments should not take the place of clean and clear code, but can help explain why you might have done certain steps or used certain functions. Commenting is just as important for you as it is for any one else reading your code – as they say, your worst collaborator is you from 6 months ago. Learn a few more tricks with commenting and organizing code that RStudio offers here.
    • Follow a style and be consistent. There is no single correct way to style your code, but the important part is that you are consistent. Consistent style makes your code more readable, which makes collaboration with others and the future you much easier. When we talk about style, we are talking about how you name variables, functions, or files (camelCase, under_scores, etc) and how your code is visually organized (e.g. how long a single line can span, where you indent, etc). You can choose your own style guide, but this style guide from Google and this one from tidyverse are good places to start. When it comes to visual organization, RStudio has auto-indent features that help make it easier for you to put your code in the right place, and you can always force the RStudio indent by highlighting code and using CTRL + I.
    • As you are writing your code, take advantage of RStudio’s autocomplete features (and/or copy/paste). Typing mistakes are often a reason that code doesn’t work (e.g. a lowercase letter that really should have been uppercase), and using auto-complete or copy/paste reduces the prevalence of these mistakes. To use auto-complete, start typing something and hit the Tab button on your keyboard. You should see options start to pop up. More about auto-complete features here.
    • Learn to use loops or functions when you find yourself copying and pasting chunks of code with only minor changes. More code means more to maintain and troubleshoot, so try to reuse code via loops and functions when possible. Consider this tutorial from Nice R Code, the USGS Introduction to R lesson on R programming structures, or the Software Carpentry loop tutorial to learn more.

    This is a brief list of good practices to consider when writing R code, and there are lots of other resources to reference when it comes to “best practices”. You should take a look at other posts to get an idea of what the R community thinks more broadly. To start, you could reference Best Practices for Writing R by Software Carpentry or Efficient R programming by Colin Gillespie and Robin Lovelace. As you explore suggested practices online, keep in mind that R is open-source software and is constantly evolving which means best practices will evolve, too.

    Disclaimer

    Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: The USGS OWI blog . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Visualizing Wine Reviews

    Mon, 07/30/2018 - 01:30

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

     

    Have you ever been wine shopping and wondered if the ratings actually mean anything? Do only high-priced wines get good reviews? This analysis attempts to demystify some of the confusion behind these ratings by examining a wide range of wines reviewed by a popular wine publication and showing ways to choose a wine based on other factors such as country of origin and varietal.

    Background

    I obtained a data set of reviews for over 110,000 different wines published by Wine Enthusiast magazine between 1999 and 2017. The wines reviewed originated from 42 different countries and ranged in price from $4 to $3300. Reviews were written by at least 20 different professional wine tasters (some anonymous) and included a rating of the wine on a 100-point scale. Only wines with a rating of 80 or higher are reviewed and included in the database. The rating scale used by Wine Enthusiast magazine is provided below. A “Classic” rating is extremely rare – in fact, only 115 wines among the 110,000 reviewed received a rating of 98 or higher.

    Classic 98-100 The pinnacle of quality. Superb 94-97 A great achievement. Excellent 90-93 Highly recommended. Very Good 87-89 Often good value; well recommended. Good 83-86 Suitable for everyday consumption; often good value. Acceptable 80-82  Can be employed in casual, less-critical circumstances.

     

    Wine Ratings vs. Price

    One of the first questions you may have when shopping for a wine and see its rating is to ask whether the wine will be good enough for the occasion or if it’s better to spend more money and get a better wine. To better visualize the relationship between price and rating, I’ve plotted the rating against the wine price for 5000 wines sampled from the data set. It’s no surprise that yes, higher-priced wines do tend to have higher ratings. However, the extremely high-priced wines graphed on a linear scale make it difficult to see the relationship for the majority of the wines reviewed, so I’ve also plotted the price on a logarithmic scale, which shows a much more direct relationship between price and wine rating.

    The good news for (frugal) wine lovers is that the spread in the data for many of the wines in the $10 to $100 range reveals that there are still many wines with “Excellent” ratings of 90 and above within reach. Using the linear regression line in the Rating vs. log(Price) graph allows us to determine that wines plotted several points above the line are better values compared to others in its price range or rating category.

    Varietal Ratings and Prices by Country

    Next I was interested in finding out if the ratings and prices for different varietals varied by country and if there was a significant difference in price and rating. These bar charts, which compared the average rating and median price for several varietals from the 5 countries with the most wines reviewed, showed some surprising insights as well. For example, Bordeaux Red-blends from Portugal, on average, were more highly-rated than the other four countries and had a much lower significant median selling price. On the other hand, Spanish Rieslings were rated lower than the other four countries shown, but sold for a similar price. Comparing favorite wine varietals in this manner enables consumers to find better deals on higher-rated wines and encourage them be more adventurous in trying different wines.

    Wine Selector

    Finally, I also put together a few tools to help explore the full database of wines reviewed and find the best values on wines specified by varietal and price range. The user can also choose a desired rating category and search for the wines Idetermined were the “Best Values” from my wine rating vs. price analysis. Another tool allows the user to see the most popular wine varietal from different countries around the world and each of the U.S. states that produce wines. This may be useful for travelers wishing to find the best type of wine to drink when visiting or to buy as a memorable souvenir.

    Most Popular Wine  by Region Conclusion Overall we can see that higher-priced wines tend to have higher ratings, however more can be learned when comparing these two variables. Using these tools, we can discover new wines and find the best values for wines we love. Additional work includes the ability to filter out wines that are currently available, using today’s prices, since many of the reviews may date back up to 17 years, and links to the full text of each wine’s review.

    Cheers!

     

    Data source: Kaggle Dataset: Wine Reviews

    Data analysis and interactive charts were developed with R and Shiny and can be found here.

    Github code

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Co-integration and Pairs Trading

    Sun, 07/29/2018 - 22:28

    (This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

    The co-integration is an important statistical concept behind the statistical arbitrage strategy named “Pairs Trading”. While projecting a stock price with time series models is by all means difficult, it is technically feasible to find a pair of (or even a portfolio of) stocks sharing the common trend such that a linear combination of two series is stationary, which is so-called co-integration. The underlying logic of Pairs Trading is to monitor movements of co-integrated stocks and to look for trading opportunities when the divergence presents. Under the mean-reversion assumption, the stock price would tend to move back to the long-term equilibrium. As a result, the spread between two co-integrated stock prices would eventually converge. Furthermore, given the stationarity of the spread between co-integrated stocks, it becomes possible to forecast such spread with time series models.

    Below shows a R utility function helping to identify pairwise co-integrations based upon the Johansen Test out of a arbitrary number of stock prices provided in a list of tickers.

    For instance, based on a starting date on 2010/01/01 and a list of tickers for major US banks, we are able to identify 23 pairs of co-integrated stock prices out of 78 pairwise combinations. It is interesting to see that stock prices of two regional players, e.g. Fifth Third and M&T, are highly co-integrated, as visualized in the chart below.

    pkgs <- list("quantmod", "doParallel", "foreach", "urca") lapply(pkgs, require, character.only = T) registerDoParallel(cores = 4) jtest <- function(t1, t2) { start <- sd getSymbols(t1, from = start) getSymbols(t2, from = start) j <- summary(ca.jo(cbind(get(t1)[, 6], get(t2)[, 6]))) r <- data.frame(stock1 = t1, stock2 = t2, stat = j@teststat[2]) r[, c("pct10", "pct5", "pct1")] <- j@cval[2, ] return(r) } pair <- function(lst) { d2 <- data.frame(t(combn(lst, 2))) stat <- foreach(i = 1:nrow(d2), .combine = rbind) %dopar% jtest(as.character(d2[i, 1]), as.character(d2[i, 2])) stat <- stat[order(-stat$stat), ] # THE PIECE GENERATING * CAN'T BE DISPLAYED PROPERLY IN WORDPRESS rownames(stat) <- NULL return(stat) } sd <- "2010-01-01" tickers <- c("FITB", "BBT", "MTB", "STI", "PNC", "HBAN", "CMA", "USB", "KEY", "JPM", "C", "BAC", "WFC") pair(tickers) stock1 stock2 stat pct10 pct5 pct1 coint 1 STI JPM 27.207462 12.91 14.9 19.19 *** 2 FITB MTB 21.514142 12.91 14.9 19.19 *** 3 MTB KEY 20.760885 12.91 14.9 19.19 *** 4 HBAN KEY 19.247719 12.91 14.9 19.19 *** 5 C BAC 18.573168 12.91 14.9 19.19 ** 6 HBAN JPM 18.019051 12.91 14.9 19.19 ** 7 FITB BAC 17.490536 12.91 14.9 19.19 ** 8 PNC HBAN 16.959451 12.91 14.9 19.19 ** 9 FITB BBT 16.727097 12.91 14.9 19.19 ** 10 MTB HBAN 15.852456 12.91 14.9 19.19 ** 11 PNC JPM 15.822610 12.91 14.9 19.19 ** 12 CMA BAC 15.685086 12.91 14.9 19.19 ** 13 HBAN BAC 15.446149 12.91 14.9 19.19 ** 14 BBT MTB 15.256334 12.91 14.9 19.19 ** 15 MTB JPM 15.178646 12.91 14.9 19.19 ** 16 BBT HBAN 14.808770 12.91 14.9 19.19 * 17 KEY BAC 14.576440 12.91 14.9 19.19 * 18 FITB JPM 14.272424 12.91 14.9 19.19 * 19 STI BAC 14.253971 12.91 14.9 19.19 * 20 FITB PNC 14.215647 12.91 14.9 19.19 * 21 MTB BAC 13.891615 12.91 14.9 19.19 * 22 MTB PNC 13.668863 12.91 14.9 19.19 * 23 KEY JPM 12.952239 12.91 14.9 19.19 *

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    House price data cleansing and segmentation tool.

    Sun, 07/29/2018 - 18:44

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    Project background

    Land Registry publishes data for each housing sale transaction that is registered in England & Wales. This data has been used extensively for many analysis, from price evolution in time to the assessment of price differences between areas. This dataset is publicly available under the government licence and dates back to 1995.

    The main variables we find in each dataset are the location of the transaction (with full postcode, used for geolocate each property), the type of asset (Flat, Detached, Bungalow,… and Other, which was discarded) and the price. Unfortunately, there is no qualitative or quantitative information of each asset (i.e. an area would be very useful).

    The main issue working with this dataset, particularly when working with any transactional dataset (Residential, Offices, …) is that the distribution is usually skewed, with values of just few thousand pounds to many millions. For this reason, we have created a tool with two main objectives:

    • Filtering the original dataset based on the Median Absolute Deviation times a factor to discard outliers.
    • Further, filtering the dataset results to segment the market data so users can focus on a range of upper or lower percentile of the sample
    Cleanse the data and discard outliers

    Although it is a common practice, it would be wrong to cleanse the data by only applying a minimum and maximum value filtering the whole dataset, mainly because of the two reasons listed below:

    • The highly skewed distribution of the price range
    • The variability of prices within each group

    To tackle the first issue, we decided to implement the MAD (Median Absolute Deviation) instead of the standard deviation to be less dependent on the variance of the data and apply a factor below and above the median to filter the data. So, a ‘2’ factor will discard values 2 times above or below the Median therefore the higher the factor applied will result in more extreme values to be included in the final sample.

    At the same time and to avoid geographical bias, we apply this methodology for each Borough. That is, we calculate the median and MAD for each borough and with the calculated median, we then apply the MAD factor filtering. In other words, a 1million pounds house in a central Borough won’t be considered an outlier but it might be discarded if the house is located in one of the outer Boroughs.

     

     

    Market segmentation

    Real Estate investors market their properties to very specific market segments, i.e. for the top end of the market. Having this capability to dynamically select which ‘slice’ of the market we are interested in is, therefore, a very useful tool when working with housing prices.

    With the percentiles slider implemented, the user can easily select the segment of the market he/she is interested in. For example, to analyse the top end of the market would be as easy as to select for 0.8 to 1 in the slider.

    Download the data

    The primary objective of this tool is to implement specific criteria to filter and segment a dataset. Hence, the download capability has also been implemented. We can download either the full datasets once is cleansed according to user selection or a Borough summary for convenience.

     Next steps

    This application should be extended to allow the following:

    • To implement the capability to upload your own CSV.
    • Mapping each of the data properties to help with data visualization.
    • Making the coding as generic as possible to apply the same methodology to other country datasets.
    Access the hosted Shiny Application: https://natxomoreno.shinyapps.io/London_House_Prices_Stat_Explorer/ var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    histoRicalg: The effort to document historic and historical numerical algorithms in R

    Sun, 07/29/2018 - 17:17

    (This article was first published on R – Strange Attractors, and kindly contributed to R-bloggers)

    Recently, the R-consortium accepted a new project called histoRicalg. The main goal of the project is to document and transfer knowledge of some older algorithms used by R and by other computational systems. There is a lot of R written in Fortran—much of which is in the old F77 format—and in C whose original implementations themselves may have been in older languages. It is worthwhile to trace the provenance of these routines in its own right. Firstly, to understand how the code evolved. Secondly, to transfer much of the compiled wisdom and experience of those who wrote these algorithms to a new generation of statisticians, programmers, and analysts. Reviewing the evolution of these algorithms becomes even more valuable when it unearths potential or actual bugs, as was recently seen in nlm() and in optim::L-BFGS-B.

    To that purpose, Dr. John C. Nash, well-known for his contributions to R and the general statistical knowledge-base, recently started this project. Project members do not have to be academics or expert programmers—I certainly am neither! Rather, anyone who is interested in helping document how these fundamental algorithms came to be and what may be done to make them better is enthusiastically invited. Younger R users, statisticians, programmers, or anyone willing to learn about the methods that underlie the computations in R are especially welcome. This relates directly to the one of the project’s first goals, which is to create a “Working Group on Algorithms Used in R” focused on identifying and prioritizing algorithmic issues and developing procedures for linking older and younger workers. The project is being hosted on Gitlab, where there is a wiki and already near a dozen vignettes on various R functions. For more information, or to sign up, people are invited to contact Dr. Nash directly or via the project’s mailing list. Looking forward to seeing you there soon!

    The post histoRicalg: The effort to document historic and historical numerical algorithms in R appeared first on Strange Attractors.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Strange Attractors. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    ggplot “Doodling” with HIBP Breaches

    Sun, 07/29/2018 - 16:08

    (This article was first published on R – rud.is, and kindly contributed to R-bloggers)

    After reading this interesting analysis of “How Often Are Americans’ Accounts Breached?” by Gaurav Sood (which we need more of in cyber-land) I gave in to the impulse to do some gg-doodling with the “Have I Been Pwnd” JSON data he used.

    It’s just some basic data manipulation with some heavy ggplot2 styling customization, so no real need for exposition beyond noting that there are many other ways to view the data. I just settled on centered segments early on and went from there. If you do a bit of gg-doodling yourself, drop a note in the comments with a link.

    You can see a full-size version of the image via this link.

    library(hrbrthemes) # use github or gitlab version library(tidyverse) # get the data dat_url <- "https://raw.githubusercontent.com/themains/pwned/master/data/breaches.json" jsonlite::fromJSON(dat_url) %>% mutate(BreachDate = as.Date(BreachDate)) %>% tbl_df() -> breaches # selected breach labels df group_by(breaches, year = lubridate::year(BreachDate)) %>% top_n(1, wt=PwnCount) %>% ungroup() %>% filter(year %in% c(2008, 2015, 2016, 2017)) %>% # pick years where labels will fit nicely mutate( lab = sprintf("%s\n%sM accounts", Name, as.integer(PwnCount/1000000)) ) %>% arrange(year) -> labs # num of known breaches in that year for labels count(breaches, year = lubridate::year(BreachDate)) %>% mutate(nlab = sprintf("n=%s", n)) %>% mutate(lab_x = as.Date(sprintf("%s-07-02", year))) -> year_cts mutate(breaches, p_half = PwnCount/2) %>% # for centered segments ggplot() + geom_segment( # centered segments aes(BreachDate, p_half, xend=BreachDate, yend=-p_half), color = ft_cols$yellow, size = 0.3 ) + geom_text( # selected breach labels data = labs, aes(BreachDate, PwnCount/2, label = lab), lineheight = 0.875, size = 3.25, family = font_rc, hjust = c(0, 1, 1, 0), vjust = 1, nudge_x = c(25, -25, -25, 25), nudge_y = 0, color = ft_cols$slate ) + geom_text( # top year labels data = year_cts, aes(lab_x, Inf, label = year), family = font_rc, size = 4, vjust = 1, lineheight = 0.875, color = ft_cols$gray ) + geom_text( # bottom known breach count totals data = year_cts, aes(lab_x, -Inf, label = nlab, size = n), vjust = 0, lineheight = 0.875, color = ft_cols$peach, family = font_rc, show.legend = FALSE ) + scale_x_date( # break on year name = NULL, date_breaks = "1 year", date_labels = "%Y" ) + scale_y_comma(name = NULL, limits = c(-450000000, 450000000)) + # make room for labels scale_size_continuous(range = c(3, 4.5)) + # tolerable font sizes labs( title = "HIBP (Known) Breach Frequency & Size", subtitle = "Segment length is number of accounts; n=# known account breaches that year", caption = "Source: HIBP via " ) + theme_ft_rc(grid="X") + theme(axis.text.y = element_blank()) + theme(axis.text.x = element_blank()) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    CHAID vs. ranger vs. xgboost — a comparison

    Sun, 07/29/2018 - 16:03

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    In an earlier post, I focused on an in-depth visit with CHAID (Chi-square automatic interaction detection). Quoting myself, I said “As the name implies it is fundamentally based on the venerable Chi-square test – and while not the most powerful (in terms of detecting the smallest possible differences) or the fastest, it really is easy to manage and more importantly to tell the story after using it”. In this post I'll spend a little time comparing CHAID with a random forest algorithm in the ranger library and with a gradient boosting algorithm via the xgboost library. I'll use the exact same data set for all three so we can draw some easy comparisons about their speed and their accuracy.

    I do believe CHAID is a great choice for some sets of data and some circumstances but I'm interested in some empirical information, so off we go.

    Setup and library loading

    If you've never used CHAID before you may also not have partykit. CHAID isn't on CRAN but I have provided the commented out install command below. ranger and xgboost are available from CRAN and are straightforward to install. You'll also get a variety of messages, none of which is relevant to this example so I've suppressed them.

    # install.packages("partykit") # install.packages("CHAID", repos="http://R-Forge.R-project.org") # install.packages("ranger") # install.packages("xgboost") require(dplyr) require(tidyr) require(ggplot2) require(CHAID) require(purrr) require(caret) require(ranger) require(xgboost) require(kableExtra) # just to make the output nicer theme_set(theme_bw()) # set theme for ggplot2 Predicting customer churn for a fictional TELCO company

    We're going to use a dataset that comes to us from the IBM Watson Project. It's a very practical example and an understandable dataset. A great use case for the algorithms we'll be using. Imagine yourself in a fictional company faced with the task of trying to predict which customers are going to leave your business for another provider a.k.a. churn. Obviously we'd like to be able to predict this phenomenon and potentially target these customers for retention or just better project our revenue. Being able to predict churn even a little bit better could save us lots of money, especially if we can identify the key indicators and influence them.

    In the original posting I spent a great deal of time explaining the mechanics of loading and prepping the data. This time we'll do that quickly and efficiently and if you need an explanation of what's going on please refer back. I've embedded some comments in the code where I think they'll be most helpful. First we'll grab the data from the IBM site using read.csv, in this case I'm happy to let it tag most of our variables as factors since that's what we'll want for our CHAID work.

    set.seed(2018) churn <- read.csv("https://community.watsonanalytics.com/wp-content/uploads/2015/03/WA_Fn-UseC_-Telco-Customer-Churn.csv") str(churn) ## 'data.frame': 7043 obs. of 21 variables: ## $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ... ## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ... ## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ... ## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ... ## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ... ## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ... ## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ... ## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ... ## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ... ## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ... ## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ... ## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ... ## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ... ## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ... ## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ... ## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ... ## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ... ## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ... ## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ... ## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

    We have data on 7,043 customers across 21 variables. customerID can't really be a predictor but we will use it in a little bit. Churn is what we want to predict so we have 19 potential predictor variables to work with. Four of them were not automatically converted to factors so we'll have to look into them for CHAID. For a review of what the output means and how CHAID works please refer back.

    Let's address the easiest thing first. SeniorCitizen is coded zero and one instead of yes/no so let's recode that in a nice conservative fashion and see what the breakdown is.

    # Fix senior citizen status churn$SeniorCitizen <- recode_factor( churn$SeniorCitizen, `0` = "No", `1` = "Yes", .default = "Should not happen" ) summary(churn$SeniorCitizen) ## No Yes ## 5901 1142

    We have three variables left that are numeric, now that we have addressed senior citizen status. Let's use a combination of dplyr and ggplot2 to see what the distribution looks like using a density plot.

    churn %>% select_if(is.numeric) %>% gather(metric, value) %>% ggplot(aes(value, fill = metric)) + geom_density(show.legend = FALSE) + facet_wrap( ~ metric, scales = "free") ## Warning: Removed 11 rows containing non-finite values (stat_density).

    The plot:

    Well those aren't the most normal looking distributions and we have this message ## Warning: Removed 11 rows containing non-finite values (stat_density). which alerts us to the fact that there are some missing values in our data. Let's first figure out where the missing data is:

    churn %>% select_if(anyNA) %>% summary ## TotalCharges ## Min. : 18.8 ## 1st Qu.: 401.4 ## Median :1397.5 ## Mean :2283.3 ## 3rd Qu.:3794.7 ## Max. :8684.8 ## NA's :11

    Now we know that total customer charges are missing 11 entries. Our three algorithms vary as to how gracefully they handle missing values but at this point, we have several options including:

    • Eliminate the entire customer record if anything is missing
    • Impute or substitute in some reasonable value like the mean or the median for missing values
    • Do some fancier imputation to make sure we substitute in the most plausible value for TotalCharges

    Elimination is easy, efficient, and conservative and since it is a very small percentage of our total data set unlikely to cost us a lot of information for the models that don't handle missing values well. But for purposes of this blog post and to help demonstrate some of the capabilities within caret (since we're going to use it anyway) we'll try median and knn (k nearest neighbor) imputation.

    First let's make a vector that contains the customerID numbers of the eleven cases in question.

    xxx <- churn %>% filter_all(any_vars(is.na(.))) %>% select(customerID) xxx <- as.vector(xxx$customerID) xxx ## [1] "4472-LVYGI" "3115-CZMZD" "5709-LVOEQ" "4367-NUYAO" "1371-DWPAZ" "7644-OMVMY" "3213-VVOLG" "2520-SGTTA" "2923-ARZLG" "4075-WKNIU" "2775-SEFEE" churn %>% filter(customerID %in% xxx) ## customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn ## 1 4472-LVYGI Female No Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 NA No ## 2 3115-CZMZD Male No No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 NA No ## 3 5709-LVOEQ Female No Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 NA No ## 4 4367-NUYAO Male No Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 NA No ## 5 1371-DWPAZ Female No Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 NA No ## 6 7644-OMVMY Male No Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 NA No ## 7 3213-VVOLG Male No Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 NA No ## 8 2520-SGTTA Female No Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 NA No ## 9 2923-ARZLG Male No Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 NA No ## 10 4075-WKNIU Female No Yes Yes 0 Yes Yes DSL No Yes Yes Yes Yes No Two year No Mailed check 73.35 NA No ## 11 2775-SEFEE Male No No Yes 0 Yes Yes DSL Yes Yes No Yes No No Two year Yes Bank transfer (automatic) 61.90 NA No

    As you look at those eleven records it doesn't appear they are “average”! In particular, I'm worried that the MonthlyCharges look small and they have 0 tenure for this group. No way of knowing for certain but it could be that these are just the newest customers with very little time using our service. Let's use our list to do some comparing of these eleven versus the total population, that will help us decide what to do about the missing cases. Replacing with the median value is simple and easy but it may well not be the most accurate choice.

    churn %>% filter(customerID %in% xxx) %>% summarise(median(MonthlyCharges)) ## median(MonthlyCharges) ## 1 25.75 median(churn$MonthlyCharges, na.rm = TRUE) ## [1] 70.35 churn %>% filter(customerID %in% xxx) %>% summarise(median(tenure)) ## median(tenure) ## 1 0 median(churn$tenure, na.rm = TRUE) ## [1] 29

    The median MonthlyCharges are much lower and instead of two years or so of median tenure this group has none. Let's use the preProcess function in caret to accomplish several goals. We'll ask it to impute the missing values for us using both knnImpute (k nearest neighbors) and a pure median medianImpute. From the ?preProcess help pages:

    k-nearest neighbor imputation is carried out by finding the k closest samples (Euclidian distance) in the training set. Imputation via bagging fits a bagged tree model for each predictor (as a function of all the others). This method is simple, accurate and accepts missing values, but it has much higher computational cost. Imputation via medians takes the median of each predictor in the training set, and uses them to fill missing values. This method is simple, fast, and accepts missing values, but treats each predictor independently, and may be inaccurate.

    We'll also have it transform our numeric variables using YeoJohnson and identify any predictor variables that have near zero variance nzv.

    # using k nearest neighbors pp_knn <- preProcess(churn, method = c("knnImpute", "YeoJohnson", "nzv")) # simple output pp_knn ## Created from 7032 samples and 21 variables ## ## Pre-processing: ## - centered (3) ## - ignored (18) ## - 5 nearest neighbor imputation (3) ## - scaled (3) ## - Yeo-Johnson transformation (3) ## ## Lambda estimates for Yeo-Johnson transformation: ## 0.45, 0.93, 0.25 # more verbose pp_knn$method ## $knnImpute ## [1] "tenure" "MonthlyCharges" "TotalCharges" ## ## $YeoJohnson ## [1] "tenure" "MonthlyCharges" "TotalCharges" ## ## $ignore ## [1] "customerID" "gender" "SeniorCitizen" "Partner" "Dependents" "PhoneService" "MultipleLines" "InternetService" "OnlineSecurity" "OnlineBackup" "DeviceProtection" "TechSupport" "StreamingTV" "StreamingMovies" "Contract" "PaperlessBilling" "PaymentMethod" "Churn" ## ## $center ## [1] "tenure" "MonthlyCharges" "TotalCharges" ## ## $scale ## [1] "tenure" "MonthlyCharges" "TotalCharges" # using medians pp_median <- preProcess(churn, method = c("medianImpute", "YeoJohnson", "nzv")) pp_median ## Created from 7032 samples and 21 variables ## ## Pre-processing: ## - ignored (18) ## - median imputation (3) ## - Yeo-Johnson transformation (3) ## ## Lambda estimates for Yeo-Johnson transformation: ## 0.45, 0.93, 0.25 pp_median$method ## $medianImpute ## [1] "tenure" "MonthlyCharges" "TotalCharges" ## ## $YeoJohnson ## [1] "tenure" "MonthlyCharges" "TotalCharges" ## ## $ignore ## [1] "customerID" "gender" "SeniorCitizen" "Partner" "Dependents" "PhoneService" "MultipleLines" "InternetService" "OnlineSecurity" "OnlineBackup" "DeviceProtection" "TechSupport" "StreamingTV" "StreamingMovies" "Contract" "PaperlessBilling" "PaymentMethod" "Churn"

    The preProcess function creates a list object of class preProcess that contains information about what needs to be done and what the results of the transformations will be, but we need to apply the predict function to actually make the changes proposed. So at this point let's create two new dataframes nchurn1 and nchurn2 that contain the data after the pre-processing has occurred. Then we can see how the results compare.

    nchurn1 <- predict(pp_knn,churn) nchurn2 <- predict(pp_median,churn) nchurn2 %>% filter(customerID %in% xxx) %>% summarise(median(TotalCharges)) ## median(TotalCharges) ## 1 20.79526 median(nchurn2$TotalCharges, na.rm = TRUE) ## [1] 20.79526 nchurn1 %>% filter(customerID %in% xxx) %>% summarise(median(TotalCharges)) ## median(TotalCharges) ## 1 -1.849681 median(nchurn1$TotalCharges, na.rm = TRUE) ## [1] 0.01820494

    May also be useful to visualize the data as we did earlier to see how the transformations have changed the density plots.

    nchurn1 %>% select_if(is.numeric) %>% gather(metric, value) %>% ggplot(aes(value, fill = metric)) + geom_density(show.legend = FALSE) + facet_wrap( ~ metric, scales = "free")

    Gives this plot:

    nchurn2 %>% select_if(is.numeric) %>% gather(metric, value) %>% ggplot(aes(value, fill = metric)) + geom_density(show.legend = FALSE) + facet_wrap( ~ metric, scales = "free")

    Gives this plot:

    If you compare the two plots you can see that they vary imperceptibly except for the y axis scale. There is no warning about missing values and if you scroll back and compare with the original plots of the raw variables the shape of tenure and TotalCharges have changed significantly because of the transformation.

    I'm pretty convinced that knn provides a much better approximation of those eleven missing values than a mere median substitution so let's make those changes and move on to comparing models. While we're at it, let's go ahead and remove the unique customer ID number as well. We really only needed it to compare a few specific cases.

    churn <- predict(pp_knn,churn) churn$customerID <- NULL str(churn) ## 'data.frame': 7043 obs. of 20 variables: ## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ... ## $ SeniorCitizen : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ... ## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ... ## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ... ## $ tenure : num -1.644 0.297 -1.495 0.646 -1.495 ... ## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ... ## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ... ## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ... ## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ... ## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ... ## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ... ## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ... ## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ... ## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ... ## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ... ## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ... ## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ... ## $ MonthlyCharges : num -1.158 -0.239 -0.343 -0.731 0.214 ... ## $ TotalCharges : num -1.81 0.254 -1.386 0.233 -1.249 ... ## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

    One more step before we start using CHAID, ranger, and xgboost and while we have the data in one frame. Let's take the 3 numeric variables and create 3 analogous variables as factors. This is necessary because CHAID requires categorical a.k.a. nominal data. If you'd like to review the options for how to “cut” the data please refer back to my earlier post.

    churn <- churn %>% mutate_if(is.numeric, funs(factor = cut_number(., n=5, labels = c("Lowest","Below Middle","Middle","Above Middle","Highest")))) summary(churn) ## gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn tenure_factor MonthlyCharges_factor TotalCharges_factor ## Female:3488 No :5901 No :3641 No :4933 Min. :-1.8439 No : 682 No :3390 DSL :2421 No :3498 No :3088 No :3095 No :3473 No :2810 No :2785 Month-to-month:3875 No :2872 Bank transfer (automatic):1544 Min. :-1.5685 Min. :-1.929306 No :5174 Lowest :1481 Lowest :1420 Lowest :1409 ## Male :3555 Yes:1142 Yes:3402 Yes:2110 1st Qu.:-0.8555 Yes:6361 No phone service: 682 Fiber optic:3096 No internet service:1526 No internet service:1526 No internet service:1526 No internet service:1526 No internet service:1526 No internet service:1526 One year :1473 Yes:4171 Credit card (automatic) :1522 1st Qu.:-0.9632 1st Qu.:-0.783551 Yes:1869 Below Middle:1397 Below Middle:1397 Below Middle:1408 ## Median : 0.1183 Yes :2971 No :1526 Yes :2019 Yes :2429 Yes :2422 Yes :2044 Yes :2707 Yes :2732 Two year :1695 Electronic check :2365 Median : 0.2021 Median : 0.018205 Middle :1408 Middle :1411 Middle :1409 ## Mean : 0.0000 Mailed check :1612 Mean : 0.0000 Mean :-0.002732 Above Middle:1350 Above Middle:1407 Above Middle:1408 ## 3rd Qu.: 0.9252 3rd Qu.: 0.8341 3rd Qu.: 0.868066 Highest :1407 Highest :1408 Highest :1409 ## Max. : 1.3421 Max. : 1.7530 Max. : 1.758003

    Okay now we have three additional variables that end in _factor, they're like their numeric equivalents only cut into more or less 5 equal bins.

    Training and testing our models

    We're going to use caret to train and test all three of the algorithms on our data. We could operate directly by invoking the individual model functions directly but caret will allow us to use some common steps. We'll employ cross-validation a.k.a. cv to mitigate the problem of over-fitting. This article explains it well so I won't repeat that explanation here, I'll simply show you how to run the steps in R.

    This is also a good time to point out that caret has extraordinarily comprehensive documentation which I used extensively and I'm limiting myself to the basics.

    As a first step, let's just take 30% of our data and put is aside as the testing data set. Why 30%? Doesn't have to be, could be as low as 20% or as high as 40% it really depends on how conservative you want to be, and how much data you have at hand. Since this is just a tutorial we'll simply use 30% as a representative number. I'm going to use caret syntax which is the line with createDataPartition(churn$Churn, p=0.7, list=FALSE) in it. That takes our data set churn makes a 70% split ensuring that we keep our outcome variable Churn as close to 70/30 as we can. This is important because our data is already pretty lop-sided for outcomes. The two subsequent lines serve to take the vector intrain and produce two separate dataframes, testing and training. They have 2112 and 4931 customers respectively.

    intrain <- createDataPartition(churn$Churn, p=0.7, list=FALSE) training <- churn[intrain,] testing <- churn[-intrain,] dim(training) ## [1] 4931 23 dim(testing) ## [1] 2112 23 CHAID

    Now that we have a training and testing dataset let's remove the numeric version of the variables CHAID can't use.

    # first pass at CHAID # remove numbers training <- training %>% select_if(is.factor) dim(training) ## [1] 4931 20 testing <- testing %>% select_if(is.factor) dim(testing) ## [1] 2112 20

    The next step is a little counter-intuitive but quite practical. Turns out that many models do not perform well when you feed them a formula for the model even if they claim to support a formula interface (as CHAID does). Here's a Stack Overflow link that discusses in detail but my suggestion to you is to always separate them and avoid the problem altogether. We're just taking our predictors or features and putting them in x while we put our outcome in y.

    # create response and feature data features <- setdiff(names(training), "Churn") x <- training[, features] y <- training$Churn

    trainControl is the next function within caret we need to use. Chapter 5 in the caret doco covers it in great detail. I'm simply going to pluck out a few sane and safe options. method = "cv" gets us cross-validation. number = 5 is pretty obvious. I happen to like seeing the progress in case I want to go for coffee so verboseIter = TRUE (here I will turn it off since the static output is rather boring), and I play it safe and explicitly save my predictions savePredictions = "final". We put everything in train_control which we'll use in a minute. We'll use this same train_control for all our models

    # set up 5-fold cross validation procedure train_control <- trainControl(method = "cv", number = 5, # verboseIter = TRUE, savePredictions = "final")

    By default caret allows us to adjust three parameters in our chaid model; alpha2, alpha3, and alpha4. As a matter of fact it will allow us to build a grid of those parameters and test all the permutations we like, using the same cross-validation process. I'm a bit worried that we're not being conservative enough. I'd like to train our model using p values for alpha that are not .05, .03, and .01 but instead the de facto levels in my discipline; .05, .01, and .001. The function in caret is tuneGrid. We'll use the base R function expand.grid to build a dataframe with all the combinations and then feed it to caret in our training via tuneGrid = search_grid in our call to train.

    # set up tuning grid default search_grid <- expand.grid( alpha2 = c(.05, .01, .001), alpha4 = c(.05, .01, .001), alpha3 = -1 )

    Now we can use the train function in caret to train our model! It wants to know what our x and y's are, as well as our training control parameters which we've parked in train_control.

    chaid.model <- train( x = x, y = y, method = "chaid", trControl = train_control, tuneGrid = search_grid ) chaid.model ## CHi-squared Automated Interaction Detection ## ## 4931 samples ## 19 predictor ## 2 classes: 'No', 'Yes' ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## Summary of sample sizes: 3944, 3946, 3945, 3944, 3945 ## Resampling results across tuning parameters: ## ## alpha2 alpha4 Accuracy Kappa ## 0.001 0.001 0.7805723 0.3956853 ## 0.001 0.010 0.7832088 0.3786619 ## 0.001 0.050 0.7767165 0.3956067 ## 0.010 0.001 0.7844263 0.3950757 ## 0.010 0.010 0.7832088 0.3740510 ## 0.010 0.050 0.7767167 0.3883397 ## 0.050 0.001 0.7844263 0.3950757 ## 0.050 0.010 0.7817889 0.3764934 ## 0.050 0.050 0.7783394 0.3921129 ## ## Tuning parameter 'alpha3' was held constant at a value of -1 ## Accuracy was used to select the optimal model using the largest value. ## The final values used for the model were alpha2 = 0.05, alpha3 = -1 and alpha4 = 0.001.

    And after roughly two minutes it's done. Let's inspect what we have so far. The output gives us a nice concise summary. 4931 cases with 19 predictors. It gives us an idea of how many of the 4931 cases were used in the individual folds Summary of sample sizes: 3944, 3946, 3945, 3944, 3945. If you need a review of what alpha2, alpha4, and alpha3 are please review the ?chaid doco.

    You'll notice that I stored the results in an object called chaid.model. That object has lots of useful information you can access (it's a list object of class “train”). As a matter of fact we will be creating one object per run and then using the stored information to build a nice comparison later. For now here are some useful examples of what's contained in the object…

    • Produce the confusionMatrix across all folds confusionMatrix(chaid.model)
    • Plot the effect of the tuning parameters on accuracy plot(chaid.model). Note that the scaling deceives the eye and the results are close across the plot
    • Check on variable importance varImp(chaid.model)
    • How long did it take? Look in chaid.model$times

    If you need a refresher on what these represent please see the earlier post on CHAID.

    confusionMatrix(chaid.model) ## Cross-Validated (5 fold) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction No Yes ## No 66.2 14.3 ## Yes 7.2 12.2 ## ## Accuracy (average) : 0.7844 plot(chaid.model)

    Gives this plot:

    varImp(chaid.model) ## ROC curve variable importance ## ## Importance ## Contract 100.0000 ## tenure_factor 95.7457 ## OnlineSecurity 81.2110 ## TechSupport 74.1473 ## TotalCharges_factor 57.0885 ## MonthlyCharges_factor 53.3755 ## OnlineBackup 48.3382 ## DeviceProtection 45.3815 ## PaperlessBilling 44.2540 ## Partner 36.7387 ## Dependents 35.4481 ## PaymentMethod 29.7699 ## SeniorCitizen 22.5287 ## StreamingMovies 10.3811 ## MultipleLines 9.1536 ## InternetService 7.8258 ## StreamingTV 6.7502 ## gender 0.6105 ## PhoneService 0.0000 chaid.model$times ## $everything ## user system elapsed ## 146.121 2.658 149.002 ## ## $final ## user system elapsed ## 1.401 0.045 1.445 ## ## $prediction ## [1] NA NA NA

    One of the nice aspects about CHAID as a method is that is relatively easy to “see”“ your model in either text or plot format. While there are packages that will help you "see” a random forest; by definition (pardon the pun) it's hard to see the forest because of all the trees. Simply “printing” the final model with chaid.model$finalModel gives you the text representation while you can plot the final model with plot(chaid.model$finalModel). As I explained in the earlier post it's nice being able to see where your model fits well and where it misses at a high level.

    chaid.model$finalModel ## ## Model formula: ## .outcome ~ gender + SeniorCitizen + Partner + Dependents + PhoneService + ## MultipleLines + InternetService + OnlineSecurity + OnlineBackup + ## DeviceProtection + TechSupport + StreamingTV + StreamingMovies + ## Contract + PaperlessBilling + PaymentMethod + tenure_factor + ## MonthlyCharges_factor + TotalCharges_factor ## ## Fitted party: ## [1] root ## | [2] Contract in Month-to-month ## | | [3] InternetService in DSL ## | | | [4] TotalCharges_factor in Lowest: No (n = 326, err = 49.7%) ## | | | [5] TotalCharges_factor in Below Middle: No (n = 255, err = 27.5%) ## | | | [6] TotalCharges_factor in Middle, Above Middle, Highest: No (n = 282, err = 15.2%) ## | | [7] InternetService in Fiber optic ## | | | [8] tenure_factor in Lowest ## | | | | [9] OnlineSecurity in No, No internet service: Yes (n = 414, err = 25.4%) ## | | | | [10] OnlineSecurity in Yes: No (n = 33, err = 42.4%) ## | | | [11] tenure_factor in Below Middle: Yes (n = 413, err = 41.4%) ## | | | [12] tenure_factor in Middle: No (n = 352, err = 44.3%) ## | | | [13] tenure_factor in Above Middle, Highest: No (n = 276, err = 31.9%) ## | | [14] InternetService in No: No (n = 378, err = 17.2%) ## | [15] Contract in One year ## | | [16] StreamingMovies in No: No (n = 303, err = 8.3%) ## | | [17] StreamingMovies in No internet service: No (n = 268, err = 2.2%) ## | | [18] StreamingMovies in Yes: No (n = 457, err = 20.6%) ## | [19] Contract in Two year ## | | [20] InternetService in DSL, No: No (n = 866, err = 1.2%) ## | | [21] InternetService in Fiber optic: No (n = 308, err = 8.1%) ## ## Number of inner nodes: 7 ## Number of terminal nodes: 14 plot(chaid.model$finalModel)

    Gives this plot:

    Finally, probably the most important step of all, we'll take our trained model and apply it to the testing data that we held back to see how well it fits this data it's never seen before. This is a key step because it reassures us that we have not overfit (if you want a fuller understanding please consider reading this post on EliteDataScience) our model. We'll take our model we made with the training dataset chaid.model and have it predict against the testing dataset and see how we did with a confusionMatrix

    confusionMatrix(predict(chaid.model, newdata = testing), testing$Churn) ## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 1443 331 ## Yes 109 229 ## ## Accuracy : 0.7917 ## 95% CI : (0.7737, 0.8088) ## No Information Rate : 0.7348 ## P-Value [Acc > NIR] : 7.793e-10 ## ## Kappa : 0.3878 ## Mcnemar's Test P-Value : < 2.2e-16 ## ## Sensitivity : 0.9298 ## Specificity : 0.4089 ## Pos Pred Value : 0.8134 ## Neg Pred Value : 0.6775 ## Prevalence : 0.7348 ## Detection Rate : 0.6832 ## Detection Prevalence : 0.8400 ## Balanced Accuracy : 0.6693 ## ## 'Positive' Class : No ##

    Very nice! Our accuracy on testing actually exceeds the accuracy we achieved in training.

    Random Forest via ranger

    One of the nicest things about using caret is that it is pretty straight-forward to move from one model to another. The amount of work we have to do while moving from CHAID to ranger and eventually xgboost is actually quite modest.

    ranger will accept a mix of factors and numeric variables so our first step will be to go back and recreate training and testing using the numeric versions of tenure, MonthlyCharges, and TotalCharges instead of the _factor versions. intrain still holds our list of rows that should be in training so we'll follow the exact same process just keep the numeric versions and arrive at x and y to feed to caret and ranger.

    ##### using ranger # intrain <- createDataPartition(churn$Churn,p=0.7,list=FALSE) training <- churn[intrain,] testing <- churn[-intrain,] dim(training) ## [1] 4931 23 dim(testing) ## [1] 2112 23 training <- training %>% select(-ends_with("_factor")) dim(training) ## [1] 4931 20 # testing <- testing %>% # select(-ends_with("_factor")) dim(testing) ## [1] 2112 23 # create response and feature data features <- setdiff(names(training), "Churn") x <- training[, features] y <- training$Churn

    As I mentioned earlier train_control doesn't have to change at all. So I'll just print it to remind you of what's in there.

    search_grid is almost always specific to the model and this is no exception. When we consult the documentation for ranger within caret we see that we can adjust mtry, splitrule, and min.node.size. We'll put in some reasonable values for those and then put the resulting grid into rf_grid. I tried to give ranger's search grid about the same amount of flexibility as I did for CHAID.

    ##### reusing train_control head(train_control) ## $method ## [1] "cv" ## ## $number ## [1] 5 ## ## $repeats ## [1] NA ## ## $search ## [1] "grid" ## ## $p ## [1] 0.75 ## ## $initialWindow ## NULL # define a grid of parameter options to try with ranger rf_grid <- expand.grid(mtry = c(2:4), splitrule = c("gini"), min.node.size = c(3, 5, 7)) rf_grid ## mtry splitrule min.node.size ## 1 2 gini 3 ## 2 3 gini 3 ## 3 4 gini 3 ## 4 2 gini 5 ## 5 3 gini 5 ## 6 4 gini 5 ## 7 2 gini 7 ## 8 3 gini 7 ## 9 4 gini 7

    Okay, we're ready to train our model using ranger now. The only additional line we need (besides changing from chaid to ranger is to tell it what to use to capture variable importance e.g. “impurity”.

    # re-fit the model with the parameter grid rf.model <- train( x = x, y = y, method = "ranger", trControl = train_control, tuneGrid = rf_grid, importance = "impurity") rf.model ## Random Forest ## ## 4931 samples ## 19 predictor ## 2 classes: 'No', 'Yes' ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## Summary of sample sizes: 3946, 3944, 3945, 3944, 3945 ## Resampling results across tuning parameters: ## ## mtry min.node.size Accuracy Kappa ## 2 3 0.7990299 0.4368130 ## 2 5 0.8000437 0.4380797 ## 2 7 0.7976106 0.4298171 ## 3 3 0.7963928 0.4342273 ## 3 5 0.7980151 0.4365761 ## 3 7 0.7986242 0.4360132 ## 4 3 0.7970009 0.4343468 ## 4 5 0.7982171 0.4358659 ## 4 7 0.7943648 0.4228518 ## ## Tuning parameter 'splitrule' was held constant at a value of gini ## Accuracy was used to select the optimal model using the largest value. ## The final values used for the model were mtry = 2, splitrule = gini and min.node.size = 5.

    Now we can run the exact same set of commands as we did with chaid.model on rf.model.

    confusionMatrix(rf.model) ## Cross-Validated (5 fold) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction No Yes ## No 67.1 13.6 ## Yes 6.4 12.9 ## ## Accuracy (average) : 0.8 plot(rf.model)

    Gives this plot:

    varImp(rf.model) ## ranger variable importance ## ## Overall ## tenure 100.000 ## TotalCharges 97.336 ## MonthlyCharges 88.321 ## Contract 66.882 ## OnlineSecurity 43.533 ## TechSupport 37.978 ## PaymentMethod 34.685 ## InternetService 29.435 ## OnlineBackup 18.389 ## DeviceProtection 14.376 ## PaperlessBilling 13.363 ## MultipleLines 10.114 ## Partner 9.712 ## StreamingTV 9.607 ## StreamingMovies 8.183 ## gender 7.998 ## Dependents 7.679 ## SeniorCitizen 7.314 ## PhoneService 0.000 rf.model$times ## $everything ## user system elapsed ## 108.364 1.834 18.655 ## ## $final ## user system elapsed ## 2.160 0.030 0.359 ## ## $prediction ## [1] NA NA NA

    Now, the all-important prediction against the testing data set.

    confusionMatrix(predict(rf.model, newdata = testing), testing$Churn) ## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 1427 286 ## Yes 125 274 ## ## Accuracy : 0.8054 ## 95% CI : (0.7879, 0.8221) ## No Information Rate : 0.7348 ## P-Value [Acc > NIR] : 2.048e-14 ## ## Kappa : 0.4501 ## Mcnemar's Test P-Value : 2.969e-15 ## ## Sensitivity : 0.9195 ## Specificity : 0.4893 ## Pos Pred Value : 0.8330 ## Neg Pred Value : 0.6867 ## Prevalence : 0.7348 ## Detection Rate : 0.6757 ## Detection Prevalence : 0.8111 ## Balanced Accuracy : 0.7044 ## ## 'Positive' Class : No ##

    Very nice! Once again our accuracy on testing actually exceeds the accuracy we achieved in training. Looks like we were more accurate than CHAID but we'll come back to that after we finish xgboost.

    Extreme Gradient Boosting via xgboost

    Moving from ranger to xgboost is even easier than it was from CHAID.

    xgboost like ranger will accept a mix of factors and numeric variables so there is no need to change our training and testing datasets at all. There's also no need to change our train_control. As far as tuning goes caret supports 7 of the many parameters that you could feed to ?xgboost. If you consult the caret documentation here under xgbTree you'll see them listed. If you don't provide any tuning guidance then it will provide a default set of pretty rational initial values. I initially ran it that way but below for purposes of this post have chosen only a few that seem to make the largest difference to accuracy and set the rest to a constant.

    One final important note about the code below. Notice in the train command I am feeding a formula Churn ~ . to train. If you try to give it the same x = x & y = y syntax I used with ranger it will fail. That's because as stated in the doco “xgb.train accepts only an xgb.DMatrix as the input. xgboost, in addition, also accepts matrix, dgCMatrix, or name of a local data file.” You could use commands like xx <- model.matrix(~. -1, data=x)[,-1] &
    yy <- as.numeric(y) -1 to convert them but since our dataset is small I'm just going to use the formula interface.

    # reusing train_control head(train_control) ## $method ## [1] "cv" ## ## $number ## [1] 5 ## ## $repeats ## [1] NA ## ## $search ## [1] "grid" ## ## $p ## [1] 0.75 ## ## $initialWindow ## NULL # define a grid of parameter options to try with xgboost xgb_grid <- expand.grid(nrounds = c(100, 150, 200), max_depth = 1, min_child_weight = 1, subsample = 1, gamma = 0, colsample_bytree = 0.8, eta = c(.2, .3, .4)) xgb_grid ## nrounds max_depth min_child_weight subsample gamma colsample_bytree eta ## 1 100 1 1 1 0 0.8 0.2 ## 2 150 1 1 1 0 0.8 0.2 ## 3 200 1 1 1 0 0.8 0.2 ## 4 100 1 1 1 0 0.8 0.3 ## 5 150 1 1 1 0 0.8 0.3 ## 6 200 1 1 1 0 0.8 0.3 ## 7 100 1 1 1 0 0.8 0.4 ## 8 150 1 1 1 0 0.8 0.4 ## 9 200 1 1 1 0 0.8 0.4 # Fit the model with the parameter grid xgboost.model <- train(Churn ~ ., training , method = "xgbTree", tuneGrid = xgb_grid, trControl = train_control) xgboost.model ## eXtreme Gradient Boosting ## ## 4931 samples ## 19 predictor ## 2 classes: 'No', 'Yes' ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## Summary of sample sizes: 3945, 3944, 3945, 3944, 3946 ## Resampling results across tuning parameters: ## ## eta nrounds Accuracy Kappa ## 0.2 100 0.7974010 0.4272991 ## 0.2 150 0.7998353 0.4414212 ## 0.2 200 0.8002400 0.4465105 ## 0.3 100 0.8002410 0.4449462 ## 0.3 150 0.8020667 0.4532478 ## 0.3 200 0.8028779 0.4565789 ## 0.4 100 0.8010528 0.4478758 ## 0.4 150 0.8020655 0.4530151 ## 0.4 200 0.8024716 0.4546512 ## ## Tuning parameter 'max_depth' was held constant at a value of 1 ## Tuning parameter 'gamma' was held constant at a value of 0 ## Tuning parameter 'colsample_bytree' was held constant at a value of 0.8 ## Tuning parameter 'min_child_weight' was held constant at a value of 1 ## Tuning parameter 'subsample' was held constant at a value of 1 ## Accuracy was used to select the optimal model using the largest value. ## The final values used for the model were nrounds = 200, max_depth = 1, eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample = 1.

    After a (relatively) brief moment the results are back. Average accuracy on the training is .8029 which is better than CHAID or ranger. We can run the same additional commands simply by listing xgboost.model.

    confusionMatrix(xgboost.model) ## Cross-Validated (5 fold) Confusion Matrix ## ## (entries are percentual average cell counts across resamples) ## ## Reference ## Prediction No Yes ## No 66.5 12.7 ## Yes 7.0 13.8 ## ## Accuracy (average) : 0.8029 plot(xgboost.model)

    The plot:

    varImp(xgboost.model) ## xgbTree variable importance ## ## only 20 most important variables shown (out of 30) ## ## Overall ## tenure 100.00000 ## InternetServiceFiber optic 64.17202 ## ContractTwo year 48.23551 ## InternetServiceNo 26.81070 ## ContractOne year 10.38092 ## OnlineSecurityYes 7.23882 ## StreamingTVYes 5.04154 ## MultipleLinesYes 2.87755 ## StreamingMoviesYes 2.74796 ## TechSupportYes 2.14009 ## PhoneServiceYes 1.52273 ## DependentsYes 1.33244 ## SeniorCitizenYes 0.50710 ## genderMale 0.23151 ## DeviceProtectionYes 0.03949 ## MonthlyCharges 0.00000 ## MultipleLinesNo phone service 0.00000 ## StreamingMoviesNo internet service 0.00000 ## OnlineBackupNo internet service 0.00000 ## PartnerYes 0.00000 xgboost.model$times ## $everything ## user system elapsed ## 4.398 0.096 4.497 ## ## $final ## user system elapsed ## 0.266 0.002 0.269 ## ## $prediction ## [1] NA NA NA

    Now, the all-important prediction against the testing data set.

    confusionMatrix(predict(xgboost.model, newdata = testing), testing$Churn) ## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 1405 262 ## Yes 147 298 ## ## Accuracy : 0.8063 ## 95% CI : (0.7888, 0.823) ## No Information Rate : 0.7348 ## P-Value [Acc > NIR] : 9.038e-15 ## ## Kappa : 0.4682 ## Mcnemar's Test P-Value : 1.731e-08 ## ## Sensitivity : 0.9053 ## Specificity : 0.5321 ## Pos Pred Value : 0.8428 ## Neg Pred Value : 0.6697 ## Prevalence : 0.7348 ## Detection Rate : 0.6652 ## Detection Prevalence : 0.7893 ## Balanced Accuracy : 0.7187 ## ## 'Positive' Class : No ##

    Very nice! Once again our accuracy on testing .8063 actually exceeds the accuracy we achieved in training. Looks like we were more accurate than either CHAID or ranger and we'll focus on the comparison in the next section.

    Comparing Models

    At this juncture we're faced with a problem I've had before. We're drowning in data from the individual confusionMatrix results. We'll resort to the same purrr solution to give us a far more legible table of results focusing on the metrics I'm most interested in. To do that we need to:

    • Make a named list called modellist that contains our 3 models with a descriptive name for each
    • Use map from purrr to apply the predict command to each model in turn to our testing dataset
    • Pipe those results to a second map command to generate a confusion matrix comparing our predictions to testing$Churn which are the actual outcomes.
    • Pipe those results to a complex map_dfr that creates a dataframe of all the results with each model as a row.
    • Separately grab the elapsed times for training with commands like chaid.model$times$everything[[3]]
    • Separately grab the best accuracy for training with commands like max(chaid.model$results$Accuracy)
    • Then use kable to make a pretty table that is much easier to understand.
    modellist <- list("CHAID" = chaid.model, "ranger" = rf.model, "xgboost" = xgboost.model) CompareResults <- map(modellist, ~ predict(.x, newdata = testing)) %>% map(~ confusionMatrix(testing$Churn, .x)) %>% map_dfr(~ cbind(as.data.frame(t(.x$overall)), as.data.frame(t(.x$byClass))), .id = "Model") CompareResults[1,"ETime"] <- chaid.model$times$everything[[3]] CompareResults[2,"ETime"] <- rf.model$times$everything[[3]] CompareResults[3,"ETime"] <- xgboost.model$times$everything[[3]] CompareResults[1,"BestTrain"] <- max(chaid.model$results$Accuracy) CompareResults[2,"BestTrain"] <- max(rf.model$results$Accuracy) CompareResults[3,"BestTrain"] <- max(xgboost.model$results$Accuracy) kable(CompareResults, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) What do we know?

    Well our table looks very nice but there's probably still too much information. What data should we focus on and what conclusions can we draw from our little exercise in comparative modeling? I will draw your attention back to this webpage to review the terminology for classification models and how to interpret a confusion matrix.

    So Accuracy, Kappa, and F1 are all measures of overall accuracy. There are merits to each. Pos Pred Value, and Neg Pred Value are related but different nuanced ideas we'll discuss in a minute. We'll also want to talk about time to complete training our model with ETime and training accuracy with BestTrain.

    Let's use dplyr to select just these columns we want and see what we can glean from this reduced table.

    CompareResults %>% select(Model, ETime, BestTrain, Accuracy, Kappa, F1, 'Pos Pred Value', 'Neg Pred Value') %>% kable("html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) Model ETime BestTrain Accuracy Kappa F1 Pos Pred Value Neg Pred Value CHAID 162.493 0.7844263 0.7916667 0.3878325 0.8677090 0.9297680 0.4089286 ranger 19.375 0.8000437 0.8053977 0.4501003 0.8741194 0.9194588 0.4892857 xgboost 4.650 0.8028779 0.8063447 0.4681509 0.8729419 0.9052835 0.5321429

    Clearly xgboost is the fastest to train a model, more than 30 times faster than CHAID, and 3 times faster than ranger for this data. Not really surprising since xgboost is a very modern set of code designed from the ground up to be fast and efficient.

    One interesting fact you can glean from all 3 models is that they all did better on testing than they did on training. This is slightly unusual since one would expect some differences to be missed but is likely simply due to a lucky split in our data with more of the difficult to predict cases falling in training than testing. The good news is it leaves us feeling comfortable that we did not overfit our model to the training data, which is why we were conservative in our fitting and cross-validated the training data.

    No matter which “accuracy measure” we look at Accuracy, F1 or Kappa the answer is pretty consistent, xgboost “wins” or is the most accurate. The exception is F1 where ranger edges is out by 0.11775% which means it was correct on about 3 more cases out of 2112 cases in the testing set.

    Notice that the differences in accuracy are not large as percentages xgboost is 1.4678% more accurate than CHAID or it correctly predicted 31 more customers. While more accurate is always “better” the practical significance is also a matter of what the stakes are. If a wrong prediction costs you $1,000.00 dollars that additional accuracy is more concerning than a lesser dollar amount.

    I also deliberately included Positive and Negative Predictive Values the columns labelled Pos Pred Value and Neg Pred Value for a very specific reason. Notice that CHAID has the highest Pos Pred Value that means is is the most accurate at predicting customers who did not “churn”. Of the 1,552 customers who did not leave us is correctly predicted 1,443 of them. xgboost on the other hand was much much better at Neg Pred Value correctly predicting 298 out of 560 customers who left us. While Accuracy, Kappa and F1 take different approaches to finding “balanced” accuracy sometimes one case negative or positive has more important implications for your business and you should choose those measures.

    At least at this point after a possible tl;dr journey we have some empirical data to inform my original statement about CHAID: “As the name implies it is fundamentally based on the venerable Chi-square test – and while not the most powerful (in terms of detecting the smallest possible differences) or the fastest, it really is easy to manage and more importantly to tell the story after using it”.

    What don't we know?
    • That this example would apply to other types of datasets. Absolutely not! This sort of data is almost ideal for CHAID since it involves a lot of nominal/categorical and/or ordinal data. CHAID will get much slower faster as we add more columns. More generally this was one example relatively small dataset more about learning something about caret and process than a true comparison of accuracy across a wide range of cases.
    • This is the “best” these models can do with this data Absolutely not! I made no attempt to seriously tune any of them. Tried some mild comparability. Also made no effort to feature engineer or adjust. I'm pretty certain if you tried you can squeeze a little more out of all three. Even wth CHAID there's more we could do very easily. I arbitrarily divided tenure into 5 equal sized bins. Why not 10? Why not equidistant instead of equal sized?

    I hope you've found this useful. I am always open to comments, corrections and suggestions.

      Related Post

      1. Common Mistakes to Avoid When Learning to Code in Python
      2. Top Python Libraries for Machine Learning
      3. Get Popular Programming Languages from TIOBE Index using R
      4. How to make Seaborn Pairplot and Heatmap in R (Write Python in R)
      5. Topic Modeling in Python with NLTK and Gensim
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Statistics Sunday: More Text Analysis – Term Frequency and Inverse Document Frequency

      Sun, 07/29/2018 - 15:30

      (This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

      .knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

      Statistics Sunday: Term Frequency and Inverse Document Frequency As a mixed methods researcher, I love working with qualitative data, but I also love the idea of using quantitative methods to add some meaning and context to the words. This is the main reason I’ve started digging into using R for text mining, and these skills have paid off in not only fun blog posts about Taylor Swift, Lorde, and “Hotel California”, but also in analyzing data for my job (blog post about that to be posted soon). So today, I thought I’d keep moving forward to other tools you can use in text analysis: term frequency and inverse document frequency.

      These tools are useful when you have multiple documents you’re analyzing, such as interview text from different people or books by the same author. For my demonstration today, I’ll be using (what else?) song lyrics, this time from Florence + the Machine (one of my all-time favorites), who just dropped a new album, High as Hope. So let’s get started by pulling in those lyrics.

      library(geniusR)

      high_as_hope <- genius_album(artist = "Florence the Machine", album = "High as Hope")
      ## Joining, by = c("track_title", "track_n", "track_url")
      library(tidyverse)
      library(tidytext)
      tidy_hope <- high_as_hope %>%
      unnest_tokens(word,lyric) %>%
      anti_join(stop_words)
      ## Joining, by = "word"
      head(tidy_hope)
      ## # A tibble: 6 x 4
      ## track_title track_n line word
      ##
      ## 1 June 1 1 started
      ## 2 June 1 1 crack
      ## 3 June 1 2 woke
      ## 4 June 1 2 chicago
      ## 5 June 1 2 sky
      ## 6 June 1 2 black

      Now we have a tidy dataset with stop words removed. Before we go any farther, let’s talk about the tools we’re going to apply. Often, when we analyze text, we want to try to discover what different documents are about – what are their topics or themes? One way to do that is to look at common words used in a document, which can tell us something about the document’s theme. An overall measure of how often a term comes up in a particular document is term frequency (TF).

      Removing stop words is an important step before looking at TF, because otherwise, the high frequency words wouldn’t be very meaningful – they’d be words that fill every sentence, like “the” or “a.” But there still might be many common words that don’t get weeded out by our stop words anti-join. And it’s often the less frequently used words that tell us something about the meaning of a document. This is where inverse document frequency (IDF) comes in; it takes into account how common a word is across a set of documents, and gives higher weight to words that are infrequent across a set of documents and lower weight to common words. This means that a word used a great deal in one song but very little in the other songs will have a higher IDF.

      We can use these two values at the same time, by multiplying them together to form TF-IDF, which tells us the frequency of the term in a document adjusted for common it is across a set of documents. And thanks to the tidytext package, these values can be automatically calculated for us with the bind_tf_idf function. First, we need to reformat our data a bit, by counting use of each word by song. We do this by referencing the track_title variable in our count function, which tells R to group by this variable, followed by what we want R to count (the variable called word).

      song_words <- tidy_hope %>%
      count(track_title, word, sort = TRUE) %>%
      ungroup()

      The bind_tf_idf function needs 3 arguments: word (or whatever we called the variable containing our words), the document indicator (in this case, track_title), and the word counts by document (n).

      song_words <- song_words %>%
      bind_tf_idf(word, track_title, n) %>%
      arrange(desc(tf_idf))

      head(song_words)
      ## # A tibble: 6 x 6
      ## track_title word n tf idf tf_idf
      ##
      ## 1 Hunger hunger 25 0.236 2.30 0.543
      ## 2 Grace grace 16 0.216 2.30 0.498
      ## 3 The End of Love wash 18 0.209 2.30 0.482
      ## 4 Hunger ooh 20 0.189 2.30 0.434
      ## 5 Patricia wonderful 10 0.125 2.30 0.288
      ## 6 100 Years hundred 12 0.106 2.30 0.245

      Some of the results are unsurprising – “hunger” is far more common in the track called “Hunger” than any other track, “grace” is more common in “Grace”, and “hundred” is more common in “100 Years”. But let’s explore the different words by plotting the highest tf-idf for each track. To keep the plot from getting ridiculously large, I’ll just ask for the top 5 for each of the 10 tracks.

      song_words %>%
      mutate(word = factor(word, levels = rev(unique(word)))) %>%
      group_by(track_title) %>%
      top_n(5) %>%
      ungroup() %>%
      ggplot(aes(word, tf_idf, fill = track_title)) +
      geom_col(show.legend = FALSE) +
      labs(x = NULL, y = "tf-idf") +
      facet_wrap(~track_title, ncol = 2, scales = "free") +
      coord_flip()
      ## Selecting by tf_idf

      Some tracks have more than 5 words listed, because of ties, but this plot helps us to look for commonalities and differences across the tracks. There is a strong religious theme across many of the tracks, with concepts like “pray”, “god”, “grace”, and “angel” coming up in many tracks. The song “Patricia” uses many positively-valenced words like “wonderful” and “believer”. “No Choir” references music-themed words. And “Sky Full of Song” references things that fly (like “arrow”) and things in the sky (like “thunder”).

      What does Florence Welch have to say about the meaning behind this album?

      There is loneliness in this record, and there’s issues, and pain, and things that I struggled with, but the overriding feeling is that I have hope about them, and that’s what kinda brought me to this title; I was gonna call it The End of Love, which I actually saw as a positive thing cause it was the end of a needy kind of love, it was the end of a love that comes from a place of lack, it’s about a love that’s bigger and broader, that takes so much explaining. It could sound a bit negative but I didn’t really think of it that way.

      She’s also mentioned that High as Hope is the first album she made sober, so her past struggles with addiction are certainly also a theme of the album. And many reviews of the album (like this one) talk about the meaning and stories behind the music. This information can provide some context to the TF-IDF results.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      But can ravens forecast?

      Sun, 07/29/2018 - 14:10

      (This article was first published on R – thinkr, and kindly contributed to R-bloggers)

      Why forecast sales?

      Humans have the magical ability to plan for future events, for future gain. It’s not quite a uniquely human trait. Because apparently ravens can match a 4-year-old.

      An abundance of data, and some very nice R packages, make our ability to plan all the more powerful.

      A couple of months ago we looked at sales from an historical perspective in Digital Marketplace. Six months later. In this post, we’ll use the sales data to March 31st to model a time-series forecast for the next two years. The techniques apply to any time series with characteristics of trend, seasonality or longer-term cycles.

      Why forecast sales? Business plans require a budget, e.g. for resources, marketing and office space. A good projection of revenue provides the foundation for the budget. And, for an established business, with historical data, time-series forecasting is one way to deliver a robust projection.

      The forecast assumes one continues to do what one’s doing. So, it provides a good starting-point. Then one might, for example, add assumptions about new products or services.

      The power of iteration

      Businesses typically deal with many product / service lines. So the ability to iteratively forecast multiple time series is very powerful.

      We’ll deal first with each government framework: G-Cloud (cloud services), and DOS (Digital Outcomes & Specialists). Then we’ll iterate through G-Cloud’s lot structure: Cloud Hosting, Cloud Software and Cloud Support. Only three child levels, but the principle is easily scaled up.

      Cleaning data

      G-Cloud suppliers are contractually-obliged under the government’s framework to report monthly on their buyer invoicing. So, if some months were missed, then there would be a one-time catch-up in a later month. This could result in the odd outlier in the Digital Marketplace sales data. However, as revealed in the more detailed analysis (with code), none was discovered.

      Seasonal decomposition

      By decomposing the historical data we can tease out the underlying trend and seasonality:

        • Trend:  G-Cloud sales have grown over time as more suppliers have added their services to the government frameworks. And more Public Sector organizations have found the benefits of purchasing Cloud services this way. Why? Because it’s a faster, simpler, more transparent and competitive contracting vehicle.
        • Seasonality:  Suppliers often manage their sales and financials based on a quarterly cycle. There’s a particular emphasis on a strong close to the financial year (often December 31st for commercial enterprises). And government buyers may want to make optimal use of their budgets at the close of their financial year (March 31st). Consequently, we see quarterly seasonality with an extra spike in March, and a secondary peak in December.

      Forecasting sales for each framework

      Using AutoRegressive Integrated Moving Average (ARIMA) modelling, we can select from close to 100 models to describe the autocorrelations in the data. Then we can use the generated model to forecast future sales.

      In the plot below, we project two years ahead with 80% and 95% prediction intervals. This means the darker-shaded 80% range should include the future sales value with an 80% probability. Likewise with a 95% probability when adding the wider and lighter-shaded area.

      The DOS framework (for project-related services) was launched more recently in June 2016. It exhibits different time-series characteristics. Hence a different ARIMA model.

      Forecasting sales for the component lots

      The G-Cloud framework comprises three lots. There are different ways of forecasting multiple time series. We will do so in one shot, with the best model tailored to each lot. The possible approaches, and code used, are detailed here.

      Ravens aren’t yet ready for forecasting with R. But then neither are 4-year-olds, are they?

      R toolkit

      R packages and functions (excluding base) used in this analysis.

        Packages Functions purrr map[6]; map2_df[1]; possibly[1]; set_names[1]; simplify[1]; some[1]; when[1] readr guess_encoding[3]; locale[2]; read_csv[2]; parse_number[1] dplyr mutate[12]; if_else[8]; filter[6]; group_by[6]; desc[3]; first[3]; select[3]; summarise[3]; summarize[2]; arrange[1]; as_tibble[1]; between[1]; bind_rows[1]; collapse[1]; data_frame[1]; n[1] tibble str_c[7]; fixed[2]; str_remove[2]; str_count[1]; str_detect[1]; str_extract[1]; str_replace[1] stringr str_c[7]; fixed[2]; str_remove[2]; str_count[1]; str_detect[1]; str_extract[1]; str_replace[1] rebus or[4]; alpha[1]; literal[1]; whole_word[1] lubridate month[11]; year[4]; ceiling_date[1]; date[1]; days_in_month[1]; myd[1]; parse_date_time[1]; tz[1]; ymd[1] timetk tk_ts[3] sweep str_c[7]; fixed[2]; str_remove[2]; str_count[1]; str_detect[1]; str_extract[1]; str_replace[1] tidyr str_c[7]; fixed[2]; str_remove[2]; str_count[1]; str_detect[1]; str_extract[1]; str_replace[1] forecast forecast[14]; auto.arima[8]; autoplot[7]; BoxCox[4]; mstl[2]; ndiffs[2]; nsdiffs[2]; tsclean[2]; BoxCox.lambda[1]; seasonal[1] ggplot2 autoplot[7]; xlab[6]; ylab[6]; aes[5]; theme[5]; labs[4]; element_rect[3]; geom_ribbon[2]; unit[2]; alpha[1]; element_line[1]; element_text[1]; facet_wrap[1]; geom_line[1]; geom_path[1]; geom_text[1]; ggplot[1]; margin[1]; scale_x_date[1] scales or[4]; alpha[1]; literal[1]; whole_word[1] cowplot draw_label[1]; plot_grid[1] ggthemes theme_economist[1] kableExtra kable[3]; kable_styling[2] knitr kable[3]; opts_chunk[1]

      View the code here.

      Citations

      R Development Core Team (2008). R: A language and environment for
      statistical computing. R Foundation for Statistical Computing,
      Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

      Hyndman R, Athanasopoulos G, Bergmeir C, Caceres G, Chhay L, O’Hara-Wild M, Petropoulos F, Razbash S, Wang E, Yasmeen F (2018). forecast: Forecasting functions for time series and linear models. R package version 8.4, http://pkg.robjhyndman.com/forecast.

      Hyndman RJ, Khandakar Y (2008). “Automatic time series forecasting: the forecast package for R.” Journal of Statistical Software, 26(3), 1–22. http://www.jstatsoft.org/article/view/v027i03.

      Contains public sector information licensed under the Open Government Licence v3.0.

      The post But can ravens forecast? appeared first on thinkr.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – thinkr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      June 2018: Top 40 New Packages

      Sun, 07/29/2018 - 02:00

      (This article was first published on R Views, and kindly contributed to R-bloggers)

      Approximately 144 new packages stuck to CRAN in June. That fact that 31 of these are specialized to particular scientific disciplines or analyses provides some evidence to my hypothesis that working scientists are actively adopting R. Below are my Top 40 picks for June, organized into the categories of Computational Methods, Data, Data Science, Economics, Science, Statistics, Time Series, Utilities and Visualizations. The Data packages, especially rtrek and opensensmapr, look like they have some interesting new data to explore.

      Computational Methods

      nnTensor v0.99.1: Provides methods for n-negative matrix factorization and decomposition. See Cichock et al (2009) for details.

      RcppEigenAD v1.0.0: Provides functions to compile C++ code using Rcpp, Eigen, and CppAD to produce first- and second-order partial derivatives, and also provides an implementation of Faa’ di Bruno’s formula to combine the partial derivatives of composed functions. See Hardy (2006).

      rcrane v1.0: Provides optimization algorithms to estimate coefficients in models such as linear regression and neural networks. Includes batch gradient descent, stochastic gradient descent, minibatch gradient descent, and coordinate descent. See [Kiwiel, (2001)](doi:10.1007/PL00011414, Yu Nesterov (2004), Ferguson (1982), Zeiler (2012), and Wright (2015). The vignette introduces the package.

      Data

      bjscrapeR v0.1.0: Scrapes crime data from the National Crime Victimization Survey, which tracks personal and household crime in the USA.

      genesysr v0.9.1: Implements an API to access data on plant genetic resources from genebanks around the world published on Genesys. The vignette offers a short tutorial.

      opensensmapr v0.4.1: Allows users to download real-time environmental measurements and sensor station metadata from the OpenSenseMap API. There are vignettes for Visualization, Exploration, and Caching Data for Reproducibility.

      readabs v0.2.1: Provides functions to read Excel files from the Australian Bureau of Statistics into Tidy Data Sets. See the vignette.

      rppo v1.0: Implements an interface to the Global Plant Phenology Data Portal. See the vignette.

      rtrek v0.1.0: Provides datasets related to the Star Trek fictional universe, functions for working with the data, and access to real-world datasets based on the televised series and other related licensed media productions. It interfaces with Wikipedia, the Star Trek API (STAPI), Memory Alpha, and Memory Beta to retrieve data, metadata, and other information relating to Star Trek. See the README for usage information.

      skynet v1.2.2: Implements methods for generating air transport statistics based on publicly available data from the U.S. Bureau of Transport Statistics (BTS). See the vignette.

      Data Science

      AdaSampling v1.1: Implements the adaptive sampling procedure, a framework for both positive unlabeled learning and learning with class label noise. See Yang et al. (2018) and the vignette.

      AROC v1.0: Provides functions to estimate the covariate-adjusted Receiver Operating Characteristic (AROC) curve and pooled (unadjusted) ROC curve. See de Carvalho and Rodriguez-Alvarez (2018).

      cloudml v0.5.1: Provides an interface to the Google Cloud Machine Learning Platform. There is a Getting Sarted Guide and vignettes on Deploying Models, Cloud storage, Training, and Hyperparameter Tuning.

      reclin v0.1.0: Provide functions to assist in performing probabilistic record linkage and deduplication: generating pairs, comparing records, em-algorithm for estimating m- and u-probabilities, forcing one-to-one matching. There is an Introduction and a vignette on Duplication.

      vip v0.1.0: Provides a general framework for constructing variable importance plots from various types of machine learning models, based on a novel approach using partial dependence plots and individual conditional expectation curves as described in Greenwell et al. (2018). See the README for details and examples.

      wevid v0.4.2: Provides functions to quantify the performance of a binary classifier through weight of evidence. These can be used with any test dataset on which you have observed case-control status, and have computed prior and posterior probabilities of case status using a model learned on a training dataset. Look at this website for details and examples.

      Economics

      trade v0.5.3: Provides tools for working with trade model, including the ability to calibrate different consumer-demand systems and simulate the effects of tariffs and quotas under different competitive regimes. The vignette provides details.

      Science

      linpk v1.0: Provides functions and a shiny application to generate concentration-time profiles from linear pharmacokinetic (PK) systems. Single or multiple doses may be specified. The vignette offers details and examples.

      ratematrix v1.0: Provides functions to estimate the evolutionary rate matrix ® using Markov chain Monte Carlo (MCMC), as described in Caetano and Harmon (2017). There is a vignette on Setting a custom starting point and another on Using prior distributions.

      spectralAnalysis v3.12.0: Provides a toolkit for spectral-analysis, enabling users to pre-process, visualize, and analyse process analytical dat, by spectral data measurements made during a chemical process.

      Statistics

      betaboost v1.0.1: Implements boosting beta regression for potentially high-dimensional data Mayr et al. (2018) using the same parametrization as betareg Cribari-Neto and Zeileis (2010). The underlying algorithms are implemented via the R add-on packages mboost Hofner et al. (2014) and gamboostLSS Mayr et al. (2012). The vignette offers examples.

      bfw v0.1.0: Provides a framework for conducting Bayesian analysis using Markov chain Monte Carlo with the JAGS sampler. There are vignettes on Fitting Latent Data, Fitting Observed Data, the Predict Metric, Plotting, and Regression.

      CaseBasedReasoning v0.1: Given a large set of problems and their individual solutions, case-based reasoning seeks to solve a new problem by referring to the solution of that problem that is “most similar” to the new problem. See Dippon et al. (2002), the vignette on Motivation, and examples of case-based reasoning with a Cox-Beta Model and a Random Forest Model.

      coxed v0.1.1: Provides functions for generating, simulating, and visualizing expected durations and marginal changes in duration from the Cox proportional hazards model. There is a vignette on using the coxed() function and another on simulating survival data.

      GLMMadaptive v0.2-0: Provides functions to fit generalized linear mixed models for a single grouping factor under maximum likelihood approximating the integrals over the random effects with an adaptive Gaussian quadrature rule. See Pinheiro and Bates (1995) and the vignettes on Custom Models,
      GLMMadaptive Basics, Methods for MixMod Objects, and Zero-Inflated and Two-Part Mixed Effects Models.

      glmmfields v0.1.0: Implements generalized linear mixed models with robust random fields for spatiotemporal modeling. The vignette provides examples.

      kendallRandomWalks v0.9.3: Provides functions for simulating Kendall random walks, continuous-space Markov chains generated by the Kendall generalized convolution. See Jasiulis-Gołdyn (2014) for details and the vignettes Kendall Random Walks and Studying the Behavior of Kendall Random Walks.

      netSEM v0.5.0: Provides functions for structural equation modeling. There is an Introduction and vignettes on Backsheet Degradation, Backsheet Cracking, Current Voltage Features, and Modeling of the Weathering Driven Degradation of Poly(ethylene-terephthalate) Films.

      umap v0.1.0.3: Implements the uniform manifold approximation and projection technique for dimension reduction as described in McInnes and Healy (2018). The vignette shows how to use the package.

      vimp v1.0.0:Provides functions to calculate point estimates of, and valid confidence intervals for, non-parametric variable importance measures in high and low dimensions. For information about the methods, see Williamson et al. (2017). The vignette contains an introduction to the package.

      vsgoftest v0.3-2: Implements Vasicek and Song goodness-of-fit tests (based on Kullbach-Leibler divergence) for a family of distributions that include uniform, Gaussian, log-normal, exponential, gamma, Weibull, Pareto, Fisher, Laplace, and beta distributions. See Lequesne and Regnault (2018) for details and the Tutorial.

      Time Series

      anomaly v1.0.0: Implements the CAPA (Collective And Point Anomaly) algorithm of Fisch, Eckley and Fearnhead (2018) for the detection of anomalies in time series data.

      exuber v0.1.0: Provides functions for testing and dating periods of explosive dynamics (exuberance) in time series using recursive unit root tests as proposed by Phillips et al. (2015). See the README to get started.

      Simulate a variety of periodically-collapsing bubble models. The estimation and simulation utilizes the matrix inversion lemma from the recursive least squares algorithm, which results in a significant speed improvement.

      Utilities

      BiocManager v1.30.1: Implements a tool to install and update Bioconductor packages. The vignette shows how to use the package.

      IntervalSurgeon v1.0: Provides functions for manipulating integer-bounded intervals including finding overlaps, piling, and merging. The vignette shows how to use the package.

      pkgbuild v1.0.0: Provides functions used to build R packages. Locates compilers needed to build R packages on various platforms and ensures the PATH is configured appropriately.

      rqdatatable v0.1.2: Implements the rquery piped query algebra using data.table. There is a vignette on Grouped Sampling and a Logistic Example.

      ssh v0.2: Provides functions to connect to a remote server over SSH to transfer files via SCP, setup a secure tunnel, or run a command or script on the host while streaming stdout and stderr directly to the client. There is a vignette.

      Visualization

      mgcViz v0.1.1: An extension of the mgcv package, providing visual tools for Generalized Additive Models (GAMs) that exploit the additive structure of GAMs, scale to large data sets, and can be used in conjunction with a wide range of response distributions. See the vignette for examples.

      tiler v0.2.0: Provides functions to create geographic map tiles from geospatial map files or non-geographic map tiles from simple image files. The vignette provides an introduction.

      _____='https://rviews.rstudio.com/2018/07/29/june-2018-top-40-new-packages/';

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages