Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 2 days 1 hour ago

Tips for A/B Testing with R

Wed, 11/22/2017 - 10:19

(This article was first published on INWT-Blog-RBloggers, and kindly contributed to R-bloggers)

Which layout of an advertisement leads to more clicks? Would a different color or position of the purchase button lead to a higher conversion rate? Does a special offer really attract more customers – and which of two phrasings would be better?

For a long time, people have trusted their gut feeling to answer these questions. Today all these questions could be answered by conducting an A/B test. For this purpose, visitors of a website are randomly assigned to one of two groups between which the target metric (i.e. click-through rate, conversion rate…) can then be compared. Due to this randomization, the groups do not systematically differ in all other relevant dimensions. This means: If your target metric takes a significantly higher value in one group, you can be quite sure that it is because of your treatment and not because of any other variable.

In comparison to other methods, conducting an A/B test does not require extensive statistical knowledge. Nevertheless, some caveats have to be taken into account.

When making a statistical decision, there are two possible errors (see also table 1): A Type I error means that we observe a significant result although there is no real difference between our groups. A Type II error means that we do not observe a significant result although there is in fact a difference. The Type I error can be controlled and set to a fixed number in advance, e.g., at 5%, often denoted as α or the significance level. The Type II error in contrast cannot be controlled directly. It decreases with the sample size and the magnitude of the actual effect. When, for example, one of the designs performs way better than the other one, it’s more likely that the difference is actually detected by the test in comparison to a situation where there is only a small difference with respect to the target metric. Therefore, the required sample size can be computed in advance, given α and the minimum effect size you want to be able to detect (statistical power analysis). Knowing the average traffic on the website you can get a rough idea of the time you have to wait for the test to complete. Setting the rule for the end of the test in advance is often called “fixed-horizon testing”.

Table 1: Overview over possible errors and correct decisions in statistical tests Effect really exists No Yes Statistical test is significant No True negative Type II error (false negative) Yes Type I error (false positive) True positive

Statistical tests generally provide the p-value which reflects the probability of obtaining the observed result (or an even more extreme one) just by chance, given that there is no effect. If the p-value is smaller than α, the result is denoted as “significant”.

When running an A/B test you may not always want to wait until the end but take a look from time to time to see how the test performs. What if you suddenly observe that your p-value has already fallen below your significance level – doesn’t that mean that the winner has already been identified and you could stop the test? Although this conclusion is very appealing, it can also be very wrong. The p-value fluctuates strongly during the experiment and even if the p-value at the end of the fixed-horizon is substantially larger than α, it can go below α at some point during the experiment. This is the reason why looking at your p-value several times is a little bit like cheating, because it makes your actual probability of a Type I error substantially larger than the α you chose in advance. This is called “α inflation”. At best you only change the color or position of a button although it does not have any impact. At worst, your company provides a special offer which causes costs but actually no gain. The more often you check your p-value during the data collection, the more likely you are to draw wrong conclusions. In short: As attractive as it may seem, don’t stop your A/B test early just because you are observing a significant result. In fact you can prove that if you increase your time horizon to infinity, you are guaranteed to get a significant p-value at some point in time.

The following code simulates some data and plots the course of the p-value during the test. (For the first samples which are still very small R returns a warning that the chi square approximation may be incorrect.)

library(timeDate) library(ggplot2) # Choose parameters: pA <- 0.05 # True click through rate for group A pB <- 0.08 # True click through rate for group B nA <- 500 # Number of cases for group A nB <- 500 # Number of cases for group B alpha <- 0.05 # Significance level # Simulate data: data <- data.frame(group = rep(c("A", "B"), c(nA, nB)), timestamp = sample(seq(as.timeDate('2016-06-02'), as.timeDate('2016-06-09'), by = 1), nA+nB), clickedTrue = as.factor(c(rbinom(n = nA, size = 1, prob = pA), rbinom(n = nB, size = 1, prob = pB)))) # Order data by timestamp data <- data[order(data$GMT.x..i..), ] levels(data$clickedTrue) <- c("0", "1") # Compute current p-values after every observation: pValues <- c() index <- c() for (i in 50:dim(data)[1]){ presentData <- table(data$group[1:i], data$clickedTrue[1:i]) if (all(rowSums(presentData) > 0)){ pValues <- c(pValues, prop.test(presentData)$p.value) index <- c(index, i)} } results <- data.frame(index = index, pValue = pValues) # Plot the p-values: ggplot(results, aes(x = index, y = pValue)) + geom_line() + geom_hline(aes(yintercept = alpha)) + scale_y_continuous(name = "p-value", limits = c(0,1)) + scale_x_continuous(name = "Observed data points") + theme(text = element_text(size=20))

The figure below shows an example with 500 observations and true rates of 5% in both groups, i.e., no actual difference. You can see that the p-value nevertheless crosses the threshold several times, but finally takes a very high value. By stopping this test early, it would have been very likely to draw a wrong conclusion.

Example for the course of a p-value for two groups with no actual difference

The following code shows you how to test the difference between two rates in R, e.g., click-through rates or conversion rates. You can apply the code to your own data by replacing the URL to the example data with your file path. To test the difference between two proportions, you can use the function prop.test which is equivalent to Pearson’s chi-squared test. For small samples you should use Fisher’s exact test instead. Prop.test returns a p-value and a confidence interval for the difference between the two rates. The interpretation of a 95% confidence interval is as follows: When conducting such an analysis many times, then 95% of the displayed confidence intervals would contain the true difference. Afterwards you can also take a look at the fluctuations of the p-value during the tests by using the code from above.

library(readr) # Specify file path: dataPath <- "https://www.inwt-statistics.de/files/INWT/downloads/exampleDataABtest.csv" # Read data data <- read_csv(file = dataPath) # Inspect structure of the data str(data) ## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 3 variables: ## $ group : chr "A" "A" "A" "B" ... ## $ time : POSIXct, format: "2016-06-02 02:17:53" "2016-06-02 03:03:54" ... ## $ clickedTrue: int 0 0 1 0 0 0 0 0 0 0 ... # Change the column names names(data) <- c("group", "time", "clickedTrue") # Change type of group to factor data$group <- as.factor(data$group) # Change type of click through variable to factor data$clickedTrue <- as.factor(data$clickedTrue) levels(data$clickedTrue) <- c("0", "1") # Compute frequencies and conduct test for proportions # (Frequency table with successes in the first column) freqTable <- table(data$group, data$clickedTrue)[, c(2,1)] # print frequency table freqTable ## ## 1 0 ## A 20 480 ## B 40 460 # Conduct significance test prop.test(freqTable, conf.level = .95) ## ## 2-sample test for equality of proportions with continuity ## correction ## ## data: freqTable ## X-squared = 6.4007, df = 1, p-value = 0.01141 ## alternative hypothesis: two.sided ## 95 percent confidence interval: ## -0.071334055 -0.008665945 ## sample estimates: ## prop 1 prop 2 ## 0.04 0.08

There are some more pitfalls, but most of them can easily be avoided. First, as a counterpart of stopping your test early because of a significant result, you could gather more data after the planned end of the test because the results have not yet become significant. This would likewise lead to an α inflation. A second, similar problem arises when running several tests at once: The probability to achieve a false-positive result would then be α for each of the tests. The overall probability that at least one of the results is false-positive is much larger. So always keep in mind that some of the significant results may have been caused by chance. Third, you can also get into trouble when you reach the required sample size very fast and stop the test after a few hours already. You should always consider that the behavior of the users in this specific time slot might not be representative for the general case. To avoid this, you should plan the duration of the test so that it covers at least 24 hours or a week when customers are behaving different at the weekend than on a typical work day. A fourth caveat concerns a rather moral issue: When users discover they are part of an experiment and suffer from disadvantages as a result, they might rightly become angry. (This problem will probably not arise due to a different-colored button, but maybe because of different prices or special offers.)

If you are willing to invest some more time, you may want to learn about techniques to avoid α inflation when conducting multiple tests or stopping your test as soon as the p-value crosses a certain threshold. In addition, there are techniques to include previous knowledge in your computations with Bayesian approaches. The latter is especially useful when you have rather small samples, but previous knowledge about the values that your target metric usually takes.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: INWT-Blog-RBloggers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

October 2017 New Packages

Wed, 11/22/2017 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Of the 182 new packages that made it to CRAN in October, here are my picks for the “Top 40”. They are organized into eight categories: Engineering, Machine Learning, Numerical Methods, Science, Statistics, Time Series, Utilities and Visualizations. Engineering is a new category, and its appearance may be an early signal for the expansion of R into a new domain. The Science category is well-represented this month. I think this is the result of the continuing trend for working scientists to wrap their specialized analyses into R packages.

Engineering

FlowRegEnvCost 0.1.1: Calculates the daily environmental costs of river-flow regulation by dams based on García de Jalon et al. (2017).

rroad v0.0.4: Computes and visualizes the International Roughness Index (IRI) given a longitudinal road profile for a single road segment, or for a sequence of segments with a fixed length. For details on The International Road Roughness Experiment establishing a correlation and a calibration standard for measurements, see the World Bank technical paper. The vignette shows an example of a Road Condition Analysis. The following scaleogram was produced from a continuous wavelet transform of a 3D accelerometer signal.

Machine Learning

detrendr v0.1.0: Implements a method based on an algorithm by Nolan et al. (2017) for detrending images affected by bleaching. See the vignette

MlBayesOpt v0.3.3: Provides a framework for using Bayesian optimization (see Shahriari et al. to tune hyperparameters for support vector machine, random forest, and extreme gradient boosting models. The vignette shows how to set things up.

rerf v1.0: Implements an algorithm, Random Forester (RerF), developed by Tomita (2016), which is similar to the Random Combination (Forest-RC) algorithm developed by Breiman (2001). Both algorithms form splits using linear combinations of coordinates.

Numerical Methods

episode v1.0.0: Provides statistical tools for inferring unknown parameters in continuous time processes governed by ordinary differential equations (ODE). See the Introduction.

KGode v1.0.1: Implements the kernel ridge regression and the gradient matching algorithm proposed in Niu et al. (2016), and the warping algorithm proposed in Niu et al. (2017) for improving parameter estimation in ODEs.

Science

adjclust v0.5.2: Implements a constrained version of hierarchical agglomerative clustering, in which each observation is associated with a position, and only adjacent clusters can be merged. The algorithm, which is time- and memory-efficient, is described in Alia Dehman (2015). There are vignettes on Clustering Hi-C Contact Maps, Implementation Notes, and Inferring Linkage Disequilibrium blocks from Genotypes.

hsdar v0.6.0: Provides functions for transforming reflectance spectra, calculating vegetation indices and red edge parameters, and spectral resampling for hyperspectral remote sensing and simulation. The Introduction offers several examples.

mapfuser v0.1.2: Constructs consensus genetic maps with LPmerge (See Endelman and Plomion (2014)) and models the relationship between physical distance and genetic distance using thin-plate regression splines (see Wood (2003)). The vignette explains how to use the package.

mortAAR v1.0.0: Provides functions for the analysis of archaeological mortality data See Chamberlain (2006). There is a vignette on Lifetables and an Extended Discussion.

skyscapeR v0.2.2: Provides a tool set for data reduction, visualization and analysis in skyscape archaeology, archaeoastronomy and cultural astronomy. The vignette shows how to use the package.

Statistics

BayesRS v0.1.2: Fits hierarchical linear Bayesian models, samples from the posterior distributions of model parameters in JAGS, and computes Bayes factors for group parameters of interest with the Savage-Dickey density ratio ([See Wetzels et al.(2009). There is an Introduction.

CatPredi v1.1: Allows users to categorize a continuous predictor variable in a logistic or a Cox proportional hazards regression setting, by maximizing the discriminative ability of the model. See Barrio et al. (2015) and Barrio et al. (2017).

CovTools v0.2.1: Provides a collection of geometric and inferential tools for convenient analysis of covariance structures. For an introduction to covariance in multivariate statistical analysis, see Schervish (1987).

genlogis v0.5.0: Provides basic distribution functions for a generalized logistic distribution proposed by Rathie and Swamee (2006).

emmeans v0.9.1: Provides functions to obtain estimated marginal means (EMMs) for many linear, generalized linear, and mixed models, and computes contrasts or linear functions of EMMs, trends, and comparisons of slopes. There are twelve vignettes including The Basics, Comparisons and Contrasts, Confidence Intervals and Tests, Interaction Analysis, and Working with Messy Data.

ESTER v0.1.0: Provides an implementation of sequential testing that uses evidence ratios computed from the Akaike weights of a set of models. For details see Burnham & Anderson (2004). There is a vignette.

FarmTest v1.0.0: Provides functions to perform robust multiple testing for means in the presence of latent factors. It uses Huber’s loss function to estimate distribution parameters and accounts for strong dependence among coordinates via an approximate factor model. See Zhou et al.(2017) for details. There is a vignette to get you started.

miic v0.1: Implements an information-theoretic method which learns a large class of causal or non-causal graphical models from purely observational data, while including the effects of unobserved latent variables, commonly found in many datasets. For more information see Verny et al. (2017).

modcmfitr v0.1.0: Fits a modified version of the Connor-Mosimann distribution ( Connor & Mosimann (1969)), a Connor-Mosimann distribution, or a Dirichlet distribution to elicited quantiles of a multinomial distribution. See the vignette for details.

pense v1.0.8: Provides a robust penalized elastic net S and MM estimator for linear regression as described in Freue et al. (2017).

paramtest v0.1.0: Enables running simulations or other functions while easily varying parameters from one iteration to the next. The vignette shows how to run a power simulation.

rENA v0.1.0: Implements functions to perform epistemic network analysis ENA, a novel method for identifying and quantifying connections among elements in coded data, and representing them in dynamic network models, which illustrate the structure of connections and measure the strength of association among elements in a network.

rma.exact v0.1.0: Provides functions to compute an exact CI for the population mean under a random-effects model. For details, see Michael, Thronton, Xie, and Tian (2017).

Time Series

carfima v1.0.1: Provides a toolbox to fit a continuous-time, fractionally integrated ARMA process CARFIMA on univariate and irregularly spaced time-series data using a general-order CARFIMA(p, H, q) model for p>q as specified in Tsai and Chan (2005).

colorednoise v0.0.1: Provides tools for simulating populations with white noise (no temporal autocorrelation), red noise (positive temporal autocorrelation), and blue noise (negative temporal autocorrelation) based on work by Ruokolainen et al. (2009). The vignette describes colored noise.

nnfor v0.9: Provides functions to facilitate automatic time-series modelling with neural networks. Look here for some help getting started.

Utilities

hdf5r v1.0.0: Provides an object-oriented wrapper for the HDF5 API using R6 classes. The vignette shows how to use the package.

geoops v0.1.2: Provides tools for working with the GeoJSON geospatial data interchange format. There is an Introduction.

linl v0.0.2: Adds a LaTeX Letter class to rmarkdown, using the pandoc-letter template adapted for use with markdown. See the vignette and README for details.

oshka v0.1.2: Expands quoted language by recursively replacing any symbol that points to quoted language with the language itself. There is an Introduction and a vignette on Non Standard Evaluation Functions.

rcreds v0.6.6: Provides functions to read and write credentials to and from an encrypted file. The vignette describes how to use the package.

RMariaDB v1.0-2: Implements a DBI-compliant interface to MariaDB and MySQL databases.

securitytxt v0.1.0: Provides tools to identify and parse security.txt files to enable the analysis and adoption of the Web Security Policies draft standard.

usethis v1.1.0: Automates package and project setup tasks, including setting up unit testing, test coverage, continuous integration, Git, GitHub, licenses, Rcpp, RStudio projects, and more that would otherwise be performed manually. README provides examples.

xltabr v0.1.1: Provides functions to produce nicely formatted cross tabulations to Excel using [openxlsx]((https://cran.r-project.org/package=openxlsx), which has been developed to help automate the process of publishing Official Statistics. Look here for documentation.

Visualizations

iheatmapr v0.4.2: Provides a system for making complex, interactive heatmaps. Look at the webpage for examples.

otvPlots v0.2.0: Provides functions to automate the visualization of variable distributions over time, and compute time-aggregated summary statistics for large datasets. See the README for an introduction.

_____='https://rviews.rstudio.com/2017/11/22/october-2017-new-packages/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Le Monde puzzle [#1029]

Wed, 11/22/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A convoluted counting Le Monde mathematical puzzle:

A film theatre has a waiting room and several projection rooms. With four films on display. A first set of 600 spectators enters the waiting room and vote for their favourite film. The most popular film is projected to the spectators who voted for it and the remaining spectators stay in the waiting room. They are joined by a new set of 600 spectators, who then also vote for their favourite film. The selected film (which may be the same as the first one) is then shown to those who vote for it and the remaining spectators stay in the waiting room. This pattern is repeated for a total number of 10 votes, after which the remaining spectators leave. What are the maximal possible numbers of waiting spectators and of spectators in a projection room?

A first attempt by random sampling does not produce extreme enough events to reach those maxima:

wm=rm=600 #waiting and watching for (v in 1:V){ film=rep(0,4) #votes on each fiLm for (t in 1:9){ film=film+rmultinom(1,600,rep(1,4)) rm=max(rm,max(film)) film[order(film)[4]]=0 wm=max(wm,sum(film)+600)} rm=max(rm,max(film)+600)}

where the last line adds the last batch of arriving spectators to the largest group of waiting ones. This code only returns 1605 for the maximal number of waiting spectators. And 1155 for the maximal number in a projection room.  Compared with the even separation of the first 600 into four groups of 150… I thus looked for an alternative deterministic allocation:

wm=rm=0 film=rep(0,4) for (t in 1:9){ size=sum(film)+600 film=c(rep(ceiling(size/4),3),size-3*ceiling(size/4)) film[order(film)[4]]=0 rm=max(rm,max(film)+600) wm=max(wm,sum(film)+600)}

which tries to preserve as many waiting spectators as possible for the last round (and always considers the scenario of all newcomers backing the largest waiting group for the next film). The outcome of this sequence moves up to 1155 for the largest projection audience and 2264 for the largest waiting group. I however wonder if splitting into two groups in the final round(s) could even increase the size of the last projection. And indeed halving the last batch into two groups leads to 1709 spectators in the final projection. With uncertainties about the validity of the split towards ancient spectators keeping their vote fixed! (I did not think long enough about this puzzle to turn it into a more mathematical problem…)

While in Warwick, I reconsidered the problem from a dynamic programming perspective, always keeping the notion that it was optimal to allocate the votes evenly between some of the films (from 1 to 4). Using the recursive R code

optiz=function(votz,t){ if (t==9){ return(sort(votz)[3]+600) }else{ goal=optiz(sort(votz)+c(0,0,600,-max(votz)),t+1) goal=rep(goal,4) for (i in 2:4){ film=sort(votz);film[4]=0;film=sort(film) size=sum(film[(4-i+1):4])+600 film[(4-i+1):4]=ceiling(size/i) while (sum(film[(4-i+1):4])>size) film[4]=film[4]-1 goal[i]=optiz(sort(film),t+1)} return(max(goal))}}

led to a maximal audience size of 1619. [Which is also the answer provided by Le Monde]

Filed under: Books, Kids, R Tagged: combinatorics, Le Monde, R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

4-Day Workshop in Toronto: Learn R, tidyverse, and Shiny with Dean Attali and Kirill Muller

Tue, 11/21/2017 - 21:00

(This article was first published on Dean Attali's R Blog, and kindly contributed to R-bloggers)

Kirill Muller and I are excited to announce a 4-day R workshop in Toronto on January 23-26, 2018!

To register or for more information click here

Readers of this blog can receive a 10% discount by using promo code TIDYSHINYTEN (limited-time offer).
If you know someone who might benefit from the workshop, you can receive 10% of the ticket price using this affiliate link (we will pay you using PayPal).

The course will take place in the Hilton Hotel in Toronto, Canada over 4 full days. It is intended for beginners and experienced R users alike. For any questions, feel free to contact Dean at dean@attalitech.com

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Dean Attali's R Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

xray: The R Package to Have X Ray Vision on your Datasets

Tue, 11/21/2017 - 19:00

(This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers)


This package lets you analyze the variables of a dataset, to evaluate how is the shape of your data. Consider this the first step when you have your data for modeling, you can use this package to analyze all variables and check if there is anything weird worth transforming or even avoiding the variable altogether.

Installation

You can install xray from github with:

# install.packages("devtools") devtools::install_github("sicarul/xray") Usage Anomaly detection

xray::anomalies analyzes all your columns for anomalies, whether they are NAs, Zeroes, Infinite, etc, and warns you if it detects variables with at least 80% of rows with those anomalies. It also warns you when all rows have the same value.

Example:

data(longley) badLongley=longley badLongley$GNP=NA xray::anomalies(badLongley) #> Warning in xray::anomalies(badLongley): Found 1 possible problematic variables: #> GNP #> $variables #> Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct #> 1 GNP 16 16 1 0 0 0 0 0 0 1 #> 2 GNP.deflator 16 0 0 0 0 0 0 0 0 16 #> 3 Unemployed 16 0 0 0 0 0 0 0 0 16 #> 4 Armed.Forces 16 0 0 0 0 0 0 0 0 16 #> 5 Population 16 0 0 0 0 0 0 0 0 16 #> 6 Year 16 0 0 0 0 0 0 0 0 16 #> 7 Employed 16 0 0 0 0 0 0 0 0 16 #> type anomalous_percent #> 1 Logical 1 #> 2 Numeric 0 #> 3 Numeric 0 #> 4 Numeric 0 #> 5 Numeric 0 #> 6 Integer 0 #> 7 Numeric 0 #> #> $problem_variables #> Variable q qNA pNA qZero pZero qBlank pBlank qInf pInf qDistinct #> 1 GNP 16 16 1 0 0 0 0 0 0 1 #> type anomalous_percent #> 1 Logical 1 #> problems #> 1 Anomalies present in 100% of the rows. Less than 2 distinct values. Distributions

xray::distributions tries to analyze the distribution of your variables, so you can understand how each variable is statistically structured. It also returns a percentiles table of numeric variables as a result, which can inform you of the shape of the data.

distrLongley=longley distrLongley$testCategorical=c(rep('One',7), rep('Two', 9)) xray::distributions(distrLongley)


#> Variable p_1 p_10 p_25 p_50 p_75 p_90 #> 1 GNP.deflator 83.78 88.35 94.525 100.6 111.25 114.95 #> 2 GNP 237.8537 258.74 317.881 381.427 454.0855 510.387 #> 3 Unemployed 187.93 201.55 234.825 314.35 384.25 434.4 #> 4 Armed.Forces 147.61 160.3 229.8 271.75 306.075 344.85 #> 5 Population 107.7616 109.2025 111.7885 116.8035 122.304 126.61 #> 6 Year 1947.15 1948.5 1950.75 1954.5 1958.25 1960.5 #> 7 Employed 60.1938 60.7225 62.7125 65.504 68.2905 69.4475 #> p_99 #> 1 116.72 #> 2 549.3859 #> 3 478.725 #> 4 358.695 #> 5 129.7466 #> 6 1961.85 #> 7 70.403 Distributions along a time axis

xray::timebased also investigates into your distributions, but shows you the change over time, so if there is any change in the distribution over time (For example a variable stops or starts being collected) you can easily visualize it.

dateLongley=longley dateLongley$Year=as.Date(paste0(dateLongley$Year,'-01-01')) dateLongley$Data='Original' ndateLongley=dateLongley ndateLongley$GNP=dateLongley$GNP+10 ndateLongley$Data='Offseted' xray::timebased(rbind(dateLongley, ndateLongley), 'Year')



#> [1] “7 charts have been generated.”

If you are just starting to learn about Data Science or want to explore further, i encourage you to check this cool book made by my buddy Pablo Casas: The Data Science Live Book

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

βCEA

Tue, 11/21/2017 - 17:32

(This article was first published on Gianluca Baio's blog, and kindly contributed to R-bloggers)

Recently, I’ve been doing a lot of work on the beta version of BCEA (I was after all born in Agrigento $-$ in the picture to the left $-$, which is a Greek city, so a beta version sounds about right…). 

The new version is only available as a beta-release from our GitHub repository – usual ways to install it are through the devtools package.

There aren’t very many changes from the current CRAN version, although the one thing I did change is kind of big. In fact, I’ve embedded the web-app functionalities within the package. So, it is now possible to launch the web-app from the current R session using the new function BCEAweb. This takes as arguments three inputs: a matrix e containing $S$ simulations for the measures of effectiveness computed for the $T$ interventions; a matrix c containing the simulations for the measures of costs; and a data frame or matrix containing simulations for the model parameters. 

In fact, none of the inputs is required and the user can actually launch an empty web-app, in which the inputs can be uploaded, say, from a spreadsheet (there are in fact other formats available).

I think the web-app facility is not necessary when you’ve gone through the trouble of actually installing the R package and you’re obviously using it from R. But it’s helpful, nonetheless, for example in terms of producing some standard output (perhaps even more than the actual package $-$ which I think is more flexible) and of reporting, with the cool facility based on pandoc.

This means there are a few more packages “suggested” on installation and potentially a longer compilation time for the package $-$ but nothing major. The new version is under testing but I may be able to release it on CRAN soon-ish… And there are other cool things we’re playing around (the links here give all the details!).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Gianluca Baio's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Scale up your parallel R workloads with containers and doAzureParallel

Tue, 11/21/2017 - 17:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by JS Tan (Program Manager, Microsoft)

The R language is by and far the most popular statistical language, and has seen massive adoption in both academia and industry. In our new data-centric economy, the models and algorithms that data scientists build in R are not just being used for research and experimentation. They are now also being deployed into production environments, and directly into products themselves.

However, taking your workload in R and deploying it at production capacity, and at scale, is no trivial matter.  Because of R's rich and robust package ecosystem, and the many versions of R, reproducing the environment of your local machine in a production setting can be challenging. Let alone ensuring your model's reproducibility!

This is why using containers is extremely important when it comes to operationalizing your R workloads. I'm happy to announce that the doAzureParallel package, powered by Azure Batch, now supports fully containerized deployments. With this migration, doAzureParallel will not only help you scale out your workloads, but will also do it in a completely containerized fashion, letting your bypass the complexities of dealing with inconsistent environments. Now that doAzureParallel runs on containers, we can ensure a consistent immutable runtime while handling custom R versions, environments, and packages.

By default, the container used in doAzureParallel is the 'rocker/tidyverse:latest' container that is developed and maintained as part of the rocker project. For most cases, and especially for beginners, this image will contain most of what is needed. However, as users become more experienced or have more complex deployment requirements, they may want to change the Docker image that is used, or even build their own. doAzureParallel supports both those options, giving you flexibility (without any compromise on reliability). Configuring the Docker image is easy. Once you know which Docker image you want to use, you can simply specify its location in the cluster configuration and doAzureParallel will just know to use it when provisioning subsequent clusters. More details on configuring your Docker container settings with doAzureParallel are included in the documentation. 

With this release, we hope to unblock many users who are looking to take their R models, and scale it up in the cloud. To get started with doAzureParallel, visit our Github page. Please give it a try and let us know if you have questions, feedback, or suggestions, or via email at razurebatch@microsoft.com.

Github (Azure): doAzureParallel

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

dplyr and the design effect in survey samples

Tue, 11/21/2017 - 13:59

(This article was first published on Data Literacy - The blog of Andrés Gutiérrez, and kindly contributed to R-bloggers)

Blogdown entry here.

For those guys like me who are not such R geeks, this trick could be of interest. The package dplyr can be very useful when it comes to data manipulation and you can extract valuable information from a data frame. For example, when using if you want to count how many humans have a particular hair color, you can run the following piece of code:

library(dplyr)

starwars %>% filter(species == "Human") %>%
group_by(hair_color) %>%
summarise(n = n()) hair_color n auburn 1 auburn, grey 1 auburn, white 1 black 8 blond 3 brown 14 brown, grey 1 grey 1 none 3 white 2

As a result the former query gives you a data frame and you can use it to make another query. For example, if you want to know the average number of individuals in the data frame you can use the summarise twice:

library(dplyr)

starwars %>% filter(species == "Human") %>%
group_by(hair_color) %>%
summarise(n = n()) %>%
summarise(x.b = mean(n)) x.b 3.5

Now, turning our attention to statistics, it is known that, when dealing with sample surveys, one measure of interest is the design effect defined as

$Deff \approx 1 + (\bar{m} – 1)\rho$

where $\bar{m}$ is the average cluster size and $\rho$ is the intraclass correlation coefficient. If you are dealing with survey data and you want to figure out the value of $\bar{m}$ and $\rho$, you can use dplyr. Let’s use the Lucy data of the samplesize4surveys package to show how you can do it.

library(samplesize4surveys)
data(Lucy)

m <- Lucy %>% group_by(Zone) %>%
summarise(n = n()) %>%
summarise(m = mean(n))

rho <- ICC(y = Lucy$Taxes, cl = Lucy$Zone)$ICC

DEFF <- 1 + (as.integer(m) - 1) * rho
DEFF
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Data Literacy - The blog of Andrés Gutiérrez. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Practical Machine Learning with R and Python – Part 6

Tue, 11/21/2017 - 12:46

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

Introduction

This is the final and concluding part of my series on ‘Practical Machine Learning with R and Python’. In this series I included the implementations of the most common Machine Learning algorithms in R and Python. The algorithms implemented were

1. Practical Machine Learning with R and Python – Part 1 In this initial post, I touch upon regression of a continuous target variable. Specifically I touch upon Univariate, Multivariate, Polynomial regression and KNN regression in both R and Python
2. Practical Machine Learning with R and Python – Part 2 In this post, I discuss Logistic Regression, KNN classification and Cross Validation error for both LOOCV and K-Fold in both R and Python
3. Practical Machine Learning with R and Python – Part 3 This 3rd part included feature selection in Machine Learning. Specifically I touch best fit, forward fit, backward fit, ridge(L2 regularization) & lasso (L1 regularization). The post includes equivalent code in R and Python.
4. Practical Machine Learning with R and Python – Part 4 In this part I discussed SVMs, Decision Trees, Validation, Precision-Recall, AUC and ROC curves
5. Practical Machine Learning with R and Python – Part 5  In this penultimate part, I touch upon B-splines, natural splines, smoothing spline, Generalized Additive Models(GAMs), Decision Trees, Random Forests and Gradient Boosted Treess.

In this last part I cover Unsupervised Learning. Specifically I cover the implementations of Principal Component Analysis (PCA). K-Means and Heirarchical Clustering. You can download this R Markdown file from Github at MachineLearning-RandPython-Part6

1.1a Principal Component Analysis (PCA) – R code

Principal Component Analysis is used to reduce the dimensionality of the input. In the code below 8 x 8 pixel of handwritten digits is reduced into its principal components. Then a scatter plot of the first 2 principal components give a very good visial representation of the data

library(dplyr) library(ggplot2) #Note: This example is adapted from an the example in the book Python Datascience handbook by # Jake VanderPlas (https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html) # Read the digits data (From sklearn datasets) digits= read.csv("digits.csv") # Create a digits classes target variable digitClasses <- factor(digits$X0.000000000000000000e.00.29) #Invoke the Principal Componsent analysis on columns 1-64 digitsPCA=prcomp(digits[,1:64]) # Create a dataframe of PCA df <- data.frame(digitsPCA$x) # Bind the digit classes df1 <- cbind(df,digitClasses) # Plot only the first 2 Principal components as a scatter plot. This plot uses only the # first 2 principal components ggplot(df1,aes(x=PC1,y=PC2,col=digitClasses)) + geom_point() + ggtitle("Top 2 Principal Components") 1.1 b Variance explained vs no principal components – R code

In the code below the variance explained vs the number of principal components is plotted. It can be seen that with 20 Principal components almost 90% of the variance is explained by this reduced dimensional model.

# Read the digits data (from sklearn datasets) digits= read.csv("digits.csv") # Digits target digitClasses <- factor(digits$X0.000000000000000000e.00.29) digitsPCA=prcomp(digits[,1:64]) # Get the Standard Deviation sd=digitsPCA$sdev # Compute the variance digitsVar=digitsPCA$sdev^2 #Compute the percent variance explained percentVarExp=digitsVar/sum(digitsVar) # Plot the percent variance exlained as a function of the number of principal components #plot(cumsum(percentVarExp), xlab="Principal Component", # ylab="Cumulative Proportion of Variance Explained", # main="Principal Components vs % Variance explained",ylim=c(0,1),type='l',lwd=2, # col="blue") 1.1c Principal Component Analysis (PCA) – Python code import numpy as np from sklearn.decomposition import PCA from sklearn import decomposition from sklearn import datasets import matplotlib.pyplot as plt from sklearn.datasets import load_digits # Load the digits data digits = load_digits() # Select only the first 2 principal components pca = PCA(2) # project from 64 to 2 dimensions #Compute the first 2 PCA projected = pca.fit_transform(digits.data) # Plot a scatter plot of the first 2 principal components plt.scatter(projected[:, 0], projected[:, 1], c=digits.target, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('spectral', 10)) plt.xlabel('PCA 1') plt.ylabel('PCA 2') plt.colorbar(); plt.title("Top 2 Principal Components") plt.savefig('fig1.png', bbox_inches='tight') 1.1 b Variance vs no principal components – Python code import numpy as np from sklearn.decomposition import PCA from sklearn import decomposition from sklearn import datasets import matplotlib.pyplot as plt from sklearn.datasets import load_digits digits = load_digits() # Select all 64 principal components pca = PCA(64) # project from 64 to 2 dimensions projected = pca.fit_transform(digits.data) # Obtain the explained variance for each principal component varianceExp= pca.explained_variance_ratio_ # Compute the total sum of variance totVarExp=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100) # Plot the variance explained as a function of the number of principal components plt.plot(totVarExp) plt.xlabel('No of principal components') plt.ylabel('% variance explained') plt.title('No of Principal Components vs Total Variance explained') plt.savefig('fig2.png', bbox_inches='tight') 1.2a K-Means – R code

In the code first the scatter plot of the first 2 Principal Components of the handwritten digits is plotted as a scatter plot. Over this plot 10 centroids of the 10 different clusters corresponding the 10 diferent digits is plotted over the original scatter plot.

library(ggplot2) # Read the digits data digits= read.csv("digits.csv") # Create digit classes target variable digitClasses <- factor(digits$X0.000000000000000000e.00.29) # Compute the Principal COmponents digitsPCA=prcomp(digits[,1:64]) # Create a data frame of Principal components and the digit classes df <- data.frame(digitsPCA$x) df1 <- cbind(df,digitClasses) # Pick only the first 2 principal components a<- df[,1:2] # Compute K Means of 10 clusters and allow for 1000 iterations k<-kmeans(a,10,1000) # Create a dataframe of the centroids of the clusters df2<-data.frame(k$centers) #Plot the first 2 principal components with the K Means centroids ggplot(df1,aes(x=PC1,y=PC2,col=digitClasses)) + geom_point() + geom_point(data=df2,aes(x=PC1,y=PC2),col="black",size = 4) + ggtitle("Top 2 Principal Components with KMeans clustering") 1.2b K-Means – Python code

The centroids of the 10 different handwritten digits is plotted over the scatter plot of the first 2 principal components.

import numpy as np from sklearn.decomposition import PCA from sklearn import decomposition from sklearn import datasets import matplotlib.pyplot as plt from sklearn.datasets import load_digits from sklearn.cluster import KMeans digits = load_digits() # Select only the 1st 2 principal components pca = PCA(2) # project from 64 to 2 dimensions projected = pca.fit_transform(digits.data) # Create 10 different clusters kmeans = KMeans(n_clusters=10) # Compute the clusters kmeans.fit(projected) y_kmeans = kmeans.predict(projected) # Get the cluster centroids centers = kmeans.cluster_centers_ centers #Create a scatter plot of the first 2 principal components plt.scatter(projected[:, 0], projected[:, 1], c=digits.target, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('spectral', 10)) plt.xlabel('PCA 1') plt.ylabel('PCA 2') plt.colorbar(); # Overlay the centroids on the scatter plot plt.scatter(centers[:, 0], centers[:, 1], c='darkblue', s=100) plt.savefig('fig3.png', bbox_inches='tight') 1.3a Heirarchical clusters – R code

Herirachical clusters is another type of unsupervised learning. It successively joins the closest pair of objects (points or clusters) in succession based on some ‘distance’ metric. In this type of clustering we do not have choose the number of centroids. We can cut the created dendrogram mat an appropriate height to get a desired and reasonable number of clusters These are the following ‘distance’ metrics used while combining successive objects

  • Ward
  • Complete
  • Single
  • Average
  • Centroid
# Read the IRIS dataset iris <- datasets::iris iris2 <- iris[,-5] species <- iris[,5] #Compute the distance matrix d_iris <- dist(iris2) # Use the 'average' method to for the clsuters hc_iris <- hclust(d_iris, method = "average") # Plot the clusters plot(hc_iris) # Cut tree into 3 groups sub_grp <- cutree(hc_iris, k = 3) # Number of members in each cluster table(sub_grp) ## sub_grp ## 1 2 3 ## 50 64 36 # Draw rectangles around the clusters rect.hclust(hc_iris, k = 3, border = 2:5) 1.3a Heirarchical clusters – Python code from sklearn.datasets import load_iris import matplotlib.pyplot as plt from scipy.cluster.hierarchy import dendrogram, linkage # Load the IRIS data set iris = load_iris() # Generate the linkage matrix using the average method Z = linkage(iris.data, 'average') #Plot the dendrogram #dendrogram(Z) #plt.xlabel('Data') #plt.ylabel('Distance') #plt.suptitle('Samples clustering', fontweight='bold', fontsize=14); #plt.savefig('fig4.png', bbox_inches='tight') Conclusion

This is the last and concluding part of my series on Practical Machine Learning with R and Python. These parallel implementations of R and Python can be used as a quick reference while working on a large project. A person who is adept in one of the languages R or Python, can quickly absorb code in the other language.

Hope you find this series useful!

More interesting things to come. Watch this space!

References

  1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
  2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

Also see
1. The many faces of latency
2. Simulating a Web Join in Android
3. The Anamoly
4. yorkr pads up for the Twenty20s:Part 3:Overall team performance against all oppositions
5. Bend it like Bluemix, MongoDB using Auto-scale – Part 1!

To see all posts see ‘Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using an R ‘template’ package to enhance reproducible research or the ‘R package syndrome’

Tue, 11/21/2017 - 11:00
Motivation

Have you ever had the feeling that the creation of your data analysis report(s) resulted in looking up, copy-pasting and reuse of code from previous analyses?
This approach is time consuming and prone to errors. If you frequently analyze similar data(-types), e.g. from a standardized analysis workflow or different experiments on the same platform, the automation of your report creation via an R ‘template’ package might be a very useful and time-saving step. It also allows to focus on the important part of the analysis (i.e. the experiment or analysis-specific part).
Imagine that you need to analyze tens or hundreds of runs of data in the same format, making use of an R ‘template’ package can save you hours, days or even weeks. On the go, reports can be adjusted, errors corrected and extensions added without much effort.

A bit of history

The reproducibility of an analysis workflow remains one complex challenge for R data scientists. The creation of dynamic and reproducible analysis report in R started to spread with the creation of Sweave [1] by Friederich Leisch, enabling the creation of mixed R and LaTeX reports. This was further expanded with the knitr package [2] by Yihui Xie in 2012. The concept of child documents and parametrized (rmarkdown) reports is another step towards this goal [3]. Our approach integrates the efforts towards reproducible research by means of report templates integrated within an R package structure, namely the ‘R template’ package.

Advantages of an R ‘template’ package

Integrating report templates in an R package structure enables to take advantage of the functionalities of an R package. R functions and report templates, wrapping the entire analysis workflow can be contained in the same code structure.
Bug fixes, extensions of an analysis can be tracked via the package versioning. Because your entire analysis workflow is contained within a unique container, the code for the creation of the analysis report can easily be exchanged with collaborators and a new version installed smoothly in a standard manner.
Successive modifications of the reports can easily be applied to previous analysis reports.

How do you start the development and/or writing of your own R ‘templates’?

Distinguish the data(experiment)-specific (e.g. different models, visualization) from the frequently used analysis parts. The latter one contains the parts of the analysis that are consistent throughout the different analyses or experiments (i.e. the code that you were copy-pasting from previous analyses). Think of e.g. the reporting of results from a standard analysis or modeling technique like linear discriminant analysis. The modeling function, visualization and results formatting is the same irrespective of the specific experiment at hand. The experiment-specific part consists of e.g. the input data, model or analysis parameters, etc.
The code which is consistent throughout the different analyses can be captured in R functions or the equivalent R ‘template’ documents for reporting. The R ‘template’ documents can indeed be seen as the equivalent of an R function for reporting purposes, integrated within an R package, with input parameters and potentially some default values (i.e. the most common value).
Each part of an analysis report can be contained in separated modular child document(s), integrated within an R package structure.
This can be either one ‘template’ document or several (nested) ‘template’ documents, depending on personal preference regarding structuring (e.g. one document per analysis technique) as well as complexity of the analysis at hand. An advantage of using a more complex nested approach is that it allows to include or exclude several parts of the analysis depending on requests of the user and/or the data at hand and/or the findings within the analysis.
This so called template document(s) can be run from outside the package, and for data/parameters specific of a certain analysis/experiment.

Our suggested approach

The approach that we recommend makes use of a main ‘master’ ‘template’ which calls on its turn the (different) child document(s) contained in the R ‘template’ package (see figure). This allows the developers to easily extend the ‘template’ without running into issues with previous reports.

It is advisable and user-friendly to create a start template which mentions the required and optional input parameters necessary for the downstream analysis. A dedicated function can be created to extract the template(s) path (after the package installation), which can be used to render/run the document.

To enhance the approach:

  • progress messages can be implemented in the template to follow the report creation
  • default parameters can be set at the start of each template
  • session information can be included at the end of the main template to track the versions of the R package and its dependencies.
How to use an R ‘template’ package?

The template(s) contained in the R package can be either called directly in R. There is the option of specifying the necessary parameters for the report directly inside the document or to pass them via the render call which provides a cleaner R environment. The latter option might be quite complicated, depending on the analysis at hand.
The report creation can also be combined with a Shiny user interface. This has the advantage that even non-(expert) R-users can easily create a reproducible report without specifying anything in R at all. (Although it might be that your report does not lend itself to a shiny interface.) Suppose your data are in a standardized format (e.g. several runs of data with the same type of information). In this case a shiny application can be developed with a list of possible runs, some user-specified additional parameters and a button to create and/or download the report. In this way even users without any R knowledge can create their own standardized report in a straightforward way from the same R template package.

Summary of the advantages

To summarize, using an R template package makes the approach less prone to errors, you save time and the reports are consistent across different analyses. It is easy to keep track of changes, extensions and bug fixes, via appropriate package versioning which ensures the reproducibility of an entire analysis workflow. The combination of an R ‘template’ package with a shiny application creates opportunities for non(-expert) R users to create their own report.

Is the development of an R ‘template’ worthwhile the effort?

The development of such template reports might seem cumbersome and time-consuming; and the use of the R package structure for such purposes overkill, a.k.a the ‘R package syndrome’.
However a lot of time and analysis reproducibility can be gained when using this package afterwards, which makes the effort worthwhile.

All the tools are already available thanks to the open R community, so do-it-yourself!

Our presentation at the UseR2017 conference is available here. An example of an R ‘template’ package with a shiny interface is available here. Feel free to contact us for more information: Laure Cougnaud (laure.cougnaud.at.openanalytics.eu), Kirsten Van Hoorde (kirsten.vanhoorde.at.openanalytics.eu).

[1] F. Leisch. Sweave: Dynamic generation of statistical reports using literate data analysis. In W. Härdle and B. Rönz, editors, Compstat 2002 ”Proceedings in Computational Statistics”, pages 575-580. Physika Verlag, Heidelberg, Germany, 2002].

[2] https://cran.r-project.org/web/packages/knitr/index.html.

[3] http://rmarkdown.rstudio.com/developer_parameterized_reports.html.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

How to Build a Geographic Dashboard with Real-Time Data

Tue, 11/21/2017 - 03:08

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

In this post, I show how to build an interactive geographic dashboard using Displayr, Plotly and R. It is particularly fascinating in that it tracks the real-time position of military aircraft. To do so, I am first going to bring in some data from two separate sources (regions based on the size of military air-force, and then the real-time tracking of aircraft positions). The dashboard displays the dynamic data in two ways: region shading (to indicate the size of the military force per country) and point markers (for the aircraft positions). I then build a map to neatly and attractively display all this data.

Reading tabular data from the web

I am going to shade each country of the map according to the size of its military air-force. To do this I need a list of aircraft per country. Fortunately, the ever useful Wikipedia has just what I need (here). The following code reads in the data and cleans it to produce a neat table.

library(httr) library(XML) # Extract table via URL url &amp;amp;lt;- "https://en.wikipedia.org/wiki/List_of_countries_by_level_of_military_equipment#List" r &amp;amp;lt;- GET(url) airforces &amp;amp;lt;- readHTMLTable(doc = content(r, "text"))[[2]] # Clean required columns airforces &amp;amp;lt;- airforces[-1, c("Country[note 1]", "Military aircraft[note 3]")] colnames(airforces) &amp;amp;lt;- c("Country", "MilitaryAircraft") remove.bracket.content &amp;amp;lt;- function(s) { return(gsub("\\[.+\\]", "", s)) } airforces &amp;amp;lt;- data.frame(apply(airforces, 2, remove.bracket.content)) airforces$MilitaryAircraft &amp;amp;lt;- as.numeric(gsub(",", "", airforces$MilitaryAircraft)) airforces Pooling real-time data from across the globe

Compared to the above, the second source of data is more dynamic. I am using ADS-B, which pools real-time flight tracking information from across the globe. Not all military operations are top secret. Apparently, some aircraft operated by the armed forces broadcast their positions publicly.

To collate this information, I construct a URL to retrieve a JSON object with information about military aircraft (JSON is flexible text format used for data exchange). Then I parse the JSON to create a data.frame.

library(jsonlite)&amp;amp;lt;/pre&amp;amp;gt; url &amp;amp;lt;- "http://public-api.adsbexchange.com/VirtualRadar/AircraftList.json?" url &amp;amp;lt;- paste0(url, "fMilQ=TRUE") positions &amp;amp;lt;- fromJSON(url)$acList if (length(positions) != 0) { positions &amp;amp;lt;- positions[positions$Type != "TEST", ] positions &amp;amp;lt;- positions[!is.na(positions$Lat), ] } positions Shading countries of a map

The following code produces a plotly map of the world. The countries are shaded according to the size of the air-force, with the scale shown on a legend. In plotly terminology, each layer of the map is known as a trace. 

library(plotly) library(flipFormat) # specify map area and projection g &amp;amp;lt;- list(scope = "world", showframe = FALSE, showcoastlines = TRUE, projection = list(type = 'mercator'), lonaxis = list(range = c(-140, 179)), lataxis = list(range = c(-55, 70)), resolution = 50) # higher resolution # shade countries by airforce size p &amp;amp;lt;- plot_geo(airforces) %&amp;amp;gt;% add_trace(data = airforces, name = "Airforce", z = ~MilitaryAircraft, color = ~MilitaryAircraft, colors = 'Blues', locations = ~Country, marker = list(line = list(color = toRGB("grey"), width = 0.5)), showscale = TRUE, locationmode = "country names", colorbar = list(title = 'Airforce', separatethousands = TRUE)) %&amp;amp;gt;% config(displayModeBar = F) %&amp;amp;gt;% layout(geo = g, margin = list(l=0, r=0, t=0, b=0, pad=0), paper_bgcolor = 'transparent') Adding markers for the aircraft

Finally, I add markers showing the airplane locations as another trace. I use a different color for those with speed less than 200 knots and altitude less than 2000 feet. The hover text contains more detail about the aircraft.

aircolors = rep("airborne", nrow(positions)) # create a vector with the status of each aeroplane aircolors[positions$Spd &amp;amp;lt; 200 &amp;amp;amp; positions$Alt &amp;amp;lt; 2000] &amp;amp;lt;- "ground/approach" hovertext = paste0("Operator:", positions$Op, "\nModel:", positions$Mdl, "\nAltitide(ft):", sapply(positions$Alt, FormatAsReal)) hoverinfo = rep("all", nrow(positions)) p = add_trace(p, data = positions, x = positions$Long, y = positions$Lat, color = aircolors, hovertext = hovertext, showlegend = FALSE)

The final result is shown below. Since broadcasting of position is purely voluntary, I suspect that the Top Gun flights are not shown!

Adding some finishing touches

While the map above shows the required information, it can easily be made more useful and appealing. Displayr allows me to add a control to switch between regions, text and image annotations, and a background. Here is a link to the finished dashboard, with a screenshot shown below.

Thinking about it again, if you can’t see any military aircraft at all on the radar, then you should really worry!

Try it yourself

You can explore the dashboards used in this post within Displayr. All the code is embedded within the pages (look on the right-hand panel after you click on an object). Click here to see the copy of the working document (without background and finishing touches). Click here open a copy of the finished document that includes the background, control box, and other finishing touches.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R charts in a Tweet

Mon, 11/20/2017 - 21:25

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Twitter recently doubled the maximum length of a tweet to 280 characters, and while all users now have access to longer tweets, few have taken advantage of the opportunity. Bob Rudis used the rtweet package to analyze tweets sent with the #rstats hashtag since 280-char tweets were introduced, and most still kept below the old 280-character limit. The actual percentage differed by the Twitter client being used; I’ve reproduced the charts for the top 10 clients below. (Click the chart for even more clients, and details of the analysis including the R code that generated it.)

That being said, some have embraced the new freedom with gusto, not least my Microsoft colleague Paige Bailey who demonstrated you can fit some pretty nice R charts — and even animations! — into just 280 characters:

df <- expand.grid(x = 1:10, y=1:10)
df$angle <- runif(100, 0, 2*pi)
df$speed <- runif(100, 0, sqrt(0.1 * df$x))

ggplot(df, aes(x, y)) +
geom_point() +
geom_spoke(aes(angle = angle, radius = speed))

y’all twitterpeople give me 280 characters?
yr just gonna get code samples pic.twitter.com/hyGEE2DxGy

— @DynamicWebPaige (@DynamicWebPaige) November 8, 2017

x <- getURL(“https://t.co/ivZZvodbNK“)
b <- read.csv(text=x)
c <- get_map(location=c(-122.080954,36.971709), maptype=”terrain”, source=”google”, zoom=14)
ggmap(c) +
geom_path(data=b, aes(color=elevation), size=3)+
scale_color_gradientn(colours=rainbow(7), breaks=seq(25, 200, 25)) pic.twitter.com/7WdQLR56uZ

— @DynamicWebPaige (@DynamicWebPaige) November 11, 2017

library(dygraphs)
lungDeaths <- cbind(mdeaths, fdeaths)
dygraph(lungDeaths) %>%
dySeries(“mdeaths”, label = “Male”) %>%
dySeries(“fdeaths”, label = “Female”) %>%
dyOptions(stackedGraph = TRUE) %>%
dyRangeSelector(height = 20)

I ❤️ R’s interactive visualizations SO MUCH pic.twitter.com/LevjElly3L

— @DynamicWebPaige (@DynamicWebPaige) November 16, 2017

library(plot3D)

par(mar = c(2, 2, 2, 2))
par(mfrow = c(1, 1))
x <- seq(0, 2*pi,length.out=50)
y <- seq(0, pi,length.out=50)
M <- mesh(x, y)

surf3D(x = (3+2*cos(M$x)) * cos(M$y),
y = (3+2*cos(M$x)) * sin(M$y),
z = 2 * sin(M$x),
colkey=FALSE,
bty=”b2″) pic.twitter.com/a6GwQTaGYC

— @DynamicWebPaige (@DynamicWebPaige) November 17, 2017

For more 280-character examples in R and other languages follow this Twitter thread, and for more on analyzing tweets with rtweet follow the link below.

rud.is: Twitter Outer Limits : Seeing How Far Have Folks Fallen Down The Slippery Slope to “280” with rtweet

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Kaggle’s Advanced Regression Competition: Predicting Housing Prices in Ames, Iowa

Mon, 11/20/2017 - 19:31

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Introduction

Kaggle.com is a website designed for data scientists and data enthusiasts to connect and compete with each other. It is an open community that hosts forums and competitions in the wide field of data. In each Kaggle competition, competitors are given a training data set, which is used to train their models, and a test data set, used to test their models. Kagglers can then submit their predictions to view how well their score (e.g., accuracy or error) compares to others’.

As a team, we joined the House Prices: Advanced Regression Techniques Kaggle challenge to test our model building and machine learning skills. For this competition, we were tasked with predicting housing prices of residences in Ames, Iowa. Our training data set included 1460 houses (i.e., observations) accompanied by 79 attributes (i.e., features, variables, or predictors) and the sales price for each house. Our testing set included 1459 houses with the same 79 attributes, but sales price was not included as this was our target variable.

To view our code (split between R and Python) and our project presentation slides for this project see our shared GitHub repository.

 

Understanding the Data

Of the 79 variables provided, 51 were categorical and 28 were continuous.

Our first step was to combine these data sets into a single set both to account for the total missing values and to fully understand all the classes for each categorical variable. That is, there might be missing values or different class types in the test set that are not in the training set.

 

Processing the Data

Response Variable

As our response variable, Sale Price, is continuous, we will be utilizing regression models. One assumption of linear regression models is that the error between the observed and expected values (i.e., the residuals) should be normally distributed. Violations of this assumption often stem from a skewed response variable. Sale Price has a right skew, so we log + 1 transform it to normalize its distribution.

 

Missing Data

Machine learning algorithms do not handle missing values very well, so we must obtain an understanding of the missing values in our data to determine the best way to handle them. We find that 34 of the predictor variables have values that are interpreted by R and Python as missing (i.e., “NA” and “NaN”). Below we describe examples of some of the ways we treated these missing data.

 

1) NA/NaN is actually a class:

In many instances, what R and Python interpret as a missing value is actually a class of the variable. For example, Pool Quality is comprised of 5 classes: Excellent, Good, Fair, Typical, and NA. The NA class describes houses that do not have a pool, but our coding languages interpret houses of NA class as a missing value instead of a class of the Pool Quality variable.

Our solution was to impute most of the NA/NaN values to a value of “None.”

 

2) Not every NA/NaN corresponds to a missing attribute:

While we found that most NA/NaN values corresponded to an actual class for different variables, some NA/NaN values actually represented missing data. For example, we find that three houses with NA/NaN values for Pool Quality, also have a non-zero value for the variable, Pool Area (square footage of pool). These three houses likely have a pool, but its quality was not assessed or input into the data set.

Our solution was to first calculate mean Pool Area for each class of Pool Quality, then impute the missing Pool Quality classes based on how close that house’s Pool Area was to the mean Pool Areas for each Pool Quality class. For example, the first row in the below picture on the left has a Pool Area of 368 square feet. The average Pool Area for houses with Excellent pool quality (Ex) is about 360 square feet (picture on the right). Therefore, we imputed this house to have a Pool Quality of Excellent.

3) Domain knowledge:

Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised.

Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located.

4) Imputing with mode:

Most variables have some intuitive relationship to other variables, and imputation can be based on these related features. But some missing values are found in variables with no apparent relation to others. For example, the Electrical variable, which describes the electrical system, was missing for a single observation.

Our solution was to simply find the most common class for this categorical variable and impute for this missing value.

Ordinal Categories

For linear (but not tree-based) models, categorical variables must be treated as continuous. There are two types of categorical features: ordinal, where there is an inherent order to the classes (e.g., Excellent is greater than Good, which is greater than Fair), and nominal, where there is no obvious order (e.g., red, green, and blue).

Our solution for ordinal variables was to simply assign the classes a number corresponding to their relative ranks. For example, Kitchen Quality has five classes: Excellent, Good, Typical, Fair, and Poor, which we encoded (i.e., converted) to the numbers 5, 4, 3, 2, and 1, respectively.

Nominal Categories

The ranking of nominal categories is not appropriate as there is no actual rank among the classes.

Our solution was to one-hot encode these variables, which creates a new variable for each class with values of zero (not present) or one (present).

Outliers

An outlier can be defined with a quantitative (i.e., statistical) or qualitative definition. We opted for the qualitative version when looking for outliers: observations that are abnormally far from other values. Viewing the relationship between Above Ground Living Area and Sale Price, we noticed some very large areas for very low prices.

Our solution was to remove these observations as we thought they fit our chosen definition of an outlier, and because they might increase our models’ errors.

 

Skewness

While there are few assumptions regarding the independent variables of regression models, often transforming skewed variables to a normal distribution can improve model performance.

Our solution was to log + 1 transform several of the predictors.

 

Near Zero Predictors

Predictors with very low variance offer little predictive power to models.

Our solution was to find the ratio of the second most frequent value to the most frequent value for each predictor, and to remove variables where this ratio was less than 0.05. This roughly translates to dropping variables where 95% or more of the values are the same.

 

Feature Engineering

Feature (variable or predictor) engineering is one of the most important steps in model creation. Often there is valuable information “hidden” in the predictors that is only revealed when manipulating these features in some way. Below are just some examples of the features we created:

  • Remodeled (categorical): Yes or No if Year Built is different from Year Remodeled; If the year the house was remodeled is different from the year it was built, the remodeling likely increases property value
  • Seasonality (categorical): Combined Month Sold with Year Sold; While more houses were sold during summer months, this likely varies across years, especially during the time period these houses were sold, which coincides with the housing crash (2006-2010)
  • New House (categorical): Yes or No if Year Sold is equal to Year Built; If a house was sold the same year it was built, we might expect it was in high demand and might have a higher Sale Price
  • Total Area (continuous): Sum of all variables that describe the area of different sections of a house; There are many variables that pertain to the square footage of different aspects of each house, we might expect that the total square footage has a strong influence on Sale Price

 

Models and Results

Now that we have prepared our data set, we can begin training our models and use them to predict Sale Price.

We trained and tested dozens of versions of the models described below with different combinations of engineered features and processed variables. The information in the table represents our best results for each model. The table explains the pros and cons for each model type, the optimal hyperparameters found through either grid search or Bayesian optimization, our test score, and the score we received from Kaggle. Our scores the root mean square error (RMSE) of our predictions, which is a metric for describing the difference between the observed values and our predicted values for Sale Price; scores closer to zero are better.

For brevity, we will not describe the details of the different models. However, see the following links for more information about how each model is used to create predictions: random forest, gradient boost, XGBoost, elastic net regularization for regression.

 

Below are plots summarizing variables that contribute most to the respective model’s prediction of Sale Price.

For most models, predictors related to square footage (Area), quality (different Quality measures), and age (Year Built) have the strongest impact on each model’s predictions.

 


 

There is no visualization for our best model, which was an ensemble of four other models. The predictions for this ensembled model are calculated by averaging the predictions from the separate models (two linear regression models and two tree-based models). The idea is that each model’s predictions include error both above and below the real values, and the averaged predictions of the best models might have less overall error than any one single model.

One note is that tree-based models (random forest, gradient boosting, and XGBoosting) cannot provide an idea of positive or negative influence for each variable on Sale Price, rather they can only provide an idea of how important that variable is to the models’ predictions overall. In contrast, linear models can provide information about which variables positively and negatively influence Sale Price. For the figure immediately above, the strongest predictor, residency in a Commercial Zone, is actually negatively related to Sale Price.

 

Conclusions

The objective of this Kaggle competition was to build models to predict housing prices of different residences in Ames, IA. Our best model resulted in an RMSE of 0.1071, which translates to an error of about $9000 (or about 5%) for the average-priced house.

While this error is quite low, the interpretability of our model is poor. Each model found within our ensembled model varies with respect to the variables that are most important to predicting Sale Price. The best way to interpret our ensemble is to look for shared variables among its constituent models. The variables seen as most important or as strongest predictors through our models were those related to square footage, the age and condition of the home, the neighborhood where the house was located, the city zone where the house was located, and the year the house was sold.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

«smooth» package for R. Common ground. Part II. Estimators

Mon, 11/20/2017 - 18:21

(This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers)

A bit about estimates of parameters

Hi everyone! Today I want to tell you about parameters estimation of smooth functions. But before going into details, there are several things that I want to note. In this post we will discuss bias, efficiency and consistency of estimates of parameters, so I will use phrases like “efficient estimator”, implying that we are talking about some optimisation mechanism that gives efficient estimates of parameters. It is probably not obvious for people without statistical background to understand what the hell is going on and why we should care, so I decided to give a brief explanation. Although there are strict statistical definitions of the aforementioned terms (you can easily find them in Wikipedia or anywhere else), I do not want to copy-paste them here, because there are only a couple of important points worth mentioning in our context. So, let’s get started.

Bias refers to the expected difference between the estimated value of parameter (on a specific sample) and the “true” one. Having unbiased estimates of parameters is important because they should lead to more accurate forecasts (at least in theory). For example, if the estimated parameter is equal to zero, while in fact it should be 0.5, then the model would not take the provided information into account correctly and as a result will produce less accurate point forecasts and incorrect prediction intervals. In inventory this may mean that we constantly order 100 units less than needed only because the parameter is lower than it should be.

Efficiency means that if the sample size increases, then the estimated parameters will not change substantially, they will vary in a narrow range (variance of estimates will be small). In the case with inefficient estimates the increase of sample size from 50 to 51 observations may lead to the change of a parameter from 0.1 to, let’s say, 10. This is bad because the values of parameters usually influence both point forecasts and prediction intervals. As a result the inventory decision may differ radically from day to day. For example, we may decide that we urgently need 1000 units of product on Monday, and order it just to realise on Tuesday that we only need 100. Obviously this is an exaggeration, but no one wants to deal with such an erratic stocking policy, so we need to have efficient estimates of parameters.

Consistency means that our estimates of parameters will get closer to the stable values (what statisticians would refer to as “true” values) with the increase of the sample size. This is important because in the opposite case estimates of parameters will diverge and become less and less realistic. This once again influences both point forecasts and prediction intervals, which will be less meaningful than they should have been. In a way consistency means that with the increase of the sample size the parameters will become more efficient and less biased. This in turn means that the more observations we have, the better. There is a prejudice in the world of practitioners that the situation in the market changes so fast that the old observations become useless very fast. As a result many companies just through away the old data. Although, in general the statement about the market changes is true, the forecasters tend to work with the models that take this into account (e.g. Exponential smoothing, ARIMA). These models adapt to the potential changes. So, we may benefit from the old data because it allows us getting more consistent estimates of parameters. Just keep in mind, that you can always remove the annoying bits of data but you can never un-throw away the data.

Having clarified these points, we can proceed to the topic of today’s post.

One-step-ahead estimators of smooth functions

We already know that the default estimator used for

smooth

functions is Mean Squared Error (for one step ahead forecast). If the residuals are distributed normally / log-normally, then the minimum of MSE gives estimates that also maximise the respective likelihood function. As a result the estimates of parameters become nice: consistent and efficient. It is also known in statistics that minimum of MSE gives mean estimates of the parameters, which means that MSE also produces unbiased estimates of parameters (if the model is specified correctly and bla-bla-bla). This works very well, when we deal with symmetric distributions of random variables. But it may perform poorly otherwise.

In this post we will use the series N1823 for our examples:

library(Mcomp) x <- ts(c(M3$N1823$x,M3$N1823$xx),frequency=frequency(M3$N1823$x))

Plot the data in order to see what we have:

plot(x)

N1823 series

The data seems to have slight multiplicative seasonality, which changes over the time, but it is hard to say for sure. Anyway, in order to simplify things, we will apply an ETS(A,A,N) model to this data, so we can see how the different estimators behave. We will withhold 18 observations as it is usually done for monthly data in M3.

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T)

N1823 and ETS(A,A,N) with MSE

Here’s the output:

Time elapsed: 0.08 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.147 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 629.249 Cost function type: MSE; Cost function value: 377623.069 Information criteria: AIC AICc BIC 1703.389 1703.977 1716.800 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -14%; Bias: -74.1%; MAPE: 16.8%; SMAPE: 15.1% MASE: 0.855; sMAE: 13.4%; RelMAE: 1.047; sMSE: 2.4%

It is hard to make any reasonable conclusions from the graph and the output, but it seems that we slightly overforecast the data. At least the prediction interval covers all the values in the holdout. Relative MAE is equal to 1.047, which means that the model produced forecasts less accurate than Naive. Let’s have a look at the residuals:

qqnorm(resid(ourModel)) qqline(resid(ourModel))

QQ-plot of the residuals from ETS(A,A,N) with MSE

The residuals of this model do not look normal, a lot of empirical quantiles a far from the theoretical ones. If we conduct Shapiro-Wilk test, then we will have to reject the hypothesis of normality for the residuals on 5%:

shapiro.test(resid(ourModel)) > p-value = 0.001223

This may indicate that other estimators may do a better job. And there is a magical parameter

cfType

in the

smooth

functions which allows to estimate models differently. It controls what to use and how to use it. You can select the following estimators instead of MSE:

  • MAE – Mean Absolute Error:

\begin{equation} \label{eq:MAE}
\text{MAE} = \frac{1}{T} \sum_{t=1}^T |e_{t+1}|
\end{equation}

The minimum of MAE gives median estimates of the parameters. MAE is considered to be a more robust estimator than MSE. If you have asymmetric distribution, give MAE a try. It gives consistent, but not efficient estimates of parameters. Asymptotically, if the distribution of the residuals is normal, the estimators of MAE converge to the estimators of MSE (which follows from the connection between mean and median of normal distribution).

Let’s see what happens with the same model, on the same data when we use MAE:

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="MAE")

N1823 and ETS(A,A,N) with MAE

Time elapsed: 0.09 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.101 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 636.546 Cost function type: MAE; Cost function value: 462.675 Information criteria: AIC AICc BIC 1705.879 1706.468 1719.290 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -5.1%; Bias: -32.1%; MAPE: 12.9%; SMAPE: 12.4% MASE: 0.688; sMAE: 10.7%; RelMAE: 0.842; sMSE: 1.5%

There are several things to note from the graph and the output. First, the smoothing parameter alpha is smaller than in the previous case. Second, Relative MAE is smaller than one, which means that model in this case outperformed Naive. Comparing this value with the one from the previous model, we can conclude that MAE worked well as an estimator of parameters for this data. Finally, the graph shows that point forecasts go through the middle of the holdout sample, which is reflected with lower values of error measures. The residuals are still not normally distributed, but this is expected, because they won’t become normal just because we used a different estimator:

QQ-plot of the residuals from ETS(A,A,N) with MAE

  • HAM – Half Absolute Moment:

\begin{equation} \label{eq:HAM}
\text{HAM} = \frac{1}{T} \sum_{t=1}^T \sqrt{|e_{t+1}|}
\end{equation}
This is even more robust estimator than MAE. On count data its minimum corresponds to the mode of data. In case of continuous data the minimum of this estimator corresponds to something not yet well studied, between mode and median. The paper about this thing is currently in a draft stage, but you can already give it a try, if you want. This is also consistent, but not efficient estimator.

The same example, the same model, but HAM as an estimator:

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="HAM")

N1823 and ETS(A,A,N) with HAM

Time elapsed: 0.06 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.001 0.001 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 666.439 Cost function type: HAM; Cost function value: 19.67 Information criteria: AIC AICc BIC 1715.792 1716.381 1729.203 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -1.7%; Bias: -14.1%; MAPE: 11.4%; SMAPE: 11.4% MASE: 0.63; sMAE: 9.8%; RelMAE: 0.772; sMSE: 1.3%

This estimator produced even more accurate forecasts in this example, forcing smoothing parameters to become close to zero. Note that the residuals standard deviation in case of HAM is larger than in case of MAE which in its turn is larger than in case of MSE. This means that one-step-ahead parametric and semiparametric prediction intervals will be wider in case of HAM than in case of MAE, than in case of MSE. However, taking that the smoothing parameters in the last model are close to zero, the multiple steps ahead prediction intervals of HAM may be narrower than the ones of MSE.

Finally, it is worth noting that the optimisation of models using different estimators takes different time. MSE is the slowest, while HAM is the fastest estimator. I don’t have any detailed explanation of this, but this obviously happens because of the form of the cost functions surfaces. So if you are in a hurry and need to estimate something somehow, you can give HAM a try. Just remember that information criteria may become inapplicable in this case.

Multiple-steps-ahead estimators of smooth functions

While these three estimators above are calculated based on the one-step-ahead forecast error, the next three are based on multiple steps ahead estimators. They can be useful if you want to have a more stable and “conservative” model (a paper on this topic is currently in the final stage). Prior to v2.2.1 these estimators had different names, be aware!

  • MSE\(_h\) – Mean Squared Error for h-steps ahead forecast:

\begin{equation} \label{eq:MSEh}
\text{MSE}_h = \frac{1}{T} \sum_{t=1}^T e_{t+h}^2
\end{equation}
The idea of this estimator is very simple: if you are interested in 5 steps ahead forecasts, then optimise over this horizon, not one-step-ahead. However, by using this estimator, we shrink the smoothing parameters towards zero, forcing the model to become closer to the deterministic and robust to outliers. This applies both to ETS and ARIMA, but the models behave slightly differently. The effect of shrinkage increases with the increase of \(h\). The forecasts accuracy may increase for that specific horizon, but it almost surely will decrease for all the other horizons. Keep in mind that this is in general not efficient and biased estimator with a much slower convergence to the true value than the one-step-ahead estimators. This estimator is eventually consistent, but it may need a very large sample to become one. This means that this estimator may result in values of parameters very close to zero even if they are not really needed for the data. I personally would advise using this thing on large samples (for instance, on high frequency data). By the way, Nikos Kourentzes, Rebecca Killick and I are working on a paper on that topic, so stay tuned.

Here’s what happens when we use this estimator:

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="MSEh")

N1823 and ETS(A,A,N) with MSEh

Time elapsed: 0.24 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 657.781 Cost function type: MSEh; Cost function value: 550179.34 Information criteria: AIC AICc BIC 30393.86 30404.45 30635.25 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -10.4%; Bias: -62%; MAPE: 14.9%; SMAPE: 13.8% MASE: 0.772; sMAE: 12.1%; RelMAE: 0.945; sMSE: 1.8%

As you can see, the smoothing parameters are now equal to zero, which gives us the straight line going through all the data. If we had 1008 observations instead of 108, the parameters would not be shrunk to zero, because the model would need to adapt to changes in order to minimise the respective cost function.

  • TMSE – Trace Mean Squared Error:

The need for having a specific 5 steps ahead forecast is not common, so it makes sense to work with something that deals with one to h steps ahead:
\begin{equation} \label{TMSE}
\text{TMSE} = \sum_{j=1}^h \frac{1}{T} \sum_{t=1}^T e_{t+j}^2
\end{equation}
This estimator is more reasonable than MSE\(_h\) because it takes into account all the errors from one to h-steps-ahead forecasts. This is a desired behaviour in inventory management, because we are not so much interested in how much we will sell tomorrow or next Monday, but rather how much we will sell starting from tomorrow till the next Monday. However, the variance of forecast errors h-steps-ahead is usually larger than the variance of one-step-ahead errors (because of the increasing uncertainty), which leads to the effect of “masking”: the latter is hidden behind the former. As a result if we use TMSE as the estimator, the final values are seriously influenced by the long term errors than the short term ones (see Taieb and Atiya, 2015 paper). This estimator is not recommended if short-term forecasts are more important than long term ones. Plus, this is still less efficient and more biased estimator than one-step-ahead estimators, with slow convergence to the true values, similar to MSE\(_h\), but slightly better.

This is what happens in our example:

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="TMSE")

N1823 and ETS(A,A,N) with TMSE

Time elapsed: 0.2 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.075 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 633.48 Cost function type: TMSE; Cost function value: 7477097.717 Information criteria: AIC AICc BIC 30394.36 30404.94 30635.75 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -7.5%; Bias: -48.9%; MAPE: 13.4%; SMAPE: 12.6% MASE: 0.704; sMAE: 11%; RelMAE: 0.862; sMSE: 1.5%

Comparing the model estimated using TMSE with the same one estimated using MSE and MSE\(_h\), it is worth noting that the smoothing parameters in this model are greater than in case of MSE\(_h\), but less than in case of MSE. This demonstrates that there is a shrinkage effect in TMSE, forcing the parameters towards zero, but the inclusion of one-step-ahead errors makes model slightly more flexible than in case of MSE\(_h\). Still, it is advised to use this estimator on large samples, where the estimates of parameters become more efficient and less biased.

  • GTMSE – Geometric Trace Mean Squared Error:

This is similar to TMSE, but derived from the so called Trace Forecast Likelihood (which I may discuss at some point in one of the future posts). The idea here is to take logarithms of each MSE\(_j\) and then sum them up:
\begin{equation} \label{eq:GTMSE}
\text{GTMSE} = \sum_{j=1}^h \log \left( \frac{1}{T} \sum_{t=1}^T e_{t+j}^2 \right)
\end{equation}
Logarithms make variances of errors on several steps ahead closer to each other. For example, if the variance of one-step-ahead error is equal to 100 and the variance of 10 steps ahead error is equal to 1000, then their logarithms will be 4.6 and 6.9 respectively. As a result when GTMSE is used as an estimator, the model will take into account both short and long term errors. So this is a more balanced estimator of parameters than MSE\(_h\) and TMSE. This estimator is more efficient than both TMSE and MSE\(_j\) because of the log-scale and converges to true values faster than the previous two, but still can be biased on small samples.

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="GTMSE")

N1823 and ETS(A,A,N) with GTMSE

Time elapsed: 0.18 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 649.253 Cost function type: GTMSE; Cost function value: 232.419 Information criteria: AIC AICc BIC 30402.77 30413.36 30644.16 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -8.2%; Bias: -53.8%; MAPE: 13.8%; SMAPE: 12.9% MASE: 0.72; sMAE: 11.3%; RelMAE: 0.882; sMSE: 1.6%

In our example GTMSE shrinks both smoothing parameters towards zero and makes the model deterministic, which corresponds to MSE\(_h\). However the initial values are slightly different, which leads to slightly different forecasts. Once again, it is advised to use this estimator on large samples.

Keep in mind that all those multiple steps ahead estimators take more time for the calculation, because the model needs to produce h-steps-ahead forecasts from each observation in sample.

  • Analytical multiple steps ahead estimators.

There is also a non-documented feature in

smooth

functions (currently available only for pure additive models) – analytical versions of multiple steps ahead estimators. In order to use it, we need to add “a” in front of the desired estimator: aMSE\(_h\), aTMSE, aGTMSE. In this case only one-step-ahead forecast error is produced and after that the structure of the applied state-space model is used in order to reconstruct multiple steps ahead estimators. This feature is useful if you want to use the multiple steps ahead estimator on small samples, where the multi-steps errors cannot be calculated appropriately. It is also useful in cases of large samples, when the time of estimation is important.

These estimators have similar properties to their empirical counterparts, but work faster and are based on asymptotic properties. Here is an example of analytical MSE\(_h\) estimator:

ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="aMSEh")

N1823 and ETS(A,A,N) with aMSEh

Time elapsed: 0.11 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 627.818 Cost function type: aMSEh; Cost function value: 375907.976 Information criteria: AIC AICc BIC 30652.15 30662.74 30893.55 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -1.9%; Bias: -14.6%; MAPE: 11.7%; SMAPE: 11.6% MASE: 0.643; sMAE: 10%; RelMAE: 0.787; sMSE: 1.3%

The resulting smoothing parameters are shrunk towards zero, similar to MSE\(_h\), but the initial values are slightly different, which leads to different forecasts. Note that the time elapsed in this case is 0.11 seconds instead of 0.24 as in MSE\(_h\). The difference in time may increase with the increase of sample size and forecasting horizon.

  • Similar to MSE, there are empirical multi-step MAE and HAM in smooth

    functions (e.g. MAE\(_h\) and THAM). However, they are currently implemented mainly “because I can” and for fun, so I cannot give you any recommendations about them.

Conclusions

Now that we discussed all the possible estimators that you can use with

smooth

, you are most probably confused and completely lost. The question that may naturally appear after you have read this post is “What should I do?” Frankly speaking, I cannot give you appropriate answer and a set of universal recommendations, because this is still under-researched problem. However, I have some advice.

First, Nikos Kourentzes and Juan Ramon Trapero found that in case of high frequency data (they used solar irradiation data) using MSE\(_h\) and TMSE leads to the increase in forecasting accuracy in comparison with the conventional MSE. However in order to achieve good accuracy in case of MSE\(_h\), you need to estimate \(h\) separate models, while with TMSE you need to estimate only one. So, TMSE is faster than MSE\(_h\), but at the same time leads to at least as accurate forecasts as in case of MSE\(_h\) for all the steps from 1 to h.

Second, if you have asymmetrically distributed residuals in the model after using MSE, give MAE or HAM a try – they may improve your model and its accuracy.

Third, analytical counterparts of multi-step estimators can be useful in one of the two situations: 1. When you deal with very large samples (e.g. high frequency data), want to use advanced estimation methods, but want them to work fast. 2. When you work with small sample, but want to use the properties of these estimators anyway.

Finally, don’t use MSE\(_h\), TMSE and GTMSE if you are interested in the values of parameters of models – they will almost surely be inefficient and biased. This applies to both ETS and ARIMA models, which will become close to their deterministic counterparts in this case. Use conventional MSE instead.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dataiku 4.1.0: More support for R users!

Mon, 11/20/2017 - 16:24

(This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers)

Introduction

Recently, Dataiku 4.1.0 was released, it now offers much more support for R users. But wait a minute, Data-what? I guess some of you do not know Dataiku, so what is Dataiku in the first place? It is a collaborative data science platform created to design and run data products at scale. The main themes of the product are:

Collaboration & Orchestration: A data science project often involves a team of people with different skills and interests. To name a few, we have data engineers, data scientists, business analysts, business stakeholders, hardcore coders, R users and Python users. Dataiku provides a platform to accommodate the needs of these different roles to work together on data science projects.

Productivity: Whether you like hard core coding or are more GUI oriented, the platform offers an environment for both. A flow interface can handle most of the steps needed in a data science project, and this can be enriched by Python or R recipes. Moreover, a managed notebook environment is integrated in Dataiku to do whatever you want with code.

Deployment of data science products: As a data scientist you can produce many interesting stuff, i.e. graphs, data transformations, analysis, predictive models. The Dataiku platform facilitates the deployment of these deliverables, so that others (in your organization) can consume them. There are dashboards, web-apps, model API’s, productionized model API’s and data pipelines.

 

There is a free version which contains already a lot of features to be very useful, and there is an paid version, with “enterprise features“. See for the Dataiku website for more info.

Improved R Support in 4.1.0

Among many new features, and the one that interests me the most as an R user, is the improved support for R. In previous versions of Dataiku there was already some support for R, this version has the following improvements. There is now support for:

R Code environments

In Dataiku you can now create so-called code environments for R (and Python). A code environment is a standalone and self-contained environment to run R code. Each environment can have its own set of packages (and specific versions of packages). Dataiku provides a handy GUI to manage different code environments. The figure below shows an example code environment with specific packages.

 

In Dataiku whenever you make use of R –> in R recipes, Shiny, R Markdown or creating R API’s you can select a specific R code environment to use.

R Markdown reports & Shiny applications

If you are working in RStudio you most likely already know R Markdown documents and Shiny applications. In this version, you can also create them in Dataiku. Now, why would you do that and not just create them in RStudio? Well, the reports and shiny apps become part of the Dataiku environment and so:

  • They are managed in the environment. You will have a good overview of all reports and apps and see who has created/edited them.
  • You can make use of all data that is already available in the Dataiku environment.
  • Moreover, the resulting reports and Shiny apps can be embedded inside Dataiku dashboards.

 

The figure above shows a R markdown report in Dataiku, the interface provides a nice way to edit the report, alter settings and publish the report. Below is an example dashboard in Dataiku with a markdown and Shiny report.

Creating R API’s

Once you created an analytical model in R, you want to deploy it and make use of its predictions. With Dataiku you can now easily expose R prediction models as an API. In fact, you can expose any R function as an API. The Dataiku GUI provides an environment where you can easily set up and test an R API’s. Moreover, The Dataiku API Node, which can be installed on a (separate) production server imports the R models that you have created in the GUI and can take care of load balancing, high availability and scaling of real-time scoring.

The following three figures give you an overview of how easy it is to work with the R API functionality.

First, define an API endpoint and R (prediction) function.

 

Then, define the R function, it can make use of data in Dataiku, R objects created earlier or any R library you need.

 

Then, test and deploy the R function. Dataiku provides a handy interface to test your function/API.

 

Finally, once you are satisfied with the R API you can make a package of the API, that package can then be imported on a production server with Dataiku API node installed. From which you can then serve API requests.

Conclusion

The new Dataiku 4.1.0 version has a lot to offer for anyone involved in a data science project. The system already has a wide range support for Python, but now with the improved support for R, the system is even more useful to a very large group of data scientists.

Cheers, Longhow.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest

Mon, 11/20/2017 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location.

Similar concept with predicting employee turnover, we are going to predict customer churn using telecom dataset. We will introduce Logistic Regression, Decision Tree, and Random Forest. But this time, we will do all of the above in R. Let’s get started!

Data Preprocessing

The data was downloaded from IBM Sample Data Sets. Each row represents a customer, each column contains that customer’s attributes:

library(plyr) library(corrplot) library(ggplot2) library(gridExtra) library(ggthemes) library(caret) library(MASS) library(randomForest) library(party) churn <- read.csv('Telco-Customer-Churn.csv') str(churn) 'data.frame': 7043 obs. of 21 variables: $ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ... $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ... $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ... $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ... $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ... $ tenure : int 1 34 2 45 2 8 22 10 28 62 ... $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ... $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ... $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ... $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ... $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ... $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ... $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ... $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ... $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ... $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ... $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ... $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ... $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ... $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ... $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

The raw data contains 7043 rows (customers) and 21 columns (features). The “Churn” column is our target.

We use sapply to check the number if missing values in each columns. We found that there are 11 missing values in “TotalCharges” columns. So, let’s remove all rows with missing values.

sapply(churn, function(x) sum(is.na(x))) customerID gender SeniorCitizen Partner 0 0 0 0 Dependents tenure PhoneService MultipleLines 0 0 0 0 InternetService OnlineSecurity OnlineBackup DeviceProtection 0 0 0 0 TechSupport StreamingTV StreamingMovies Contract 0 0 0 0 PaperlessBilling PaymentMethod MonthlyCharges TotalCharges 0 0 0 11 Churn 0 churn <- churn[complete.cases(churn), ] Look at the variables, we can see that we have some wranglings to do.

1. We will change “No internet service” to “No” for six columns, they are: “OnlineSecurity”, “OnlineBackup”, “DeviceProtection”, “TechSupport”, “streamingTV”, “streamingMovies”.

cols_recode1 <- c(10:15) for(i in 1:ncol(churn[,cols_recode1])) { churn[,cols_recode1][,i] <- as.factor(mapvalues (churn[,cols_recode1][,i], from =c("No internet service"),to=c("No"))) }

2. We will change “No phone service” to “No” for column “MultipleLines”

churn$MultipleLines <- as.factor(mapvalues(churn$MultipleLines, from=c("No phone service"), to=c("No")))

3. Since the minimum tenure is 1 month and maximum tenure is 72 months, we can group them into five tenure groups: “0–12 Month”, “12–24 Month”, “24–48 Months”, “48–60 Month”, “> 60 Month”

min(churn$tenure); max(churn$tenure) [1] 1 [1] 72 group_tenure = 0 & tenure 12 & tenure 24 & tenure 48 & tenure 60){ return('> 60 Month') } } churn$tenure_group <- sapply(churn$tenure,group_tenure) churn$tenure_group <- as.factor(churn$tenure_group)

4. Change the values in column “SeniorCitizen” from 0 or 1 to “No” or “Yes”.

churn$SeniorCitizen <- as.factor(mapvalues(churn$SeniorCitizen, from=c("0","1"), to=c("No", "Yes")))

5. Remove the columns we do not need for the analysis.

churn$customerID <- NULL churn$tenure <- NULL Exploratory data analysis and feature selection

Correlation between numeric variables

numeric.var <- sapply(churn, is.numeric) corr.matrix <- cor(churn[,numeric.var]) corrplot(corr.matrix, main="\n\nCorrelation Plot for Numerical Variables", method="number")

Gives this plot:

The Monthly Charges and Total Charges are correlated. So one of them will be removed from the model. We remove Total Charges.

churn$TotalCharges <- NULL

Bar plots of categorical variables

p1 <- ggplot(churn, aes(x=gender)) + ggtitle("Gender") + xlab("Gender") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p2 <- ggplot(churn, aes(x=SeniorCitizen)) + ggtitle("Senior Citizen") + xlab("Senior Citizen") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p3 <- ggplot(churn, aes(x=Partner)) + ggtitle("Partner") + xlab("Partner") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p4 <- ggplot(churn, aes(x=Dependents)) + ggtitle("Dependents") + xlab("Dependents") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p1, p2, p3, p4, ncol=2)

Gives this plot:

p5 <- ggplot(churn, aes(x=PhoneService)) + ggtitle("Phone Service") + xlab("Phone Service") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p6 <- ggplot(churn, aes(x=MultipleLines)) + ggtitle("Multiple Lines") + xlab("Multiple Lines") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p7 <- ggplot(churn, aes(x=InternetService)) + ggtitle("Internet Service") + xlab("Internet Service") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p8 <- ggplot(churn, aes(x=OnlineSecurity)) + ggtitle("Online Security") + xlab("Online Security") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p5, p6, p7, p8, ncol=2)

Gives this plot:

p9 <- ggplot(churn, aes(x=OnlineBackup)) + ggtitle("Online Backup") + xlab("Online Backup") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p10 <- ggplot(churn, aes(x=DeviceProtection)) + ggtitle("Device Protection") + xlab("Device Protection") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p11 <- ggplot(churn, aes(x=TechSupport)) + ggtitle("Tech Support") + xlab("Tech Support") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p12 <- ggplot(churn, aes(x=StreamingTV)) + ggtitle("Streaming TV") + xlab("Streaming TV") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p9, p10, p11, p12, ncol=2)

Gives this plot:

p13 <- ggplot(churn, aes(x=StreamingMovies)) + ggtitle("Streaming Movies") + xlab("Streaming Movies") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p14 <- ggplot(churn, aes(x=Contract)) + ggtitle("Contract") + xlab("Contract") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p15 <- ggplot(churn, aes(x=PaperlessBilling)) + ggtitle("Paperless Billing") + xlab("Paperless Billing") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p16 <- ggplot(churn, aes(x=PaymentMethod)) + ggtitle("Payment Method") + xlab("Payment Method") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p17 <- ggplot(churn, aes(x=tenure_group)) + ggtitle("Tenure Group") + xlab("Tenure Group") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p13, p14, p15, p16, p17, ncol=2)

Gives this plot:

All of the categorical variables seem to have a reasonably broad distribution, therefore, all of them will be kept for the further analysis.

Logistic Regression

First, we split the data into training and testing sets

intrain<- createDataPartition(churn$Churn,p=0.7,list=FALSE) set.seed(2017) training<- churn[intrain,] testing<- churn[-intrain,]

Confirm the splitting is correct

dim(training); dim(testing) [1] 4924 19 [1] 2108 19

Fitting the Logistic Regression Model

LogModel <- glm(Churn ~ .,family=binomial(link="logit"),data=training) print(summary(LogModel)) Call: glm(formula = Churn ~ ., family = binomial(link = "logit"), data = training) Deviance Residuals: Min 1Q Median 3Q Max -2.0370 -0.6793 -0.2861 0.6590 3.1608 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.030149 1.008223 -2.014 0.044053 * genderMale -0.039006 0.077686 -0.502 0.615603 SeniorCitizenYes 0.194408 0.101151 1.922 0.054611 . PartnerYes -0.062031 0.092911 -0.668 0.504363 DependentsYes -0.017974 0.107659 -0.167 0.867405 PhoneServiceYes -0.387097 0.788745 -0.491 0.623585 MultipleLinesYes 0.345052 0.214799 1.606 0.108187 InternetServiceFiber optic 1.022836 0.968062 1.057 0.290703 InternetServiceNo -0.829521 0.978909 -0.847 0.396776 OnlineSecurityYes -0.393647 0.215993 -1.823 0.068379 . OnlineBackupYes -0.113220 0.213825 -0.529 0.596460 DeviceProtectionYes -0.025797 0.213317 -0.121 0.903744 TechSupportYes -0.306423 0.220920 -1.387 0.165433 StreamingTVYes 0.399209 0.395912 1.008 0.313297 StreamingMoviesYes 0.338852 0.395890 0.856 0.392040 ContractOne year -0.805584 0.127125 -6.337 2.34e-10 *** ContractTwo year -1.662793 0.216346 -7.686 1.52e-14 *** PaperlessBillingYes 0.338724 0.089407 3.789 0.000152 *** PaymentMethodCredit card (automatic) -0.018574 0.135318 -0.137 0.890826 PaymentMethodElectronic check 0.373214 0.113029 3.302 0.000960 *** PaymentMethodMailed check 0.001400 0.136446 0.010 0.991815 MonthlyCharges -0.005001 0.038558 -0.130 0.896797 tenure_group0-12 Month 1.858899 0.205956 9.026 < 2e-16 *** tenure_group12-24 Month 0.968497 0.201452 4.808 1.53e-06 *** tenure_group24-48 Month 0.574822 0.185500 3.099 0.001943 ** tenure_group48-60 Month 0.311790 0.200096 1.558 0.119185 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 5702.8 on 4923 degrees of freedom Residual deviance: 4094.4 on 4898 degrees of freedom AIC: 4146.4 Number of Fisher Scoring iterations: 6

Feature Analysis
The top three most-relevant features include Contract, tenure_group and PaperlessBilling.

anova(LogModel, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: Churn Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 4923 5702.8 gender 1 0.39 4922 5702.4 0.5318602 SeniorCitizen 1 95.08 4921 5607.3 < 2.2e-16 *** Partner 1 107.29 4920 5500.0 < 2.2e-16 *** Dependents 1 27.26 4919 5472.7 1.775e-07 *** PhoneService 1 1.27 4918 5471.5 0.2597501 MultipleLines 1 9.63 4917 5461.8 0.0019177 ** InternetService 2 452.01 4915 5009.8 < 2.2e-16 *** OnlineSecurity 1 183.83 4914 4826.0 < 2.2e-16 *** OnlineBackup 1 69.94 4913 4756.1 < 2.2e-16 *** DeviceProtection 1 47.58 4912 4708.5 5.287e-12 *** TechSupport 1 82.78 4911 4625.7 < 2.2e-16 *** StreamingTV 1 4.90 4910 4620.8 0.0269174 * StreamingMovies 1 0.36 4909 4620.4 0.5461056 Contract 2 309.25 4907 4311.2 < 2.2e-16 *** PaperlessBilling 1 14.21 4906 4297.0 0.0001638 *** PaymentMethod 3 38.85 4903 4258.1 1.867e-08 *** MonthlyCharges 1 0.10 4902 4258.0 0.7491553 tenure_group 4 163.67 4898 4094.4 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time. Adding InternetService, Contract and tenure_group significantly reduces the residual deviance. The other variables such as PaymentMethod and Dependents seem to improve the model less even though they all have low p-values.

Assessing the predictive ability of the Logistic Regression model

testing$Churn <- as.character(testing$Churn) testing$Churn[testing$Churn=="No"] <- "0" testing$Churn[testing$Churn=="Yes"] <- "1" fitted.results <- predict(LogModel,newdata=testing,type='response') fitted.results 0.5,1,0) misClasificError <- mean(fitted.results != testing$Churn) print(paste('Logistic Regression Accuracy',1-misClasificError)) [1] "Logistic Regression Accuracy 0.796489563567362"

Logistic Regression Confusion Matrix

print("Confusion Matrix for Logistic Regression"); table(testing$Churn, fitted.results > 0.5) [1] "Confusion Matrix for Logistic Regression" FALSE TRUE 0 1392 156 1 273 287

Odds Ratio
One of the interesting performance measurements in logistic regression is Odds Ratio.Basically, Odds ratio is what the odds of an event is happening.

exp(cbind(OR=coef(LogModel), confint(LogModel))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 0.1313160 0.01815817 0.9461855 genderMale 0.9617454 0.82587632 1.1199399 SeniorCitizenYes 1.2145919 0.99591418 1.4807053 PartnerYes 0.9398537 0.78338071 1.1276774 DependentsYes 0.9821863 0.79488224 1.2124072 PhoneServiceYes 0.6790251 0.14466019 3.1878587 MultipleLinesYes 1.4120635 0.92707245 2.1522692 InternetServiceFiber optic 2.7810695 0.41762316 18.5910286 InternetServiceNo 0.4362582 0.06397364 2.9715699 OnlineSecurityYes 0.6745919 0.44144273 1.0296515 OnlineBackupYes 0.8929545 0.58709919 1.3577947 DeviceProtectionYes 0.9745328 0.64144877 1.4805460 TechSupportYes 0.7360754 0.47707096 1.1344691 StreamingTVYes 1.4906458 0.68637788 3.2416264 StreamingMoviesYes 1.4033353 0.64624171 3.0518161 ContractOne year 0.4468271 0.34725066 0.5717469 ContractTwo year 0.1896086 0.12230199 0.2861341 PaperlessBillingYes 1.4031556 1.17798691 1.6725920 PaymentMethodCredit card (automatic) 0.9815977 0.75273387 1.2797506 PaymentMethodElectronic check 1.4523952 1.16480721 1.8145076 PaymentMethodMailed check 1.0014007 0.76673087 1.3092444 MonthlyCharges 0.9950112 0.92252949 1.0731016 tenure_group0-12 Month 6.4166692 4.30371945 9.6544837 tenure_group12-24 Month 2.6339823 1.78095906 3.9256133 tenure_group24-48 Month 1.7768147 1.23988035 2.5676783 tenure_group48-60 Month 1.3658675 0.92383315 2.0261505 Decision Tree

Decision Tree visualization
For illustration purpose, we are going to use only three variables for plotting Decision Trees, they are “Contract”, “tenure_group” and “PaperlessBilling”.

tree <- ctree(Churn~Contract+tenure_group+PaperlessBilling, training) plot(tree, type='simple')

Gives this plot:

1. Out of three variables we use, Contract is the most important variable to predict customer churn or not churn.
2. If a customer in a one-year or two-year contract, no matter he (she) has PapelessBilling or not, he (she) is less likely to churn.
3. On the other hand, if a customer is in a month-to-month contract, and in the tenure group of 0–12 month, and using PaperlessBilling, then this customer is more likely to churn.

Decision Tree Confusion Matrix
We are using all the variables to product confusion matrix table and make predictions.

pred_tree <- predict(tree, testing) print("Confusion Matrix for Decision Tree"); table(Predicted = pred_tree, Actual = testing$Churn) [1] "Confusion Matrix for Decision Tree" Actual Predicted No Yes No 1395 346 Yes 153 214

Decision Tree Accuracy

p1 <- predict(tree, training) tab1 <- table(Predicted = p1, Actual = training$Churn) tab2 <- table(Predicted = pred_tree, Actual = testing$Churn) print(paste('Decision Tree Accuracy',sum(diag(tab2))/sum(tab2))) [1] "Decision Tree Accuracy 0.763282732447818"

The accuracy for Decision Tree has hardly improved. Let’s see if we can do better using Random Forest.

Random Forest

Random Forest Initial Model

rfModel <- randomForest(Churn ~., data = training) print(rfModel) Call: randomForest(formula = Churn ~ ., data = training) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4 OOB estimate of error rate: 20.65% Confusion matrix: No Yes class.error No 3245 370 0.1023513 Yes 647 662 0.4942704

The error rate is relatively low when predicting “No”, and the error rate is much higher when predicting “Yes”.

Random Forest Prediction and Confusion Matrix

pred_rf <- predict(rfModel, testing) caret::confusionMatrix(pred_rf, testing$Churn) Confusion Matrix and Statistics Reference Prediction No Yes No 1381 281 Yes 167 279 Accuracy : 0.7875 95% CI : (0.7694, 0.8048) No Information Rate : 0.7343 P-Value [Acc > NIR] : 9.284e-09 Kappa : 0.4175 Mcnemar's Test P-Value : 9.359e-08 Sensitivity : 0.8921 Specificity : 0.4982 Pos Pred Value : 0.8309 Neg Pred Value : 0.6256 Prevalence : 0.7343 Detection Rate : 0.6551 Detection Prevalence : 0.7884 Balanced Accuracy : 0.6952 'Positive' Class : No

Random Forest Error Rate

plot(rfModel)

Gives this plot:

We use this plot to help us determine the number of trees. As the number of trees increases, the OOB error rate decreases, and then becomes almost constant. We are not able to decrease the OOB error rate after about 100 to 200 trees.

Tune Random Forest Model

t <- tuneRF(training[, -18], training[, 18], stepFactor = 0.5, plot = TRUE, ntreeTry = 200, trace = TRUE, improve = 0.05)

Gives this plot:

We use this plot to give us some ideas on the number of mtry to choose. OOB error rate is at the lowest when mtry is 2. Therefore, we choose mtry=2.

Fit the Random Forest Model After Tuning

rfModel_new <- randomForest(Churn ~., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE) print(rfModel_new) Call: randomForest(formula = Churn ~ ., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE) Type of random forest: classification Number of trees: 200 No. of variables tried at each split: 2 OOB estimate of error rate: 19.7% Confusion matrix: No Yes class.error No 3301 314 0.0868603 Yes 656 653 0.5011459

OOB error rate decreased to 19.7% from 20.65% earlier.

Random Forest Predictions and Confusion Matrix After Tuning

pred_rf_new <- predict(rfModel_new, testing) caret::confusionMatrix(pred_rf_new, testing$Churn) Confusion Matrix and Statistics Reference Prediction No Yes No 1404 305 Yes 144 255 Accuracy : 0.787 95% CI : (0.7689, 0.8043) No Information Rate : 0.7343 P-Value [Acc > NIR] : 1.252e-08 Kappa : 0.3989 Mcnemar's Test P-Value : 4.324e-14 Sensitivity : 0.9070 Specificity : 0.4554 Pos Pred Value : 0.8215 Neg Pred Value : 0.6391 Prevalence : 0.7343 Detection Rate : 0.6660 Detection Prevalence : 0.8107 Balanced Accuracy : 0.6812 'Positive' Class : No

The accuracy did not increase but the sensitivity improved, compare with the initial Random Forest model.

Random Forest Feature Importance

varImpPlot(rfModel_new, sort=T, n.var = 10, main = 'Top 10 Feature Importance')

Gives this plot:

Summary

From the above example, we can see that Logistic Regression and Random Forest performed better than Decision Tree for customer churn analysis for this particular dataset.

Throughout the analysis, I have learned several important things:
1. Features such as tenure_group, Contract, PaperlessBilling, MonthlyCharges and InternetService appear to play a role in customer churn.
2. There does not seem to be a relationship between gender and churn.
3. Customers in a month-to-month contract, with PaperlessBilling and are within 12 months tenure, are more likely to churn; On the other hand, customers with one or two year contract, with longer than 12 months tenure, that are not using PaperlessBilling, are less likely to churn.

Source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above.

    Related Post

    1. Find Your Best Customers with Customer Segmentation in Python
    2. Principal Component Analysis – Unsupervised Learning
    3. ARIMA models and Intervention Analysis
    4. A Gentle Introduction on Market Basket Analysis — Association Rules
    5. Building A Book Recommender System – The Basics, kNN and Matrix Factorization
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    9 Jobs for R users from around the world (2017-11-20)

    Mon, 11/20/2017 - 13:43
    To post your R job on the next post

    Just visit  this link and post a new R job  to the R community.

    You can post a job for  free  (and there are also “featured job” options available for extra exposure).

    Current R jobs

    Job seekers:  please follow the links below to learn more and apply for your R job of interest:

    Featured Jobs

     

    More New R Jobs

     

     

    1. Full-Time
      R&D Database Developer @ Toronto, Canada Crescendo Technology Ltd – Posted by Crescendo
      Toronto Ontario, Canada
      17 Nov 2017
    2. Full-Time
      Customer Success Representative RStudio, Inc. – Posted by jclemens1
      Anywhere
      17 Nov 2017
    3. Full-Time
      Data Science Engineer Bonify – Posted by arianna@meritocracy
      Berlin Berlin, Germany
      17 Nov 2017
    4. Freelance
      Dashboard/Visualization developer in R/Shiny International Finance Corporation – Posted by Oleksiy
      Anywhere
      14 Nov 2017
    5. Part-Time
      Development of User-Defined Calculations and Graphical Outputs for WHO’s Influenza Data World Health Organization – Posted by aspenh
      Anywhere
      6 Nov 2017
    6. Full-Time
      Data Scientist for H Labs @ Chicago, Illinois, United States Heidrick & Struggles – Posted by Heidrick1
      Chicago Illinois, United States
      2 Nov 2017
    7. Full-Time
      Business Data Analytics Faculty Maryville University – Posted by tlorden
      St. Louis Missouri, United States
      2 Nov 2017
    8. Full-Time
      PROGRAMMER/SOFTWARE DEVELOPMENT ENGINEER/COMPUTATIONAL AND MACHINE LEARNING SPECIALIST fchiava
      Cambridge Massachusetts, United States
      10 Oct 2017
    9. Full-Time
      Data Scientist @ London Hiscox – Posted by alan.south
      London England, United Kingdom
      13 Sep 2017

    In  R-users.com  you can see  all  the R jobs that are currently available.

    R-users Resumes

    R-users also has a  resume section  which features CVs from over 300 R users. You can  submit your resume  (as a “job seeker”) or  browse the resumes  for free.

    (you may also look at  previous R jobs posts ).

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Housing Prices in Ames, Iowa: Kaggle’s Advanced Regression Competition

    Mon, 11/20/2017 - 09:05

    (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

    Introduction

    Residential real estate prices are fascinating… and frustrating. The homebuyer, the home-seller, the real estate agent, the economist, and the banker are all interested in housing prices, but they’re sufficiently subtle that no one person or industry has a complete understanding.

    For our third overall project and first group project we were assigned Kaggle’s Advanced Regression Techniques Competition. The goal, for the project and the original competition, was to predict housing prices in Ames, Iowa. While this particular competition is no longer active, the premise proved to be a veritable playground for testing our knowledge of data cleaning, exploratory data analysis, statistics, and, most importantly, machine learning. What follows is the combined work of Quentin Picard, Paul Ton, Kathryn Bryant, and Hans Lau.

    Data

    As stated on the Kaggle competition description page, the data for this project was compiled by Dean De Cock for educational purposes, and it includes 79 predictor variables (house attributes) and one target variable (price). As a result of the educational nature of the competition, the data was pre-split into a training set and a test set; the two datasets were given in the forms of csv files, each around 450 KB in size. Each of the predictor variables could fall under one of the following:

    • lot/land variables
    • location variables
    • age variables
    • basement variables
    • roof variables
    • garage variables
    • kitchen variables
    • room/bathroom variables
    • utilities variables
    • appearance variables
    • external features (pools, porches, etc.) variables

    The specifics of the 79 predictor variables are omitted for brevity but can be found in the data_description.txt file on the competition website. The target variable was Sale Price, given in US dollars.

    Project Overview (Process)

    The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and the results of our individual models alerted us to areas in data cleaning and feature engineering that needed improvement. We used root mean squared error (RMSE) of log Sale Price to evaluate model fit as this was the metric used by Kaggle to evaluate submitted models.

    Data cleaning, EDA, feature engineering, and private train/test splitting (and one spline model!) were all done in R but  we used Python for individual model training and ensembling/stacking. Using R and Python in these ways worked well, but the decision to split work in this manner was driven more by timing with curriculum than by anything else.

    Throughout the project our group wanted to mimic a real-world machine learning project as much as possible, which meant that although we were given both a training set and a test set, we opted to treat the given test set as if it were “future” data. As a result, we further split the Kaggle training data into a private training set and a private testing set, with an 80/20 split, respectively. This allowed us to evaluate models in two ways before predicting on the Kaggle test data: with RMSE  of predictions made on the private test set and with cross validation RMSE of the entire training set.

    Given the above choices, the process for training and evaluating each individual model was broken down as follows:

    1. Grid search to tune hyper parameters (if applicable) to private train set
    2. Fit model to private train set with tuned parameters
    3. Predict on private test set; if RMSE okay, proceed.
    4. Fit new model to Kaggle train set with private train hyperparameters
    5. Cross-validate on Kaggle train set; if CV RMSE okay, proceed.
    6. Predict on Kaggle test set
    7. Submit predictions to Kaggle

    Computing RMSE of predictions made on the private test set served as a group sanity check, in that if anything was amiss with a model we found out at this point and could correct for it before proceeding. The decision to fit a new model to the Kaggle train in step 4 set using the private train hyperparameters found in step 1 was one for which we never felt completely at ease; we were trying to minimize overfitting to the Kaggle training set, but also recognized that we weren’t using all the possible information we could by not re-tuning using the full training set. One benefit to this choice was that we never saw any major discrepancies between our own cross validation scores in step 5 and the Kaggle scores from step 7. With more time, we would have further investigated whether keeping the private train parameters or re-tuning for the final models was more beneficial.

    The last noteworthy choice we made in our overall process was to feed differently-cleaned datasets to our linear-based models and our tree-based models. In linear-based models (linear, ridge, LASSO, elastic net, spline), prediction values are continuous and sensitive to outliers so we opted to sacrifice information about “rare” houses in favor of gaining better predictions on “common” houses. We also wanted to minimize the likelihood of rare levels only showing up in our test data and/or causing columns of all 0’s in dummifed data. We executed this tradeoff by releveling any nominal categorical variables that contained extremely rare classes, where “rare” was defined to be “occuring in less than 1% of the observations.”

    For example, the Heating variable had six levels but four of them combined accounted for only about 1% of the observations; so, we combined these four levels into a new ‘other’ level so that releveled Heating had only three levels, all accounting for 1% or more of the observations. Depending on the variable, we either created an ‘other’ level as in the example or we grouped rare levels into existing levels according to level similarity in the variable documentation.

    We opted not to relevel data fed into our tree-based models because tree predictions are more robust to outliers and rare classes; trees can separate rare observations from others through splits, which prevents common observation predictions from being distorted by rare observations in fitting.

    Data Cleaning and EDA

    We were fortunate with the Kaggle data in that it came to us relatively clean. The only basic cleaning tasks were to correct typos in levels of categorical variables, specify numeric or categorical variables in R, and rename variables beginning with numbers to satisfy R’s variable name requirements. There were a number of  “quality” and “condition” variables that had levels of Poor, Fair, Typical/Average, Good, and Excellent which we label encoded as integers 1-5 to preserve their inherent ordinality. For nominal categorical variables, we used one-hot encoding from the ‘vtreat’ package in R. (Most machine learning algorithms in Python require all variables to have numeric values, and although R’s machine learning algorithms can handle nominal categorical variables in non-numeric/string form, R’s computations effectively use one-hot encoding under the hood in these cases.)

    After taking care of these basic cleaning issues, we needed to address missingness in our data. Below is a plot that shows the variables containing missing values and the degree to which missingness occurs in each (shown in yellow):

    For all variables except Garage Year Built and Lot Frontage, we performed basic mode imputation. Mode imputation was chosen for simplicity and because it could be done with both categorical and numerical data. Of course, mode imputation  has the negative side effect of artificially decreasing variance within the affected variable and therefore would not be appropriate for variables with higher degrees of missingness. It was for this reason, in fact, that we approached missingness differently for Garage Year Built and Lot Frontage.

    For missing values in Garage Year Built, we imputed the Year Built for the house. We justified this because most garages are built at the same time as the house, and houses without garages get no penalty or benefit by having the Garage Year Built equal to the Year Built for the house.

    For Lot Frontage, the variable with the greatest degree of missingness, was researched and explored in order to arrive at an imputation strategy. Lot Frontage was defined to be the linear feet of street connected to the property. Given that most properties were either regularly or only slightly irregularly shaped according to Lot Shape, we deduced that most properties would have Lot Frontage values that were correlated with Lot Area values. However, since length is measured in units and area is measure in square units, we found it most appropriate to relate log(Lot Frontage) with log(Lot Area), so as to get a linear relationship between the two. See plot below.

     

    Thus, we imputed missing values for Lot Frontage by fitting a linear model of log(Lot Frontage) regressed onto log(Lot Area), predicting on the missing values for Lot Frontage using Lot Area, and then exponentiating (inverse log-ing) the result.

    The next step in EDA after finding and addressing missingness was to look at outliers. Looking at boxplots of both Sale Price and log(Sale Price) we saw that there were quite a few outliers, where ‘outliers’ mathematically defined to be observations lying more than 1.5*IQR (Inner Quartile Range) above the third quartile and below the first quartile.

    Although removing the outliers may have improved our predictions for average-priced homes, we were hesitant to do so due to the relatively small size of our sample (~1600 observations in Kaggle training set). We felt that removing any outliers without further justification than them simply being outliers would likely jeopardize our predictions for houses in the extremes.

    Since outliers would have the most impact on the fit of linear-based models, we further investigated outliers by training a basic multiple linear regression model on the Kaggle training set with all observations included; we then looked at the resulting influence and studentized residuals plots:

    From these, we saw that there were only two observations that could justifiably be removed: observation 1299 and observation 251. These were both beyond or on the lines representing Cook’s Distance, meaning that as individual observations they had a significant impact on the regression formula; as such, we removed these two observations from consideration.

    The last bit of preprocessing we did was dealing with multicollinearity. This, as for outliers, was cleaning done mostly for the benefit of linear-based models; in fact, it was done only for vanilla multiple linear regression since regularization in ridge, LASSO, and elastic net models deals with collinearity by construction. To eliminate collinearity issues, we used the findLinearCombos() function from the ‘caret’ package in R on our dummified data. This function identified linear combinations between predictor variables and allowed us to easily drop linearly dependent variables.

    Feature Engineering

    For feature engineering we took a two-pronged approach: we used anecdotal data/personal knowledge and we did research.

    • Garage interaction: Garage Quality * Number of Cars Garage Holds.
      • If a home has a really great or really poor garage, the impact of that quality component on price will be exacerbated by the size of the garage.
    • Total number of bathrooms: Full Bath + Half Bath + Basement Full Bath + Basement Half Bath.
      • In our experience, houses are often listed in terms of total number of bedrooms and total number of bathrooms. Our data had total number of bedrooms, but lacked total number of bathrooms.
    • Average room size: Above-Ground Living Area / Total Number of R00ms Above Ground.
      • “Open concept” homes have been gaining popularity and homes with large rooms have always been popular, and with the provided variables we believe average room size might address both of these trends.
    • Bathroom to room ratio: (Full Bath + Half Bath) / Number of Bedrooms Above Ground
      • The number of bathrooms desired in a house depends on the number of bedrooms. A home with one bedroom will not be viewed nearly as negatively for having only one bathroom as would a house with three bedrooms.
    • Comparative size of living area: Above-Ground Living Area / mean(Above-Ground Living Area)
      • This variable attempts to capture house size as it directly compares to other houses, but in a manner different from mere number of rooms.
    • Landscape-ability interaction: Lot Shape * Land Contour
      • Landscaped homes tend to sell for more (see this article) and the ability for a lot to be easily landscaped is, in part, determined by the shape of the lot (regular, slightly irregular, irregular, very irregular) and the land contour (level, banked, hillside, depressed). Certain combinations may be workable (regular shape, hillside) while other combinations (very irregular shape, hillside) may make landscaping difficult and/or expensive.

    Of the six features that we added, only “landscape-ability” resulted from research; we either already had the important industry variables in the data (total above-ground square footage or neighborhood, for example) or we did not have the information we needed to create the desired variables. Additional features we would have liked to have added include geolocation data for proximity to schools and shopping centers, quality of nearby schools, and specific rooms/house features remodeled (if applicable).

    One step that we glossed over a bit was extensive adding and removing of features. We noticed with just a few trials that removing existing features seemed to negatively impact the performance of our models and that the models improved when we added features. Given that the regularized linear models would give preference to more important features via larger coefficients and tree-based models would give preference to more important features by splitting preferences, we opted not to spend too much time manually adding and removing features. Essentially, we allowed our models to select the most predictive features themselves.

    Modeling

    For our individual models, we trained the following:

    • Multiple linear regression
    • Ridge regression
    • LASSO regression
    • Elastic net regression
    • Spline regression
    • Basic decision tree
    • Random forest
    • Gradient boosted tree
    • XGBoosted tree

    Overall our models consistently had RMSE values between .115 and .145. As stated earlier, our cross validation scores were usually very close to our scores on Kaggle. To our surprise, our linear-based models did very well compared to our tree-based models. Even the much-hyped XGBoost model was at best on par with our linear-based spline and elastic net models, and our random forest models tended to be much worse. An abbreviated table of our results is shown below:

    Elastic Net Spline Random Forest Gradient Boost XGBoost Ensemble Cross Validation 0.12157 0.11181 0.14171 0.12491 0.12282 0.11227 Held-out Test 0.11762 0.11398 0.13834 0.11403 0.11485 n/a Kaggle 0.12107 0.11796 n/a n/a n/a 0.11710

    As we endeavored to find a final, “best” model via ensembling and stacking, we followed the advice of our instructor (and successful Kaggler) Zeyu by attempting to combine models with different strengths and weaknesses.

    We took the predictions from each of our best base models (Spline, Gradient Boost, XGBoost) as the features to use for the next level. Our best result came from a weighted average of the three, with weights determined by a grid search. The blend of 76% Spline, 14.5% Gradient Boost, and 9.5% Xgboost gave us a 0.11227 RMSE on our training set and 0.11710 on Kaggle.

    We also experimented with using a second level meta-model to do the stacking, but with a linear meta-model, the coefficients were highly unstable because the predictions were so closely related to each other, and with a gradient boost meta-model, we were unable to beat our best base model alone.

    Conclusions

    As mentioned above, we were surprised by the strength of our various linear models in comparison to our tree models. We suspect this had a lot to do with the data itself, in that Sale Price (or rather, log(Sale Price)) likely has a relatively linear relationship with the predictor variables. This highlights one of the most important takeaways from this project: that linear models have a real place in machine learning.

    In situations like this, where performance between a simpler model and a more complex model are similar, the better interpretability and the ease of training of the simpler model may also prove to be deciding factors on model choice.  In most cases, stacking multiple base-layer models into a second-layer is impractical, despite performing slightly better.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppEigen 0.3.3.3.1

    Mon, 11/20/2017 - 01:23

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A maintenance release 0.3.3.3.1 of RcppEigen is now on CRAN (and will get to Debian soon). It brings Eigen 3.3.* to R.

    The impetus was a request from CRAN to change the call to Rcpp::Rcpp.plugin.maker() to only use :: as the function has in fact been exported and accessible for a pretty long time. So now the usage pattern catches up. Otherwise, Haiku-OS is now supported and a minor Travis tweak was made.

    The complete NEWS file entry follows.

    Changes in RcppEigen version 0.3.3.3.1 (2017-11-19)
    • Compilation under Haiku-OS is now supported (Yu Gong in #45).

    • The Rcpp.plugin.maker helper function is called via :: as it is in fact exported (yet we had old code using :::).

    • A spurious argument was removed from an example call.

    • Travis CI now uses https to fetch the test runner script.

    Courtesy of CRANberries, there is also a diffstat report for the most recent release.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    RcppClassic 0.9.9

    Mon, 11/20/2017 - 01:22

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    A maintenance release RcppClassic 0.9.9 is now at CRAN. This package provides a maintained version of the otherwise deprecated first Rcpp API; no new projects should use it.

    Per a request from CRAN, we changed the call to Rcpp::Rcpp.plugin.maker() to only use :: as the function has in fact been exported and accessible for a pretty long time. So now the usage pattern catches up.

    Courtesy of CRANberries, there are changes relative to the previous release.

    Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages