Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 days 1 hour ago

Online data science (with R) courses at Udemy for only $10 (from March 9th until the 14th)

Thu, 03/09/2017 - 17:28

In order to get the discount, simply click choose a link below and when paying use the promo code: CHANGEIT

Udemy is offering readers of R-bloggers access to its global online learning marketplace for only $10 per course! This deal (offering over 50%-90% discount) is for hundreds of their courses – including many R-Programming, data science, machine learning etc.

Click here to browse ALL (R and non-R) courses

Advanced R courses: 

Introductory R courses: 

Non R courses for data science: 


The Rise of Civilization, Visualized with R

Thu, 03/09/2017 - 17:09

This animation by geographer James Cheshire shows something at once simple and profound: the founding and growth of the cities of the world since the dawn of civilization.

Dr Cheshire created the animation using R and the rworldmap package, using data from this Nature dataset. The complete R code is provided in the blog post linked below, and you can easily tweak the code to change the projection of the map. Check out the complete post below to see two other versions of the animation: one focusing on the birth of civilization in the Old World, and another depicting the pre-colonial civilizations and the subsequent colonization of the Americas. Mapping 5,000 Years of City Growth

Some Win-Vector R packages

Thu, 03/09/2017 - 16:35

This post concludes our mini-series of Win-Vector open source R packages. We end with WVPlots, a collection of ready-made ggplot2 plots we find handy.

Please read on for list of some of the Win-Vector LLC open-source R packages that we are pleased to share.

For each package we have prepared a short introduction, so you can quickly check if a package suits your needs using the links below.

Forecasting Stock Returns using ARIMA model

Thu, 03/09/2017 - 16:30

By Milind Paradkar

“Prediction is very difficult, especially about the future”. Many of you must have come across this famous quote by Neils Bohr, a Danish physicist. Prediction is the theme of this blog post. In this post, we will cover the popular ARIMA forecasting model to predict returns on a stock and demonstrate a step-by-step process of ARIMA modeling using R programming.

What is a forecasting model in Time Series?

Forecasting involves predicting values for a variable using its historical data points or it can also involve predicting the change in one variable given the change in the value of another variable. Forecasting approaches are primarily categorized into qualitative forecasting and quantitative forecasting. Time series forecasting falls under the category of quantitative forecasting wherein statistical principals and concepts are applied to a given historical data of a variable to forecast the future values of the same variable. Some time series forecasting techniques used include:

  • Autoregressive Models (AR)
  • Moving Average Models (MA)
  • Seasonal Regression Models
  • Distributed Lags Models
What is Autoregressive Integrated Moving Average (ARIMA)?

ARIMA stands for Autoregressive Integrated Moving Average. ARIMA is also known as Box-Jenkins approach. Box and Jenkins claimed that non-stationary data can be made stationary by differencing the series, Yt. The general model for Yt is written as,

Yt =ϕ1Yt−1 +ϕ2Yt−2…ϕpYt−p +ϵt + θ1ϵt−1+ θ2ϵt−2 +…θqϵt−q

Where, Yt is the differenced time series value, ϕ and θ are unknown parameters and ϵ are independent identically distributed error terms with zero mean. Here, Yt is expressed in terms of its past values and the current and past values of error terms.

The ARIMA model combines three basic methods:

  • AutoRegression (AR) – In auto-regression the values of a given time series data are regressed on their own lagged values, which is indicated by the “p” value in the model.
  • Differencing (I-for Integrated) – This involves differencing the time series data to remove the trend and convert a non-stationary time series to a stationary one. This is indicated by the “d” value in the model. If d = 1, it looks at the difference between two time series entries, if d = 2 it looks at the differences of the differences obtained at d =1, and so forth.
  • Moving Average (MA) – The moving average nature of the model is represented by the “q” value which is the number of lagged values of the error term.

This model is called Autoregressive Integrated Moving Average or ARIMA(p,d,q) of Yt.  We will follow the steps enumerated below to build our model.

Step 1: Testing and Ensuring Stationarity

To model a time series with the Box-Jenkins approach, the series has to be stationary. A stationary time series means a time series without trend, one having a constant mean and variance over time, which makes it easy for predicting values.

Testing for stationarity – We test for stationarity using the Augmented Dickey-Fuller unit root test. The p-value resulting from the ADF test has to be less than 0.05 or 5% for a time series to be stationary. If the p-value is greater than 0.05 or 5%, you conclude that the time series has a unit root which means that it is a non-stationary process.

Differencing – To convert a non-stationary process to a stationary process, we apply the differencing method. Differencing a time series means finding the differences between consecutive values of a time series data. The differenced values form a new time series dataset which can be tested to uncover new correlations or other interesting statistical properties.

We can apply the differencing method consecutively more than once, giving rise to the “first differences”, “second order differences”, etc.

We apply the appropriate differencing order (d) to make a time series stationary before we can proceed to the next step.

Step 2: Identification of p and q

In this step, we identify the appropriate order of Autoregressive (AR) and Moving average (MA) processes by using the Autocorrelation function (ACF) and Partial Autocorrelation function (PACF).  Please refer to our blog, “Starting out with Time Series” for an explanation of ACF and PACF functions.

Identifying the p order of AR model

For AR models, the ACF will dampen exponentially and the PACF will be used to identify the order (p) of the AR model. If we have one significant spike at lag 1 on the PACF, then we have an AR model of the order 1, i.e. AR(1). If we have significant spikes at lag 1, 2, and 3 on the PACF, then we have an AR model of the order 3, i.e. AR(3).

Identifying the q order of MA model

For MA models, the PACF will dampen exponentially and the ACF plot will be used to identify the order of the MA process. If we have one significant spike at lag 1 on the ACF, then we have an MA model of the order 1, i.e. MA(1). If we have significant spikes at lag 1, 2, and 3 on the ACF, then we have an MA model of the order 3, i.e. MA(3).

Step 3: Estimation and Forecasting

Once we have determined the parameters (p,d,q) we estimate the accuracy of the ARIMA model on a training data set and then use the fitted model to forecast the values of the test data set using a forecasting function. In the end, we cross check whether our forecasted values are in line with the actual values.

Building ARIMA model using R programming

Now, let us follow the steps explained to build an ARIMA model in R. There are a number of packages available for time series analysis and forecasting. We load the relevant R package for time series analysis and pull the stock data from yahoo finance.

library(quantmod);library(tseries); library(timeSeries);library(forecast);library(xts); # Pull data from Yahoo finance getSymbols('TECHM.NS', from='2012-01-01', to='2015-01-01') # Select the relevant close price series stock_prices = TECHM.NS[,4]

In the next step, we compute the logarithmic returns of the stock as we want the ARIMA model to forecast the log returns and not the stock price. We also plot the log return series using the plot function.

# Compute the log returns for the stock stock = diff(log(stock_prices),lag=1) stock = stock[!] # Plot log returns plot(stock,type='l', main='log returns plot')

Next, we call the ADF test on the returns series data to check for stationarity. The p-value of 0.01 from the ADF test tells us that the series is stationary. If the series were to be non-stationary, we would have first differenced the returns series to make it stationary.

# Conduct ADF test on log returns series print(adf.test(stock))

In the next step, we fixed a breakpoint which will be used to split the returns dataset in two parts further down the code.

# Split the dataset in two parts - training and testing breakpoint = floor(nrow(stock)*(2.9/3))

We truncate the original returns series till the breakpoint, and call the ACF and PACF functions on this truncated series.

# Apply the ACF and PACF functions par(mfrow = c(1,1)) acf.stock = acf(stock[c(1:breakpoint),], main='ACF Plot', lag.max=100) pacf.stock = pacf(stock[c(1:breakpoint),], main='PACF Plot', lag.max=100)

We can observe these plots and arrive at the Autoregressive (AR) order and Moving Average (MA) order.

We know that for AR models, the ACF will dampen exponentially and the PACF plot will be used to identify the order (p) of the AR model. For MA models, the PACF will dampen exponentially and the ACF plot will be used to identify the order (q) of the MA model. From these plots let us select AR order = 2 and MA order = 2. Thus, our ARIMA parameters will be (2,0,2).

Our objective is to forecast the entire returns series from breakpoint onwards. We will make use of the For Loop statement in R and within this loop we will forecast returns for each data point from the test dataset.

In the code given below, we first initialize a series which will store the actual returns and another series to store the forecasted returns.  In the For Loop, we first form the training dataset and the test dataset based on the dynamic breakpoint.

We call the arima function on the training dataset for which the order specified is (2, 0, 2). We use this fitted model to forecast the next data point by using the forecast.Arima function. The function is set at 99% confidence level. One can use the confidence level argument to enhance the model. We will be using the forecasted point estimate from the model. The “h” argument in the forecast function indicates the number of values that we want to forecast, in this case, the next day returns.

We can use the summary function to confirm the results of the ARIMA model are within acceptable limits. In the last part, we append every forecasted return and the actual return to the forecasted returns series and the actual returns series respectively.

# Initialzing an xts object for Actual log returns Actual_series = xts(0,as.Date("2014-11-25","%Y-%m-%d")) # Initialzing a dataframe for the forecasted return series forecasted_series = data.frame(Forecasted = numeric()) for (b in breakpoint:(nrow(stock)-1)) { stock_train = stock[1:b, ] stock_test = stock[(b+1):nrow(stock), ] # Summary of the ARIMA model using the determined (p,d,q) parameters fit = arima(stock_train, order = c(2, 0, 2),include.mean=FALSE) summary(fit) # plotting a acf plot of the residuals acf(fit$residuals,main="Residuals plot") # Forecasting the log returns arima.forecast = forecast.Arima(fit, h = 1,level=99) summary(arima.forecast) # plotting the forecast par(mfrow=c(1,1)) plot(arima.forecast, main = "ARIMA Forecast") # Creating a series of forecasted returns for the forecasted period forecasted_series = rbind(forecasted_series,arima.forecast$mean[1]) colnames(forecasted_series) = c("Forecasted") # Creating a series of actual returns for the forecasted period Actual_return = stock[(b+1),] Actual_series = c(Actual_series,xts(Actual_return)) rm(Actual_return) print(stock_prices[(b+1),]) print(stock_prices[(b+2),]) }

Before we move to the last part of the code, let us check the results of the ARIMA model for a sample data point from the test dataset.

From the coefficients obtained, the return equation can be written as:

Yt = 0.6072*Y(t-1)  – 0.8818*Y(t-2) – 0.5447ε(t-1) + 0.8972ε(t-1)

The standard error is given for the coefficients, and this needs to be within the acceptable limits. The Akaike information criterion (AIC) score is a good indicator of the ARIMA model accuracy. Lower the AIC score better the model. We can also view the ACF plot of the residuals; a good ARIMA model will have its autocorrelations below the threshold limit. The forecasted point return is -0.001326978, which is given in the last row of the output.

Let us check the accuracy of the ARIMA model by comparing the forecasted returns versus the actual returns. The last part of the code computes this accuracy information.

# Adjust the length of the Actual return series Actual_series = Actual_series[-1] # Create a time series object of the forecasted series forecasted_series = xts(forecasted_series,index(Actual_series)) # Create a plot of the two return series - Actual versus Forecasted plot(Actual_series,type='l',main='Actual Returns Vs Forecasted Returns') lines(forecasted_series,lwd=1.5,col='red') legend('bottomright',c("Actual","Forecasted"),lty=c(1,1),lwd=c(1.5,1.5),col=c('black','red')) # Create a table for the accuracy of the forecast comparsion = merge(Actual_series,forecasted_series) comparsion$Accuracy = sign(comparsion$Actual_series)==sign(comparsion$Forecasted) print(comparsion) # Compute the accuracy percentage metric Accuracy_percentage = sum(comparsion$Accuracy == 1)*100/length(comparsion$Accuracy) print(Accuracy_percentage)

If the sign of the forecasted return equals the sign of the actual returns we have assigned it a positive accuracy score. The accuracy percentage of the model comes to around 55% which looks like a decent number. One can try running the model for other possible combinations of (p,d,q) or instead use the auto.arima function which selects the best optimal parameters to run the model.


To conclude, in this post we covered the ARIMA model and applied it for forecasting stock price returns using R programming language. We also crossed checked our forecasted results with the actual returns. In our upcoming posts, we will cover other time series forecasting techniques and try them in Python/R programming languages.

Next Step

If you want to learn various aspects of Algorithmic trading then check out the Executive Programme in Algorithmic Trading (EPAT™). The course covers training modules like Statistics & Econometrics, Financial Computing & Technology, and Algorithmic & Quantitative Trading. EPAT™ equips you with the required skill sets to build a promising career in algorithmic trading. Enroll now!

The post Forecasting Stock Returns using ARIMA model appeared first on .

[R] Kenntnis-Tage 2017: Save the date, save your ticket, save your talk

Thu, 03/09/2017 - 14:01

(This article was first published on eoda english R news, and kindly contributed to R-bloggers)

When data science success stories meet practical R tutorials, the result is a one of a kind event for data analysts with a business background. On November 8 and 9, the [R] Kenntnis-Tage 2017 will be the platform for exchanging everyday challenges and solutions when working with the programming language R.

[R] Kenntnis-Tage-2017 in Kassel – Germany

Diverse program and inspiring networking

A rich program around data science and R awaits the participants on those two days:

  • Practical and inspiring talks, case studies and keynote speeches which offer insights into the various fields of application
  • Tutorials for deepening the participants’ knowledge of R
  • Networking and exchange with other data scientists, R users, developers and IT decision makers

Participants can gather valuable inspiration for writing their own success story with data science and R in times of data-driven business processes.

Call for papers: Submit your topic and be part of the [R] Kenntnis-Tage 2017

Especially the exchange across all industries and topics provides participants with valuable best practices for their own issues. Therefore, data science specialist eoda invites users, developers, admins and IT decision makers to hand in topics with a call for papers until April 30.

Last year’s event a total success

With the [R] Kenntnis-Tage 2017 eoda continues its own success story. At last year’s event, around 50 partly international participants could enhance their knowledge of R in workshops and tutorials as well as gain new inspiration for use cases in guest lectures by companies such as Lufthansa Industry Solutions and TRUMPF Laser.

Participants from different industries praised the event for giving innovative impulses for applying data science and R in the future.

Registration now open: book early and benefit from early bird discount

For more information about the [R] Kenntnis-Tage 2017 and the registration with special conditions for early bookers visit

To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Negative Binomial Regression for Complex Samples (Surveys) #rstats

Thu, 03/09/2017 - 09:57

The survey-package from Thomas Lumley is a great toolkit when analyzing complex samples. It provides svyglm(), to fit generalised linear models to data from a complex survey design. svyglm() covers all families that are also provided by R’s glm() – however, the survey-package has no function to fit negative binomial models, which might be useful for overdispersed count models. Yet, the package provides a generic svymle() to fit user-specified likelihood estimations. In his book, Appendix E, Thomas Lumley describes how to write your own likelihood-function, passed to svymle(), to fit negative binomial models for complex samples. So I wrote a small „wrapper“ and implemented a function svyglm.nb() in my sjstats-package.

# ------------------------------------------ # This example reproduces the results from # Lumley 2010, figure E.7 (Appendix E, p256) # ------------------------------------------ library(sjstats) library(survey) data(nhanes_sample) # create survey design des <- svydesign( id = ~ SDMVPSU, strat = ~ SDMVSTRA, weights = ~ WTINT2YR, nest = TRUE, data = nhanes_sample ) # fit negative binomial regression fit <- svyglm.nb(total ~ factor(RIAGENDR) * (log(age) + factor(RIDRETH1)), des) # print coefficients and standard errors round(cbind(coef(fit), survey::SE(fit)), 2) #> [,1] [,2] #> theta.(Intercept) 0.81 0.05 #> eta.(Intercept) 2.29 0.16 #> eta.factor(RIAGENDR)2 -0.80 0.18 #> eta.log(age) 1.07 0.23 #> eta.factor(RIDRETH1)2 0.08 0.15 #> eta.factor(RIDRETH1)3 0.09 0.18 #> eta.factor(RIDRETH1)4 0.82 0.30 #> eta.factor(RIDRETH1)5 0.06 0.38 #> eta.factor(RIAGENDR)2:log(age) -1.22 0.27 #> eta.factor(RIAGENDR)2:factor(RIDRETH1)2 -0.18 0.26 #> eta.factor(RIAGENDR)2:factor(RIDRETH1)3 0.60 0.19 #> eta.factor(RIAGENDR)2:factor(RIDRETH1)4 0.06 0.37 #> eta.factor(RIAGENDR)2:factor(RIDRETH1)5 0.38 0.44

The functions returns an object of class svymle, so all methods provided by the survey-package for this class work – it’s just that there are only a few, and common methods like predict() are currently not implemented. Maybe, hopefully, future updates of the survey-package will include such features.

Tagged: R, rstats, Statistik

Euler Problem 15: Pathways Through a Lattice

Thu, 03/09/2017 - 08:52

Euler Problem 15 analyses taxicab geometry. This system replaces the usual distance function with the sum of the absolute differences of their Cartesian coordinates. In other words, the distance a taxi would travel in a grid plan. The fifteenth Euler problem asks to determine the number of possible routes a taxi can take in a city of a certain size.

Euler Problem 15 Definition

Starting in the top left corner of a 2×2 grid, and only being able to move to the right and down, there are exactly 6 routes to the bottom right corner. How many possible routes are there through a 20×20 grid?


The defined lattice is one larger than the number of squares. Along the edges of the matrix, only one pathway is possible: straight to the right or down. We can calculate the number of possible pathways for the remaining number by adding the number to the right and below the point.

For the two by two lattice the solution space is:

6  3  1
3  2  1
1  1  0

The total number of pathways from the upper left corner to the lower right corner is thus 6. This logic can now be applied to a grid of any arbitrary size using the following code.

The code defines the lattice and initiates the boundary conditions. The bottom row and the right column are filled with 1 as there is only one solution from these points. The code then calculates the pathways by working backwards through the matrix. The final solution is the number is the first cell.

# Define lattice nLattice <- 20 lattice = matrix(ncol=nLattice + 1, nrow=nLattice + 1) # Boundary conditions lattice[nLattice + 1,-(nLattice + 1)] <- 1 lattice[-(nLattice + 1), nLattice + 1] <- 1 # Calculate Pathways for (i in nLattice:1) { for (j in nLattice:1) { lattice[i,j] <- lattice[i+1, j] + lattice[i, j+1] } } answer <- lattice[1,1] Taxicab Geometry

The post Euler Problem 15: Pathways Through a Lattice appeared first on The Devil is in the Data.

Changing names in the tidyverse: An example for many regressions

Thu, 03/09/2017 - 07:17

(This article was first published on R – Stat Bandit, and kindly contributed to R-bloggers)

A collaborator posed an interesting R question to me today. She wanted to do
several regressions using different outcomes, with models being computed on
different strata defined by a combination of experimental design variables. She then just wanted to extract the p-values for the slopes for each of the models, and then
filter the strata based on p-value levels.

This seems straighforward, right? Let’s set up a toy example:

library(tidyverse) dat <- as_tibble(expand.grid(letters[1:4], 1:5)) d <- vector('list', nrow(dat)) set.seed(102) for(i in 1:nrow(dat)){ x <- rnorm(100) d[[i]] <- tibble(x = x, y1 = 3 - 2*x + rnorm(100), y2 = -4+5*x+rnorm(100)) } dat <- as_tibble(bind_cols(dat, tibble(dat=d))) %>% unnest() knitr::kable(head(dat), format='html') Var1 Var2 x y1 y2 a 1 0.1805229 4.2598245 -3.004535 a 1 0.7847340 0.0023338 -2.104949 a 1 -1.3531646 3.1711898 -9.156758 a 1 1.9832982 -0.7140910 5.966377 a 1 1.2384717 0.3523034 2.131004 a 1 1.2006174 0.6267716 1.752106

Now we’re going to perform two regressions, one using y1 and one using y2 as the dependent variables, for each stratum defined by Var1 and Var2.

out % nest(-Var1, -Var2) %>% mutate(model1 = map(data, ~lm(y1~x, data=.)), model2 = map(data, ~lm(y2~x, data=.)))

Now conceptually, all we do is tidy up the output for the models using the broom package, filter on the rows containg the slope information, and extract the p-values, right? Not quite….

library(broom) out_problem % mutate(output1 = map(model1, ~tidy(.)), output2 = map(model2, ~tidy(.))) %>% select(-data, -model1, -model2) %>% unnest() names(out_problem)

[1] "Var1" "Var2" "term" "estimate" "std.error"
[6] "statistic" "p.value" "term" "estimate" "std.error"
[11] "statistic" "p.value"

We've got two sets of output, but with the same column names!!! This is a problem! An easy solution would be to preface the column names with the name of the response variable. I struggled with this today until I discovered the secret function.

out_nice % mutate(output1 = map(model1, ~tidy(.)), output2 = map(model2, ~tidy(.)), output1 = map(output1, ~setNames(., paste('y1', names(.), sep='_'))), output2 = map(output2, ~setNames(., paste('y2', names(.), sep='_')))) %>% select(-data, -model1, -model2) %>% unnest()

This is a compact representation of the results of both regressions by strata, and we can extract the information we would like very easily. For example, to extract the stratum-specific slope estimates:

out_nice %>% filter(y1_term=='x') %>% select(Var1, Var2, ends_with('estimate')) %>% knitr::kable(digits=3, format='html') Var1 Var2 y1_estimate y2_estimate a 1 -1.897 5.036 b 1 -2.000 5.022 c 1 -1.988 4.888 d 1 -2.089 5.089 a 2 -2.052 5.015 b 2 -1.922 5.004 c 2 -1.936 4.969 d 2 -1.961 4.959 a 3 -2.043 5.017 b 3 -2.045 4.860 c 3 -1.996 5.009 d 3 -1.922 4.894 a 4 -2.000 4.942 b 4 -2.000 4.932 c 4 -2.033 5.042 d 4 -2.165 5.049 a 5 -2.094 5.010 b 5 -1.961 5.122 c 5 -2.106 5.153 d 5 -1.974 5.009


To leave a comment for the author, please follow the link and comment on their blog: R – Stat Bandit. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Being successful on Kaggle using `mlr`

Thu, 03/09/2017 - 01:00

(This article was first published on mlr-org, and kindly contributed to R-bloggers)

Achieving a good score on a Kaggle competition is typically quite difficult.
This blog post outlines 7 tips for beginners to improve their ranking on the Kaggle leaderboards.
For this purpose, I also created a Kernel
for the Kaggle bike sharing competition
that shows how the R package, mlr, can be used to tune a xgboost model with random search in parallel (using 16 cores). The R script scores rank 90 (of 3251) on the Kaggle leaderboard.

7 Rules
  1. Use good software
  2. Understand the objective
  3. Create and select features
  4. Tune your model
  5. Validate your model
  6. Ensemble different models
  7. Track your progress
1. Use good software

Whether you choose R, Python or another language to work on Kaggle, you will
most likely need to leverage quite a few packages to follow best practices in
machine learning. To save time, you should use ‘software’
that offers a standardized and well-tested interface for the important steps
in your workflow:

  • Benchmarking different machine learning algorithms (learners)
  • Optimizing hyperparameters of learners
  • Feature selection, feature engineering and dealing with missing values
  • Resampling methods for validation of learner performance
  • Parallelizing the points above

Examples of ‘software’ that implement the steps above and more:

  • For python: scikit-learn (
  • For R: mlr ( or caret.
2. Understand the objective

To develop a good understanding of the Kaggle challenge, you should:

  • Understand the problem domain:
    • Read the description and try to understand the aim of the competition.
    • Keep reading the forum and looking into scripts/kernels of others, learn from them!
    • Domain knowledge might help you (i.e., read publications about the topic, wikipedia is also ok).
    • Use external data if allowed (e.g., google trends, historical weather data).
  • Explore the dataset:
    • Which features are numerical, categorical, ordinal or time dependent?
    • Decide how to handle missing values. Some options:
      • Impute missing values with the mean, median or with values that are out of range (for numerical features).
      • Interpolate missing values if the feature is time dependent.
      • Introduce a new category for the missing values or use the mode (for categorical features).
    • Do exploratory data analysis (for the lazy: wait until someone else uploads an EDA kernel).
    • Insights you learn here will inform the rest of your workflow (creating new features).

Make sure you choose an approach that directly optimizes the measure of interest!

  • The median minimizes the mean absolute error (MAE) and
    the mean minimizes the mean squared error (MSE).
  • By default, many regression algorithms predict the expected mean but there
    are counterparts that predict the expected median
    (e.g., linear regression vs. quantile regression).

  • For strange measures: Use algorithms where you can implement your own objective
    function, see e.g.

3. Create and select features:

In many kaggle competitions, finding a “magic feature” can dramatically increase your ranking.
Sometimes, better data beats better algorithms!
You should therefore try to introduce new features containing valuable information
(which can’t be found by the model) or remove noisy features (which can decrease model performance):

  • Concat several columns
  • Multiply/Add several numerical columns
  • Count NAs per row
  • Create dummy features from factor columns
  • For time series, you could try
    • to add the weekday as new feature
    • to use rolling mean or median of any other numerical feature
    • to add features with a lag…
  • Remove noisy features: Feature selection / filtering
4. Tune your model

Typically you can focus on a single model (e.g. xgboost) and tune its hyperparameters for optimal performance.

5. Validate your model

Good machine learning models not only work on the data they were trained on, but
also on unseen (test) data that was not used for training the model. When you use training data
to make any kind of decision (like feature or model selection, hyperparameter tuning, …),
the data becomes less valuable for generalization to unseen data. So if you just use the public
leaderboard for testing, you might overfit to the public leaderboard and lose many ranks once the private
leaderboard is revealed.
A better approach is to use validation to get an estimate of performane on unseen data:

  • First figure out how the Kaggle data was split into train and test data. Your resampling strategy should follow the same method if possible. So if kaggle uses, e.g. a feature for splitting the data, you should not use random samples for creating cross-validation folds.
  • Set up a resampling procedure, e.g., cross-validation (CV) to measure your model performance
  • Improvements on your local CV score should also lead to improvements on the leaderboard.
  • If this is not the case, you can try
    • several CV folds (e.g., 3-fold, 5-fold, 8-fold)
    • repeated CV (e.g., 3 times 3-fold, 3 times 5-fold)
    • stratified CV
  • mlr offers nice visualizations to benchmark different algorithms.
6. Ensemble different models (see, e.g. this guide):

After training many different models, you might want to ensemble them into one strong model using one of these methods:

  • simple averaging or voting
  • finding optimal weights for averaging or voting
  • stacking
7. Track your progress

A kaggle project might get quite messy very quickly, because you might try and prototype
many different ideas. To avoid getting lost, make sure to keep track of:

  • What preprocessing steps were used to create the data
  • What model was used for each step
  • What values were predicted in the test file
  • What local score did the model achieve
  • What public score did the model achieve

If you do not want to use a tool like git, at least make sure you create subfolders
for each prototype. This way you can later analyse which models you might want to ensemble
or use for your final commits for the competition.

To leave a comment for the author, please follow the link and comment on their blog: mlr-org. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The nhmrcData package: NHMRC funding outcomes data made tidy

Thu, 03/09/2017 - 00:00

Do you like R? Information about Australian biomedical research funding outcomes? Tidy data? If the answers to those questions are “yes”, then you may also like nhmrcData, a collection of datasets derived from funding statistics provided by the Australian National Health & Medical Research Council. It’s also my first R package (more correctly, R data package).

Read on for the details.

1. Installation

The package is hosted at Github and is in a subdirectory of a top-level repository, so it can be installed using the devtools package, then loaded in the usual way:

devtools::install_github("neilfws/politics", subdir = "nhmrcData", build_vignettes = TRUE) library(nhmrcData)

There are currently 14 datasets in the package. No need to type data(); they are “lazy-loaded”, so just start typing “nhmrc” then hit Tab to see their names. The datasets are (somewhat) documented and briefly described in the repository README so for this post, I’ll focus on just four examples: those related to gender.

2. Examples
Example code is (or will be) in the package vignette (I’m a bit over copy-pasting code to WordPress). Here are the results.

2.1 nhmrcOutcomesGenderBRA
The nhmrcOutcomesGenderBRA dataset contains information on funding outcomes by gender, fellowship scheme and broad research area for 2013 – 2015. Here’s an attempt to capture all of that in an area plot.

Mmm, look at those success rates. My qualitative first impression is that there are more women applicants in several categories, without a corresponding increase in funded proposals.

2.2 nhmrcOutcomesGenderFellows
The nhmrcOutcomesGenderFellows dataset also summarises outcomes by fellowship scheme, this time broken down into levels.

What’s interesting here is the difference in numbers by gender after the career development stage.

2.3 nhmrcOutcomesGenderPartTime
The nhmrcOutcomesGenderPartTime dataset looks at the relatively-small number of part-time fellowship applications.

I tried to create this chart in the style of a “population pyramid”, to contrast directly the numbers by gender. Looks rather like women could use more part-time opportunities earlier in their careers.

2.4 nhmrcOutcomesGenderScheme
Finally, the nhmrcOutcomesGenderScheme dataset: yet another summary by fellowship scheme but this time, broken down into 29 categories, such as “Australian Biomedical (Peter Doherty)”.

That’s the category I chose to display here and again, it indicates that more women are trying as hard or harder, yet this is not reflected in funding rates.

Of course, these charts are exploratory qualitative observations and it would be useful to apply some statistical tests where appropriate.

3. Thoughts on publicly-available data and tidyness
It’s great when organisations make their data available, but the next step is making it usable. Excel files is a start – but why are they always so dreadfully-formatted? The answer of course is that people cannot decide whether Excel is a database or a presentation tool (it is neither), so we end up with the worst of all worlds. PDF as a data source is just plain wrong, end of story. For this project I used the excellent tabulizer package to extract PDF tables into R matrices, followed by a lot of customised, manual cleaning which was impossible to automate or to make reproducible. The best that I could do was to dump the results as CSV files in the package data-raw directory.

Given that organisations struggle to get even the basics of data correct, I suppose hoping for some consideration of tidyness is wildly optimistic. So many spreadsheets with a header row of years (2000, 2001, 2002…) as opposed to a column named year. The tidy philosophy certainly makes manipulation and visualisation using R much easier, though I note that the occasional “spread” to wide format is still required for certain calculations. For example, funding outcomes are often expressed as total applications and successful applications. When those values are stored in the same column, it’s easier to calculate the number of unsuccessful applications (total – successful) by spreading to two columns, then gathering back again.

4. Thoughts on package writing
I’ve long wanted to write an R package but like many people, I kept putting it into the “too hard just now” basket. This is no longer the case thanks to some excellent resources. The first of these of course is RStudio itself, which includes several tools to facilitate package writing, building, testing and documentation. For an overview, start here. There are also many excellent online guides and tutorials. One such resource (which includes links to others) is Karl Broman’s R package primer.

Time will tell how easy it will be to update the package with new NHMRC data. In the meantime, I hope some of you find this useful and welcome your constructive feedback.

Filed under: R, statistics

Meetup: Stream processing with R in AWS

Wed, 03/08/2017 - 18:29

Update: Slides click here.

Stream processing with R in AWS
by Gergely Daróczi

Abstract: R is rarely mentioned among the big data tools, although it’s fairly well scalable for most data science problems and ETL tasks. This talk presents an open-source R package to interact with Amazon Kinesis via the MultiLangDaemon bundled with the Amazon KCL to start multiple R sessions on a machine or cluster of nodes to process data from theoretically any number of Kinesis shards. Besides the technical background and a quick introduction on how Kinesis works, this talk will feature some stream processing use-cases at, and will also provide an overview and hands-on demos on the related data infrastructure built on the top of Docker, Amazon ECS, ECR, KMS, Redshift and a bunch of third-party APIs.

Bio: Gergely Daróczi is an enthusiast R user and package developer, founder of an R-based web application at, Ph.D. candidate in Sociology, Director of Analytics at with a strong interest in designing a scalable data platform built on the top of R, AWS and a dozen APIs. He maintains some CRAN packages mainly dealing with reporting and API integrations, co-authored a number of journal articles in social and medical sciences, and recently published the “Mastering Data Analysis with R” book at Packt.


– 6:30pm arrival, food/drinks and networking
– 7:30pm talks
– 9 pm more networking

You must have a confirmed RSVP and please arrive by 7:25pm the latest. Please RSVP here on Eventbrite.

Venue: Edmunds, 2401 Colorado Avenue (this is Edmunds’ NEW LOCATION, don’t go to the old one)
Park underneath the building (Colorado Center), Edmunds will validate.

Reproducible Finance with R: Interactive Maps and ETF Analysis

Wed, 03/08/2017 - 18:00

by Jonathan Regenstein

In this post, I’ll describe a Shiny app to support the Emerging Markets ETF Country Exposure analysis developed in a previous post

I have done some additional work and updated the analysis to include five ETFs in the app, whereas we originally imported data on 1 ETF. The new notebook is available here.

As we start to build our Shiny app, we will assume that our underlying Notebook has been used to build the desired ETF data objects and spatial data frame, and that those have been saved in an appropriately named .RDat file. That’s an important assumption because the first thing we’ll do in the chunk below is load up that file.

A brief aside before we dive into the first code chunk: if you look at the source code for this app, you might notice that the code chunks have different ‘contexts’ – something we haven’t had in our previous flexdashboards. For example, the first chunk has ‘context = “data”‘, the next has ‘context = “render”, and so on. That is because I am taking advantage of a new runtime option called shinyprerendered that is still under development. This run time allows your flexdashboard to “pre-render” certain objects, meaning they are knit prior to deployment. Here, we pre-render our data and our map object.

First, we’ll load up our data and build a map of the world, shaded by our original emerging markets fund. That is what a user will see when first arriving to the app. We can think of it as the default map and shading before the user has done anything. We didn’t have to use emerging markets as the default, but it was the first fund we imported in the Notebook.

This code should look familiar. When we load the .RDat file, all of the saved objects go the global environment, and after the chunk has been run (or pre-rendered), the data and map object are available.

Now we want to set up some radio buttons so that the user can eventually change the shading of the map. This is pure UI work and is a simple selection of radio buttons. Note, in particular, how we give the user a choice of radio buttons with a nice intuitive label but then assign each button with a string value that matches the name of our fund objects. For example, we label the second button as “Global Infrastructure ETF”, which is what the user sees. But, upon choosing that button, we assign an input value equal to “infrastructure”.

Go back and look at the Notebook and, no surprise, “infrastructure” is what we titled our infrastructure fund object, as well as the column in the spatial data frame that contains country weights for the infrastructure fund.

I will harp on this point again below but it was quite purposeful to choose one label – “infrastructure” – for both the fund object and the column in the spatial data frame. It allows us to use one radio button to reference both of those. The same convention holds true for the other funds: one label for the fund object and the spatial data frame column that holds country weights.

Let’s go ahead and render the default map to the user.

Next, we build that default map in the “server” context. It will be shaded by whatever defaults we chose in the first chunk. In this case, we shade by the emerging market ETF country weights.

After we have that map, finally, the fun part: we’re going to allow the user to change the shading of the map according to the radio buttons. But – but! – we don’t want to rebuild a map of the world upon each radio button selection. Instead, we want to change just the shading and the pop up, and we’ll use the leafletProxy() function to accomplish that.

This will be a four step process: (1) use observeEvent() to capture the input from the radio button, (2) build a new palette based on that input, (3)build a new pop up based on that input, (4) display the new map which consists of the original map plus the new palette and pop up.

Have a read through the code chunk and then we’ll dissect it.

  1. Inside the observeEvent function, capture the radio button selection with ‘input$fund’, and save it to an object called ‘indicator’.

  2. Create a new palette based on the indicator called ‘newPal’, in the same way as we created the palette in the the first code chunk, except instead of explicitly passing in a fund like ’emerging_market’, pass in ‘indicator’, which has taken on a fund value from the radio button.

  3. Create a new pop up based on the indicator using the same process. Call it newPopup and pass it ‘indicator’ instead of an explicit fund name.

  4. Display the new map.
    a) Use leafletProxy() to build out that new map. We don’t want rebuild the whole map, so we tell leafletProxy() to start with the already built ‘fundMap’.
    b) Remove the layerId so we have a clean slate. Add back the provider tiles; then add back the polygons.
    c) To add a palette, we do two things: use ‘newPal’, which is based on the radio button, and use the column from the spatial data frame that has the same name as ‘indicator’ (that is, shade the palette by ‘worldfundcountry_weights[[indicator]]’). We can do that because the column in the spatial data frame has the same name as the fund object.

In the original the Notebook where we created the data frame emergingmarketcountryweights, we made a decision that makes interactive shading smoother: we named the column in the data frame that holds the country weights ’emergingmarket’. Then we added that data frame to the spatial data frame and the spatial data frame got a new column called ’emerging_market’.

Notice that because he column in the spatial data frame emerging_market, has the same name as our fund object We can reference both from one selection of the radio button. Convenient!

Just to finish off the new map, we use ‘popup = newPopup’ to add the popup object we created from the radio input. Our app now has an interactively shaded map!

Before we head to happy hour, though, let’s finish up with a relatively simple addition to the app. When a user clicks a country, not only does the user see a popup but also country level detail in a data table. The data table shows the individual companies that the ETF holds in the clicked country, and the weight to each of those companies.

Again, the hard work was done in the Notebook. When we loaded the .RDat file, it contained an object for each fund, with that convenient label I have belabored, with a column for countries and a column for companies. All we need do is filter that object by the clicked country. Our friend eventReactive() again allows us to capture the clicked country’s id, and we subset by that id. For example, if a user clicks on China, we subset and display only the companies with ‘China’ in the country column.

And, we’re done! Please note, however, that the techniques employed in this map can be used to map country exposures of mutual funds, bespoke stock/bond portfolios, currency and commodities holdings, etc. Enjoy!

Employee Retention with R Based Data Science Accelerator

Wed, 03/08/2017 - 16:22

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)

Employee retention has been and will continue to be one of the biggest challenges of a company. While classical tactics such as promotion, competitive perks, etc. are practiced as ways to retain employees, it is now a hot trend to rely on machine learning technology to discover behavioral patterns with which a company can understand their employees better.

Employee demographic data has been studied and used for analyzing employees’ inclination towards leaving a company. Nowadays, with the proliferation of the Internet, employees’ behavior can better be understood and analyzed through such data as internal and external social media postings. Such data can be leveraged for the analysis of, for example, sentiment and thereby determination of an employees likelihood of leaving the company. Novel cognitive computing technology based on artificial intelligence tools empower today’s HR departments to identify staff who are likely to churn before they do. Through pro-active intervention HR can better manage staff to encourage them to remain longer term with the company.

This blog post introduces an R based data science accelerator that can be quickly adopted by a data scientist to prototype a solution for the employee attrition prediction scenario.  The prediction is based on two types of employee data that are typically already collected by companies:

  1. Static data which does not tend to change over time. This type of data may refer to demographic and organizational data such as age, gender, title, etc. A characteristic of this type of data is that within a certain period they do not change or solely change in a deterministic way. For instance, years of service of an employee is static as the number increments every year.
  2. The second type of data is the dynamically evolving information about an employee. Recent studies revealed that sentiment is playing a critical role in employee attrition prediction. Classical measures of sentiment require employee surveys of work satisfaction work. Social media posts become useful for sentiment analysis as employees may express their feelings about work. Non-structural data such as text can be collected for mining patterns which are indicative of employees with different inclinations for churn.

Attrition prediction is a scenario that takes historic employee data as input to then identify individuals that are inclined to leave. The basic procedure is to extract features from the available data that might have previously been manually analyzed and to build predictive models based on a training set with labels relating to the employment status. Normally it can be formalized as a supervised classification problem, while the uniqueness is that population of employees with different employment status may not be equal. Training such an imbalanced data set requires resampling or cost-sensitive learning techniques. For sentiment analysis on unstructured data such as text, pre-processing techniques that extract analysis-friendly quantitative features should be applied. Commonly used feature extraction methods for text analysis include word-to-vector, term frequency, or term frequency and inverse document frequency, etc. Algorithms for building the model depend on the data characteristics. In case a specific algorithm does not yield the desired results, we have found that ensemble techniques can be deployed to effectively boost model performance.

The data science language R is a convenient tool for performing HR churn prediction analysis. A lightweight data science accelerator that demonstrates the process of predicting employee attrition is shared in this Github repository. The walk-through basically shows cutting-edge machine learning and text mining techniques applied in R.

The code for the analytics is provided in an R markdown document, which can be interactively executed step by step to aid replication and learning. Documents of various formats (e.g., PDF, html, Jupyter Notebook, etc.) can be produced directly from the markdown document.

Taking this employee attrition data set, for example, one can easily visualize and perform correlation analysis with R packages. For instance, we may plot the distribution of monthly income of employees from different departments and intuitively analyze whether income can be a factor that affects attrition.

df <- read.xlsx("./attrition.xlsx", sheetIndex=1, header=TRUE, colClasses=NA) ggplot(df, aes(x=factor(Department), y=MonthlyIncome, color=factor(Attrition))) + geom_boxplot() + xlab("Department") + ylab("Monthly income") + scale_fill_discrete(guide=guide_legend(title="Attrition")) + theme_bw()

R packages such as tm offer handy functions for dealing with text mining tasks on employees’ social media posts. Sometimes text analysis on multiple languages are of great interest to employers that have subsidiaries across different language-speaking locations. This can be done either via language-specific text analysis package (e.g., jiebaR for Chinese segmentation) or on-line translation. This R code demonstrates how to perform tokenization and term frequency count on English-Chinese text with jiebaR and tm.

For the complete solution for employee attrition, follow the link below.

Github (Microsoft): Data Science Accelerator – Employee Attrition Prediction with Sentiment Analysis

Dataviz of the week: 8/3/17

Wed, 03/08/2017 - 14:26


Corinne Riddell posted this on Twitter. It’s one version of multiple time series that she tried out, one for each USA state. It’s not the finished article, but is really nice for its combination of that recognisable shape (I suppose if your country has a dull shape like Portugal — no offence — then readers won’t immediately recognise the meaning of the arrangement) and the clean, simple small multiples. Admittedly, the time series has enough signal:noise to make this possible, and only a few unusual states, and without that it might start to get like spaghetti, but it’s always worth sketching options like this out to see how they work.

Could the state names get dropped? Probably not, but they could turn into two-letter abbreviations. The whole idea of the small multiple is that it’s not a precise x-y mapping but a general impression, so the y-axis labels could go (just as there are no x-axis labels).

Overall, I really like it and would like to write up a UK function for this to add to my dataviz toolbox.

Lesser known dplyr tricks

Wed, 03/08/2017 - 12:00

In this blog post I share some lesser-known (at least I believe they are) tricks that use mainly functions from dplyr.

Removing unneeded columns

Did you know that you can use - in front of a column name to remove it from a data frame?

mtcars %>% select(-disp) %>% head() ## mpg cyl hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 105 2.76 3.460 20.22 1 0 3 1 Re-ordering columns

Still using select(), it is easy te re-order columns in your data frame:

mtcars %>% select(cyl, disp, hp, everything()) %>% head() ## cyl disp hp mpg drat wt qsec vs am gear carb ## Mazda RX4 6 160 110 21.0 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 6 160 110 21.0 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 4 108 93 22.8 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 6 258 110 21.4 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 8 360 175 18.7 3.15 3.440 17.02 0 0 3 2 ## Valiant 6 225 105 18.1 2.76 3.460 20.22 1 0 3 1

As its name implies everything() simply means all the other columns.

Renaming columns with rename() mtcars <- rename(mtcars, spam_mpg = mpg) mtcars <- rename(mtcars, spam_disp = disp) mtcars <- rename(mtcars, spam_hp = hp) head(mtcars) ## spam_mpg cyl spam_disp spam_hp drat wt qsec vs am ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 ## gear carb ## Mazda RX4 4 4 ## Mazda RX4 Wag 4 4 ## Datsun 710 4 1 ## Hornet 4 Drive 3 1 ## Hornet Sportabout 3 2 ## Valiant 3 1 Selecting columns with a regexp

It is easy to select the columns that start with “spam” with some helper functions:

mtcars %>% select(contains("spam")) %>% head() ## spam_mpg spam_disp spam_hp ## Mazda RX4 21.0 160 110 ## Mazda RX4 Wag 21.0 160 110 ## Datsun 710 22.8 108 93 ## Hornet 4 Drive 21.4 258 110 ## Hornet Sportabout 18.7 360 175 ## Valiant 18.1 225 105

take also a look at starts_with(), ends_with(), contains(), matches(), num_range(), one_of() and everything().

Create new columns with mutate() and if_else() mtcars %>% mutate(vs_new = if_else( vs == 1, "one", "zero", NA_character_)) %>% head() ## spam_mpg cyl spam_disp spam_hp drat wt qsec vs am gear carb vs_new ## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 zero ## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 zero ## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 one ## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 one ## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 zero ## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 one

You might want to create a new variable conditionally on several values of another column:

mtcars %>% mutate(carb_new = case_when(.$carb == 1 ~ "one", .$carb == 2 ~ "two", .$carb == 4 ~ "four", TRUE ~ "other")) %>% head(15) ## spam_mpg cyl spam_disp spam_hp drat wt qsec vs am gear carb ## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ## 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 ## 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 ## 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 ## 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 ## 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 ## carb_new ## 1 four ## 2 four ## 3 one ## 4 one ## 5 two ## 6 one ## 7 four ## 8 two ## 9 two ## 10 four ## 11 four ## 12 other ## 13 other ## 14 other ## 15 four

Mind the .$ before the variable carb. There is a github issue about this, and it is already fixed in the development version of dplyr, which means that in the next version of dplyr, case_when() will work as any other specialized dplyr function inside mutate().

Apply a function to certain columns only, by rows mtcars %>% select(am, gear, carb) %>% purrr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2 head(mtcars2) ## # A tibble: 6 × 4 ## am gear carb sum_am_gear_carb ## <dbl> <dbl> <dbl> <dbl> ## 1 1 4 4 9 ## 2 1 4 4 9 ## 3 1 4 1 6 ## 4 0 3 1 4 ## 5 0 3 2 5 ## 6 0 3 1 4

For this, I had to use purrr’s by_row() function. You can then add this column to your original data frame:

mtcars <- cbind(mtcars, "sum_am_gear_carb" = mtcars2$sum_am_gear_carb) head(mtcars) ## spam_mpg cyl spam_disp spam_hp drat wt qsec vs am ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 ## gear carb sum_am_gear_carb ## Mazda RX4 4 4 9 ## Mazda RX4 Wag 4 4 9 ## Datsun 710 4 1 6 ## Hornet 4 Drive 3 1 4 ## Hornet Sportabout 3 2 5 ## Valiant 3 1 4 Use do() to do any arbitrary operation mtcars %>% group_by(cyl) %>% do(models = lm(spam_mpg ~ drat + wt, data = .)) %>% broom::tidy(models) ## Source: local data frame [9 x 6] ## Groups: cyl [3] ## ## cyl term estimate std.error statistic p.value ## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 4 (Intercept) 33.2493403 17.0987286 1.9445504 0.087727622 ## 2 4 drat 1.3244329 3.4519717 0.3836743 0.711215433 ## 3 4 wt -5.2400608 2.2150213 -2.3656932 0.045551615 ## 4 6 (Intercept) 30.6544931 7.5141648 4.0795609 0.015103868 ## 5 6 drat -0.4435744 1.1740862 -0.3778039 0.724768945 ## 6 6 wt -2.9902720 1.5685053 -1.9064468 0.129274249 ## 7 8 (Intercept) 29.6519180 7.0878976 4.1834574 0.001527613 ## 8 8 drat -1.4698722 1.6285054 -0.9025897 0.386081744 ## 9 8 wt -2.4518017 0.7985112 -3.0704664 0.010651044

do() is useful when you want to use any R function (user defined functions work too!) with dplyr functions. First I grouped the observations by cyl and then ran a linear model for each group. Then I converted the output to a tidy data frame using broom::tidy().

Using dplyr() functions inside your own functions extract_vars <- function(data, some_string){ data %>% select_(lazyeval::interp(~contains(some_string))) -> data return(data) } extract_vars(mtcars, "spam") ## spam_mpg spam_disp spam_hp ## Mazda RX4 21.0 160.0 110 ## Mazda RX4 Wag 21.0 160.0 110 ## Datsun 710 22.8 108.0 93 ## Hornet 4 Drive 21.4 258.0 110 ## Hornet Sportabout 18.7 360.0 175 ## Valiant 18.1 225.0 105 ## Duster 360 14.3 360.0 245 ## Merc 240D 24.4 146.7 62 ## Merc 230 22.8 140.8 95 ## Merc 280 19.2 167.6 123 ## Merc 280C 17.8 167.6 123 ## Merc 450SE 16.4 275.8 180 ## Merc 450SL 17.3 275.8 180 ## Merc 450SLC 15.2 275.8 180 ## Cadillac Fleetwood 10.4 472.0 205 ## Lincoln Continental 10.4 460.0 215 ## Chrysler Imperial 14.7 440.0 230 ## Fiat 128 32.4 78.7 66 ## Honda Civic 30.4 75.7 52 ## Toyota Corolla 33.9 71.1 65 ## Toyota Corona 21.5 120.1 97 ## Dodge Challenger 15.5 318.0 150 ## AMC Javelin 15.2 304.0 150 ## Camaro Z28 13.3 350.0 245 ## Pontiac Firebird 19.2 400.0 175 ## Fiat X1-9 27.3 79.0 66 ## Porsche 914-2 26.0 120.3 91 ## Lotus Europa 30.4 95.1 113 ## Ford Pantera L 15.8 351.0 264 ## Ferrari Dino 19.7 145.0 175 ## Maserati Bora 15.0 301.0 335 ## Volvo 142E 21.4 121.0 109

About this last point, you can read more about it here.

Hope you liked this small list of tricks!

Follow-up forecasting forum in Eindhoven

Tue, 03/07/2017 - 23:35

Last October I gave a 3-day masterclass on “Forecasting with R” in Eindhoven, Netherlands. There is a follow-up event planned for Tuesday 18 April 2017. It is particularly designed for people who attended the 3-day class, but if anyone else wants to attend they would be welcome.

Please register here if you want to attend.

The preliminary schedule is as follows.

10.00 — 11.00 New developments in forecasting using R

  • forecast v8.0
  • prophet
  • forecastHybrid
  • opera
Coffee break 11.15 — 12.15 Open discussion of forecasting problems and questions Lunch break 13.00 — 14.00 Brainstorm about possible features for forecast v9.0

Advanced Econometrics: Model Selection

Tue, 03/07/2017 - 21:00

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

On Thursday, March 23rd, I will give the third lecture of the PhD course on advanced tools for econometrics, on model selection and variable selection, where we will focus on ridge and lasso regressions . Slides are available online.

The first part was on on Nonlinearities in Econometric models, and the second one on Simulations.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building views with R

Tue, 03/07/2017 - 19:01


[Here you can see the Building views with R cheat sheet at a full resolution]


In database theory a query is a request for data or information from a database table or combination of tables.

Since dplyr we have something that quite closely conceptually resembles a query in R:


## Warning: package 'dplyr' was built under R version 3.2.5


mtcars %>% tbl_df() %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg))

## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 26.66364 4.509828 ## 2 6 19.74286 1.453567 ## 3 8 15.10000 2.560048

I particularly appreciate of dplyr the possibility of building my query as a step by step set of R statement that I can progressively test at each step.



Again in database theory, a view is the result set of a stored query on the data, which the database users can query just as they would in a table.

I would like to have something similar to a view in R

As far as I know, I can achieve this goal in three ways:

  • Function makeActiveBinding
  • Operator %>a% from package pryr
  • My proposed `%>>% operator


Function makeActiveBinding()

Function makeActiveBinding(sym, fun, env) installs a function in an environment env so that getting the value of sym calls fun with no arguments.

As a basic example I can actively bind a function that simulates a dice to an object named dice :

makeActiveBinding("dice", function() sample(1:6, 1), env = globalenv())

so that:

replicate(5 , dice)

## [1] 5 1 6 2 3

Similarly, I can wrap adplyr expression into a function:

f <- function() {mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg))}

and then actively bind it to a symbol:

makeActiveBinding('view', f , env = globalenv())

so that, any time we call view the result of function f()is computed again:


## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 26.66364 4.509828 ## 2 6 19.74286 1.453567 ## 3 8 15.10000 2.560048

As a result, if I change any value of mpg within mtcars, view is automatically updated:

mtcars$mpg[c(1,3,5)] <- 0 view

## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 24.59091 9.231192 ## 2 6 16.74286 7.504189 ## 3 8 13.76429 4.601606

Clearly, I have to admit that all of this looks quite unfriendly, at least to me.


Operator %<a-%

A valid alternative, that wraps away the complexity of function makeActiveBinding() is provided by operator %<a-% from package pryr:

view %<a-% {mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg))}

Again, if I change any value of mpg within mtcars, the value of view get automatically updated:

mtcars$mpg[c(1,3,5)] <- 50 view

## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 29.13636 8.159568 ## 2 6 23.88571 11.593451 ## 3 8 17.33571 9.688503

Note that in this case I have to enclose the whole expression within curly brackets.

Moreover, the final assignment: %<a-% goes on the left hand side of my chain of dplyr statements.


Operator %>>%

Finally I would like to propose a third alternative, still based on makeActiveBinding(), that I named %>>%

`%>>%` <- function( expr, x) { x <- substitute(x) call <-[-1] fun <- function() {NULL} body(fun) <- call$expr makeActiveBinding(sym = deparse(x), fun = fun, env = parent.frame()) invisible(NULL) }

that can be used as:

mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg)) %>>% view

And again, if I change the values of mpg:

mtcars$mpg[c(1,3,5)] <- 100

The content of view changes accordingly


## # A tibble: 3 × 3 ## cyl mean_mpg sd_mpg ## <dbl> <dbl> <dbl> ## 1 4 33.68182 22.41624 ## 2 6 31.02857 30.44321 ## 3 8 20.90714 22.88454

I believe this operator offers two advantages:

  • Avoids the usage of curly brackets around my dplyr expression
  • Allows me to actively assign the result of my chain of dplyr statements, in a more natural way at the end of the chain

The post Building views with R appeared first on Quantide – R training & consulting.

In case you missed it: February 2017 roundup

Tue, 03/07/2017 - 18:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from February of particular interest to R users. 

Public policy researchers use R to predict neighbourhoods in US cities subject to gentrification.

The ggraph package provides a grammar-of-graphics framework for visualizing directed and undirected graphs.

Facebook has open-sourced the "prophet" package they use for forecasting time series at scale.

A preview of features coming soon to R Tools for Visual Studio 1.0.

On the differences between using Excel and R for data analysis.

A data scientist suggests a "Gloom Index" for identifying the most depressing songs by the band Radiohead.

Catterplots is a package that replaces scatterplot points with cats. (Tufte would not approve.)

A collection of tips on using Microsoft R Server from the support team.

A summary of the many improvements slated for R 3.4.0.

R code using the RevoScaleR package to classify a large database of galaxy images in SQL Server.

A review of four deep learning packages for R: MXNet, darch, deepnet and h2o.

An update on more than a dozen projects and community initiatives funded by R Consortium grants.

R has overtaken SAS for Statistics job listings on

ModernDive is a free online textbook on Statistics and data science using R. 

A solution (with R code) for modeling customer churn in the retail industry using SQL Server R Services.

The superheat package provides enhanced heatmap graphics for R.

The fst package provides a new serialization format for R data focused on performance.

Thomas Dinsmore reflects on major events in the R Project and for Microsoft in 2016.

And some general interest stories (not necessarily related to R): A big drain; 'Vous' vs 'tu'; a remembrance of the late Hans Rosling; and Ten Meter Tower, a short film.

As always, thanks for the comments and please send any suggestions to me at Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

sigr: Simple Significance Reporting

Tue, 03/07/2017 - 16:44

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

sigr is a simple R package that conveniently formats a few statistics and their significance tests. This allows the analyst to use the correct test no matter what modeling package or procedure they use.

Model Example

Let’s take as our example the following linear relation between x and y:

library('sigr') set.seed(353525) d <- data.frame(x= rnorm(5)) d$y <- 2*d$x + 0.5*rnorm(nrow(d))

stats::lm() has among the most complete summaries of all models in R, so we easily get can see the quality of fit and its significance:

model <- lm(y~x, d=d) summary(model) ## ## Call: ## lm(formula = y ~ x, data = d) ## ## Residuals: ## 1 2 3 4 5 ## 0.3292 0.3421 0.0344 -0.1671 -0.5387 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.4201 0.2933 1.432 0.24747 ## x 2.1117 0.2996 7.048 0.00587 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4261 on 3 degrees of freedom ## Multiple R-squared: 0.943, Adjusted R-squared: 0.9241 ## F-statistic: 49.67 on 1 and 3 DF, p-value: 0.005871

sigr::wrapFTest() can render the relevant model quality summary.

cat(render(wrapFTest(model), pLargeCutoff=1))

F Test summary: (R2=0.94, F(1,3)=50, p=0.0059).

Note: sigr reports the un-adjusted R-squared, as this is the one that significance is reported for in summary::lm().

sigr also carries around the important summary components for use in code.

unclass(wrapFTest(model)) ## $test ## [1] "F Test" ## ## $numdf ## [1] 1 ## ## $dendf ## [1] 3 ## ## $FValue ## [1] 49.66886 ## ## $R2 ## [1] 0.9430403 ## ## $pValue ## [1] 0.005871236

In this function sigr is much like broom::glance() or modelr::rsquare().

broom::glance(model) ## r.squared adj.r.squared sigma statistic p.value df logLik ## 1 0.9430403 0.9240538 0.4261232 49.66886 0.005871236 2 -1.552495 ## AIC BIC deviance df.residual ## 1 9.10499 7.933304 0.544743 3 modelr::rsquare(model, d) ## [1] 0.9430403

This is something like what is reported by caret::train() (where caret controls the model training process).

cr <- caret::train(y~x, d, method = 'lm') ## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = ## trainInfo, : There were missing values in resampled performance measures. cr$results ## intercept RMSE Rsquared RMSESD RsquaredSD ## 1 TRUE 0.9631662 0.9998162 0.6239989 0.0007577834 summary(cr$finalModel) ## ## Call: ## lm(formula = .outcome ~ ., data = dat) ## ## Residuals: ## X1 X2 X3 X4 X5 ## 0.3292 0.3421 0.0344 -0.1671 -0.5387 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.4201 0.2933 1.432 0.24747 ## x 2.1117 0.2996 7.048 0.00587 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.4261 on 3 degrees of freedom ## Multiple R-squared: 0.943, Adjusted R-squared: 0.9241 ## F-statistic: 49.67 on 1 and 3 DF, p-value: 0.005871

(I presume cr$results$Rsquared is a model quality report and not a consistency of cross-validation procedure report. If it is a model quality report it is somehow inflated, perhaps by the resampling procedure. So I apologize for using either caret::train() or its results incorrectly.)

Data example

The issues in taking summary statistics (and significances) from models include:

  • Working only from models limits you to models that include the statistic you want.
  • Working only from models is mostly "in-sample." You need additional procedures for test or hold-out data.

With sigr it is also easy to reconstruct quality and significance from the predictions, no matter where they came from (without needing the model data structures).

In-sample example d$pred <- predict(model, newdata = d) cat(render(wrapFTest(d, 'pred', 'y'), pLargeCutoff=1))

F Test summary: (R2=0.94, F(1,3)=50, p=0.0059).

Notice we reconstruct the summary statistic and significance, independent of the model data structures. This means the test is generic and can be used on any regression (modulo informing the significance model of the appropriate number of parameters). It also can be used on held-out or test data.

In this mode it is a lot like ModelMetrics::rmse() or caret::postResample().

ModelMetrics::rmse(d$y, d$pred) ## [1] 0.3300736 caret::postResample(d$y, d$pred) ## RMSE Rsquared ## 0.3300736 0.9430403 Out of sample example

If we had more data we can look at the quality of the prediction on that data (though you have to take care in understanding the number of degrees of freedom is different for held-out data).

d2 <- data.frame(x= rnorm(5)) d2$y <- 2*d2$x + 0.5*rnorm(nrow(d2)) d2$pred <- predict(model, newdata = d2) cat(render(wrapFTest(d2, 'pred', 'y'), pLargeCutoff=1))

F Test summary: (R2=0.76, F(1,3)=9.6, p=0.054).

Exotic model example

We could have used glmnet instead of lm:

library("glmnet") d$one <- 1 # make sure we have at least 2 columns xmat <- as.matrix(d[, c('one','x')]) glmnetModel <- cv.glmnet(xmat, d$y) ## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations ## per fold summary(glmnetModel) ## Length Class Mode ## lambda 53 -none- numeric ## cvm 53 -none- numeric ## cvsd 53 -none- numeric ## cvup 53 -none- numeric ## cvlo 53 -none- numeric ## nzero 53 -none- numeric ## name 1 -none- character ## 12 elnet list ## lambda.min 1 -none- numeric ## lambda.1se 1 -none- numeric d$predg <- as.numeric(predict(glmnetModel, newx= xmat)) cat(render(wrapFTest(d, 'predg', 'y'), pLargeCutoff=1))

F Test summary: (R2=0.91, F(1,3)=30, p=0.012).


Because sigr can render to "LaTex" it can (when used in conjunction with latex2exp) also produce formatted titles for plots.

library("ggplot2") library("latex2exp") f <- paste0(format(model$coefficients['x'], digits= 3), '*x + ', format(model$coefficients['(Intercept)'], digits= 3)) title <- paste0("linear y ~ ", f, " relation") subtitle <- latex2exp::TeX(render(wrapFTest(d, 'pred', 'y'), format= 'latex')) ggplot(data=d, mapping=aes(x=pred, y=y)) + geom_point() + geom_abline(color='blue') + xlab(f) + ggtitle(title, subtitle= subtitle) Conclusion

sigr computes a few statistics or summaries (with standard significance estimates) and returns the calculation in both machine readable and formatted forms. For non-standard summaries sigr includes some resampling methods for significance estimation. For formatting sigr tries to get near the formats specified by "The American Psychological Association." sigr works with models (such as lm and glm(family='binomial')) or data, and is a small package with few dependencies.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...