### R charts in a Tweet

R bloggers - Mon, 11/20/2017 - 21:25

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Twitter recently doubled the maximum length of a tweet to 280 characters, and while all users now have access to longer tweets, few have taken advantage of the opportunity. Bob Rudis used the rtweet package to analyze tweets sent with the #rstats hashtag since 280-char tweets were introduced, and most still kept below the old 280-character limit. The actual percentage differed by the Twitter client being used; I’ve reproduced the charts for the top 10 clients below. (Click the chart for even more clients, and details of the analysis including the R code that generated it.)

That being said, some have embraced the new freedom with gusto, not least my Microsoft colleague Paige Bailey who demonstrated you can fit some pretty nice R charts — and even animations! — into just 280 characters:

df <- expand.grid(x = 1:10, y=1:10)
df$angle <- runif(100, 0, 2*pi) df$speed <- runif(100, 0, sqrt(0.1 * df$x)) ggplot(df, aes(x, y)) + geom_point() + geom_spoke(aes(angle = angle, radius = speed)) y’all twitterpeople give me 280 characters? yr just gonna get code samples pic.twitter.com/hyGEE2DxGy — @DynamicWebPaige (@DynamicWebPaige) November 8, 2017 x <- getURL(“https://t.co/ivZZvodbNK“) b <- read.csv(text=x) c <- get_map(location=c(-122.080954,36.971709), maptype=”terrain”, source=”google”, zoom=14) ggmap(c) + geom_path(data=b, aes(color=elevation), size=3)+ scale_color_gradientn(colours=rainbow(7), breaks=seq(25, 200, 25)) pic.twitter.com/7WdQLR56uZ — @DynamicWebPaige (@DynamicWebPaige) November 11, 2017 library(dygraphs) lungDeaths <- cbind(mdeaths, fdeaths) dygraph(lungDeaths) %>% dySeries(“mdeaths”, label = “Male”) %>% dySeries(“fdeaths”, label = “Female”) %>% dyOptions(stackedGraph = TRUE) %>% dyRangeSelector(height = 20) I ❤️ R’s interactive visualizations SO MUCH pic.twitter.com/LevjElly3L — @DynamicWebPaige (@DynamicWebPaige) November 16, 2017 library(plot3D) par(mar = c(2, 2, 2, 2)) par(mfrow = c(1, 1)) x <- seq(0, 2*pi,length.out=50) y <- seq(0, pi,length.out=50) M <- mesh(x, y) surf3D(x = (3+2*cos(M$x)) * cos(M$y), y = (3+2*cos(M$x)) * sin(M$y), z = 2 * sin(M$x),
colkey=FALSE,

— @DynamicWebPaige (@DynamicWebPaige) November 17, 2017

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Kaggle’s Advanced Regression Competition: Predicting Housing Prices in Ames, Iowa

R bloggers - Mon, 11/20/2017 - 19:31

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Introduction

Kaggle.com is a website designed for data scientists and data enthusiasts to connect and compete with each other. It is an open community that hosts forums and competitions in the wide field of data. In each Kaggle competition, competitors are given a training data set, which is used to train their models, and a test data set, used to test their models. Kagglers can then submit their predictions to view how well their score (e.g., accuracy or error) compares to others’.

As a team, we joined the House Prices: Advanced Regression Techniques Kaggle challenge to test our model building and machine learning skills. For this competition, we were tasked with predicting housing prices of residences in Ames, Iowa. Our training data set included 1460 houses (i.e., observations) accompanied by 79 attributes (i.e., features, variables, or predictors) and the sales price for each house. Our testing set included 1459 houses with the same 79 attributes, but sales price was not included as this was our target variable.

To view our code (split between R and Python) and our project presentation slides for this project see our shared GitHub repository.

Understanding the Data

Of the 79 variables provided, 51 were categorical and 28 were continuous.

Our first step was to combine these data sets into a single set both to account for the total missing values and to fully understand all the classes for each categorical variable. That is, there might be missing values or different class types in the test set that are not in the training set.

Processing the Data

Response Variable

As our response variable, Sale Price, is continuous, we will be utilizing regression models. One assumption of linear regression models is that the error between the observed and expected values (i.e., the residuals) should be normally distributed. Violations of this assumption often stem from a skewed response variable. Sale Price has a right skew, so we log + 1 transform it to normalize its distribution.

Missing Data

Machine learning algorithms do not handle missing values very well, so we must obtain an understanding of the missing values in our data to determine the best way to handle them. We find that 34 of the predictor variables have values that are interpreted by R and Python as missing (i.e., “NA” and “NaN”). Below we describe examples of some of the ways we treated these missing data.

1) NA/NaN is actually a class:

In many instances, what R and Python interpret as a missing value is actually a class of the variable. For example, Pool Quality is comprised of 5 classes: Excellent, Good, Fair, Typical, and NA. The NA class describes houses that do not have a pool, but our coding languages interpret houses of NA class as a missing value instead of a class of the Pool Quality variable.

Our solution was to impute most of the NA/NaN values to a value of “None.”

2) Not every NA/NaN corresponds to a missing attribute:

While we found that most NA/NaN values corresponded to an actual class for different variables, some NA/NaN values actually represented missing data. For example, we find that three houses with NA/NaN values for Pool Quality, also have a non-zero value for the variable, Pool Area (square footage of pool). These three houses likely have a pool, but its quality was not assessed or input into the data set.

Our solution was to first calculate mean Pool Area for each class of Pool Quality, then impute the missing Pool Quality classes based on how close that house’s Pool Area was to the mean Pool Areas for each Pool Quality class. For example, the first row in the below picture on the left has a Pool Area of 368 square feet. The average Pool Area for houses with Excellent pool quality (Ex) is about 360 square feet (picture on the right). Therefore, we imputed this house to have a Pool Quality of Excellent.

3) Domain knowledge:

Some variables had a moderate amount of missingness. For example, about 17% of the houses were missing the continuous variable, Lot Frontage, the linear feet of street connected to the property. Intuitively, attributes related to the size of a house are likely important factors regarding the price of the house. Therefore, dropping these variables seems ill-advised.

Our solution was based on the assumption that houses in the same neighborhood likely have similar features. Thus, we imputed the missing Lot Frontage values based on the median Lot Frontage for the neighborhood in which the house with missing value was located.

4) Imputing with mode:

Most variables have some intuitive relationship to other variables, and imputation can be based on these related features. But some missing values are found in variables with no apparent relation to others. For example, the Electrical variable, which describes the electrical system, was missing for a single observation.

Our solution was to simply find the most common class for this categorical variable and impute for this missing value.

Ordinal Categories

For linear (but not tree-based) models, categorical variables must be treated as continuous. There are two types of categorical features: ordinal, where there is an inherent order to the classes (e.g., Excellent is greater than Good, which is greater than Fair), and nominal, where there is no obvious order (e.g., red, green, and blue).

Our solution for ordinal variables was to simply assign the classes a number corresponding to their relative ranks. For example, Kitchen Quality has five classes: Excellent, Good, Typical, Fair, and Poor, which we encoded (i.e., converted) to the numbers 5, 4, 3, 2, and 1, respectively.

Nominal Categories

The ranking of nominal categories is not appropriate as there is no actual rank among the classes.

Our solution was to one-hot encode these variables, which creates a new variable for each class with values of zero (not present) or one (present).

Outliers

An outlier can be defined with a quantitative (i.e., statistical) or qualitative definition. We opted for the qualitative version when looking for outliers: observations that are abnormally far from other values. Viewing the relationship between Above Ground Living Area and Sale Price, we noticed some very large areas for very low prices.

Our solution was to remove these observations as we thought they fit our chosen definition of an outlier, and because they might increase our models’ errors.

Skewness

While there are few assumptions regarding the independent variables of regression models, often transforming skewed variables to a normal distribution can improve model performance.

Our solution was to log + 1 transform several of the predictors.

Near Zero Predictors

Predictors with very low variance offer little predictive power to models.

Our solution was to find the ratio of the second most frequent value to the most frequent value for each predictor, and to remove variables where this ratio was less than 0.05. This roughly translates to dropping variables where 95% or more of the values are the same.

Feature Engineering

Feature (variable or predictor) engineering is one of the most important steps in model creation. Often there is valuable information “hidden” in the predictors that is only revealed when manipulating these features in some way. Below are just some examples of the features we created:

• Remodeled (categorical): Yes or No if Year Built is different from Year Remodeled; If the year the house was remodeled is different from the year it was built, the remodeling likely increases property value
• Seasonality (categorical): Combined Month Sold with Year Sold; While more houses were sold during summer months, this likely varies across years, especially during the time period these houses were sold, which coincides with the housing crash (2006-2010)
• New House (categorical): Yes or No if Year Sold is equal to Year Built; If a house was sold the same year it was built, we might expect it was in high demand and might have a higher Sale Price
• Total Area (continuous): Sum of all variables that describe the area of different sections of a house; There are many variables that pertain to the square footage of different aspects of each house, we might expect that the total square footage has a strong influence on Sale Price

Models and Results

Now that we have prepared our data set, we can begin training our models and use them to predict Sale Price.

We trained and tested dozens of versions of the models described below with different combinations of engineered features and processed variables. The information in the table represents our best results for each model. The table explains the pros and cons for each model type, the optimal hyperparameters found through either grid search or Bayesian optimization, our test score, and the score we received from Kaggle. Our scores the root mean square error (RMSE) of our predictions, which is a metric for describing the difference between the observed values and our predicted values for Sale Price; scores closer to zero are better.

For brevity, we will not describe the details of the different models. However, see the following links for more information about how each model is used to create predictions: random forest, gradient boost, XGBoost, elastic net regularization for regression.

Below are plots summarizing variables that contribute most to the respective model’s prediction of Sale Price.

For most models, predictors related to square footage (Area), quality (different Quality measures), and age (Year Built) have the strongest impact on each model’s predictions.

There is no visualization for our best model, which was an ensemble of four other models. The predictions for this ensembled model are calculated by averaging the predictions from the separate models (two linear regression models and two tree-based models). The idea is that each model’s predictions include error both above and below the real values, and the averaged predictions of the best models might have less overall error than any one single model.

One note is that tree-based models (random forest, gradient boosting, and XGBoosting) cannot provide an idea of positive or negative influence for each variable on Sale Price, rather they can only provide an idea of how important that variable is to the models’ predictions overall. In contrast, linear models can provide information about which variables positively and negatively influence Sale Price. For the figure immediately above, the strongest predictor, residency in a Commercial Zone, is actually negatively related to Sale Price.

Conclusions

The objective of this Kaggle competition was to build models to predict housing prices of different residences in Ames, IA. Our best model resulted in an RMSE of 0.1071, which translates to an error of about $9000 (or about 5%) for the average-priced house. While this error is quite low, the interpretability of our model is poor. Each model found within our ensembled model varies with respect to the variables that are most important to predicting Sale Price. The best way to interpret our ensemble is to look for shared variables among its constituent models. The variables seen as most important or as strongest predictors through our models were those related to square footage, the age and condition of the home, the neighborhood where the house was located, the city zone where the house was located, and the year the house was sold. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### «smooth» package for R. Common ground. Part II. Estimators R bloggers - Mon, 11/20/2017 - 18:21 (This article was first published on R – Modern Forecasting, and kindly contributed to R-bloggers) A bit about estimates of parameters Hi everyone! Today I want to tell you about parameters estimation of smooth functions. But before going into details, there are several things that I want to note. In this post we will discuss bias, efficiency and consistency of estimates of parameters, so I will use phrases like “efficient estimator”, implying that we are talking about some optimisation mechanism that gives efficient estimates of parameters. It is probably not obvious for people without statistical background to understand what the hell is going on and why we should care, so I decided to give a brief explanation. Although there are strict statistical definitions of the aforementioned terms (you can easily find them in Wikipedia or anywhere else), I do not want to copy-paste them here, because there are only a couple of important points worth mentioning in our context. So, let’s get started. Bias refers to the expected difference between the estimated value of parameter (on a specific sample) and the “true” one. Having unbiased estimates of parameters is important because they should lead to more accurate forecasts (at least in theory). For example, if the estimated parameter is equal to zero, while in fact it should be 0.5, then the model would not take the provided information into account correctly and as a result will produce less accurate point forecasts and incorrect prediction intervals. In inventory this may mean that we constantly order 100 units less than needed only because the parameter is lower than it should be. Efficiency means that if the sample size increases, then the estimated parameters will not change substantially, they will vary in a narrow range (variance of estimates will be small). In the case with inefficient estimates the increase of sample size from 50 to 51 observations may lead to the change of a parameter from 0.1 to, let’s say, 10. This is bad because the values of parameters usually influence both point forecasts and prediction intervals. As a result the inventory decision may differ radically from day to day. For example, we may decide that we urgently need 1000 units of product on Monday, and order it just to realise on Tuesday that we only need 100. Obviously this is an exaggeration, but no one wants to deal with such an erratic stocking policy, so we need to have efficient estimates of parameters. Consistency means that our estimates of parameters will get closer to the stable values (what statisticians would refer to as “true” values) with the increase of the sample size. This is important because in the opposite case estimates of parameters will diverge and become less and less realistic. This once again influences both point forecasts and prediction intervals, which will be less meaningful than they should have been. In a way consistency means that with the increase of the sample size the parameters will become more efficient and less biased. This in turn means that the more observations we have, the better. There is a prejudice in the world of practitioners that the situation in the market changes so fast that the old observations become useless very fast. As a result many companies just through away the old data. Although, in general the statement about the market changes is true, the forecasters tend to work with the models that take this into account (e.g. Exponential smoothing, ARIMA). These models adapt to the potential changes. So, we may benefit from the old data because it allows us getting more consistent estimates of parameters. Just keep in mind, that you can always remove the annoying bits of data but you can never un-throw away the data. Having clarified these points, we can proceed to the topic of today’s post. One-step-ahead estimators of smooth functions We already know that the default estimator used for smooth functions is Mean Squared Error (for one step ahead forecast). If the residuals are distributed normally / log-normally, then the minimum of MSE gives estimates that also maximise the respective likelihood function. As a result the estimates of parameters become nice: consistent and efficient. It is also known in statistics that minimum of MSE gives mean estimates of the parameters, which means that MSE also produces unbiased estimates of parameters (if the model is specified correctly and bla-bla-bla). This works very well, when we deal with symmetric distributions of random variables. But it may perform poorly otherwise. In this post we will use the series N1823 for our examples: library(Mcomp) x <- ts(c(M3$N1823$x,M3$N1823$xx),frequency=frequency(M3$N1823$x)) Plot the data in order to see what we have: plot(x) N1823 series The data seems to have slight multiplicative seasonality, which changes over the time, but it is hard to say for sure. Anyway, in order to simplify things, we will apply an ETS(A,A,N) model to this data, so we can see how the different estimators behave. We will withhold 18 observations as it is usually done for monthly data in M3. ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T) N1823 and ETS(A,A,N) with MSE Here’s the output: Time elapsed: 0.08 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.147 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 629.249 Cost function type: MSE; Cost function value: 377623.069 Information criteria: AIC AICc BIC 1703.389 1703.977 1716.800 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -14%; Bias: -74.1%; MAPE: 16.8%; SMAPE: 15.1% MASE: 0.855; sMAE: 13.4%; RelMAE: 1.047; sMSE: 2.4% It is hard to make any reasonable conclusions from the graph and the output, but it seems that we slightly overforecast the data. At least the prediction interval covers all the values in the holdout. Relative MAE is equal to 1.047, which means that the model produced forecasts less accurate than Naive. Let’s have a look at the residuals: qqnorm(resid(ourModel)) qqline(resid(ourModel)) QQ-plot of the residuals from ETS(A,A,N) with MSE The residuals of this model do not look normal, a lot of empirical quantiles a far from the theoretical ones. If we conduct Shapiro-Wilk test, then we will have to reject the hypothesis of normality for the residuals on 5%: shapiro.test(resid(ourModel)) > p-value = 0.001223 This may indicate that other estimators may do a better job. And there is a magical parameter cfType in the smooth functions which allows to estimate models differently. It controls what to use and how to use it. You can select the following estimators instead of MSE: • MAE – Mean Absolute Error: \label{eq:MAE} \text{MAE} = \frac{1}{T} \sum_{t=1}^T |e_{t+1}| The minimum of MAE gives median estimates of the parameters. MAE is considered to be a more robust estimator than MSE. If you have asymmetric distribution, give MAE a try. It gives consistent, but not efficient estimates of parameters. Asymptotically, if the distribution of the residuals is normal, the estimators of MAE converge to the estimators of MSE (which follows from the connection between mean and median of normal distribution). Let’s see what happens with the same model, on the same data when we use MAE: ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="MAE") N1823 and ETS(A,A,N) with MAE Time elapsed: 0.09 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.101 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 636.546 Cost function type: MAE; Cost function value: 462.675 Information criteria: AIC AICc BIC 1705.879 1706.468 1719.290 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -5.1%; Bias: -32.1%; MAPE: 12.9%; SMAPE: 12.4% MASE: 0.688; sMAE: 10.7%; RelMAE: 0.842; sMSE: 1.5% There are several things to note from the graph and the output. First, the smoothing parameter alpha is smaller than in the previous case. Second, Relative MAE is smaller than one, which means that model in this case outperformed Naive. Comparing this value with the one from the previous model, we can conclude that MAE worked well as an estimator of parameters for this data. Finally, the graph shows that point forecasts go through the middle of the holdout sample, which is reflected with lower values of error measures. The residuals are still not normally distributed, but this is expected, because they won’t become normal just because we used a different estimator: QQ-plot of the residuals from ETS(A,A,N) with MAE • HAM – Half Absolute Moment: \label{eq:HAM} \text{HAM} = \frac{1}{T} \sum_{t=1}^T \sqrt{|e_{t+1}|} This is even more robust estimator than MAE. On count data its minimum corresponds to the mode of data. In case of continuous data the minimum of this estimator corresponds to something not yet well studied, between mode and median. The paper about this thing is currently in a draft stage, but you can already give it a try, if you want. This is also consistent, but not efficient estimator. The same example, the same model, but HAM as an estimator: ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="HAM") N1823 and ETS(A,A,N) with HAM Time elapsed: 0.06 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.001 0.001 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 666.439 Cost function type: HAM; Cost function value: 19.67 Information criteria: AIC AICc BIC 1715.792 1716.381 1729.203 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -1.7%; Bias: -14.1%; MAPE: 11.4%; SMAPE: 11.4% MASE: 0.63; sMAE: 9.8%; RelMAE: 0.772; sMSE: 1.3% This estimator produced even more accurate forecasts in this example, forcing smoothing parameters to become close to zero. Note that the residuals standard deviation in case of HAM is larger than in case of MAE which in its turn is larger than in case of MSE. This means that one-step-ahead parametric and semiparametric prediction intervals will be wider in case of HAM than in case of MAE, than in case of MSE. However, taking that the smoothing parameters in the last model are close to zero, the multiple steps ahead prediction intervals of HAM may be narrower than the ones of MSE. Finally, it is worth noting that the optimisation of models using different estimators takes different time. MSE is the slowest, while HAM is the fastest estimator. I don’t have any detailed explanation of this, but this obviously happens because of the form of the cost functions surfaces. So if you are in a hurry and need to estimate something somehow, you can give HAM a try. Just remember that information criteria may become inapplicable in this case. Multiple-steps-ahead estimators of smooth functions While these three estimators above are calculated based on the one-step-ahead forecast error, the next three are based on multiple steps ahead estimators. They can be useful if you want to have a more stable and “conservative” model (a paper on this topic is currently in the final stage). Prior to v2.2.1 these estimators had different names, be aware! • MSE$$_h$$ – Mean Squared Error for h-steps ahead forecast: \label{eq:MSEh} \text{MSE}_h = \frac{1}{T} \sum_{t=1}^T e_{t+h}^2 The idea of this estimator is very simple: if you are interested in 5 steps ahead forecasts, then optimise over this horizon, not one-step-ahead. However, by using this estimator, we shrink the smoothing parameters towards zero, forcing the model to become closer to the deterministic and robust to outliers. This applies both to ETS and ARIMA, but the models behave slightly differently. The effect of shrinkage increases with the increase of $$h$$. The forecasts accuracy may increase for that specific horizon, but it almost surely will decrease for all the other horizons. Keep in mind that this is in general not efficient and biased estimator with a much slower convergence to the true value than the one-step-ahead estimators. This estimator is eventually consistent, but it may need a very large sample to become one. This means that this estimator may result in values of parameters very close to zero even if they are not really needed for the data. I personally would advise using this thing on large samples (for instance, on high frequency data). By the way, Nikos Kourentzes, Rebecca Killick and I are working on a paper on that topic, so stay tuned. Here’s what happens when we use this estimator: ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="MSEh") N1823 and ETS(A,A,N) with MSEh Time elapsed: 0.24 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 657.781 Cost function type: MSEh; Cost function value: 550179.34 Information criteria: AIC AICc BIC 30393.86 30404.45 30635.25 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -10.4%; Bias: -62%; MAPE: 14.9%; SMAPE: 13.8% MASE: 0.772; sMAE: 12.1%; RelMAE: 0.945; sMSE: 1.8% As you can see, the smoothing parameters are now equal to zero, which gives us the straight line going through all the data. If we had 1008 observations instead of 108, the parameters would not be shrunk to zero, because the model would need to adapt to changes in order to minimise the respective cost function. • TMSE – Trace Mean Squared Error: The need for having a specific 5 steps ahead forecast is not common, so it makes sense to work with something that deals with one to h steps ahead: \label{TMSE} \text{TMSE} = \sum_{j=1}^h \frac{1}{T} \sum_{t=1}^T e_{t+j}^2 This estimator is more reasonable than MSE$$_h$$ because it takes into account all the errors from one to h-steps-ahead forecasts. This is a desired behaviour in inventory management, because we are not so much interested in how much we will sell tomorrow or next Monday, but rather how much we will sell starting from tomorrow till the next Monday. However, the variance of forecast errors h-steps-ahead is usually larger than the variance of one-step-ahead errors (because of the increasing uncertainty), which leads to the effect of “masking”: the latter is hidden behind the former. As a result if we use TMSE as the estimator, the final values are seriously influenced by the long term errors than the short term ones (see Taieb and Atiya, 2015 paper). This estimator is not recommended if short-term forecasts are more important than long term ones. Plus, this is still less efficient and more biased estimator than one-step-ahead estimators, with slow convergence to the true values, similar to MSE$$_h$$, but slightly better. This is what happens in our example: ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="TMSE") N1823 and ETS(A,A,N) with TMSE Time elapsed: 0.2 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0.075 0.000 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 633.48 Cost function type: TMSE; Cost function value: 7477097.717 Information criteria: AIC AICc BIC 30394.36 30404.94 30635.75 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -7.5%; Bias: -48.9%; MAPE: 13.4%; SMAPE: 12.6% MASE: 0.704; sMAE: 11%; RelMAE: 0.862; sMSE: 1.5% Comparing the model estimated using TMSE with the same one estimated using MSE and MSE$$_h$$, it is worth noting that the smoothing parameters in this model are greater than in case of MSE$$_h$$, but less than in case of MSE. This demonstrates that there is a shrinkage effect in TMSE, forcing the parameters towards zero, but the inclusion of one-step-ahead errors makes model slightly more flexible than in case of MSE$$_h$$. Still, it is advised to use this estimator on large samples, where the estimates of parameters become more efficient and less biased. • GTMSE – Geometric Trace Mean Squared Error: This is similar to TMSE, but derived from the so called Trace Forecast Likelihood (which I may discuss at some point in one of the future posts). The idea here is to take logarithms of each MSE$$_j$$ and then sum them up: \label{eq:GTMSE} \text{GTMSE} = \sum_{j=1}^h \log \left( \frac{1}{T} \sum_{t=1}^T e_{t+j}^2 \right) Logarithms make variances of errors on several steps ahead closer to each other. For example, if the variance of one-step-ahead error is equal to 100 and the variance of 10 steps ahead error is equal to 1000, then their logarithms will be 4.6 and 6.9 respectively. As a result when GTMSE is used as an estimator, the model will take into account both short and long term errors. So this is a more balanced estimator of parameters than MSE$$_h$$ and TMSE. This estimator is more efficient than both TMSE and MSE$$_j$$ because of the log-scale and converges to true values faster than the previous two, but still can be biased on small samples. ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="GTMSE") N1823 and ETS(A,A,N) with GTMSE Time elapsed: 0.18 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 649.253 Cost function type: GTMSE; Cost function value: 232.419 Information criteria: AIC AICc BIC 30402.77 30413.36 30644.16 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -8.2%; Bias: -53.8%; MAPE: 13.8%; SMAPE: 12.9% MASE: 0.72; sMAE: 11.3%; RelMAE: 0.882; sMSE: 1.6% In our example GTMSE shrinks both smoothing parameters towards zero and makes the model deterministic, which corresponds to MSE$$_h$$. However the initial values are slightly different, which leads to slightly different forecasts. Once again, it is advised to use this estimator on large samples. Keep in mind that all those multiple steps ahead estimators take more time for the calculation, because the model needs to produce h-steps-ahead forecasts from each observation in sample. • Analytical multiple steps ahead estimators. There is also a non-documented feature in smooth functions (currently available only for pure additive models) – analytical versions of multiple steps ahead estimators. In order to use it, we need to add “a” in front of the desired estimator: aMSE$$_h$$, aTMSE, aGTMSE. In this case only one-step-ahead forecast error is produced and after that the structure of the applied state-space model is used in order to reconstruct multiple steps ahead estimators. This feature is useful if you want to use the multiple steps ahead estimator on small samples, where the multi-steps errors cannot be calculated appropriately. It is also useful in cases of large samples, when the time of estimation is important. These estimators have similar properties to their empirical counterparts, but work faster and are based on asymptotic properties. Here is an example of analytical MSE$$_h$$ estimator: ourModel <- es(x,"AAN",silent=F,intervals="p",h=18,holdout=T,cfType="aMSEh") N1823 and ETS(A,A,N) with aMSEh Time elapsed: 0.11 seconds Model estimated: ETS(AAN) Persistence vector g: alpha beta 0 0 Initial values were optimised. 5 parameters were estimated in the process Residuals standard deviation: 627.818 Cost function type: aMSEh; Cost function value: 375907.976 Information criteria: AIC AICc BIC 30652.15 30662.74 30893.55 95% parametric prediction intervals were constructed 100% of values are in the prediction interval Forecast errors: MPE: -1.9%; Bias: -14.6%; MAPE: 11.7%; SMAPE: 11.6% MASE: 0.643; sMAE: 10%; RelMAE: 0.787; sMSE: 1.3% The resulting smoothing parameters are shrunk towards zero, similar to MSE$$_h$$, but the initial values are slightly different, which leads to different forecasts. Note that the time elapsed in this case is 0.11 seconds instead of 0.24 as in MSE$$_h$$. The difference in time may increase with the increase of sample size and forecasting horizon. • Similar to MSE, there are empirical multi-step MAE and HAM in smooth functions (e.g. MAE$$_h$$ and THAM). However, they are currently implemented mainly “because I can” and for fun, so I cannot give you any recommendations about them. Conclusions Now that we discussed all the possible estimators that you can use with smooth , you are most probably confused and completely lost. The question that may naturally appear after you have read this post is “What should I do?” Frankly speaking, I cannot give you appropriate answer and a set of universal recommendations, because this is still under-researched problem. However, I have some advice. First, Nikos Kourentzes and Juan Ramon Trapero found that in case of high frequency data (they used solar irradiation data) using MSE$$_h$$ and TMSE leads to the increase in forecasting accuracy in comparison with the conventional MSE. However in order to achieve good accuracy in case of MSE$$_h$$, you need to estimate $$h$$ separate models, while with TMSE you need to estimate only one. So, TMSE is faster than MSE$$_h$$, but at the same time leads to at least as accurate forecasts as in case of MSE$$_h$$ for all the steps from 1 to h. Second, if you have asymmetrically distributed residuals in the model after using MSE, give MAE or HAM a try – they may improve your model and its accuracy. Third, analytical counterparts of multi-step estimators can be useful in one of the two situations: 1. When you deal with very large samples (e.g. high frequency data), want to use advanced estimation methods, but want them to work fast. 2. When you work with small sample, but want to use the properties of these estimators anyway. Finally, don’t use MSE$$_h$$, TMSE and GTMSE if you are interested in the values of parameters of models – they will almost surely be inefficient and biased. This applies to both ETS and ARIMA models, which will become close to their deterministic counterparts in this case. Use conventional MSE instead. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Modern Forecasting. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Dataiku 4.1.0: More support for R users! R bloggers - Mon, 11/20/2017 - 16:24 (This article was first published on R – Longhow Lam's Blog, and kindly contributed to R-bloggers) Introduction Recently, Dataiku 4.1.0 was released, it now offers much more support for R users. But wait a minute, Data-what? I guess some of you do not know Dataiku, so what is Dataiku in the first place? It is a collaborative data science platform created to design and run data products at scale. The main themes of the product are: Collaboration & Orchestration: A data science project often involves a team of people with different skills and interests. To name a few, we have data engineers, data scientists, business analysts, business stakeholders, hardcore coders, R users and Python users. Dataiku provides a platform to accommodate the needs of these different roles to work together on data science projects. Productivity: Whether you like hard core coding or are more GUI oriented, the platform offers an environment for both. A flow interface can handle most of the steps needed in a data science project, and this can be enriched by Python or R recipes. Moreover, a managed notebook environment is integrated in Dataiku to do whatever you want with code. Deployment of data science products: As a data scientist you can produce many interesting stuff, i.e. graphs, data transformations, analysis, predictive models. The Dataiku platform facilitates the deployment of these deliverables, so that others (in your organization) can consume them. There are dashboards, web-apps, model API’s, productionized model API’s and data pipelines. There is a free version which contains already a lot of features to be very useful, and there is an paid version, with “enterprise features“. See for the Dataiku website for more info. Improved R Support in 4.1.0 Among many new features, and the one that interests me the most as an R user, is the improved support for R. In previous versions of Dataiku there was already some support for R, this version has the following improvements. There is now support for: R Code environments In Dataiku you can now create so-called code environments for R (and Python). A code environment is a standalone and self-contained environment to run R code. Each environment can have its own set of packages (and specific versions of packages). Dataiku provides a handy GUI to manage different code environments. The figure below shows an example code environment with specific packages. In Dataiku whenever you make use of R –> in R recipes, Shiny, R Markdown or creating R API’s you can select a specific R code environment to use. R Markdown reports & Shiny applications If you are working in RStudio you most likely already know R Markdown documents and Shiny applications. In this version, you can also create them in Dataiku. Now, why would you do that and not just create them in RStudio? Well, the reports and shiny apps become part of the Dataiku environment and so: • They are managed in the environment. You will have a good overview of all reports and apps and see who has created/edited them. • You can make use of all data that is already available in the Dataiku environment. • Moreover, the resulting reports and Shiny apps can be embedded inside Dataiku dashboards. The figure above shows a R markdown report in Dataiku, the interface provides a nice way to edit the report, alter settings and publish the report. Below is an example dashboard in Dataiku with a markdown and Shiny report. Creating R API’s Once you created an analytical model in R, you want to deploy it and make use of its predictions. With Dataiku you can now easily expose R prediction models as an API. In fact, you can expose any R function as an API. The Dataiku GUI provides an environment where you can easily set up and test an R API’s. Moreover, The Dataiku API Node, which can be installed on a (separate) production server imports the R models that you have created in the GUI and can take care of load balancing, high availability and scaling of real-time scoring. The following three figures give you an overview of how easy it is to work with the R API functionality. First, define an API endpoint and R (prediction) function. Then, define the R function, it can make use of data in Dataiku, R objects created earlier or any R library you need. Then, test and deploy the R function. Dataiku provides a handy interface to test your function/API. Finally, once you are satisfied with the R API you can make a package of the API, that package can then be imported on a production server with Dataiku API node installed. From which you can then serve API requests. Conclusion The new Dataiku 4.1.0 version has a lot to offer for anyone involved in a data science project. The system already has a wide range support for Python, but now with the improved support for R, the system is even more useful to a very large group of data scientists. Cheers, Longhow. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Longhow Lam's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest R bloggers - Mon, 11/20/2017 - 15:00 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) Customer churn occurs when customers or subscribers stop doing business with a company or service, also known as customer attrition. It is also referred as loss of clients or customers. One industry in which churn rates are particularly useful is the telecommunications industry, because most customers have multiple options from which to choose within a geographic location. Similar concept with predicting employee turnover, we are going to predict customer churn using telecom dataset. We will introduce Logistic Regression, Decision Tree, and Random Forest. But this time, we will do all of the above in R. Let’s get started! Data Preprocessing The data was downloaded from IBM Sample Data Sets. Each row represents a customer, each column contains that customer’s attributes: library(plyr) library(corrplot) library(ggplot2) library(gridExtra) library(ggthemes) library(caret) library(MASS) library(randomForest) library(party) churn <- read.csv('Telco-Customer-Churn.csv') str(churn) 'data.frame': 7043 obs. of 21 variables:$ customerID : Factor w/ 7043 levels "0002-ORFBO","0003-MKNFE",..: 5376 3963 2565 5536 6512 6552 1003 4771 5605 4535 ... $gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...$ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ... $Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...$ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ... $tenure : int 1 34 2 45 2 8 22 10 28 62 ...$ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ... $MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...$ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ... $OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...$ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ... $DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...$ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ... $StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...$ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ... $Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...$ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ... $PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...$ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ... $TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...$ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

The raw data contains 7043 rows (customers) and 21 columns (features). The “Churn” column is our target.

We use sapply to check the number if missing values in each columns. We found that there are 11 missing values in “TotalCharges” columns. So, let’s remove all rows with missing values.

sapply(churn, function(x) sum(is.na(x))) customerID gender SeniorCitizen Partner 0 0 0 0 Dependents tenure PhoneService MultipleLines 0 0 0 0 InternetService OnlineSecurity OnlineBackup DeviceProtection 0 0 0 0 TechSupport StreamingTV StreamingMovies Contract 0 0 0 0 PaperlessBilling PaymentMethod MonthlyCharges TotalCharges 0 0 0 11 Churn 0 churn <- churn[complete.cases(churn), ] Look at the variables, we can see that we have some wranglings to do.

1. We will change “No internet service” to “No” for six columns, they are: “OnlineSecurity”, “OnlineBackup”, “DeviceProtection”, “TechSupport”, “streamingTV”, “streamingMovies”.

cols_recode1 <- c(10:15) for(i in 1:ncol(churn[,cols_recode1])) { churn[,cols_recode1][,i] <- as.factor(mapvalues (churn[,cols_recode1][,i], from =c("No internet service"),to=c("No"))) }

2. We will change “No phone service” to “No” for column “MultipleLines”

churn$MultipleLines <- as.factor(mapvalues(churn$MultipleLines, from=c("No phone service"), to=c("No")))

3. Since the minimum tenure is 1 month and maximum tenure is 72 months, we can group them into five tenure groups: “0–12 Month”, “12–24 Month”, “24–48 Months”, “48–60 Month”, “> 60 Month”

min(churn$tenure); max(churn$tenure) [1] 1 [1] 72 group_tenure = 0 & tenure 12 & tenure 24 & tenure 48 & tenure 60){ return('> 60 Month') } } churn$tenure_group <- sapply(churn$tenure,group_tenure) churn$tenure_group <- as.factor(churn$tenure_group)

4. Change the values in column “SeniorCitizen” from 0 or 1 to “No” or “Yes”.

churn$SeniorCitizen <- as.factor(mapvalues(churn$SeniorCitizen, from=c("0","1"), to=c("No", "Yes")))

5. Remove the columns we do not need for the analysis.

churn$customerID <- NULL churn$tenure <- NULL Exploratory data analysis and feature selection

Correlation between numeric variables

numeric.var <- sapply(churn, is.numeric) corr.matrix <- cor(churn[,numeric.var]) corrplot(corr.matrix, main="\n\nCorrelation Plot for Numerical Variables", method="number")

Gives this plot:

The Monthly Charges and Total Charges are correlated. So one of them will be removed from the model. We remove Total Charges.

churn$TotalCharges <- NULL Bar plots of categorical variables p1 <- ggplot(churn, aes(x=gender)) + ggtitle("Gender") + xlab("Gender") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p2 <- ggplot(churn, aes(x=SeniorCitizen)) + ggtitle("Senior Citizen") + xlab("Senior Citizen") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p3 <- ggplot(churn, aes(x=Partner)) + ggtitle("Partner") + xlab("Partner") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p4 <- ggplot(churn, aes(x=Dependents)) + ggtitle("Dependents") + xlab("Dependents") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p1, p2, p3, p4, ncol=2) Gives this plot: p5 <- ggplot(churn, aes(x=PhoneService)) + ggtitle("Phone Service") + xlab("Phone Service") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p6 <- ggplot(churn, aes(x=MultipleLines)) + ggtitle("Multiple Lines") + xlab("Multiple Lines") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p7 <- ggplot(churn, aes(x=InternetService)) + ggtitle("Internet Service") + xlab("Internet Service") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p8 <- ggplot(churn, aes(x=OnlineSecurity)) + ggtitle("Online Security") + xlab("Online Security") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p5, p6, p7, p8, ncol=2) Gives this plot: p9 <- ggplot(churn, aes(x=OnlineBackup)) + ggtitle("Online Backup") + xlab("Online Backup") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p10 <- ggplot(churn, aes(x=DeviceProtection)) + ggtitle("Device Protection") + xlab("Device Protection") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p11 <- ggplot(churn, aes(x=TechSupport)) + ggtitle("Tech Support") + xlab("Tech Support") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p12 <- ggplot(churn, aes(x=StreamingTV)) + ggtitle("Streaming TV") + xlab("Streaming TV") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p9, p10, p11, p12, ncol=2) Gives this plot: p13 <- ggplot(churn, aes(x=StreamingMovies)) + ggtitle("Streaming Movies") + xlab("Streaming Movies") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p14 <- ggplot(churn, aes(x=Contract)) + ggtitle("Contract") + xlab("Contract") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p15 <- ggplot(churn, aes(x=PaperlessBilling)) + ggtitle("Paperless Billing") + xlab("Paperless Billing") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p16 <- ggplot(churn, aes(x=PaymentMethod)) + ggtitle("Payment Method") + xlab("Payment Method") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() p17 <- ggplot(churn, aes(x=tenure_group)) + ggtitle("Tenure Group") + xlab("Tenure Group") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal() grid.arrange(p13, p14, p15, p16, p17, ncol=2) Gives this plot: All of the categorical variables seem to have a reasonably broad distribution, therefore, all of them will be kept for the further analysis. Logistic Regression First, we split the data into training and testing sets intrain<- createDataPartition(churn$Churn,p=0.7,list=FALSE) set.seed(2017) training<- churn[intrain,] testing<- churn[-intrain,]

Confirm the splitting is correct

dim(training); dim(testing) [1] 4924 19 [1] 2108 19

Fitting the Logistic Regression Model

LogModel <- glm(Churn ~ .,family=binomial(link="logit"),data=training) print(summary(LogModel)) Call: glm(formula = Churn ~ ., family = binomial(link = "logit"), data = training) Deviance Residuals: Min 1Q Median 3Q Max -2.0370 -0.6793 -0.2861 0.6590 3.1608 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.030149 1.008223 -2.014 0.044053 * genderMale -0.039006 0.077686 -0.502 0.615603 SeniorCitizenYes 0.194408 0.101151 1.922 0.054611 . PartnerYes -0.062031 0.092911 -0.668 0.504363 DependentsYes -0.017974 0.107659 -0.167 0.867405 PhoneServiceYes -0.387097 0.788745 -0.491 0.623585 MultipleLinesYes 0.345052 0.214799 1.606 0.108187 InternetServiceFiber optic 1.022836 0.968062 1.057 0.290703 InternetServiceNo -0.829521 0.978909 -0.847 0.396776 OnlineSecurityYes -0.393647 0.215993 -1.823 0.068379 . OnlineBackupYes -0.113220 0.213825 -0.529 0.596460 DeviceProtectionYes -0.025797 0.213317 -0.121 0.903744 TechSupportYes -0.306423 0.220920 -1.387 0.165433 StreamingTVYes 0.399209 0.395912 1.008 0.313297 StreamingMoviesYes 0.338852 0.395890 0.856 0.392040 ContractOne year -0.805584 0.127125 -6.337 2.34e-10 *** ContractTwo year -1.662793 0.216346 -7.686 1.52e-14 *** PaperlessBillingYes 0.338724 0.089407 3.789 0.000152 *** PaymentMethodCredit card (automatic) -0.018574 0.135318 -0.137 0.890826 PaymentMethodElectronic check 0.373214 0.113029 3.302 0.000960 *** PaymentMethodMailed check 0.001400 0.136446 0.010 0.991815 MonthlyCharges -0.005001 0.038558 -0.130 0.896797 tenure_group0-12 Month 1.858899 0.205956 9.026 < 2e-16 *** tenure_group12-24 Month 0.968497 0.201452 4.808 1.53e-06 *** tenure_group24-48 Month 0.574822 0.185500 3.099 0.001943 ** tenure_group48-60 Month 0.311790 0.200096 1.558 0.119185 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 5702.8 on 4923 degrees of freedom Residual deviance: 4094.4 on 4898 degrees of freedom AIC: 4146.4 Number of Fisher Scoring iterations: 6

Feature Analysis
The top three most-relevant features include Contract, tenure_group and PaperlessBilling.

anova(LogModel, test="Chisq") Analysis of Deviance Table Model: binomial, link: logit Response: Churn Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 4923 5702.8 gender 1 0.39 4922 5702.4 0.5318602 SeniorCitizen 1 95.08 4921 5607.3 < 2.2e-16 *** Partner 1 107.29 4920 5500.0 < 2.2e-16 *** Dependents 1 27.26 4919 5472.7 1.775e-07 *** PhoneService 1 1.27 4918 5471.5 0.2597501 MultipleLines 1 9.63 4917 5461.8 0.0019177 ** InternetService 2 452.01 4915 5009.8 < 2.2e-16 *** OnlineSecurity 1 183.83 4914 4826.0 < 2.2e-16 *** OnlineBackup 1 69.94 4913 4756.1 < 2.2e-16 *** DeviceProtection 1 47.58 4912 4708.5 5.287e-12 *** TechSupport 1 82.78 4911 4625.7 < 2.2e-16 *** StreamingTV 1 4.90 4910 4620.8 0.0269174 * StreamingMovies 1 0.36 4909 4620.4 0.5461056 Contract 2 309.25 4907 4311.2 < 2.2e-16 *** PaperlessBilling 1 14.21 4906 4297.0 0.0001638 *** PaymentMethod 3 38.85 4903 4258.1 1.867e-08 *** MonthlyCharges 1 0.10 4902 4258.0 0.7491553 tenure_group 4 163.67 4898 4094.4 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time. Adding InternetService, Contract and tenure_group significantly reduces the residual deviance. The other variables such as PaymentMethod and Dependents seem to improve the model less even though they all have low p-values.

Assessing the predictive ability of the Logistic Regression model

testing$Churn <- as.character(testing$Churn) testing$Churn[testing$Churn=="No"] <- "0" testing$Churn[testing$Churn=="Yes"] <- "1" fitted.results <- predict(LogModel,newdata=testing,type='response') fitted.results 0.5,1,0) misClasificError <- mean(fitted.results != testing$Churn) print(paste('Logistic Regression Accuracy',1-misClasificError)) [1] "Logistic Regression Accuracy 0.796489563567362" Logistic Regression Confusion Matrix print("Confusion Matrix for Logistic Regression"); table(testing$Churn, fitted.results > 0.5) [1] "Confusion Matrix for Logistic Regression" FALSE TRUE 0 1392 156 1 273 287

Odds Ratio
One of the interesting performance measurements in logistic regression is Odds Ratio.Basically, Odds ratio is what the odds of an event is happening.

exp(cbind(OR=coef(LogModel), confint(LogModel))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 0.1313160 0.01815817 0.9461855 genderMale 0.9617454 0.82587632 1.1199399 SeniorCitizenYes 1.2145919 0.99591418 1.4807053 PartnerYes 0.9398537 0.78338071 1.1276774 DependentsYes 0.9821863 0.79488224 1.2124072 PhoneServiceYes 0.6790251 0.14466019 3.1878587 MultipleLinesYes 1.4120635 0.92707245 2.1522692 InternetServiceFiber optic 2.7810695 0.41762316 18.5910286 InternetServiceNo 0.4362582 0.06397364 2.9715699 OnlineSecurityYes 0.6745919 0.44144273 1.0296515 OnlineBackupYes 0.8929545 0.58709919 1.3577947 DeviceProtectionYes 0.9745328 0.64144877 1.4805460 TechSupportYes 0.7360754 0.47707096 1.1344691 StreamingTVYes 1.4906458 0.68637788 3.2416264 StreamingMoviesYes 1.4033353 0.64624171 3.0518161 ContractOne year 0.4468271 0.34725066 0.5717469 ContractTwo year 0.1896086 0.12230199 0.2861341 PaperlessBillingYes 1.4031556 1.17798691 1.6725920 PaymentMethodCredit card (automatic) 0.9815977 0.75273387 1.2797506 PaymentMethodElectronic check 1.4523952 1.16480721 1.8145076 PaymentMethodMailed check 1.0014007 0.76673087 1.3092444 MonthlyCharges 0.9950112 0.92252949 1.0731016 tenure_group0-12 Month 6.4166692 4.30371945 9.6544837 tenure_group12-24 Month 2.6339823 1.78095906 3.9256133 tenure_group24-48 Month 1.7768147 1.23988035 2.5676783 tenure_group48-60 Month 1.3658675 0.92383315 2.0261505 Decision Tree

Decision Tree visualization
For illustration purpose, we are going to use only three variables for plotting Decision Trees, they are “Contract”, “tenure_group” and “PaperlessBilling”.

tree <- ctree(Churn~Contract+tenure_group+PaperlessBilling, training) plot(tree, type='simple')

Gives this plot:

1. Out of three variables we use, Contract is the most important variable to predict customer churn or not churn.
2. If a customer in a one-year or two-year contract, no matter he (she) has PapelessBilling or not, he (she) is less likely to churn.
3. On the other hand, if a customer is in a month-to-month contract, and in the tenure group of 0–12 month, and using PaperlessBilling, then this customer is more likely to churn.

Decision Tree Confusion Matrix
We are using all the variables to product confusion matrix table and make predictions.

pred_tree <- predict(tree, testing) print("Confusion Matrix for Decision Tree"); table(Predicted = pred_tree, Actual = testing$Churn) [1] "Confusion Matrix for Decision Tree" Actual Predicted No Yes No 1395 346 Yes 153 214 Decision Tree Accuracy p1 <- predict(tree, training) tab1 <- table(Predicted = p1, Actual = training$Churn) tab2 <- table(Predicted = pred_tree, Actual = testing$Churn) print(paste('Decision Tree Accuracy',sum(diag(tab2))/sum(tab2))) [1] "Decision Tree Accuracy 0.763282732447818" The accuracy for Decision Tree has hardly improved. Let’s see if we can do better using Random Forest. Random Forest Random Forest Initial Model rfModel <- randomForest(Churn ~., data = training) print(rfModel) Call: randomForest(formula = Churn ~ ., data = training) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 4 OOB estimate of error rate: 20.65% Confusion matrix: No Yes class.error No 3245 370 0.1023513 Yes 647 662 0.4942704 The error rate is relatively low when predicting “No”, and the error rate is much higher when predicting “Yes”. Random Forest Prediction and Confusion Matrix pred_rf <- predict(rfModel, testing) caret::confusionMatrix(pred_rf, testing$Churn) Confusion Matrix and Statistics Reference Prediction No Yes No 1381 281 Yes 167 279 Accuracy : 0.7875 95% CI : (0.7694, 0.8048) No Information Rate : 0.7343 P-Value [Acc > NIR] : 9.284e-09 Kappa : 0.4175 Mcnemar's Test P-Value : 9.359e-08 Sensitivity : 0.8921 Specificity : 0.4982 Pos Pred Value : 0.8309 Neg Pred Value : 0.6256 Prevalence : 0.7343 Detection Rate : 0.6551 Detection Prevalence : 0.7884 Balanced Accuracy : 0.6952 'Positive' Class : No

Random Forest Error Rate

plot(rfModel)

Gives this plot:

We use this plot to help us determine the number of trees. As the number of trees increases, the OOB error rate decreases, and then becomes almost constant. We are not able to decrease the OOB error rate after about 100 to 200 trees.

Tune Random Forest Model

t <- tuneRF(training[, -18], training[, 18], stepFactor = 0.5, plot = TRUE, ntreeTry = 200, trace = TRUE, improve = 0.05)

Gives this plot:

We use this plot to give us some ideas on the number of mtry to choose. OOB error rate is at the lowest when mtry is 2. Therefore, we choose mtry=2.

Fit the Random Forest Model After Tuning

rfModel_new <- randomForest(Churn ~., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE) print(rfModel_new) Call: randomForest(formula = Churn ~ ., data = training, ntree = 200, mtry = 2, importance = TRUE, proximity = TRUE) Type of random forest: classification Number of trees: 200 No. of variables tried at each split: 2 OOB estimate of error rate: 19.7% Confusion matrix: No Yes class.error No 3301 314 0.0868603 Yes 656 653 0.5011459

OOB error rate decreased to 19.7% from 20.65% earlier.

Random Forest Predictions and Confusion Matrix After Tuning

pred_rf_new <- predict(rfModel_new, testing) caret::confusionMatrix(pred_rf_new, testing$Churn) Confusion Matrix and Statistics Reference Prediction No Yes No 1404 305 Yes 144 255 Accuracy : 0.787 95% CI : (0.7689, 0.8043) No Information Rate : 0.7343 P-Value [Acc > NIR] : 1.252e-08 Kappa : 0.3989 Mcnemar's Test P-Value : 4.324e-14 Sensitivity : 0.9070 Specificity : 0.4554 Pos Pred Value : 0.8215 Neg Pred Value : 0.6391 Prevalence : 0.7343 Detection Rate : 0.6660 Detection Prevalence : 0.8107 Balanced Accuracy : 0.6812 'Positive' Class : No The accuracy did not increase but the sensitivity improved, compare with the initial Random Forest model. Random Forest Feature Importance varImpPlot(rfModel_new, sort=T, n.var = 10, main = 'Top 10 Feature Importance') Gives this plot: Summary From the above example, we can see that Logistic Regression and Random Forest performed better than Decision Tree for customer churn analysis for this particular dataset. Throughout the analysis, I have learned several important things: 1. Features such as tenure_group, Contract, PaperlessBilling, MonthlyCharges and InternetService appear to play a role in customer churn. 2. There does not seem to be a relationship between gender and churn. 3. Customers in a month-to-month contract, with PaperlessBilling and are within 12 months tenure, are more likely to churn; On the other hand, customers with one or two year contract, with longer than 12 months tenure, that are not using PaperlessBilling, are less likely to churn. Source code that created this post can be found here. I would be pleased to receive feedback or questions on any of the above. Related Post var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### 9 Jobs for R users from around the world (2017-11-20) R bloggers - Mon, 11/20/2017 - 13:43 To post your R job on the next post Just visit this link and post a new R job to the R community. You can post a job for free (and there are also “featured job” options available for extra exposure). Current R jobs Job seekers: please follow the links below to learn more and apply for your R job of interest: Featured Jobs More New R Jobs 1. Full-Time R&D Database Developer @ Toronto, Canada Crescendo Technology Ltd – Posted by Crescendo Toronto Ontario, Canada 17 Nov 2017 2. Full-Time Customer Success Representative RStudio, Inc. – Posted by jclemens1 Anywhere 17 Nov 2017 3. Full-Time Data Science Engineer Bonify – Posted by arianna@meritocracy Berlin Berlin, Germany 17 Nov 2017 4. Freelance Dashboard/Visualization developer in R/Shiny International Finance Corporation – Posted by Oleksiy Anywhere 14 Nov 2017 5. Part-Time Development of User-Defined Calculations and Graphical Outputs for WHO’s Influenza Data World Health Organization – Posted by aspenh Anywhere 6 Nov 2017 6. Full-Time Data Scientist for H Labs @ Chicago, Illinois, United States Heidrick & Struggles – Posted by Heidrick1 Chicago Illinois, United States 2 Nov 2017 7. Full-Time Business Data Analytics Faculty Maryville University – Posted by tlorden St. Louis Missouri, United States 2 Nov 2017 8. Full-Time PROGRAMMER/SOFTWARE DEVELOPMENT ENGINEER/COMPUTATIONAL AND MACHINE LEARNING SPECIALIST fchiava Cambridge Massachusetts, United States 10 Oct 2017 9. Full-Time Data Scientist @ London Hiscox – Posted by alan.south London England, United Kingdom 13 Sep 2017 In R-users.com you can see all the R jobs that are currently available. R-users Resumes R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free. (you may also look at previous R jobs posts ). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Find the subset of a raster that best matches another raster R-sig-geo - Mon, 11/20/2017 - 11:37 Hello, I have two raster objects of identical resolution but different sizes. And I need to find a subset of raster B that is most similar to raster A by for example a simple correlation measure. This subset of course needs to have the same size as A. Now I could loop through the dimensions of A and crop all possible subsets of size dim(A) and locate the one that has the highest correlation with A. However, I wonder if there are any more efficient algorithms to solve the problem. library(raster) b <- brick(system.file("external/rlogo.grd", package="raster")) A <- crop(b[[1]], extent(b[[1]], 20, 60, 25, 75)) B <- b[[2]] [[alternative HTML version deleted]] _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo ### Housing Prices in Ames, Iowa: Kaggle’s Advanced Regression Competition R bloggers - Mon, 11/20/2017 - 09:05 (This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers) Introduction Residential real estate prices are fascinating… and frustrating. The homebuyer, the home-seller, the real estate agent, the economist, and the banker are all interested in housing prices, but they’re sufficiently subtle that no one person or industry has a complete understanding. For our third overall project and first group project we were assigned Kaggle’s Advanced Regression Techniques Competition. The goal, for the project and the original competition, was to predict housing prices in Ames, Iowa. While this particular competition is no longer active, the premise proved to be a veritable playground for testing our knowledge of data cleaning, exploratory data analysis, statistics, and, most importantly, machine learning. What follows is the combined work of Quentin Picard, Paul Ton, Kathryn Bryant, and Hans Lau. Data As stated on the Kaggle competition description page, the data for this project was compiled by Dean De Cock for educational purposes, and it includes 79 predictor variables (house attributes) and one target variable (price). As a result of the educational nature of the competition, the data was pre-split into a training set and a test set; the two datasets were given in the forms of csv files, each around 450 KB in size. Each of the predictor variables could fall under one of the following: • lot/land variables • location variables • age variables • basement variables • roof variables • garage variables • kitchen variables • room/bathroom variables • utilities variables • appearance variables • external features (pools, porches, etc.) variables The specifics of the 79 predictor variables are omitted for brevity but can be found in the data_description.txt file on the competition website. The target variable was Sale Price, given in US dollars. Project Overview (Process) The lifecycle of our project was a typical one. We started with data cleaning and basic exploratory data analysis, then proceeded to feature engineering, individual model training, and ensembling/stacking. Of course, the process in practice was not quite so linear and the results of our individual models alerted us to areas in data cleaning and feature engineering that needed improvement. We used root mean squared error (RMSE) of log Sale Price to evaluate model fit as this was the metric used by Kaggle to evaluate submitted models. Data cleaning, EDA, feature engineering, and private train/test splitting (and one spline model!) were all done in R but we used Python for individual model training and ensembling/stacking. Using R and Python in these ways worked well, but the decision to split work in this manner was driven more by timing with curriculum than by anything else. Throughout the project our group wanted to mimic a real-world machine learning project as much as possible, which meant that although we were given both a training set and a test set, we opted to treat the given test set as if it were “future” data. As a result, we further split the Kaggle training data into a private training set and a private testing set, with an 80/20 split, respectively. This allowed us to evaluate models in two ways before predicting on the Kaggle test data: with RMSE of predictions made on the private test set and with cross validation RMSE of the entire training set. Given the above choices, the process for training and evaluating each individual model was broken down as follows: 1. Grid search to tune hyper parameters (if applicable) to private train set 2. Fit model to private train set with tuned parameters 3. Predict on private test set; if RMSE okay, proceed. 4. Fit new model to Kaggle train set with private train hyperparameters 5. Cross-validate on Kaggle train set; if CV RMSE okay, proceed. 6. Predict on Kaggle test set 7. Submit predictions to Kaggle Computing RMSE of predictions made on the private test set served as a group sanity check, in that if anything was amiss with a model we found out at this point and could correct for it before proceeding. The decision to fit a new model to the Kaggle train in step 4 set using the private train hyperparameters found in step 1 was one for which we never felt completely at ease; we were trying to minimize overfitting to the Kaggle training set, but also recognized that we weren’t using all the possible information we could by not re-tuning using the full training set. One benefit to this choice was that we never saw any major discrepancies between our own cross validation scores in step 5 and the Kaggle scores from step 7. With more time, we would have further investigated whether keeping the private train parameters or re-tuning for the final models was more beneficial. The last noteworthy choice we made in our overall process was to feed differently-cleaned datasets to our linear-based models and our tree-based models. In linear-based models (linear, ridge, LASSO, elastic net, spline), prediction values are continuous and sensitive to outliers so we opted to sacrifice information about “rare” houses in favor of gaining better predictions on “common” houses. We also wanted to minimize the likelihood of rare levels only showing up in our test data and/or causing columns of all 0’s in dummifed data. We executed this tradeoff by releveling any nominal categorical variables that contained extremely rare classes, where “rare” was defined to be “occuring in less than 1% of the observations.” For example, the Heating variable had six levels but four of them combined accounted for only about 1% of the observations; so, we combined these four levels into a new ‘other’ level so that releveled Heating had only three levels, all accounting for 1% or more of the observations. Depending on the variable, we either created an ‘other’ level as in the example or we grouped rare levels into existing levels according to level similarity in the variable documentation. We opted not to relevel data fed into our tree-based models because tree predictions are more robust to outliers and rare classes; trees can separate rare observations from others through splits, which prevents common observation predictions from being distorted by rare observations in fitting. Data Cleaning and EDA We were fortunate with the Kaggle data in that it came to us relatively clean. The only basic cleaning tasks were to correct typos in levels of categorical variables, specify numeric or categorical variables in R, and rename variables beginning with numbers to satisfy R’s variable name requirements. There were a number of “quality” and “condition” variables that had levels of Poor, Fair, Typical/Average, Good, and Excellent which we label encoded as integers 1-5 to preserve their inherent ordinality. For nominal categorical variables, we used one-hot encoding from the ‘vtreat’ package in R. (Most machine learning algorithms in Python require all variables to have numeric values, and although R’s machine learning algorithms can handle nominal categorical variables in non-numeric/string form, R’s computations effectively use one-hot encoding under the hood in these cases.) After taking care of these basic cleaning issues, we needed to address missingness in our data. Below is a plot that shows the variables containing missing values and the degree to which missingness occurs in each (shown in yellow): For all variables except Garage Year Built and Lot Frontage, we performed basic mode imputation. Mode imputation was chosen for simplicity and because it could be done with both categorical and numerical data. Of course, mode imputation has the negative side effect of artificially decreasing variance within the affected variable and therefore would not be appropriate for variables with higher degrees of missingness. It was for this reason, in fact, that we approached missingness differently for Garage Year Built and Lot Frontage. For missing values in Garage Year Built, we imputed the Year Built for the house. We justified this because most garages are built at the same time as the house, and houses without garages get no penalty or benefit by having the Garage Year Built equal to the Year Built for the house. For Lot Frontage, the variable with the greatest degree of missingness, was researched and explored in order to arrive at an imputation strategy. Lot Frontage was defined to be the linear feet of street connected to the property. Given that most properties were either regularly or only slightly irregularly shaped according to Lot Shape, we deduced that most properties would have Lot Frontage values that were correlated with Lot Area values. However, since length is measured in units and area is measure in square units, we found it most appropriate to relate log(Lot Frontage) with log(Lot Area), so as to get a linear relationship between the two. See plot below. Thus, we imputed missing values for Lot Frontage by fitting a linear model of log(Lot Frontage) regressed onto log(Lot Area), predicting on the missing values for Lot Frontage using Lot Area, and then exponentiating (inverse log-ing) the result. The next step in EDA after finding and addressing missingness was to look at outliers. Looking at boxplots of both Sale Price and log(Sale Price) we saw that there were quite a few outliers, where ‘outliers’ mathematically defined to be observations lying more than 1.5*IQR (Inner Quartile Range) above the third quartile and below the first quartile. Although removing the outliers may have improved our predictions for average-priced homes, we were hesitant to do so due to the relatively small size of our sample (~1600 observations in Kaggle training set). We felt that removing any outliers without further justification than them simply being outliers would likely jeopardize our predictions for houses in the extremes. Since outliers would have the most impact on the fit of linear-based models, we further investigated outliers by training a basic multiple linear regression model on the Kaggle training set with all observations included; we then looked at the resulting influence and studentized residuals plots: From these, we saw that there were only two observations that could justifiably be removed: observation 1299 and observation 251. These were both beyond or on the lines representing Cook’s Distance, meaning that as individual observations they had a significant impact on the regression formula; as such, we removed these two observations from consideration. The last bit of preprocessing we did was dealing with multicollinearity. This, as for outliers, was cleaning done mostly for the benefit of linear-based models; in fact, it was done only for vanilla multiple linear regression since regularization in ridge, LASSO, and elastic net models deals with collinearity by construction. To eliminate collinearity issues, we used the findLinearCombos() function from the ‘caret’ package in R on our dummified data. This function identified linear combinations between predictor variables and allowed us to easily drop linearly dependent variables. Feature Engineering For feature engineering we took a two-pronged approach: we used anecdotal data/personal knowledge and we did research. • Garage interaction: Garage Quality * Number of Cars Garage Holds. • If a home has a really great or really poor garage, the impact of that quality component on price will be exacerbated by the size of the garage. • Total number of bathrooms: Full Bath + Half Bath + Basement Full Bath + Basement Half Bath. • In our experience, houses are often listed in terms of total number of bedrooms and total number of bathrooms. Our data had total number of bedrooms, but lacked total number of bathrooms. • Average room size: Above-Ground Living Area / Total Number of R00ms Above Ground. • “Open concept” homes have been gaining popularity and homes with large rooms have always been popular, and with the provided variables we believe average room size might address both of these trends. • Bathroom to room ratio: (Full Bath + Half Bath) / Number of Bedrooms Above Ground • The number of bathrooms desired in a house depends on the number of bedrooms. A home with one bedroom will not be viewed nearly as negatively for having only one bathroom as would a house with three bedrooms. • Comparative size of living area: Above-Ground Living Area / mean(Above-Ground Living Area) • This variable attempts to capture house size as it directly compares to other houses, but in a manner different from mere number of rooms. • Landscape-ability interaction: Lot Shape * Land Contour • Landscaped homes tend to sell for more (see this article) and the ability for a lot to be easily landscaped is, in part, determined by the shape of the lot (regular, slightly irregular, irregular, very irregular) and the land contour (level, banked, hillside, depressed). Certain combinations may be workable (regular shape, hillside) while other combinations (very irregular shape, hillside) may make landscaping difficult and/or expensive. Of the six features that we added, only “landscape-ability” resulted from research; we either already had the important industry variables in the data (total above-ground square footage or neighborhood, for example) or we did not have the information we needed to create the desired variables. Additional features we would have liked to have added include geolocation data for proximity to schools and shopping centers, quality of nearby schools, and specific rooms/house features remodeled (if applicable). One step that we glossed over a bit was extensive adding and removing of features. We noticed with just a few trials that removing existing features seemed to negatively impact the performance of our models and that the models improved when we added features. Given that the regularized linear models would give preference to more important features via larger coefficients and tree-based models would give preference to more important features by splitting preferences, we opted not to spend too much time manually adding and removing features. Essentially, we allowed our models to select the most predictive features themselves. Modeling For our individual models, we trained the following: • Multiple linear regression • Ridge regression • LASSO regression • Elastic net regression • Spline regression • Basic decision tree • Random forest • Gradient boosted tree • XGBoosted tree Overall our models consistently had RMSE values between .115 and .145. As stated earlier, our cross validation scores were usually very close to our scores on Kaggle. To our surprise, our linear-based models did very well compared to our tree-based models. Even the much-hyped XGBoost model was at best on par with our linear-based spline and elastic net models, and our random forest models tended to be much worse. An abbreviated table of our results is shown below: Elastic Net Spline Random Forest Gradient Boost XGBoost Ensemble Cross Validation 0.12157 0.11181 0.14171 0.12491 0.12282 0.11227 Held-out Test 0.11762 0.11398 0.13834 0.11403 0.11485 n/a Kaggle 0.12107 0.11796 n/a n/a n/a 0.11710 As we endeavored to find a final, “best” model via ensembling and stacking, we followed the advice of our instructor (and successful Kaggler) Zeyu by attempting to combine models with different strengths and weaknesses. We took the predictions from each of our best base models (Spline, Gradient Boost, XGBoost) as the features to use for the next level. Our best result came from a weighted average of the three, with weights determined by a grid search. The blend of 76% Spline, 14.5% Gradient Boost, and 9.5% Xgboost gave us a 0.11227 RMSE on our training set and 0.11710 on Kaggle. We also experimented with using a second level meta-model to do the stacking, but with a linear meta-model, the coefficients were highly unstable because the predictions were so closely related to each other, and with a gradient boost meta-model, we were unable to beat our best base model alone. Conclusions As mentioned above, we were surprised by the strength of our various linear models in comparison to our tree models. We suspect this had a lot to do with the data itself, in that Sale Price (or rather, log(Sale Price)) likely has a relatively linear relationship with the predictor variables. This highlights one of the most important takeaways from this project: that linear models have a real place in machine learning. In situations like this, where performance between a simpler model and a more complex model are similar, the better interpretability and the ease of training of the simpler model may also prove to be deciding factors on model choice. In most cases, stacking multiple base-layer models into a second-layer is impractical, despite performing slightly better. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### RcppEigen 0.3.3.3.1 R bloggers - Mon, 11/20/2017 - 01:23 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) A maintenance release 0.3.3.3.1 of RcppEigen is now on CRAN (and will get to Debian soon). It brings Eigen 3.3.* to R. The impetus was a request from CRAN to change the call to Rcpp::Rcpp.plugin.maker() to only use :: as the function has in fact been exported and accessible for a pretty long time. So now the usage pattern catches up. Otherwise, Haiku-OS is now supported and a minor Travis tweak was made. The complete NEWS file entry follows. Changes in RcppEigen version 0.3.3.3.1 (2017-11-19) • Compilation under Haiku-OS is now supported (Yu Gong in #45). • The Rcpp.plugin.maker helper function is called via :: as it is in fact exported (yet we had old code using :::). • A spurious argument was removed from an example call. • Travis CI now uses https to fetch the test runner script. Courtesy of CRANberries, there is also a diffstat report for the most recent release. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### RcppClassic 0.9.9 R bloggers - Mon, 11/20/2017 - 01:22 (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers) A maintenance release RcppClassic 0.9.9 is now at CRAN. This package provides a maintained version of the otherwise deprecated first Rcpp API; no new projects should use it. Per a request from CRAN, we changed the call to Rcpp::Rcpp.plugin.maker() to only use :: as the function has in fact been exported and accessible for a pretty long time. So now the usage pattern catches up. Courtesy of CRANberries, there are changes relative to the previous release. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Timing in R R bloggers - Mon, 11/20/2017 - 01:00 (This article was first published on R on The Jumping Rivers Blog, and kindly contributed to R-bloggers) As time goes on, your R scripts are probably getting longer and more complicated, right? Timing parts of your script could save you precious time when re-running code over and over again. Today I’m going to go through the 4 main functions for doing so. Nested timings 1) Sys.time() Sys.time() takes a “snap-shot” of the current time and so it can be used to record start and end times of code. start_time = Sys.time() Sys.sleep(0.5) end_time = Sys.time() To calculate the difference, we just use a simple subtraction end_time - start_time ## Time difference of 0.5061 secs Notice it creates a neat little message for the time difference. 2) The tictoc package You can install the CRAN version of tictoc via install.packages("tictoc") whilst the most recent development is available via the tictoc GitHub page. library("tictoc") Like Sys.time(), tictoc also gives us ability to nest timings within code. However, we now have some more options to customise our timing. At it’s most basic it acts like Sys.time(): tic() Sys.sleep(0.5) toc() ## 0.506 sec elapsed Now for a more contrived example. # start timer for the entire section, notice we can name sections of code tic("total time") # start timer for first subsection tic("Start time til half way") Sys.sleep(2) # end timer for the first subsection, log = TRUE tells toc to give us a message toc(log = TRUE) ## Start time til half way: 2.037 sec elapsed Now to start the timer for the second subsection tic("Half way til end") Sys.sleep(2) # end timer for second subsection toc(log = TRUE) ## Half way til end: 2.012 sec elapsed # end timer for entire section toc(log = TRUE) ## total time: 4.067 sec elapsed We can view the results as a list (format = TRUE returns this list in a nice format), rather than raw code tic.log(format = TRUE) ## [[1]] ## [1] "Start time til half way: 2.037 sec elapsed" ## ## [[2]] ## [1] "Half way til end: 2.012 sec elapsed" ## ## [[3]] ## [1] "total time: 4.067 sec elapsed" We can also reset the log via tic.clearlog() Comparing functions 1) system.time() Why oh WHY did R choose to give system.time() a lower case s and Sys.time() and upper case s? Anyway… system.time() can be used to time functions without having to take note of the start and end times. system.time(Sys.sleep(0.5)) ## user system elapsed ## 0.000 0.000 0.501 system.time(Sys.sleep(1)) ## user system elapsed ## 0.000 0.000 1.003 We only want to take notice of the “elapsed” time, for the definition of the “user” and “system” times see this thread. For a repeated timing, we would use the replicate() function. system.time(replicate(10, Sys.sleep(0.1))) ## user system elapsed ## 0.000 0.000 1.007 2) The microbenchmark package You can install the CRAN version of microbenchmark via install.packages("microbenchmark") Alternatively you can install the latest update via the microbenchmark GitHub page. library("microbenchmark") At it’s most basic, microbenchmark() can we used to time single pieces of code. # times = 10: repeat the test 10 times # unit = "s": output in seconds microbenchmark(Sys.sleep(0.1), times = 10, unit = "s") ## Unit: seconds ## expr min lq mean median uq max neval ## Sys.sleep(0.1) 0.1001 0.1006 0.1005 0.1006 0.1006 0.1006 10 Notice we get a nicely formatted table of summary statistics. We can record our times in anything from seconds to nanoseconds(!!!!). Already this is better than system.time(). Not only that, but we can compare sections of code in an easy-to-do way and name the sections of code for an easy-to-read output. sleep = microbenchmark(sleepy = Sys.sleep(0.1), sleepier = Sys.sleep(0.2), sleepiest = Sys.sleep(0.3), times = 10, unit = "s") As well as this (more?!) microbenchmark comes with a two built-in plotting functions. microbenchmark:::autoplot.microbenchmark(sleep) microbenchmark:::boxplot.microbenchmark(sleep) These provide quick and efficient ways of visualising our timings. Conclusion Sys.time() and system.time() have there place, but for most cases we can do better. The tictoc and microbenchmark packages are particularly useful and make it easy to store timings for later use, and the range of options for both packages stretch far past the options for Sys.time() and system.time(). The built-in plotting functions are handy. Thanks for chatting! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on The Jumping Rivers Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Where to live in the Netherlands based on temperature XKCD style R bloggers - Mon, 11/20/2017 - 01:00 (This article was first published on Clean Code, and kindly contributed to R-bloggers) After seeing a plot of best places to live in Spain and the USA based on the weather, I had to chime in and do the same thing for the Netherlands. The idea is simple, determine where you want to live based on your temperature preferences. First the end result: How to read this plot? In this xkcd comic we see that the topleft of the the graph represents “if you hate cold and hate heat”, if you go down from the topleft to the bottom left the winters get colder and ending in “if you love cold and hate heat”. Going to the right the heat (and humidity) go up ending in “if you love cold and love heat”. Going up to the top right: “if you hate cold and love heat”. This post explains how to make the plot, to see where I got the data and what procedures I took look at https://github.com/RMHogervorst/xkcd_weather_cities_nl. Inspiration According to this post by Maële Salmon inspired by xkcd, we can determine our ideal city by looking at wintertemperature and humidex (combination of humidity and summer heat). I’ve seen major cities in the USA (post by Maelle) and where to live in Spain by Claudia Guirao. There is even one on France in French, of course. So I had to make one for the Netherlands too. There is just a small detail, The Netherlands is tiny, the United States is approximately 237 times larger then The Netherlands. From The Hague to the German border directly east is 164 km (101 miles) and from Leeuwarden to Maastricht in the south is 262 km (162 miles). Essentially my home country has a moderate sea climate with warm summers and relatively mild winters. I expect there to be no real big differences between the cities. I use the average temperature over december, january and february for winter temperature and calculate the humidex using the comf package. This humidex is a combination of humidity and temperature. • 20 to 29: Little to no discomfort • 30 to 39: Some discomfort • 40 to 45: Great discomfort; avoid exertion • Above 45: Dangerous; heat stroke quite possible For cities I went the extremely lazy way and took all of the available weatherstations provided by the Dutch weather service (KNMI, — Royal Netherlands, Metereological Institute). There are 46 stations in the Netherlands. These are not always in the cities but sort of broadly describe the entire country. Plot like XKCD The XKCD package has wonderful plotting abilities and a cool font that you have to install. I copied and modified the code from the post of Mäelle, because it is really good! If you want to see how I did the data cleaning go to the github page. First we plot all of the stations in the Netherlands most of these places will probably not be familiar to non-Dutch people. library("xkcd") library("ggplot2") library("extrafont") library("ggrepel") xrange <- range(result$humidex_avg) yrange <- range(result$wintertemp) set.seed(3456) ggplot(result, aes(humidex_avg, wintertemp)) + geom_point() + geom_text_repel(aes(label = NAME), family = "xkcd", max.iter = 50000, size = 3)+ ggtitle("Where to live in The Netherlands \nbased on your temperature preferences", subtitle = "Data from KNMI, 2016-2017") + xlab("Summer heat and humidity via Humidex")+ ylab("Winter temperature in Celsius degrees") + xkcdaxis(xrange = xrange, yrange = yrange)+ theme_xkcd() + theme(text = element_text(size = 13, family = "xkcd")) But what does that mean, in the grand scheme of things? As you might notice the range is very small. It would be interesting to add some US cities and perhaps some Spanisch cities to compare against. For fun I also added the Dutch Caribean islands. xrange2 <- range(c(18,40)) # modified these by hand to increase the margins yrange2 <- range(c(-5,40)) USA <- tribble( ~NAME, ~humidex_avg, ~wintertemp, "DETROIT, MI", 27, 0, "NASHVILLE, TN", 33, 9, "FORT MEYERS, FL",37, 20 ) SPAIN <- tribble( ~NAME, ~humidex_avg, ~wintertemp, "MADRID, SPAIN", 27, 8, "TENERIFE, SPAIN", 24, 13, "BARCELONA, SPAIN",32, 10 ) D_CARI <- tribble( ~NAME, ~humidex_avg, ~wintertemp, "Bonaire, Caribbean Netherlands", 27, calcHumx(29,76), "Sint Eustatius, Caribbean Netherlands", 28, calcHumx(28,77), "Saba, Caribbean Netherlands",30, calcHumx(30,76) ) set.seed(3456) ggplot(result, aes(humidex_avg, wintertemp)) + geom_point(alpha = .7) + geom_text_repel(aes(label = NAME), family = "xkcd", max.iter = 50000, size = 2)+ geom_text_repel(data = USA, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd", max.iter = 50000, size = 2, color = "blue")+ geom_point(data = USA, aes(humidex_avg, wintertemp), color = "blue")+ geom_text_repel(data = SPAIN, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd", max.iter = 50000, size = 2, color = "red")+ geom_point(data = SPAIN, aes(humidex_avg, wintertemp),color = "red")+ geom_text_repel(data = D_CARI, aes(humidex_avg, wintertemp, label = NAME), family = "xkcd", max.iter = 50000, size = 2, color = "orange")+ geom_point(data = D_CARI, aes(humidex_avg, wintertemp), color = "orange")+ labs(title = "Where to live in The Netherlands based on \nyour temperature preferences \nCompared with some places in Spain, Caribbean NL and USA", subtitle = "Data from KNMI, 2016-2017, added Spain and US cities", x = "Summer heat and humidity via Humidex", y = "Winter temperature in Celsius degrees", caption = "includes Caribbean Netherlands" ) + xkcdaxis(xrange = xrange2, yrange = yrange2)+ theme_xkcd() + theme(text = element_text(size = 16, family = "xkcd")) Finally we can compare the wintertemperature and summer humidex in The Netherlands by placing the points on a map using the coordinates of the measure stations. NL <- map_data(map = "world",region = "Netherlands") result %>% rename(LON = LON(east), LAT = LAT(north)) %>% ggplot( aes(LON, LAT))+ geom_point(aes(color = wintertemp))+ geom_text_repel(aes(label = NAME), family = "xkcd", size = 3, max.iter = 50000)+ geom_polygon(data = NL, aes(x=long, y = lat, group = group), fill = NA, color = "black") + coord_map()+ labs(title = "Wintertemperature in NL", subtitle = "data from KNMI (2016,2017", x = "", y = "")+ theme_xkcd()+ theme(text = element_text(size = 16, family = "xkcd")) result %>% rename(LON = LON(east), LAT = LAT(north)) %>% ggplot( aes(LON, LAT))+ geom_point(aes(color = humidex_avg))+ geom_text_repel(aes(label = NAME), family = "xkcd", size = 3, max.iter = 50000)+ geom_polygon(data = NL, aes(x=long, y = lat, group = group), fill = NA, color = "black") + coord_map()+ labs(title = "Humidex in NL", subtitle = "data from KNMI (2016,2017", x = "", y = "")+ theme_xkcd()+ theme(text = element_text(size = 12, family = "xkcd"))+ scale_color_continuous(low = "white", high = "red") Now show us what your country looks like! Where to live in the Netherlands based on temperature XKCD style was originally published by at Clean Code on November 20, 2017. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Clean Code. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Content Evaluation: What is the Value of Social Media? R bloggers - Mon, 11/20/2017 - 01:00 (This article was first published on Florian Teschner, and kindly contributed to R-bloggers) As most bloggers, I do check my analytics stats on a regular basis. What I do not really look at is social shares. For most blog posts traffic follows a clear pattern; 2 days of increased traffic, followed by a steady decrease to the base traffic volume. The amplitude varies massively depending on how interesting/viral the post was. However, in the long-term this virality does not drive the majority of the traffic but rather the search ranking of individual posts, as 80% of the base traffic comes through organic search results. There is plenty of discussion on the value of a share/like with very different take-aways. I thought it would be great to analyse if social shares do increase my traffic. In addition, I realized that it is pretty easy to get the number of Facebook and LinkedIn shares for each of my blog posts from r-bloggers. (Please find the script at the end of the post.) For the analysis, I scraped the social share numbers and merged them with my Google Analytics stats. In order to compare the traffic between blog posts, I normalise the total traffic per post by number of days the post is online. To get a first picture of the data we can plot the normalized traffic over the number of shares (both by blog post). Next, we can analyse how the variables are correlated. I included the total pageviews as reference. M <- cor(dff[,c("shares", "linkedIn1", "pageviews", "normalizedPageViews")]) corrplot.mixed(M) We see that LinkedIn and Facebook shares are positively correlated with r=.56 and that the correlation between shares and normalized pageviews is the same. When looking at the total page views we see that have a much lower correlation with shares both from facebook as well as Linkedin. To determine the “value” of social media to drive traffic volume, we can regress the number of social shares on the normalized traffic numbers. dff$DaysOnline <- as.numeric(Sys.Date()-dff$Date) fit1 <- lm(normalizedPageViews ~ shares + linkedIn1, data=dff) fit2 <- lm(normalizedPageViews ~ shares + linkedIn1, data=dff) fit3 <- lm(normalizedPageViews ~ shares + linkedIn1+ DaysOnline, data=dff) stargazer(fit1, fit2, fit3, type = "html" ,title = "The value of shares for daily traffic") The value of shares for daily traffic Dependent variable: normalizedPageViews (1) (2) (3) shares 0.043** 0.043** 0.026 (0.017) (0.017) (0.022) linkedIn1 0.138 0.138 0.101 (0.126) (0.126) (0.129) DaysOnline -0.008 (0.006) Constant 1.462 1.462 5.744 (1.114) (1.114) (3.571) Observations 33 33 33 R2 0.341 0.341 0.375 Adjusted R2 0.297 0.297 0.310 Residual Std. Error 5.042 (df = 30) 5.042 (df = 30) 4.994 (df = 29) F Statistic 7.755*** (df = 2; 30) 7.755*** (df = 2; 30) 5.802*** (df = 3; 29) Note: *p<0.1; **p<0.05; ***p<0.01 As in the correlation plot, we see that highly shared posts have more daily visits. If you take the estimates at face value one can also state that a linkedIn share is worth 3 times a facebook share in terms of additional traffic. In the last regression I also included a variable (DaysOnline) to capture my learning effect. The longer post is online the lower is the daily traffic. (e.g. posts published when I started blogging). While writing this post, I realized that the script can also be used to analyse; • which topics are highly shared • which authors are popular • how social sharing changed over time on r-bloggers. Have fun! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### assigning lower , upper and control arguments R-sig-geo - Sun, 11/19/2017 - 20:33 Dear all I have a question about assigning control, lower and upper bounds arguments in fit.stvariogram function. How we should choose them for different covariance models(like separable, metric,...). Thanks alot. [[alternative HTML version deleted]] _______________________________________________ R-sig-Geo mailing list [hidden email] https://stat.ethz.ch/mailman/listinfo/r-sig-geo ### Automatic output format in Rmarkdown R bloggers - Sun, 11/19/2017 - 19:21 (This article was first published on Data Literacy - The blog of Andrés Gutiérrez, and kindly contributed to R-bloggers) I am writing a Rmarkdown document with plenty of tables, and I want them in a decent format, e.g. kable. However I don’t want to format them one by one. For example, I have created the following data frame in dplyr data2 %>% group_by(uf) %>% summarise(n = n(), ) %>% arrange(desc(n)) One solution to the output format of this data frame would be to name it as an object in R, and then give it a format by using the kable function. t1 <- data2 %>% group_by(uf) %>% summarise(n = n(), ) %>% arrange(desc(n)) knitr::kable(t1) However, if your document has hundreds of these queries and you need a faster way to compile the document, while keeping the kable style automatically, avoiding giving a name to the data frame and even avoiding to call the kable function over that name, you can use the printr package. Just add the following piece of code inside a chunk at the beginning of your document and voilá. library(printr) Now, all of your data frames will have a decent style, and you do not need to worry about this issue. For example, I have knitted a presentation by using printr and the first code in this post, and this is the result: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Data Literacy - The blog of Andrés Gutiérrez. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Why is R slow? some explanations and MKL/OpenBLAS setup to try to fix this R bloggers - Sun, 11/19/2017 - 17:00 (This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers) Introduction Many users tell me that R is slow. With old R releases that is 100% true provided old R versions used its own numerical libraries instead of optimized numerical libraries. But, numerical libraries do not explain the complete story. In many cases slow code execution can be attributed to inefficient code and in precise terms because of not doing one or more of these good practises: • Using byte-code compiler • Vectorizing operations • Using simple data structures (i.e using data frames instead of matrices in large computing instances) • Re-using results I would add another good practise: “Use the tidyverse”. Provided tidyverse packages such as dplyr benefit from Rcpp, having a C++ backend can be faster than using dplyr’s equivalents in base (i.e plain vanilla) R. The idea of this post is to clarify some ideas. R does not compete with C or C++ provided they are different languages. R and data.table package may compete with Python and numpy library. This does not mean that I’m defending R over Python or backwards. The reason behind this is that both R and Python implementations consists in an interpreter while in C and C++ it consists in a compiler, and this means that C and C++ will always be faster because in really over-simplifying terms compiler implementations are closer to the machine. Basic setup for general usage As an Ubuntu user I can say the basic R installation from Canonical or CRAN repositories work for most of the things I do on my laptop. When I use RStudio Server Pro© that’s a different story because I really want to optimize things because when I work with large data (i.e. 100GB in RAM) a 3% more of resources efficiency or reduced execution time is valuable. Installing R with OpenBLAS will give you a tremedous performance boost, and that will work for most of laptop situations. I explain how to do that in detail for Ubuntu 17.10 and Ubuntu 16.04 but a general setup would be as simple as one of this two options: # 1: Install from Canonical (default Ubuntu repository) sudo apt-get update && sudo apt-get upgrade sudo apt-get install libopenblas-dev r-base # 2: Install from CRAN mirror sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9 printf '#CRAN mirror\ndeb https://cran.rstudio.com/bin/linux/ubuntu artful/\ndeb-src https://cran.rstudio.com/bin/linux/ubuntu artful/\n' | sudo tee -a /etc/apt/sources.list.d/cran-mirror.list sudo apt-get update && sudo apt-get upgrade sudo apt-get install libopenblas-dev r-base # 3: Install RStudio (bonus) cd ~/Downloads wget https://download1.rstudio.org/rstudio-xenial-1.1.383-amd64.deb sudo apt-get install gdebi sudo gdebi rstudio-xenial-1.1.383-amd64.deb printf '\nexport QT_STYLE_OVERRIDE=gtk\n' | sudo tee -a ~/.profile Being (1) a substitute of (2). It’s totally up to you which one to use and both will give you a really fast R compared to installing it without OpenBLAS. Benchmarking different R setups I already use R with OpenBLAS just like the setup above. I will compile parallel R instances to do the benchmarking. Installing Intel© MKL numerical libraries My benchmarks do indicate that in my case it’s convenient to take the time it takes to install Intel© MKL. The execution time is strongly reduces for some operations when compared to R with OpenBLAS performance. Run this to install MKL: # keys taken from https://software.intel.com/en-us/articles/installing-intel-free-libs-and-python-apt-repo wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list' sudo apt-get update && sudo apt-get install intel-mkl-64bit Installing CRAN R with MKL To compile it from source (in this case it’s the only option) run these lines: # key added after sudo apt-get update returned a warning following this guide: https://support.rstudio.com/hc/en-us/articles/218004217-Building-R-from-source sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9 printf '#CRAN mirror\ndeb https://cran.rstudio.com/bin/linux/ubuntu artful/\ndeb-src https://cran.rstudio.com/bin/linux/ubuntu artful/\n' | sudo tee -a /etc/apt/sources.list.d/cran-mirror.list # you need to enable multiverse repo or packages as xvfb won't be found sudo rm -rf /etc/apt/sources.list printf 'deb http://us.archive.ubuntu.com/ubuntu artful main restricted universe multiverse deb-src http://us.archive.ubuntu.com/ubuntu artful main restricted universe multiverse\n deb http://security.ubuntu.com/ubuntu artful-security main restricted universe multiverse deb-src http://security.ubuntu.com/ubuntu artful-security main restricted universe multiverse\n deb http://us.archive.ubuntu.com/ubuntu artful-updates main restricted universe multiverse deb-src http://us.archive.ubuntu.com/ubuntu artful-updates main restricted universe multiverse\n' | sudo tee -a /etc/apt/sources.list sudo apt-get update sudo apt-get clean sudo apt-get autoclean sudo apt-get autoremove sudo apt-get upgrade --with-new-pkgs sudo apt-get build-dep r-base cd ~/GitHub/r-with-intel-mkl wget https://cran.r-project.org/src/base/R-3/R-3.4.2.tar.gz tar xzvf R-3.4.2.tar.gz cd R-3.4.2 source /opt/intel/mkl/bin/mklvars.sh intel64 MKL="-Wl,--no-as-needed -lmkl_gf_lp64 -Wl,--start-group -lmkl_gnu_thread -lmkl_core -Wl,--end-group -fopenmp -ldl -lpthread -lm" ./configure --prefix=/opt/R/R-3.4.2-intel-mkl --enable-R-shlib --with-blas="$MKL" --with-lapack make && sudo make install printf '\nexport RSTUDIO_WHICH_R=/usr/local/bin/R\nexport RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl\n' | tee -a ~/.profile Installing CRAN R with OpenBLAS

Just not to interfere with working installation I decided to compile a parallel instance from source:

cd ~/GitHub/r-with-intel-mkl/ rm -rf R-3.4.2 tar xzvf R-3.4.2.tar.gz cd R-3.4.2 ./configure --prefix=/opt/R/R-3.4.2-openblas --enable-R-shlib --with-blas --with-lapack make && sudo make install printf 'export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R\n' | tee -a ~/.profile Installing CRAN R with no optimized numerical libraries

There is a lot of discussion and strong evidence from different stakeholders in the R community that do indicate that this is by far the most inefficient option. I compiled this just to make a complete benchmark:

cd ~/GitHub/r-with-intel-mkl/ rm -rf R-3.4.2 tar xzvf R-3.4.2.tar.gz cd R-3.4.2 ./configure --prefix=/opt/R/R-3.4.2-defaults --enable-R-shlib make && sudo make install printf 'export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R\n' | tee -a ~/.profile Installing Microsoft© R Open with MKL

This R version includes MKL by default and it’s supposed to be easy to install. I could not make it run and that’s bad because different articles (like this post by Brett Klamer) state that this R version is really efficient but no different to standard CRAN R with MKL numerical libraries.

In any case here’s the code to install this version:

cd ~/GitHub/r-with-intel-mkl wget https://mran.blob.core.windows.net/install/mro/3.4.2/microsoft-r-open-3.4.2.tar.gz tar xzvf microsoft-r-open-3.4.2.tar.gz cd microsoft-r-open sudo ./install.sh printf 'export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R\n' | tee -a ~/.profile # it was not possible to start /opt/microsoft/ropen/3.4.2/lib64/R/bin/R # the error is: # *** caught segfault *** # address 0x50, cause 'memory not mapped' # removing Microsoft R # https://mran.microsoft.com/documents/rro/installation#revorinst-uninstall steps did not work sudo apt-get remove 'microsoft-r-.*' sudo apt-get autoclean && sudo apt-get autoremove Benchmark results

My scripts above do edit ~/.profile. This is to open RStudio and work with differently configured R instances on my computer.

I released the benchmark results and scripts on GitHub. The idea is to run the same scripts from ATT© and Microsoft© to see how different setups perform.

To work with CRAN R with MKL I had to edit ~/.profile because of how I configurated the instances. So I had to run nano ~/.profile and comment the last part of the file to obtain this result:

#export RSTUDIO_WHICH_R=/usr/bin/R export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl/bin/R #export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R #export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R #export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R

After that I log out and then log in to open RStudio to run the benchmark.

The other two cases are similar and the benchmark results were obtained editing ~/.profile, logging out and in and opening RStudio with the corresponding instance.

As an example, this result starts with the R version and the corresponding numerical libraries used in that sessions. Any other result are reported in the same way.

And here are the results from ATT© benchmarking script:

Task CRAN R with MKL (seconds) CRAN R with OpenBLAS (seconds) CRAN R with no optimized libraries (seconds) Creation, transp., deformation of a 2500×2500 matrix (sec) 0.68 0.68 0.67 2400×2400 normal distributed random matrix ^1000 0.56 0.56 0.56 Sorting of 7,000,000 random values 0.79 0.79 0.79 2800×2800 cross-product matrix (b = a’ * a) 0.3 0.36 14.55 Linear regr. over a 3000×3000 matrix (c = a \ b’) 0.17 0.22 6.98 FFT over 2,400,000 random values 0.33 0.33 0.33 Eigenvalues of a 640×640 random matrix 0.22 0.49 0.74 Determinant of a 2500×2500 random matrix 0.2 0.22 2.99 Cholesky decomposition of a 3000×3000 matrix 0.31 0.21 5.76 Inverse of a 1600×1600 random matrix 0.2 0.21 2.79 3,500,000 Fibonacci numbers calculation (vector calc) 0.54 0.54 0.54 Creation of a 3000×3000 Hilbert matrix (matrix calc) 0.23 0.24 0.23 Grand common divisors of 400,000 pairs (recursion) 0.27 0.29 0.3 Creation of a 500×500 Toeplitz matrix (loops) 0.28 0.28 0.28 Escoufier’s method on a 45×45 matrix (mixed) 0.22 0.23 0.28 Total time for all 15 tests 5.3 5.62 37.78 Overall mean (weighted mean) 0.31 0.32 0.93

And here are the results from Microsoft© benchmarking script:

Task CRAN R with MKL (seconds) CRAN R with OpenBLAS (seconds) CRAN R with no optimized libraries (seconds) Matrix multiply 5.985 13.227 165.18 Cholesky Factorization 1.061 2.349 26.762 Singular Value Decomposition 7.268 18.325 47.076 Principal Components Analysis 14.932 40.612 162.338 Linear Discriminant Analysis 26.195 43.75 117.537 Actions after benchmarking results

I decided to try Intel MKL and I’ll write another post benchmarking things I do everyday beyond what is considered in the scripts.

To clean my system I deleted all R instances but MKL:

sudo apt-get remove r-base r-base-dev sudo apt-get remove 'r-cran.*' sudo apt-get autoclean && sudo apt-get autoremove sudo apt-get build-dep r-base sudo rm -rf /opt/R/R-3.4.2-openblas sudo rm -rf /opt/R/R-3.4.2-defaults sudo ln -s /opt/R/R-3.4.2-intel-mkl/bin/R /usr/bin/R

I edited ~/.profile so the final lines are:

export RSTUDIO_WHICH_R=/usr/bin/R #export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-intel-mkl/bin/R #export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-openblas/bin/R #export RSTUDIO_WHICH_R=/opt/R/R-3.4.2-defaults/bin/R #export RSTUDIO_WHICH_R=/opt/microsoft/ropen/3.4.2/lib64/R/bin/R

And I also decided to configure my user packages directory from zero:

# remove installed user packages rm -rf ~/R # create new user packages directory mkdir ~/R/ mkdir ~/R/x86_64-pc-linux-gnu-library/ mkdir ~/R/x86_64-pc-linux-gnu-library/3.4 # install common packages R --vanilla << EOF install.packages(c("tidyverse","data.table","dtplyr","devtools","roxygen2","bit64","pacman"), repos = "https://cran.rstudio.com/") q() EOF # Export to HTML/Excel R --vanilla << EOF install.packages(c("htmlTable","openxlsx"), repos = "https://cran.rstudio.com/") q() EOF # Blog tools R --vanilla << EOF install.packages(c("knitr","rmarkdown"), repos='http://cran.us.r-project.org') q() EOF sudo apt-get install python-pip sudo pip install --upgrade --force-reinstall markdown rpy2==2.7.8 pelican==3.6.3 # PDF extraction tools sudo apt-get install libpoppler-cpp-dev default-jre default-jdk sudo R CMD javareconf R --vanilla << EOF library(devtools) install.packages(c("rjava","pdftools"), repos = "https://cran.rstudio.com/") install_github("ropensci/tabulizer") q() EOF # TTF/OTF fonts usage R --vanilla << EOF install.packages("showtext", repos = "https://cran.rstudio.com/") q() EOF # Cairo for graphic devices sudo apt-get install libgtk2.0-dev libxt-dev libcairo2-dev R --vanilla << EOF install.packages("Cairo", repos = "https://cran.rstudio.com/") q() EOF Concluding remarks

The benchmark exposed here are in no way a definitive end to the long discussion on numerical libraries. My results show some evidence that indicates, that because of more speed for some operations, I should use MKL.

One of the advantages of the setup I explained is that you can use MKL with Python. In that case numpy calculations will be boosted.

Using MKL with AMD© processors might not provide an important improvement when compared to use OpenBLAS. This is because MKL uses specific processor instructions that work well with i3 or i5 processors but not neccesarily with non-Intel models.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### M4 Forecasting Competition

R bloggers - Sun, 11/19/2017 - 01:00

(This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers)

The “M” competitions organized by Spyros Makridakis have had an enormous influence on the field of forecasting. They focused attention on what models produced good forecasts, rather than on the mathematical properties of those models. For that, Spyros deserves congratulations for changing the landscape of forecasting research through this series of competitions.
Makridakis & Hibon, (JRSSA 1979) was the first serious attempt at a large empirical evaluation of forecast methods.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RStudio Keyboard Shortcuts for Pipes

R bloggers - Sat, 11/18/2017 - 19:32

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

I have just released some simple RStudio add-ins that are great for creating keyboard shortcuts when working with pipes in R.

You can install the add-ins from here (which also includes both installation instructions and use instructions/examples).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Statebins Reimagined

R bloggers - Sat, 11/18/2017 - 16:33

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

A long time ago, in a github repo far, far away there lived a tiny package that made it possible to create equal area, square U.S. state cartograms in R dubbed statebins. Three years have come and gone and — truth be told — I’ve never been happy with that package. It never felt “right” and that gnawing feeling finally turned into action with a “re-imagining” of the API.

Previously, on statebins…

There were three different functions in the old-style package:

• one for discrete scales (it automated ‘cuts’)
• one for continuous scales
• one for manual scales

It also did some hack-y stuff with grobs to try to get things to look good without putting too much burden on the user.

All that “mostly” worked, but I always ended up doing some painful workaround when I needed more custom charts (not that I have to use this package much given the line of work I’m in).

Downsizing statebins

Now, there’s just one function for making the cartograms — statebins() — and another for applying a base theme — theme_statebins(). The minimalisation has some advantages that we’ll take a look at now, starting with the most basic example (the one on the manual page):

data(USArrests) USArrests$state <- rownames(USArrests) statebins(USArrests, value_col="Assault", name = "Assault") + theme_statebins(legend_position="right") Two things should stand out there: • you got scale_fill_distiller() for free! • labels are dark/light depending on the tile color Before we go into ^^, it may be helpful to show the new function interface: statebins(state_data, state_col = "state", value_col = "value", dark_label = "black", light_label = "white", font_size = 3, state_border_col = "white", state_border_size = 2, ggplot2_scale_function = ggplot2::scale_fill_distiller, ...) You pass in the state name/abbreviation & value columns like the old interface but also specify colors for the dark & light labels (set hex code color with 00 ending alpha values if you don’t want labels but Muricans are pretty daft and generally need the abbreviations on the squares). You can set the font size, too (we’ll do that in a bit) and customize the border color (usually to match the background of the target medium). BUT, you also pass in the ggplot2 scale function you want to use and the named parameters for it (that’s what the ... is for). So, yes I’ve placed more of a burden on you if you want discrete cuts, but I’ve also made the package way more flexible and made it possible to keep the labels readable without you having to lift an extra coding finger. The theme()-ing is also moved out to a separate theme function which makes it easier for you to further customize the final output. But that’s not all! There are now squares for Puerto Rico, the Virgin Islands and New York City (the latter two were primarily for new features/data in cdcfluview but they are good to have available). Let’s build out a larger example with some of these customizations (we’ll make up some data to do that): library(statebins) library(tidyverse) library(viridis) data(USArrests) # make up some data for the example rownames_to_column(USArrests, "state") %>% bind_rows( data_frame( state = c("Virgin Islands", "Puerto Rico", "New York City"), Murder = rep(mean(max(USArrests$Murder),3)), Assault = rep(mean(max(USArrests$Assault),3)), Rape = rep(mean(max(USArrests$Rape),3)), UrbanPop = c(93, 95, 100) ) ) -> us_arrests statebins(us_arrests, value_col="Assault", ggplot2_scale_function = viridis::scale_fill_viridis) + labs(title="USArrests + made up data") + theme_statebins("right")

Cutting to the chase

I still think it makes more sense to use binned data in these cartograms, and while you no longer get that for “free”, it’s not difficult to do:

adat <- suppressMessages(read_csv("http://www.washingtonpost.com/wp-srv/special/business/states-most-threatened-by-trade/states.csv?cache=1")) mutate( adat, share = cut(avgshare94_00, breaks = 4, labels = c("0-1", "1-2", "2-3", "3-4")) ) %>% statebins( value_col = "share", ggplot2_scale_function = scale_fill_brewer, name = "Share of workforce with jobs lost or threatened by trade" ) + labs(title = "1994-2000") + theme_statebins()

More manual labor

You can also still use hardcoded colors, but it’s a little more work on your end (but not much!):

election_2012 <- suppressMessages(read_csv("https://raw.githubusercontent.com/hrbrmstr/statebins/master/tmp/election2012.csv")) mutate(election_2012, value = ifelse(is.na(Obama), "Romney", "Obama")) %>% statebins( font_size=4, dark_label = "white", light_label = "white", ggplot2_scale_function = scale_fill_manual, name = "Winner", values = c(Romney = "#2166ac", Obama = "#b2182b") ) + theme_statebins()

BREAKING NEWS: Rounded corners

A Twitter request ended up turning into a new feature this afternoon (after I made this post) => rounded corners:

data(USArrests) USArrests$state <- rownames(USArrests) statebins(USArrests, value_col="Assault", name = "Assault", round=TRUE) + theme_statebins(legend_position="right") FIN It’ll be a while before this hits CRAN and I’m not really planning on keeping the old interface when the submission happens. So, it’ll be on GitHub for a bit to let folks chime in on what additional features you want and whether you really need to keep the deprecated functions around in the package. So, kick the tyres and don’t hesitate to shoot over some feedback! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Seasonality of plagues by @ellis2013nz R bloggers - Sat, 11/18/2017 - 12:00 (This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers) Seasonality of plague deaths In the course of my ramblings through history, I recently came across Mark R. Welford and Brian H. Bossak Validation of Inverse Seasonal Peak Mortality in Medieval Plagues, Including the Black Death, in Comparison to Modern Yersinia pestis-Variant Diseases. This article is part of an interesting epidemiological literature on the seasonality of medieval and early modern European plagues and the twentieth century outbreaks of plague caused by the Yersinia pestis bacterium. Y. pestis is widely credited (if that’s the word) for the 6th century Plague of Justinian, the mid 14th century Black Death, and the Third Pandemic from the mid nineteenth to mid twentieth centuries. The Black Death killed between 75 and 200 million people in Eurasis in only a few years; the Third Pandemic killed at least 12 million. These are important diseases to understand! In their 2009 article, Welford and Bossak added additional evidence and rigour to the argument that the difference in seasonality of peak deaths between the first two suspected bubonic plague pandemics and the Third Pandemic indicates a material difference in the diseases, whether it be a variation in Y. pestis itself, its mode and timing of transmission, or even the presence of an unknown human-to-human virus. This latter explanation would account for the extraordinary speed of spread of the Medieval Black Death, which I understand is surprising for a rodent-borne bacterial disease. These are deep waters; as a newcomer to the topic there’s too much risk of me getting it wrong if I try to paraphrase, so I’ll leave it there and encourage you to read specialists in the area. Plot polishing One of Welford and Bossak’s graphics caught my eye. Their original is reproduced at the bottom of this post. It makes the point of the differing seasonalities in the different outbreaks of plague but is pretty cluttered. I set myself the job of improving it as a way of familiarising myself with the issues. My first go differed from there’s only in minimal respects: • split the graphic into four facets for each plague agent / era to allow easier comparison • direct labels of the lines rather than relying on a hard-to-parse legend • added a grey ribbon showing a smoothed overall trend, to give a better idea of the “average” seasonality of the particuar era’s outbreak in each facet • removed gridlines All the R code for today’s post, from download to animation, is collected at the end of the post. I think the grey ribbon of the smoothed average line is the best addition here. It makes it much clearer that there is an average seasonality in the European (medieval and early modern) plagues not apparent in the Indian and Chinese examples. I wasn’t particularly happy with the logarithm transform of the vertical axis. While this is a great transformation for showing changes in terms of relative difference (eg growth rates), when we’re showing seasonality like this the natural comparison is of the distance of each point from the bottom. It is natural to interpret the space as proportionate to a number, as though there is no transformation. In effect, the transformation is hiding the extent of the seasonality. I suspect the main reason for the use of the transform at all was to avoid the high numbers for some outbreaks hiding the variation in the smaller ones. An alternative approach is to allow the vertical axes of some of the facets to use varying scales. This is legitimate so long as any comparison of magnitudes is within each facet, and comparisons across facets are of trends and patterns, not magnitude. That is the case here. Here’s how that looks: I think this is a significant improvement. The concentration of deaths in a few months of each outbreak is now much clearer. There’s an obvious problem with both the charts so far, which is the clutter of text from the labels. I think they are an improvement on the legend, but they are a bit of mess. This is in fact a classic problem of how to present longitudinal data in what is often termed a “spaghetti plot” for obvious reasons. Generally, such charts only work if we don’t particularly care which line is which. I attempted to address this problem by aggregating the samples to the country level, while keeping different lines for different combinations of country and year. I made a colour palette matched to countries so they would have the same colour over time. I also changed the scale to being the proportion of deaths in each month from the total year’s outbreak. If what we’re after is in fact the proportion of deaths in each month (ie the seasonality) rather than the magnitudes, then let’s show that on the graphic: This is starting to look better. Animating While I think it was worth while aggregating the deaths in those small medieval towns and villages to get a less cluttered plot, it does lose some lovely granular detail. Did you know for instance that the medieval Welsh town of Abergwillar is completely invisible on the internet other than in discussion of the plague in 1349? One way around this problem of longitudinal data where we have a time series for many cases, themselves of some interest, is to animate through the cases. This also gave me a chance to use Thomas Pedersen’s handy tweenr R package to smooth the transitions between each case study. Note how I’ve re-used the country colour palette, and used the extra visual space given by spreading facets out over time (in effect) to add some additional contextual information, like the latitude of each location. Some innocent thoughts on the actual topic This was a toe in the water for something I’d stumbled across while reading a history book. There’s lots of interesting questions here. Plague is in the news again today. There’s a big outbreak in Madagascar, with 171 deaths between August and mid November 2017 and more than 2,000 infections. And while I have been writing this blog, the Express has reported on North Korea stockpiling plague bacteria for biological warfare. On the actual topic of whether seasonality of deaths tells us that the medieval Black Death was different from 19th century Chinese and Indian plagues, I’d like to see more data. For example, what were the death rates by month in China in the 14th century, where the outbreak probably began? Presumably not available. Nineteenth century Bombay was different from a medieval European village in many important respects. Bombay’s winter in fact is comparable in temperature to Europe’s summer, so the coincidence of these as peak times for plague deaths perhaps has something in common. On the other hand, Manchuria is deadly, freezing cold in its winter when pneumonic plague deaths peaked in our data. Interesting topic. R code Here’s the R code that did all that. It’s nothing spectacular. I actually typed by hand the data from the original table, which was only available, as far as I can see, as an image of a table. Any interest is likely to be in the particular niceties of some of the plot polishing eg using a consistent palette for country colour when colour is actually mapped to country-year combination int he data. Although tweenr is often used in combination with the animate R package, I prefer to do what animate does manually: save the individual frames somewhere I have set myself and call ImageMagick directly via a system() call. I find the animate package adds a layer of problems – particularly on Windows – that sometimes gives me additional headaches. Calling another application via system() works fine. library(tidyverse) library(scales) library(openxlsx) library(directlabels) library(forcats) library(RColorBrewer) library(stringr) library(tweenr) #========download and prepare data========= download.file("https://github.com/ellisp/ellisp.github.io/raw/master/data/plague-mortality.xlsx", destfile = "plauge-mortality.xlsx", mode = "wb") orig <- read.xlsx("plague-mortality.xlsx", sheet = "data") d1 <- orig %>% rename(place = Place, year = Date, agent = Agent, country = Country, latitude = Latitude) %>% gather(month, death_rate, -place, -year, -agent, -country, -latitude) %>% # put the months' levels in correct order so they work in charts etc mutate(month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")), month_number = as.numeric(month)) %>% mutate(place_date = paste(place, year), agent = fct_reorder(agent, year)) %>% arrange(place, year, month_number) # define a caption for multiple use: the_caption <- str_wrap("Graphic by Peter Ellis, reworking data and figures published by Mark R. Welford and Brian H. Bossak 'Validation of Inverse Seasonal Peak Mortality in Medieval Plagues, Including the Black Death, in Comparison to Modern Yersinia pestis-Variant Diseases'", 100) #============faceted version of original graphic===================== p1 <- d1 %>% ggplot(aes(x = month_number, y = death_rate, colour = place_date)) + geom_smooth(aes(group = NULL), se = TRUE, colour = NA, size = 4) + geom_line(aes(x = month_number), size = 0.9) + scale_y_log10("Number of deaths in a month", limits = c(1, 1000000), label = comma) + scale_x_continuous("", breaks = 1:12, labels = levels(d1$month), limits = c(0.5, 12)) + facet_wrap(~agent) + theme(panel.grid = element_blank()) + labs(caption = the_caption) + ggtitle("Seasonal deaths by different plague agents", "Medieval and early modern Black Death and plague peaked in summer, whereas early 20th Century outbreaks peaked in late winter") direct.label(p1, list("top.bumpup", fontfamily = "myfont", cex = 0.8)) # good # version with free y axes as an alternative to the log transform p1b <- p1 + facet_wrap(~agent, scales = "free_y") + scale_y_continuous() direct.label(p1b, list("top.bumpup", fontfamily = "myfont", cex = 0.8)) #==============================y axis as proportion=================== d2 <- d1 %>% group_by(country, year, month_number, agent) %>% summarise(deaths = sum(death_rate, na.rm = TRUE)) %>% mutate(country_date = paste(country, year)) %>% group_by(country_date) %>% mutate(prop_deaths = deaths / sum(deaths, na.rm = TRUE)) %>% # convert the zeroes back to NA to be consistent with original presentation: mutate(prop_deaths = ifelse(prop_deaths == 0, NA, prop_deaths)) # Defining a palette. This is a little complex because although colour # is going to be mapped to country_year, I want each country to have its own # colour regardless of the year. I'm going to use Set1 of Cynthia Brewer's colours, # except for the yellow which is invisible against a white background pal1 <- data_frame(country = unique(d2$country)) pal1$colour <- brewer.pal(nrow(pal1) + 1, "Set1")[-6] pal2 <- distinct(d2, country, country_date) %>% left_join(pal1, by = "country") pal3 <- pal2$colour names(pal3) <- pal2$country_date # draw graphic with colour mapped to country-year combination, but colours come out # at country level: p2 <- d2 %>% ggplot(aes(x = month_number, y = prop_deaths, colour = country_date)) + geom_smooth(aes(group = NULL), se = TRUE, colour = NA, size = 4) + # geom_point() + geom_line(aes(x = month_number)) + scale_color_manual(values = pal3) + scale_y_continuous("Proportion of the year's total deaths in each month\n", label = percent, limits = c(0, 0.65)) + scale_x_continuous("", breaks = 1:12, labels = levels(d1$month), limits = c(0.5, 12)) + facet_wrap(~agent) + theme(panel.grid = element_blank()) + ggtitle("Seasonal deaths by different plague agents", "Medieval and early modern Black Death and plague peaked in summer, whereas early 20th Century outbreaks peaked in late winter") + labs(caption = the_caption) direct.label(p2, list("top.bumpup", fontfamily = "myfont", cex = 0.8)) #============animation=================== d3 <- d1 %>% group_by(place_date, country, latitude) %>% mutate(prop_deaths = death_rate / sum(death_rate, na.rm = TRUE), prop_deaths = ifelse(is.na(prop_deaths), 0, prop_deaths)) %>% ungroup() %>% mutate(place_date = fct_reorder(place_date, year), place_date_id = as.numeric(place_date)) %>% arrange(place_date_id, month_number) place_dates <- levels(d3$place_date) # change this to a list of data.frames, one for each state d3_l <- lapply(place_dates, function(x){ d3 %>% filter(place_date == x) %>% select(prop_deaths, month, place_date, agent) }) # use tweenr to create a single data frame combining those 20+ frames in the list, with interpolated # values for smooth animation: d3_t <- tween_states(d3_l, tweenlength = 5, statelength = 8, ease = "cubic-in-out", nframes = 600) %>% # caution - tween_states loses the information ont the ordering of factors mutate(month= factor(as.character(month), levels = levels(d1$month)), place_date = factor(as.character(place_date, levels = place_dates))) # make a temporary folder to store some thousands of PNG files as individual frames of the animation unlink("0114-tmp", recursive = TRUE) dir.create("0114-tmp") for(i in 1:max(d3_t$.frame)){ png(paste0("0114-tmp/", i + 10000, ".png"), 2000, 1000, res = 200) the_data <- filter(d3_t, .frame == i) the_title <- paste("Deaths from", the_data[1, "agent"], "in", the_data[1, "place_date"]) tmp <- d3 %>% filter(place_date == the_data[1, "place_date"]) %>% select(place_date, country, latitude) %>% distinct() the_xlab <- with(tmp, paste0(place_date, ", ", country, ", ", latitude, " degrees North")) the_high_point <- the_data %>% filter(prop_deaths == max(prop_deaths) ) %>% slice(1) %>% mutate(prop_deaths = prop_deaths + 0.03) %>% mutate(lab = agent) the_colour <- pal2[pal2$country == tmp$country, ]$colour[1] print( ggplot(the_data, aes(x = as.numeric(month), y = prop_deaths)) + geom_ribbon(aes(ymax = prop_deaths), ymin = 0, fill = the_colour, colour = "grey50", alpha = 0.1) + scale_x_continuous(the_xlab, breaks = 1:12, labels = levels(d1$month), limits = c(0.5, 12)) + ggtitle(the_title, "Distribution by month of the total deaths that year") + theme(panel.grid = element_blank()) + geom_text(data = the_high_point, aes(label = lab)) + labs(caption = the_caption) + scale_y_continuous("Proportion of the year's total deaths in each month\n", label = percent, limits = c(0, 0.68)) ) dev.off() # counter: if (i / 20 == round(i / 20)){cat(i)} } # combine all those frames into a single animated gif pd <- setwd("0114-tmp") system('magick -loop 0 -delay 8 *.png "0114-plague-deaths.gif"') setwd(pd) # clean up unlink("plague-mortality.xlsx") The original

As promised, here’s the original Welford and Bossak image. Sadly, they dropped Abergwillar; and they have data for Poona which I couldn’t find in a table in their paper.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...