Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 32 min ago

Freedman’s paradox

Tue, 06/06/2017 - 02:00

(This article was first published on Alexej's blog, and kindly contributed to R-bloggers)

Recently I came across the classical 1983 paper A note on screening regression equations by David Freedman. Freedman shows in an impressive way the dangers of data reuse in statistical analyses. The potentially dangerous scenarios include those where the results of one statistical procedure performed on the data are fed into another procedure performed on the same data. As a concrete example Freedman considers the practice of performing variable selection first, and then fitting another model using only the identified variables on the same data that was used to identify them in the first place. Because of the unexpectedly high severity of the problem this phenomenon became known as “Freedman’s paradox”. Moreover, in his paper Freedman derives asymptotic estimates for the resulting errors.

The 1983 paper presents a simulation with only 10 repetitions. But in the present day it is very easy (both in terms of computational time and implementation difficulty) to reproduce the simulation with many more repetitions (even my phone’s computational power is probably higher than that of the high performance computer that Freedman used in the 80’s). We also have more convenient ways to visualize the results than in the 80’s. So let’s do it.

I am going to use a few R packages (most notably the package broom to fit and analyze many many linear models in a single step).

library(dplyr) library(broom) library(ggplot2) library(tidyr) set.seed(20170605)

The considered data structure is the following:

  • A matrix of predictors with 100 rows and 50 columns is generated with independent standard normal entries.
  • The response variable is generated independently of the model matrix (also from the standard normal distribution), i.e., the true answer is that there is no relationship between predictors and response.

Instead of Freedman’s 10 repetitions we perform 1000. So let’s generate all 1000 datasets at once as stacked in a large data frame:

n_row <- 100 # n_col is set to 51 because the 51st column will serve as y n_col <- 51 n_rep <- 1000 # a stack of matrices for all n_rep repetitions is generated... X <- matrix(rnorm(n_rep * n_row * n_col), n_rep * n_row, n_col) colnames(X) <- paste0("X", 1:n_col) # ...and then transformed to a data frame with a repetition number column X_df <- as_data_frame(X) %>% mutate(repetition = rep(1:n_rep, each = n_row))

The data are analyzed in two successive linear models. The second (illegally) reusing the results of the first.

The first model fit.
After the 1000 ordinary linear models are fit to the data, we record for each of them the R squared, the F test statistic with corresponding p-value, and the t test statistics with p-values for the individual regression coefficients.

Using functions from the broom package we can fit and extract information from all 1000 models at once.

# all models can be fit at once... models_df = X_df %>% group_by(repetition) %>% do(full_model = lm(X51 ~ . + 0, data = select(., -repetition))) # ...then the results are extracted model_coefs <- tidy(models_df, full_model) model_statistics <- glance(models_df, full_model) model_statistics$data_reuse <- rep(FALSE, nrow(model_statistics))

The second model fit.
For each one of the first 1000 models, the corresponding second linear model is fit using only those variables which have p-values significant at the 25% level in the first model.
That is, the second model uses the first model for variable selection.

This gives us 1000 reduced re-fitted linear models. We record the same model statistics (R squared, F, and t tests) as for the first group of models.

reduced_models <- list() for (i in 1:n_rep) { full_data <- X_df %>% filter(repetition == i) significant_coefs <- model_coefs %>% filter(repetition == i & p.value < 0.25) reduced_data <- select(full_data, one_of(unlist(significant_coefs[ , "term"])), X51) reduced_models[[i]] <- lm(X51 ~ . + 0, data = reduced_data) tmp_df <- glance(reduced_models[[i]]) tmp_df$repetition <- i tmp_df$data_reuse <- TRUE model_statistics <- bind_rows(model_statistics, tmp_df) }

Finally let’s look at the results. The figure shows the distributions of the considered model statistics across the 1000 repetitions for model fits with and without data reuse (the code producing this figure is given at the bottom of this post):

Well, the R squared statistic shows a moderate change between models with or without data reuse (average of 0.3093018 vs. 0.5001641). The F test statistic however grows immensely to an average of 3.2806118 (from 1.0480097), and the p-values fall after data reuse to an average of 0.0112216 (from 0.5017696), below the widely used (but arbitrary) 5% significance level.

Obviously the model with data reuse is highly misleading here, because in fact there are absolutely no relationships between the predictor variables and the response (as per the data generation procedure).

In fact, Freedman derived asymptotic estimates for the magnitudes of change in the considered model statistics, and they indeed match the above observations. However I’m too lazy to summarize them here. So I refer to the primary source.

This code generates the above figure:

model_statistics %>% select(r.squared, p.value, statistic, repetition, data_reuse) %>% mutate(data_reuse = ifelse(data_reuse, "With Data Reuse", "Without Data Reuse")) %>% mutate(data_reuse = factor(data_reuse, levels = c("Without Data Reuse", "With Data Reuse"), ordered = TRUE)) %>% rename("F-statistic" = statistic, "p-value" = p.value, "R squared" = r.squared) %>% gather(stat, value, -repetition, -data_reuse) %>% ggplot(aes(x = stat, y = value)) + geom_violin(aes(fill = stat), scale = "width", draw_quantiles = c(0.25, 0.5, 0.75)) + geom_hline(yintercept = 0.05, linetype = 2, size = 0.3) + facet_wrap(~data_reuse) + theme_linedraw() + scale_y_continuous(breaks = c(0.05, 2, 4, 6)) + ggtitle(paste(n_rep, "repetitions of an LM fit with", n_row, "rows,", n_col, "columns")) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Alexej's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The Case Against Seasonal Unit Roots

Mon, 06/05/2017 - 19:26

(This article was first published on R – first differences, and kindly contributed to R-bloggers)

There are several ways to model seasonality in a time series. Traditionally, trend-cycle decomposition such as the Holt-Winters procedure has been very popular. Also, until today applied researchers often try to account for seasonality by using seasonal dummy variables. But of course, in a stochastic process it seems unreasonable to assume that seasonal effects are purely deterministic. Therefore, in a time series context seasonal extensions of the classical ARMA model are very popular. One of these extensions is the seasonal unit root model

where is the usual lag operator and is the period length of the seasonality such as or for a yearly cycle in quarterly or monthly data and is some short run component such as an innovation term or a SARMA(p,q)-(P,Q) model.

I have always been puzzled about the popularity of this process. Probably it is due to the obvious conceptual simplicity. It also seems to be a natural extension of the usual non-seasonal integrated ARIMA model. However, the model generates a feature that we will hardly ever observe in an actual time series: as time progresses the difference between consecutive values of the will become infinitely large.

To see this consider the following example. To generate seasonal unit root processes we first define a function that generates seasonal sums.

seasonal_sum<-function(data,S){ out<-data for(t in (S+1):length(data)){out[t]<-data[t]+out[(t-S)]} out }

We then generate a sample of 250 observations from the process and look at its plot and its autocorrelation function. We choose a period of , so that the example resembles a yearly cycle in monthly data.

series<-seasonal_sum(rnorm(250),S=12) acf(series)

ts.plot(series, ylab="series", xlab="t")

From the autocorrelation function (ACF) it can be seen that there is a pronounced seasonal behavior with a spike in the ACF at each lag that is an integer multiple of . However, the plot of the series shows a curious behavior. As increases, we see that the difference between two consecutive observations increases. This behavior becomes even more pronounced if we increase the sample size to 2500.

ts.plot(seasonal_sum(rnorm(2500),S=12), ylab="series")

To understand this feature consider the usual unit root model with an innovation with variance . This can be expressed as the sum over all past innovations.

From this representation it is easy to show that the variance of the process is given by

so that the variance becomes infinite as approaches infinity. This is a property that seems to apply to many economic and financial time series and is therefore completely reasonable.

Now, the seasonal unit root model can be expressed in a similar way, but with an important twist. To see this, denote the th innovation in the th repetition of the cycle of length by . This means that if you have monthly observations the innovation in the first January in the sample is and the innovation in the second January in the sample is . By the same principle the innovation in the th December in the sample would be . Therefore, any observation , for some and can be represented as

.

The important thing to note here is that for two consecutive observations within the th repetition of the cycle we have and . Since and are independent streams of random numbers this means that and are independent random walks! Consequently, the difference of the process is given by

so that

Since goes to infinity as goes to infinity, so does the variance of the changes. Has anybody ever seen a series that exhibits such a feature? Of course in reality we would expect that the innovations are not but show some kind of dependence structure, so that the random walks are not completely independent anymore. However, if the dependence is weak – such as that of an ARMA process – they are still asymptotically independent for large lags. Therefore, the same issue arises, as can be seen from the example below.

sarima_sim<-function(T, S, arma_model){ arma_series<-arima.sim(n=T, model=arma_model) seasonal_sum(data=arma_series, S=S) } sarima_series<-sarima_sim(T=250, S=12, arma_model=list(ar=c(0.5,0.3))) acf(sarima_series)

ts.plot(sarima_series, ylab="series")

ts.plot(sarima_sim(T=2500, S=12, arma_model=list(ar=c(0.5,0.3))), ylab="series")

So what is the conclusion from all this? The seasonal unit root process seems to be ill suited to model most behavior that we observe in practice. However, it is well known that it often generates a good fit. Especially in shorter time series the drawbacks of the seasonal unit root process do not have to become visible. Nevertheless, I think it is fair to say that one could envision a more satisfactory model. One avenue that seems very useful in this context is that of seasonal long memory processes that are able to combine some persistence in the cyclical fluctuations with a finite variance.

Another important conclusion is that we have to be careful with seemingly direct extensions of standard models such as the ARIMA. The fact that the ARIMA is extremely successful in modelling the non-seasonal component, does not necessarily mean that the SARIMA is a good model for the seasonal applications that we have in mind, too.

Filed under: Allgemein, R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – first differences. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Quantile Regression in R exercises

Mon, 06/05/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

The standard OLS (Ordinary Least Squares) model explains the relationship between independent variables and the conditional mean of the dependent variable. In contrast, quantile regression models this relationship for different quantiles of the dependent variable.
In this exercise set we will use the quantreg package (package description: here) to implement quantile regression in R.

Answers to the exercises are available here.

Exercise 1
Load the quantreg package and the barro dataset (Barro and Lee, 1994). This has data on GDP growth rates for various countries.
Next, summarize the data.

Exercise 2
The dependent variable is y.net (Annual change per capita GDP). The remaining variables will be used to explain y.net. It is easier to combine variables using cbind before applying regression techniques. Combine variables so that we can write Y ~ X.

Exercise 3
Regress y.net on the independent variables using OLS. We will use this result as benchmark for comparison.

Exercise 4
Using the rq function, estimate the model at the median y.net. Compare results from exercise-3.

Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

  • Avoid model over-fitting using cross-validation for optimal parameter selection
  • Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-linear kernels.
  • And much more

Exercise 5
Estimate the model for the first and third quartiles and compare results.

Exercise 6
Using a single command estimate the model for 10 equally spaced deciles of y.net.

Exercise 7
quantreg package also offers shrinkage estimators to determine which variables play the most important role in predicting y.net. Estimate the model with LASSO based quantile regression at the median level with lambda=0.5.

Exercise 8
Quantile plots are most useful for interpreting results. To do that we need to define the sequence of percentiles. Use the seq function to define the sequence of percentiles from 5% to 95% with a jump of 5%.

Exercise 9
Use the result from exercise-8 to plot the graphs. Note that the red line is the OLS estimate bounded by the dotted lines which represent confidence intervals.

Exercise 10
Using results from exercise-5, test whether coefficients are significantly different for the first and third quartile based regressions.

Related exercise sets:
  1. Forecasting: Multivariate Regression Exercises (Part-4)
  2. Instrumental Variables in R exercises (Part-3)
  3. Forecasting: ARIMAX Model Exercises (Part-5)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

There is usually more than one way in R

Mon, 06/05/2017 - 17:38

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common R functions: str(), head(), and the tibble package‘s glimpse().

tibble::glimpse()

Consider the important task inspecting a data.frame to see column types and a few example values. The dplyr/tibble/tidyverse way of doing this is as follows:

library("tibble") glimpse(mtcars) Observations: 32 Variables: 11 $ mpg 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.... $ cyl 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4 $ disp 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 167.6, 275.8, 275.8, 27... $ hp 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65,... $ drat 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.07, 2.93, 3.0... $ wt 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3.440, 4.070, 3.730, 3.... $ qsec 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18.90, 17.40, 17.60, 18... $ vs 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1 $ am 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1 $ gear 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4 $ carb 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2 utils::str()

A common “base-R“ (actually from the utils package) way to examine the data is:

str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

However, both of the above summaries have unfortunately obscured an important fact about the mtcars data.frame: the car names! This is because mtcars stores this important key as row-names instead of as a column. Even base::summary() will hide this from the analyst.

utils::head()

The base-R command head() (again from the utils package) provides a good way to examine the first few rows of data:

head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

We are missing the size of the table and the column types, but those are easy to get with “dim(mtcars)” and “stack(vapply(mtcars, class, character(1)))“. And we can get something like the “columns on the side” presentation as follows:

t(head(mtcars)) Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant mpg 21.00 21.000 22.80 21.400 18.70 18.10 cyl 6.00 6.000 4.00 6.000 8.00 6.00 disp 160.00 160.000 108.00 258.000 360.00 225.00 hp 110.00 110.000 93.00 110.000 175.00 105.00 drat 3.90 3.900 3.85 3.080 3.15 2.76 wt 2.62 2.875 2.32 3.215 3.44 3.46 qsec 16.46 17.020 18.61 19.440 17.02 20.22 vs 0.00 0.000 1.00 1.000 0.00 1.00 am 1.00 1.000 1.00 0.000 0.00 0.00 gear 4.00 4.000 4.00 3.000 3.00 3.00 carb 4.00 4.000 1.00 1.000 2.00 1.00

Also, head() is usually re-adapted (through R‘s S3 object system) to work with remote data sources.

library("sparklyr") sc <- sparklyr::spark_connect(version='2.0.2', master = "local") dRemote <- copy_to(sc, mtcars) head(dRemote) # Source: query [6 x 11] # Database: spark connection master=local[4] app=sparklyr local=TRUE # # # A tibble: 6 x 11 # mpg cyl disp hp drat wt qsec vs am gear carb # # 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 # 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 # 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 # 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 # 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 # 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 glimpse(dRemote) # Observations: 32 # Variables: 11 # # Rerun with Debug # Error in if (width[i] <= max_width[i]) next : # missing value where TRUE/FALSE needed broom::glance(dRemote) # Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl Conclusion

R often has more than one way to nearly perform the same task. When working in R consider trying a few functions to see which one best fits your needs. Also be aware that base-R (R with the standard packages) often already has powerful capabilities.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

ANOVA in R: afex may be the solution you are looking for

Mon, 06/05/2017 - 17:14

(This article was first published on Computational Psychology - Henrik Singmann, and kindly contributed to R-bloggers)

Prelude: When you start with R and try to estimate a standard ANOVA , which is relatively simple in commercial software like SPSS, R kind of sucks. Especially for unbalanced designs or designs with repeated-measures replicating the results from such software in base R may require considerable effort. For a newcomer (and even an old timer) this can be somewhat off-putting. After I had gained experience developing my first package and was once again struggling with R and ANOVA I had enough and decided to develop afex. If you know this feeling, afex is also for you.

A new version of afex (0.18-0) has been accepted on CRAN a few days ago. This version only fixes a small bug that was introduced in the last version.  aov_ez did not work with more than one covariate (thanks to tkerwin for reporting this bug).

I want to use this opportunity to introduce one of the main functionalities of afex. It provides a set of functions that make calculating ANOVAs easy. In the default settings, afex automatically uses appropriate orthogonal contrasts for factors, transforms numerical variables into factors, uses so-called Type III sums of squares, and allows for any number of factors including repeated-measures (or within-subjects) factors and mixed/split-plot designs. Together this guarantees that the ANOVA results correspond to the results obtained from commercial statistical packages such as SPSS or SAS. On top of this, the ANOVA object returned by afex (of class afex_aov) can be directly used for follow-up or post-hoc tests/contrasts using the lsmeans package .

Example Data

Let me illustrate how to calculate an ANOVA with a simple example. We use data courtesy of Andrew Heathcote and colleagues . The data are lexical decision and word naming latencies for 300 words and 300 nonwords from 45 participants. Here we only look at three factors:

  • task is a between subjects (or independent-samples) factor: 25 participants worked on the lexical decision task (lexdec; i.e., participants had to make a binary decision whether or not the presented string is a word or nonword) and 20 participants on the naming task (naming; i.e., participant had to say the presented string out loud).
  • stimulus is a repeated-measures or within-subjects factor that codes whether a presented string was a word or nonword.
  • length is also a repeated-measures factor that gives the number of characters of the presented strings with three levels: 3, 4, and 5.

The dependent variable is the response latency or response time for each presented string. More specifically, as is common in the literature we analyze the log of the response times, log_rt. After excluding erroneous responses each participants responded to between 135 and 150 words and between 124 and 150 nonwords. To use this data in an ANOVA one needs to aggregate the data such that only one observation per participant and cell of the design (i.e., combination of all factors) remains. As we will see, afex does this automatically for us (this is one of the features I blatantly stole from ez).

library(afex) data("fhch2010") # load data (comes with afex) mean(!fhch2010$correct) # error rate # [1] 0.01981546 fhch <- droplevels(fhch2010[ fhch2010$correct,]) # remove errors str(fhch2010) # structure of the data # 'data.frame': 13222 obs. of 10 variables: # $ id : Factor w/ 45 levels "N1","N12","N13",..: 1 1 1 1 1 1 1 1 ... # $ task : Factor w/ 2 levels "naming","lexdec": 1 1 1 1 1 1 1 1 1 1 ... # $ stimulus : Factor w/ 2 levels "word","nonword": 1 1 1 2 2 1 2 2 1 2 ... # $ density : Factor w/ 2 levels "low","high": 2 1 1 2 1 2 1 1 1 1 ... # $ frequency: Factor w/ 2 levels "low","high": 1 2 2 2 2 2 1 2 1 2 ... # $ length : Factor w/ 3 levels "4","5","6": 3 3 2 2 1 1 3 2 1 3 ... # $ item : Factor w/ 600 levels "abide","acts",..: 363 121 ... # $ rt : num 1.091 0.876 0.71 1.21 0.843 ... # $ log_rt : num 0.0871 -0.1324 -0.3425 0.1906 -0.1708 ... # $ correct : logi TRUE TRUE TRUE TRUE TRUE TRUE ...

We first load the data and remove the roughly 2% errors. The structure of the data.frame (obtained via str()) shows us that the data has a few more factors than discussed here. To specify our ANOVA we first use function aov_car() which works very similar to base R aov(), but as all afex functions uses car::Anova() (read as function Anova() from package car) as the backend for calculating the ANOVA.

Specifying an ANOVA (a1 <- aov_car(log_rt ~ task*length*stimulus + Error(id/(length*stimulus)), fhch)) # Contrasts set to contr.sum for the following variables: task # Anova Table (Type 3 tests) # # Response: log_rt #                 Effect          df  MSE          F   ges p.value # 1                 task       1, 43 0.23  13.38 ***   .22   .0007 # 2               length 1.83, 78.64 0.00  18.55 ***  .008  <.0001 # 3          task:length 1.83, 78.64 0.00       1.02 .0004     .36 # 4             stimulus       1, 43 0.01 173.25 ***   .17  <.0001 # 5        task:stimulus       1, 43 0.01  87.56 ***   .10  <.0001 # 6      length:stimulus 1.70, 72.97 0.00       1.91 .0007     .16 # 7 task:length:stimulus 1.70, 72.97 0.00       1.21 .0005     .30 # --- # Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1 # # Sphericity correction method: GG # Warning message: # More than one observation per cell, aggregating the data using mean (i.e, fun_aggregate = mean)!

The printed output is an ANOVA table that could basically be copied to a manuscript as is. One sees the terms in column Effect, the degrees of freedoms (df), the mean-squared error (MSE, I would probably remove this column in a manuscript), the F-value (F, which also contains the significance stars), and the p-value (p.value). The only somewhat uncommon column is ges which provides generalized eta-squared, ‘the recommended effect size statistics for repeated measure designs’  . The standard output also reports Greenhouse-Geisser (GG) corrected df for repeated-measures factors with more than two levels (to account for possible violations of sphericity). Note that these corrected df are not integers.

We can also see a warning notifying us that afex has detected that each participant and cell of the design provides more than one observation which are then automatically aggregated using mean. The warning serves the purpose to notify the user in case this was not intended (i.e., when there should be only one observation per participant and cell of the design). The warning can be suppressed via specifying fun_aggregate = mean explicitly in the call to aov_car.

The formula passed to aov_car basically needs to be the same as for standard aov with a few differences:

  • It must have an Error term specifying the column containing the participant (or unit of observation) identifier (e.g., minimally +Error(id)). This is necessary to allow the automatic aggregation even in designs without repeated-measures factor.
  • Repeated-measures factors only need to be defined in the Error term and do not need to be enclosed by parentheses. Consequently, the following call produces the same ANOVA: aov_car(log_rt ~ task + Error(id/length*stimulus), fhch)

     

In addition to aov_car, afex provides two further function for calculating ANOVAs. These function produce the same output but differ in the way how to specify the ANOVA.

  • aov_ez allows the ANOVA specification not via a formula but via character vectors (and is similar to ez::ezANOVA()): aov_ez(id = "id", dv = "log_rt", fhch, between = "task", within = c("length", "stimulus"))
  • aov_4 requires a formula for which the id and repeated-measures factors need to be specified as in lme4::lmer() (with the same simplification that repeated-measures factors only need to be specified in the random part): aov_4(log_rt ~ task + (length*stimulus|id), fhch) aov_4(log_rt ~ task*length*stimulus + (length*stimulus|id), fhch)
Follow-up Tests

A common requirement after the omnibus test provided by the ANOVA is some-sort of follow-up analysis. For this purpose, afex is fully integrated with lsmeans .

For example, assume we are interested in the significant task:stimulus interaction. As a first step we might want to investigate the marginal means of these two factors:

lsmeans(a1, c("stimulus","task")) # NOTE: Results may be misleading due to involvement in interactions #  stimulus task        lsmean         SE    df    lower.CL    upper.CL #  word     naming -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435 #  nonword  naming -0.02687619 0.04250050 48.46 -0.11230839  0.05855602 #  word     lexdec  0.00331642 0.04224522 47.37 -0.08165241  0.08828525 #  nonword  lexdec  0.05640801 0.04224522 47.37 -0.02856083  0.14137684 # # Results are averaged over the levels of: length # Confidence level used: 0.95

From this we can see naming trials seems to be generally slower (as a reminder, the dv is log-transformed RT in seconds, so values below 0 correspond to RTs bewteen 0 and 1), It also appears that the difference between word and nonword trials is larger in the naming task then in the lexdec task. We test this with the following code using a few different lsmeans function. We first use lsmeans again, but this time using task as the conditioning variable specified in by. Then we use pairs() for obtaining all pairwise comparisons within each conditioning strata (i.e., level of task). This provides us already with the correct tests, but does not control for the family-wise error rate across both tests. To get those, we simply update() the returned results and remove the conditioning by setting by=NULL. In the call to update we can already specify the method for error control and we specify 'holm',  because it is uniformly more powerful than Bonferroni.

# set up conditional marginal means: (ls1 <- lsmeans(a1, c("stimulus"), by="task")) # task = naming: # stimulus lsmean SE df lower.CL upper.CL # word -0.34111656 0.04250050 48.46 -0.42654877 -0.25568435 # nonword -0.02687619 0.04250050 48.46 -0.11230839 0.05855602 # # task = lexdec: # stimulus lsmean SE df lower.CL upper.CL # word 0.00331642 0.04224522 47.37 -0.08165241 0.08828525 # nonword 0.05640801 0.04224522 47.37 -0.02856083 0.14137684 # # Results are averaged over the levels of: length # Confidence level used: 0.95 update(pairs(ls1), by=NULL, adjust = "holm") # contrast task estimate SE df t.ratio p.value # word - nonword naming -0.31424037 0.02080113 43 -15.107 <.0001 # word - nonword lexdec -0.05309159 0.01860509 43 -2.854 0.0066 # # Results are averaged over the levels of: length # P value adjustment: holm method for 2 tests

Hmm. These results show that the stimulus effects in both task conditions are independently significant. Obviously, the difference between them must also be significant then, or?

pairs(update(pairs(ls1), by=NULL)) # contrast estimate SE df t.ratio p.value # wrd-nnwrd,naming - wrd-nnwrd,lexdec -0.2611488 0.02790764 43 -9.358 <.0001

They obviously are. As a reminder, the interaction is testing exactly this, the difference of the difference. And we can actually recover the F-value of the interaction using lsmeans alone by invoking yet another of its functions, test(..., joint=TRUE).

test(pairs(update(pairs(ls1), by=NULL)), joint=TRUE) # df1 df2 F p.value # 1 43 87.565 <.0001

These last two example were perhaps not particularly interesting from a statistical point of view, but show an important ability of lsmeans. Any set of estimated marginal means produced by lsmeans, including any sort of (custom) contrasts, can be used again for further tests or calculating new sets of marginal means. And with test() we can even obtain joint F-tests over several parameters using joint=TRUE. lsmeans is extremely powerful and one of my most frequently used packages that basically performs all tests following an omnibus test (and in its latest version it directly interfaces with rstanarm so it can now also be used for a lot of Bayesian stuff, but this is the topic of another blog post).

Finally, lsmeans can also be used directly for plotting by envoking lsmip:

lsmip(a1, task ~ stimulus)

Note that lsmip does not add error bars to the estimated marginal means, but only plots the point estimates. There are mainly two reasons for this. First, as soon as repeated-measures factors are involved, it is difficult to decide which error bars to plot. Standard error bars based on the standard error of the mean are not appropriate for within-subjects comparisons. For those, one would need to use a within-subject intervals  (see also here or here). Especially for plots as the current one with both independent-samples and repeated-measures factors (i.e., mixed within-between designs or split-plot designs) no error bar will allow comparisons across both dimensions. Second, only ‘if the SE [i.e., standard error] of the mean is exactly 1/2 the SE of the difference of two means — which is almost never the case — it would be appropriate to use overlapping confidence intervals to test comparisons of means’ (lsmeans author Russel Lenth, the link provides an alternative).

We can also use lsmeans in combination with lattice to plot the results on the unconstrained scale (i.e., after back-transforming tha data from the log scale to the original scale of response time in seconds). The plot is not shown here.

lsm1 <- summary(lsmeans(a1, c("stimulus","task"))) lsm1$lsmean <- exp(lsm1$lsmean) require(lattice) xyplot(lsmean ~ stimulus, lsm1, group = task, type = "b", auto.key = list(space = "right"))

 

Summary
  • afex provides a set of functions that make specifying standard ANOVAs for an arbitrary number of between-subjects (i.e., independent-sample) or within-subjects (i.e., repeated-measures) factors easy: aov_car(), aov_ez(), and aov_4().
  • In its default settings, the afex ANOVA functions replicate the results of commercial statistical packages such as SPSS or SAS (using orthogonal contrasts and Type III sums of squares).
  • Fitted ANOVA models can be passed to lsmeans for follow-up tests, custom contrast tests, and plotting.
  • For specific questions visit the new afex support forum: afex.singmann.science (I think we just need someone to ask the first ANOVA question to get the ball rolling).
  • For more examples see the vignette or here (blog post by Ulf Mertens) or download the full example R script used here.

As a caveat, let me end this post with some cautionary remarks from Douglas Bates (fortunes::fortune(184)) who explains why ANOVA in R is supposed to not be the same as in other software packages (i.e., he justifies why it ‘sucks’):

You must realize that R is written by experts in statistics and statistical computing who, despite popular opinion, do not believe that everything in SAS and SPSS is worth copying. Some things done in such packages, which trace their roots back to the days of punched cards and magnetic tape when fitting a single linear model may take several days because your first 5 attempts failed due to syntax errors in the JCL or the SAS code, still reflect the approach of “give me every possible statistic that could be calculated from this model, whether or not it makes sense”. The approach taken in R is different. The underlying assumption is that the useR is thinking about the analysis while doing it.
— Douglas Bates (in reply to the suggestion to include type III sums of squares and lsmeans in base R to make it more similar to SAS or SPSS)
R-help (March 2007)

Maybe he is right, but maybe what I have described here is useful to some degree.

References 881472
{EID22BTQ};{6ZTIZNX3};{ZTKXF57D};{EID22BTQ};{8SUWFIRC},{4U9N89AU},{CS2R6JNU}
apa
default
ASC
no










485
http://singmann.org/wp-content/plugins/zotpress/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Computational Psychology - Henrik Singmann. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

R⁶ — Scraping Images To PDFs

Mon, 06/05/2017 - 16:34

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

I’ve been doing intermittent prep work for a follow-up to an earlier post on store closings and came across this CNN Money “article” on it. Said “article” is a deliberately obfuscated or lazily crafted series of GIF images that contain all the Radio Shack impending store closings. It’s the most comprehensive list I’ve found, but the format is terrible and there’s no easy, in-browser way to download them all.

CNN has ToS that prevent automated data gathering from CNN-proper. But, they used Adobe Document Cloud for these images which has no similar restrictions from a quick glance at their ToS. That means you get an R⁶ post on how to grab the individual 38 images and combine them into one PDF. I did this all with the hopes of OCRing the text, which has not panned out too well since the image quality and font was likely deliberately set to make it hard to do precisely what I’m trying to do.

If you work through the example, you’ll get a feel for:

  • using sprintf() to take a template and build a vector of URLs
  • use dplyr progress bars
  • customize httr verb options to ensure you can get to content
  • use purrr to iterate through a process of turning raw image bytes into image content (via magick) and turn a list of images into a PDF
library(httr) library(magick) library(tidyverse) url_template <- "https://assets.documentcloud.org/documents/1657793/pages/radioshack-convert-p%s-large.gif" pb <- progress_estimated(38) sprintf(url_template, 1:38) %>% map(~{ pb$tick()$print() GET(url = .x, add_headers( accept = "image/webp,image/apng,image/*,*/*;q=0.8", referer = "http://money.cnn.com/interactive/technology/radio-shack-closure-list/index.html", authority = "assets.documentcloud.org")) }) -> store_list_pages map(store_list_pages, content) %>% map(image_read) %>% reduce(image_join) %>% image_write("combined_pages.pdf", format = "pdf")

I figured out the Document Cloud links and necessary httr::GET() options by using Chrome Developer Tools and my curlconverter package.

If any academic-y folks have a test subjectsummer intern with a free hour and would be willing to have them transcribe this list and stick it on GitHub, you’d have my eternal thanks.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

14 Jobs for R users (2017-05-06) – from all over the world

Mon, 06/05/2017 - 16:30
To post your R job on the next post

Just visit this link and post a new R job to the R community.

You can post a job for free (and there are also “featured job” options available for extra exposure).

Current R jobs

Job seekers: please follow the links below to learn more and apply for your R job of interest:

Featured Jobs
More New Jobs
  1. Full-Time
    Phd opportunity @ Barcelona / Moorepark Teagasc (Ireland), Universitat Autònoma de Barcelona (Spain) and CEDRIC (Centre d’études et de recherché en informatique et communications) of CNAM (Paris) – Posted by emmanuel.serrano.ferron
    Anywhere
    25 May2017
  2. Full-Time
    Research Software Engineer @ Bailrigg, England Lancaster University – Posted by killick
    Bailrigg England, United Kingdom
    24 May2017
  3. Full-Time
    Senior Data Scientist @ Minneapolis, Minnesota, U.S. General Mills – Posted by dreyco676
    Minneapolis Minnesota, United States
    23 May2017
  4. Part-Time
    Big Data Discovery Automation in Digital Health (5-10 hours per week) MedStar Institutes for Innovation – Posted by Praxiteles
    Anywhere
    22 May2017
  5. Full-Time
    Data Scientist / Analytics Consultant Hedera Insights – Posted by HederaInsights
    Antwerpen Vlaanderen, Belgium
    18 May2017
  6. Full-Time
    Data Scientists for ArcelorMittal @ Avilés, Principado de Asturias, Spain ArcelorMittal – Posted by JuanM
    Avilés Principado de Asturias, Spain
    12 May2017
  7. Full-Time
    Data Scientist (m/f) Meritocracy – Posted by arianna@meritocracy
    Hamburg Hamburg, Germany
    11 May2017
  8. Full-Time
    Data Scientist Prospera Technologies – Posted by Prospera Technologies
    Tel Aviv-Yafo Tel Aviv District, Israel
    11 May2017
  9. Full-Time
    Data Scientist One Acre Fund – Posted by OAF
    Nairobi Nairobi County, Kenya
    9 May2017
  10. Full-Time
    Back-End Developer – Sacramento Kings (Sacramento, CA) Sacramento Kings – Posted by khavey
    Sacramento California, United States
    9 May2017
  11. Full-Time
    Data Analyst, Chicago Violence Reduction Strategy National Network for Safe Communities (NNSC) – Posted by nnsc
    Chicago Illinois, United States
    5 May2017
  12. Full-Time
    Transportation Market Research Analyst @ Arlington, Virginia, U.S. RSG – Posted by patricia.holland@rsginc.com
    Arlington Virginia, United States
    4 May2017
  13. Full-Time
    Data Manager @ Los Angeles, California, U.S. Virtual Pediatric Systems, LLC – Posted by gsotocampos
    Los Angeles California, United States
    3 May2017
  14. Full-Time
    Sr. Manager, Analytics @ Minneapolis, Minnesota, U.S. Korn Ferry – Posted by jone1087
    Minneapolis Minnesota, United States
    3 May 2017

In R-users.com you can see all the R jobs that are currently available.

R-users Resumes

R-users also has a resume section which features CVs from over 300 R users. You can submit your resume (as a “job seeker”) or browse the resumes for free.

(you may also look at previous R jobs posts).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Weather forecast with regression models – part 2

Mon, 06/05/2017 - 14:27

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

In the first part, I introduced the weather dataset and outlining its exploratory analysis. In the second part of our tutorial, we are going to build multiple logistic regression models to predict weather forecast. Specifically, we intend to produce the following forecasts:

  • tomorrow’s weather forecast at 9am of the current day
  • tomorrow’s weather forecast at 3pm of the current day
  • tomorrow’s weather forecast at late evening time of the current day

For each of above tasks, a specific subset of variables shall be available, and precisely:

  • 9am: MinTemp, WindSpeed9am, Humidity9am, Pressure9am, Cloud9am, Temp9am
  • 3pm: (9am variables) + Humidity3pm, Pressure3pm, Cloud3pm, Temp3pm, MaxTemp
  • evening: (3pm variables) + Evaporation, Sunshine, WindGustDir, WindGustSpeed

We suppose the MinTemp already available at 9am as we expect the overnight temperature resulting with that information. We suppose the MaxTemp already available at 3pm, as determined on central day hours. Further, we suppose Sunshine, Evaporation, WindGustDir and WindGustSpeed final information only by late evening. Other variables are explicitely bound to a specific daytime.

suppressPackageStartupMessages(library(caret)) set.seed(1023) weather_data5 <- read.csv("weather_data5.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE) colnames(weather_data5) [1] "MinTemp" "MaxTemp" "Evaporation" "Sunshine" "WindGustDir" [6] "WindGustSpeed" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [11] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" [16] "Temp3pm" "RainTomorrow" nrow(weather_data5) [1] 353 sum(weather_data5["RainTomorrow"] == "Yes") [1] 64 sum(weather_data5["RainTomorrow"] == "No") [1] 289

We are going to split the original dataset in a training dataset (70% of original data) and a testing dataset (30% remaining).

train_rec <- createDataPartition(weather_data5$RainTomorrow, p = 0.7, list = FALSE) training <- weather_data5[train_rec,] testing <- weather_data5[-train_rec,]

We check the balance of RainTomorrow Yes/No fractions in the training and testing datasets.

sum(training["RainTomorrow"] == "Yes")/sum(training["RainTomorrow"] == "No") [1] 0.2216749 sum(testing["RainTomorrow"] == "Yes")/sum(testing["RainTomorrow"] == "No") [1] 0.2209302

The dataset is slight unbalanced with respect the label to be predicted. We will check later if some remedy is needed or achieved accuracy is satisfactory as well.

9AM Forecast Model

For all models, we are going to take advantage of a train control directive made available by the caret package which prescribes repeated k-flod cross-validation. The k-fold cross validation method involves splitting the training dataset into k-subsets. For each subset, one is held out while the model is trained on all other subsets. This process is completed until accuracy is determined for each instance in the training dataset, and an overall accuracy estimate is provided. The process of splitting the data into k-folds can be repeated a number of times and this is called repeated k-fold cross validation. The final model accuracy is taken as the mean from the number of repeats.

trControl <- trainControl(method = "repeatedcv", repeats = 5, number = 10, verboseIter = FALSE)

The trainControl is passed as a parameter to the train() caret function. As a further input to the train() function, the tuneLength parameter indicates the number of different values to try for each algorithm parameter. However in case of the “glm” method, it does not drive any specific behavior and hence it will be omitted.

By taking into account the results from exploratory analysis, we herein compare two models for 9AM prediction. The first one is so determined and evaluated.

predictors_9am_c1 <- c("Cloud9am", "Humidity9am", "Pressure9am", "Temp9am") formula_9am_c1 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c6, collapse="+"), sep="~")) mod9am_c1_fit <- train(formula_9am_c1, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod9am_c1_fit$results$Accuracy [1] 0.8383538

The resulting accuracy shown above is the one determined by the repeated k-fold cross validation as above explained. The obtained value is not that bad considering the use of just 9AM available data.

(summary_rep <- summary(mod9am_c1_fit$finalModel)) Call: NULL Deviance Residuals: Min 1Q Median 3Q Max -1.7667 -0.5437 -0.3384 -0.1874 2.7742 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 96.01611 33.57702 2.860 0.004242 ** Cloud9am 0.17180 0.07763 2.213 0.026893 * Humidity9am 0.06769 0.02043 3.313 0.000922 *** Pressure9am -0.10286 0.03283 -3.133 0.001729 ** Temp9am 0.10595 0.04545 2.331 0.019744 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 177.34 on 243 degrees of freedom AIC: 187.34 Number of Fisher Scoring iterations: 5

From above summary, all predictors result as significative for the logistic regression model. We can test the null hypothesis that the logistic model represents an adequate fit for the data by computing the following p-value.

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9994587

We further investigate if any predictor can be dropped by taking advantage of the drop1() function.

drop1(mod9am_c1_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Cloud9am + Humidity9am + Pressure9am + Temp9am Df Deviance AIC LRT Pr(>Chi) 177.34 187.34 Cloud9am 1 182.48 190.48 5.1414 0.0233623 * Humidity9am 1 190.19 198.19 12.8502 0.0003374 *** Pressure9am 1 187.60 195.60 10.2668 0.0013544 ** Temp9am 1 183.13 191.13 5.7914 0.0161050 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can evaluate a second model where the MinTemp is replaced by the Temp9am. We do not keep both as they are correlated (see part #1 exploratory analysis).

predictors_9am_c2 <- c("Cloud9am", "Humidity9am", "Pressure9am", "MinTemp") formula_9am_c2 <- as.formula(paste("RainTomorrow", paste(predictors_9am_c2, collapse="+"), sep="~")) mod9am_c2_fit <- train(formula_9am_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod9am_c2_fit$results$Accuracy [1] 0.8366462 (summary_rep <- summary(mod9am_c2_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -1.7233 -0.5446 -0.3510 -0.1994 2.7237 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 103.00407 33.57102 3.068 0.00215 ** Cloud9am 0.16379 0.08014 2.044 0.04096 * Humidity9am 0.05462 0.01855 2.945 0.00323 ** Pressure9am -0.10792 0.03298 -3.272 0.00107 ** MinTemp 0.06750 0.04091 1.650 0.09896 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.9 on 247 degrees of freedom Residual deviance: 180.3 on 243 degrees of freedom AIC: 190.3 Number of Fisher Scoring iterations: 5

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9990409

This second model shows similar accuracy values, however MinTemp p-value shows that such predictor is not significative. Further, the explained variance is slightly less than the first model one. As a conclusion, we select the first model for 9AM predictions purpose and we go on by calculating accuracy with the test set.

mod9am_pred <- predict(mod9am_c1_fit, testing) confusionMatrix(mod9am_pred, testing[,"RainTomorrow"]) Confusion Matrix and Statistics Reference Prediction No Yes No 80 12 Yes 6 7 Accuracy : 0.8286 95% CI : (0.7427, 0.8951) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.4602 Kappa : 0.3405 Mcnemar's Test P-Value : 0.2386 Sensitivity : 0.9302 Specificity : 0.3684 Pos Pred Value : 0.8696 Neg Pred Value : 0.5385 Prevalence : 0.8190 Detection Rate : 0.7619 Detection Prevalence : 0.8762 Balanced Accuracy : 0.6493 'Positive' Class : No

The confusion matrix shows encouraging results, not a bad test accuracy for a 9AM weather forecast. We then go on building the 3PM prediction model.

3PM Forecast Model

We are going to use what we expect to be reliable predictors at 3PM. We already seen that they are not strongly correlated each other and they are not linearly dependent.

predictors_3pm_c1 <- c("Cloud3pm", "Humidity3pm", "Pressure3pm", "Temp3pm") formula_3pm_c1 <- as.formula(paste("RainTomorrow", paste(predictors_3pm_c1, collapse="+"), sep="~")) mod3pm_c1_fit <- train(formula_3pm_c1, data = training, method = "glm", family = "binomial", trControl = trControl, metric = 'Accuracy') mod3pm_c1$_fitresults$Accuracy [1] 0.8582333

This model shows an acceptable accuracy as measured on the training set.

(summary_rep <- summary(mod3pm_c1_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.3065 -0.5114 -0.2792 -0.1340 2.6641 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 122.56229 35.20777 3.481 0.000499 *** Cloud3pm 0.27754 0.09866 2.813 0.004906 ** Humidity3pm 0.06745 0.01817 3.711 0.000206 *** Pressure3pm -0.12885 0.03443 -3.743 0.000182 *** Temp3pm 0.10877 0.04560 2.385 0.017071 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 159.64 on 243 degrees of freedom AIC: 169.64 Number of Fisher Scoring iterations: 6

All predictors are reported as significative for the model. Let us further verify if any predictor can be dropped.

drop1(mod3pm_c1_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm Df Deviance AIC LRT Pr(>Chi) 159.64 169.64 Cloud3pm 1 168.23 176.23 8.5915 0.003377 ** Humidity3pm 1 175.91 183.91 16.2675 5.500e-05 *** Pressure3pm 1 175.60 183.60 15.9551 6.486e-05 *** Temp3pm 1 165.77 173.77 6.1329 0.013269 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999913

We go on with the computation of the test set accuracy.

mod3pm_pred <- predict(mod3pm_c1_fit, testing) confusionMatrix(mod3pm_pred, testing[,"RainTomorrow"]) Confusion Matrix and Statistics Reference Prediction No Yes No 82 8 Yes 4 11 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.58 Mcnemar's Test P-Value : 0.38648 Sensitivity : 0.9535 Specificity : 0.5789 Pos Pred Value : 0.9111 Neg Pred Value : 0.7333 Prevalence : 0.8190 Detection Rate : 0.7810 Detection Prevalence : 0.8571 Balanced Accuracy : 0.7662 'Positive' Class : No

The test set prediction accuracy is quite satisfactory.

Evening Forecast Model

As first evening forecast model, we introduce the Sunshine variable and at the same time we take Cloud3pm and Humidity3pm out as strongly correlated to Sunshine.

predictors_evening_c1 <- c("Pressure3pm", "Temp3pm", "Sunshine") formula_evening_c1 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c1, collapse="+"), sep="~")) mod_ev_c1_fit <- train(formula_evening_c1, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c1_fit$results$Accuracy [1] 0.8466974

The training set based accuracy is acceptable.

(summary_rep <- summary(mod_ev_c1_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -1.9404 -0.4532 -0.2876 -0.1350 2.4664 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 174.43458 37.12201 4.699 2.61e-06 *** Pressure3pm -0.17175 0.03635 -4.726 2.29e-06 *** Temp3pm 0.06506 0.03763 1.729 0.0838 . Sunshine -0.41519 0.07036 -5.901 3.61e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 156.93 on 244 degrees of freedom AIC: 164.93 Number of Fisher Scoring iterations: 6

The Temp3pm predictor is reported as not significative and can be dropped as confirmed below.

drop1(mod_ev_c1_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Pressure3pm + Temp3pm + Sunshine Df Deviance AIC LRT Pr(>Chi) 156.93 164.93 Pressure3pm 1 185.64 191.64 28.712 8.400e-08 *** Temp3pm 1 160.03 166.03 3.105 0.07805 . Sunshine 1 203.03 209.03 46.098 1.125e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999968

As a second tentative model, we take advantage of the 3PM model predictors together with WindGustDir and WindGustSpeed.

predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir", "WindGustSpeed") formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~")) mod_ev_c2_fit <- train(formula_evening_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c2_fit$results$Accuracy [1] 0.8681179

It results with an improved training set accuracy.

(summary_rep <- summary(mod_ev_c2_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.44962 -0.42289 -0.20929 -0.04328 2.76204 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 9.928e+01 4.863e+01 2.042 0.041193 * Cloud3pm 2.569e-01 1.081e-01 2.376 0.017481 * Humidity3pm 8.441e-02 2.208e-02 3.822 0.000132 *** Pressure3pm -1.088e-01 4.695e-02 -2.318 0.020462 * Temp3pm 1.382e-01 5.540e-02 2.495 0.012596 * WindGustDirENE -1.064e+00 1.418e+00 -0.750 0.453254 WindGustDirESE 1.998e+00 1.088e+00 1.837 0.066235 . WindGustDirN -2.731e-03 1.759e+00 -0.002 0.998761 WindGustDirNE 1.773e+00 1.260e+00 1.407 0.159364 WindGustDirNNE -1.619e+01 2.884e+03 -0.006 0.995520 WindGustDirNNW 1.859e+00 1.021e+00 1.821 0.068672 . WindGustDirNW 1.011e+00 1.009e+00 1.002 0.316338 WindGustDirS 1.752e+00 1.248e+00 1.404 0.160394 WindGustDirSE -1.549e+01 2.075e+03 -0.007 0.994043 WindGustDirSSE -1.051e-01 1.636e+00 -0.064 0.948777 WindGustDirSSW -1.451e+01 6.523e+03 -0.002 0.998225 WindGustDirSW 2.002e+01 6.523e+03 0.003 0.997551 WindGustDirW 1.108e+00 1.183e+00 0.936 0.349051 WindGustDirWNW 1.325e-01 1.269e+00 0.104 0.916848 WindGustDirWSW -1.574e+01 4.367e+03 -0.004 0.997124 WindGustSpeed 1.674e-02 2.065e-02 0.811 0.417420 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 135.61 on 227 degrees of freedom AIC: 177.61 Number of Fisher Scoring iterations: 17

The WindGustDir and WindGustSpeed predictors are reported as not significative. Also the AIC value is sensibly increased.

drop1(mod_ev_c2_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + WindGustDirWNW + WindGustDirWSW + WindGustSpeed Df Deviance AIC LRT Pr(>Chi) 135.61 177.61 Cloud3pm 1 141.64 181.64 6.0340 0.01403 * Humidity3pm 1 154.49 194.49 18.8817 1.391e-05 *** Pressure3pm 1 141.38 181.38 5.7692 0.01631 * Temp3pm 1 142.21 182.21 6.5992 0.01020 * WindGustDirENE 1 136.20 176.20 0.5921 0.44159 WindGustDirESE 1 139.40 179.40 3.7883 0.05161 . WindGustDirN 1 135.61 175.61 0.0000 0.99876 WindGustDirNE 1 137.55 177.55 1.9412 0.16354 WindGustDirNNE 1 136.40 176.40 0.7859 0.37535 WindGustDirNNW 1 139.43 179.43 3.8160 0.05077 . WindGustDirNW 1 136.69 176.69 1.0855 0.29747 WindGustDirS 1 137.64 177.64 2.0293 0.15430 WindGustDirSE 1 136.36 176.36 0.7458 0.38782 WindGustDirSSE 1 135.61 175.61 0.0041 0.94866 WindGustDirSSW 1 135.64 175.64 0.0339 0.85384 WindGustDirSW 1 138.38 178.38 2.7693 0.09609 . WindGustDirW 1 136.52 176.52 0.9106 0.33996 WindGustDirWNW 1 135.62 175.62 0.0109 0.91689 WindGustDirWSW 1 135.85 175.85 0.2414 0.62318 WindGustSpeed 1 136.27 176.27 0.6633 0.41541 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

WindGustDir has some borderline p-value for some specific directions. WindGustDir is not significative and we should drop it from the model. Hence, we redefine such model after having taken WindGustDir off.

predictors_evening_c2 <- c(predictors_3pm_c1, "WindGustDir") formula_evening_c2 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c2, collapse="+"), sep="~")) mod_ev_c2_fit <- train(formula_evening_c2, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c2_fit$results$Accuracy [1] 0.8574256 (summary_rep <- summary(mod_ev_c2_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.45082 -0.40624 -0.21271 -0.04667 2.71193 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 118.76549 42.77047 2.777 0.005490 ** Cloud3pm 0.25566 0.10794 2.369 0.017859 * Humidity3pm 0.08052 0.02126 3.787 0.000152 *** Pressure3pm -0.12701 0.04172 -3.044 0.002333 ** Temp3pm 0.13035 0.05441 2.396 0.016592 * WindGustDirENE -1.13954 1.43581 -0.794 0.427398 WindGustDirESE 2.03313 1.10027 1.848 0.064624 . WindGustDirN -0.00212 1.68548 -0.001 0.998996 WindGustDirNE 1.59121 1.25530 1.268 0.204943 WindGustDirNNE -16.31932 2891.85796 -0.006 0.995497 WindGustDirNNW 1.87617 1.03532 1.812 0.069962 . WindGustDirNW 1.09105 1.01928 1.070 0.284433 WindGustDirS 1.93346 1.24036 1.559 0.119046 WindGustDirSE -15.50996 2073.57766 -0.007 0.994032 WindGustDirSSE -0.09029 1.63401 -0.055 0.955934 WindGustDirSSW -14.34617 6522.63868 -0.002 0.998245 WindGustDirSW 19.99670 6522.63868 0.003 0.997554 WindGustDirW 1.16834 1.18736 0.984 0.325127 WindGustDirWNW 0.16961 1.28341 0.132 0.894858 WindGustDirWSW -15.71816 4349.40572 -0.004 0.997117 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 136.27 on 228 degrees of freedom AIC: 176.27 Number of Fisher Scoring iterations: 17 drop1(mod_ev_c2_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + WindGustDirWNW + WindGustDirWSW Df Deviance AIC LRT Pr(>Chi) 136.27 176.27 Cloud3pm 1 142.24 180.24 5.9653 0.014590 * Humidity3pm 1 154.50 192.50 18.2255 1.962e-05 *** Pressure3pm 1 146.76 184.76 10.4852 0.001203 ** Temp3pm 1 142.33 180.33 6.0611 0.013819 * WindGustDirENE 1 136.93 174.93 0.6612 0.416126 WindGustDirESE 1 140.13 178.13 3.8568 0.049545 * WindGustDirN 1 136.27 174.27 0.0000 0.998996 WindGustDirNE 1 137.87 175.87 1.5970 0.206332 WindGustDirNNE 1 137.14 175.14 0.8698 0.351012 WindGustDirNNW 1 140.09 178.09 3.8159 0.050769 . WindGustDirNW 1 137.52 175.52 1.2477 0.263994 WindGustDirS 1 138.78 176.78 2.5032 0.113617 WindGustDirSE 1 137.03 175.03 0.7541 0.385179 WindGustDirSSE 1 136.28 174.28 0.0031 0.955862 WindGustDirSSW 1 136.30 174.30 0.0290 0.864850 WindGustDirSW 1 139.00 177.00 2.7314 0.098393 . WindGustDirW 1 137.28 175.28 1.0100 0.314905 WindGustDirWNW 1 136.29 174.29 0.0174 0.894921 WindGustDirWSW 1 136.51 174.51 0.2373 0.626164 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

WindGustDirESE is reported as significant and hence including WindGustDir was a right choice and accept that model as a candidate one. The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999998

To investigate a final third choice, we gather a small set of predictors, Pressure3pm and Sunshine.

predictors_evening_c3 <- c("Pressure3pm", "Sunshine") formula_evening_c3 <- as.formula(paste("RainTomorrow", paste(predictors_evening_c3, collapse="+"), sep="~")) mod_ev_c3_fit <- train(formula_evening_c3, data=training, method="glm", family="binomial", trControl = trControl, metric = 'Accuracy') mod_ev_c3_fit$results$Accuracy [1] 0.8484513

The training set based accuracy has an acceptable value.

(summary_rep <- summary(mod_ev_c3_fit$finalModel)) Deviance Residuals: Min 1Q Median 3Q Max -2.1123 -0.4602 -0.2961 -0.1562 2.3997 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 192.97146 36.09640 5.346 8.99e-08 *** Pressure3pm -0.18913 0.03545 -5.336 9.52e-08 *** Sunshine -0.35973 0.06025 -5.971 2.36e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 234.90 on 247 degrees of freedom Residual deviance: 160.03 on 245 degrees of freedom AIC: 166.03 Number of Fisher Scoring iterations: 6

All predictors are reported as significative.

drop1(mod_ev_c3_fit$finalModel, test="Chisq") Single term deletions Model: .outcome ~ Pressure3pm + Sunshine Df Deviance AIC LRT Pr(>Chi) 160.03 166.03 Pressure3pm 1 199.36 203.36 39.324 3.591e-10 *** Sunshine 1 206.09 210.09 46.060 1.147e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The p-value associated with the null hypothesis that this model is a good fit for the data is:

1 - pchisq(summary_rep$deviance, summary_rep$df.residual) [1] 0.9999938

We compare the last two models by running an ANOVA analysis on those to check if the lower residual deviance of the first model is significative or not.

anova(mod_ev_c2_fit$finalModel, mod_ev_c3_fit$finalModel, test="Chisq") Analysis of Deviance Table Model 1: .outcome ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDirENE + WindGustDirESE + WindGustDirN + WindGustDirNE + WindGustDirNNE + WindGustDirNNW + WindGustDirNW + WindGustDirS + WindGustDirSE + WindGustDirSSE + WindGustDirSSW + WindGustDirSW + WindGustDirW + WindGustDirWNW + WindGustDirWSW Model 2: .outcome ~ Pressure3pm + Sunshine Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 228 136.27 2 245 160.03 -17 -23.76 0.1261

Based on p-value, there is no significative difference between them. We then choose both models. The first model because it provides with a better accuracy then the second. The second model for its simplicity. Let us evaluate the test accuracy for both of them.

modevening_pred <- predict(mod_ev_c2_fit, testing) confusionMatrix(modevening_pred, testing[,"RainTomorrow"]) Confusion Matrix and Statistics Reference Prediction No Yes No 83 9 Yes 3 10 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.5604 Mcnemar's Test P-Value : 0.14891 Sensitivity : 0.9651 Specificity : 0.5263 Pos Pred Value : 0.9022 Neg Pred Value : 0.7692 Prevalence : 0.8190 Detection Rate : 0.7905 Detection Prevalence : 0.8762 Balanced Accuracy : 0.7457 'Positive' Class : No

Good test accuracy with a high sensitivity and positive predicted values. We then ttest the second evening forecast candidate model.

modevening_pred <- predict(mod_ev_c3_fit, testing) confusionMatrix(modevening_pred, testing[,"RainTomorrow"]) Confusion Matrix and Statistics Reference Prediction No Yes No 81 7 Yes 5 12 Accuracy : 0.8857 95% CI : (0.8089, 0.9395) No Information Rate : 0.819 P-Value [Acc > NIR] : 0.04408 Kappa : 0.598 Mcnemar's Test P-Value : 0.77283 Sensitivity : 0.9419 Specificity : 0.6316 Pos Pred Value : 0.9205 Neg Pred Value : 0.7059 Prevalence : 0.8190 Detection Rate : 0.7714 Detection Prevalence : 0.8381 Balanced Accuracy : 0.7867 'Positive' Class : No

Slightly higher accuracy and lower sensitivity than the previous model. Positive predicitive value is improved with respect the same.

Resuming up our final choices:

mod9am_c1_fit: RainTomorrow ~ Cloud9am + Humidity9am + Pressure9am + Temp9am mod3pm_c1_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm mod_ev_c2_fit: RainTomorrow ~ Cloud3pm + Humidity3pm + Pressure3pm + Temp3pm + WindGustDir mod_ev_c3_fit: RainTomorrow ~ Pressure3pm + Sunshine

We save the training record indexes, the training and testing datasets together with all the chosen models for further analysis.

saveRDS(list(weather_data5, train_rec, training, testing, mod9am_c1_fit, mod9am_c2_fit, mod3pm_c1_fit, mod_ev_c2_fit, mod_ev_c3_fit), file="wf_log_reg_part2.rds") Conclusions

We build models for weather forecasts purposes at different hours of the day. We explored training accuracy and prediction significancy to determine best models. We then compute the test accuracy and collect results. Prediction accuracy was shown to be quite satisfactory in case expecially for 3PM and evening models. In next part of this tutorial, we will tune all those models to improve their accuracy if possible.

If you have any questions, please feel free to comment below.

    Related Post

    1. Weather forecast with regression models – part 1
    2. Weighted Linear Support Vector Machine
    3. Logistic Regression Regularized with Optimization
    4. Analytical and Numerical Solutions to Linear Regression Problems
    5. How to create a loop to run multiple regression models
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Shiny app to explore ggplot2

    Mon, 06/05/2017 - 10:09

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)

    Do you struggle with learning ggplot2? Do you have problems with understanding what aesthetics actually do and how manipulating them change your plots?

    Here is the solution! Explore 33 ggplot2 geoms on one website!

    I created this ggplot2 explorer to help all R learners to understand how to plot beautiful/useful charts using the most popular vizualization package ggplot2. It won’t teach you how to write a code, but definitely will show you how ggplot2 geoms look like, and how manipulating their arguments changes visualization. 

    You can find my app here

    And you can find code on my github

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Unconf recap 1

    Mon, 06/05/2017 - 09:00

    (This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers)

    Following up on Stefanie's recap of unconf 17, we are following up this entire week with summaries of projects developed at the event. We plan to highlight 4-5 projects each day, with detailed posts from a handful of teams to follow.

    skimr

    Summary: skimr, a package inspired by Hadley Wickham's precis package, aims to provide summary statistics iteratively and interactively as part of a pipeline. The package provides easily skimmable summary statistics to help you better understand your data and see what is missing.

    Team:
    Amelia McNamara, Eduardo Arino de la Rubia, Hao Zhu, Julia Lowndes, Shannon Ellis, Elin Waring, Michael Quinn, Hope McLeod

    Github: https://github.com/ropenscilabs/skimr

    emldown

    Summary: emldown is a package for creating a helpful website based on EML metadata. Ecological Metadata Language (EML) is a metadata specification developed by the ecology discipline and for the ecology discipline. EML is implemented as a series of XML document types that can by used in a modular and extensible manner to document ecological data.

    Team: Maëlle Salmon, Kara Woo, Andrew McDonald, Carl Boettiger

    Github: https://github.com/ropenscilabs/emldown

    testrmd

    Summary: testrmd provides facilities to enable testing of and reporting on tested R Markdown chunks. When running R Markdown documents as part of a workflow where the data are likely to change with each render, testrmd ensures that each chunk will alert the reader to any problems that may arise as a result of such subtle changes in data.

    Team: Mara Averick, Christopher Tull, Joe Cheng, Robert Flight

    Github: https://github.com/ropenscilabs/testrmd

    webrockets

    Summary: The webrockets package implements a basic websocket listener in R. The implementation draws heavily on @hrbmstr's wrapper of easywsclient in https://github.com/ropenscilabs/decapitated.

    Team:

    Miles McBain, Alicia Schep, Carl Ganz, Bob Rudis, Henrik Bengtsson, Joe Cheng

    Github: https://github.com/ropenscilabs/webrockets

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Taking Advanced Analytics to the Cloud – Part I: R on AWS

    Mon, 06/05/2017 - 05:59

    (This article was first published on R – Luke Stanke, and kindly contributed to R-bloggers)

    .main-container { max-width: 940px; margin-left: auto; margin-right: auto; } code { color: inherit; background-color: rgba(0, 0, 0, 0.04); } img { max-width:100%; height: auto; } .tabbed-pane { padding-top: 12px; } button.code-folding-btn:focus { outline: none; }


    $(document).ready(function () { window.buildTabsets("TOC"); });

    Running R on the cloud isn’t very difficult. This demo shows how to get Rstudio running on Amazon Web Services. To run R on the cloud we need to initiate a machine/computer and install R – that’s all very simple. Where you might get caught up is in the settings and permissions.

    Step 1: Create an AWS Account

    If you don’t have an AWS account you’ll need to sign up.

    Step 2: Create an Role

    Once you are signed-up and signed into AWS the first thing you’ll need to do is create a role for your machine. A role provides the necessary permissions for your machine to access various APIs available by AWS.

    You can do this by searching IAM on the services tab.

    Once you are on the page click “Roles” on the left menu, then the “Create New Role” button.

    Then select Amazon EC2 from the AWS Service Roll.

    There are a number of services that you could run with your instance that you will set up – and there are varying permissions, too. For now, just search and select AmazonS3FullAccess, AmazonEC2FullAccess, RedshiftFullAccess, and AthenaFullAccess. You really won’t need anyt of these right away, but they will be useful when connecting to other services like S3 or another EC2 instance. Note: photo does not include AthenaFullAccess but you should include it!

    From there you’ll be good to go! Depending on your access needs your policies should be selected accordingly. In fact, if you were doing this correctly, you’d want to create policies that are not over-arching like the full access options we’ve selected for this demo.

    Step 3: Create a Key Pair

    Next, you’ll want to create a key pair. This will allow you to securely log into your instance to update your machine and install Rstudio.

    Go to Services and search EC2 in the search bar. Once the EC2 page loads, click “Key Pairs” under the “Network & Security” section on the left menu bar.

    From there click “Create Key Pair”. Give your key a name and hit create. The key pair will download as a .pem file. Do not lose this key! Also: do not share this key!

    Step 4: Create a R/Rstudio/Shiny Security Group

    As you can see by the steps this far its all about security. From the EC2 menu, under the “Network & Security” section on the left menu select “Security Groups”. Click the “Create Security Group Button” and a pop-up will appear. Create a security group name called “Rstudio/Shiny”. Under description write “Allows port access for R and Shiny”. You can leave the VPC drop down alone. On the inbound tab – the default – add three new rules.

    Under the first rule select SSH from the dropdown. The source should be auto populated with 0.0.0.0/0. This opens up SSH to anywhere in the world – but you’ll need the .pem key to login.

    For the second rule leave it as custom TCP and type 8787 as the port range. This is the port that needs to be opened on the server for you to connect to Rstudio. Under source type 0.0.0.0/0. This means you could log into Rstudio from any IP address. You could also just use your own IP address if you know it, too.

    For the third rule leave it as custom TCP also and type 3838 as the port range. This is the port for the Shiny connection. Were not going to use it for the immediate demo but it’ll be very useful in the future. Under source type 0.0.0.0/0 as well.

    Step 5: Launch EC2 Instance

    Staying in the EC2 Section click “Instance” under the “Instances” Section on the left menu. Click the “Launch Instance” button. This will take you to “Step 1: Choose an Amazon Machine Image”.

    You’ll have a number of tabs to select AMIs from. Stay on the Quick Start tab and select Amazon Linux AMI – it’s the first option. It has a number of pre-built tools that are useful – but it doesn’t have R installed. You’ll do that in a bit.

    For “Step 2: Choose an Instance Type” choosing an instance type can be daunting. Here you are essentially specifying the type of computer you want to run. I typically select from the General purpose, Compute optimized, or Memory optimized options depending on the type of models I’m running and the type of data I am working with. For this example select t2.micro because this is just demo the tool. Click “Next: Configure Instance”. Note: you’ll need a more powerful machine to install packages Step 8 and beyond. I’d recommend a c2.large machine just to be safe.

    For “Step 3: Configure Instance”, under “IAM Role” select the role you created earlier – my role was called EC2- S3-Redshift. Note: Under Advanced Options you could send a bash script to configure your instance, instead we’ll do it with command line tools. Click “Next: Add Storage”.

    For “Step 4: Add Storage” we can stick with the default settings – for now. Click “Next: Add Tags”

    For “Step 5: Add Tags”, click “Add Tag”, enter a key of “Name” and a value of “Rstudio Example” – this will give our machine a name of “Rstudio Example”. Having a name is important when you have more than one machine running. Click “Next: Configure Security Group”

    For “Step 6: Configure Security Group”, under “Assign a Security Group” select “Select an existing security group”. Find and select your “Rstudio/Shiny” security group. If this is your first time doing this, it should be fine. If you have multiple security groups like myself you’ll have to search through the lot. Click “Review and Launch”.

    Under the review screen, make sure you’ve selected instance types, security groups, and IAM roles as mentioned above. Click “Launch” – you’ll get a pop-up to select an existing key pair or create a new key pair. Choose an existing key pair and select the key pair you just created. My key is called awesome_key. Awknowledge you’ve selected the key pair and you’ll need it to log on. Note: make sure you have the key pair to log into your instance. This should go without saying, but I’m saying it – you need your key pair to log into your machine and set it up! Launch your instance.

    Step 6: Login to your instance

    This is where the Windows OS world diverges from the Apple/Unix/Linux/Ubuntu worlds. Windows doesn’t have a built-in terminal like these other machines so you’ll have to download and setup PuTTy if you don’t have it already. Next you’ll use your terminal to SSH into your newly created instance and set it up. The instance will likely need about 60 seconds to setup from when you hit launch.

    After you launch click on your instance id just to the right of the message saying the instance launches has been initiated.

    This will take you to your instance on the EC2 dashboard. The dashboard is full of important information – most notably it’ll tell you if your instance is up and running. From the image below you can see my instance state is green running. This tells me it’s ready to be SSH’ed into.

    You can also see the Public DNS on the right and on the description tab. We’ll need that information to SSH into it. *Note: you can also see my IAM Role and key pair name. Click the “Connect” button just to the right of the “Launch Instance Button”. This provides you with addition directons on how to connect to your instance.

    First, let’s open our terminal and change directories so that we are in the folder that contains our .pem key. If you haven’t moved it out of your downloads folder, it’s probably there – and you should probably move it.

    Next change the permissions of your key pair to allow you to SSH onto your AWS instance.

    chmod 400 awesome_key.pem

    Then SSH onto your machine using the following format. You’ll have to replace the key pair with the key pair you’ve created. You’ll also have to change the public DNS address to the address of your machine..

    ssh -i "awesome_key.pem" ec2-user@ec2-54-145-158-106.compute-1.amazonaws.com

    Once you are logged in it your terminal should look like this:

    Step 7: Setup your instance

    Once we are logged in we need to update our machine, install a few additional programs, install R, and install Rstudio. You can do this by running the following commands line-by-line through your EC2 instance.

    # Update the machine sudo yum -y update # Install programs that run well with the devtools package sudo yum -y install libcurl-devel openssl-devel # used for devtools # Install programs that assist APIs sudo yum -y install libxml2 libxml2-devel # Install R sudo su yum install -y R

    After running this code you will have 1) updated your machine; 2) installed tools to allow the devtools package to run; 3) installed tools to allow like httr and aws.s3 to run; and 4) installed base R.

    Next you’ll want to install the most recent version of Rstudio and Shiny. Check here to find the most recent releases of Rstudio and Shiny. Edit the code so that you install the most recent version. Run the install of Rstudio and Shiny.

    # Install RStudio Server - change version when installing your Rstudio wget -P /tmp https://s3.amazonaws.com/rstudio-dailybuilds/rstudio-server-rhel-1.0.143-x86_64.rpm sudo yum install -y --nogpgcheck /tmp/rstudio-server-rhel-1.0.143-x86_64.rpm #install shiny and shiny-server - change version when installing your Rstudio R -e "install.packages('shiny', repos='http://cran.rstudio.com/')" wget https://download3.rstudio.org/centos5.9/x86_64/shiny-server-1.5.3.838-rh5-x86_64.rpm yum install -y --nogpgcheck shiny-server-1.5.3.838-rh5-x86_64.rpm #add user(s) sudo useradd -m stanke sudo passwd stanke

    Finally add a user — and change the password once this is done you can terminate your SSH tunnel.

    Step 8: Log into your Rstudio Instance

    Copy your public DNS that was located on your EC2 page earlier. Paste that public DNS into your browser and add “:8787” after your instance – in my case “ec2-54-145-158-106.compute-1.amazonaws.com:8787”. Hit enter. Your Rstudio login page should appear. Enter your credentials from your new user. Click “Sign In”.

    Step 9: Setup your Rstudio Defaults.

    So technically you’ve done it. You’ve logged into Rstudio on AWS. But let’s take this another few steps. Let’s set up some defaults so that any time we want to set up an Rstudio instance on AWS we don’t have to go through the hassle we just did above. Let’s install a bunch of packages that you might regularly use. This install might take a while since you’ll be installing a number of packages onto your machine.

    ## install packages if not present install.packages( c( "devtools", "sparklyr", "ggplot2", "magrittr", "tidyverse", "Lahman", "DBI", "rsparkling", "h2o", "ghit", "xml2", "stringr", "magrittr", "data.table", "clipr" ) )

    Lets also create a new .R file and write a few lines of code. You don’t really need to do this, but it’ll show the power of Step 9 in a few minutes.

    Here is the R script you can use:

    ### get data data("iris") ### See data structure head(iris) ### Save iris data locally. write.csv(iris, "iris.csv")

    After the initial script: A .R file, an iris object, and an iris.csv file.

    Step 9: Take an image of your current instance

    Setting up Rstudio is a pain – all that time waiting for packages to install, getting your data just right, only to possibly find out you didn’t size your instance correctly. No one wants to have to deal with that every time they setup R on AWS. This isn’t a problem as you can take a snapshot of your instance as it is and spin off new instances from the point and time of the snapshot.

    To take a snapshot go back to the webpage with your EC2 details. Click the “Actions” button, then go to “Image” and “Create Image”.

    From there enter an Image Name of “Rstudio AMI Example” and Image Description of “Rstudio AMI Example”. Click “Create Image” and wait a few minutes.

    Step 10: Create an instance from the AMI

    Launch a new instance. On “Step 1: Choose an Amazon Machine Image”, click “My AMIs”. Select “Rstudio AMI Example”. Follow Step 5 above for the rest of the setup. However with this instance tag the Name as “Rstudio AMI”. If you’ve set up everything correctly you shouldn’t need to SSH into your instance to configure.

    Copy the Public DNS into your browser and add “:8787”. Login with your username and password derived earlier. Once logged in you’ll see that your .R script and .csv scripts are still saved to the instance – allowing you to quickly jump back into your analyses.

    In fact, creating an image and then creating multiple instances will allow you to quickly fit models and test which instance type will be best for the models/data you are currently working with and eventually minimize costs. If the data become more complicated or larger in size you can create a new instance that can accommodate those changes. However, sometimes the data become so large and we need to use a distributed system – Rstudio can sit on top of distributed systems as well – I’ll talk about that in a different post. This approach will get you a great jump-start though.

    // add bootstrap table styles to pandoc tables function bootstrapStylePandocTables() { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); } $(document).ready(function () { bootstrapStylePandocTables(); });


    (function () { var script = document.createElement("script"); script.type = "text/javascript"; ; document.getElementsByTagName("head")[0].appendChild(script); })();

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Luke Stanke. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    2017 Fantasy Football Projections

    Mon, 06/05/2017 - 05:10

    (This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers)

    We are releasing our 2017 fantasy football projections in a Shiny webapp using R.  The app allows you to calculate custom rankings/projections for your league based on your league settings.  The projections incorporate more sources of projections than any other site, and have been the most accurate projections over the last 5 years.  New features of the app this season include the ability to view additional variables (including “per game” projections).  You can access the Projections tool here:

    http://apps.fantasyfootballanalytics.net

    For instructions how to use the app, see here.  We also have a Draft Optimizer tool (see here).  See our articles on how to win your Snake Draft and Auction Draft.  We will be updating the projections as the season approaches with more sources of projections.  Feel free to add suggestions in the comments!

    The post 2017 Fantasy Football Projections appeared first on Fantasy Football Analytics.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    More on R and Reproducible Research

    Sun, 06/04/2017 - 23:33

    (This article was first published on Mad (Data) Scientist, and kindly contributed to R-bloggers)

    Readers who found my posting last week on our revisit package of interest may wish to check the somewhat related package ropensci/rrrpkg, which Edzer Pemesma tweeted about today.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Packaging the TheyWorkForYou API

    Sun, 06/04/2017 - 21:19

    (This article was first published on R – CONJUGATEPRIOR, and kindly contributed to R-bloggers)

    TheyWorkForYou is a great website for keeping up with British politics and one of the many fine things mySociety does to make democracy in the UK more transparent.

    There’s also an API, accessible via http and wrapped up for a few languages. However, R is not amongst them, so I wrote twfy.

    If you’re interested in using it (and you’ve got devtools installed) you can install it with

    devtools::install_github("conjugateprior/twfy")

    It was my first proper API package and a bit of a learning experience. If you want to hear more about that, read on.

    APIs

    First some recap, for those just joining us.

    The TheyWorkForYou API works with parameterized GETs to URLs with a common base:

    http://theyworkforyou.com/api/

    and different endpoints, depending on what you want. First you sign up for an API key and then you make the calls.

    For example, if you want a list of UK parliamentary constituencies then your endpoint is getConstituency, which takes either a name or a postcode, plus your API key key and an output specification, and returns a structured constituency object.

    In a browser, the complete call looks like

    https://www.theyworkforyou.com/api/getConstituency?name=Keighley&output=js&key=adsfiddscsdlsdlk

    where of course adsfiddscsdlsdlk isn’t really an API key. It just plays one the web.

    The server returns a JSON object:

    { "bbc_constituency_id" : "344", "guardian_election_results" : "http://www.guardian.co.uk/politics/constituency/1050/keighley", "guardian_id" : "1050", "guardian_name" : "Keighley", "pa_id" : "338", "name" : "Keighley" }

    Except that’s only sort of true. The server claims to return a Javascript object, as we can tell from its MIME type. "text/javascript; charset=iso-8859-1". We’ll just treat it like JSON though.

    Making this call and decoding the result programmatically is straightforward with the right packages

    library(httr) library(jsonlite) q <- list(output="js", key="adsfiddscsdlsdlk", name="Keighley") url <- "https://www.theyworkforyou.com/api/getConstituency" resp <- GET(url, query=q) # make the call json <- fromJSON(content(resp)) # parse the response

    where we’ve let the GET function deal with all the URL escaping, and just asked fromJSON to do the right thing. By default, jsonlite generally believes the right thing looks like a data.frame and most of the time that’s fine.

    So far so straightforward.

    Creating a general purpose function to call the API

    It’s pretty easy to generalize this code to create a generic API calling function

    call_api(endpoint, ...)

    Inside that function we’ll just grab the arguments as list(...), add our preferred output format and the API key, and hand the whole thing to GET.

    With call_api in hand it’s possible to make all the API functions look pretty much the same, e.g.

    getConstituencies <- function(date=NULL, search=NULL){ params <- list(data=date, search=search) call_api("getConstituencies", params) }

    but you know, why write ‘getConstituencies’ twice?

    Let’s be a bit more general, and use the fact that R functions know what their names are, and that all function parameters have a name, to make a completely general function body.

    The actual function in twfy is

    getConstituencies <- function(date=NULL, search=NULL){ params <- params_from_call(match.call()) do.call("call_api", params) }

    which has exactly the same body as

    getMPInfo <- function(id, fields=NULL){ params <- params_from_call(match.call()) do.call("call_api", params) }

    Cute.

    If the API adds another endpoint, I’ll create a new function with this body, give it the name of the endpoint, and write the help in roxygen above it.

    So how does this work?

    Well, inside (any) function as.list(match.call()) is a list with an unlabeled first element that is the name of the function, and subsequent labeled components that are its arguments. If we call getConstituencies function above with search="Keigh" that means

    [[1]]
    getConstituencies

    $search
    [1] "Keigh"

    All the package’s params_from_call does is remove the first argument from the list and re-add it (as.character, because it’s actually an R symbol) under the new label endpoint, so that params is

    $search
    [1] "Keigh"

    $endpoint
    [1] "getConstituencies"

    I then use do.call to call call_api with these arguments. This works because call_api is looking for an argument called endpoint and zero or more named arguments, and params gives it one of each.

    This leads to the question: why even have separate function for each endpoint offered by the API? There are two answers:

    First, an important effect of wrapping an API is to have the documentation near to hand. This requires separate R functions to write the roxygen above.

    Speaking of documentation, TheyWorkForYou is a little bit vague about what each of its endpoints returns, so if you’re listening, a pointer to some more documentation would be great.

    Second, it is sometimes useful to pre- or post-process the arguments to do.call. Here’s an example of how documentation and pre-processing interact:

    getDebates <- function(type=c("commons", "westminsterhall", "lords", "scotland", "northernireland"), date=NULL, search=NULL, person=NULL, gid=NULL, order=c("d", "r"), page=NULL, num=NULL){ params <- params_from_call(match.call()) params$type <- match.arg(type) params$order <- match.arg(order) do.call("call_api", params) }

    The user must specify a legislative body to search with the type argument, and can specify a results ordering with the order argument. The function definition is a good place to put the small number of argument possibilities, not least because they will get picked up by command completion.

    In the code above I process the function’s arguments as usual, but then step in and fix the values of type and order using match.arg in the normal way, before making the call.

    Where did I leave my keys?

    Like most APIs TheyWorkForYou requires a key to use. Here I follow Hadley Wickham’s very useful guidelines (see the links at the end) and store it as an environment variable.

    In twfy there’s an internal function that prompts for a key as necessary

    get_api_key <- function(){ key <- Sys.getenv("TWFY_API_KEY") if (key == ""){ key <- ask_for_key() if (key != ""){ Sys.setenv(TWFY_API_KEY=key) add_key_to_renviron(key) # and set up for next time } else stop("Hint: you can request an API key from http://theyworkforyou.com/api/key") } key }

    The first time it’s needed, this prompts the user for the key, sets its value in the local environment, and writes a line into the user’s .Renviron file so it’s available in later sessions.

    There is a set_api_key, but this is only really needed to reset an existing key.

    Testing with keys

    If you’re a fan of continuous integration, then the next challenge is to set things up in such a way as not to expose the API key in the server logs or hardcode it into the R source. twfy uses Travis, and for Travis the solution is to set the api key as an environment variable in the repository settings.

    By default these variables do not appear in the build logs, and that’s the way we like it.

    The current .travis.yml for twfy looks like

    language: R sudo: false cache: packages before_install: - echo "TWFY_API_KEY=${TWFY_API_KEY}" > ~/.Renviron

    Actually I’m not sure whether it’s even necessary to drop Travis’s copy of the API key into the .Renviron to get picked up by the package functions. It’s possible that R picks up local environment variables more reliably on Ubuntu than on my OS X box.

    Still, this works. The package builds and no third parties (or forks) see the API key.

    Further reading

    If you find yourself wrapping an API I’d thoroughly recommend reading

    Somebody always got there first

    It turns out that Jack Blumenau saw TheyWorkForYou’s API and thought the same thing. Great minds, and all that. You can see his take on the problem at here. He likes XML quite a bit more than me, apparently.

    In any case, as Chairman Mao once said: “Let a hundred flowers blossom, let a hundred schools of thought contend. And let the winner drive the losers onto a small volcanic island shaped like a sweet potato”. Or something like that.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – CONJUGATEPRIOR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Data Visualization with googleVis exercises part 1

    Sun, 06/04/2017 - 18:00

    (This article was first published on R-exercises, and kindly contributed to R-bloggers)

    Line, Bar and Column Charts

    Hello everybody! This is the first part of an exciting data visualization series r exercises.com is going to provide you.

    Data visualization involves the creation and study of the visual representation of data. The increasing amounts of data created by Internet activity make the “data visualization” tool totally necessary for data scientists.

    One of the best packages that R language provides for this purpose is googleVis and guess what, we are going to see its amazing features together. In this first part we are going to see basic stuff of three useful types of charts (Line Chart, Bar Chart and Column Chart) before “digging deeper” in the next parts.

    NOTE: The charts are created locally by your browser. In case they are not displayed at once press F5 to reload the page.

    Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

    Answers to the exercises are available here.

    Install the Package.

    Exercise 1

    Install and load the package googleVis.

    Create a data frame

    First of all let’s create an experimental data.frame to use for all our plots. This is an example:
    dfr=data.frame(name=c("GRE", "ARG", "BRA"),
    val1=c(20,32,19),
    val2=c(25,52,12))

    Exercise 2

    Create a data frame named “df”. Give as variables the “Pts” (Points) and “Rbs” (Rebounds) of three NBA players. Names and values are up to you.

    Line Chart

    To produce a Line Chart you can use:
    LineC <- gvisLineChart(df)
    plot(LineC)

    Exercise 3

    Create a list named “LineC” and pass to it the “df” data frame you just created as a line chart. HINT: Use gvisLineChart().

    Exercise 4

    Plot the line chart. HINT: Use plot().

    Line chart with two axis

    Below there is an example of this type of Line Chart:
    LineC2 <- gvisLineChart(df, "name", c("val1","val2"),
    options=list(
    series="[{targetAxisIndex: 0},
    {targetAxisIndex:1}]",
    vAxes="[{title:'val1'}, {title:'val2'}]"
    ))
    plot(LineC2)

    As you can see you create a list (options) that corresponds values 1 & 2 in the two axes respectively.

    Exercise 5

    Create a single axis Line chart that displays only the “Pts” of the “df” data frame.

    Exercise 6

    Now create a two axis line chart that displays both “Pts” and “Rbs” of the “df” data frame. HINT: Use list().

    Bar Chart

    It is quite simple to create a bar chart with googleVis with:
    BarC <- gvisBarChart(df)
    plot(BarC)

    Exercise 7

    Create a list named “BarC” and pass to it the “df” data frame you just created as a bar chart. HINT: Use gvisBarChart().

    Exercise 8

    Plot the bar chart. HINT: Use plot().

    Column Chart

    Column charts could be considered as vertical bar charts and are quite simple to be created too.
    ColumnC <- gvisColumnChart(df)
    plot(ColumnC)

    Exercise 9

    Create a list named “ColumnC” and pass to it the “df” data frame you just created as a column chart. HINT: Use gvisColumnChart().

    Exercise 10

    Plot the column chart. HINT: Use plot().

    Related exercise sets:
    1. Getting started with Plotly: basic Plots
    2. Shiny Applications Layouts Exercises (Part-9)
    3. Customize a scatterplot exercises
    4. Explore all our (>1000) R exercises
    5. Find an R course using our R Course Finder directory
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    RcppArmadillo 0.7.900.2.0

    Sun, 06/04/2017 - 16:21

    (This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

    The new RcppArmadillo release 0.7.900.2.0 is now on CRAN, and the Debian package was just updated as well.

    Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 350 other packages on CRAN—an increase of 32 since the last CRAN release of 0.7.800.2.0 in April!

    With the 7.900.* series of Armadillo, Conrad has started to more fully utilize OpenMP (also see Wikipedia on OpenMP) for operations that can be parallelized. To use this in your package you need to update its src/Makevars{,.win} file similarly to what the skeleton default now uses

    PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) PKG_LIBS = $(SHLIB_OPENMP_CFLAGS) $(LAPACK_LIBS) $(BLAS_LIBS) $(FLIBS)

    and you may want to enable C++11 while you are at it—though this may pose issues with older-than-ancient RHEL installations which are still (way too) pervasive so we do not do it by default (yet).

    Here, we once again rely on the build infrastructure automagically provided by R itself: if and when OpenMP is available, R will use it via $(SHLIB_OPENMP_CXXFLAGS) etc; see the fine WRE manual for details. That said, some operating systems make this harder than other, and macOS usually takes the crown. See for example this blog post by James for surviving in that environment. I am a little short of details because on Linux these things just work, and have for well over a decade. The rcpp-devel mailing list will be the best place for questions.

    Changes in this release relative to the previous CRAN release are as follows:

    Changes in RcppArmadillo version 0.7.900.2.0 (2017-06-02)
    • Upgraded to Armadillo release 7.900.2 (Evil Banana Republic)

      • Expanded clamp() to handle cubes

      • Computationally expensive element-wise functions (such as exp(), log(), cos(), etc) can now be automatically sped up via OpenMP; this requires a C++11/C++14 compiler with OpenMP 3.0+ support for GCC and clang compilers

      • One caveat: when using GCC, use of -march=native in conjunction with -fopenmp may lead to speed regressions on recent processors

    • Added gcc 7 to support compiler check (James Balamuta in #128 addressing #126).

    • A unit test helper function for rmultinom was corrected (#133).

    • OpenMP support was added to the skeleton helper in inline.R

    Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

    This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Linking R to IQFeed with the QuantTools package

    Sun, 06/04/2017 - 09:18

    IQFeed provides streaming data services and trading solutions that cover the Agricultural, Energy and Financial marketplace. It is a well known and recognized data feed provider geared toward retail users and small institutions. The subscription price starts at around $80/month.

    Stanislav Kovalevsky has developed a package called QuantTools. It is an all in one package designed to enhance quantitative trading modelling. It allows to download and organize historical market data from multiple sources like Yahoo, Google, Finam, MOEX and IQFeed. The feature that interests me the most is the ability to link IQFeed to R. I’ve been using IQFeed for a few years and I’m happy with it (I’m not affiliated to the company in any way). More information can be found here. I’ve been looking for an integration within R for a while and here it is. As a result, after I ran a few tests, I moved my code that was still in Python into R. Just for completeness, here’s a link that explains how to download historical data from IQFeed using Python.

    QuantTools offers four main functionalities: Get market data, Store/Retrieve market data, Plot time series data and  Back testing

    • Get Market Data

    First make sure that IQfeed is open. You can either download daily or intraday data. The below code downloads daily prices (Open, High, Low, Close) for SPY from 1st Jan 2017 to 1st June 2017.

    ## Generic parameters library(QuantTools) from = '2016-01-01' to = '2016-06-01' symbol = 'SPY' ## Request data get_iqfeed_data(symbol, from, to)

    The below code downloads intraday data from 1st May 2017 to 3rd May 2017.

    ## Generic parameters library(QuantTools) from = '2016-05-01' to = '2016-05-03' symbol = 'SPY' ## Request data get_iqfeed_data(symbol, from, to, period = 'tick')

    Note the period parameter. It can take any of the following values: tick, 1min, 5min, 10min, 15min, 30min, hour, day, week, month, depending on the frequency you need.

    • Store/Retrieve Market Data

    QuantTools makes the process of managing and storing tick market data easy. You just setup storage parameters and you are ready to go. The parameters are where, since what date and which symbols you would like to be stored. Any time you can add more symbols and if they are not present in a storage, QuantTools tries to get the data from specified start date. The code below will save the data in the following directory: “C:/Users/Arnaud/Documents/Market Data/iqfeed”.  There is one sub folder by instrument and the data is aved in .rds  files.

    library(QuantTools) settings = list( iqfeed_storage = paste( path.expand('~') , 'Market Data', 'iqfeed', sep = '/'), iqfeed_symbols = c('SPY', 'QQQ'), iqfeed_storage_from = format(Sys.Date() - 3) ) QuantTools_settings(settings) # Update storage with data from last date available until today store_iqfeed_data()

    You can also store data between specific dates. Replace the last line of code above with one of the below

    # Update storage with data from last date available until specified date store_iqfeed_data(to = format(Sys.Date())) # Update storage with data between from and to dates, store_iqfeed_data(from = format(Sys.Date() - 3), to = format(Sys.Date()))

    Now should you want to get back some of the data you stored, just run something like:

    get_iqfeed_data(symbol = 'SPY', from = '2017-06-01', to = '2017-06-02', period = 'tick', local = TRUE)

    Note that only ticks are supported in local storage so period must be ‘tick’

    • Plot time series data

    QuantTools provides plot_ts function to plot time series data without weekend, holidays and overnight gaps. In the example below, I first retrieve the data stored above, then select the first 100 price observations and finally draw the chart

    ## Retrieve previously stored data spy = get_iqfeed_data(symbol = 'SPY', from = '2017-06-01', to = '2017-06-02', period = 'tick', local = TRUE) ## Select the first 100 rows spy_price = spy[,.(time,price)][1:100] ## Plot plot_ts(spy_price)

    Two things to notice: First spy is a data.table object hence the syntax above. To get a  quick overview of data.table capabilities have a look at this excellent cheat sheet from DataCamp. Second the local parameter is TRUE as the data is retrieved from internal storage.

    • Back testing

    QuantTools allows to write your own trading strategy using its C++ API. I’m not going to elaborate on this as this is basically C++ code. You can refer to the Examples section on QuantTools website.

    Overall I find the package extremely useful and well documented. The only missing bit is the live feed between R and IQFeed which will make the package a real end to end solution.

    As usual any comments welcome

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Quick illustration of Metropolis and Metropolis-in-Gibbs Sampling in R

    Sun, 06/04/2017 - 02:00

    (This article was first published on R – BioStatMatt, and kindly contributed to R-bloggers)

    The code below gives a simple implementation of the Metropolis and Metropolis-in-Gibbs sampling algorithms, which are useful for sampling probability densities for which the normalizing constant is difficult to calculate, are irregular, or have high dimension (Metropolis-in-Gibbs).

    ## Metropolis sampling ## x - current value of Markov chain (numeric vector) ## targ - target log density function ## prop - function with prototype function(x, ...) that generates ## a proposal value from a symmetric proposal distribution library('mvtnorm') prop_mvnorm <- function(x, ...) drop(rmvnorm(1, mean=x, ...)) metropolis <- function(x, targ, prop=prop_mvnorm, ...) { xnew <- prop(x) lrat <- targ(xnew, ...) - targ(x, ...) if(log(runif(1)) < lrat) x <- xnew return(x) } ## Metropolis-in-Gibbs sampling ## x - current value of Markov chain (numeric vector) ## targ - target log density function ## ... - arguments passed to 'targ' gibbs <- function(x, targ, ...) { for(i in 1:length(x)) { ## define full conditional targ1 <- function(x1, ...) { x[i] <- x1 targ(x, ...) } ## sample using Metropolis algorithm x[i] <- metropolis(x[i], targ1, ...) } return(x) }

    The following code produces the figure below to illustrate the two methods using a 'dumbell' distribution (cf. R package 'ks' vignette).

    ### The code below illustrates the use of the functions above ## target 'dumbell' density (c.f., R package 'ks' vignette) library('ks') mus <- rbind(c(-2,2), c(0,0), c(2,-2)) sigmas <- rbind(diag(2), matrix(c(0.8, -0.72, -0.72, 0.8), nrow=2), diag(2)) cwt <- 3/11 props <- c((1-cwt)/2, cwt, (1-cwt)/2) targ_test <- function(x, ...) log(dmvnorm.mixt(x, mus=mus, Sigmas=sigmas, props=props)) ## plot contours of target 'dumbell' density set.seed(42) par(mfrow=c(1,2)) plotmixt(mus=mus, Sigmas=sigmas, props=props, xlim=c(-4,4), ylim=c(-4,4), xlab=expression(x[1]), ylab=expression(x[2]), main="Metropolis-in-Gibbs") ## initialize and sample using Metropolis-in-Gibbs xcur <- gibbs(c(0,0), targ_test, sigma=vcov_test) for(j in 1:2000) { xcur <- gibbs(xcur, targ_test, sigma=vcov_test) points(xcur[1], xcur[2], pch=20, col='#00000055') } ## plot contours of target 'dumbell' density plotmixt(mus=mus, Sigmas=sigmas, props=props, xlim=c(-4,4), ylim=c(-4,4), xlab=expression(x[1]), ylab=expression(x[2]), main="Metropolis") ## initialize and sample using Metropolis xcur <- metropolis(c(0,0), targ_test, sigma=vcov_test) for(j in 1:2000) { xcur <- metropolis(xcur, targ_test, sigma=vcov_test) points(xcur[1], xcur[2], pch=20, col='#00000055') }

    The figure illustrates two contrasting properties of the two methods:

    1. Metropolis-in-Gibbs samples can get 'stuck' in certain regions of the support, especially when there are multiple modes and/or significant correlation among the random variables. This is not as much a problem for Metropolis sampling.
    2. Metropolis sampling can produce fewer unique samples due to the poor approximation of the proposal density to the target density. This occurs more often for high-dimensional target densities.
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – BioStatMatt. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    datazar

    Sun, 06/04/2017 - 00:17

    (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

    A few weeks ago and then some, I [as occasional blogger!] got contacted by datazar.com to write a piece on this data-sharing platform. I then went and checked what this was all about, having the vague impression this was a platform where I could store and tun R codes, besides dropping collective projects, but from what I quickly read, it sounds more like being able to run R scripts from one’s machine using data and code stored on datazar.com. But after reading just one more blog entry I finally understood it is also possible to run R, SQL, NotebookJS (and LaTeX) directly on that platform, without downloading code or data to one’s machine. Which makes it a definitive plus with this site, as users can experiment with no transfer to their computer. Hence on a larger variety of platforms. While personally I do not [yet?] see how to use it for my research or [limited] teaching, it seems like an [yet another] interesting exploration of the positive uses of Internet to collaborate and communicate on scientific issues! With no opinion on privacy and data protection offered by the site, of course.

    Filed under: R, Statistics, University life Tagged: blogging, data privacy, data science, data-sharing, datazar.com, LaTeX, R, R-bloggers

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Deep Learning Dude pt 1

    Sat, 06/03/2017 - 17:39

    (This article was first published on Rblog – Thinkinator, and kindly contributed to R-bloggers)

    You’ve probably noticed that Deep Learning is all the rage right now. AlphaGo has beaten the world champion at Go, you can google cat photos and be sure you won’t accidentally get photos of canines, and many other near-miraculous feats: all enabled by Deep Learning with neural nets. (I am thinking of coining the phrase “laminar learning” to add some panache to old-school non-deep learning.)

    I do a lot of my work in R, and it turns out that not one but two R packages have recently been released that enable R users to use the famous Python-based deep learning package, Keras: keras and kerasR.

    The first thing you’ll need to do is to make sure you have Python 3 on your system. Windows generally doesn’t have Python at all, while Macs and some Linux systems will include Python 2.7. The way I’d recommend to get Python 3 is to download and install Anaconda. (Few people do data science with a base-installed Python: they mostly use Anaconda, which pre-packages a lot of data science tools along with Python. Trying to install all of those tools yourself is … difficult.)

    The usual place to install Anaconda is in your home directory. Then, on a Mac or Linux system add ~/anaconda/bin: to the beginning of your PATH environment variable. For example, on my system, I edited the .profile file in my home directory and the beginning of the PATH line looks like export PATH=~/anaconda/bin:/usr/local/bin:, so that when I type python the system will use the one provided by Anaconda.

    One of the nice things about Anaconda is that it provides an enhanced package loading system, conda. This is similar to R’s cran in some sense. But to install Keras, we’ll use Python’s pip, which is similar to R’s devtools and is run from a command line (not from within Python):

    pip install keras tensorflow pydot graphviz

    should do the trick. The last two packages allow you to plot the model, though the plot is fairly boring.

    Of the two packages (keras and kerasR), I’ve started using kerasR because it has some nice tutorials and it’s in CRAN, so that’s what I’ll use here. (keras is installed via devtools.)

    In R, install packages kerasR and reticulate from CRAN. To load kerasR requires an extra step beyond the usual library:

    reticulate::use_python ("~/anaconda/bin/python")
    library (kerasR)

    The use_python tells R where to find the Python 3 that you’re using, the one where you also pip-loaded keras. If the library doesn’t think for a moment and then return “successfully loaded keras”, something is wrong, and you’ll get a bunch of Python error messages which will give you a clue, if you examine them carefully. If you forget the use_python, R will be looking at the base-installed Python and won’t be able to import keras in Python. The things that can go wrong are myriad, and I can’t really be more specific, unfortunately.

    If you got this far, you can now follow the kerasR tutorial. At the moment (03 June), there is an error in the first model of the tutorial (Boston Housing) that causes it to perform poorly. I’ve submitted a trouble ticket, and the code I’d recommend is:

    mod <- Sequential ()

    mod$add (Dense (units=50, input_shape=13))
    mod$add (Activation ("relu"))

    mod$add (Dense (units=1))
    # mod$add (Activation ("relu"))

    keras_compile (mod, loss="mse", optimizer=RMSprop())

    boston <- load_boston_housing ()
    X_train <- scale (boston$X_train)
    Y_train <- boston$Y_train
    X_test <- scale (boston$X_test, center=attr (X_train, "scaled:center"), scale=attr (X_train, "scaled:scale"))
    Y_test <- boston$Y_test

    keras_fit (mod, X_train, Y_train, batch_size=32, epochs=200, verbose=1, validation_split=0.1)

    pred <- keras_predict (mod, X_test)
    sd (as.numeric (pred) - Y_test) / sd (Y_test)

    plot (Y_test, pred)
    abline (0, 1)

    Keras works a bit differently than the way R usually works in that mod$add modifies the model mod directly, in-place. The mod$add‘s first create an empty model (Sequential ()), and then add a layer with 50 hidden units and a “Relu” activation function, and then add the 1-unit output layer.

    This is pretty much a simple Hello World model with 13 input variables and one hidden layer with 50 units. You could have made the same model in older R neural net packages, but Keras has many advantages.

    In the tutorial (as of 03 June), the R scale‘s of the X training and testing data were independent and not linked. In this case, I scale the training data and then use the same center and scale for the testing data, just as you would when you deploy a model: training represents the data we already have, while testing represents new data arriving in the future. (This is a pet peeve on my part, and not generally important.)

    More importantly, the tutorial also accidentally applied a second normalization to the data in the prediction step, which would drive it in a different direction from the training data. The version, above, has results that look pretty reasonable:

    This example isn’t Deep Learning(tm) and you could’ve done this with older R neural net packages, but it’s just the start of Keras exploration. Follow the kerasR tutorials and the links they recommend. For more details on what the various kerasR functions do, check out the Keras pages. (Remembering that kerasR doesn’t implement everything in Keras itself.)

    For a very readable explanation of Deep Learning architectures, first read Neural Network Zoo Prequel, and then Neural Network Zoo, by the Asimov Institute.

    One of the advantages of Keras is that it’s built on Tensor Flow, which takes full advantage of clusters of machines, GPUs, and all those other things that makes Google Google. Unfortunately, by “GPU” we mean Nvidia GPUs (i.e. GPUs that support CUDA). My Mac laptop has an AMD graphics chip, so I can’t use GPUs, though I can still develop things on my laptop and then someday spin up things on Amazon Web Services and use GPU-based instances.

    Filed under: Data Science, R, Rblog, Uncategorized

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Rblog – Thinkinator. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    Pages