RTutor: The effect of the TseTse fly on African Development
(This article was first published on Economics and R  R posts, and kindly contributed to Rbloggers)
What is the causal impact of the TseTse fly on historical and current development in Africa? Marcella Alsan studies this question in her fascinating article ‘The effect of the TseTse fly on African development’ in the American Economic Review, 2015.
As part of her Bachelor Thesis at Ulm University, Vanessa Schöller has generated a very nice RTutor problem set that allows you to replicate the main insights of the paper in an interactive fashion.
Here is screenshoot:
Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to procceed. In addition you are challenged by many multiple choice quizzes.
To install the problem set the problem set locally, first install RTutor as explained here:
https://github.com/skranz/RTutor
and then install the problem set package:
https://github.com/vanessaschoeller/RTutorTseTse
There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 30 hours total usage time per month. So it may be greyed out when you click at it.)
https://vanessaschoeller.shinyapps.io/RTutorTseTse/
If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page
https://github.com/skranz/RTutor
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Economics and R  R posts. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Deferred & Remote Function Execution in R
(This article was first published on R – Random Remarks, and kindly contributed to Rbloggers)
What would you say if you could automatically wrap your R pipeline – which consists of numerous functions and variables – into a single function? What would you say if you could do it repeatedly, with no extra effort regardless of its complexity? Would you store it to document your progress at that particular moment in time? Would you serialize and send it to a remote R session to offload computations? Well, it’s kindof possible.
defer brings a new capability to R – packaging, as a single R function, a number of interconnected functions and variables which exist in the current R session. It is something I found useful during fastprototyping phases of my own work and wanted to share for quite some time now – and at this point I’m interested in seeing if it turns out useful for someone else besides me.
Here’s a simplistic example where defer is given a toplevel function h() that requires two other functions (f() and g()), a variable (C) and a library call (base::mean).
C < 10 f < function(x) x*x + C g < function(y) (f(y) + 10) * f(y+3) h < function(z) mean(c(g(z), g(z+1), g(z+2))) wrapper < defer(h) #> Found functions: #> g, f #> variables: #> C #> library calls: #> base::mean rm(C, f, g, h) wrapper(10) #> [1] 29688.67Here is another example where that same wrapper() can be run in a remote R session (via opencpu) without installing or deploying any extra code on the remote server. First, we serialize wrapper as a base64encoded string.
library(base64enc) buffer < rawConnection(raw(0), 'w') saveRDS(wrapper, buffer) serialized < base64enc::base64encode(rawConnectionValue(buffer))Then, we write that string into a temporary file together with a few lines of code to deserialize and run it in a remote R session.
local_script_path < tempfile(fileext = ".R") cat(paste0("base64 < '", serialized, "'\n", "library(base64enc)\n", "bytes < rawConnection(base64enc::base64decode(base64), 'r')\n", "deserialized < readRDS(bytes)\n", "deserialized(10)\n"), file = local_script_path)Finally, we upload the file on the opencpu server and run it via base::source.
library(httr) result < httr::POST(public_opencpu_url, body = list(file = upload_file(local_script_path))) content(result, 'text') #> [1] "$value\n[1] 29688.67\n\n$visible\n[1] TRUE\n\n"And that seems like a number we’ve seen already, right?
The package is currently available only from GitHub as it’s way not ready for CRAN.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Random Remarks. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));R Weekly Bulletin Vol – X
This week’s R bulletin will cover topics on grouping data using ntile function, how to open files automatically, and formatting an Excel sheet using R.
We will also cover functions like the choose function, sample function, runif and rnorm function. Click To TweetHope you like this R weekly bulletin. Enjoy reading!
Shortcut Keys1. Fold selected chunk – Alt+L
2. Unfold selected chunk – Shift+Alt+L
3. Fold all – Alt+0
The ntile function is part of the dplyr package, and is used for grouping data. The syntax for the function is given by:
ntile(x, n)
Where,
“x” is the vector of values and
“n” is the number of buckets/groups to divide the data into.
Example:
In this example, we first create a data frame from two vectors, one comprising of Stock symbols, and the other comprising of their respective prices. We then group the values in Price column in 2 groups, and the ranks are populated in a new column called “Ntile”. In the last line we are selecting only those values which fall in the 2nd bucket using the subset function.
library(dplyr) Ticker = c("PAGEIND", "MRF", "BOSCHLTD", "EICHERMOT", "TIDEWATER") Price = c(14742, 33922, 24450, 21800, 5519) data = data.frame(Ticker, Price) data$Ntile = ntile(data$Price, 2) print(data) ranked_data = subset(data, subset = (Ntile == 2)) print(ranked_data) Automatically open the saved filesIf you are saving the output returned upon executing an R script, and also want to open the file post running the code, one can you use the shell.exec function. This function opens the specified file using the application specified in the Windows file associations.
A file association associates a file with an application capable of opening that file. More commonly, a file association associates a class of files (usually determined by their filename extension, such as .txt) with a corresponding application (such as a text editor).
The example below illustrates the usage of the function.
shell.exec(filename)
Example:
df = data.frame(Symbols=c("ABAN","BPCL","IOC"),Price=c(212,579,538)) write.csv(df,"Stocks List.csv") shell.exec("Stocks List.csv") Quick format of the excel sheet for column widthWe can format the excel sheets for column width using the command lines given below. In the example, the first line will load the excel workbook specified by the file name. In the third & the fourth line, the autoSizeColumn function adjusts the width of the columns, which are specified in the “colIndex”, for each of the worksheets. The last line will save the workbook again after making the necessary formatting changes.
Example:
wb = loadWorkbook(file_name) sheets = getSheets(wb) autoSizeColumn(sheets[[1]], colIndex=1:7) autoSizeColumn(sheets[[2]], colIndex=1:5) saveWorkbook(wb,file_name) Functions Demystified choose functionThe choose function computes the combination nCr. The syntax for the function is given as:
choose(n,r)
where,
n is the number of elements
r is the number of subset elements
nCr = n!/(r! * (nr)!)
Examples:
choose(5, 2)[1] 10
choose(2, 1)[1] 2
sample functionThe sample function randomly selects n items from a given vector. The samples are selected without replacement, which means that the function will not select the same item twice. The syntax for the function is given as:
sample(vector, n)
Example: Consider a vector consisting of yearly revenue growth data for a stock. We select 5 years revenue growth at random using the sample function.
Revenue = c(12, 10.5, 11, 9, 10.75, 11.25, 12.1, 10.5, 9.5, 11.45) sample(Revenue, 5)[1] 11.45 12.00 9.50 12.10 10.50
Some statistical processes require sampling with replacement, in such cases you can specify replace= TRUE to the sample function.
Example:
x = c(1, 3, 5, 7) sample(x, 7, replace = TRUE)[1] 7 1 5 3 7 3 5
runif and rnorm functionsThe runif function generates a uniform random number between 0 and 1. The argument of runif function is the number of random values to be generated.
Example:
# This will generate 7 uniform random number between 0 and 1. runif(7)[1] 0.6989614 0.5750565 0.6918520 0.3442109 0.5469400 0.7955652 0.5258890
# This will generate 5 uniform random number between 2 and 4. runif(5, min = 2, max = 4)[1] 2.899836 2.418774 2.906082 3.728974 2.720633
The rnorm function generates random numbers from normal distribution. The function rnorm stands for the Normal distribution’s random number generator. The syntax for the function is given as:
rnorm(n, mean, sd)
Example:
# generates 6 numbers from a normal distribution with a mean of 3 and standard deviation of 0.25 rnorm(6, 3, 0.25)[1] 3.588193 3.095924 3.240684 3.061176 2.905392 2.891183
Next StepWe hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.
Download the PDF Now!The post R Weekly Bulletin Vol – X appeared first on .
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));(Linear Algebra) Do not scale your matrix
(This article was first published on Florian Privé, and kindly contributed to Rbloggers)
In this post, I will show you that you generally don’t need to explicitly scale a matrix. Maybe you wanted to know more about WHY matrices should be scaled when doing linear algebra. I will remind about that in the beginning but the rest will focus on HOW to not explicitly scale matrices. We will apply our findings to the computation of Principal Component Analysis (PCA) and then Pearson correlation at the end.
WHY scaling matrices?Generally, if you don’t center columns of a matrix before PCA, you end up with the loadings of PC1 being the column means, which is not of must interest.
n < 100; m < 10 a < matrix(0, n, m); a[] < rnorm(length(a)) a < sweep(a, 2, 1:m, '+') colMeans(a) ## [1] 0.9969826 1.9922219 3.0894905 3.9680835 4.9433477 6.0007352 ## [7] 6.9524794 8.0898546 8.9506657 10.0354630 pca < prcomp(a, center = FALSE) cor(pca$rotation[, 1], colMeans(a)) ## [1] 0.9999991Now, say you have centered column or your matrix, do you also need to scale them? That is to say, do they need to have the same norm or standard deviation?
PCA consists in finding an orthogonal basis that maximizes the variation in the data you analyze. So, if there is a column with much more variation than the others, it will probably end up being PC1, which is not of must interest.
n < 100; m < 10 a < matrix(0, n, m); a[] < rnorm(length(a)) a[, 1] < 100 * a[, 1] apply(a, 2, sd) ## [1] 90.8562836 0.8949268 0.9514208 1.0437561 0.9995103 1.0475651 ## [7] 0.9531466 0.9651383 0.9804789 0.9605121 pca < prcomp(a, center = TRUE) pca$rotation[, 1] ## [1] 9.999939e01 6.982439e04 5.037773e04 9.553513e05 8.937869e04 ## [6] 1.893732e03 3.053232e04 1.811950e03 1.658014e03 9.937442e04Hope I convinced you on WHY it is important to scale matrix columns before doing PCA. I will now show you HOW not to do it explictly.
Reimplementation of PCAIn this part, I will show you the basic code if you want to reimplement PCA yourself. It will serve as a basis to show you how to do linear algebra on a scaled matrix, without explicitly scaling the matrix.
# True one n < 100; m < 10 a < matrix(0, n, m); a[] < rnorm(length(a)) pca < prcomp(a, center = TRUE, scale. = TRUE) # DIY a.scaled < scale(a, center = TRUE, scale = TRUE) K < crossprod(a.scaled) K.eigs < eigen(K, symmetric = TRUE) v < K.eigs$vectors PCs < a.scaled %*% v # Verif, recall that PCs can be opposites between runs plot(v, pca$rotation) all.equal(sqrt(K.eigs$values), sqrt(n  1) * pca$sdev) ## [1] TRUE plot(PCs, pca$x) Linear algebra behind the previous implementationSuppose \(m < n\) (\(m\) is the number of columns and \(n\) is the number of rows). Let us denote \(\tilde{X}\) the scaled matrix. A partial singular value decomposition of \(\tilde{X}\) is \(\tilde{X} \approx U \Delta V^T\) where \(U\) is an \(n \times K\) matrix such that \(U^T U = I_K\), \(\Delta\) is a \(K \times K\) diagonal matrix and \(V\) is an \(m \times K\) matrix such that \(V^T V = I_K\). Taking \(K = m\), you end up with \(\tilde{X} = U \Delta V^T\).
\(U \Delta\) are the scores (PCs) of the PCA and \(V\) are the loadings (rotation coefficients). \(K = \tilde{X}^T \tilde{X} = (U \Delta V^T)^T \cdot U \Delta V^T = V \Delta U^T U \Delta V^T = V \Delta^2 V^T\). So, when doing the eigen decomposition of K, you get \(V\) and \(\Delta^2\) because \(K V = V \Delta^2\). For getting the scores, you then compute \(\tilde{X} V = U \Delta\).
These are exactly the steps implemented above.
Implicit scaling of the matrixDo you know the matrix formulation of column scaling? \(\tilde{X} = C_n X S\) where \(C_n = I_n – \frac{1}{n} 1_n 1_n^T\) is the centering matrix and \(S\) is an \(m \times m\) diagonal matrix with the scaling coefficients (typically, \(S_{j,j} = 1 / \text{sd}_j\)).
# Let's verify sds < apply(a, 2, sd) a.scaled2 < (diag(n)  tcrossprod(rep(1, n)) / n) %*% a %*% diag(1 / sds) all.equal(a.scaled2, a.scaled, check.attributes = FALSE) ## [1] TRUEIn our previous implementation, we computed \(\tilde{X}^T \tilde{X}\) and \(\tilde{X} V\). We are going to compute them again, without explicitly scaling the matrix.
ProductLet us begin by something easy: \(\tilde{X} V = C_n X S V = C_n (X (S V))\). So, you can compute \(\tilde{X} V\) without explicitly scaling \(X\). Let us verify:
SV < v / sds XSV < a %*% SV CXSV < sweep(XSV, 2, colMeans(XSV), '') all.equal(CXSV, PCs) ## [1] TRUE Self crossproductA little more tricky: \(\tilde{X}^T \tilde{X} = (C_n X S)^T \cdot C_n X S = S^T X^T C_n X S\) (\(C_n^2 = C_n\) is intuitive because centering an already centered matrix doesn’t change it).
\(\tilde{X}^T \tilde{X} = S^T X^T (I_n – \frac{1}{n} 1_n 1_n^T) X S = S^T (X^T X – X^T (\frac{1}{n} 1_n 1_n^T) X) S = S^T (X^T X – \frac{1}{n} s_X * s_X^T) S\) where \(s_X\) is the vector of column sums of X.
Let us verify with a rough implementation:
sx < colSums(a) K2 < (crossprod(a)  tcrossprod(sx) / n) / tcrossprod(sds) all.equal(K, K2) ## [1] TRUE ConclusionWe have recalled some steps about the computation of Principal Component Analysis on a scaled matrix. We have seen how to compute the different steps of the implementation without having to explicitly scale the matrix. This “implicit” scaling can be quite useful if you manipulate very large matrices because you are not copying the matrix nor making useless computation. In my next post, I will present to you my new package that uses this trick to make a lightning partial SVD on very large matrices.
Appendix: application to Pearson correlationPearson correlation is merely a self crossproduct on a centered and normalized (columns with unit norm) matrix. Let us just implement that with our new trick.
#include using namespace Rcpp; // [[Rcpp::export]] NumericMatrix& correlize(NumericMatrix& mat, const NumericVector& shift, const NumericVector& scale) { int n = mat.nrow(); int i, j; for (j = 0; j < n; j++) { for (i = 0; i < n; i++) { // corresponds to " \frac{1}{n} s_X * s_X^T" mat(i, j) = shift(i) * shift(j); // corresponds to "S^T (...) S" mat(i, j) /= scale(i) * scale(j); } } return mat; } cor3 < function(mat) { sums < colSums(mat) / sqrt(nrow(mat)) corr < crossprod(mat) diags < sqrt(diag(corr)  sums^2) correlize(corr, shift = sums, scale = diags) } a < matrix(0, 1000, 1000); a[] < rnorm(length(a)) all.equal(cor3(a), cor(a)) ## [1] TRUE library(microbenchmark) microbenchmark( cor3(a), cor(a), times = 20 ) ## Unit: milliseconds ## expr min lq mean median uq max neval ## cor3(a) 39.38138 39.67276 40.68596 40.09893 40.65623 46.46785 20 ## cor(a) 635.74350 637.33605 639.34810 638.09980 639.36110 651.61876 20 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Florian Privé. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Hacking the principles of #openscience #workshops
(This article was first published on R – christopher lortie, and kindly contributed to Rbloggers)
In a previous post, I discussed the key elements that really stood out for me in recent workshops associated with open science, data science, and ecology. Summer workshop season is upon us, and here are some principles to consider that can be used to hack a workshop. These hacks can be applied a priori as an instructor or in situ as a participant or instructor by engaging with the context from a pragmatic, problemsolving perspective.
Principles
1. Embrace open pedagogy.
2. Use and current best practices from traditional teaching contexts.
3. Be learner centered.
4. Speak less, do more.
5. Solve authentic challenges.
Hacks (for each principle)
1. Prepare learning outcomes for every lesson.
2. Identify solveaproblem opportunities in advance and be open to ones that emerge organically during the workshop.
3. Use no slide decks. This challenges the instructor to more directly engage with the students and participants in the workshop and leaves space for students to shape content and narrative to some extent. Decks lock all of us in. This is appropriate for some contexts such as conference presentations, but workshops can be more fluid and open.
4. Plan pauses. Prepare your lessons with gaps for contributions. Prepare a list of questions to offer up for every lesson and provide time for discussion of solutions.
5. Use real evidence/data to answer a compelling question (scale can be limited, approach beta as long as an answer is provided, and the challenge can emerge if teaching is open and space provided for the workshop participants to ideate).
Final hack that is a more general teaching principle, consider keeping all teaching materials within a single ecosystem that then references outwards only as needed. For me, this has become all content prepared in RStudio, knitted to html, then pushed to GitHub ghpages for sharing as a webpage (or site). Then participants can engage in all ideas and content including code, data, ideas in one place.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – christopher lortie. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Teach kids about R with Minecraft
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
As I mentioned earlier this week, I was on a team at the ROpenSci Unconference (with Brooke Anderson, Karl Broman, Gergely Daróczi, and my Microsoft colleagues Mario Inchiosa and Ali Zaidi) to work on a project to interface the R language with Minecraft. The resulting R package, miner, is now available to install from Github. The goal of the package is to introduce budding programmers to the R language via their interest in Minecraft, and to that end there's also a book (R Programming with Minecraft) and associated R package (craft) under development to provide lots of fun examples of manipulating the Minecraft world with R.
Create objects in Minecraft with R functionsIf you're a parent you're probably already aware of the Minecraft phenomenon, but if not: it's kinda like the Lego of the digital generation. Kids (and kidsandheart) enter a virtual 3D world composed of cubes representing ground, water, trees and other materials, and use those cubes to build their own structures, which they can then explore with their friends using their ingame avatars. Inspired by the Pythonfocused book "Learn to program with Minecraft", Karl had the brilliant idea of building a similar interface around R.
The miner package provides just a few simple functions to manipulate the game world: find or move a player's position; add or remove blocks in the world; send a message to all players in the world. The functions are deliberately simple, designed to be used to build more complex tasks. For example, you could write a function to detect when a player is standing in a hole. It uses the getPlayerIds and getPlayerPos functions to find the ID and locations of all players in the world, and the getBlocks function to check if the player is surrounded by blocks that are not code 0 (air).
The package also provides are also a couple of "listener" functions: you can detect player chat messages, or when players strike a block with their sword. You can then use to write functions to react to player actions. For example, you can write a chatbot to play a game with the player, create an AI to solve an ingame maze, or give the player the power's of Elsa from Frozen to freeze water by walking on it:
To get started with the miner package, you'll need to purchase a copy of Minecraft for Windows, Mac or Linux if you don't have one already. (Note: the Windows 10 version from the Microsoft Store isn't compatible with the miner package.) This is what you'll use to explore the world managed by the Minecraft server, which you'll also need to install along with RaspberryJuice plugin. You can find setup details in the miner package vignette. We installed the opensource Spigot server on a Ubuntu VM running in Azure; you might find this Dockerfile helpful if you're trying something similar.
The miner package and craft package available to install from Github at the link below. The packages are a workinprogress: comments, suggestions, pullrequests and additional examples for the R Programming with Minecraft book are most welcome!
Github (ROpenSci labs): miner and craft
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Weather forecast with regression models – part 1
(This article was first published on R Programming – DataScience+, and kindly contributed to Rbloggers)
In this tutorial we are going to analyse a weather dataset to produce exploratory analysis and forecast reports based on regression models.
We are going to take advantage of a public dataset which is part of the exercise datasets of the “Data Mining and Business Analytics with R” book (Wiley) written by Johannes Ledolter. In doing that, we are taking advantage of several R packages, caret package included.
The tutorial is going to be split in the following four parts:
1. Introducing the weather dataset and outlining its exploratory analysis
2. Building logistic regression models for 9am, 3pm and late evening weather forecasts
3. Tuning to improve accuracy of previously build models and show ROC plots
4. Making considerations on “atleast” moderate rainfall scenarios and building additional models to predict further weather variables
Overall, we are going to take advantage of the following packages:
suppressPackageStartupMessages(library(knitr)) suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(gmodels)) suppressPackageStartupMessages(library(lattice)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(gridExtra)) suppressPackageStartupMessages(library(Kmisc)) suppressPackageStartupMessages(library(ROCR)) suppressPackageStartupMessages(library(corrplot)) Exploratory AnalysisThe dataset under analysis is publicly available at:
jledolter – datamining – data exercises
It contains daily observations from a single weather station. Let us start by loading the data and taking a look at.
set.seed(1023) weather_data < read.csv(url("https://www.biz.uiowa.edu/faculty/jledolter/datamining/weather.csv"), header = TRUE, sep = ",", stringsAsFactors = TRUE) kable(head(weather_data)) Date Location  MinTemp MaxTemp Rainfall Evaporation SunshineWindGustDir  WindGustSpeedWindDir9am WindDir3pm  WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pmRainToday  RISK_MMRainTomorrow  :::::::::::::::::::::::: 11/1/2007 Canberra  8.0 24.3 0.0 3.4 6.3NW  30SW NW  6 20 68 29 1019.7 1015.0 7 7 14.4 23.6No  3.6Yes  11/2/2007 Canberra  14.0 26.9 3.6 4.4 9.7ENE  39E W  4 17 80 36 1012.4 1008.4 5 3 17.5 25.7Yes  3.6Yes  11/3/2007 Canberra  13.7 23.4 3.6 5.8 3.3NW  85N NNE  6 6 82 69 1009.5 1007.2 8 7 15.4 20.2Yes  39.8Yes  11/4/2007 Canberra  13.3 15.5 39.8 7.2 9.1NW  54WNW W  30 24 62 56 1005.5 1007.0 2 7 13.5 14.1Yes  2.8Yes  11/5/2007 Canberra  7.6 16.1 2.8 5.6 10.6SSE  50SSE ESE  20 28 68 49 1018.3 1018.5 7 7 11.1 15.4Yes  0.0No  11/6/2007 Canberra  6.2 16.9 0.0 5.8 8.2SE  44SE E  20 24 70 57 1023.8 1021.7 7 5 10.9 14.8No  0.2No Metrics at specific Date and Location are given together with the RainTomorrow variable indicating if rain occurred on next day.
colnames(weather_data) [1] "Date" "Location" "MinTemp" "MaxTemp" "Rainfall" [6] "Evaporation" "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am" [11] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [16] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" [21] "Temp3pm" "RainToday" "RISK_MM" "RainTomorrow"The description of the variables is the following.
Date: The date of observation (a date object).
Location: The common name of the location of the weather station
MinTemp: The minimum temperature in degrees centigrade
MaxTemp: The maximum temperature in degrees centigrade
Rainfall: The amount of rainfall recorded for the day in millimeters.
Evaporation: Class A pan evaporation (in millimeters) during 24 h
Sunshine: The number of hours of bright sunshine in the day
WindGustDir: The direction of the strongest wind gust in the 24 h to midnight
WindGustSpeed: The speed (in kilometers per hour) of the strongest wind gust in the 24 h to midnight
WindDir9am: The direction of the wind gust at 9 a.m.
WindDir3pm: The direction of the wind gust at 3 p.m.
WindSpeed9am: Wind speed (in kilometers per hour) averaged over 10 min before 9 a.m.
WindSpeed3pm: Wind speed (in kilometers per hour) averaged over 10 min before 3 p.m.
Humidity9am: Relative humidity (in percent) at 9 am
Humidity3pm: Relative humidity (in percent) at 3 pm
Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9 a.m.
Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3 p.m.
Cloud9am: Fraction of sky obscured by cloud at 9 a.m. This is measured in ”oktas,” which are a unit of eighths. It records how many eighths of the sky are obscured by cloud. A 0 measure indicates completely clear sky, while an 8 indicates that it is completely overcast
Cloud3pm: Fraction of sky obscured by cloud at 3 p.m; see Cloud9am for a description of the values
Temp9am: Temperature (degrees C) at 9 a.m.
Temp3pm: Temperature (degrees C) at 3 p.m.
RainToday: Integer 1 if precipitation (in millimeters) in the 24 h to 9 a.m. exceeds 1 mm, otherwise 0
RISK_MM: The continuous target variable; the amount of rain recorded during the next day
RainTomorrow: The binary target variable whether it rains or not during the next day
Looking then at the data structure, we discover the dataset includes a mix of numerical and categorical variables.
str(weather_data) > str(weather_data) 'data.frame': 366 obs. of 24 variables: $ Date : Factor w/ 366 levels "1/1/2008","1/10/2008",..: 63 74 85 87 88 89 90 91 92 64 ... $ Location : Factor w/ 1 level "Canberra": 1 1 1 1 1 1 1 1 1 1 ... $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 8.8 8.4 ... $ MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 18.2 17 19.5 22.8 ... $ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 ... $ Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6 4 5.4 ... $ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.6 4.1 7.7 ... $ WindGustDir : Factor w/ 16 levels "E","ENE","ESE",..: 8 2 8 8 11 10 10 1 9 1 ... $ WindGustSpeed: int 30 39 85 54 50 44 43 41 48 31 ... $ WindDir9am : Factor w/ 16 levels "E","ENE","ESE",..: 13 1 4 15 11 10 10 10 1 9 ... $ WindDir3pm : Factor w/ 16 levels "E","ENE","ESE",..: 8 14 6 14 3 1 3 1 2 3 ... $ WindSpeed9am : int 6 4 6 30 20 20 19 11 19 7 ... $ WindSpeed3pm : int 20 17 6 24 28 24 26 24 17 6 ... $ Humidity9am : int 68 80 82 62 68 70 63 65 70 82 ... $ Humidity3pm : int 29 36 69 56 49 57 47 57 48 32 ... $ Pressure9am : num 1020 1012 1010 1006 1018 ... $ Pressure3pm : num 1015 1008 1007 1007 1018 ... $ Cloud9am : int 7 5 8 2 7 7 4 6 7 7 ... $ Cloud3pm : int 7 3 7 7 7 5 6 7 7 1 ... $ Temp9am : num 14.4 17.5 15.4 13.5 11.1 10.9 12.4 12.1 14.1 13.3 ... $ Temp3pm : num 23.6 25.7 20.2 14.1 15.4 14.8 17.3 15.5 18.9 21.7 ... $ RainToday : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 1 1 2 ... $ RISK_MM : num 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 0 ... $ RainTomorrow : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 1 1 1 2 1 ...We have available 366 records:
(n < nrow(weather_data)) [1] 366which spans the following timeline:
c(as.character(weather_data$Date[1]), as.character(weather_data$Date[n])) [1] "11/1/2007" "10/31/2008"We further notice that RISK_MM relation with the RainTomorrow variable is the following.
all.equal(weather_data$RISK_MM > 1, weather_data$RainTomorrow == "Yes") [1] TRUEThe Rainfall variable and the RainToday are equivalent according to the following relationship (as anticipated by the Rainfall description).
all.equal(weather_data$Rainfall > 1, weather_data$RainToday == "Yes") [1] TRUETo make it more challenging, we decide to take RISK_MM, RainFall and RainToday out, and determine a new dataset as herein depicted.
weather_data2 < subset(weather_data, select = c(Date, Location, RISK_MM, Rainfall, RainToday)) colnames(weather_data2) [1] "MinTemp" "MaxTemp" "Evaporation" "Sunshine" "WindGustDir" "WindGustSpeed" [7] "WindDir9am" "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [13] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" "Temp3pm" [19] "RainTomorrow" (cols_withNa < apply(weather_data2, 2, function(x) sum(is.na(x)))) MinTemp MaxTemp Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am 0 0 0 3 3 2 31 WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm 1 7 0 0 0 0 0 Cloud9am Cloud3pm Temp9am Temp3pm RainTomorrow 0 0 0 0 0Look at the NA’s counts associated to WindDir9am. If WindDir9am were a not significative predictor for RainTomorrow, we could take such data column out and increased the complete cases count. When dealing with the categorical variable analysis we determine if that is possible. For now, we consider records without NA’s values.
weather_data3 < weather_data2[complete.cases(weather_data2),] Categorical Variable Analysis factor_vars < names(which(sapply(weather_data3, class) == "factor")) factor_vars < setdiff(factor_vars, "RainTomorrow") chisq_test_res < lapply(factor_vars, function(x) { chisq.test(weather_data3[,x], weather_data3[, "RainTomorrow"], simulate.p.value = TRUE) }) names(chisq_test_res) < factor_vars chisq_test_res $WindGustDir Pearson's Chisquared test with simulated pvalue (based on 2000 replicates) data: weather_data3[, x] and weather_data3[, "RainTomorrow"] Xsquared = 35.833, df = NA, pvalue = 0.001999 $WindDir9am Pearson's Chisquared test with simulated pvalue (based on 2000 replicates) data: weather_data3[, x] and weather_data3[, "RainTomorrow"] Xsquared = 23.813, df = NA, pvalue = 0.06547 $WindDir3pm Pearson's Chisquared test with simulated pvalue (based on 2000 replicates) data: weather_data3[, x] and weather_data3[, "RainTomorrow"] Xsquared = 9.403, df = NA, pvalue = 0.8636Above shown Chisquared pvalue results tell us that RainTomorrow values depend upon WindGustDir (we reject the null hypothesis that RainTomorrow does not depend upon the WindGustDir). We reject the nullhypothesis for WindDir9am and WindDir3pm as well. We therefore expect to take advantage of WindGustDir as predictor and not to consider WindDir9am and WindDir3pm for such purpose.
It is also possible to obtain a visual understanding of the significativeness of such categorical variables by taking advantage of barcharts with the cross table row proportions as input.
barchart_res < lapply(factor_vars, function(x) { title < colnames(weather_data3[,x, drop=FALSE]) wgd < CrossTable(weather_data3[,x], weather_data3$RainTomorrow, prop.chisq=F) barchart(wgd$prop.row, stack=F, auto.key=list(rectangles = TRUE, space = "top", title = title)) }) names(barchart_res) < factor_vars barchart_res$WindGustDir barchart_res$WindDir9am barchart_res$WindDir3pmBeing WindDir9am not a candidate predictor and having got more than 30 NA’s values, we decide to take it out. As a consequence, we have increased the number of NAfree records from 328 to 352. Same for WindDir3pm.
weather_data4 < subset(weather_data2, select = c(WindDir9am, WindDir3pm)) weather_data5 < weather_data4[complete.cases(weather_data4),] colnames(weather_data5) [1] "MinTemp" "MaxTemp" "Evaporation" "Sunshine" "WindGustDir" [6] "WindGustSpeed" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [11] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" [16] "Temp3pm" "RainTomorrow"So, we end up with a dataset made up of 16 potential predictors, one of those is a categorical variable (WindGustDir) and 15 are quantitative.
Quantitative Variable AnalysisIn this section, we carry on the exploratory analysis of quantitative variables. We start first by a visualization of the correlation relationship among variables.
factor_vars < names(which(sapply(weather_data5, class) == "factor")) numeric_vars < setdiff(colnames(weather_data5), factor_vars) numeric_vars < setdiff(numeric_vars, "RainTomorrow") numeric_vars numeric_vars_mat < as.matrix(weather_data5[, numeric_vars, drop=FALSE]) numeric_vars_cor < cor(numeric_vars_mat) corrplot(numeric_vars_cor)By taking a look at above shown correlation plot, we can state that:
– Temp9am strongly positive correlated with MinTemp
– Temp9am strongly positive correlated with MaxTemp
– Temp9am strongly positive correlated with Temp3pm
– Temp3pm strongly positive correlated with MaxTemp
– Pressure9am strongly positive correlated with Pressure3pm
– Humidity3pm strongly negative correlated with Sunshine
– Cloud9am strongly negative correlated with Sunshine
– Cloud3pm strongly negative correlated with Sunshine
The pairs plot put in evidence if any relationship among variables is in place, such as a linear relationship.
pairs(weather_data5[,numeric_vars], col=weather_data5$RainTomorrow)Visual analysis, put in evidence a linear relationship among the following variable pairs:

MinTemp and MaxTemp
MinTemp and Evaporation
MaxTemp and Evaporation
Temp9am and MinTemp
Temp3pm and MaxTemp
Pressure9am and Pressure3pm
Humidity9am and Humidity3pm
WindSpeed9am and WindGustSpeed
WindSpeed3pm and WindGustSpeed
Boxplots and density plots give a visual understanding of the significativeness of a predictor by looking how much are overlapping the predictor values set depending on the variable to be predicted (RainTomorrow).
gp < invisible(lapply(numeric_vars, function(x) { ggplot(data=weather_data5, aes(x= RainTomorrow, y=eval(parse(text=x)), col = RainTomorrow)) + geom_boxplot() + xlab("RainTomorrow") + ylab(x) + ggtitle("") + theme(legend.position="none")})) grob_plots < invisible(lapply(chunk(1, length(gp), 4), function(x) { marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)})) grob_plots gp < invisible(lapply(numeric_vars, function(x) { ggplot(data=weather_data5, aes(x=eval(parse(text=x)), col = RainTomorrow)) + geom_density() + xlab(x) + ggtitle(paste(x, "density", sep= " "))})) grob_plots < invisible(lapply(chunk(1, length(gp), 4), function(x) { marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)})) grob_plotsFrom all those plots, we can state that Humidity3pm, Pressure9am, Pressure3pm, Cloud9am, Cloud3pm and Sunshine are predictors to be considered.
Ultimately, we save the dataset currently under analysis for the prosecution of this tutorial.
write.csv(weather_data5, file="weather_data5.csv", sep=",", row.names=FALSE) ConclusionsExploratory analysis has highlighted interesting relationship across variables. Such preliminary results give us hints for predictors selection and will be considered in the prosecution of this analysis.
If you have any questions, please feel free to comment below.
References
Related Post
 Weighted Linear Support Vector Machine
 Logistic Regression Regularized with Optimization
 Analytical and Numerical Solutions to Linear Regression Problems
 How to create a loop to run multiple regression models
 Regression model with auto correlated errors – Part 3, some astrology
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));The code (and other stuff…)
I’ve received a couple of emails or comments on one of the General Election posts to ask me to share the code I’ve used.
In general, I think this is a bit dirty and lots could be done in a more efficient way $$ effectively, I’m doing this out of my own curiosity and while I think the model is sensible, it’s probably not “publicationstandard” (in terms of annotation etc).
Anyway, I’ve created a (rather plain) GitHub repository, which contains the basic files (including R script, R functions, basic data and JAGS model). Given time (which I’m not given…), I’d like to put a lot more description and perhaps also write a Stan version of the model code. I could also write a more precise model description $$ I’ll try to update the material on the GitHub.
On another note, the previous posts have been syndicated in a couple of places (here and here), which was nice. And finally, here’s a little update with the latest data. As of today, the model predicts the following seats distribution.
mean sd 2.5% median 97.5%
Conservative 352.124 3.8760350 345 352 359
Labour 216.615 3.8041091 211 217 224
UKIP 0.000 0.0000000 0 0 0
Lib Dem 12.084 1.8752228 8 12 16
SNP 49.844 1.8240041 45 51 52
Green 0.000 0.0000000 0 0 0
PCY 1.333 0.9513233 0 2 3
Other 0.000 0.0000000 0 0 0
Labour are still slowly but surely gaining some ground $$ I’m not sure the effect of the debate earlier this week (which was deserted by the PM) are visible yet as only a couple of the polls included were conducted after that.
Another interesting thing (following up on this post) is the analysis of the marginal seats that the model predicts to swing from the 2015 Winners. I’ve updated the plot, which now looks as below.
Now there are 30 constituencies that are predicted to change hand, many still towards the Tories. I am not a political scientists, so I don’t really know all the ins and outs of these, but I think a couple of examples are quite interesting and I would venture some comment…
So, the model doesn’t know about the recent byelections of Copeland and StokeonTrent South and so still label these seats as “Labour” (as they were in 2015), although the Tories have actually now control of Copeland.
In the prediction given the polls and the impact of the EU referendum (both were strong Leave areas with with 60% and 70% of the preference, respectively) and the Tories did well in 2015 (36% vs Labour’s 42% in Copeland and 33% to Labour’s 39% in 2015). So, the model is suggesting that both are likely to switch to the Tories this time around.
In fact, we know that at the time of the byelection, while Copeland (where the contest was mostly Labour v Tories) did go blue, Stoke didn’t. But there, the main battle was between the Labour’s and the UKIP’s candidate (UKIP had got 21% in 2015). And the byelection was fought last February, when the Tories lead was much more robust that it probably is now.
Another interesting area is Twickenham $$ historically a constituency leaning to the Lib Dems, which was captured by the Conservatives in 2015. But since then, in another byelection the Tories have lost another similar area (Richmond Park,with a massive swing) and the model is suggesting that Twickenham could follow suit, come next Thursday.
Finally, Clapton was the only seat won by UKIP in 2015, but since then, the elected MP (a former ToryturnedUKIP) has defected the party and is not contesting the seat. This, combined with the poor standing of UKIP in the polls produces the not surprisingly outcome that Clapton is predicted to go blue with basically no uncertainty…
These results look reasonable to me $$ not sure how life will turn out of course. As many commentators have noted much may depend on the turn out among the younger. Or other factors. And probably there’ll be another instance of the “ShyTory effect” (I’ll think about this if I get some time before the final prediction). But the model does seem to make some sense…
A Shiny App for Exploring Commodities Prices and Economic Indicators, via Quandl
(This article was first published on R Views, and kindly contributed to Rbloggers)
In a previous post, we created an R Notebook to explore the relationship between the copper/gold price ratio and 10year Treasury yields (if you’re curious why we might care about this relationship, have a quick look at that previous post), relying on data from Quandl. Today, we’ll create a Shiny app that lets users choose which different commodities ratios and different economic indicators to investigate. Perhaps users don’t care about Dr. Copper and Treasury yields, but instead want to explore the oil/gold price ratio and how it correlates with the US inflation rate, or the EU inflation rate. Let’s give them some flexibility!
The finished app is available here.
Before we get to it, note a few issues that we have seen in the past.
Very similar to our previous Shiny app, in the code chunk below, we have some important decisions about how a user selects a commodity. We could use textInput to allow the user to enter the code for the desired data set which would not limit the user in any way – he or she could choose any dataset on Quandl. The cost would be that the user would need to know, or go to Quandl and look up, the code for any data set.
For example, to import iron ore prices, the user would have to type in ODA/PIORECR_USD. That’s a big problem if your end users are not familiar with and have no interest in data set codes. To emphasize convenience and usability we will go with selectInput instead of textInput, meaning our app will show a dropdown of a few choices. The user just clicks on “Iron Ore” instead of typing ODA/PIORECR_USD, or clicks on “copper” instead of typing CHRIS/CME_HG1.1. But, if a user wants to work with a data set that we haven’t included, said user is out of luck.
Another big decision is how many choices to give the user. I’ve included a few: copper, oil, iron, platinum, palladium and silver. That’s a nice smattering of commodities whose prices tend to rise with a growing economy, plus silver which does not. I included silver so we could take a look at a commodity that should behave differently from the rest. As always, our choice here is driven by how broad this app should be. We could have added steel, lead, zinc, and tin, or we could have included just copper, oil and iron, depending on the use case. Either way, the number of drop downs is another tradeoff between usability and flexibility.
The final decision is a bit more nuanced and requires looking ahead to how these inputs will be used further down in the app. Have a peak at the object called commodityChoices and you might notice that we don’t strictly need that object. We could have put the vector of choices as an argment to selectInput, so that our code would have read choices = c("Copper" = "CHRIS/CME_HG1.1", ...) instead of choices = commodityChoices. In that choice assignment, “copper” is called the name and “CHRIS/CME_HG1.1” is called the value (together we can think of them as a namevalue pair). The reason for building a separate commodityChoices object is that we want the ability to extract either the name or the value of the namevalue pair. Usually we would care only about the value, because we want to pass the value to Quandl and import the data, but that name is going to be useful when we label our graphs.
Without further adieu, let’s look at commodityChoices, econIndicatorChoices and the use of selectInput.
# Create a vector of commodity choices. commodityChoices < c( "Copper" = "CHRIS/CME_HG1", "WTI oil" = "FRED/DCOILWTICO",# "Iron Ore" = "ODA/PIORECR_USD", # monthly "Platinum" = "LPPM/PLAT", "Palladium" = "LPPM/PALL", "Silver" = "LBMA/SILVER") # Make those commodity choices avaible via selectInput. selectInput("commodity", "Commodity", choices = commodityChoices, selected = "Copper") # Create a vector of economic indicator choices. econIndicatorChoices < c( "10Yr Yield" = "FRED/DGS10", # daily "US CPI" = "RATEINF/INFLATION_USA",# monthly "Japan CPI" = "RATEINF/INFLATION_JPN", "EU CPI" = "RATEINF/INFLATION_EUR") # Make those economic indicator choices avaible via selectInput. selectInput("econIndicator", "Economic Indicator", choices = econIndicatorChoices, selected = "10yr Yield") # A standard date range input. dateRangeInput("dateRange", "Date range", start = "19900101", end = "20161231")Now that we have the inputs in a sidebar for the user, it’s back to Quandl to import the data for the chosen commodity, gold and the chosen economic indicator. There’s a common date range for all three so we’ll start by creating start and end date objects.
ratio_indicator < reactive({ Quandl.api_key("your API key here") start_date < format(input$dateRange[1]) end_date < format(input$dateRange[2])We could now write three separate calls to Quandl for each of the data sets but, instead, let’s make use of the map() function from the purrr package. If you’re not familiar with purrr, have a look here. I’ll just say that you’ll probably never have to use lapply() again (and that should be motivation enough), but, in short, the family of map() functions takes a function and applies it to the elements in a vector, similar to the apply() functions.
Before we can use map() though, we need a vector to feed it. Let’s create a vector of Quandl codes.
# Create a vector of 3 data set codes # 1) commodity chosen by user # 2) gold quandl code # 3) economic indicator chosen by user gold_code < "CHRIS/CME_GC1.1" # Vector of Quandl codes. data_set_codes < c(input$commodity, gold_code, input$econIndicator)Then we’ll apply the Quandl() function by piping our vector of codes and using map().
# Pipe the data_set_codes vector to Quandl via the map() function # Note we can still set the start and end date and object type # as we always can with Quandl. quandlData< data_set_codes %>% # Pipe the datasets vector to Quandl via the map() function. map(Quandl, start_date = start_date, end_date = end_date, collapse = "monthly", type = "xts") %>%Next, we will use map() to apply the na.locf() function to our time series and ensure that no NAs remain.
# Replace all NAs using map() and na.locf(). map(na.locf, formLast = TRUE) %>%If we stopped here, we would have a list of three xts series, but I don’t want a list, I want one xts object. So, we’ll pipe our list of three and use the reduce() + merge() to combine our list of 3 time series into one xts object.
# Merge to one xts object using map() and merge(). reduce(merge) %>% # Add nicer column names. `colnames<`(c(names(commodityChoices[commodityChoices == input$commodity]), "Gold", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator])))Alright, after running our Quandl codes through that series of mapped functions, we have three time series stored in one xts object, and now we want to calculate the price ratio of the chosen commodity/gold.
To create that price ratio, we need to divide the first column by the second column and we’ll store it in a new column called ratio.
Then we will save just that ratio and the economic indicator column data into their xts object. That is not necessary but it makes things cleaner and easier when we pass to dygraph().
# Create a column and add the price ratio. quandlData$ratio < quandlData[,1]/quandlData[,2] # Save just the ratio and the economic indicator data. ratio_indicator < merge(quandlData$ratio, quandlData[,3]) # Add more general names. colnames(ratio_indicator) < c("ratio","indicator") return(ratio_indicator) })Now we just need to pass our reactive object ratio_indicator() to dygraph() and follow the same steps as we did when testing in our Notebook.
We will use dyRoller() to smooth out our chart and make each point an average of the number of periods specified with rollPeriod = X. This won’t affect our xts object, where we store the data, it just makes the chart more readable.
Remember also that we are charting two time series on the same chart and they are on different scales, so we want to add a righthandside yaxis.
To do so, we need to invoke dyAxis() for the lefthand axis, called “y”. Then we invoke dyAxis() for the righthand axis, called “y2”. We also need to set independentTicks = TRUE so that we can use a unique, independent value scale for the righthand side. Next, in our dySeries() call for each time series, we assign each one to an axis. Here we assign “ratio” with axis = 'y', so that the commoditygold price ratio will be on the lefthand scale, and we assign “indicator” with axis = 'y2', so the economic indicator will be on the righthand scale.
dygraphOutput("ratio_indicator") output$ratio_indicator < renderDygraph({ dygraph(ratio_indicator()) %>% # Add the rollPeriod for smoothing. dyRoller(rollPeriod = 3) %>% # Create two independent axes, just we did in the Notebook. dyAxis("y", label = "USD") %>% dyAxis("y2", label = "Percent (%)", independentTicks = TRUE) %>% # Assign each time series to an axis. # Use the name from the namevalue pair to create nice labels for each. dySeries("ratio", axis = 'y', label = paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold (LHS)", sep = ""), color = "blue") %>% dySeries("indicator", axis = 'y2', label = paste(names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), "(RHS)", sep = ""), color = "red") })We could end things here but let’s go ahead and add a chart to show the rolling correlation between the ratio and the indicator. We’ve done so much work to calculate and wrangle these time series, might as well put them to use!
First, we’ll calculate the rolling correlation using the rollapply() function. Nothing too complicated here.
dygraphOutput("rollingCorrelation") output$rollingCorrelation < renderDygraph({ rolling_cor < rollapply(ratio_indicator(), 24, function(x) cor(x[, 1], x[, 2], use = "pairwise.complete.obs"), by.column = FALSE) # Make a nicer name for the xts object that stores the rolling correlation. # This name will be displayed when a user hovers on the dygraph. names(rolling_cor) < paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold ", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), " Correlation", sep = "")It’s not necessary, but I like to display the mean, minimum and maximum rolling correlations on the chart. We’ll store those in three objects: avg, mini, and maxi.
avg < round(mean(rolling_cor, na.rm = T), 2) mini < round(min(rolling_cor, na.rm = T), 2) maxi < round(max(rolling_cor, na.rm = T), 2)Now we pass our rolling_cor xts object to dygraph() and pass the mean, minimum and maximum objects to dyLimit().
dygraph(rolling_cor, main = paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold ", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), " Correlation", sep = "")) %>% dyRangeSelector(dateWindow = c("20150101", "20161231")) %>% # Add a line for the mean, min and max. dyLimit(avg, color = 'purple') %>% dyLimit(mini, color = 'red') %>% dyLimit(maxi, color = 'blue') %>% # Add an event for the US election. dyEvent("20161108", "Trump!", labelLoc = "bottom") })And, we’re done! It’s fun to explore different relationships amongst different time series with this app. And once we have this template in the toolkit, all sorts of different data sets can be substituted in for exploration. For example, we might want to port this work over to a currencies dashboard, or a country GDP dashboard. The nice thing is, it’s just a matter of finding the right Quandl codes and imagining new hypotheses to explore.
Things got a little choppy today with all the piping, so just a reminder that if you want the reusable code for this app, it’s available via the source code button at the top right of the live app. Thanks, and see you next time.
_____='https://rviews.rstudio.com/2017/06/02/ashinyappforexploringcommoditiespricesandeconomicindicatorsviaquandl/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));Python and R top 2017 KDnuggets rankings
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
The results of KDnuggets' 18th annual poll of data science software usage are in, and for the first time in three years Python has edged out R as the most popular software. While R increased its share of usage from 45.7% in last year's poll to 52.1% this year, Python's usage among data scientists increased even more, from 36.6% of users in 2016 to 52.6% of users this year.
There were some interesting moves in the long tail, as well. Several tools entered the KDNuggets chart for the first time, including Keras (9.5% of users), PyCharm (9.0%) and Microsoft R Server (4.3%). And several returning tools saw big jumps in usage, including Microsoft Cognitive Toolkit (3.4% of users), Tensorflow (20.2%) and Power BI (10.2%). Microsoft SQL Server increased its share to 11.6% (up from 10.8%), whereas SAS (7.1%) and Matlab (7.4%) saw declines. Julia, somewhat surprisingly, remained flat at 1.1%.
For the complete results and analysis of the 2017 KDnuggets data science software poll, follow the link below.
KDnuggets: New Leader, Trends, and Surprises in Analytics, Data Science, Machine Learning Software Poll
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Canada Labour Market: Future Perspectives
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
For anyone looking for job opportunities, it is nice to have an idea how the job market will perform in the future for your chosen career or industry. Many countries have open data sets that offer this kind of data. In these exercises we will use R to analyze the future perspective of Canadian labour market.
Answers to the exercises are available here.
Exercise 1
Download and read into R all data sets from Canadian Occupational Projection System (COPS) – 2015 to 2024 projections.
Exercise 2
Load library tidyr. Use gather to rearrange any occupation related data set that present time series data into a tidy data format.
Exercise 3
Use gather to rearrange ALL other occupation related data sets that present time series data into a tidy data format, and pile out them in a unique data frame.
Exercise 4
Remove lines that present NA values, columns in French, and the “X” in front every year. Take a look at your tidy data set.
 Learn indepth how to work with dplyr
 Get a full introduction to the data.table package
 And much more
Exercise 5
Let’s do the same with industries data sets. Start by taking one of the industries data sets that present data in a time series an use gather.
Exercise 6
Do the same procedure of exercise 5 to all other industries data sets. Pile out them in a new data frame.
Exercise 7
Remove NAs, and French columns. In addition, set year and value as numeric, and take a look at your new tidy data set about industries.
Exercise 8
Find out the industries that have que lowest number of jobseekers, and create a new data set by sub setting the previous one.
Exercise 9
Plot the recently create data set using a line for each industry.
Exercise 10
Create a similar plot for the top 5 occupations in terms of low amount of jobseekers.
 Reshape 2 Exercises
 As.Date() Exercises
 Data frame exercises Vol. 2
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
In defense of wrapr::let()
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
Saw this the other day:
In defense of wrapr::let() (originally part of replyr, and still reexported by that package) I would say:
 let() was deliberately designed for a single realworld use case: working with data when you don’t know the column names when you are writing the code (i.e., the column names will come later in a variable). We can rephrase that as: there is deliberately less to learn as let() is adapted to a need (instead of one having to adapt to let()).
 The R community already has months of experience confirming let() working reliably in production while interacting with a number of different packages.
 let() will continue to be a very specific, consistent, reliable, and relevant tool even after dpyr 0.6.* is released, and the community gains experience with rlang/tidyeval in production.
If rlang/tidyeval is your thing, by all means please use and teach it. But please continue to consider also using wrapr::let(). If you are trying to get something done quickly, or trying to share work with others: a “deeper theory” may not be the best choice.
An example follows.
In “base R” one can write:
d < data.frame(x = 1:3)If we know the column name we wish to add to this data frame we write:
d$newColumn < 1The above is “nonparameterized” evaluation in that the variable name is taken from the source code, and not from a variable carrying the information. This is great for adhoc analysis, but it would be hard to write functions, scripts and packages if this was the only way we had to manipulate columns.
This isn’t a problem as R supplies an additional parameterized notation as we show here. When we don’t know the name of the column (but expect it to be in a variable at run time) we write:
# code written very far away from us variableHoldingColumnName < 'newColumn' # our code d[[variableHoldingColumnName]] < 1The purpose of wrapr::let() is to allow one to use the nonparameterized form as if it were parameterized. Obviously this is only useful if there is not a convenient parameterized alternative (which is the case for some packages). But for teaching purposes: how would wrapr::let() let us use the “$” as if it were parameterized (which we would have to do if [[]] and [] did not exist)?
With wrapr::let() we can rewrite the “dollarsign form” as:
wrapr::let( c(NEWCOL = variableHoldingColumnName), { d$NEWCOL < 1 } )The name “NEWCOL” is a standin name that we write all our code in terms of. The expression “c(NEWCOL = variableHoldingColumnName)” is saying: “NEWCOL is to be interpreted as being whatever name is stored in the variable variableHoldingColumnName.” Notice we can’t tell from the code what value is in variableHoldingColumnName, as that won’t be known until execution time. The alias list or vector can be arbitrarily long and built wherever you like (it is just data). The expression block can be arbitrarily large and complex (so you need only pay the mapping notation cost once per function, not once per line of code).
And that is wrapr::let().
If you don’t need parametric names you don’t need wrapr::let(). If you do need parametric names wrapr::let() is one of the most reliable and easiest to learn ways to get them.
A nice article from a user recording their experience trying different ways to parameterize their code (including wrapr::let()) can be found here.
If you wish to learn more we have a lot of resources available:
 A vignette: vignette('let', package='wrapr').
 The method help: help('let', package='wrapr')
 Examples: Using replyr::let to Parameterize dplyr Expressions.
 A video lecture: My recent BARUG talk: Parametric Programming in R with replyr.
 Some notes: Parametric Programming in R.
 The package introduction: The wrapr introduction.
 Comparison to the soon to deprecated SE underbar/underscore methods: Comparative examples using replyr::let.
And many more examples on our blog.
To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Primer in functional Programming in R (part 2)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
In the last exercise, We have seen how powerful functional programming principles can be and how it can drammatically increase the readablity of the code and how easily you can work with them .In this set of exercises we will look at functional programming principles with purrr.Purrr comes with a number of interesting features and is really useful in writing clean and concise code . Please check the documentation and load the purrr library in your R session before starting these exercise set .
Answers to the exercises are available here
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
From the airquality dataset( available in base R ) , Find the mean ,median ,standard deviation of all columns using map functions .
Exercise 2
In the same dataset,find 95th percentile of each column excluding the NA values
Exercise 3
Load the iris dataset ,with help of pipe and map functions find out the mean of the relavant columns.Keep in mind mean is meant for numeric columns ,so you may need multiple map like functions.I expect the output as a dataframe .
Exercise 4
I have a vector x < 1:20 ,I want to multiply every odd element with 2 and every even element with 4 , find a solution using purrr .
Exercise 5
I have a sequence x < 1:100 , I want to increase the 1st,11th, 21st…91st element by 1 . How can I achieve that .
Exercise 6
Suppose I have two character vectors
x < letters # letters is a vector of alphabets in small available in R
y < LETTERS # LETTERS is a vector of alphabets in caps available in R
How do I join each capital letter with the small letter so that the end result looks like this
"A,a" "B,b" ,.. and so on
Exercise 7
The previous exercise gave you the intuition of how to work parallely on two vectors.Now accepts a list of character vectors of same size and joins them like the previous exercise . so if I have a
list like x < list(c("a","b","c"),c("d","e",f"),c("g","h","i")) ,it should give me output as .
[1] "a d g" "b e h" "c f i"
Exercise 8
using a functional tool from purrr ,reverse the letters so that my output is a string of 26 character starting from z and ending at a
like “z y x w v u t s r q p o n m l k j i h g f e d c b a”
Exercise 9
Exercise on Purrr wont be complete if We dont mention its “Filter” like functions . take a numeric vector of 1:100 , keep all the numbers which are divisible by 2 and after that remove the one’s which are divisible by 5
Exercise 10
Ability to create partial functions is a powerful tool in functional programming. This allow functions to behave a little like data structures .Consider creating a partial function whenever you see you are repeating functions argument. This may not sound very useful in context of this exercise but I am sure you will find it very useful in your R gigs .Now create a Partial function
ql ,which is to find the 25th percentile and 50th percentile of a column from a dataframe .use it to find the same from airquality dataset .Dont use quantile twice in this exercise .
 Higher Order Functions Exercises
 A Primer in Functional Programming in R (Part – 1)
 Building Shiny App exercises part 4
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A tidy model pipeline with twidlr and broom
(This article was first published on blogR, and kindly contributed to Rbloggers)
@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.
The problemDifferent model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:
Thus, different models may need very different code.
However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.
Twostep modellingTo understand the solution, think of the problem as a twostep process, depicted in this Figure:
Step 1: from data to fitted modelStep 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!
To demonstrate:
#devtools::install_github("drimonj/twidlr") # To install library(twidlr) lm(mtcars, hp ~ .) #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) mpg cyl disp drat #> 79.048 2.063 8.204 0.439 4.619 #> wt qsec vs am gear #> 27.660 1.784 25.813 9.486 7.216 #> carb #> 18.749This means we can pipe data.frames into any model function exposed by twidlr. For example:
library(dplyr) mtcars %>% lm(hp ~ .) #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) mpg cyl disp drat #> 79.048 2.063 8.204 0.439 4.619 #> wt qsec vs am gear #> 27.660 1.784 25.813 9.486 7.216 #> carb #> 18.749 Step2: fitted model to tidy resultsStep 2 must take a fitted model object as its input and return a tidy data frame of results. This is precisely what the broom package does via three functions: glance, tidy, and augment! To demonstrate:
#install.packages("broom") # To install library(broom) fit < mtcars %>% lm(hp ~ .) glance(fit) #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e08 11 142.8905 #> AIC BIC deviance df.residual #> 1 309.7809 327.3697 14164.76 21 tidy(fit) #> term estimate std.error statistic p.value #> 1 (Intercept) 79.0483879 184.5040756 0.4284371 0.672695339 #> 2 mpg 2.0630545 2.0905650 0.9868407 0.334955314 #> 3 cyl 8.2037204 10.0861425 0.8133655 0.425134929 #> 4 disp 0.4390024 0.1492007 2.9423609 0.007779725 #> 5 drat 4.6185488 16.0829171 0.2871711 0.776795845 #> 6 wt 27.6600472 19.2703681 1.4353668 0.165910518 #> 7 qsec 1.7843654 7.3639133 0.2423121 0.810889101 #> 8 vs 25.8128774 19.8512410 1.3003156 0.207583411 #> 9 am 9.4862914 20.7599371 0.4569518 0.652397317 #> 10 gear 7.2164047 14.6160152 0.4937327 0.626619355 #> 11 carb 18.7486691 7.0287674 2.6674192 0.014412403 augment(fit) %>% head() #> .rownames hp mpg cyl disp drat wt qsec vs am gear carb #> 1 Mazda RX4 110 21.0 6 160 3.90 2.620 16.46 0 1 4 4 #> 2 Mazda RX4 Wag 110 21.0 6 160 3.90 2.875 17.02 0 1 4 4 #> 3 Datsun 710 93 22.8 4 108 3.85 2.320 18.61 1 1 4 1 #> 4 Hornet 4 Drive 110 21.4 6 258 3.08 3.215 19.44 1 0 3 1 #> 5 Hornet Sportabout 175 18.7 8 360 3.15 3.440 17.02 0 0 3 2 #> 6 Valiant 105 18.1 6 225 2.76 3.460 20.22 1 0 3 1 #> .fitted .resid .hat .sigma .cooksd .std.resid #> 1 148.68122 38.681220 0.2142214 24.75946 0.069964902 1.6801773 #> 2 140.62866 30.628664 0.2323739 25.43881 0.049861042 1.3460408 #> 3 79.99158 13.008418 0.3075987 26.38216 0.014633059 0.6019364 #> 4 125.75448 15.754483 0.2103960 26.31579 0.011288712 0.6826601 #> 5 183.21756 8.217565 0.2016137 26.53317 0.002878707 0.3541128 #> 6 111.38490 6.384902 0.3147448 26.55680 0.003682813 0.2969840 A single, tidy pipelineSo twidlr and broom functions can be combined into a single, tidy pipeline to go from data.frame to tidy output:
library(twidlr) library(broom) mtcars %>% lm(hp ~ .) %>% glance() #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e08 11 142.8905 #> AIC BIC deviance df.residual #> 1 309.7809 327.3697 14164.76 21Any model included in twidlr and broom can be used in this same way. Here’s a kmeans example:
iris %>% select(Species) %>% kmeans(centers = 3) %>% tidy() #> x1 x2 x3 x4 size withinss cluster #> 1 5.901613 2.748387 4.393548 1.433871 62 39.82097 1 #> 2 5.006000 3.428000 1.462000 0.246000 50 15.15100 2 #> 3 6.850000 3.073684 5.742105 2.071053 38 23.87947 3And a ridge regression with crossfold validation example:
mtcars %>% cv.glmnet(am ~ ., alpha = 0) %>% glance() #> lambda.min lambda.1se #> 1 0.2284167 0.8402035So next time you want to do some tidy modelling, keep this pipeline in mind:
LimitationsCurrently, the major limitation for this approach is that a model must be covered by twidlr and broom. For example, you can’t use randomForest in this pipeline because, although twidlr exposes a data.frame friendly version of it, broom doesn’t provide tidying methods for it. So if you want to write tidy code for a model that isn’t covered by these packages, have a go at helping out by contributing to these open source projects! To get started creating and contributing to R packages, take a look at Hadley Wickham’s free book, “R Packages”.
Sign offThanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.
To leave a comment for the author, please follow the link and comment on their blog: blogR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Correcting bias in metaanalyses: What not to do (metashowdown Part 1)
(This article was first published on R – Nicebread, and kindly contributed to Rbloggers)
tl;dr: Publication bias and phacking can dramatically inflate effect size estimates in metaanalyses. Many methods have been proposed to correct for such bias and to estimate the underlying true effect. In a large simulation study, we found out which methods do not work well under which conditions, and give recommendations what not to use. Estimated reading time: 7 min. It is well known that publication bias and phacking inflate effect size estimates from metaanalyses. In the last years, methodologists have developed an ever growing menu of statistical approaches to correct for such overestimation. However, to date it was unclear under which conditions they perform well, and what to do if they disagree. Born out of a Twitter discussion, Evan Carter, Joe Hilgard, Will Gervais and I did a large simulation project, where we compared the performance of naive random effects metaanalysis (RE), trimandfill (TF), pcurve, puniform, PET, PEESE, PETPEESE, and the threeparameter selection model (3PSM).Previous investigations typically looked only at publication bias or questionable research practices QRPs (but not both), used nonrepresentative studylevel sample sizes, or only compared few biascorrecting techniques, but not all of them. Our goal was to simulate a research literature that is as realistic as possible for psychology. In order to simulate several research environments, we fully crossed five experimental factors: (1) the true underlying effect, δ (0, 0.2, 0.5, 0.8); (2) betweenstudy heterogeneity, τ (0, 0.2, 0.4); (3) the number of studies in the metaanalytic sample, k (10, 30, 60, 100); (4) the percentage of studies in the metaanalytic sample produced under publication bias (0%, 60%, 90%); and (5) the use of QRPs in the literature that produced the metaanalytic sample (none, medium, high).
This blog post summarizes some insights from our study, internally called “metashowdown”. Check out the preprint; and the interactive app metaExplorer. The fully reproducible and reusable simulation code is on Github, and more information is on OSF.
In this blog post, I will highlight some lessons that we learned during the project, primarily focusing on what not do to when performing a metaanalysis.
Limitation of generalizability disclaimer: These recommendations apply to typical sample sizes, effect sizes, and heterogeneities in psychology; other research literatures might have different settings and therefore a different performance of the methods. Furthermore, the recommendations rely on the modeling assumptions of our simulation. We went a long way to make them as realistic as possible, but other assumptions could lead to other results.
Never trust a naive random effects metaanalysis or trimandfill (unless you metaanalyze a set of registered reports)If studies have no publication bias, nothing can beat plain old random effects metaanalysis: it has the highest power, the least bias, and the highest efficiency compared to all other methods. Even in the presence of some (though not extreme) QRPs, naive RE performs better than all other methods. When can we expect no publication bias? If (and, in my opinion only if) we metaanalyze a set of registered reports.
But.
In any other setting except registered reports, a consequential amount of publication bias must be expected. In the field of psychology/psychiatry, more than 90% of all published hypothesis tests are significant (Fanelli, 2011) despite the average power being estimated as around 35% (Bakker, van Dijk, & Wicherts, 2012) – the gap points towards a huge publication bias. In the presence of publication bias, naive random effects metaanalysis and trimandfill have false positive rates approaching 100%:
More thoughts about trimandfill’s inability to recover δ=0 are in Joe Hilgard’s blog post. (Note: this insight is not really new and has been shown multiple times before, for example by Moreno et al., 2009, and Simonsohn, Nelson, and Simmons, 2014).
Our recommendation: Never trust metaanalyses based on naive random effects and trimandfill, unless you can rule out publication bias. Results from previously published metaanalyses based on these methods should be treated with a lot of skepticism.
Do not use pcurve and puniform for effect size estimation (under heterogeneity)
As a default, heterogeneity should always be expected – even under the most controlled conditions, where many labs perform the same computeradministered experiment, a large proportion showed significant and substantial amounts of betweenstudy heterogeneity (cf. ManyLabs 1 and 3; see also our supplementary document for more details). pcurve and puniform assume homogeneous effect sizes, and their performance is impacted to a large extent by heterogeneity:
As you can see, all other methods retain the nominal false positive rate, but pcurve and puniform go through the roof as soon as heterogeneity comes into play (see also McShane, Böckenholt, & Hansen, 2016; van Aert et al., 2016).
Under H1, heterogeneity leads to overestimation of the true effect:
(additional settings for these plots: no QRPs, no publication bias, k = 100 studies, true effect size = 0.5)
Note that in their presentation of pcurve, Simonsohn et al. (2014) emphasize that, in the presence of heterogeneity, pcurve is intended as an estimate of the average true effect size among the studies submitted to pcurve (see here, Supplement 2). pcurve may indeed yield an accurate estimate of the true effect size among the significant studies, but in our view, the goal of biascorrection in metaanalysis is to estimate the average effect of all conducted studies. Of course this latter estimation hinges on modeling assumptions (e.g., that the effects are normally distributed), which can be disputed, and there might be applications where indeed the underlying true effect of all significant studies is more interesting.
Furthermore, as McShane et al (2016) demonstrate, pcurve and puniform are constrained versions of the more general threeparameter selection model (3PSM; Iyengar & Greenhouse, 1988). The 3PSM estimates (a) the mean of the true effect, δ, (b) the heterogeneity, τ, and (c) the probability that a nonsignificant result enters the literature, p. The constraints of pcurve and puniform are: 100% publication bias (i.e., p = 0) and homogeneity (i.e., τ = 0). Hence, for the estimation of effect sizes, 3PSM seems to be a good replacement for pcurve and puniform, as it makes these constraints testable.
Our recommendation: Do not use pcurve or puniform for effect size estimation when heterogeneity can be expected (which is nearly always the case).
Ignore overadjustments in the opposite directionMany biascorrecting methods are driven by QRPs – the more QRPs, the stronger the downward correction. However, this effect can get so strong, that methods overadjust into the opposite direction, even if all studies in the metaanalysis are of the same sign:
Note: You need to set the option “Keep negative estimates” to get this plot.
Our recommendation: Ignore biascorrected results that go into the opposite direction; set the estimate to zero, do not reject H₀.
Typical smallstudy effects (e.g., by phacking or publication bias) induce a negative correlation between sample size and effect size – the smaller the sample, the larger the observed effect size. PETPEESE aims to correct for that relationship. In the absence of bias and QRPs, however, random fluctuations can lead to a positive correlation between sample size and effect size, which leads to a PET and PEESE slope of the unintended sign. Without publication bias, this reversal of the slope actually happens quite often.
See for example the next figure. The true effect size is zero (red dot), naive random effects metaanalysis slightly overestimates the true effect (see black dotted triangle), but PET and PEESE massively overadjust towards more positive effects:
PETPEESE was never intended to correct in the reverse direction. An underlying biasing process would have to systematically remove small studies that show a significant result with larger effect sizes, and keep small studies with nonsignificant results. In the current incentive structure, I see no reason for such a process.
Our recommendation: Ignore the PETPEESE correction if it has the wrong sign.
PETPEESE sometimes overestimates, sometimes underestimates
A bias can be more easily accepted if it always is conservative – then one could conclude: “This method might miss some true effects, but if it indicates an effect, we can be quite confident that it really exists”. Depending on the conditions (i.e., how much publication bias, how much QRPs, etc.), however, PET/PEESE sometimes shows huge overestimation and sometimes huge underestimation.
For example, with no publication bias, some heterogeneity (τ=0.2), and severe QRPs, PET/PEESE underestimates the true effect of δ = 0.5:
In contrast, if no effect exists in reality, but strong publication bias, large heterogeneity and no QRPs, PET/PEESE overestimates at lot:
In fact, the distribution of PET/PEESE estimates looks virtually identical for these two examples, although the underlying true effect is δ = 0.5 in the upper plot and δ = 0 in the lower plot. Furthermore, note the huge spread of PET/PEESE estimates (the error bars visualize the 95% quantiles of all simulated replications): Any single PET/PEESE estimate can be very far off.
Our recommendation: As one cannot know the condition of reality, it is safest not to use PET/PEESE at all.
Recommendations in a nutshell: What you should not use in a metaanalysis
Again, please consider the “limitations of generalizability” disclaimer above.
 When you can exclude publication bias (i.e., in the context of registered reports), do not use biascorrecting techniques. Even in the presence of some QRPs they perform worse than plain random effects metaanalysis.
 In any other setting except registered reports, expect publication bias, and do not use random effects metaanalysis or trimandfill. Both will give you a 100% false positive rate in typical settings, and a biased estimation.
 Under heterogeneity, pcurve and puniform overestimate the underlying effect and have false positive rates >= 50%
 Even if all studies entering a metaanalysis point into the same direction (e.g., all are positive), biascorrecting techniques sometimes overadjust and return a significant estimate of the opposite direction. Ignore these results, set the estimate to zero, do not reject H₀.
 Sometimes PET/PEESE adjust into the wrong direction (i.e., increasing the estimated true effect size)
As with any general recommendations, there might be good reasons to ignore them.
Additional technical recommendations The puniform package (v. 0.0.2) very rarely does not provide a lower CI. In this case, ignore the estimate.
 Do not run pcurve or puniform on <=3 significant and directionally consistent studies. Although computationally possible, this gives hugely variable results, which are often very biased. See our supplemental material for more information and plots.
 If the 3PSM method (in the implementation of McShane et al., 2016) returns an incomplete covariance matrix, ignore the result (even if a point estimate is provided).
Now you probably ask: But what should I use? Read our preprint for an answer!
The post Correcting bias in metaanalyses: What not to do (metashowdown Part 1) appeared first on Nicebread.
To leave a comment for the author, please follow the link and comment on their blog: R – Nicebread. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Simple bash script for a fresh install of R and its dependencies in Linux
(This article was first published on R and Finance, and kindly contributed to Rbloggers)
–
I’ve been working with Linux for some time but always in a dual boot setup. While I used LinuxMint at home, my university computer always had Windows. Most of the times it was not a problem working in one or the other and share files between computers with Dropbox.
After getting constantly frustated with Windows updates, crashes and slowdowns, I decided to make the full switch. All my computers are now Linux based. I am very happy with my choice and regret not making the switch earlier.
Since I formatted all my three computers (home/laptop/work), I wrote a small bash file to automate the process of installing R and its dependencies. I use lots of R packages in a daily basis. For some of them, it is required to install dependencies using the terminal. Each time that a install.package() failed, I saved the name of the required software and added it to the bash file. While my bash file will not cover all dependencies for all packages, it will suffice for a great proportion.
The steps for using it are:
1) Copy the contents of the code presented next in a file called InstallR.sh. Do notice that the name of the release, in this case trusty, is added to the cran address. Try not running the bash file twice, or it will append the CRAN address more than one time in /etc/apt/sources.list.
#!/bin/bash # Adds R to apt and install it # # Instructions: # sudo chmod 700 InstallR.sh # ./FirstTimeInstallR.sh sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/"  sudo tee a /etc/apt/sources.list gpg keyserver keyserver.ubuntu.com recvkey E084DAB9 gpg a export E084DAB9  sudo aptkey add  sudo aptget update sudo aptget install y rbase rbasedev rcranxml rcranrjava libcurl4openssldev sudo aptget install y libssldev libxml2dev openjdk7* libgdaldev libprojdev libgsldev sudo aptget install y xml2 defaultjre defaultjdk mesacommondev libglu1mesadev freeglut3dev sudo aptget install y mesacommondev libx11dev rcranrgl rcranrglpk rcranrsymphony rcranplyr sudo aptget install y rcranreshape rcranreshape2 rcranrmysql sudo R CMD javareconf2) Change the permissions of the file so that it can be executed:
sudo chmod 700 InstallR.sh3) Execute it as sudo
sudo ./FirstTimeInstallR.shThat’s it. I hope others will find this simple bash script useful as well.
To leave a comment for the author, please follow the link and comment on their blog: R and Finance. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
A Partial Remedy to the Reproducibility Problem
(This article was first published on Mad (Data) Scientist, and kindly contributed to Rbloggers)
Several years ago, John Ionnidis jolted the scientific establishment with an article titled, “Why Most Published Research Findings Are False.” He had concerns about inattention to statistical power, multiple inference issues and so on. Most people had already been aware of all this, of course, but that conversation opened the floodgates, and many more issues were brought up, such as hidden labtolab variability. In addition, there is the occasional revelation of outright fraud.
Many consider the field to be at a crisis point.
In the 2014 JSM, Phil Stark organized a lastminute session on the issue, including Marcia McNutt, former editor of Science and Yoav Benjamini of multiple inference methodology fame. The session attracted a standingroomonly crowd.
In this post, Reed Davis and I are releasing the prototype of an R package that we are writing, revisit, with the goal of partially remedying the statistical and data wrangling aspects of this problem. It is assumed that the authors of a study have supplied (possibly via carrots or sticks) not only the data but also the complete code for their analyses, from data cleaning up through formal statistical analysis.
There are two main aspects:
 The package allows the user to “replay” the authors’ analysis, and most importantly, explore other alternate analyses that the authors may have overlooked. The various alternate analyses may be saved for sharing.
 Warn of statistical errors, such as: overreliance on pvalues; need for multiple inference procedures; possible distortion due to outliers; etc.
The term user here could refer to several different situations:
 The various authors of a study, collaborating and trying different analyses during the course of the study.
 Reviewers of a paper submitted for publication on the results of the study.
 Fellow scientists who wish to delve further into the study after it is published.
The package has text and GUI versions. The latter is currently implemented as an RStudio addin.
The package is on my GitHub site, and has a fairly extensive README file introducing the goals and usage.
To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
U.S. Residential Energy Use: Machine Learning on the RECS Dataset
(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to Rbloggers)
Contributed by Thomas Kassel. He is currently enrolled in the NYC Data Science Academy remote bootcamp program taking place from JanuaryMay 2017. This post is based on his final capstone project, focusing on the use of machine learning techniques learned throughout the course.
Introduction
The residential sector accounts for up to 40% of annual U.S. electricity consumption, representing a large opportunity for energy efficiency and conservation. A strong understanding of the main electricity enduses in residences can allow homeowners to make more informed decisions to lower their energy bills, help utilities maximize efficiency/incentive programs, and allow governments or NGOs to better forecast energy demand and address climate concerns.
The Residential Energy Consumption Survey, RECS, collects energyrelated data on a nationally representative sample of U.S. homes. First conducted in 1978 and administered every 45 years by the Energy Information Administration, it is a leading data source for residential energy analysis. Methodology for the study is welldocumented – a comprehensive overview of RECS can be found in the program’s technical documentation.
This project applied machine learning methods to the most recently available RECS dataset, published in 2009. The primary goals were twofold, as is common in many regression exercises:
 Description – identification of variable importance or features with the strongest statistical associations to annual residential electricity usage (kWh)
 Prediction – feature engineering and use of machine learning platform h2o to optimize predictive accuracy over annual electricity usage
EDA and Data PreProcessing
RECS 2009 consists of 12,083 observations (each representing a unique housing unit) and over 900 features encompassing physical building characteristics, appliances, occupant behavior/usage patterns, occupant demographics, etc. These features serve as independent variables in predicting the outcome variable, annual electricity usage in kilowatthours (kWh). Because RECS aims to cover a comprehensive overview of many types of residences for a variety of analytical purposes (beyond energy use prediction), many of the features are sparsely populated, collinear, or uncorrelated with kWh usage. Therefore, a preliminary and recurring task throughout the project was dimension reduction.
Since most of the independent variables were collected through occupant interviews – as opposed to exact physical examination of the residence – the values of many are binned as factor/categorical variables. The dataset’s 931 original variables had the following characteristics:
Missingness was common in the raw dataset, with 73 features having NA values in more than 95% of observations. To quickly gauge whether these features correlated with the outcome variable kWh (and therefore should be retained/imputed), I made use of flexible EDA graphing functions from the GGally R package. An example is shown below on continuous variables, revealing generally low rstatistics in correlation to kWh and having nonsignificant pvalues (not shown). GGally also accommodates similar exploratory analysis on factor variables via pairwise box plots. Overall, although a more complex imputation strategy could have been employed to address these 73 features, they were dropped due to their high missingness and little evidence of predictive power over the outcome variable.
Feature Engineering/Reduction
Factor Encoding
As previously mentioned, the majority of the dataset features were factor variables, many of which needed recoding in order to correctly capture the variation being described. As illustrated below, nominal factor variables such as WALLTYPE have no intrinsic ordering; it would not make sense to order “stucco” higher than “wood”, for example. On the other hand, ordinal factors such as AGERFRI1 (age of mostused refrigerator) have a clear order to their factor levels; each level denotes a numerically higher value than the previous. Information contained in variables like AGERFRI1 is often more appropriately formatted in integer/numeric form, which can better convey their continuous range of values, aid interpretation of the variable coefficient, and reduce standard error by decreasing the number of coefficients for model estimation (many algorithms generate a separate model coefficient for each factor level via onehotencoding).
In the image above, WALLTYPE was left as a factor while AGERFRI1 was recoded as an integer value (using the mid range of each bin as number of years). This factortointeger recoding was applied to additional binned variables relating to age of appliances and frequency of use of those appliances.
New factor variables were also created by consolidating information from existing factors into onehotencoded binary values such as presence of heated pool, use of electricity for space heating, etc.
Feature Reduction
We can visualize multicollinearity among the integer and numeric features using correlation plots, again made with R package GGally. In the plot below, a conceptual group of collinear variables stands out, i.e. those directly or indirectly representing a residence’s size, for example, total square footage (TOTSQFT), cooling square footage (TOTCSQFT), number of bathrooms (NCOMBATH), number of refrigerators (NUMFRIG), etc.
To confirm the degree of multicollinearity, variance inflation factors were calculated based on a linear model with the above numeric features – those with high VIF scores (loosely following the rule of thumb VIF > 5.0) were dropped.
Despite variable recoding and decorrelation using the methods previously described, a large number of features remained in the dataset, the majority of them being factor variables. Although principal component analysis (PCA) is a powerful method for dimension reduction, it does not generalize easily to categorical data. Therefore, feature reduction was continued instead through the use of an exploratory LASSO regression model, utilizing the L1 regularization penalty to drive nonsignificant variable coefficients in the model’s output to 0. This was helpful in identifying and dropping features with very little predictive power over kWh usage, particularly factor variables that were not covered in the multicollinearity exercise above.
Modeling With H2o
The processed and cleaned dataset was migrated to opensource, online machine learning platform h2o, which offers highly efficient and scalable modeling methods. In contrast to machine learning conducted in slower native R packages (caret, glm) in the local R environment, R package h2o facilitates API calls to h2o’s online platform, sending the given dataset to be distributed and parallelprocessed among multiple clusters. H2o offers an array of the most common machine learning algorithms (glm, kNN, random forests, gbm, deep learning) at very impressive run times – lessening the large burden of computational speed in the model fitting and tuning process. It also provides handy, builtin functions for preprocessing such as training/validation/test set splitting, in this case chosen to be 70/20/10.
View the code on Gist.
Multivariable Regression
To best understand variable importance and interpret linear effects of the features on kWh usage, a simple multivariate regression model was fit using h2o’s glm function with default parameters (no regularization). A variable importance plot (see below) ranks the top 15 features by zstatistic, all of which are quite large – an observed zscore of 15 translates to a pvalue of 9.63e54, or virtually zero. Hence all 15 variables show extremely significant statistical evidence of a relationship to kWh, with an additional ~100 of the features (not shown) also meeting significance at the p = 0.05 threshold. Variables beginning with “hasElec” were feature engineered (previously described), validating the approach of using domain knowledge to add or recode valuable information in new features. Coefficient estimates are denoted on the bar for each feature. We can observe that the presence of electric space heating at a residence (onehotencoded as a factor variable) increases yearly electricity usage at the residence by about 2,722 kWh; each additional TOTCSQFT (air conditioned square feet, numeric) adds 1.146 kWh/year.
ElasticNet Regression with Grid Search
While default h2o glm provided interpretability and pvalues, adding regularization to the linear model enabled an increase in predictive accuracy. H2o allows flexible grid search methods to tune the main glm hyperparameters alpha (type of regularization) and lambda (degree of coefficient shrinkage). Because grid search can quickly become computationally expensive, I utilized h2o’s powerful earlystopping and random search functionality to ensure that computation time is not wasted for very small, marginal gains in validation accuracy.
View the code on Gist.
A primary purpose of grid search is to choose optimal tuning parameters according to a validation metric – in this case rootmeansquarederror (RMSE) of kWh prediction on the validation set – and then train a new model using the optimal values discovered on the full dataset (no train/validation split) to maximize sample size. In 5 minutes of training, h2o fit all ~440 model combinations specified in the grid search, with the best RMSE model having alpha = 1.0 (LASSO regression) and lambda = 1.0 (small amount of regularization). These parameters were then used to fit a final model on 90% of the data (combining 70% train and 20% validation sets) and performance was evaluated on the 10% holdout test data, which had not yet been seen by the model.
Gradient Boosted Machine
To increase predictive accuracy over generalized linear modelbased approaches, I used h2o’s implementation of gradient boosted machines. Boosting, a nonlinear treebased approach, sequentially fits many decorrelated decision trees to slowly reduce overall predictive error (in the regression setting, often RMSE). Grid search was performed using the same methods as above and with the following tuning parameters:
 learn_rate – how much error should each tree “remember” from the previous
 max_depth – max tree depth (rule of thumb – sqrt of # of features)
 sample_rate – percentage of rows/observations to be chosen to fit a given tree
 col_sample_rate – percentage of columns to be chosen to fit a given tree
Following the same process as with GLM, parameter values were then chosen from the trained model with the best validation set RMSE. The model was retrained on the full 90% of training data and tested on the final 10% holdout split.
Results from the two modeling strategies are summarized in the table below.
Additionally, predicted vs actual kWh values for either modeling strategy are plotted against the most significant numeric feature, CDD30YR (30year average of cooling degreedays in the residence’s region). Both model types demonstrate reasonable ability to predict kWh over a range of annual cooling loads/climate regions.
ConclusionsFinal predictive accuracy was similar for both linearbased (GLM) and treebased (GBM) models on the RECS dataset. The two models’ RMSE values represent a large improvement over an RMSE of 7,560, which was established as a baseline accuracy by initially fitting a default GLM on the raw dataset (before any feature engineering or reduction). This improvement validates the approaches used – using domain knowledge to feature engineer highly significant variables and grid search for hyperparameter tuning.
Aside from the regression setting, future analysis could be applied to RECS in a classification context. Since the majority of the dataset’s features are factor variables, energy providers could use the rich training set to attempt to answer important energyrelated classification questions about their customers – type of space heating, presence of major appliances, or whether the home is a good candidate for solar installation. Greater predictive ability over residential electricity use will contribute over time to a smarter, cleaner, and more efficient power grid.
The post U.S. Residential Energy Use: Machine Learning on the RECS Dataset appeared first on NYC Data Science Academy Blog.
To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Complete Subset Regressions, simple and powerful
(This article was first published on R – insightR, and kindly contributed to Rbloggers)
By Gabriel VasconcelosThe complete subset regressions (CSR) is a forecasting method proposed by Elliott, Gargano and Timmermann in 2013. It is as very simple but powerful technique. Suppose you have a set of variables and you want to forecast one of them using information from the others. If your variables are highly correlated and the variable you want to predict is noisy you will have collinearity problems and insample overfitting because the model will try to fit the noise.
These problems may be solved if you estimate a smaller model using only a subset of the explanatory variables, however, if you do not know which variables are important you may loose information. What if we estimate models from many different subsets and combine their forecasts? Even better, what if we estimate models for all possible combinations of variables? This would be a good solution, however, if you have only 20 variables, the number of regressions would be more the 1 million. The CSR is a solution between using only one subset and all possible subsets. Instead of estimating all possible combinations of models, we fix a value and estimate all possible models with variables. Then we compute the forecasts from all these models to get our final result. Naturaly, must be smaller than the number of variables you have. Let us review the steps:
 Suppose you have explanatory variables. Estimate all possible models with variables,
 Compute the forecasts for all the models,
 Compute the average of all the forecasts to have the final result.
We are going to generate data from a linear model where the explanatory variables, , are draw from a multivariate normal distribution. The coefficients are generated from a normal distribution and multiplied by a parameter to ensure the dependent variable has a significant random component. The value for is 10 and the CSR is estimated with .
set.seed(1) # = Seed for replication = # K = 10 # = Number of Variables = # T = 300 # = Number of Observations = # # = Generate covariance matrix = # library(mvtnorm) D = diag(0.1, K) P = matrix(rnorm(K * K), K, K) sigma = t(P)%*%D%*%P alpha = 0.1 beta = rnorm(K) * alpha # = coefficients = # X = rmvnorm(T, sigma = sigma) # = Explanatory Variables = # u = rnorm(T) # = Error = # y = X%*%beta + u # = Generate y = # # = Break data into insample and outofsample = # y.in = y[1:200] y.out = y[c(1:200)] X.in = X[1:200, ] X.out = X[c(1:200), ] # = Estimate model by OLS to compare = # model = lm(y.in ~ X.in) pred = cbind(1, X.out)%*%coef(model) # = CSR = # k = 4 models = combn(K, k) # = Gives all combinations with 4 variables = # csr.fitted = rep(0, length(y.in)) # = Store insample fitted values = # csr.pred = rep(0, length(y.out)) # = Store forecast = # for(i in 1:ncol(models)){ m = lm(y.in ~ X.in[ ,models[ ,i]]) csr.fitted = csr.fitted + fitted(m) csr.pred = csr.pred + cbind(1, X.out[ ,models[ ,i]])%*%coef(m) } R2ols = 1  var(y.in  fitted(model))/var(y.in) # = R2 OLS = # R2csr = 1  var(y.in  csr.fitted/ncol(models))/var(y.in) # = R2 CSR = # c(R2ols, R2csr) ## [1] 0.1815733 0.1461342 # = Insample fit = # plot(y.in, type="l") lines(fitted(model), col=2) lines(csr.fitted/ncol(models), col=4) # = Outofsample fit = # plot(y.out, type="l") lines(pred, col = 2) lines(csr.pred/ncol(models), col = 4) # = MAPE = # MAPEols=mean(abs(y.out  pred)) MAPEcsr=mean(abs(y.out  csr.pred/ncol(models))) c(MAPEols, MAPEcsr) ## [1] 0.8820019 0.8446682The main conclusion from the results is that the CSR gives up some insample performance to improve the forecasts. In fact, the CSR is very robust to overfitting and you should definitively add it to your collection to use when you believe that you are having this type of problem. Most modern forecasting techniques have the same idea of accepting some bias insample to have a more parsimonious or stable model outofsample.
This is the simplest implementation possible for the CSR. In some cases you may have fixed controls you want to include in all regressions such as lags of the dependent variable. The CSR computational costs increase fast with the number of variables. If you are in a highdimensional framework you may need to do some type of preselection of the variables to reduce the problem’s size.
ReferencesElliott, Graham, Antonio Gargano, and Allan Timmermann. “Complete subset regressions.” Journal of Econometrics 177.2 (2013): 357373.
To leave a comment for the author, please follow the link and comment on their blog: R – insightR. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Euler Problem 23: NonAbundant Sums
(This article was first published on The Devil is in the Data, and kindly contributed to Rbloggers)
Euler problem 23 asks us to solve a problem with abundant or excessive numbers.
These are numbers for which the sum of its proper divisors is greater than the number itself.
12 is an abundant number because the sum of its proper divisors (the aliquot sum) is larger than 12: (1 + 2 + 3 + 4 + 6 = 16).
All highly composite numbers or antiprimes greater than six are abundant numbers. These are numbers that have so many divisors that they are considered the opposite of primes, as explained in the Numberphile video below.
Euler Problem 23 DefinitionA perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.
A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.
As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis, even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.
SolutionThis solution repurposes the divisors function that determines the proper divisors for a number, introduced for Euler Problem 21. The first code snippet creates the sequence of all abundant numbers up to 28123 (sequence A005101 in the OEIS). An abundant number is one where its aliquot sum is larger than n.
# Generate abundant numbers (OEIS A005101) A005101 < function(x){ abundant < vector() a < 1 for (n in 1:x) { aliquot.sum < sum(proper.divisors(n))  n if (aliquot.sum > n) { abundant[a] < n a < a + 1 } } return(abundant) } abundant < A005101(28123)The solution to this problem is also a sequence in the Online Encyclopedia of Integer Sequences (OEIS A048242). This page states that the highest number in this sequence is 20161, not 28123 as stated in the problem definition.
The second section of code creates a list of all potential numbers not the sum of two abundant numbers. The next bit of code sieves any sum of two abundant numbers from the list. The answer is determined by adding remaining numbers in the sequence.
# Create a list of potential numbers that are not the sum of two abundant numbers A048242 < 1:20161 # Remove any number that is the sum of two abundant numbers for (i in 1:length(abundant)) { for (j in i:length(abundant)) { if (abundant[i] + abundant[j] <= 20161) { A048242[abundant[i] + abundant[j]] < NA } } } A048242 < A048242[!is.na(A048242)] answer < sum(A048242) print(answer)The post Euler Problem 23: NonAbundant Sums appeared first on The Devil is in the Data.
To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...