R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 50 min ago

### RTutor: The effect of the TseTse fly on African Development

Sat, 06/03/2017 - 10:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

What is the causal impact of the TseTse fly on historical and current development in Africa? Marcella Alsan studies this question in her fascinating article ‘The effect of the TseTse fly on African development’ in the American Economic Review, 2015.

As part of her Bachelor Thesis at Ulm University, Vanessa Schöller has generated a very nice RTutor problem set that allows you to replicate the main insights of the paper in an interactive fashion.

Here is screenshoot:

Like in previous RTutor problem sets, you can enter free R code in a web based shiny app. The code will be automatically checked and you can get hints how to procceed. In addition you are challenged by many multiple choice quizzes.

To install the problem set the problem set locally, first install RTutor as explained here:

https://github.com/skranz/RTutor

and then install the problem set package:

https://github.com/vanessaschoeller/RTutorTseTse

There is also an online version hosted by shinyapps.io that allows you explore the problem set without any local installation. (The online version is capped at 30 hours total usage time per month. So it may be greyed out when you click at it.)

https://vanessaschoeller.shinyapps.io/RTutorTseTse/

If you want to learn more about RTutor, to try out other problem sets, or to create a problem set yourself, take a look at the RTutor Github page

https://github.com/skranz/RTutor

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Deferred & Remote Function Execution in R

Sat, 06/03/2017 - 08:31

(This article was first published on R – Random Remarks, and kindly contributed to R-bloggers)

What would you say if you could automatically wrap your R pipeline – which consists of numerous functions and variables – into a single function? What would you say if you could do it repeatedly, with no extra effort regardless of its complexity? Would you store it to document your progress at that particular moment in time? Would you serialize and send it to a remote R session to offload computations? Well, it’s kind-of possible.

defer brings a new capability to R – packaging, as a single R function, a number of inter-connected functions and variables which exist in the current R session. It is something I found useful during fast-prototyping phases of my own work and wanted to share for quite some time now – and at this point I’m interested in seeing if it turns out useful for someone else besides me.

Here’s a simplistic example where defer is given a top-level function h() that requires two other functions (f() and g()), a variable (C) and a library call (base::mean).

C <- 10 f <- function(x) x*x + C g <- function(y) (f(y) + 10) * f(y+3) h <- function(z) mean(c(g(z), g(z+1), g(z+2))) wrapper <- defer(h) #> Found functions: #> g, f #> variables: #> C #> library calls: #> base::mean rm(C, f, g, h) wrapper(10) #> [1] 29688.67

Here is another example where that same wrapper() can be run in a remote R session (via opencpu) without installing or deploying any extra code on the remote server. First, we serialize wrapper as a base64-encoded string.

library(base64enc) buffer <- rawConnection(raw(0), 'w') saveRDS(wrapper, buffer) serialized <- base64enc::base64encode(rawConnectionValue(buffer))

Then, we write that string into a temporary file together with a few lines of code to deserialize and run it in a remote R session.

local_script_path <- tempfile(fileext = ".R") cat(paste0("base64 <- '", serialized, "'\n", "library(base64enc)\n", "bytes <- rawConnection(base64enc::base64decode(base64), 'r')\n", "deserialized <- readRDS(bytes)\n", "deserialized(10)\n"), file = local_script_path)

Finally, we upload the file on the opencpu server and run it via base::source.

library(httr) result <- httr::POST(public_opencpu_url, body = list(file = upload_file(local_script_path))) content(result, 'text') #> [1] "$value\n[1] 29688.67\n\n$visible\n[1] TRUE\n\n"

And that seems like a number we’ve seen already, right?

The package is currently available only from GitHub as it’s way not ready for CRAN.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Random Remarks. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### R Weekly Bulletin Vol – X

Sat, 06/03/2017 - 03:49

This week’s R bulletin will cover topics on grouping data using ntile function, how to open files automatically, and formatting an Excel sheet using R.

We will also cover functions like the choose function, sample function, runif and rnorm function. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Fold selected chunk – Alt+L
2. Unfold selected chunk – Shift+Alt+L
3. Fold all – Alt+0

Problem Solving Ideas Grouping data using ntile function

The ntile function is part of the dplyr package, and is used for grouping data. The syntax for the function is given by:

ntile(x, n)

Where,
“x” is the vector of values and
“n” is the number of buckets/groups to divide the data into.

Example:

In this example, we first create a data frame from two vectors, one comprising of Stock symbols, and the other comprising of their respective prices. We then group the values in Price column in 2 groups, and the ranks are populated in a new column called “Ntile”. In the last line we are selecting only those values which fall in the 2nd bucket using the subset function.

library(dplyr) Ticker = c("PAGEIND", "MRF", "BOSCHLTD", "EICHERMOT", "TIDEWATER") Price = c(14742, 33922, 24450, 21800, 5519) data = data.frame(Ticker, Price) data$Ntile = ntile(data$Price, 2) print(data)

ranked_data = subset(data, subset = (Ntile == 2)) print(ranked_data)

Automatically open the saved files

If you are saving the output returned upon executing an R script, and also want to open the file post running the code, one can you use the shell.exec function. This function opens the specified file using the application specified in the Windows file associations.

A file association associates a file with an application capable of opening that file. More commonly, a file association associates a class of files (usually determined by their filename extension, such as .txt) with a corresponding application (such as a text editor).

The example below illustrates the usage of the function.
shell.exec(filename)

Example:

df = data.frame(Symbols=c("ABAN","BPCL","IOC"),Price=c(212,579,538)) write.csv(df,"Stocks List.csv") shell.exec("Stocks List.csv") Quick format of the excel sheet for column width

We can format the excel sheets for column width using the command lines given below. In the example, the first line will load the excel workbook specified by the file name. In the third & the fourth line, the autoSizeColumn function adjusts the width of the columns, which are specified in the “colIndex”, for each of the worksheets. The last line will save the workbook again after making the necessary formatting changes.

Example:

wb = loadWorkbook(file_name) sheets = getSheets(wb) autoSizeColumn(sheets[[1]], colIndex=1:7) autoSizeColumn(sheets[[2]], colIndex=1:5) saveWorkbook(wb,file_name) Functions Demystified choose function

The choose function computes the combination nCr. The syntax for the function is given as:

choose(n,r)

where,
n is the number of elements
r is the number of subset elements

nCr = n!/(r! * (n-r)!)

Examples:

choose(5, 2)

[1] 10

choose(2, 1)

[1] 2

sample function

The sample function randomly selects n items from a given vector. The samples are selected without replacement, which means that the function will not select the same item twice. The syntax for the function is given as:

sample(vector, n)

Example: Consider a vector consisting of yearly revenue growth data for a stock. We select 5 years revenue growth at random using the sample function.

Revenue = c(12, 10.5, 11, 9, 10.75, 11.25, 12.1, 10.5, 9.5, 11.45) sample(Revenue, 5)

[1] 11.45 12.00 9.50 12.10 10.50

Some statistical processes require sampling with replacement, in such cases you can specify replace= TRUE to the sample function.

Example:

x = c(1, 3, 5, 7) sample(x, 7, replace = TRUE)

[1] 7 1 5 3 7 3 5

runif and rnorm functions

The runif function generates a uniform random number between 0 and 1. The argument of runif function is the number of random values to be generated.

Example:

# This will generate 7 uniform random number between 0 and 1. runif(7)

[1] 0.6989614 0.5750565 0.6918520 0.3442109 0.5469400 0.7955652 0.5258890

# This will generate 5 uniform random number between 2 and 4. runif(5, min = 2, max = 4)

[1] 2.899836 2.418774 2.906082 3.728974 2.720633

The rnorm function generates random numbers from normal distribution. The function rnorm stands for the Normal distribution’s random number generator. The syntax for the function is given as:

rnorm(n, mean, sd)

Example:

# generates 6 numbers from a normal distribution with a mean of 3 and standard deviation of 0.25 rnorm(6, 3, 0.25)

[1] 3.588193 3.095924 3.240684 3.061176 2.905392 2.891183

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

The post R Weekly Bulletin Vol – X appeared first on .

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### (Linear Algebra) Do not scale your matrix

Sat, 06/03/2017 - 02:00

(This article was first published on Florian Privé, and kindly contributed to R-bloggers)

In this post, I will show you that you generally don’t need to explicitly scale a matrix. Maybe you wanted to know more about WHY matrices should be scaled when doing linear algebra. I will remind about that in the beginning but the rest will focus on HOW to not explicitly scale matrices. We will apply our findings to the computation of Principal Component Analysis (PCA) and then Pearson correlation at the end.

WHY scaling matrices?

Generally, if you don’t center columns of a matrix before PCA, you end up with the loadings of PC1 being the column means, which is not of must interest.

n <- 100; m <- 10 a <- matrix(0, n, m); a[] <- rnorm(length(a)) a <- sweep(a, 2, 1:m, '+') colMeans(a) ## [1] 0.9969826 1.9922219 3.0894905 3.9680835 4.9433477 6.0007352 ## [7] 6.9524794 8.0898546 8.9506657 10.0354630 pca <- prcomp(a, center = FALSE) cor(pca$rotation[, 1], colMeans(a)) ## [1] 0.9999991 Now, say you have centered column or your matrix, do you also need to scale them? That is to say, do they need to have the same norm or standard deviation? PCA consists in finding an orthogonal basis that maximizes the variation in the data you analyze. So, if there is a column with much more variation than the others, it will probably end up being PC1, which is not of must interest. n <- 100; m <- 10 a <- matrix(0, n, m); a[] <- rnorm(length(a)) a[, 1] <- 100 * a[, 1] apply(a, 2, sd) ## [1] 90.8562836 0.8949268 0.9514208 1.0437561 0.9995103 1.0475651 ## [7] 0.9531466 0.9651383 0.9804789 0.9605121 pca <- prcomp(a, center = TRUE) pca$rotation[, 1] ## [1] 9.999939e-01 6.982439e-04 -5.037773e-04 -9.553513e-05 8.937869e-04 ## [6] -1.893732e-03 -3.053232e-04 1.811950e-03 -1.658014e-03 9.937442e-04

Hope I convinced you on WHY it is important to scale matrix columns before doing PCA. I will now show you HOW not to do it explictly.

Reimplementation of PCA

In this part, I will show you the basic code if you want to reimplement PCA yourself. It will serve as a basis to show you how to do linear algebra on a scaled matrix, without explicitly scaling the matrix.

# True one n <- 100; m <- 10 a <- matrix(0, n, m); a[] <- rnorm(length(a)) pca <- prcomp(a, center = TRUE, scale. = TRUE) # DIY a.scaled <- scale(a, center = TRUE, scale = TRUE) K <- crossprod(a.scaled) K.eigs <- eigen(K, symmetric = TRUE) v <- K.eigs$vectors PCs <- a.scaled %*% v # Verif, recall that PCs can be opposites between runs plot(v, pca$rotation)

all.equal(sqrt(K.eigs$values), sqrt(n - 1) * pca$sdev) ## [1] TRUE plot(PCs, pca$x) Linear algebra behind the previous implementation Suppose $$m < n$$ ($$m$$ is the number of columns and $$n$$ is the number of rows). Let us denote $$\tilde{X}$$ the scaled matrix. A partial singular value decomposition of $$\tilde{X}$$ is $$\tilde{X} \approx U \Delta V^T$$ where $$U$$ is an $$n \times K$$ matrix such that $$U^T U = I_K$$, $$\Delta$$ is a $$K \times K$$ diagonal matrix and $$V$$ is an $$m \times K$$ matrix such that $$V^T V = I_K$$. Taking $$K = m$$, you end up with $$\tilde{X} = U \Delta V^T$$. $$U \Delta$$ are the scores (PCs) of the PCA and $$V$$ are the loadings (rotation coefficients). $$K = \tilde{X}^T \tilde{X} = (U \Delta V^T)^T \cdot U \Delta V^T = V \Delta U^T U \Delta V^T = V \Delta^2 V^T$$. So, when doing the eigen decomposition of K, you get $$V$$ and $$\Delta^2$$ because $$K V = V \Delta^2$$. For getting the scores, you then compute $$\tilde{X} V = U \Delta$$. These are exactly the steps implemented above. Implicit scaling of the matrix Do you know the matrix formulation of column scaling? $$\tilde{X} = C_n X S$$ where $$C_n = I_n – \frac{1}{n} 1_n 1_n^T$$ is the centering matrix and $$S$$ is an $$m \times m$$ diagonal matrix with the scaling coefficients (typically, $$S_{j,j} = 1 / \text{sd}_j$$). # Let's verify sds <- apply(a, 2, sd) a.scaled2 <- (diag(n) - tcrossprod(rep(1, n)) / n) %*% a %*% diag(1 / sds) all.equal(a.scaled2, a.scaled, check.attributes = FALSE) ## [1] TRUE In our previous implementation, we computed $$\tilde{X}^T \tilde{X}$$ and $$\tilde{X} V$$. We are going to compute them again, without explicitly scaling the matrix. Product Let us begin by something easy: $$\tilde{X} V = C_n X S V = C_n (X (S V))$$. So, you can compute $$\tilde{X} V$$ without explicitly scaling $$X$$. Let us verify: SV <- v / sds XSV <- a %*% SV CXSV <- sweep(XSV, 2, colMeans(XSV), '-') all.equal(CXSV, PCs) ## [1] TRUE Self cross-product A little more tricky: $$\tilde{X}^T \tilde{X} = (C_n X S)^T \cdot C_n X S = S^T X^T C_n X S$$ ($$C_n^2 = C_n$$ is intuitive because centering an already centered matrix doesn’t change it). $$\tilde{X}^T \tilde{X} = S^T X^T (I_n – \frac{1}{n} 1_n 1_n^T) X S = S^T (X^T X – X^T (\frac{1}{n} 1_n 1_n^T) X) S = S^T (X^T X – \frac{1}{n} s_X * s_X^T) S$$ where $$s_X$$ is the vector of column sums of X. Let us verify with a rough implementation: sx <- colSums(a) K2 <- (crossprod(a) - tcrossprod(sx) / n) / tcrossprod(sds) all.equal(K, K2) ## [1] TRUE Conclusion We have recalled some steps about the computation of Principal Component Analysis on a scaled matrix. We have seen how to compute the different steps of the implementation without having to explicitly scale the matrix. This “implicit” scaling can be quite useful if you manipulate very large matrices because you are not copying the matrix nor making useless computation. In my next post, I will present to you my new package that uses this trick to make a lightning partial SVD on very large matrices. Appendix: application to Pearson correlation Pearson correlation is merely a self cross-product on a centered and normalized (columns with unit norm) matrix. Let us just implement that with our new trick. #include using namespace Rcpp; // [[Rcpp::export]] NumericMatrix& correlize(NumericMatrix& mat, const NumericVector& shift, const NumericVector& scale) { int n = mat.nrow(); int i, j; for (j = 0; j < n; j++) { for (i = 0; i < n; i++) { // corresponds to "- \frac{1}{n} s_X * s_X^T" mat(i, j) -= shift(i) * shift(j); // corresponds to "S^T (...) S" mat(i, j) /= scale(i) * scale(j); } } return mat; } cor3 <- function(mat) { sums <- colSums(mat) / sqrt(nrow(mat)) corr <- crossprod(mat) diags <- sqrt(diag(corr) - sums^2) correlize(corr, shift = sums, scale = diags) } a <- matrix(0, 1000, 1000); a[] <- rnorm(length(a)) all.equal(cor3(a), cor(a)) ## [1] TRUE library(microbenchmark) microbenchmark( cor3(a), cor(a), times = 20 ) ## Unit: milliseconds ## expr min lq mean median uq max neval ## cor3(a) 39.38138 39.67276 40.68596 40.09893 40.65623 46.46785 20 ## cor(a) 635.74350 637.33605 639.34810 638.09980 639.36110 651.61876 20 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Florian Privé. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Hacking the principles of #openscience #workshops Fri, 06/02/2017 - 20:47 (This article was first published on R – christopher lortie, and kindly contributed to R-bloggers) In a previous post, I discussed the key elements that really stood out for me in recent workshops associated with open science, data science, and ecology. Summer workshop season is upon us, and here are some principles to consider that can be used to hack a workshop. These hacks can be applied a priori as an instructor or in situ as a participant or instructor by engaging with the context from a pragmatic, problem-solving perspective. Principles 1. Embrace open pedagogy. 2. Use and current best practices from traditional teaching contexts. 3. Be learner centered. 4. Speak less, do more. 5. Solve authentic challenges. Hacks (for each principle) 1. Prepare learning outcomes for every lesson. 2. Identify solve-a-problem opportunities in advance and be open to ones that emerge organically during the workshop. 3. Use no slide decks. This challenges the instructor to more directly engage with the students and participants in the workshop and leaves space for students to shape content and narrative to some extent. Decks lock all of us in. This is appropriate for some contexts such as conference presentations, but workshops can be more fluid and open. 4. Plan pauses. Prepare your lessons with gaps for contributions. Prepare a list of questions to offer up for every lesson and provide time for discussion of solutions. 5. Use real evidence/data to answer a compelling question (scale can be limited, approach beta as long as an answer is provided, and the challenge can emerge if teaching is open and space provided for the workshop participants to ideate). Final hack that is a more general teaching principle, consider keeping all teaching materials within a single ecosystem that then references outwards only as needed. For me, this has become all content prepared in RStudio, knitted to html, then pushed to GitHub gh-pages for sharing as a webpage (or site). Then participants can engage in all ideas and content including code, data, ideas in one place. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – christopher lortie. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Teach kids about R with Minecraft Fri, 06/02/2017 - 17:19 (This article was first published on Revolutions, and kindly contributed to R-bloggers) As I mentioned earlier this week, I was on a team at the ROpenSci Unconference (with Brooke AndersonKarl BromanGergely Daróczi, and my Microsoft colleagues Mario Inchiosa and Ali Zaidi) to work on a project to interface the R language with Minecraft. The resulting R package, miner, is now available to install from Github. The goal of the package is to introduce budding programmers to the R language via their interest in Minecraft, and to that end there's also a book (R Programming with Minecraft) and associated R package (craft) under development to provide lots of fun examples of manipulating the Minecraft world with R. Create objects in Minecraft with R functions If you're a parent you're probably already aware of the Minecraft phenomenon, but if not: it's kinda like the Lego of the digital generation. Kids (and kids-and-heart) enter a virtual 3-D world composed of cubes representing ground, water, trees and other materials, and use those cubes to build their own structures, which they can then explore with their friends using their in-game avatars. Inspired by the Python-focused book "Learn to program with Minecraft", Karl had the brilliant idea of building a similar interface around R. The miner package provides just a few simple functions to manipulate the game world: find or move a player's position; add or remove blocks in the world; send a message to all players in the world. The functions are deliberately simple, designed to be used to build more complex tasks. For example, you could write a function to detect when a player is standing in a hole. It uses the getPlayerIds and getPlayerPos functions to find the ID and locations of all players in the world, and the getBlocks function to check if the player is surrounded by blocks that are not code 0 (air). The package also provides are also a couple of "listener" functions: you can detect player chat messages, or when players strike a block with their sword. You can then use to write functions to react to player actions. For example, you can write a chat-bot to play a game with the player, create an AI to solve an in-game maze, or give the player the power's of Elsa from Frozen to freeze water by walking on it To get started with the miner package, you'll need to purchase a copy of Minecraft for Windows, Mac or Linux if you don't have one already. (Note: the Windows 10 version from the Microsoft Store isn't compatible with the miner package.) This is what you'll use to explore the world managed by the Minecraft server, which you'll also need to install along with RaspberryJuice plugin. You can find setup details in the miner package vignette. We installed the open-source Spigot server on a Ubuntu VM running in Azure; you might find this Dockerfile helpful if you're trying something similar. The miner package and craft package available to install from Github at the link below. The packages are a work-in-progress: comments, suggestions, pull-requests and additional examples for the R Programming with Minecraft book are most welcome! Github (ROpenSci labs): miner and craft var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); ### Weather forecast with regression models – part 1 Fri, 06/02/2017 - 14:21 (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers) In this tutorial we are going to analyse a weather dataset to produce exploratory analysis and forecast reports based on regression models. We are going to take advantage of a public dataset which is part of the exercise datasets of the “Data Mining and Business Analytics with R” book (Wiley) written by Johannes Ledolter. In doing that, we are taking advantage of several R packages, caret package included. The tutorial is going to be split in the following four parts: 1. Introducing the weather dataset and outlining its exploratory analysis 2. Building logistic regression models for 9am, 3pm and late evening weather forecasts 3. Tuning to improve accuracy of previously build models and show ROC plots 4. Making considerations on “at-least” moderate rainfall scenarios and building additional models to predict further weather variables R Packages Overall, we are going to take advantage of the following packages: suppressPackageStartupMessages(library(knitr)) suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(gmodels)) suppressPackageStartupMessages(library(lattice)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(gridExtra)) suppressPackageStartupMessages(library(Kmisc)) suppressPackageStartupMessages(library(ROCR)) suppressPackageStartupMessages(library(corrplot)) Exploratory Analysis The dataset under analysis is publicly available at: jledolter – datamining – data exercises It contains daily observations from a single weather station. Let us start by loading the data and taking a look at. set.seed(1023) weather_data <- read.csv(url("https://www.biz.uiowa.edu/faculty/jledolter/datamining/weather.csv"), header = TRUE, sep = ",", stringsAsFactors = TRUE) kable(head(weather_data)) |Date |Location | MinTemp| MaxTemp| Rainfall| Evaporation| Sunshine|WindGustDir | WindGustSpeed|WindDir9am |WindDir3pm | WindSpeed9am| WindSpeed3pm| Humidity9am| Humidity3pm| Pressure9am| Pressure3pm| Cloud9am| Cloud3pm| Temp9am| Temp3pm|RainToday | RISK_MM|RainTomorrow | |:---------|:--------|-------:|-------:|--------:|-----------:|--------:|:-----------|-------------:|:----------|:----------|------------:|------------:|-----------:|-----------:|-----------:|-----------:|--------:|--------:|-------:|-------:|:---------|-------:|:------------| |11/1/2007 |Canberra | 8.0| 24.3| 0.0| 3.4| 6.3|NW | 30|SW |NW | 6| 20| 68| 29| 1019.7| 1015.0| 7| 7| 14.4| 23.6|No | 3.6|Yes | |11/2/2007 |Canberra | 14.0| 26.9| 3.6| 4.4| 9.7|ENE | 39|E |W | 4| 17| 80| 36| 1012.4| 1008.4| 5| 3| 17.5| 25.7|Yes | 3.6|Yes | |11/3/2007 |Canberra | 13.7| 23.4| 3.6| 5.8| 3.3|NW | 85|N |NNE | 6| 6| 82| 69| 1009.5| 1007.2| 8| 7| 15.4| 20.2|Yes | 39.8|Yes | |11/4/2007 |Canberra | 13.3| 15.5| 39.8| 7.2| 9.1|NW | 54|WNW |W | 30| 24| 62| 56| 1005.5| 1007.0| 2| 7| 13.5| 14.1|Yes | 2.8|Yes | |11/5/2007 |Canberra | 7.6| 16.1| 2.8| 5.6| 10.6|SSE | 50|SSE |ESE | 20| 28| 68| 49| 1018.3| 1018.5| 7| 7| 11.1| 15.4|Yes | 0.0|No | |11/6/2007 |Canberra | 6.2| 16.9| 0.0| 5.8| 8.2|SE | 44|SE |E | 20| 24| 70| 57| 1023.8| 1021.7| 7| 5| 10.9| 14.8|No | 0.2|No | Metrics at specific Date and Location are given together with the RainTomorrow variable indicating if rain occurred on next day. colnames(weather_data) [1] "Date" "Location" "MinTemp" "MaxTemp" "Rainfall" [6] "Evaporation" "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am" [11] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [16] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" [21] "Temp3pm" "RainToday" "RISK_MM" "RainTomorrow" The description of the variables is the following. Date: The date of observation (a date object). Location: The common name of the location of the weather station MinTemp: The minimum temperature in degrees centigrade MaxTemp: The maximum temperature in degrees centigrade Rainfall: The amount of rainfall recorded for the day in millimeters. Evaporation: Class A pan evaporation (in millimeters) during 24 h Sunshine: The number of hours of bright sunshine in the day WindGustDir: The direction of the strongest wind gust in the 24 h to midnight WindGustSpeed: The speed (in kilometers per hour) of the strongest wind gust in the 24 h to midnight WindDir9am: The direction of the wind gust at 9 a.m. WindDir3pm: The direction of the wind gust at 3 p.m. WindSpeed9am: Wind speed (in kilometers per hour) averaged over 10 min before 9 a.m. WindSpeed3pm: Wind speed (in kilometers per hour) averaged over 10 min before 3 p.m. Humidity9am: Relative humidity (in percent) at 9 am Humidity3pm: Relative humidity (in percent) at 3 pm Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9 a.m. Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3 p.m. Cloud9am: Fraction of sky obscured by cloud at 9 a.m. This is measured in ”oktas,” which are a unit of eighths. It records how many eighths of the sky are obscured by cloud. A 0 measure indicates completely clear sky, while an 8 indicates that it is completely overcast Cloud3pm: Fraction of sky obscured by cloud at 3 p.m; see Cloud9am for a description of the values Temp9am: Temperature (degrees C) at 9 a.m. Temp3pm: Temperature (degrees C) at 3 p.m. RainToday: Integer 1 if precipitation (in millimeters) in the 24 h to 9 a.m. exceeds 1 mm, otherwise 0 RISK_MM: The continuous target variable; the amount of rain recorded during the next day RainTomorrow: The binary target variable whether it rains or not during the next day Looking then at the data structure, we discover the dataset includes a mix of numerical and categorical variables. str(weather_data) > str(weather_data) 'data.frame': 366 obs. of 24 variables:$ Date : Factor w/ 366 levels "1/1/2008","1/10/2008",..: 63 74 85 87 88 89 90 91 92 64 ... $Location : Factor w/ 1 level "Canberra": 1 1 1 1 1 1 1 1 1 1 ...$ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 8.8 8.4 ... $MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 18.2 17 19.5 22.8 ...$ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 ... $Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6 4 5.4 ...$ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.6 4.1 7.7 ... $WindGustDir : Factor w/ 16 levels "E","ENE","ESE",..: 8 2 8 8 11 10 10 1 9 1 ...$ WindGustSpeed: int 30 39 85 54 50 44 43 41 48 31 ... $WindDir9am : Factor w/ 16 levels "E","ENE","ESE",..: 13 1 4 15 11 10 10 10 1 9 ...$ WindDir3pm : Factor w/ 16 levels "E","ENE","ESE",..: 8 14 6 14 3 1 3 1 2 3 ... $WindSpeed9am : int 6 4 6 30 20 20 19 11 19 7 ...$ WindSpeed3pm : int 20 17 6 24 28 24 26 24 17 6 ... $Humidity9am : int 68 80 82 62 68 70 63 65 70 82 ...$ Humidity3pm : int 29 36 69 56 49 57 47 57 48 32 ... $Pressure9am : num 1020 1012 1010 1006 1018 ...$ Pressure3pm : num 1015 1008 1007 1007 1018 ... $Cloud9am : int 7 5 8 2 7 7 4 6 7 7 ...$ Cloud3pm : int 7 3 7 7 7 5 6 7 7 1 ... $Temp9am : num 14.4 17.5 15.4 13.5 11.1 10.9 12.4 12.1 14.1 13.3 ...$ Temp3pm : num 23.6 25.7 20.2 14.1 15.4 14.8 17.3 15.5 18.9 21.7 ... $RainToday : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 1 1 1 1 2 ...$ RISK_MM : num 3.6 3.6 39.8 2.8 0 0.2 0 0 16.2 0 ... $RainTomorrow : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 1 1 1 2 1 ... We have available 366 records: (n <- nrow(weather_data)) [1] 366 which spans the following timeline: c(as.character(weather_data$Date[1]), as.character(weather_data$Date[n])) [1] "11/1/2007" "10/31/2008" We further notice that RISK_MM relation with the RainTomorrow variable is the following. all.equal(weather_data$RISK_MM > 1, weather_data$RainTomorrow == "Yes") [1] TRUE The Rainfall variable and the RainToday are equivalent according to the following relationship (as anticipated by the Rainfall description). all.equal(weather_data$Rainfall > 1, weather_data$RainToday == "Yes") [1] TRUE To make it more challenging, we decide to take RISK_MM, RainFall and RainToday out, and determine a new dataset as herein depicted. weather_data2 <- subset(weather_data, select = -c(Date, Location, RISK_MM, Rainfall, RainToday)) colnames(weather_data2) [1] "MinTemp" "MaxTemp" "Evaporation" "Sunshine" "WindGustDir" "WindGustSpeed" [7] "WindDir9am" "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [13] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" "Temp3pm" [19] "RainTomorrow" (cols_withNa <- apply(weather_data2, 2, function(x) sum(is.na(x)))) MinTemp MaxTemp Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am 0 0 0 3 3 2 31 WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm 1 7 0 0 0 0 0 Cloud9am Cloud3pm Temp9am Temp3pm RainTomorrow 0 0 0 0 0 Look at the NA’s counts associated to WindDir9am. If WindDir9am were a not significative predictor for RainTomorrow, we could take such data column out and increased the complete cases count. When dealing with the categorical variable analysis we determine if that is possible. For now, we consider records without NA’s values. weather_data3 <- weather_data2[complete.cases(weather_data2),] Categorical Variable Analysis factor_vars <- names(which(sapply(weather_data3, class) == "factor")) factor_vars <- setdiff(factor_vars, "RainTomorrow") chisq_test_res <- lapply(factor_vars, function(x) { chisq.test(weather_data3[,x], weather_data3[, "RainTomorrow"], simulate.p.value = TRUE) }) names(chisq_test_res) <- factor_vars chisq_test_res$WindGustDir Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: weather_data3[, x] and weather_data3[, "RainTomorrow"] X-squared = 35.833, df = NA, p-value = 0.001999 $WindDir9am Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: weather_data3[, x] and weather_data3[, "RainTomorrow"] X-squared = 23.813, df = NA, p-value = 0.06547$WindDir3pm Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: weather_data3[, x] and weather_data3[, "RainTomorrow"] X-squared = 9.403, df = NA, p-value = 0.8636

Above shown Chi-squared p-value results tell us that RainTomorrow values depend upon WindGustDir (we reject the null hypothesis that RainTomorrow does not depend upon the WindGustDir). We reject the null-hypothesis for WindDir9am and WindDir3pm as well. We therefore expect to take advantage of WindGustDir as predictor and not to consider WindDir9am and WindDir3pm for such purpose.

It is also possible to obtain a visual understanding of the significativeness of such categorical variables by taking advantage of barcharts with the cross table row proportions as input.

barchart_res <- lapply(factor_vars, function(x) { title <- colnames(weather_data3[,x, drop=FALSE]) wgd <- CrossTable(weather_data3[,x], weather_data3$RainTomorrow, prop.chisq=F) barchart(wgd$prop.row, stack=F, auto.key=list(rectangles = TRUE, space = "top", title = title)) }) names(barchart_res) <- factor_vars barchart_res$WindGustDir Barchart of WindGustDir barchart_res$WindDir9am

Barchart of WindDir9am

barchart_res$WindDir3pm Barchart of WindDir3pm Being WindDir9am not a candidate predictor and having got more than 30 NA’s values, we decide to take it out. As a consequence, we have increased the number of NA-free records from 328 to 352. Same for WindDir3pm. weather_data4 <- subset(weather_data2, select = -c(WindDir9am, WindDir3pm)) weather_data5 <- weather_data4[complete.cases(weather_data4),] colnames(weather_data5) [1] "MinTemp" "MaxTemp" "Evaporation" "Sunshine" "WindGustDir" [6] "WindGustSpeed" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" "Humidity3pm" [11] "Pressure9am" "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" [16] "Temp3pm" "RainTomorrow" So, we end up with a dataset made up of 16 potential predictors, one of those is a categorical variable (WindGustDir) and 15 are quantitative. Quantitative Variable Analysis In this section, we carry on the exploratory analysis of quantitative variables. We start first by a visualization of the correlation relationship among variables. factor_vars <- names(which(sapply(weather_data5, class) == "factor")) numeric_vars <- setdiff(colnames(weather_data5), factor_vars) numeric_vars <- setdiff(numeric_vars, "RainTomorrow") numeric_vars numeric_vars_mat <- as.matrix(weather_data5[, numeric_vars, drop=FALSE]) numeric_vars_cor <- cor(numeric_vars_mat) corrplot(numeric_vars_cor) By taking a look at above shown correlation plot, we can state that: – Temp9am strongly positive correlated with MinTemp – Temp9am strongly positive correlated with MaxTemp – Temp9am strongly positive correlated with Temp3pm – Temp3pm strongly positive correlated with MaxTemp – Pressure9am strongly positive correlated with Pressure3pm – Humidity3pm strongly negative correlated with Sunshine – Cloud9am strongly negative correlated with Sunshine – Cloud3pm strongly negative correlated with Sunshine The pairs plot put in evidence if any relationship among variables is in place, such as a linear relationship. pairs(weather_data5[,numeric_vars], col=weather_data5$RainTomorrow)

Visual analysis, put in evidence a linear relationship among the following variable pairs:

MinTemp and MaxTemp
MinTemp and Evaporation
MaxTemp and Evaporation
Temp9am and MinTemp
Temp3pm and MaxTemp
Pressure9am and Pressure3pm
Humidity9am and Humidity3pm
WindSpeed9am and WindGustSpeed
WindSpeed3pm and WindGustSpeed

Boxplots and density plots give a visual understanding of the significativeness of a predictor by looking how much are overlapping the predictor values set depending on the variable to be predicted (RainTomorrow).

gp <- invisible(lapply(numeric_vars, function(x) { ggplot(data=weather_data5, aes(x= RainTomorrow, y=eval(parse(text=x)), col = RainTomorrow)) + geom_boxplot() + xlab("RainTomorrow") + ylab(x) + ggtitle("") + theme(legend.position="none")})) grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) { marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)})) grob_plots

gp <- invisible(lapply(numeric_vars, function(x) { ggplot(data=weather_data5, aes(x=eval(parse(text=x)), col = RainTomorrow)) + geom_density() + xlab(x) + ggtitle(paste(x, "density", sep= " "))})) grob_plots <- invisible(lapply(chunk(1, length(gp), 4), function(x) { marrangeGrob(grobs=lapply(gp[x], ggplotGrob), nrow=2, ncol=2)})) grob_plots

From all those plots, we can state that Humidity3pm, Pressure9am, Pressure3pm, Cloud9am, Cloud3pm and Sunshine are predictors to be considered.

Ultimately, we save the dataset currently under analysis for the prosecution of this tutorial.

write.csv(weather_data5, file="weather_data5.csv", sep=",", row.names=FALSE) Conclusions

Exploratory analysis has highlighted interesting relationship across variables. Such preliminary results give us hints for predictors selection and will be considered in the prosecution of this analysis.

If you have any questions, please feel free to comment below.

References

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### The code (and other stuff…)

Fri, 06/02/2017 - 12:56

I’ve received a couple of emails or comments on one of the General Election posts to ask me to share the code I’ve used.

In general, I think this is a bit dirty and lots could be done in a more efficient way $-$ effectively, I’m doing this out of my own curiosity and while I think the model is sensible, it’s probably not “publication-standard” (in terms of annotation etc).

Anyway, I’ve created a (rather plain) GitHub repository, which contains the basic files (including R script, R functions, basic data and JAGS model). Given time (which I’m not given…), I’d like to put a lot more description and perhaps also write a Stan version of the model code. I could also write a more precise model description $-$ I’ll try to update the material on the GitHub.

On another note, the previous posts have been syndicated in a couple of places (here and here), which was nice. And finally, here’s a little update with the latest data. As of today, the model predicts the following seats distribution.

mean        sd 2.5% median 97.5%
Conservative 352.124 3.8760350  345    352   359
Labour       216.615 3.8041091  211    217   224
UKIP           0.000 0.0000000    0      0     0
Lib Dem       12.084 1.8752228    8     12    16
SNP           49.844 1.8240041   45     51    52
Green          0.000 0.0000000    0      0     0
PCY            1.333 0.9513233    0      2     3
Other          0.000 0.0000000    0      0     0

Labour are still slowly but surely gaining some ground $-$ I’m not sure the effect of the debate earlier this week (which was deserted by the PM) are visible yet as only a couple of the polls included were conducted after that.

Another interesting thing (following up on this post) is the analysis of the marginal seats that the model predicts to swing from the 2015 Winners. I’ve updated the plot, which now looks as below.

Now there are 30 constituencies that are predicted to change hand, many still towards the Tories. I am not a political scientists, so I don’t really know all the ins and outs of these, but I think a couple of examples are quite interesting and I would venture some comment…

So, the model doesn’t know about the recent by-elections of Copeland and Stoke-on-Trent South and so still label these seats as “Labour” (as they were in 2015), although the Tories have actually now control of Copeland.

In the prediction given the polls and the impact of the EU referendum (both were strong Leave areas with with 60% and 70% of the preference, respectively) and the Tories did well in 2015 (36% vs Labour’s 42% in Copeland and 33% to Labour’s 39% in 2015). So, the model is suggesting that both are likely to switch to the Tories this time around.

In fact, we know that at the time of the by-election, while Copeland (where the contest was mostly Labour v Tories) did go blue, Stoke didn’t. But there, the main battle was between the Labour’s and the UKIP’s candidate (UKIP had got 21% in 2015). And the by-election was fought last February, when the Tories lead was much more robust that it probably is now.

Another interesting area is Twickenham $-$ historically a constituency leaning to the Lib Dems, which was captured by the Conservatives in 2015. But since then, in another by-election the Tories have lost another similar area (Richmond Park,with a massive swing) and the model is suggesting that Twickenham could follow suit, come next Thursday.

Finally, Clapton was the only seat won by UKIP in 2015, but since then, the elected MP (a former Tory-turned-UKIP) has defected the party and is not contesting the seat. This, combined with the poor standing of UKIP in the polls produces the not surprisingly outcome that Clapton is predicted to go blue with basically no uncertainty…

These results look reasonable to me $-$ not sure how life will turn out of course. As many commentators have noted much may depend on the turn out among the younger. Or other factors. And probably there’ll be another instance of the “Shy-Tory effect” (I’ll think about this if I get some time before the final prediction). But the model does seem to make some sense…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### A Shiny App for Exploring Commodities Prices and Economic Indicators, via Quandl

Fri, 06/02/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

In a previous post, we created an R Notebook to explore the relationship between the copper/gold price ratio and 10-year Treasury yields (if you’re curious why we might care about this relationship, have a quick look at that previous post), relying on data from Quandl. Today, we’ll create a Shiny app that lets users choose which different commodities ratios and different economic indicators to investigate. Perhaps users don’t care about Dr. Copper and Treasury yields, but instead want to explore the oil/gold price ratio and how it correlates with the US inflation rate, or the EU inflation rate. Let’s give them some flexibility!

The finished app is available here.

Before we get to it, note a few issues that we have seen in the past.

Very similar to our previous Shiny app, in the code chunk below, we have some important decisions about how a user selects a commodity. We could use textInput to allow the user to enter the code for the desired data set which would not limit the user in any way – he or she could choose any dataset on Quandl. The cost would be that the user would need to know, or go to Quandl and look up, the code for any data set.

For example, to import iron ore prices, the user would have to type in ODA/PIORECR_USD. That’s a big problem if your end users are not familiar with and have no interest in data set codes. To emphasize convenience and usability we will go with selectInput instead of textInput, meaning our app will show a drop-down of a few choices. The user just clicks on “Iron Ore” instead of typing ODA/PIORECR_USD, or clicks on “copper” instead of typing CHRIS/CME_HG1.1. But, if a user wants to work with a data set that we haven’t included, said user is out of luck.

Another big decision is how many choices to give the user. I’ve included a few: copper, oil, iron, platinum, palladium and silver. That’s a nice smattering of commodities whose prices tend to rise with a growing economy, plus silver which does not. I included silver so we could take a look at a commodity that should behave differently from the rest. As always, our choice here is driven by how broad this app should be. We could have added steel, lead, zinc, and tin, or we could have included just copper, oil and iron, depending on the use case. Either way, the number of drop downs is another tradeoff between usability and flexibility.

The final decision is a bit more nuanced and requires looking ahead to how these inputs will be used further down in the app. Have a peak at the object called commodityChoices and you might notice that we don’t strictly need that object. We could have put the vector of choices as an argment to selectInput, so that our code would have read choices = c("Copper" = "CHRIS/CME_HG1.1", ...) instead of choices = commodityChoices. In that choice assignment, “copper” is called the name and “CHRIS/CME_HG1.1” is called the value (together we can think of them as a name-value pair). The reason for building a separate commodityChoices object is that we want the ability to extract either the name or the value of the name-value pair. Usually we would care only about the value, because we want to pass the value to Quandl and import the data, but that name is going to be useful when we label our graphs.

Without further adieu, let’s look at commodityChoices, econIndicatorChoices and the use of selectInput.

# Create a vector of commodity choices. commodityChoices <- c( "Copper" = "CHRIS/CME_HG1", "WTI oil" = "FRED/DCOILWTICO",# "Iron Ore" = "ODA/PIORECR_USD", # monthly "Platinum" = "LPPM/PLAT", "Palladium" = "LPPM/PALL", "Silver" = "LBMA/SILVER") # Make those commodity choices avaible via selectInput. selectInput("commodity", "Commodity", choices = commodityChoices, selected = "Copper") # Create a vector of economic indicator choices. econIndicatorChoices <- c( "10-Yr Yield" = "FRED/DGS10", # daily "US CPI" = "RATEINF/INFLATION_USA",# monthly "Japan CPI" = "RATEINF/INFLATION_JPN", "EU CPI" = "RATEINF/INFLATION_EUR") # Make those economic indicator choices avaible via selectInput. selectInput("econIndicator", "Economic Indicator", choices = econIndicatorChoices, selected = "10-yr Yield") # A standard date range input. dateRangeInput("dateRange", "Date range", start = "1990-01-01", end = "2016-12-31")

Now that we have the inputs in a sidebar for the user, it’s back to Quandl to import the data for the chosen commodity, gold and the chosen economic indicator. There’s a common date range for all three so we’ll start by creating start and end date objects.

ratio_indicator <- reactive({ Quandl.api_key("your API key here") start_date <- format(input$dateRange[1]) end_date <- format(input$dateRange[2])

We could now write three separate calls to Quandl for each of the data sets but, instead, let’s make use of the map() function from the purrr package. If you’re not familiar with purrr, have a look here. I’ll just say that you’ll probably never have to use lapply() again (and that should be motivation enough), but, in short, the family of map() functions takes a function and applies it to the elements in a vector, similar to the apply() functions.

Before we can use map() though, we need a vector to feed it. Let’s create a vector of Quandl codes.

# Create a vector of 3 data set codes # 1) commodity chosen by user # 2) gold quandl code # 3) economic indicator chosen by user gold_code <- "CHRIS/CME_GC1.1" # Vector of Quandl codes. data_set_codes <- c(input$commodity, gold_code, input$econIndicator)

Then we’ll apply the Quandl() function by piping our vector of codes and using map().

# Pipe the data_set_codes vector to Quandl via the map() function # Note we can still set the start and end date and object type # as we always can with Quandl. quandlData<- data_set_codes %>% # Pipe the datasets vector to Quandl via the map() function. map(Quandl, start_date = start_date, end_date = end_date, collapse = "monthly", type = "xts") %>%

Next, we will use map() to apply the na.locf() function to our time series and ensure that no NAs remain.

# Replace all NAs using map() and na.locf(). map(na.locf, formLast = TRUE) %>%

If we stopped here, we would have a list of three xts series, but I don’t want a list, I want one xts object. So, we’ll pipe our list of three and use the reduce() + merge() to combine our list of 3 time series into one xts object.

# Merge to one xts object using map() and merge(). reduce(merge) %>% # Add nicer column names. colnames<-(c(names(commodityChoices[commodityChoices == input$commodity]), "Gold", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator])))

Alright, after running our Quandl codes through that series of mapped functions, we have three time series stored in one xts object, and now we want to calculate the price ratio of the chosen commodity/gold.

To create that price ratio, we need to divide the first column by the second column and we’ll store it in a new column called ratio.

Then we will save just that ratio and the economic indicator column data into their xts object. That is not necessary but it makes things cleaner and easier when we pass to dygraph().

# Create a column and add the price ratio. quandlData$ratio <- quandlData[,1]/quandlData[,2] # Save just the ratio and the economic indicator data. ratio_indicator <- merge(quandlData$ratio, quandlData[,3]) # Add more general names. colnames(ratio_indicator) <- c("ratio","indicator") return(ratio_indicator) })

Now we just need to pass our reactive object ratio_indicator() to dygraph() and follow the same steps as we did when testing in our Notebook.

We will use dyRoller() to smooth out our chart and make each point an average of the number of periods specified with rollPeriod = X. This won’t affect our xts object, where we store the data, it just makes the chart more readable.

Remember also that we are charting two time series on the same chart and they are on different scales, so we want to add a right-hand-side y-axis.

To do so, we need to invoke dyAxis() for the left-hand axis, called “y”. Then we invoke dyAxis() for the right-hand axis, called “y2”. We also need to set independentTicks = TRUE so that we can use a unique, independent value scale for the right-hand side. Next, in our dySeries() call for each time series, we assign each one to an axis. Here we assign “ratio” with axis = 'y', so that the commodity-gold price ratio will be on the left-hand scale, and we assign “indicator” with axis = 'y2', so the economic indicator will be on the right-hand scale.

dygraphOutput("ratio_indicator") output$ratio_indicator <- renderDygraph({ dygraph(ratio_indicator()) %>% # Add the rollPeriod for smoothing. dyRoller(rollPeriod = 3) %>% # Create two independent axes, just we did in the Notebook. dyAxis("y", label = "USD") %>% dyAxis("y2", label = "Percent (%)", independentTicks = TRUE) %>% # Assign each time series to an axis. # Use the name from the name-value pair to create nice labels for each. dySeries("ratio", axis = 'y', label = paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold (LHS)", sep = ""), color = "blue") %>% dySeries("indicator", axis = 'y2', label = paste(names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), "(RHS)", sep = ""), color = "red") }) We could end things here but let’s go ahead and add a chart to show the rolling correlation between the ratio and the indicator. We’ve done so much work to calculate and wrangle these time series, might as well put them to use! First, we’ll calculate the rolling correlation using the rollapply() function. Nothing too complicated here. dygraphOutput("rollingCorrelation") output$rollingCorrelation <- renderDygraph({ rolling_cor <- rollapply(ratio_indicator(), 24, function(x) cor(x[, 1], x[, 2], use = "pairwise.complete.obs"), by.column = FALSE) # Make a nicer name for the xts object that stores the rolling correlation. # This name will be displayed when a user hovers on the dygraph. names(rolling_cor) <- paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold ", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), " Correlation", sep = "")

It’s not necessary, but I like to display the mean, minimum and maximum rolling correlations on the chart. We’ll store those in three objects: avg, mini, and maxi.

avg <- round(mean(rolling_cor, na.rm = T), 2) mini <- round(min(rolling_cor, na.rm = T), 2) maxi <- round(max(rolling_cor, na.rm = T), 2)

Now we pass our rolling_cor xts object to dygraph() and pass the mean, minimum and maximum objects to dyLimit().

dygraph(rolling_cor, main = paste(names(commodityChoices[commodityChoices == input$commodity]), "/Gold ", names(econIndicatorChoices[econIndicatorChoices == input$econIndicator]), " Correlation", sep = "")) %>% dyRangeSelector(dateWindow = c("2015-01-01", "2016-12-31")) %>% # Add a line for the mean, min and max. dyLimit(avg, color = 'purple') %>% dyLimit(mini, color = 'red') %>% dyLimit(maxi, color = 'blue') %>% # Add an event for the US election. dyEvent("2016-11-08", "Trump!", labelLoc = "bottom") })

And, we’re done! It’s fun to explore different relationships amongst different time series with this app. And once we have this template in the toolkit, all sorts of different data sets can be substituted in for exploration. For example, we might want to port this work over to a currencies dashboard, or a country GDP dashboard. The nice thing is, it’s just a matter of finding the right Quandl codes and imagining new hypotheses to explore.

Things got a little choppy today with all the piping, so just a reminder that if you want the reusable code for this app, it’s available via the source code button at the top right of the live app. Thanks, and see you next time.

_____='https://rviews.rstudio.com/2017/06/02/a-shiny-app-for-exploring-commodities-prices-and-economic-indicators-via-quandl/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Python and R top 2017 KDnuggets rankings

Thu, 06/01/2017 - 23:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The results of KDnuggets' 18th annual poll of data science software usage are in, and for the first time in three years Python has edged out R as the most popular software. While R increased its share of usage from 45.7% in last year's poll to 52.1% this year, Python's usage among data scientists increased even more, from 36.6% of users in 2016 to 52.6% of users this year.

There were some interesting moves in the long tail, as well. Several tools entered the KDNuggets chart for the first time, including Keras (9.5% of users), PyCharm (9.0%) and Microsoft R Server (4.3%). And several returning tools saw big jumps in usage, including Microsoft Cognitive Toolkit (3.4% of users), Tensorflow (20.2%) and Power BI (10.2%). Microsoft SQL Server increased its share to 11.6% (up from 10.8%), whereas SAS (7.1%) and Matlab (7.4%) saw declines. Julia, somewhat surprisingly, remained flat at 1.1%.

For the complete results and analysis of the 2017 KDnuggets data science software poll, follow the link below.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Canada Labour Market: Future Perspectives

Thu, 06/01/2017 - 19:21

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

For anyone looking for job opportunities, it is nice to have an idea how the job market will perform in the future for your chosen career or industry. Many countries have open data sets that offer this kind of data. In these exercises we will use R to analyze the future perspective of Canadian labour market.

Answers to the exercises are available here.

Exercise 1

Exercise 2
Load library tidyr. Use gather to rearrange any occupation related data set that present time series data into a tidy data format.

Exercise 3
Use gather to rearrange ALL other occupation related data sets that present time series data into a tidy data format, and pile out them in a unique data frame.

Exercise 4
Remove lines that present NA values, columns in French, and the “X” in front every year. Take a look at your tidy data set.

Learn more about Data manipulation in the online course R: Complete Data Analysis Solutions. In this course you will learn how to:

• Learn indepth how to work with dplyr
• Get a full introduction to the data.table package
• And much more

Exercise 5
Let’s do the same with industries data sets. Start by taking one of the industries data sets that present data in a time series an use gather.

Exercise 6
Do the same procedure of exercise 5 to all other industries data sets. Pile out them in a new data frame.

Exercise 7
Remove NAs, and French columns. In addition, set year and value as numeric, and take a look at your new tidy data set about industries.

Exercise 8
Find out the industries that have que lowest number of jobseekers, and create a new data set by sub setting the previous one.

Exercise 9
Plot the recently create data set using a line for each industry.

Exercise 10
Create a similar plot for the top 5 occupations in terms of low amount of jobseekers.

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### In defense of wrapr::let()

Thu, 06/01/2017 - 18:01

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Saw this the other day:

In defense of wrapr::let() (originally part of replyr, and still re-exported by that package) I would say:

• let() was deliberately designed for a single real-world use case: working with data when you don’t know the column names when you are writing the code (i.e., the column names will come later in a variable). We can re-phrase that as: there is deliberately less to learn as let() is adapted to a need (instead of one having to adapt to let()).
• The R community already has months of experience confirming let() working reliably in production while interacting with a number of different packages.
• let() will continue to be a very specific, consistent, reliable, and relevant tool even after dpyr 0.6.* is released, and the community gains experience with rlang/tidyeval in production.

If rlang/tidyeval is your thing, by all means please use and teach it. But please continue to consider also using wrapr::let(). If you are trying to get something done quickly, or trying to share work with others: a “deeper theory” may not be the best choice.

An example follows.

In “base R” one can write:

d <- data.frame(x = 1:3)

If we know the column name we wish to add to this data frame we write:

d$newColumn <- 1 The above is “non-parameterized” evaluation in that the variable name is taken from the source code, and not from a variable carrying the information. This is great for ad-hoc analysis, but it would be hard to write functions, scripts and packages if this was the only way we had to manipulate columns. This isn’t a problem as R supplies an additional parameterized notation as we show here. When we don’t know the name of the column (but expect it to be in a variable at run time) we write: # code written very far away from us variableHoldingColumnName <- 'newColumn' # our code d[[variableHoldingColumnName]] <- 1 The purpose of wrapr::let() is to allow one to use the non-parameterized form as if it were parameterized. Obviously this is only useful if there is not a convenient parameterized alternative (which is the case for some packages). But for teaching purposes: how would wrapr::let() let us use the “$” as if it were parameterized (which we would have to do if [[]] and [] did not exist)?

With wrapr::let() we can re-write the “dollar-sign form” as:

wrapr::let( c(NEWCOL = variableHoldingColumnName), { d\$NEWCOL <- 1 } )

The name “NEWCOL” is a stand-in name that we write all our code in terms of. The expression “c(NEWCOL = variableHoldingColumnName)” is saying: “NEWCOL is to be interpreted as being whatever name is stored in the variable variableHoldingColumnName.” Notice we can’t tell from the code what value is in variableHoldingColumnName, as that won’t be known until execution time. The alias list or vector can be arbitrarily long and built wherever you like (it is just data). The expression block can be arbitrarily large and complex (so you need only pay the mapping notation cost once per function, not once per line of code).

And that is wrapr::let().

If you don’t need parametric names you don’t need wrapr::let(). If you do need parametric names wrapr::let() is one of the most reliable and easiest to learn ways to get them.

A nice article from a user recording their experience trying different ways to parameterize their code (including wrapr::let()) can be found here.

If you wish to learn more we have a lot of resources available:

And many more examples on our blog.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A Primer in functional Programming in R (part -2)

Thu, 06/01/2017 - 17:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the last exercise, We have seen how powerful functional programming principles can be and how it can drammatically increase the readablity of the code and how easily you can work with them .In this set of exercises we will look at functional programming principles with purrr.Purrr comes with a number of interesting features and is really useful in writing clean and concise code . Please check the documentation and load the purrr library in your R session before starting these exercise set .
Answers to the exercises are available here

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
From the airquality dataset( available in base R ) , Find the mean ,median ,standard deviation of all columns using map functions .

Exercise 2

In the same dataset,find 95th percentile of each column excluding the NA values

Exercise 3
Load the iris dataset ,with help of pipe and map functions find out the mean of the relavant columns.Keep in mind mean is meant for numeric columns ,so you may need multiple map like functions.I expect the output as a dataframe .

Exercise 4
I have a vector x <- 1:20 ,I want to multiply every odd element with 2 and every even element with 4 , find a solution using purrr .

Exercise 5
I have a sequence x <- 1:100 , I want to increase the 1st,11th, 21st…91st element by 1 . How can I achieve that .
Exercise 6
Suppose I have two character vectors
x <- letters # letters is a vector of alphabets in small available in R
y <- LETTERS # LETTERS is a vector of alphabets in caps available in R

How do I join each capital letter with the small letter so that the end result looks like this
"A,a" "B,b" ,.. and so on

Exercise 7
The previous exercise gave you the intuition of how to work parallely on two vectors.Now accepts a list of character vectors of same size and joins them like the previous exercise . so if I have a
list like x <- list(c("a","b","c"),c("d","e",f"),c("g","h","i")) ,it should give me output as .
[1] "a d g" "b e h" "c f i"

Exercise 8
using a functional tool from purrr ,reverse the letters so that my output is a string of 26 character starting from z and ending at a
like “z y x w v u t s r q p o n m l k j i h g f e d c b a”

Exercise 9
Exercise on Purrr wont be complete if We dont mention its “Filter” like functions . take a numeric vector of 1:100 , keep all the numbers which are divisible by 2 and after that remove the one’s which are divisible by 5

Exercise 10
Ability to create partial functions is a powerful tool in functional programming. This allow functions to behave a little like data structures .Consider creating a partial function whenever you see you are repeating functions argument. This may not sound very useful in context of this exercise but I am sure you will find it very useful in your R gigs .Now create a Partial function
ql ,which is to find the 25th percentile and 50th percentile of a column from a dataframe .use it to find the same from airquality dataset .Dont use quantile twice in this exercise .

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A tidy model pipeline with twidlr and broom

Thu, 06/01/2017 - 14:00

(This article was first published on blogR, and kindly contributed to R-bloggers)

@drsimonj here to show you how to go from data in a data.frame to a tidy data.frame of model output by combining twidlr and broom in a single, tidy model pipeline.

The problem

Different model functions take different types of inputs (data.frames, matrices, etc) and produce different types of output! Thus, we’re often confronted with the very untidy challenge presented in this Figure:

Thus, different models may need very different code.

However, it’s possible to create a consistent, tidy pipeline by combining the twidlr and broom packages. Let’s see how this works.

Two-step modelling

To understand the solution, think of the problem as a two-step process, depicted in this Figure:

Step 1: from data to fitted model

Step 1 must take data in a data.frame as input and return a fitted model object. twidlr exposes model functions that do just this!

To demonstrate:

#devtools::install_github("drimonj/twidlr") # To install library(twidlr) lm(mtcars, hp ~ .) #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) mpg cyl disp drat #> 79.048 -2.063 8.204 0.439 -4.619 #> wt qsec vs am gear #> -27.660 -1.784 25.813 9.486 7.216 #> carb #> 18.749

This means we can pipe data.frames into any model function exposed by twidlr. For example:

library(dplyr) mtcars %>% lm(hp ~ .) #> #> Call: #> stats::lm(formula = formula, data = data) #> #> Coefficients: #> (Intercept) mpg cyl disp drat #> 79.048 -2.063 8.204 0.439 -4.619 #> wt qsec vs am gear #> -27.660 -1.784 25.813 9.486 7.216 #> carb #> 18.749  Step2: fitted model to tidy results

Step 2 must take a fitted model object as its input and return a tidy data frame of results. This is precisely what the broom package does via three functions: glance, tidy, and augment! To demonstrate:

#install.packages("broom") # To install library(broom) fit <- mtcars %>% lm(hp ~ .) glance(fit) #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e-08 11 -142.8905 #> AIC BIC deviance df.residual #> 1 309.7809 327.3697 14164.76 21 tidy(fit) #> term estimate std.error statistic p.value #> 1 (Intercept) 79.0483879 184.5040756 0.4284371 0.672695339 #> 2 mpg -2.0630545 2.0905650 -0.9868407 0.334955314 #> 3 cyl 8.2037204 10.0861425 0.8133655 0.425134929 #> 4 disp 0.4390024 0.1492007 2.9423609 0.007779725 #> 5 drat -4.6185488 16.0829171 -0.2871711 0.776795845 #> 6 wt -27.6600472 19.2703681 -1.4353668 0.165910518 #> 7 qsec -1.7843654 7.3639133 -0.2423121 0.810889101 #> 8 vs 25.8128774 19.8512410 1.3003156 0.207583411 #> 9 am 9.4862914 20.7599371 0.4569518 0.652397317 #> 10 gear 7.2164047 14.6160152 0.4937327 0.626619355 #> 11 carb 18.7486691 7.0287674 2.6674192 0.014412403 augment(fit) %>% head() #> .rownames hp mpg cyl disp drat wt qsec vs am gear carb #> 1 Mazda RX4 110 21.0 6 160 3.90 2.620 16.46 0 1 4 4 #> 2 Mazda RX4 Wag 110 21.0 6 160 3.90 2.875 17.02 0 1 4 4 #> 3 Datsun 710 93 22.8 4 108 3.85 2.320 18.61 1 1 4 1 #> 4 Hornet 4 Drive 110 21.4 6 258 3.08 3.215 19.44 1 0 3 1 #> 5 Hornet Sportabout 175 18.7 8 360 3.15 3.440 17.02 0 0 3 2 #> 6 Valiant 105 18.1 6 225 2.76 3.460 20.22 1 0 3 1 #> .fitted .resid .hat .sigma .cooksd .std.resid #> 1 148.68122 -38.681220 0.2142214 24.75946 0.069964902 -1.6801773 #> 2 140.62866 -30.628664 0.2323739 25.43881 0.049861042 -1.3460408 #> 3 79.99158 13.008418 0.3075987 26.38216 0.014633059 0.6019364 #> 4 125.75448 -15.754483 0.2103960 26.31579 0.011288712 -0.6826601 #> 5 183.21756 -8.217565 0.2016137 26.53317 0.002878707 -0.3541128 #> 6 111.38490 -6.384902 0.3147448 26.55680 0.003682813 -0.2969840  A single, tidy pipeline

So twidlr and broom functions can be combined into a single, tidy pipeline to go from data.frame to tidy output:

library(twidlr) library(broom) mtcars %>% lm(hp ~ .) %>% glance() #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9027993 0.8565132 25.97138 19.50477 1.89833e-08 11 -142.8905 #> AIC BIC deviance df.residual #> 1 309.7809 327.3697 14164.76 21

Any model included in twidlr and broom can be used in this same way. Here’s a kmeans example:

iris %>% select(-Species) %>% kmeans(centers = 3) %>% tidy() #> x1 x2 x3 x4 size withinss cluster #> 1 5.901613 2.748387 4.393548 1.433871 62 39.82097 1 #> 2 5.006000 3.428000 1.462000 0.246000 50 15.15100 2 #> 3 6.850000 3.073684 5.742105 2.071053 38 23.87947 3

And a ridge regression with cross-fold validation example:

mtcars %>% cv.glmnet(am ~ ., alpha = 0) %>% glance() #> lambda.min lambda.1se #> 1 0.2284167 0.8402035

So next time you want to do some tidy modelling, keep this pipeline in mind:

Limitations

Currently, the major limitation for this approach is that a model must be covered by twidlr and broom. For example, you can’t use randomForest in this pipeline because, although twidlr exposes a data.frame friendly version of it, broom doesn’t provide tidying methods for it. So if you want to write tidy code for a model that isn’t covered by these packages, have a go at helping out by contributing to these open source projects! To get started creating and contributing to R packages, take a look at Hadley Wickham’s free book, “R Packages”.

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.

To leave a comment for the author, please follow the link and comment on their blog: blogR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Correcting bias in meta-analyses: What not to do (meta-showdown Part 1)

Thu, 06/01/2017 - 10:24

tl;dr: Publication bias and p-hacking can dramatically inflate effect size estimates in meta-analyses. Many methods have been proposed to correct for such bias and to estimate the underlying true effect. In a large simulation study, we found out which methods do not work well under which conditions, and give recommendations what not to use. Estimated reading time: 7 min. It is well known that publication bias and p-hacking inflate effect size estimates from meta-analyses. In the last years, methodologists have developed an ever growing menu of statistical approaches to correct for such overestimation. However, to date it was unclear under which conditions they perform well, and what to do if they disagree. Born out of a Twitter discussion, Evan Carter, Joe Hilgard, Will Gervais and I did a large simulation project, where we compared the performance of naive random effects meta-analysis (RE), trim-and-fill (TF), p-curve, p-uniform, PET, PEESE, PET-PEESE, and the three-parameter selection model (3PSM).

Previous investigations typically looked only at publication bias or questionable research practices QRPs (but not both), used non-representative study-level sample sizes, or only compared few bias-correcting techniques, but not all of them. Our goal was to simulate a research literature that is as realistic as possible for psychology. In order to simulate several research environments, we fully crossed five experimental factors: (1) the true underlying effect, δ (0, 0.2, 0.5, 0.8); (2) between-study heterogeneity, τ (0, 0.2, 0.4); (3) the number of studies in the meta-analytic sample, k (10, 30, 60, 100); (4) the percentage of studies in the meta-analytic sample produced under publication bias (0%, 60%, 90%); and (5) the use of QRPs in the literature that produced the meta-analytic sample (none, medium, high).

This blog post summarizes some insights from our study, internally called “meta-showdown”. Check out the preprint; and the interactive app metaExplorer. The fully reproducible and reusable simulation code is on Github, and more information is on OSF.

In this blog post, I will highlight some lessons that we learned during the project, primarily focusing on what not do to when performing a meta-analysis.

Limitation of generalizability disclaimer: These recommendations apply to typical sample sizes, effect sizes, and heterogeneities in psychology; other research literatures might have different settings and therefore a different performance of the methods. Furthermore, the recommendations rely on the modeling assumptions of our simulation. We went a long way to make them as realistic as possible, but other assumptions could lead to other results.

Never trust a naive random effects meta-analysis or trim-and-fill (unless you meta-analyze a set of registered reports)

If studies have no publication bias, nothing can beat plain old random effects meta-analysis: it has the highest power, the least bias, and the highest efficiency compared to all other methods. Even in the presence of some (though not extreme) QRPs, naive RE performs better than all other methods. When can we expect no publication bias? If (and, in my opinion only if) we meta-analyze a set of registered reports.

But.

In any other setting except registered reports, a consequential amount of publication bias must be expected. In the field of psychology/psychiatry, more than 90% of all published hypothesis tests are significant (Fanelli, 2011) despite the average power being estimated as around 35% (Bakker, van Dijk, & Wicherts, 2012) – the gap points towards a huge publication bias. In the presence of publication bias, naive random effects meta-analysis and trim-and-fill have false positive rates approaching 100%:

More thoughts about trim-and-fill’s inability to recover δ=0 are in Joe Hilgard’s blog post. (Note: this insight is not really new and has been shown multiple times before, for example by Moreno et al., 2009, and Simonsohn, Nelson, and Simmons, 2014).

Our recommendation: Never trust meta-analyses based on naive random effects and trim-and-fill, unless you can rule out publication bias. Results from previously published meta-analyses based on these methods should be treated with a lot of skepticism.

Do not use p-curve and p-uniform for effect size estimation (under heterogeneity)

As a default, heterogeneity should always be expected – even under the most controlled conditions, where many labs perform the same computer-administered experiment, a large proportion showed significant and substantial amounts of between-study heterogeneity (cf. ManyLabs 1 and 3; see also our supplementary document for more details). p-curve and p-uniform assume homogeneous effect sizes, and their performance is impacted to a large extent by heterogeneity:

As you can see, all other methods retain the nominal false positive rate, but p-curve and p-uniform go through the roof as soon as heterogeneity comes into play (see also McShane, Böckenholt, & Hansen, 2016; van Aert et al., 2016).

Under H1, heterogeneity leads to overestimation of the true effect:

(additional settings for these plots: no QRPs, no publication bias, k = 100 studies, true effect size = 0.5)

Note that in their presentation of p-curve, Simonsohn et al. (2014) emphasize that, in the presence of heterogeneity, p-curve is intended as an estimate of the average true effect size among the studies submitted to p-curve (see here, Supplement 2). p-curve may indeed yield an accurate estimate of the true effect size among the significant studies, but in our view, the goal of bias-correction in meta-analysis is to estimate the average effect of all conducted studies. Of course this latter estimation hinges on modeling assumptions (e.g., that the effects are normally distributed), which can be disputed, and there might be applications where indeed the underlying true effect of all significant studies is more interesting.

Furthermore, as McShane et al (2016) demonstrate, p-curve and p-uniform are constrained versions of the more general three-parameter selection model (3PSM; Iyengar & Greenhouse, 1988). The 3PSM estimates (a) the mean of the true effect, δ, (b) the heterogeneity, τ, and (c) the probability that a non-significant result enters the literature, p. The constraints of p-curve and p-uniform are: 100% publication bias (i.e., p = 0) and homogeneity (i.e., τ = 0). Hence, for the estimation of effect sizes, 3PSM seems to be a good replacement for p-curve and p-uniform, as it makes these constraints testable.

Our recommendation: Do not use p-curve or p-uniform for effect size estimation when heterogeneity can be expected (which is nearly always the case).

Ignore overadjustments in the opposite direction

Many bias-correcting methods are driven by QRPs – the more QRPs, the stronger the downward correction. However, this effect can get so strong, that methods overadjust into the opposite direction, even if all studies in the meta-analysis are of the same sign:

Note: You need to set the option “Keep negative estimates” to get this plot.

Our recommendation: Ignore bias-corrected results that go into the opposite direction; set the estimate to zero, do not reject H₀.

Do not take it seriously if PET-PEESE does a reverse correction

Typical small-study effects (e.g., by p-hacking or publication bias) induce a negative correlation between sample size and effect size – the smaller the sample, the larger the observed effect size. PET-PEESE aims to correct for that relationship. In the absence of bias and QRPs, however, random fluctuations can lead to a positive correlation between sample size and effect size, which leads to a PET and PEESE slope of the unintended sign. Without publication bias, this reversal of the slope actually happens quite often.

See for example the next figure. The true effect size is zero (red dot), naive random effects meta-analysis slightly overestimates the true effect (see black dotted triangle), but PET and PEESE massively overadjust towards more positive effects:

PET-PEESE was never intended to correct in the reverse direction. An underlying biasing process would have to systematically remove small studies that show a significant result with larger effect sizes, and keep small studies with non-significant results. In the current incentive structure, I see no reason for such a process.

Our recommendation: Ignore the PET-PEESE correction if it has the wrong sign.

PET-PEESE sometimes overestimates, sometimes underestimates

A bias can be more easily accepted if it always is conservative – then one could conclude: “This method might miss some true effects, but if it indicates an effect, we can be quite confident that it really exists”. Depending on the conditions (i.e., how much publication bias, how much QRPs, etc.), however, PET/PEESE sometimes shows huge overestimation and sometimes huge underestimation.

For example, with no publication bias, some heterogeneity (τ=0.2), and severe QRPs, PET/PEESE underestimates the true effect of δ = 0.5:

In contrast, if no effect exists in reality, but strong publication bias, large heterogeneity and no QRPs, PET/PEESE overestimates at lot:

In fact, the distribution of PET/PEESE estimates looks virtually identical for these two examples, although the underlying true effect is δ = 0.5 in the upper plot and δ = 0 in the lower plot. Furthermore, note the huge spread of PET/PEESE estimates (the error bars visualize the 95% quantiles of all simulated replications): Any single PET/PEESE estimate can be very far off.

Our recommendation: As one cannot know the condition of reality, it is safest not to use PET/PEESE at all.

Recommendations in a nutshell: What you should not use in a meta-analysis

Again, please consider the “limitations of generalizability” disclaimer above.

• When you can exclude publication bias (i.e., in the context of registered reports), do not use bias-correcting techniques. Even in the presence of some QRPs they perform worse than plain random effects meta-analysis.
• In any other setting except registered reports, expect publication bias, and do not use random effects meta-analysis or trim-and-fill. Both will give you a 100% false positive rate in typical settings, and a biased estimation.
• Under heterogeneity, p-curve and p-uniform overestimate the underlying effect and have false positive rates >= 50%
• Even if all studies entering a meta-analysis point into the same direction (e.g., all are positive), bias-correcting techniques sometimes overadjust and return a significant estimate of the opposite direction. Ignore these results, set the estimate to zero, do not reject H₀.
• Sometimes PET/PEESE adjust into the wrong direction (i.e., increasing the estimated true effect size)

As with any general recommendations, there might be good reasons to ignore them.

• The p-uniform package (v. 0.0.2) very rarely does not provide a lower CI. In this case, ignore the estimate.
• Do not run p-curve or p-uniform on <=3 significant and directionally consistent studies. Although computationally possible, this gives hugely variable results, which are often very biased. See our supplemental material for more information and plots.
• If the 3PSM method (in the implementation of McShane et al., 2016) returns an incomplete covariance matrix, ignore the result (even if a point estimate is provided).

The post Correcting bias in meta-analyses: What not to do (meta-showdown Part 1) appeared first on Nicebread.

### Simple bash script for a fresh install of R and its dependencies in Linux

Thu, 06/01/2017 - 09:00

(This article was first published on R and Finance, and kindly contributed to R-bloggers)

I’ve been working with Linux for some time but always in a dual boot setup. While I used Linux-Mint at home, my university computer always had Windows. Most of the times it was not a problem working in one or the other and share files between computers with Dropbox.

After getting constantly frustated with Windows updates, crashes and slow-downs, I decided to make the full switch. All my computers are now Linux based. I am very happy with my choice and regret not making the switch earlier.

Since I formatted all my three computers (home/laptop/work), I wrote a small bash file to automate the process of installing R and its dependencies. I use lots of R packages in a daily basis. For some of them, it is required to install dependencies using the terminal. Each time that a install.package() failed, I saved the name of the required software and added it to the bash file. While my bash file will not cover all dependencies for all packages, it will suffice for a great proportion.

The steps for using it are:

1) Copy the contents of the code presented next in a file called InstallR.sh. Do notice that the name of the release, in this case trusty, is added to the cran address. Try not running the bash file twice, or it will append the CRAN address more than one time in /etc/apt/sources.list.

#!/bin/bash # Adds R to apt and install it # # Instructions: # sudo chmod 700 InstallR.sh # ./FirstTimeInstallR.sh sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" | sudo tee -a /etc/apt/sources.list gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9 gpg -a --export E084DAB9 | sudo apt-key add - sudo apt-get update sudo apt-get install -y r-base r-base-dev r-cran-xml r-cran-rjava libcurl4-openssl-dev sudo apt-get install -y libssl-dev libxml2-dev openjdk-7-* libgdal-dev libproj-dev libgsl-dev sudo apt-get install -y xml2 default-jre default-jdk mesa-common-dev libglu1-mesa-dev freeglut3-dev sudo apt-get install -y mesa-common-dev libx11-dev r-cran-rgl r-cran-rglpk r-cran-rsymphony r-cran-plyr sudo apt-get install -y r-cran-reshape r-cran-reshape2 r-cran-rmysql sudo R CMD javareconf

2) Change the permissions of the file so that it can be executed:

sudo chmod 700 InstallR.sh

3) Execute it as sudo

sudo ./FirstTimeInstallR.sh

That’s it. I hope others will find this simple bash script useful as well.

To leave a comment for the author, please follow the link and comment on their blog: R and Finance. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A Partial Remedy to the Reproducibility Problem

Thu, 06/01/2017 - 06:48

Several years ago, John Ionnidis jolted the scientific establishment with an article titled, “Why Most Published Research Findings Are False.” He had concerns about inattention to statistical power, multiple inference issues and so on. Most people had already been aware of all this, of course, but that conversation opened the floodgates, and many more issues were brought up, such as hidden lab-to-lab variability. In addition, there is the occasional revelation of outright fraud.

Many consider the field to be at a crisis point.

In the 2014 JSM, Phil Stark organized a last-minute session on the issue, including Marcia McNutt, former editor of Science and Yoav Benjamini of multiple inference methodology fame. The session attracted a standing-room-only crowd.

In this post, Reed Davis and I are releasing the prototype of an R package that we are writing, revisit, with the goal of partially remedying the statistical and data wrangling aspects of this problem. It is assumed that the authors of a study have supplied (possibly via carrots or sticks) not only the data but also the complete code for their analyses, from data cleaning up through formal statistical analysis.

There are two main aspects:

• The package allows the user to “replay” the authors’ analysis, and most importantly, explore other alternate analyses that the authors may have overlooked. The various alternate analyses may be saved for sharing.
• Warn of statistical   errors, such as: overreliance on p-values; need for multiple inference procedures; possible distortion due to outliers; etc.

The term user here could refer to several different situations:

• The various authors of a study, collaborating and trying different analyses during the course of the study.
• Reviewers of a paper submitted for publication on the results of the study.
• Fellow scientists who wish to delve further into the study after it is published.

The package has text and GUI versions. The latter is currently implemented as an RStudio add-in.

The package is on my GitHub site, and has a fairly extensive README file introducing the goals and usage.

### U.S. Residential Energy Use: Machine Learning on the RECS Dataset

Thu, 06/01/2017 - 02:25

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

Contributed by Thomas Kassel. He is currently enrolled in the NYC Data Science Academy remote bootcamp program taking place from January-May 2017. This post is based on his final capstone project, focusing on the use of machine learning techniques learned throughout the course.

Introduction

The residential sector accounts for up to 40% of annual U.S. electricity consumption, representing a large opportunity for energy efficiency and conservation. A strong understanding of the main electricity end-uses in residences can allow homeowners to make more informed decisions to lower their energy bills, help utilities maximize efficiency/incentive programs, and allow governments or NGOs to better forecast energy demand and address climate concerns.

The Residential Energy Consumption Survey, RECS, collects energy-related data on a nationally representative sample of U.S. homes. First conducted in 1978 and administered every 4-5 years by the Energy Information Administration, it is a leading data source for residential energy analysis. Methodology for the study is well-documented – a comprehensive overview of RECS can be found in the program’s technical documentation.

This project applied machine learning methods to the most recently available RECS dataset, published in 2009. The primary goals were twofold, as is common in many regression exercises:

• Description – identification of variable importance or features with the strongest statistical associations to annual residential electricity usage (kWh)
• Prediction – feature engineering and use of machine learning platform h2o to optimize predictive accuracy over annual electricity usage

EDA and Data Pre-Processing

RECS 2009 consists of 12,083 observations (each representing a unique housing unit) and over 900 features encompassing physical building characteristics, appliances, occupant behavior/usage patterns, occupant demographics, etc. These features serve as independent variables in predicting the outcome variable, annual electricity usage in kilowatt-hours (kWh). Because RECS aims to cover a comprehensive overview of many types of residences for a variety of analytical purposes (beyond energy use prediction), many of the features are sparsely populated, collinear, or uncorrelated with kWh usage. Therefore, a preliminary and recurring task throughout the project was dimension reduction.

Since most of the independent variables were collected through occupant interviews – as opposed to exact physical examination of the residence – the values of many are binned as factor/categorical variables. The dataset’s 931 original variables had the following characteristics:

Missingness was common in the raw dataset, with 73 features having NA values in more than 95% of observations. To quickly gauge whether these features correlated with the outcome variable kWh (and therefore should be retained/imputed), I made use of flexible EDA graphing functions from the GGally R package. An example is shown below on continuous variables, revealing generally low r-statistics in correlation to kWh and having non-significant p-values (not shown). GGally also accommodates similar exploratory analysis on factor variables via pairwise box plots. Overall, although a more complex imputation strategy could have been employed to address these 73 features, they were dropped due to their high missingness and little evidence of predictive power over the outcome variable.

Feature Engineering/Reduction

Factor Encoding

As previously mentioned, the majority of the dataset features were factor variables, many of which needed recoding in order to correctly capture the variation being described. As illustrated below, nominal factor variables such as WALLTYPE have no intrinsic ordering; it would not make sense to order “stucco” higher than “wood”, for example. On the other hand, ordinal factors such as AGERFRI1 (age of most-used refrigerator) have a clear order to their factor levels; each level denotes a numerically higher value than the previous. Information contained in variables like AGERFRI1 is often more appropriately formatted in integer/numeric form, which can better convey their continuous range of values, aid interpretation of the variable coefficient, and reduce standard error by decreasing the number of coefficients for model estimation (many algorithms generate a separate model coefficient for each factor level via one-hot-encoding).

In the image above, WALLTYPE was left as a factor while AGERFRI1 was recoded as an integer value (using the mid range of each bin as number of years). This factor-to-integer recoding was applied to additional binned variables relating to age of appliances and frequency of use of those appliances.

New factor variables were also created by consolidating information from existing factors into one-hot-encoded binary values such as presence of heated pool, use of electricity for space heating, etc.

Feature Reduction

We can visualize multicollinearity among the integer and numeric features using correlation plots, again made with R package GGally. In the plot below, a conceptual group of collinear variables stands out, i.e. those directly or indirectly representing a residence’s size, for example, total square footage (TOTSQFT), cooling square footage (TOTCSQFT), number of bathrooms (NCOMBATH), number of refrigerators (NUMFRIG), etc.

To confirm the degree of multicollinearity, variance inflation factors were calculated based on a linear model with the above numeric features – those with high VIF scores (loosely following the rule of thumb VIF > 5.0) were dropped.

Despite variable recoding and de-correlation using the methods previously described, a large number of features remained in the dataset, the majority of them being factor variables. Although principal component analysis (PCA) is a powerful method for dimension reduction, it does not generalize easily to categorical data. Therefore, feature reduction was continued instead through the use of an exploratory LASSO regression model, utilizing the L1 regularization penalty to drive non-significant variable coefficients in the model’s output to 0. This was helpful in identifying and dropping features with very little predictive power over kWh usage, particularly factor variables that were not covered in the multicollinearity exercise above.

Modeling With H2o

The processed and cleaned dataset was migrated to open-source, online machine learning platform h2o, which offers highly efficient and scalable modeling methods. In contrast to machine learning conducted in slower native R packages (caret, glm) in the local R environment, R package h2o facilitates API calls to h2o’s online platform, sending the given dataset to be distributed and parallel-processed among multiple clusters. H2o offers an array of the most common machine learning algorithms (glm, kNN, random forests, gbm, deep learning) at very impressive run times – lessening the large burden of computational speed in the model fitting and tuning process. It also provides handy, built-in functions for pre-processing such as training/validation/test set splitting, in this case chosen to be 70/20/10.

View the code on Gist.

Multivariable Regression

To best understand variable importance and interpret linear effects of the features on kWh usage, a simple multivariate regression model was fit using h2o’s glm function with default parameters (no regularization). A variable importance plot (see below) ranks the top 15 features by z-statistic, all of which are quite large – an observed z-score of 15 translates to a p-value of 9.63e-54, or virtually zero. Hence all 15 variables show extremely significant statistical evidence of a relationship to kWh, with an additional ~100 of the features (not shown) also meeting significance at the p = 0.05 threshold. Variables beginning with “hasElec” were feature engineered (previously described), validating the approach of using domain knowledge to add or recode valuable information in new features. Coefficient estimates are denoted on the bar for each feature. We can observe that the presence of electric space heating at a residence (one-hot-encoded as a factor variable) increases yearly electricity usage at the residence by about 2,722 kWh; each additional TOTCSQFT (air conditioned square feet, numeric) adds 1.146 kWh/year.

ElasticNet Regression with Grid Search

While default h2o glm provided interpretability and p-values, adding regularization to the linear model enabled an increase in predictive accuracy. H2o allows flexible grid search methods to tune the main glm hyperparameters alpha (type of regularization) and lambda (degree of coefficient shrinkage). Because grid search can quickly become computationally expensive, I utilized h2o’s powerful early-stopping and random search functionality to ensure that computation time is not wasted for very small, marginal gains in validation accuracy.

View the code on Gist.

A primary purpose of grid search is to choose optimal tuning parameters according to a validation metric –  in this case root-mean-squared-error (RMSE) of kWh prediction on the validation set – and then train a new model using the optimal values discovered on the full dataset (no train/validation split) to maximize sample size. In 5 minutes of training, h2o fit all ~440 model combinations specified in the grid search, with the best RMSE model having alpha = 1.0 (LASSO regression) and lambda = 1.0 (small amount of regularization). These parameters were then used to fit a final model on 90% of the data (combining 70% train and 20% validation sets) and performance was evaluated on the 10% holdout test data, which had not yet been seen by the model.

To increase predictive accuracy over generalized linear model-based approaches, I used h2o’s implementation of gradient boosted machines. Boosting, a nonlinear tree-based approach, sequentially fits many decorrelated decision trees to slowly reduce overall predictive error (in the regression setting, often RMSE). Grid search was performed using the same methods as above and with the following tuning parameters:

• learn_rate – how much error should each tree “remember” from the previous
• max_depth – max tree depth (rule of thumb – sqrt of # of features)
• sample_rate – percentage of rows/observations to be chosen to fit a given tree
• col_sample_rate – percentage of columns to be chosen to fit a given tree

Following the same process as with GLM, parameter values were then chosen from the trained model with the best validation set RMSE. The model was re-trained on the full 90% of training data and tested on the final 10% holdout split.

Results from the two modeling strategies are summarized in the table below.

Additionally, predicted vs actual kWh values for either modeling strategy are plotted against the most significant numeric feature, CDD30YR (30-year average of cooling degree-days in the residence’s region). Both model types demonstrate reasonable ability to predict kWh over a range of annual cooling loads/climate regions.

Conclusions

Final predictive accuracy was similar for both linear-based (GLM) and tree-based (GBM) models on the RECS dataset. The two models’ RMSE values represent a large improvement over an RMSE of 7,560, which was established as a baseline accuracy by initially fitting a default GLM on the raw dataset (before any feature engineering or reduction). This improvement validates the approaches used – using domain knowledge to feature engineer highly significant variables and grid search for hyperparameter tuning.

Aside from the regression setting, future analysis could be applied to RECS in a classification context. Since the majority of the dataset’s features are factor variables, energy providers could use the rich training set to attempt to answer important energy-related classification questions about their customers – type of space heating, presence of major appliances, or whether the home is a good candidate for solar installation. Greater predictive ability over residential electricity use will contribute over time to a smarter, cleaner, and more efficient power grid.

The post U.S. Residential Energy Use: Machine Learning on the RECS Dataset appeared first on NYC Data Science Academy Blog.

### Complete Subset Regressions, simple and powerful

Thu, 06/01/2017 - 02:05

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

The complete subset regressions (CSR) is a forecasting method proposed by Elliott, Gargano and Timmermann in 2013. It is as very simple but powerful technique. Suppose you have a set of variables and you want to forecast one of them using information from the others. If your variables are highly correlated and the variable you want to predict is noisy you will have collinearity problems and in-sample overfitting because the model will try to fit the noise.

These problems may be solved if you estimate a smaller model using only a subset of the explanatory variables, however, if you do not know which variables are important you may loose information. What if we estimate models from many different subsets and combine their forecasts? Even better, what if we estimate models for all possible combinations of variables? This would be a good solution, however, if you have only 20 variables, the number of regressions would be more the 1 million. The CSR is a solution between using only one subset and all possible subsets. Instead of estimating all possible combinations of models, we fix a value and estimate all possible models with variables. Then we compute the forecasts from all these models to get our final result. Naturaly, must be smaller than the number of variables you have. Let us review the steps:

• Suppose you have explanatory variables. Estimate all possible models with variables,
• Compute the forecasts for all the models,
• Compute the average of all the forecasts to have the final result.
Aplication

We are going to generate data from a linear model where the explanatory variables, , are draw from a multivariate normal distribution. The coefficients are generated from a normal distribution and multiplied by a parameter to ensure the dependent variable has a significant random component. The value for is 10 and the CSR is estimated with .

set.seed(1) # = Seed for replication = # K = 10 # = Number of Variables = # T = 300 # = Number of Observations = # # = Generate covariance matrix = # library(mvtnorm) D = diag(0.1, K) P = matrix(rnorm(K * K), K, K) sigma = t(P)%*%D%*%P alpha = 0.1 beta = rnorm(K) * alpha # = coefficients = # X = rmvnorm(T, sigma = sigma) # = Explanatory Variables = # u = rnorm(T) # = Error = # y = X%*%beta + u # = Generate y = # # = Break data into in-sample and out-of-sample = # y.in = y[1:200] y.out = y[-c(1:200)] X.in = X[1:200, ] X.out = X[-c(1:200), ] # = Estimate model by OLS to compare = # model = lm(y.in ~ X.in) pred = cbind(1, X.out)%*%coef(model) # = CSR = # k = 4 models = combn(K, k) # = Gives all combinations with 4 variables = # csr.fitted = rep(0, length(y.in)) # = Store in-sample fitted values = # csr.pred = rep(0, length(y.out)) # = Store forecast = # for(i in 1:ncol(models)){ m = lm(y.in ~ X.in[ ,models[ ,i]]) csr.fitted = csr.fitted + fitted(m) csr.pred = csr.pred + cbind(1, X.out[ ,models[ ,i]])%*%coef(m) } R2ols = 1 - var(y.in - fitted(model))/var(y.in) # = R2 OLS = # R2csr = 1 - var(y.in - csr.fitted/ncol(models))/var(y.in) # = R2 CSR = # c(R2ols, R2csr) ## [1] 0.1815733 0.1461342 # = In-sample fit = # plot(y.in, type="l") lines(fitted(model), col=2) lines(csr.fitted/ncol(models), col=4)

# = Out-of-sample fit = # plot(y.out, type="l") lines(pred, col = 2) lines(csr.pred/ncol(models), col = 4)

# = MAPE = # MAPEols=mean(abs(y.out - pred)) MAPEcsr=mean(abs(y.out - csr.pred/ncol(models))) c(MAPEols, MAPEcsr) ## [1] 0.8820019 0.8446682

The main conclusion from the results is that the CSR gives up some in-sample performance to improve the forecasts. In fact, the CSR is very robust to overfitting and you should definitively add it to your collection to use when you believe that you are having this type of problem. Most modern forecasting techniques have the same idea of accepting some bias in-sample to have a more parsimonious or stable model out-of-sample.

This is the simplest implementation possible for the CSR. In some cases you may have fixed controls you want to include in all regressions such as lags of the dependent variable. The CSR computational costs increase fast with the number of variables. If you are in a high-dimensional framework you may need to do some type of pre-selection of the variables to reduce the problem’s size.

References

Elliott, Graham, Antonio Gargano, and Allan Timmermann. “Complete subset regressions.” Journal of Econometrics 177.2 (2013): 357-373.

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Euler Problem 23: Non-Abundant Sums

Thu, 06/01/2017 - 02:00

(This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers)

A demonstration of the abundance of the number 12 using Cuisenaire rods (Wikipedia).

Euler problem 23 asks us to solve a problem with abundant or excessive numbers.

These are numbers for which the sum of its proper divisors is greater than the number itself.

12 is an abundant number because the sum of its proper divisors (the aliquot sum) is larger than 12: (1 + 2 + 3 + 4 + 6 = 16).

All highly composite numbers or anti-primes greater than six are abundant numbers. These are numbers that have so many divisors that they are considered the opposite of primes, as explained in the Numberphile video below.

Euler Problem 23 Definition

A perfect number is a number for which the sum of its proper divisors is exactly equal to the number. For example, the sum of the proper divisors of 28 would be 1 + 2 + 4 + 7 + 14 = 28, which means that 28 is a perfect number.

A number n is called deficient if the sum of its proper divisors is less than n and it is called abundant if this sum exceeds n.

As 12 is the smallest abundant number, 1 + 2 + 3 + 4 + 6 = 16, the smallest number that can be written as the sum of two abundant numbers is 24. By mathematical analysis, it can be shown that all integers greater than 28123 can be written as the sum of two abundant numbers. However, this upper limit cannot be reduced any further by analysis, even though it is known that the greatest number that cannot be expressed as the sum of two abundant numbers is less than this limit.

Find the sum of all the positive integers which cannot be written as the sum of two abundant numbers.

Solution

This solution repurposes the divisors function that determines the proper divisors for a number, introduced for Euler Problem 21. The first code snippet creates the sequence of all abundant numbers up to 28123 (sequence A005101 in the OEIS). An abundant number is one where its aliquot sum is larger than n.

# Generate abundant numbers (OEIS A005101) A005101 <- function(x){ abundant <- vector() a <- 1 for (n in 1:x) { aliquot.sum <- sum(proper.divisors(n)) - n if (aliquot.sum > n) { abundant[a] <- n a <- a + 1 } } return(abundant) } abundant <- A005101(28123)

The solution to this problem is also a sequence in the Online Encyclopedia of Integer Sequences (OEIS A048242). This page states that the highest number in this sequence is 20161, not 28123 as stated in the problem definition.

The second section of code creates a list of all potential numbers not the sum of two abundant numbers. The next bit of code sieves any sum of two abundant numbers from the list. The answer is determined by adding remaining numbers in the sequence.

# Create a list of potential numbers that are not the sum of two abundant numbers A048242 <- 1:20161 # Remove any number that is the sum of two abundant numbers for (i in 1:length(abundant)) { for (j in i:length(abundant)) { if (abundant[i] + abundant[j] <= 20161) { A048242[abundant[i] + abundant[j]] <- NA } } } A048242 <- A048242[!is.na(A048242)] answer <- sum(A048242) print(answer)

The post Euler Problem 23: Non-Abundant Sums appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...