Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 3 hours 14 min ago

Preview of EARL San Francisco

Mon, 05/22/2017 - 23:32

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The first ever EARL (Enterprise Applications of the R Language) conference in San Francisco will take place on June 5-7 (and it's not too late to register).  The EARL conference series is now in its fourth year, and the prior conferences in London and Boston have each been a fantastic way to learn how R is used in real-world applications. Judging from the speaker lineup in San Francisco, next month's event looks like it will feature some great R stores as well. Included on the program will be stories from the following companies:

  • AirBnB (Keynote presentation by Ricardo Bion)
  • StitchFix (Keynote presentation by Hilary Parker)
  • Uber (Keynote presentation by Prakhar Mehrota)
  • Hitachi Solutions (David Bishop)
  • Amgen (Daniel Saldana)
  • Pfizer (Luke Fostveldt)
  • Fred Hutch Cancer Research Center (Thomas Vaughan)
  • Domino Data Lab (Eduardo  Ariño de la Rubia)
  • Genentech (Gabriel Becker)

… and many more. On a personal note, I'm looking forward to presenting with my colleague Bharath Sankaranarayan on using Microsoft R to predict hospital stays.

For more information on EARL 2017 San Francisco or to register, follow the link below. (And if you can't make San Francisco, check out EARL 2017 London, coming September 12-14.)

EARL: San Francisco 2017

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Marcos Lopez de Prado Hierarchical Risk Parity Algorithm

Mon, 05/22/2017 - 20:21

(This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers)

This post will be about replicating the Marcos Lopez de Prado algorithm from his paper building diversified portfolios that outperform out of sample. This algorithm is one that attempts to make a tradeoff between the classic mean-variance optimization algorithm that takes into account a covariance structure, but is unstable, and an inverse volatility algorithm that ignores covariance, but is more stable.

This is a paper that I struggled with until I ran the code in Python (I have anaconda installed but have trouble installing some packages such as keras because I’m on windows…would love to have someone walk me through setting up a Linux dual-boot), as I assumed that the clustering algorithm actually was able to concretely group every asset into a particular cluster (I.E. ETF 1 would be in cluster 1, ETF 2 in cluster 3, etc.). Turns out, that isn’t at all the case.

Here’s how the algorithm actually works.

First off, it computes a covariance and correlation matrix (created from simulated data in Marcos’s paper). Next, it uses a hierarchical clustering algorithm on a distance-transformed correlation matrix, with the “single” method (I.E. friend of friends–do ?hclust in R to read up more on this). The key output here is the order of the assets from the clustering algorithm. Note well: this is the only relevant artifact of the entire clustering algorithm.

Using this order, it then uses an algorithm that does the following:

Initialize a vector of weighs equal to 1 for each asset.

Then, run the following recursive algorithm:

1) Break the order vector up into two equal-length (or as close to equal length) lists as possible.

2) For each half of the list, compute the inverse variance weights (that is, just the diagonal) of the covariance matrix slice containing the assets of interest, and then compute the variance of the cluster when multiplied by the weights (I.E. w’ * S^2 * w).

3) Then, do a basic inverse-variance weight for the two clusters. Call the weight of cluster 0 alpha = 1-cluster_variance_0/(cluster_variance_0 + cluster_variance_1), and the weight of cluster 1 its complement. (1 – alpha).

4) Multiply all assets in the original vector of weights containing assets in cluster 0 with the weight of cluster 0, and all weights containing assets in cluster 1 with the weight of cluster 1. That is, weights[index_assets_cluster_0] *= alpha, weights[index_assets_cluster_1] *= 1-alpha.

5) Lastly, if the list isn’t of length 1 (that is, not a single asset), repeat this entire process until every asset is its own cluster.

Here is the implementation in R code.

First off, the correlation matrix and the covariance matrix for use in this code, obtained from Marcos Lopez De Prado’s code in the appendix in his paper.

> covMat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 1.000647799 -0.003050479 0.010033224 -0.010759689 -0.005036503 0.008762563 0.998201625 -0.001393196 -0.001254522 -0.009365991 2 -0.003050479 1.009021349 0.008613817 0.007334478 -0.009492688 0.013031817 -0.009420720 -0.015346223 1.010520047 1.013334849 3 0.010033224 0.008613817 1.000739363 -0.000637885 0.001783293 1.001574768 0.006385368 0.001922316 0.012902050 0.007997935 4 -0.010759689 0.007334478 -0.000637885 1.011854725 0.005759976 0.000905812 -0.011912269 0.000461894 0.012572661 0.009621670 5 -0.005036503 -0.009492688 0.001783293 0.005759976 1.005835878 0.005606343 -0.009643250 1.008567427 -0.006183035 -0.007942770 6 0.008762563 0.013031817 1.001574768 0.000905812 0.005606343 1.064309825 0.004413960 0.005780148 0.017185396 0.011601336 7 0.998201625 -0.009420720 0.006385368 -0.011912269 -0.009643250 0.004413960 1.058172027 -0.006755374 -0.008099181 -0.016240271 8 -0.001393196 -0.015346223 0.001922316 0.000461894 1.008567427 0.005780148 -0.006755374 1.074833155 -0.011903469 -0.013738378 9 -0.001254522 1.010520047 0.012902050 0.012572661 -0.006183035 0.017185396 -0.008099181 -0.011903469 1.075346677 1.015220126 10 -0.009365991 1.013334849 0.007997935 0.009621670 -0.007942770 0.011601336 -0.016240271 -0.013738378 1.015220126 1.078586686 > corMat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 1.000000000 -0.003035829 0.010026270 -0.010693011 -0.005020245 0.008490954 0.970062043 -0.001343386 -0.001209382 -0.009015412 2 -0.003035829 1.000000000 0.008572055 0.007258718 -0.009422702 0.012575370 -0.009117080 -0.014736040 0.970108941 0.971348946 3 0.010026270 0.008572055 1.000000000 -0.000633903 0.001777455 0.970485047 0.006205079 0.001853505 0.012437239 0.007698212 4 -0.010693011 0.007258718 -0.000633903 1.000000000 0.005709500 0.000872861 -0.011512172 0.000442908 0.012052964 0.009210090 5 -0.005020245 -0.009422702 0.001777455 0.005709500 1.000000000 0.005418538 -0.009347204 0.969998023 -0.005945165 -0.007625721 6 0.008490954 0.012575370 0.970485047 0.000872861 0.005418538 1.000000000 0.004159261 0.005404237 0.016063910 0.010827955 7 0.970062043 -0.009117080 0.006205079 -0.011512172 -0.009347204 0.004159261 1.000000000 -0.006334331 -0.007592568 -0.015201540 8 -0.001343386 -0.014736040 0.001853505 0.000442908 0.969998023 0.005404237 -0.006334331 1.000000000 -0.011072068 -0.012759610 9 -0.001209382 0.970108941 0.012437239 0.012052964 -0.005945165 0.016063910 -0.007592568 -0.011072068 1.000000000 0.942667300 10 -0.009015412 0.971348946 0.007698212 0.009210090 -0.007625721 0.010827955 -0.015201540 -0.012759610 0.942667300 1.000000000

Now, for the implementation.

This reads in the two matrices above and gets the clustering order.

covMat <- read.csv('cov.csv', header = FALSE) corMat <- read.csv('corMat.csv', header = FALSE) clustOrder <- hclust(dist(corMat), method = 'single')$order

This is the clustering order:

> clustOrder [1] 9 2 10 1 7 3 6 4 5 8

Next, the getIVP (get Inverse Variance Portfolio) and getClusterVar functions (note: I’m trying to keep the naming conventions identical to Dr. Lopez’s paper)

getIVP <- function(covMat) { invDiag <- 1/diag(as.matrix(covMat)) weights <- invDiag/sum(invDiag) return(weights) } getClusterVar <- function(covMat, cItems) { covMatSlice <- covMat[cItems, cItems] weights <- getIVP(covMatSlice) cVar <- t(weights) %*% as.matrix(covMatSlice) %*% weights return(cVar) }

Next, my code diverges from the code in the paper, because I do not use the list comprehension structure, but instead opt for a recursive algorithm, as I find that style to be more readable.

One wrinkle to note is the use of the double arrow dash operator, to assign to a variable outside the scope of the recurFun function. I assign the initial weights vector w in the global environment, and update it from within the recurFun function. I am aware that it is a faux pas to create variables in the global environment, but my attempts at creating a temporary environment in which to update the weight vector did not produce the updating mechanism I had hoped to, so a little bit of assistance with refactoring this code would be appreciated.

getRecBipart <- function(covMat, sortIx) { # keeping track of w in the global environment assign("w", value = rep(1, ncol(covMat)), envir = .GlobalEnv) recurFun(covMat, sortIx) return(w) } recurFun <- function(covMat, sortIx) { subIdx <- 1:trunc(length(sortIx)/2) cItems0 <- sortIx[subIdx] cItems1 <- sortIx[-subIdx] cVar0 <- getClusterVar(covMat, cItems0) cVar1 <- getClusterVar(covMat, cItems1) alpha <- 1 - cVar0/(cVar0 + cVar1) # scoping mechanics using w as a free parameter w[cItems0] <<- w[cItems0] * alpha w[cItems1] <<- w[cItems1] * (1-alpha) if(length(cItems0) > 1) { recurFun(covMat, cItems0) } if(length(cItems1) > 1) { recurFun(covMat, cItems1) } }

Lastly, let’s run the function.

out <- getRecBipart(covMat, clustOrder)

With the result (which matches the paper):

> out [1] 0.06999366 0.07592151 0.10838948 0.19029104 0.09719887 0.10191545 0.06618868 0.09095933 0.07123881 0.12790318

So, hopefully this democratizes the use of this technology in R. While I have seen a raw Rcpp implementation and one from the Systematic Investor Toolbox, neither of those implementations satisfied me from a “plug and play” perspective. This implementation solves that issue. Anyone here can copy and paste these functions into their environment and immediately make use of one of the algorithms devised by one of the top minds in quantitative finance.

A demonstration in a backtest using this methodology will be forthcoming.

Thanks for reading.

NOTE: I am always interested in networking and full-time opportunities which may benefit from my skills. Furthermore, I am also interested in project work in the volatility ETF trading space. My linkedin profile can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Instrumental Variables in R exercises (Part-2)

Mon, 05/22/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

This is the second part of the series on Instrumental Variables. For other parts of the series follow the tag instrumental variables.

In this exercise set we will build on the example from part-1.
We will now consider an over-identified case i.e. we have multiple IVs for an endogenous variable. We will also look at tests for endogeneity and over-identifying restrictions.

Answers to the exercises are available here.

Exercise 1
Load the AER package (package description: here). Next, load PSID1976 dataset provided with the AER package. This has data regarding labor force participation of married women.
Define a new data-frame that has data for all married women that were employed. This data-frame will be used for the following exercises.

Exercise 2
We will use a more elaborate model as compared to the previous set.
Regress log(wage) on education, experience and experience^2.

Exercise 3
Previously, we identified feducation and meducation as possible IVs for education.
Regress education on experience, experience^2, feducation and meducation. Comment on your results.

Exercise 4
Test the hypothesis that feducation and meducation are jointly equal to zero.

Exercise 5
Use 2SLS to estimate the IV based return to education.

Exercise 6
Use the ivreg command to get the same results as in Exercise-5. Note that standard errors would be different.

Exercise 7
There is a short form for the ivreg command which saves time when we are dealing with numerous variables.
Try the short format and check that you get the same results as in Exercise-6.

Exercise 8
Regress log(wage) on education, experience, experience^2 and residuals from the model estimated in Exercise-3.
Use your result to test for the endogeneity of education.

Exercise 9
Regress the residuals from exercise-7 on experience, experience^2, feducation and meducation.
Use your result to test for over-identifying restrictions.

Exercise 10
The two tests we did in exercises 8 and 9 can be conveniently obtained using the summary command with diagnostics turned on. Verify that you get the same test results with the summary command.

Related exercise sets:
  1. Instrumental Variables in R exercises (Part-1)
  2. Bioinformatics Tutorial with Exercises in R (part 1)
  3. Let’s get started with dplyr
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The End of the Honeymoon: Falling Out of Love with quantstrat

Mon, 05/22/2017 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Introduction

I spent good chunks of Friday, Saturday, and Sunday attempting to write another blog post on using R and the quantstrat package for backtesting, and all I have to show for my work is frustration. So I’ve started to fall out of love with quantstrat and am thinking of exploring Python backtesting libraries from now on.

Here’s my story…

Cross-Validation

Let’s suppose you’ve read my past blog posts on using R for analyzing stock data, in particular the most recent blog post on R for finance, on parameter optimization and order types. You’ll notice that we’ve been doing two things (perhaps among others):

  • We only used most recent data as out-of-sample data. What if there are peculiarities unique to the most-recent time frame, and it’s not a good judge (alone) of future performance?
  • We evaluate a strategy purely on its returns. Is this the best approach? What if a strategy doesn’t earn as much as another but it handles risk well?

I wanted to write an article focusing on addressing these points, by looking at cross-validation and different metrics for evaluating performance.

Data scientists want to fit training models to data that will do a good job of predicting future, out-of-sample data points. This is not done by finding the model that performs the best on the training data. This is called overfitting; a model may appear to perform well on training data but will not generalize to out-of-sample data. Techniques need to be applied to prevent against it.

A data scientist may first split her data set used for developing a predictive algorithm into a training set and a test set. The data scientist locks away the test set in a separate folder on the hard drive, never looking at it until she’s satisfied with the model she’s developed using the training data. The test set will be used once a final model has been found to determine the final model’s expected performance.

She would like to be able to simulate out-of-sample performance with the training set, though. The usual approach to do this is split the training set into, say, 10 different folds, or smaller data sets, so she can apply cross-validation. With cross-validation, she will choose one of the folds to be held out, and she fits a predictive model on the remaining nine folds. After fitting the model, she sees how the fitted model performs on the held-out fold, which the fitted model has not seen. She then repeats this model for the nine other folds, to get a sense of the distribution of the performance of the model, or average performance, on out-of-sample data. If there are hyperparameters (which are parameters that describe some higher-order aspect of a model and not learned from the data in the usual way other prarameters are; they’re difficult to define rigorously), she may try different combinations of them and see which combination lead to the model with the best predictive ability in cross-validation. After determining which predictive model generally leads to the best results and which hyperparameters lead to optimal results, she trains a model on the training set with those hyperparameters, evaluates its performance on the test set, and reports the results, perhaps deploying the model.

For those developing trading algorithms, the goals are similar, but there are some key differences:

  1. We evaluate a trading method not by predictive accuracy but by some other measure of performance, such as profitability or profitability relative to risk. (Maybe predictive accuracy is profitable, maybe not; if the most profitable trading system always underestimates its profitability, we’re fine with it.)
  2. We are using data where time and the order in which data comes in is thought to matter. We cannot just reshuffle data; important features would be lost. So we cannot divide up the data set randomly into different folds; order must be preserved.

I’ll talk more about how we could evaluate a trading system later; for now, let’s focus on how we can apply the idea of cross-validation analysis.

(Illustration by Max Kuhn in his book, The caret Package.)

For time-dependent data, we can employ walk-forward analysis or rolling forecast origin techniques. This comes in various flavors (see the above illustration), but I will focus on one flavor. We first divide up the data set into, say, ten periods. We first fit a trading algorithm on the first period in the data set, then see how it performs on the “out-of-sample” second period. Then we repeat for the second and third periods, third and fourth periods, and so on until we’ve run to the end of the data set. The out-of-sample data sets are then used for evaluating the potential performance of the trading system in question. If we like, we may be keeping a final data set, perhaps the most recent data set, for final evaluation of whatever trading system passes this initial test.

Other variants include overlapping training/testing periods (as described, my walk-forward analysis approach does not overlap) or ones where the initial window grows from beginning to end. Of these, I initially think either the approach I’ve described or the approach with overlapping training/testing periods makes most sense.

Attempting to Use quantstrat

I discovered that quantstrat has a function that I thought would implement the type of walk-forward analysis I wanted, called walk.forward(). I was able to define a training period duration, a testing period duration, an objective function to maximize, and many other features I wanted. It could take a strategy designed for optimization with apply.paramset() and apply a walk-forward analysis.

So I tried it once and the results were cryptic. The function returns a list with the results of training… but no sign of the test results. I would try looking at the portfolio object and tradeStats() output, but there was no sign of a portfolio. I had no idea why. I still have no idea where the portfolio went initially. I would try clearing my environments and restarting R and doing the whole process again. Same results: no sign of a portfolio, and all I got was a list of results for training data, with no sign of results for testing data. I looked at the source code. I saw that it was doing something with the test data: it was using applyStrategy() on the test data.

I had no idea what the hell that meant. When I use applyStrategy() as the user, I have to assume that the portfolio is wiped out, the account is reset, and every application is effectively a fresh start. So when I saw that line, it looked as if the function got an optimized function, applied it to the test data, and would repeat this process all the while forgetting the results of previous applications of the optimized strategy to test data sets. But online sources said that was not what was happening; the results of this backtesting were being appended back-to-back, which did not mesh with either my notion of walk-forward analysis (which allows for overlapping time frames, which does not mean you can “glue” together test data results) or what applyStrategy() does. And it still does not make sense to me. (That is what it does, though.)

The fact that the demonstration of walk.forward() supplied with quantstrat is bugged (and I never found how to fix all its bugs) didn’t help.

So I tried using the function’s auditing feature; by supplying a string to audit.prefix, *.RData files will be created containing the results of training and testing, storing them in a special .audit environment. So I changed the working directory to one specifically meant for the audit files (there could be a lot of them and I didn’t want to clutter up my other directories; I hate clutter), passed a string to audit.prefix, and tried again. After the analysis was completed, I look in the directory to see the files.

There’s nothing there.

It took hours and I had to go through a debugger to finally discover that walk.forward() was saving files, but it was using my Documents directory as the working directory, not the directory I setwd()ed into. Why? I have no idea. It could be quantstrat, blotter, or RStudio’s workbooks. I have no idea how long it took to figure out where my files had gone, but it was too long however long it was.

Good news though was that I now could try and figure out what was happening to the test data. This presentation by Guy Yollin was helpful, though it still took a while to get it figured out.

From what I can tell, what walk.forward() does is simulate a trading strategy that optimizes, trades, and re-optimizes on new data, and trades. So it doesn’t look like cross-validation as I described but rather a trading strategy, and that’s not really what I wanted here. I wanted to get a sense of the distribution of returns when using window sizes chosen optimally from backtesting. I don’t think there’s any simple way to do that with quantstrat.

Furthermore, I can’t see how I can collect the kind of statistics I want, such as a plot of the value of the account or the account’s growth since day one, with the way the walk-forward analysis is being done. Furthermore, the results I did get look… strange. I don’t know how those kinds of results would appear… at all. They look like nonsense. So I don’t even know if the function is working right.

Oh, did I mention that I discovered that the way I thought backtesting was being done isn’t how it’s actually done in quantstrat? When I was doing backtesting, I would rather see the results of the backtest on a variety of stocks instead of just one. I imagine an automated backtest behaving like a trader with an account, and active trades tie up the trader’s funds. So if the trader sees an opportunity but does not have the cash to act on it, the trader does not take the trade. (There is no margin here.)

I thought applyStrategy() was emulating such a trader, but the documentation (and the logs produced, for that matter), suggest that’s not what’s happening. It backtests on each symbol separately, unaware of any other positions that may exist. I don’t like that.

It was very difficult to figure this all out, especially given the state of the documentation currently. That said, in the developers’ defense, quantstrat is still considered under heavy development. Maybe in the future the documentation will be better. I feel, though, that creating good documentation is not something done once you’ve completed coding; you write good documentation as you code, not after.

blotter: The Fatal Flaw

Granted, I’m an ignorant person, but I don’t feel like all the confusion that lead to me wasting my weekend is completely my fault. quantstrat‘s design is strange and, honestly, not that intuitive. I think the problem stems from quantstrat‘s building off an older, key package: blotter.

blotter is a package intended to manage transactions, accounts, portfolios, financial instruments, and so on. All the functions for creating accounts and portfolios are blotter functions, not quantstrat functions. blotter does this by setting up its own environment, the .blotter environment, and the functions I’ve shown are for communicating with this environment, adding and working with entities stored in it.

I don’t know what the purpose of creating a separate environment for managing those objects is. I feel that by doing so a wall has been placed between me and the objects I’m trying to understand. quantstrat takes this further by defining a .strategy environment where strategies actually “live”, so we must use special functions to interact with that separate universe, adding features to our strategies, deleting and redefining them with special functions, and so on.

To me, this usage of environments feels very strange, and I think it makes quantstrat difficult and unintuitive. In some sense, I feel like strategies, portfolios, accounts, etc. are pseudo-objects in the object oriented programming (OOP) sense, and the developers of blotter and quantstrat are trying to program like R is a language that primarily follows the OOP paradigm.

While R does support OOP, it’s a very different kind of OOP and I would not call R an OOP language. R is a functional programming language. Many see the tidyverse as almost revolutionary in making R easy to use. That’s because the tidyverse of package, including ggplot2, magrittr, dplyr, and others, recognize that R relies on the functional programming paradigm and this paradigm does, in fact, lead to elegant code. Users can write pipelines and functions to put into pipelines that make complicated tasks easy to program. No part of the pipeline is cut off from the user, and the process of building the pipeline is very intuitive.

When I wrote my first blog post on R for finance and trading, one of the common comments I heard was, “tidyquant is better.” I have yet to look seriously at tidyquant, and maybe it is better, but so far all I’ve seen is tidyquant replace quantmod‘s functionality, and I have no complaint with quantmod (although tidyquant may also be suffering from the original sin of relying too heavily on environments). That said, the tidyquant approach may yet be the future of R backtesting.

Let’s try re-imagining what a better R backtesting package might look like, one that takes lessons from the tidyverse. Let’s first look at defining a strategy with quantstrat (after all the obnoxious boilerplate code).

portfolio_st <- "myStrat" rm.strat(strategy_st) # Must be done to delete or define a new strategy strategy(strategy_st, store = TRUE) # This is how to create a strategy add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 20), label = "fastMA") add.indicator(strategy = strategy_st, name = "SMA", arguments = list(x = quote(Cl(mktdata)), n = 50), label = "slowMA") add.signal(strategy = strategy_st, name = "sigCrossover2", # Remember me? arguments = list(columns = c("fastMA", "slowMA"), relationship = "gt"), label = "bull") add.signal(strategy = strategy_st, name = "sigCrossover2", arguments = list(columns = c("fastMA", "slowMA"), relationship = "lt"), label = "bear") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bull", sigval = TRUE, ordertype = "market", orderside = "long", replace = FALSE, TxnFees = "fee", prefer = "Open", osFUN = osMaxDollarBatch, maxSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), tradeSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), batchSize = 100), type = "enter", path.dep = TRUE, label = "buy") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bull", sigval = TRUE, ordertype = "market", orderside = "long", replace = FALSE, TxnFees = "fee", prefer = "Open", osFUN = osMaxDollarBatch, maxSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), tradeSize = quote(floor(getEndEq(account_st, Date = timestamp) * .1)), batchSize = 100), type = "enter", path.dep = TRUE, label = "buy") add.rule(strategy = strategy_st, name = "ruleSignal", arguments = list(sigcol = "bear", sigval = TRUE, orderqty = "all", ordertype = "market", orderside = "long", replace = FALSE, TxnFees = "fee", prefer = "Open"), type = "exit", path.dep = TRUE, label = "sell") add.distribution(strategy_st, paramset.label = "MA", component.type = "indicator", component.label = "fastMA", variable = list(n = 5 * 1:10), label = "nFAST") add.distribution(strategy_st, paramset.label = "MA", component.type = "indicator", component.label = "slowMA", variable = list(n = 5 * 1:10), label = "nSLOW") add.distribution.constraint(strategy_st, paramset.label = "MA", distribution.label.1 = "nFAST", distribution.label.2 = "nSLOW", operator = "<", label = "MA.Constraint")

After all this code, we have a strategy that lives in the .strategy environment, locked away from us. What if we were to create strategies like we create plots in ggplot2? The code might look like this:

myStrat <- strategy() + indicator(SMA, args = list(x = Close, n = 20), label = "fastMA") + indicator(SMA, args = list(x = Close, n = 50), label = "slowMA") + signal(sigCrossover, quote(fastMA > slowMA), label = "bull") + signal(sigCrossover, quote(fastMA < slowMA), label = "bear") + rule( ..... ) + rule( ..... ) + distribution( ...... ) + distribution( ...... ) + distribution_constraint( ...... )

In blotter (and thus quantstrat) we create porfolios and accounts like so:

portfolio_st <- "myPortf" account_st <- "myAcct" rm.strat(portfolio_st) rm.strat(account_st) # This creates a portfolio initPortf(portfolio_st, symbols = symbols, initDate = initDate, currency = "USD") # Then an account initAcct(account_st, portfolios = portfolio_st, initDate = initDate, currency = "USD", initEq = 100000) # And finally an orderbook initOrders(portfolio_st, store = TRUE)

Then we run the strategy with:

applyStrategy(strategy_st, portfolios = portfolio_st)

and we see the results with this code:

updatePortf(portfolio_st) dateRange <- time(getPortfolio(portfolio_st)$summary)[-1] updateAcct(account_st, dateRange) updateEndEq(account_st) tStats <- tradeStats(Portfolios = portfolio_st, use="trades", inclZeroDays = FALSE) tStats[, 4:ncol(tStats)] <- round(tStats[, 4:ncol(tStats)], 2) print(data.frame(t(tStats[, -c(1,2)]))) final_acct <- getAccount(account_st) plot(final_acct$summary$End.Eq["2010/2016"] / 100000, main = "Portfolio Equity", ylim = c(0.8, 2.5)) lines(SPY$SPY.Adjusted / SPY$SPY.Adjusted[[1]], col = "blue")

A tidyverse approach might look like the following:

# Create an account with a pipe myAcct <- symbols %>% portfolio(currency = "USD") %>% account(currency = "USD", initEq = 100000) # Backtesting generates an object containing the orderbook and all other # necessary data (such as price at trade) ob <- myAcct %>% myStrat # This is a backtest

Other functions would then extract any information we need from the object ob, when we ask for them.

If I wanted to optimize, I might do the following:

ob_list <- myAcct %>% myStrat(optim = "MA") # Gives a list of orderbooks, and # optimizes the parameter set MA

And as for the original reason I started to gripe, I might be able to find slices for the analysis with caret functions (they exist), subset the data in symbols according to these slices, and get ob in, say, an lapply() loop.

I like the idea for this interface, myself.

Conclusion (On To Python)

I do not have the time to write a package like this myself. I don’t know enough and I am a Ph.D. student and should be preparing for qualifying exams/research/etc.. For now, though, I’m turned off to quantstrat, and thus R for backtesting and trading.

That said, my stock data analysis posts are still, by far, the most popular content of this blog, and I still would like to find a good backtesting system. I’m looking at going back to Python, and the backtrader package has caught my eye. I will either use that or zipline, but I am leaning to backtrader, since initially I like its logging features and its use of OOP.

I would love to hear thoughts and comments, especially from users of either backtrader or zipline or Python for stock data analysis in general.

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R vs Python: Different similarities and similar differences

Mon, 05/22/2017 - 14:46

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

A debate about which language is better suited for Datascience, R or Python, can set off diehard fans of these languages into a tizzy. This post tries to look at some of the different similarities and similar differences between these languages. To a large extent the ease or difficulty in learning R or Python is subjective. I have heard that R has a steeper learning curve than Python and also vice versa. This probably depends on the degree of familiarity with the languuge To a large extent both R an Python do the same thing in just slightly different ways and syntaxes. The ease or the difficulty in the R/Python construct’s largely is in the ‘eyes of the beholder’ nay, programmer’ we could say.  I include my own experience with the languages below.

1. R data types

R has 2 data types

  1. Character
  2. Numeric

Python has several data types

  1. Int
  2. float
  3. Long
  4. Complex and so on
2. R Vector vs Python List

A common data type in R is the vector. Python has a similar data type, the list

# R vectors a<-c(4,5,1,3,4,5) print(a[3]) ## [1] 1 print(a[3:4]) # R does not always need the explicit print. ## [1] 1 3 #R type of variable print(class(a)) ## [1] "numeric" # Length of a print(length(a)) ## [1] 6 # Python lists a=[4,5,1,3,4,5] # print(a[2]) # Some python IDEs require the explicit print print(a[2:5]) print(type(a)) # Length of a print(len(a)) ## 1 ## [1, 3, 4] ## ## 6 2a. Other data types – Python

Python also has certain other data types like the tuple, dictionary etc as shown below. R does not have as many of the data types, nevertheless we can do everything that Python does in R

# Python tuple b = (4,5,7,8) print(b) #Python dictionary c={'name':'Ganesh','age':54,'Work':'Professional'} print(c) #Print type of variable c ## (4, 5, 7, 8) ## {'name': 'Ganesh', 'age': 54, 'Work': 'Professional'} 2.Type of Variable

To know the type of the variable in R we use ‘class’, In Python the corresponding command is ‘type’

#R - Type of variable a<-c(4,5,1,3,4,5) print(class(a)) ## [1] "numeric" #Python - Print type of tuple a a=[4,5,1,3,4,5] print(type(a)) b=(4,3,"the",2) print(type(b)) ## ## 3. Length

To know length in R, use length()

#R - Length of vector # Length of a a<-c(4,5,1,3,4,5) print(length(a)) ## [1] 6

To know the length of a list,tuple or dict we can use len()

# Python - Length of list , tuple etc # Length of a a=[4,5,1,3,4,5] print(len(a)) # Length of b b = (4,5,7,8) print(len(b)) ## 6 ## 4 4. Accessing help

To access help in R we use the ‘?’ or the ‘help’ function

#R - Help - To be done in R console or RStudio #?sapply #help(sapply)

Help in python on any topic involves

#Python help - This can be done on a (I)Python console #help(len) #?len 5. Subsetting

The key difference between R and Python with regards to subsetting is that in R the index starts at 1. In Python it starts at 0, much like C,C++ or Java To subset a vector in R we use

#R - Subset a<-c(4,5,1,3,4,8,12,18,1) print(a[3]) ## [1] 1 # To print a range or a slice. Print from the 3rd to the 5th element print(a[3:6]) ## [1] 1 3 4 8

Python also uses indices. The difference in Python is that the index starts from 0/

#Python - Subset a=[4,5,1,3,4,8,12,18,1] # Print the 4th element (starts from 0) print(a[3]) # Print a slice from 4 to 6th element print(a[3:6]) ## 3 ## [3, 4, 8] 6. Operations on vectors in R and operation on lists in Python

In R we can do many operations on vectors for e.g. element by element addition, subtraction, exponentation,product etc. as show

#R - Operations on vectors a<- c(5,2,3,1,7) b<- c(1,5,4,6,8) #Element wise Addition print(a+b) ## [1] 6 7 7 7 15 #Element wise subtraction print(a-b) ## [1] 4 -3 -1 -5 -1 #Element wise product print(a*b) ## [1] 5 10 12 6 56 # Exponentiating the elements of a vector print(a^2) ## [1] 25 4 9 1 49

In Python to do this on lists we need to use the ‘map’ and the ‘lambda’ function as follows

# Python - Operations on list a =[5,2,3,1,7] b =[1,5,4,6,8] #Element wise addition with map & lambda print(list(map(lambda x,y: x+y,a,b))) #Element wise subtraction print(list(map(lambda x,y: x-y,a,b))) #Element wise product print(list(map(lambda x,y: x*y,a,b))) # Exponentiating the elements of a list print(list(map(lambda x: x**2,a))) ## [6, 7, 7, 7, 15] ## [4, -3, -1, -5, -1] ## [5, 10, 12, 6, 56] ## [25, 4, 9, 1, 49]

However if we create ndarrays from lists then we can do the element wise addition,subtraction,product, etc. like R. Numpy is really a powerful module with many, many functions for matrix manipulations

import numpy as np a =[5,2,3,1,7] b =[1,5,4,6,8] a=np.array(a) b=np.array(b) #Element wise addition print(a+b) #Element wise subtraction print(a-b) #Element wise product print(a*b) # Exponentiating the elements of a list print(a**2) ## [ 6 7 7 7 15] ## [ 4 -3 -1 -5 -1] ## [ 5 10 12 6 56] ## [25 4 9 1 49] 7. Getting the index of element

To determine the index of an element which satisifies a specific logical condition in R use ‘which’. In the code below the index of element which is equal to 1 is 4

# R - Which a<- c(5,2,3,1,7) print(which(a == 1)) ## [1] 4

In Python array we can use np.where to get the same effect. The index will be 3 as the index starts from 0

# Python - np.where import numpy as np a =[5,2,3,1,7] a=np.array(a) print(np.where(a==1)) ## (array([3], dtype=int64),) 8. Data frames

R, by default comes with a set of in-built datasets. There are some datasets which come with the SkiKit- Learn package

# R # To check built datasets use #data() - In R console or in R Studio #iris - Don't print to console

We can use the in-built data sets that come with Scikit package

#Python import sklearn as sklearn import pandas as pd from sklearn import datasets # This creates a Sklearn bunch data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) 9. Working with dataframes

With R you can work with dataframes directly. For more complex dataframe operations in R there are convenient packages like dplyr, reshape2 etc. For Python we need to use the Pandas package. Pandas is quite comprehensive in the list of things we can do with data frames The most common operations on a dataframe are

  • Check the size of the dataframe
  • Take a look at the top 5 or bottom 5 rows of dataframe
  • Check the content of the dataframe
a.Size

In R use dim()

#R - Size dim(iris) ## [1] 150 5

For Python use .shape

#Python - size import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) iris.shape b. Top & bottom 5 rows of dataframe

To know the top and bottom rows of a data frame we use head() & tail as shown below for R and Python

#R head(iris,5) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa tail(iris,5) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.head(5)) print(iris.tail(5)) ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 0 5.1 3.5 1.4 0.2 ## 1 4.9 3.0 1.4 0.2 ## 2 4.7 3.2 1.3 0.2 ## 3 4.6 3.1 1.5 0.2 ## 4 5.0 3.6 1.4 0.2 ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 145 6.7 3.0 5.2 2.3 ## 146 6.3 2.5 5.0 1.9 ## 147 6.5 3.0 5.2 2.0 ## 148 6.2 3.4 5.4 2.3 ## 149 5.9 3.0 5.1 1.8 c. Check the content of the dataframe #R summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.info()) ## ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None d. Check column names #R names(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species" colnames(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species" #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) #Get column names print(iris.columns) ## Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', ## 'petal width (cm)'], ## dtype='object') e. Rename columns

In R we can assign a vector to column names

#R colnames(iris) <- c("lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal","Species") colnames(iris) ## [1] "lengthOfSepal" "widthOfSepal" "lengthOfPetal" "widthOfPetal" ## [5] "Species"

In Python we can assign a list to s.columns

#Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) iris.columns = ["lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal"] print(iris.columns) ## Index(['lengthOfSepal', 'widthOfSepal', 'lengthOfPetal', 'widthOfPetal'], dtype='object') f.Details of dataframe #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.info()) ## ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None g. Subsetting dataframes # R #To subset a dataframe 'df' in R we use df[row,column] or df[row vector,column vector] #df[row,column] iris[3,4] ## [1] 0.2 #df[row vector, column vector] iris[2:5,1:3] ## lengthOfSepal widthOfSepal lengthOfPetal ## 2 4.9 3.0 1.4 ## 3 4.7 3.2 1.3 ## 4 4.6 3.1 1.5 ## 5 5.0 3.6 1.4 #If we omit the row vector, then it implies all rows or if we omit the column vector # then implies all columns for that row iris[2:5,] ## lengthOfSepal widthOfSepal lengthOfPetal widthOfPetal Species ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa # In R we can all specific columns by column names iris$Sepal.Length[2:5] ## NULL #Python # To select an entire row we use .iloc. The index can be used with the ':'. If # .iloc[start row: end row]. If start row is omitted then it implies the beginning of # data frame, if end row is omitted then it implies all rows till end #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.iloc[3]) print(iris[:5]) # In python we can select columns by column name as follows print(iris['sepal length (cm)'][2:6]) #If you want to select more than 2 columns then you must use the double '[[]]' since the # index is a list itself print(iris[['sepal length (cm)','sepal width (cm)']][4:7]) ## sepal length (cm) 4.6 ## sepal width (cm) 3.1 ## petal length (cm) 1.5 ## petal width (cm) 0.2 ## Name: 3, dtype: float64 ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 0 5.1 3.5 1.4 0.2 ## 1 4.9 3.0 1.4 0.2 ## 2 4.7 3.2 1.3 0.2 ## 3 4.6 3.1 1.5 0.2 ## 4 5.0 3.6 1.4 0.2 ## 2 4.7 ## 3 4.6 ## 4 5.0 ## 5 5.4 ## Name: sepal length (cm), dtype: float64 ## sepal length (cm) sepal width (cm) ## 4 5.0 3.6 ## 5 5.4 3.9 ## 6 4.6 3.4 h. Computing Mean, Standard deviation #R #Mean mean(iris$lengthOfSepal) ## [1] 5.843333 #Standard deviation sd(iris$widthOfSepal) ## [1] 0.4358663 #Python #Mean import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) # Convert to Pandas dataframe print(iris['sepal length (cm)'].mean()) #Standard deviation print(iris['sepal width (cm)'].std()) ## 5.843333333333335 ## 0.4335943113621737 i. Boxplot

Boxplot can be produced in R using baseplot

#R boxplot(iris$lengthOfSepal)

 

Matplotlib is a popular package in Python for plots

#Python import sklearn as sklearn import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) img=plt.boxplot(iris['sepal length (cm)']) plt.show(img) j.Scatter plot #R plot(iris$widthOfSepal,iris$lengthOfSepal)

#Python import matplotlib.pyplot as plt import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) img=plt.scatter(iris['sepal width (cm)'],iris['sepal length (cm)']) #plt.show(img) k. Read from csv file #R tendulkar= read.csv("tendulkar.csv",stringsAsFactors = FALSE,na.strings=c(NA,"-")) #Dimensions of dataframe dim(tendulkar) ## [1] 347 13 names(tendulkar) ## [1] "X" "Runs" "Mins" "BF" "X4s" ## [6] "X6s" "SR" "Pos" "Dismissal" "Inns" ## [11] "Opposition" "Ground" "Start.Date"

Use pandas.read_csv() for Python

#Python import pandas as pd #Read csv tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"]) print(tendulkar.shape) print(tendulkar.columns) ## (347, 13) ## Index(['Unnamed: 0', 'Runs', 'Mins', 'BF', '4s', '6s', 'SR', 'Pos', ## 'Dismissal', 'Inns', 'Opposition', 'Ground', 'Start Date'], ## dtype='object') l. Clean the dataframe in R and Python.

The following steps are done for R and Python
1.Remove rows with ‘DNB’
2.Remove rows with ‘TDNB’
3.Remove rows with absent
4.Remove the “*” indicating not out
5.Remove incomplete rows with NA for R or NaN in Python
6.Do a scatter plot

#R # Remove rows with 'DNB' a <- tendulkar$Runs != "DNB" tendulkar <- tendulkar[a,] dim(tendulkar) ## [1] 330 13 # Remove rows with 'TDNB' b <- tendulkar$Runs != "TDNB" tendulkar <- tendulkar[b,] # Remove rows with absent c <- tendulkar$Runs != "absent" tendulkar <- tendulkar[c,] dim(tendulkar) ## [1] 329 13 # Remove the "* indicating not out tendulkar$Runs <- as.numeric(gsub("\\*","",tendulkar$Runs)) dim(tendulkar) ## [1] 329 13 # Select only complete rows - complete.cases() c <- complete.cases(tendulkar) #Subset the rows which are complete tendulkar <- tendulkar[c,] dim(tendulkar) ## [1] 327 13 # Do some base plotting - Scatter plot plot(tendulkar$BF,tendulkar$Runs)

#Python import pandas as pd import matplotlib.pyplot as plt #Read csv tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"]) print(tendulkar.shape) # Remove rows with 'DNB' a=tendulkar.Runs !="DNB" tendulkar=tendulkar[a] print(tendulkar.shape) # Remove rows with 'TDNB' b=tendulkar.Runs !="TDNB" tendulkar=tendulkar[b] print(tendulkar.shape) # Remove rows with absent c= tendulkar.Runs != "absent" tendulkar=tendulkar print(tendulkar.shape) # Remove the "* indicating not out tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","") #Select only complete rows - dropna() tendulkar=tendulkar.dropna() print(tendulkar.shape) tendulkar.Runs = tendulkar.Runs.astype(int) tendulkar.BF = tendulkar.BF.astype(int) #Scatter plot plt.scatter(tendulkar.BF,tendulkar.Runs) ## (347, 13) ## (330, 13) ## (329, 13) ## (329, 13) ## (327, 13) m.Chaining operations on dataframes

To chain a set of operations we need to use an R package like dplyr. Pandas does this The following operations are done on tendulkar data frame by dplyr for R and Pandas for Python below

  1. Group by ground
  2. Compute average runs in each ground
  3. Arrange in descending order
#R library(dplyr) tendulkar1 <- tendulkar %>% group_by(Ground) %>% summarise(meanRuns= mean(Runs)) %>% arrange(desc(meanRuns)) head(tendulkar1,10) ## # A tibble: 10 × 2 ## Ground meanRuns ## ## 1 Multan 194.00000 ## 2 Leeds 193.00000 ## 3 Colombo (RPS) 143.00000 ## 4 Lucknow 142.00000 ## 5 Dhaka 132.75000 ## 6 Manchester 93.50000 ## 7 Sydney 87.22222 ## 8 Bloemfontein 85.00000 ## 9 Georgetown 81.00000 ## 10 Colombo (SSC) 77.55556 #Python import pandas as pd #Read csv tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"]) print(tendulkar.shape) # Remove rows with 'DNB' a=tendulkar.Runs !="DNB" tendulkar=tendulkar[a] # Remove rows with 'TDNB' b=tendulkar.Runs !="TDNB" tendulkar=tendulkar[b] # Remove rows with absent c= tendulkar.Runs != "absent" tendulkar=tendulkar # Remove the "* indicating not out tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","") #Select only complete rows - dropna() tendulkar=tendulkar.dropna() tendulkar.Runs = tendulkar.Runs.astype(int) tendulkar.BF = tendulkar.BF.astype(int) tendulkar1= tendulkar.groupby('Ground').mean()['Runs'].sort_values(ascending=False) print(tendulkar1.head(10)) ## (347, 13) ## Ground ## Multan 194.000000 ## Leeds 193.000000 ## Colombo (RPS) 143.000000 ## Lucknow 142.000000 ## Dhaka 132.750000 ## Manchester 93.500000 ## Sydney 87.222222 ## Bloemfontein 85.000000 ## Georgetown 81.000000 ## Colombo (SSC) 77.555556 ## Name: Runs, dtype: float64 9. Functions product <- function(a,b){ c<- a*b c } product(5,7) ## [1] 35 def product(a,b): c = a*b return c print(product(5,7)) ## 35 Conclusion

Personally, I took to R, much like a ‘duck takes to water’. I found the R syntax very simple and mostly intuitive. R packages like dplyr, ggplot2, reshape2, make the language quite irrestible. R is weakly typed and has only numeric and character types as opposed to the full fledged data types in Python.

Python, has too many bells and whistles, which can be a little bewildering to the novice. It is possible that they may be useful as one becomes more experienced with the language. Also I found that installing Python packages sometimes gives errors with Python versions 2.7 or 3.6. This will leave you scrambling to google to find how to fix these problems. These can be quite frustrating. R on the other hand makes installing R packages a breeze.

Anyway, this is my current opinion, and like all opinions, may change in the course of time. Let’s see!

I may write a follow up post with more advanced features of R and Python. So do keep checking! Long live R! Viva la Python!

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bagging, the perfect solution for model instability

Mon, 05/22/2017 - 14:27

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos Motivation

The name bagging comes from boostrap aggregating. It is a machine learning technique proposed by Breiman (1996) to increase stability in potentially unstable estimators. For example, suppose you want to run a regression with a few variables in two steps. First, you run the regression with all the variables in your data and select the significant ones. Second, you run a new regression using only the selected variables and compute the predictions.

This procedure is not wrong if your problem is forecasting. However, this two step estimation may result in highly unstable models. If many variables are important but individually their importance is small, you will probably leave some of them out, and small perturbations on the data may drastically change the results.

Bagging

The Bagging solves instability problem by giving each variable several chances to be in the model. The same procedure described in the previous section is repeated in many bootstrap samples and the final forecast will be the average forecast across all these samples.

In this post I will show an example of Bagging with regression in the selection context described above. But you should keep in mind that this technique can be used for any unstable model. The most successful case is to use Bagging on regression trees, resulting in the famous random forest. The Bagging is also very robust to over fitting and produces forecasts with less variance.

Application

I am going to show an application of Bagging from the scratch using the boot package for the bootstrap replications. We start by generating some data with the following dgp:


The first equation shows how the variable we want to predict, , is generated. The second equation shows that the independent variables are generated by a common factor, which will induce a correlation structure. Since we need instability, the data will be generated to have an around and in order to have many variables with small individual importance.

N = 300 # = Number of obs. = # K = 40 # = Number of Variables = # set.seed(123) # = Seed for replication = # f = rnorm(N) # = Common factor = # A = runif(K,-1,1) # = Coefficients from the second equation = # X = t(A%*%t(f)) + matrix(rnorm(N*K), N, K) # = Generate xij = # beta = c((-1)^(1:(K-10)) ,rep(0, 10)) # = Coefficients from the first equation, the first 30 are equal 1 or -1 and the last 10 are 0 = # # = R2 setup = # aux = var(X%*%beta) erro = rnorm(N, 0, sqrt(2*aux)) # = The variance of the error will be twice the variance of X%*%beta = # y = X %*% beta+erro # = Generate y = # 1-var(erro)/var(y) # = R2 = # ## [,1] ## [1,] 0.2925866 # = Break data into in-sample and out-of-sample = # y.in = y[1:(N-100)] X.in = X[1:(N-100),] y.out = y[-c(1:(N-100))] X.out = X[-c(1:(N-100)),]

Now we must define a function to be called by the boostrap. This function must receive the data and an argument for indexes that will tell R where the sampling must be made. This function will run a linear regression with all variables, select those with t-statistics bigger than , run a new regression with the selected variables and store the coefficients.

bagg = function(data,ind){ sample = data[ind,] y = sample[, 1] X = sample[, 2:ncol(data)] model1 = lm(y ~ X) t.stat = summary(model1)$coefficients[-1, 3] selected = which(abs(t.stat)>=1.96) model2 = lm(y ~ X[, selected]) coefficients = coef(model2) cons = coefficients[1] betas = coefficients[-1] aux = rep(0, ncol(X)) aux[selected] = betas res = c(cons, aux) return(res) }

Now we are ready to use the boot function to run the Bagging. The code below does precisely that with boostrap replications. It also runs the simple OLS procedure for us to compare. If you type selected after running this code you will see that many variables were left out and some variables with were selected.

library(boot) # = Bagging = # bagging = boot(data = cbind(y.in, X.in), statistic=bagg, R=500) bagg.coef = bagging$t y.pred.bagg = rowMeans(cbind(1, X.out)%*%t(bagg.coef)) # = OLS = # ols0 = lm(y.in ~ X.in) selected = which(abs(summary(ols0)$coefficients[-1, 3])>=1.96) ols1 = lm(y.in ~ X.in[, selected]) y.pred = cbind(1, X.out[, selected])%*%coef(ols1) # = Forecasting RMSE = # sqrt(mean((y.pred-y.out)^2)) # = OLS = # ## [1] 10.43338 sqrt(mean((y.pred.bagg-y.out)^2)) # = Bagging = # ## [1] 9.725554

The results showed that the Bagging has a RMSE smaller than the OLS. It is also interesting to see what happens to the coefficients. The plot below shows that the bagging performs some shrinkage on the coefficients towards zero.

barplot(coef(ols0)[-1]) barplot(colMeans(bagg.coef)[-1], add=TRUE, col="red")

Finally, what would happen if we had no instability? If you run the same code again changing the to around you will find the plot below. To change the to just replace by in the code erro = rnorm(N, 0, sqrt(2*aux)) in the dgp. As you can see, the bagging and the simple procedure are the same in the absence of instability.

References

Breiman, Leo (1996). “Bagging predictors”. Machine Learning. 24 (2): 123–140

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Connecting R to an Oracle Database

Mon, 05/22/2017 - 12:30

(This article was first published on R – Gerald Belton, and kindly contributed to R-bloggers)

R is a very popular language for doing analytics, and particularly statistics, on your data. There are a number of R functions for reading in data, but most of them take a delimited text file (such as .CSV) for input. That’s great if your existing data is in a spreadsheet, but if you have large amounts of data, it’s probably stored in a relational database. If you work for a large company, chances are that it is an Oracle database.

The most efficient way to access an Oracle database from R is using the RODBC package, available from CRAN. If the RODBC package is not installed in your R environment, use the install.packages(“RODBC”) command to install it. ODBC stands for Open DataBase Connectivity, an open standard application programming interface (API) for databases. ODBC was created by the SQL Access Group and first released in September, 1992. Although Microsoft Windows was the first to provide an ODBC product, versions now exist for Linux and Macintosh platforms as well. ODBC is built-in to current versions of Windows. If you are using a different operating system, you’ll need to install on OBDC driver manager.

Before you can access a database from R, you’ll need to create a Data Source Name, or DSN. This is an alias to the database, which provides the connection details. In Windows, you create the DSN using the ODBC Source Administrator. This tool can be found in the Control Panel. In Windows 10, it’s under System and Security -> Administrative Tools -> ODBC Data Sources. Or you can just type “ODBC” in the search box. On my system, it looks like this:

As you can see, I already have a connection to an Oracle database. To set one up, click Add, and you’ll get this box:

Select the appropriate driver (in my case, Oracle in OraDB12Home1) and click the Finish button. A Driver Configuration box opens:

For “Data Source Name,” you can put in almost anything you want. This is the name you will use in R when you connect to the database.

The “Description” field is optional, and again, you can put in whatever you want.

TNS Service Name is the name that you (or your company data base administrator) assigned when configuring the Oracle database. And “User ID” is your ID that you use with the database.

After you fill in these fields, click the “Test Connection” button. Another box pops up, with the TNS Service Name and User ID already populated, and an empty field for your password. Enter your password and click “OK.” You should see a “Connection Successful” message. If not, check the Service Name, User ID, and Password.

Now you are ready to connect R to the database.

Here’s the R code that you need:

# Load RODBC package library(RODBC) # Create a connection to the database called "channel" channel <- odbcConnect("DATABASE", uid="USERNAME", pwd="PASSWORD") # Query the database and put the results into the data frame # "dataframe"  dataframe <- sqlQuery(channel, "  SELECT *  FROM  SCHEMA.DATATABLE") # When finished, it's a good idea to close the connection odbcClose(channel)

A couple of comments about this code are in order:

First, I don’t like the idea of having a password appear, unencrypted, in the R program. One possible solution is to prompt the user for the password before creating the connection:

pswd <- readline("Input Password: ") channel <- odbcConnect("DATABASE", uid="USERNAME",  pwd=pswd)

This will enable the connection to be made without compromising the security of the password.

Second, the sqlQuery will pass to Oracle whatever is inside the quotation marks. This is the workhorse function of the RODBC package. The term ‘query’ includes any valid SQL statement including table creation, updates, etc, as well as ‘SELECT’s.

Finally, I should mention that R works with data that is loaded into the computer’s memory. If you try to load a really huge database into memory all at once, it will a) take a very long time, and b) possibly fail due to exceeding your computer’s memory capacity. Of course, relational database systems like Oracle are the natural habitat of very large data sets, so that may be your motivation for connecting R to Oracle in the first place. Carefully constructed SQL Queries will let Oracle do the work of managing the data, and return just the data that R needs for performing analytics.

Writing SQL Queries is beyond the scope of this blog post. If you need help with that, there are plenty of free tutorials on the web, or you might find this book helpful: Oracle 12c for Dummies

 

To leave a comment for the author, please follow the link and comment on their blog: R – Gerald Belton. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sunflowers for COLOURlovers

Mon, 05/22/2017 - 09:00

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Andar, lo que es andar, anduve encima siempre de las nubes (Del tiempo perdido, Robe)

If you give importance to colours, maybe you know already COLOURlovers. As can be read in their website, COLOURlovers is a creative community where people from around the world create and share colors, palettes and patterns, discuss the latest trends and explore colorful articles… All in the spirit of love.

There is a R package called colourlovers which provides access to the COLOURlovers API. It makes very easy to choose nice colours for your graphics. I used clpalettes function to search for the top palettes of the website. Their names are pretty suggestive as well: Giant Goldfish, Thought Provoking, Adrift in Dreams, let them eat cake … Inspired by this post I have done a Shiny app to create colored flowers using that palettes. Seeds are arranged according to the golden angle. One example:

Some others:









You can play with the app here.

If you want to do your own sunflowers, here you have the code. This is the ui.R file:

library(colourlovers) library(rlist) top=clpalettes('top') sapply(1:length(top), function(x) list.extract(top, x)$title)->titles fluidPage( titlePanel("Sunflowers for COLOURlovers"), fluidRow( column(3, wellPanel( selectInput("pal", label = "Palette:", choices = titles), sliderInput("nob", label = "Number of points:", min = 200, max = 500, value = 400, step = 50) ) ), mainPanel( plotOutput("Flower") ) ) )

And this is the server.R one:

library(shiny) library(ggplot2) library(colourlovers) library(rlist) library(dplyr) top=clpalettes('top') sapply(1:length(top), function(x) list.extract(top, x)$title)->titles CreatePlot = function (ang=pi*(3-sqrt(5)), nob=150, siz=15, sha=21, pal="LoversInJapan") { list.extract(top, which(titles==pal))$colors %>% unlist %>% as.vector() %>% paste0("#", .) -> all_colors colors=data.frame(hex=all_colors, darkness=colSums(col2rgb(all_colors))) colors %>% arrange(-darkness)->colors background=colors[1,"hex"] %>% as.character colors %>% filter(hex!=background) %>% .[,1] %>% as.vector()->colors ggplot(data.frame(r=sqrt(1:nob), t=(1:nob)*ang*pi/180), aes(x=r*cos(t), y=r*sin(t)))+ geom_point(colour=sample(colors, nob, replace=TRUE, prob=exp(1:length(colors))), aes(size=(nob-r)), shape=16)+ scale_x_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+ scale_y_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+ theme(legend.position="none", panel.background = element_rect(fill=background), panel.grid=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), axis.text=element_blank())} function(input, output) { output$Flower=renderPlot({ CreatePlot(ang=180*(3-sqrt(5)), nob=input$nob, siz=input$siz, sha=as.numeric(input$sha), pal=input$pal) }, height = 550, width = 550 )}

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Using R: When using do in dplyr, don’t forget the dot

Sun, 05/21/2017 - 11:50

There will be a few posts about switching from plyr/reshape2 for data wrangling to the more contemporary dplyr/tidyr.

My most common use of plyr looked something like this: we take a data frame, split it by some column(s), and use an anonymous function to do something useful. The function takes a data frame and returns another data frame, both of which could very possibly have only one row. (If, in fact, it has to have only one row, I’d suggest an assert_that() call as the first line of the function.)

library(plyr) results <- ddply(some_data, "key", function(x) { ## do something; return data.frame() })

Or maybe, if I felt serious and thought the function would ever be used again, I’d write:

calculate <- function(x) { ## do something; return data.frame() } result <- ddply(some_data, "key", calculate)

Rinse and repeat over and over again. For me, discovering ddply was like discovering vectorization, but for data frames. Vectorization lets you think of operations on vectors, without having to think about their elements. ddply lets you think about operations on data frames, without having to think about rows and columns. It saves a lot of thinking.

The dplyr equivalent would be do(). It looks like this:

library(dplyr) grouped <- group_by(some_data, key) result <- do(grouped, calculate(.))

Or once again with magrittr:

library(magrittr) some_data %>% group_by(key) %>% do(calculate(.)) -> result

(Yes, I used the assignment arrow from the left hand side to the right hand side. Roll your eyes all you want. I think it’s in keeping with the magrittr theme of reading from left to right.)

One important thing here, which got me at first: There has to be a dot! Just passing the function name, as one would have done with ddply, will not work:

grouped <- group_by(some_data, key) ## will not work: Error: Results are not data frames at positions ... try(result <- do(grouped, calculate))

Don’t forget the dot!

Postat i:computer stuff Tagged: dplyr, plyr, R

Shiny Applications Layouts Exercises (Part-9)

Sat, 05/20/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Shiny Application Layouts – Update Input

In the ninth part of our series we will see the updated input scenario. This is different from the Dynamic UI example we met in part 8, where the UI component is generated on the server and sent to the UI, where it replaces an existing UI component.

This part can be useful for you in two ways.

First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

Secondly you can make a revision on what you learnt in “Building Shiny App” series as we will build basic shiny staff in order to present it in the proper way.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

UI Context

As always let us create the skeleton of the app.
#ui.r
fluidPage(
titlePanel("Title"),
fluidRow(
column(3,
wellPanel()),
column(3,
wellPanel()),
column(3,
wellPanel())
)
)
#server.R
function(input, output, clientData, session) {}

Exercise 1

Build the initial UI of your app. Give it a title, use 4 columns and one row for each one of the three well Panel that we are going to use.

Applying reactivity

To create a reactive context for our app we will use the observe({}) function in our server side.

Exercise 2

Create reactive context. HINT: Use observe({}).

“Master” Inputs.

Now we are going to create the two inputs that control the rest of the inputs of the app. Look at the example:
#ui.r
textInput("text_input",
"labels:",
"Some Text"),
sliderInput("slider_input",
"values:",
min = 1, max = 100, value = 50)
#server.r
t_input <- input$text_input
s_input <- input$slider_input

Exercise 3

Create a text Input in the ui side that controls labels and then pass it to a new variable on the server side.

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

Exercise 4

Create a slider Input in the ui side that controls values and then pass it to a new variable on the server side.

Dependent Inputs

Firstly let’s create a text input that changes both the label and the text.
#ui.r
textInput("Text", "Text input:", value = "text")
#server.r
updateTextInput(session, "Text",
label = paste("Sth", t_input),
value = paste("Sth", s_input)
)

Exercise 5

Create a text input, in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateTextInput().

Exercise 6

Create a slider input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateSliderInput().

Exercise 7

Create a numeric input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateNumericInput().

Exercise 8

Create a date input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateDateInput().

In order to create a Checkbox group with the same conditions as the rest of the inputs we just created we should first create a list of options like in the example below:
#server.r
options <- list()
options[[paste("option", s_input, "A")]] <-
paste0("option-", s_input, "-A")
options[[paste("option", s_input, "B")]] <-
paste0("option-", s_input, "-B")

Exercise 9

Create a list with three choices for your Checkbox Group. HINT: Use list().

Exercise 10

Create a checkboxgroup input, in the third well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateCheckboxGroupInput().

Related exercise sets:
  1. Building Shiny App exercises part 4
  2. Building Shiny App exercises part 1
  3. Building Shiny App exercises part 3
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New series: R and big data (concentrating on Spark and sparklyr)

Sat, 05/20/2017 - 15:56

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Win-Vector LLC has recently been teaching how to use R with big data through Spark and sparklyr. We have also been helping clients become productive on R/Spark infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using R, share some hints, and even admit to some of the warts found in this combination of systems.

The ability to perform sophisticated analyses and modeling on “big data” with R is rapidly improving, and this is the time for businesses to invest in the technology. Win-Vector can be your key partner in methodology development and training (through our consulting and training practices).


J. Howard Miller’s, 1943.

The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using Spark through R and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.

Please read on for a brief description of our new articles series: “R and big data.”

Background

R is a best of breed in-memory analytics platform. R allows the analyst to write programs that operate over their data and bring in a huge suite of powerful statistical techniques and machine learning procedures. Spark is an analytics platform designed to operate over big data that exposes some of its own statistical and machine learning capabilities. R can now be operated “over Spark“. That is: R programs can delegate tasks to Spark clusters and issue commands to Spark clusters. In some cases the syntax for operating over Spark is deliberately identical to working over data stored in R.

Why R and Spark

The advantages are:

  • Spark can work at a scale and speed far larger than native R . The ability to send work to Spark increases R‘s capabilities.
  • R has machine learning and statistical capabilities that go far beyond what is available on Spark or any other “big data” system (many of which are descended from report generation or basic analytics). The ability to use specialized R methods on data samples yields additional capabilities.
  • R and Spark can share code and data.

The R/Spark combination is not the only show in town; but it is a powerful capability that may not be safe to ignore. We will also talk about additional tools that can be brought into the mix: such as the powerful large scale machine learning capabilities from h2o

The warts

Frankly a lot of this is very new, and still on the “bleeding edge.” Spark 2.x has only been available in stable form since July 26, 2016 (or just under a year). Spark 2.x is much more capable than the Spark 1.x series in terms of both data manipulation and machine learning, so we strongly suggest clients strongly insist on Spark 2.x clusters from their infrastructure vendors (such as Couldera, Hortonworks, MapR, and others) despite having only become available in these packaged solutions recently. The sparklyr adapter itself was first available on CRAN only as of September 24th, 2016. And SparkR only started distributing with Spark 1.4 as of June 2015.

While R/Spark is indeed a powerful combination, nobody seems to sharing a lot of production experiences and best practices whith it yet.

Some of the problems are sins of optimism. A lot of people still confuse successfully standing a cluster up with effectively using it. Other people confuse statistical and procedures available in in-memory R (which are very broad and often quite mature) with those available in Spark (which are less numerous and less mature).

Our goal

What we want to do with the “R and big data” series is:

  • Give a taste of some of the power of the R/Spark combination.
  • Share a “capabilities and readiness” checklist you should apply when evaluating infrastructure.
  • Start to publicly document R/Spark best practices.
  • Describe some of the warts and how to work around them.
  • Share fun tricks and techniques that make working with R/Spark much easier and more effective.
The start

Our next article in this series will be up soon and will discuss the nature of data-handles in Sparklyr (one of the R/Spark interfaces) and how to manage your data inventory neatly.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Sankey charts for swinging voters

Sat, 05/20/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Continuing my examination of the individual level voting behaviour from the New Zealand Election Study, I wanted to look at the way individuals swap between parties, and between “did not vote” and a party, from one election to another. How much and how this happens is obviously an important question for both political scientists and for politicians.

Vote transition visualisations

I chose a Sankey chart as a way of showing the transition matrix from self-reported party vote in the 2011 election to the 2014 election. Here’s a static version:

And here is the more screen-friendly interactive version, with mouseover tooltips to give actual estimates:

The point with these graphics is to highlight the transitions. For example, what were the implications of turnout being higher in 2014 than 2011 (77.9% of enrolled voters in 2014 compared to 74.2% in 2011)? Judging from this survey data, the National Party gained 6.6% of the enrolled population in 2014 by converting them from a 2011 “did not vote” and lost only 3.6% in the other direction. This net gain of three percentage points was enough to win the election for the National-led coalition. In contrast, the Labour party had a net gain from “did not vote” in 2011 of only 0.2 percentage points. Remember though that these are survey-based estimates, and subject to statistical error.

I find setting up and polishing Sankey charts – controlling colours for example – a bit of a pain, so the code at the bottom of this post on how this was done might be of interest.

Weighting, missing data, population and age complications

Those visualisations have a few hidden fishhooks, which careful readers would find if they compare the percentages in the tooltips of the interactive version with percentages reported by the New Zealand Electoral Commission.

  • The 2014 percentages are proportions of the enrolled population. As the 2014 turnout of enrolled voters was 77.9%, the numbers here are noticeably less than the usually cited percentages which were used to translate into seat counts (for example, National Party reported party vote of 47.0% of votes becomes 36.6% of enrolled voters)
  • The 2011 percentages are even harder to explain, because I’ve chosen not only to scale the party vote and “did not vote” to the 2011 enrolled population as reported by the Commission, but also to add in around 5% of the 2014 population that were too young to vote in 2011.

Two things I would have liked to have taken into account but wasn’t able to were:

  • The “leakage” from the 2011 election of people who were deceased or had left the country by 2014
  • Explicit recognition of people who voted in 2014 but not in 2011 because they were out of the country. There is a variable in the survey that picks up the year the respondent came to live in New Zealand if not born here, but for only 10 respondents was this 2012 or later (in contrast to age – there were 58 respondents aged 20 or less in 2014).

I re-weighted the survey so the 2014 and 2011 reported party votes matched the known totals (with the addition of people aged 15 to 17 in 2011). One doesn’t normally re-weight a survey based on answers provided by the respondents, but in this case I think it makes perfect sense to calibrate to the public totals. The biggest impact is that for both years, but particularly 2011, relying on the respondents’ self-report and the published weighting of the NZES, totals for “did not vote” are materially understated, compared to results when calibrated to the known totals.

When party vote in 2011 had been forgotten or was an NA, and this wasn’t explained by being too young in 2011, I used multiple imputation based on a subset of relevant variables to give five instances of probable party vote to each such respondent.

Taken together, all this gives the visualisations a perspective based in 2014. It is better to think of it as “where did the 2014 voters come from” than “where did the 2011 voters go”. This is fairly natural when we consider it is the 2014 New Zealand Election Study, but is worth keeping in mind in interpretation.

Age (and hence the impact new young voters coming in, and of older voters passing on) is important in voting behaviour, as even the most casual observation of politics shows. In New Zealand, the age distribution of party voters in 2014 is seen in the chart below:

Non-voters, Green voters and to a certain extent Labour voters are young; New Zealand First voters are older. If this interests you though, I suggest you look at the multivariate analysis in this blog post or, probably more fun, this fancy interactive web app which lets you play with the predicted probabilities of voting based on a combination of demographic and socio-economic status variables.

Code

Here’s the R code that did that imputation, weighting, and the graphics:

library(tidyverse) library(forcats) library(riverplot) library(sankeyD3) library(foreign) library(testthat) library(mice) library(grid) library(survey) # Caution networkD3 has its own sankeyD3 with less features. Make sure you know which one you're using! # Don't load up networkD3 in the same session #============import data====================== # See previous blog posts for where this comes from: unzip("D:/Downloads/NZES2014GeneralReleaseApril16.sav.zip") nzes_orig <- read.spss("NZES2014GeneralReleaseApril16.sav", to.data.frame = TRUE, trim.factor.names = TRUE) # Five versions of each row, so when we do imputation later on we # in effect doing multiple imputation: nzes <- nzes_orig[rep(1:nrow(nzes_orig), each = 5), ] %>% # lump up party vote in 2014: mutate(partyvote2014 = ifelse(is.na(dpartyvote), "Did not vote", as.character(dpartyvote)), partyvote2014 = gsub("M.ori", "Maori", partyvote2014), partyvote2014 = gsub("net.Man", "net-Man", partyvote2014), partyvote2014 = fct_infreq(partyvote2014)) %>% mutate(partyvote2014 = fct_lump(partyvote2014, 5)) # party vote in 2011, and a smaller set of columns to do the imputation: nzes2 <- nzes %>% mutate(partyvote2011 = as.factor(ifelse(dlastpvote == "Don't know", NA, as.character(dlastpvote)))) %>% # select a smaller number of variables so imputation is feasible: select(dwtfin, partyvote2014, partyvote2011, dethnicity_m, dage, dhhincome, dtradeunion, dprofassoc, dhousing, dclass, dedcons, dwkpt) # impute the missing values. Although we are imputing only a single set of values, # because we are doing it with each row repeated five times this has the same impact, # for the simple analysis we do later on, as multiple imputation: nzes3 <- complete(mice(nzes2, m = 1)) # Lump up the 2011 vote, tidy labels, and work out who was too young to vote: nzes4 <- nzes3 %>% mutate(partyvote2011 = fct_lump(partyvote2011, 5), partyvote2011 = ifelse(grepl("I did not vote", partyvote2011), "Did not vote", as.character(partyvote2011)), partyvote2011 = ifelse(dage < 21, "Too young to vote", partyvote2011), partyvote2014 = as.character(partyvote2014)) #===============re-weighting to match actual votes in 2011 and 2014=================== # This makes a big difference, for both years, but more for 2011. "Did not vote" is only 16% in 2011 # if we only calibrate to 2014 voting outcomes, but in reality was 25.8%. We calibrate for both # 2011 and 2014 so the better proportions for both elections. #------------2014 actual outcomes---------------- # http://www.elections.org.nz/news-media/new-zealand-2014-general-election-official-results actual_vote_2014 <- data_frame( partyvote2014 = c("National", "Labour", "Green", "NZ First", "Other", "Did not vote"), Freq = c(1131501, 604534, 257356, 208300, 31850 + 16689 + 5286 + 95598 + 34095 + 10961 + 5113 + 1730 + 1096 + 872 + 639, NA) ) # calculate the did not vote, from the 77.9 percent turnout actual_vote_2014[6, 2] <- (100 / 77.9 - 1) * sum(actual_vote_2014[1:5, 2]) # check I did the turnout sums right: expect_equal(0.779 * sum(actual_vote_2014[ ,2]), sum(actual_vote_2014[1:5, 2])) #---------------2011 actual outcomes--------------------- # http://www.elections.org.nz/events/past-events-0/2011-general-election/2011-general-election-official-results actual_vote_2011 <- data_frame( partyvote2011 = c("National", "Labour", "Green", "NZ First", "Other", "Did not vote", "Too young to vote"), Freq = c(1058636, 614937, 247372, 147511, 31982 + 24168 + 23889 + 13443 + 59237 + 11738 + 1714 + 1595 + 1209, NA, NA) ) # calculate the did not vote, from the 74.21 percent turnout at # http://www.elections.org.nz/events/past-events/2011-general-election actual_vote_2011[6, 2] <- (100 / 74.21 - 1) * sum(actual_vote_2011[1:5, 2]) # check I did the turnout sums right: expect_equal(0.7421 * sum(actual_vote_2011[1:6 ,2]), sum(actual_vote_2011[1:5, 2])) # from the below, we conclude 4.8% of the 2014 population (as estimated by NZES) # were too young to vote in 2011: nzes_orig %>% mutate(tooyoung = dage < 21) %>% group_by(tooyoung) %>% summarise(pop = sum(dwtfin), n = n()) %>% ungroup() %>% mutate(prop = pop / sum(pop)) # this is pretty approximate but will do for our purposes. actual_vote_2011[7, 2] <- 0.048 * sum(actual_vote_2011[1:6, 2]) # Force the 2011 numbers to match the 2014 population actual_vote_2011$Freq <- actual_vote_2011$Freq / sum(actual_vote_2011$Freq) * sum(actual_vote_2014$Freq) #------------create new weights-------------------- # set up survey design with the original weights: nzes_svy <- svydesign(~1, data = nzes4, weights = ~dwtfin) # calibrate weights to the known total party votes in 2011 and 2014: nzes_cal <- calibrate(nzes_svy, list(~partyvote2014, ~partyvote2011), list(actual_vote_2014, actual_vote_2011)) # extract those weights for use in straight R (not just "survey") nzes5 <- nzes4 %>% mutate(weight = weights(nzes_cal)) # See impact of calibrated weights for the 2014 vote: prop.table(svytable(~partyvote2014, nzes_svy)) # original weights prop.table(svytable(~partyvote2014, nzes_cal)) # recalibrated weights # See impact of calibrated weights for the 2011 vote: prop.table(svytable(~partyvote2011, nzes_svy)) # original weights prop.table(svytable(~partyvote2011, nzes_cal)) # recalibrated weights #=======================previous years vote riverplot version============ the_data <- nzes5 %>% mutate( # add two spaces to ensure the partyvote2011 does not have any # names that exactly match the party vote in 2014 partyvote2011 = paste(partyvote2011, " "), partyvote2011 = gsub("M.ori", "Maori", partyvote2011)) %>% group_by(partyvote2011, partyvote2014) %>% summarise(vote_prop = sum(weight)) %>% ungroup() # change names to the names wanted by makeRiver names(the_data) <- c("col1", "col2", "Value") # node ID need to be characters I think c1 <- unique(the_data$col1) c2 <- unique(the_data$col2) nodes_ref <- data_frame(fullname = c(c1, c2)) %>% mutate(position = rep(c(1, 2), times = c(length(c1), length(c2)))) %>% mutate(ID = LETTERS[1:n()]) edges <- the_data %>% left_join(nodes_ref[ , c("fullname", "ID")], by = c("col1" = "fullname")) %>% rename(N1 = ID) %>% left_join(nodes_ref[ , c("fullname", "ID")], by = c("col2" = "fullname")) %>% rename(N2 = ID) %>% as.data.frame(stringsAsFactors = FALSE) rp <- makeRiver(nodes = as.vector(nodes_ref$ID), edges = edges, node_labels = nodes_ref$fullname, # manual vertical positioning by parties. Look at # nodes_ref to see the order in which positions are set. # This is a pain, so I just let it stay with the defaults: # node_ypos = c(4, 1, 1.8, 3, 6, 7, 5, 4, 1, 1.8, 3, 6, 7), node_xpos = nodes_ref$position, # set party colours; all based on those in nzelect::parties_v: node_styles = list(C = list(col = "#d82a20"), # red labour D = list(col = "#00529F"), # blue national E= list(col = "black"), # black NZFirst B = list(col = "#098137"), # green J = list(col = "#d82a20"), # labour I = list(col = "#098137"), # green K = list(col = "#00529F"), # national L = list(col = "black"))) # NZ First # Turn the text horizontal, and pale grey ds <- default.style() ds$srt <- 0 ds$textcol <- "grey95" mygp <- gpar(col = "grey75") # using PNG rather than SVG as vertical lines appear in the SVG version par(bg = "grey40") # note the plot_area argument - for some reason, defaults to using only half the # vertical space available, so set this to higher than 0.5!: plot(rp, default_style = ds, plot_area = 0.9) title(main = "Self-reported party vote in 2011 compared to 2014", col.main = "white", font.main = 1) grid.text(x = 0.15, y = 0.1, label = "2011 party vote", gp = mygp) grid.text(x = 0.85, y = 0.1, label = "2014 party vote", gp = mygp) grid.text(x = 0.95, y = 0.03, gp = gpar(fontsize = 7, col = "grey75"), just = "right", label = "Source: New Zealand Election Study, analysed at https://ellisp.github.io") #=======================sankeyD3 version==================== nodes_ref2 <- nodes_ref %>% mutate(ID = as.numeric(as.factor(ID)) - 1) %>% as.data.frame() edges2 <- the_data %>% ungroup() %>% left_join(nodes_ref2[ , c("fullname", "ID")], by = c("col1" = "fullname")) %>% rename(N1 = ID) %>% left_join(nodes_ref2[ , c("fullname", "ID")], by = c("col2" = "fullname")) %>% rename(N2 = ID) %>% as.data.frame(stringsAsFactors = FALSE) %>% mutate(Value = Value / sum(Value) * 100) pal <- 'd3.scaleOrdinal() .range(["#DCDCDC", "#098137", "#d82a20", "#00529F", "#000000", "#DCDCDC", "#DCDCDC", "#098137", "#d82a20", "#00529F", "#000000", "#DCDCDC"]);' sankeyNetwork(Links = edges2, Nodes = nodes_ref2, Source = "N1", Target = "N2", Value = "Value", NodeID = "fullname", NodeGroup = "fullname", LinkGroup = "col2", fontSize = 12, nodeWidth = 30, colourScale = JS(pal), numberFormat = JS('d3.format(".1f")'), fontFamily = "Calibri", units = "%", nodeShadow = TRUE, showNodeValues = FALSE, width = 700, height = 500) #=======other by the by analysis================== # Age density plot by party vote # Remember to weight by the survey weights - in this case it controls for # the under or over sampling by age in the original design. nzes5 %>% ggplot(aes(x = dage, fill = partyvote2014, weight = weight / sum(nzes5$weight))) + geom_density(alpha = 0.3) + facet_wrap(~partyvote2014, scales = "free_y") + scale_fill_manual(values = parties_v) + theme(legend.position = "none") + labs(x = "Age at time of 2014 election", caption = "Source: New Zealand Election Study") + ggtitle("Age distribution by Party Vote in the 2014 New Zealand General Election")

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Weekly Bulletin Vol – IX

Sat, 05/20/2017 - 08:42

This week’s R bulletin will cover topics on how to list files, extracting file names, and creating a folder using R.

We will also cover functions like the select function, filter function, and the arrange function. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Run the current chunk – Ctrl+Alt+C
2. Run the next chunk – Ctrl+Alt+N
3. Run the current function definition – Ctrl+Alt+F

Problem Solving Ideas How to list files with a particular extension

To list files with a particular extension, one can use the pattern argument in the list.files function. For example to list csv files use the following syntax.

Example:

files = list.files(pattern = "\\.csv$")

This will list all the csv files present in the current working directory. To list files in any other folder, you need to provide the folder path.

list.files(path = "C:/Users/MyFolder", pattern = "\\.csv$")

$ at the end means that this is end of the string. Adding \. ensures that you match only files with extension .csv

Extracting file name using gsub function

When we download stock data from google finance, the file’s name corresponds to the stock data symbol. If we want to extract the stock data symbol from the file name, we can do it using the gsub function. The function searches for a match to the pattern argument and replaces all the matches with the replacement value given in the replacement argument. The syntax for the function is given as:

gsub(pattern, replacement, x)

where,

pattern – is a character string containing a regular expression to be matched in the given character vector.
replacement – a replacement for matched pattern.
x – is a character vector where matches are sought.

In the example given below, we extract the file name for files stored in the “Reading MFs” folder. We have downloaded the stock price data in R working directory for two companies namely, MRF and PAGEIND Ltd.

Example:

folderpath = paste(getwd(), "/Reading MFs", sep = "") temp = list.files(folderpath, pattern = "*.csv") print(temp)

[1] “MRF.csv”  “PAGEIND.csv”

gsub("*.csv$", "", temp)

[1] “MRF”   “PAGEIND”

Create a folder using R

One can create a folder via R with the help of the “dir.create” function. The function creates a folder with the name as specified in the last element of the path. Trailing path separators are discarded.

The syntax is given as:

dir.create(path, showWarnings = FALSE, recursive = FALSE)

Example:

dir.create("D:/RCodes", showWarnings = FALSE, recursive = FALSE)

This will create a folder called “RCodes” in the D drive.

Functions Demystified select function

The select function comes from the dplyr package and can be used to select certain columns of a data frame which you need. Consider the data frame “df” given in the example.

Example:

library(dplyr) Ticker = c("INFY", "TCS", "HCL", "TECHM") OpenPrice = c(2012, 2300, 900, 520) ClosePrice = c(2021, 2294, 910, 524) df = data.frame(Ticker, OpenPrice, ClosePrice) print(df)

# Suppose we wanted to select the first 2 columns only. We can use the names of the columns in the # second argument to select them from the main data frame. subset_df = select(df, Ticker:OpenPrice) print(subset_df)

# Suppose we want to omit the OpenPrice column using the select function. We can do so by using # the negative sign along with the column name as the second argument to the function. subset_df = select(df, -OpenPrice) print(subset_df)

# We can also use the 'starts_with' and the 'ends_with' arguments for selecting columns from the # data frame. The example below will select all the columns which end with the word 'Price'. subset_df = select(df, ends_with("Price")) print(subset_df)

filter function

The filter function comes from the dplyr package and is used to extract subsets of rows from a data frame. This function is similar to the subset function in R.

Example:

library(dplyr) Ticker = c("INFY", "TCS", "HCL", "TECHM") OpenPrice = c(2012, 2300, 900, 520) ClosePrice = c(2021, 2294, 910, 524) df = data.frame(Ticker, OpenPrice, ClosePrice) print(df)

# Suppose we want to select stocks with closing prices above 750, we can do so using the filter # function in the following manner: subset_df = filter(df, ClosePrice > 750) print(subset_df)

# One can also use a combination of conditions as the second argument in filtering a data set. subset_df = filter(df, ClosePrice > 750 & OpenPrice < 2000) print(subset_df)

arrange function

The arrange function is part of the dplyr package, and is used to reorder rows of a data frame according to one of the columns. Columns can be arranged in descending order or ascending order by using the special desc() operator.

Example:

library(dplyr) Ticker = c("INFY", "TCS", "HCL", "TECHM") OpenPrice = c(2012, 2300, 900, 520) ClosePrice = c(2021, 2294, 910, 524) df = data.frame(Ticker, OpenPrice, ClosePrice) print(df)

# Arrange in descending order subset_df = arrange(df, desc(OpenPrice)) print(subset_df)

# Arrange in ascending order. subset_df = arrange(df, -desc(OpenPrice)) print(subset_df)

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Download the PDF Now!

The post R Weekly Bulletin Vol – IX appeared first on .

Which science is all around? #BillMeetScienceTwitter

Sat, 05/20/2017 - 02:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I’ll admit I didn’t really know who Bill Nye was before yesterday. His name sounds a bit like Bill Nighy’s, that’s all I knew. But well science is all around and quite often scientists on Twitter start interesting campaigns. Remember the #actuallylivingscientists whose animals I dedicated a blog post? This time, the Twitter campaign is the #BillMeetScienceTwitter hashtag with which scientists introduce themselves to the famous science TV host Bill Nye. Here is a nice article about the movement.

Since I like surfing on Twitter trends, I decided to download a few of these tweets and to use my own R interface to the Monkeylearn machine learning API, monkeylearn (part of the rOpenSci project!), to classify the tweets in the hope of finding the most represented science fields. So, which science is all around?

Getting the tweets

It might sound a bit like trolling by now, but if you wanna get Twitter data, I recommend using rtweet because it’s a good package and because it’s going to replace twitteR which you might know from other blogs.

I only keep tweets in English, and moreover original ones, i.e. not retweets.

library("rtweet") billmeet <- search_tweets(q = "#BillMeetScienceTwitter", n = 18000, type = "recent") billmeet <- unique(billmeet) billmeet <- dplyr::filter(billmeet, lang == "en") billmeet <- dplyr::filter(billmeet, is_retweet == FALSE)

I’ve ended up with 2491 tweets.

Classifying the tweets

I’ve chosen to use this taxonomy classifier which classifies text according to generic topics and had quite a few stars on Monkeylearn website. I don’t think it was trained on tweets, and well it wasn’t trained to classify science topics in particular, which is not optimal, but it had the merit of being readily available. I’ve still not started training my own algorithms, and anyway, if I did I’d start by creating a very crucial algorithm for determining animal fluffiness on pictures, not text mining stuff. This was a bit off topic, let’s go back to science Twitter!

When I decided to use my own package I had forgotten it took charge of cutting the request vector into groups of 20 tweets, since the API only accept 20 texts at a time. I thought I’d have to do that splitting myself, but no, since I did it once in the code of the package, I’ll never need to write that code ever again. Great feeling! Look at how easy the code is after cleaning up the tweets a bit! One just needs to wait a bit before getting all results.

output <- monkeylearn::monkeylearn_classify(request = billmeet$text, classifier_id = "cl_5icAVzKR") str(output) ## Classes 'tbl_df', 'tbl' and 'data.frame': 4466 obs. of 4 variables: ## $ category_id: int 64638 64640 64686 64696 64686 64687 64689 64692 64648 64600 ... ## $ probability: num 0.207 0.739 0.292 0.784 0.521 0.565 0.796 0.453 0.301 0.605 ... ## $ label : chr "Computers & Internet" "Internet" "Humanities" "Religion & Spirituality" ... ## $ text_md5 : chr "f7b28f45ea379b4ca6f34284ce0dc4b7" "f7b28f45ea379b4ca6f34284ce0dc4b7" "b95429d83df2cabb9cd701a562444f0b" "b95429d83df2cabb9cd701a562444f0b" ... ## - attr(*, "headers")=Classes 'tbl_df', 'tbl' and 'data.frame': 0 obs. of 0 variables

In the output, the package creator decided not to put the whole text corresponding to each line but its digested form itself, digested by the MD5 algorithm. So to join the output to the tweets again, I’ll have to first digest the tweet, which I do just copying the code from the package. After all I wrote it. Maybe it was the only time I successfully used vapply in my whole life.

billmeet <- dplyr::mutate(billmeet, text_md5 = vapply(X = text, FUN = digest::digest, FUN.VALUE = character(1), USE.NAMES = FALSE, algo = "md5")) billmeet <- dplyr::select(billmeet, text, text_md5) output <- dplyr::left_join(output, billmeet, by = "text_md5")

Looking at this small sample, some things make sense, other make less sense, either because the classification isn’t good or because the tweet looks like spam. Since my own field isn’t text analysis, I’ll consider myself happy with these results, but I’d be of course happy to read any better version of it.

As in my #first7jobs, I’ll make a very arbitrary decision and filter the labels to which a probability higher to 0.5 was attributed.

output <- dplyr::filter(output, probability > 0.5)

This covers 0.45 of the original tweets sample. I can only hope it’s a representative sample.

How many labels do I have by tweet?

dplyr::group_by(output) %>% dplyr::summarise(nlabels = n()) %>% dplyr::group_by(nlabels) %>% dplyr::summarise(n_tweets = n()) %>% knitr::kable() nlabels n_tweets 1415 1

Perfect, only one.

Looking at the results

I know I suck at finding good section titles… At least I like the title of the post, which is a reference to the song Bill Nighy, not Bill Nye, sings in Love Actually. My husband assumed that science Twitter has more biomedical stuff. Now, even if my results were to support this fact, note that this could as well be because it’s easier to classify biomedical tweets.

I’ll first show a few examples of tweets for given labels.

dplyr::filter(output, label == "Chemistry") %>% head(n = 5) %>% knitr::kable() category_id probability label text_md5 text 64701 0.530 Chemistry e82fc920b07ea9d08850928218529ca9 Hi @billnye I started off running BLAST for other ppl but now I have all the money I make them do my DNA extractions #BillMeetScienceTwitter 64701 0.656 Chemistry d21ce4386512aae5458565fc2e36b686 .@uw’s biochemistry dept – home to Nobel Laureate Eddy Fischer & ZymoGenetics co founder Earl Davie… https://t.co/0nsZW3b3xu 64701 0.552 Chemistry 1d5be9d1e169dfbe2453b6cbe07a4b34 Yo @BillNye – I’m a chemist who plays w lasers & builds to study protein interactions w materials #BillMeetScienceTwitter 64701 0.730 Chemistry 1b6a25fcb66deebf35246d7eeea34b1f Meow @BillNye! I’m Zee and I study quantum physics and working on a Nobel prize. #BillMeetScienceTwitter https://t.co/oxAZO5Y6kI 64701 0.873 Chemistry 701d8c53e3494961ee7f7146b28b9c8c Hi @BillNye, I’m a organic chemist studying how molecules form materials like the liquid crystal shown below.… https://t.co/QNG2hSG8Fw dplyr::filter(output, label == "Aquatic Mammals") %>% head(n = 5) %>% knitr::kable() category_id probability label text_md5 text 64609 0.515 Aquatic Mammals f070a05b09d2ccc85b4b1650139b6cd0 Hi Bill, I am Anusuya. I am a palaeo-biologist working at the University of Cape Town. @BillNye #BillMeetScienceTwitter 64609 0.807 Aquatic Mammals bb06d18a1580c28c255e14e15a176a0f Hi @BillNye! I worked with people at APL to show that California blue whales are nearly recovered #BillMeetScienceTwitter 64609 0.748 Aquatic Mammals 1ca07aad8bc1abe54836df8dd1ff1a9d Hi @BillNye! I’m researching marine ecological indicators to improve Arctic marine monitoring and management… https://t.co/pJv8Om4IeI 64609 0.568 Aquatic Mammals a140320fcf948701cfc9e7b01309ef8b More like as opposed to vaginitis in dolphins or chimpanzees or sharks #BillMeetScienceTwitter https://t.co/gFCQIASty1 64609 0.520 Aquatic Mammals 06d1e8423a7d928ea31fd6db3c5fee05 Hi @BillNye I study visual function in ppl born w/o largest connection between brain hemispheres #callosalagenesis… https://t.co/WSz8xsP38R dplyr::filter(output, label == "Internet") %>% head(n = 5) %>% knitr::kable() category_id probability label text_md5 text 64640 0.739 Internet f7b28f45ea379b4ca6f34284ce0dc4b7 @BillNye #AskBillNye @BillNye join me @AllendaleCFD. More details at https://t.co/nJPwWARSsa #BillMeetScienceTwitter             64640 0.725 Internet b2b7843dc9fcd9cd959c828beb72182d @120Stat you could also use #actuallivingscientist #womeninSTEM or #BillMeetScienceTwitter to spread the word about your survey as well   64640 0.542 Internet a357e1216c5e366d7f9130c7124df316 Thank you so much for the retweet, @BillNye! I’m excited for our next generation of science-lovers!… https://t.co/B3iz3KVCOQ   64640 0.839 Internet 61712f61e877f3873b69fed01486d073 @ParkerMolloy Hi @BillNye, Im an elem school admin who wants 2 bring in STEM/STEAM initiatives 2 get my students EX… https://t.co/VMLO3WKVRv   64640 0.924 Internet 4c7f961acfa2cdd17c9af655c2e81684 I just filled my twitter-feed with brilliance. #BIllMeetScienceTwitter

Based on that, and on the huge number of internet-labelled tweets, I decided to remove those.

library("ggplot2") library("viridis") label_counts <- output %>% dplyr::filter(label != "Internet") %>% dplyr::group_by(label) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(desc(n)) label_counts <- label_counts %>% dplyr::mutate(label = ifelse(n < 5, "others", label)) %>% dplyr::group_by(label) %>% dplyr::summarize(n = sum(n)) %>% dplyr::arrange(desc(n)) label_counts <- dplyr::mutate(label_counts, label = factor(label, ordered = TRUE, levels = unique(label))) ggplot(label_counts) + geom_bar(aes(label, n, fill = label), stat = "identity")+ scale_fill_viridis(discrete = TRUE, option = "plasma")+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1), text = element_text(size=25), legend.position = "none")

In the end, I’m always skeptical when looking at the results of such classifiers, and well at the quality of my sample to begin with – but then I doubt there ever was a hashtag that was perfectly used to only answer the question and not spam it and comment it (which is what I’m doing). I’d say it seems to support my husband’s hypothesis about biomedical stuff.

I’m pretty sure Bill Nye won’t have had the time to read all the tweets, but I think he should save them, or at least all the ones he can get via the Twitter API thanks to e.g. rtweet, in order to be able to look through them next time he needs an expert. And in the random sample of tweets he’s read, let’s hope he was exposed to a great diversity of science topics (and of scientists), although, hey, the health and life related stuff is the most interesting of course. Just kidding. I liked reading tweets about various scientists, science rocks! And these last words would be labelled with “performing arts”, perfect way to end this post.

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New R Users group in Münster!

Sat, 05/20/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

This is to announce that Münster now has its very own R users group!

If you are from the area, come join us (or if you happen to know someone who is and who might be interested, please forward the info).

You can find us on meetup.com: https://www.meetup.com/Munster-R-Users-Group/ and we are also on the R User groups list.

Code for the logo, which of course has been created in R:

library(ggplot2) library(ggmap) library(ggthemes) munster <- get_map(location = 'Munster', zoom = 12, maptype = "watercolor") ggmap(munster) + scale_y_continuous(limits=c(51.94, 51.97)) + annotate("text", x=7.565, y=51.96, label= "MünsteR", size = 8) + annotate("text", x=7.685, y=51.945, label= "R Users Group", size = 5) + theme_map()

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

A Primer in Functional Programming in R Exercises (Part – 1)

Fri, 05/19/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the exercises below we cover the basics of functional programming in R( part 1 of a two series exercises on functional programming) . We consider recursion with R , apply family of functions , higher order functions such as Map ,Reduce,Filter in R .
Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
create a function which calculates factorial of a number with help of Reduce ,

Exercise 2
create the same function which calculates factorial but with recursion and memoization. :

Exercise 3
create a function cum_add which makes cumulative summation for e.g if x <- 1:3 cum_add(x) will result in 1 3 6 .Don’t use cumsum .

Exercise 4
create a function which takes a dataframe and returns mean , minimum and maximum of all numeric columns . Your function should take a dataframe as input .for e.g your_func(iris)
.Try to avoid loops ,its possible to make it in one line .swag your functional skills .

Learn more about Functions in the online course R Programming: Advanced Analytics In R For Data Science. In this course you will learn how to prepare your data quicker and more efficiently, leaving you more time to focus on analyzing and visualizing your results!

Exercise 5
create a function centerColumnAroundMean which takes a numeric dataframe and manipulates the dataframe such a way that all column’s values are centered with mean value of the column . For example if my data frame is like this
df

then centerColumnAroundMean(df) will result in

It is possible to achieve this by single line of R , Please dont use loops .
Exercise 6
I have list of movies ,which have two movie franchieses as the elements starwars and LOTR,you can create the movie list by
my_movie_list<- list(STARWARS= list("A NEW HOPE","The Last Jedi","The Force Awakens"),LOTR=list("THE FELLOWSHIP OF THE RING","THE Two Towers","The RETURN of the KING","Hobbit" = list("An unexpected Journey","The Battle of the FIVE ARMY","The Desolation of Smaug") ))
now the problem I am facing is some of the texts are in caps and some of them are in small without any particular order , I would like the list to have a format like
“The Last Jedi” , Help me in writing a function which will do the same . Please keep in mind that the list is a nested list .

Exercise 7
load the diamonds data set from ggplot 2 package , I want to buy a diamond of each color but don’t want to pay the top price , I assume the second highest price is good enough for me . Help me in finding the second highest price for each color from the dataset.

Exercise 8
use the already loaded diamonds dataset from previous exercise , I want to know the average price for each combination of cut and color .your output should be similar to following. Don’t use table

Exercise 9
load iris dataset , I want to get the third row for each group of species . Help me to write a short function for me to achieve that .

Exercise 10
Create a new environment with new.env() command and create 3 vector variables under that environment as a=1:10;b=100:500;c=1000:1500
without knowing or manually calling mean for all the vector variables can you print the mean of all variables of the new environment .

Related exercise sets:
  1. Accessing Dataframe Objects Exercises
  2. Correlation and Correlogram Exercises
  3. Building Shiny App exercises part 4
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Text Mining with R: A Tidy Approach

Fri, 05/19/2017 - 18:00

(This article was first published on Pachá (Batteries Included), and kindly contributed to R-bloggers)

About the book

This book applies tidy data principles to text analysis. The aim is to present tools to make many text mining tasks easier, more effective, and consistent with tools already in use, and in particular it presents the tidytext R package.

I love this ebook, at the moment you can read the chapter at the book’s website, and I want it to be soon available on Amazon to have a paperback copy.

The authors of this beautiful exposition of methodology and coding are Julia Silge and David Robinson. Kudos to both of them. In particular, I’ve been following Julia’s blog posts in the last two years and using it as a reference to teach R in my courses.

Table of contents

List of chapters:

  1. The tidy text format
  2. Sentiment analysis with tidy data
  3. Analyzing word and document frequency: tf-idf
  4. Relationships between words: n-grams and correlations
  5. Converting to and from non-tidy formats
  6. Topic modeling
  7. Case study: comparing Twitter archives
  8. Case study: mining NASA metadata
  9. Case study: analyzing usenet text
  10. References
Remarkable contributions of this book

In my opinion chapter 5 is one of the best expositions of data structures in R. By using modern R packages such as dplyr and tidytext, among other packages, the authors move between tibble, DocumentTermMatrix and VCorpus, while they present a set of good practises in R and do include ggplot2 charts to make concepts such as sentiment analysis clear.

If you often hear colleagues saying that R syntax is awkward, show this material to them. Probably people who used R 5 years ago or more, and haven’t used it in a while, will be amazed to see how the %>% operator is used here.

Text analysis requires working with a variety of tools, many of which have inputs and outputs that aren’t in a tidy form. What the authors present here is a noble and remarkable piece of work.

To leave a comment for the author, please follow the link and comment on their blog: Pachá (Batteries Included). R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

AzureDSVM: a new R package for elastic use of the Azure Data Science Virtual Machine

Fri, 05/19/2017 - 17:30

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)

The Azure Data Science Virtual Machine (DSVM) is a curated VM which provides commonly-used tools and software for data science and machine learning, pre-installed. AzureDSVM is a new R package that enables seamless interaction with the DSVM from a local R session, by providing functions for the following tasks:

  1. Deployment, deallocation, deletion of one or multiple DSVMs;
  2. Remote execution of local R scripts: compute contexts available in Microsoft R Server can be enabled for enhanced computation efficiency for either a single DSVM or a cluster of DSVMs;
  3. Retrieval of cost consumption and total expense spent on using DSVM(s).

AzureDSVM is built upon the AzureSMR package and depends on the same set of R packages such as httr, jsonlite, etc. It requires the same initial set up on Azure Active Directory (for authentication).

To install AzureDSVM with devtools package:

library(devtools) devtools::install_github("Azure/AzureDSVM") library("AzureDSVM")

When deploying a Data Science Virtual Machine, the machine name, size, OS type, etc. must be specified. AzureDSVM supports DSVMs on Ubuntu, CentOS, Windows, and Windows with the Deep Learning Toolkit (on GPU-class instances). For example, the following code fires up a D4 v2 Ubuntu DSVM located in South East Asia:

deployDSVM(context, resource.group="example", location="southeastasia", size="Standard_D4_v2", os="Ubuntu", hostname="mydsvm", username="myname", pubkey="pubkey")

where context is an azureActiveContext object created by AzureSMR::createAzureContext() function that encapsulates credentials (Tenant ID, Client ID, etc.) for Azure authentication.

In addition to launching a single DSVM, the AzureDSVM package makes it easy to launch a cluster with multiple virtual machines. Multi-deployment supports:

  1. creating a collection of independent DSVMs which can be distributed to a group of data scientists for collaborative projects, as well as
  2. clustering a set of connected DSVMs for high-performance computation.

To create a cluster of 5 Ubuntu DSVMs with default VM size, use:

cluster<-deployDSVMCluster(context, resource.group=RG, location="southeastasia", hostnames="mydsvm", usernames="myname", pubkeys="pubkey", count=5)

To execute a local script on remote cluster of DSVMs with a specified Microsoft R Server compute context, use the executeScript function. (NOTE: only Linux-based DSVM instances are supported at the moment as underneath the remote execution is achieved via SSH. Microsoft R Server 9.x allows remote interaction for both Linux and Windows, and more details can be found here.) Here, we use the RxForeachDoPar context (as indicated by the compute.context option):

executeScript(context, resource.group="southeastasia", machines="dsvm_names_in_the_cluster", remote="fqdn_of_dsvm_used_as_master", user="myname", script="path_to_the_script_for_remote_execution", master="fqdn_of_dsvm_used_as_master", slaves="fqdns_of_dsvms_used_as_slaves", compute.context="clusterParallel")

Information of cost consumption and expense spent on DSVMs can be retrieved with:

consum<-expenseCalculator(context, instance="mydsvm", time.start="time_stamp_of_starting_point", time.end="time_stamp_of_ending_point", granularity="Daily", currency="USD", locale="en-US", offerId="offer_id_of_azure_subscription", region="southeastasia") print(consum)

Detailed introductions and tutorials can be found in the AzureDSVM Github repository, linked below.

Github (Azure): AzureDSVM

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R/Finance 2017 livestreaming today and tomorrow

Fri, 05/19/2017 - 16:18

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you weren't able to make it to Chicago for R/Finance, the annual conference devoted to applications of R in the financial industry, don't fret: the entire conference is being livestreamed (with thanks to the team at Microsoft). You can watch the proceedings at aka.ms/r_finance, and recordings will be available at the same link after the event.

Check out the conference program below for the schedule of events (times in US Central Standard Daylight Time).

Friday, May 19th, 2017 09:30 – 09:35   Kickoff 09:35 – 09:40   Sponsor Introduction 09:40 – 10:10   Marcelo Perlin: GetHFData: An R package for downloading and aggregating high frequency trading data from Bovespa    Jeffrey Mazar: The obmodeling Package    Yuting Tan: Return Volatility, Market Microstructure Noise, and Institutional Investors: Evidence from High Frequency Market    Stephen Rush: Adverse Selection and Broker Execution    Jerzy Pawlowski: How Can Machines Learn to Trade? 10:10 – 10:30   Michael Hirsch: Revealing High-Frequency Trading Provisions of Liquidity with Visualization in R 10:30 – 10:50   Matthew Dixon: MLEMVD: A R Package for Maximum Likelihood Estimation of Multivariate Diffusion Models 10:50 – 11:10   Break 11:10 – 11:30   Seoyoung Kim: Zero-Revelation RegTech: Detecting Risk through Linguistic Analysis of Corporate Emails and News 11:30 – 12:10   Szilard Pafka: No-Bullshit Data Science 12:10 – 13:30   Lunch 13:30 – 14:00   Francesco Bianchi: Measuring Risk with Continuous Time Generalized Autoregressive Conditional Heteroscedasticity Models    Eina Ooka: Bunched Random Forest in Monte Carlo Risk Simulation    Matteo Crimella: Operational Risk Stress Testing: An Empirical Comparison of Machine Learning Algorithms and Time Series Forecasting Methods    Thomas Zakrzewski: Using R for Regulatory Stress Testing Modeling    Andy Tang: How much structure is best? 14:00 – 14:20   Robert McDonald: Ratings and Asset Allocation: An Experimental Analysis 14:20 – 14:50   Break 14:50 – 15:10   Dries Cornilly: Nearest Comoment Estimation with Unobserved Factors and Linear Shrinkage 15:10 – 15:30   Bernhard Pfaff: R package: mcrp: Multiple criteria risk contribution optimization 15:30 – 16:00   Oliver Haynold: Practical Options Modeling with the sn Package, Fat Tails, and How to Avoid the Ultraviolet Catastrophe    Shuang Zhou: A Nonparametric Estimate of the Risk-Neutral Density and Its Applications    Luis Damiano: A Quick Intro to Hidden Markov Models Applied to Stock Volatility    Oleg Bondarenko: Rearrangement Algorithm and Maximum Entropy    Xin Chen: Risk and Performance Estimator Standard Errors for Serially Correlated Returns 16:00 – 16:20   Qiang Kou: Text analysis using Apache MxNet 16:20 – 16:40   Robert Krzyzanowski: Syberia: A development framework for R 16:40 – 16:52   Matt Dancho: New Tools for Performing Financial Analysis Within the 'Tidy' Ecosystem    Leonardo Silvestri: ztsdb, a time-series DBMS for R users Saturday, May 20th, 2017 08:00 – 09:00   Coffee/ Breakfast 09:00 – 09:05   Kickoff 09:05 – 09:35   Stephen Bronder: Integrating Forecasting and Machine Learning in the mlr Framework    Leopoldo Catania: Generalized Autoregressive Score Models in R: The GAS Package    Guanhao Feng: Regularizing Bayesian Predictive Regressions    Jonas Rende: partialCI: An R package for the analysis of partially cointegrated time series    Carson Sievert: Interactive visualization for multiple time series 09:35 – 09:55   Emanuele Guidotti: yuimaGUI: A graphical user interface for the yuima package 09:55 – 10:15   Daniel Kowal: A Bayesian Multivariate Functional Dynamic Linear Model 10:15 – 10:45   Break 10:45 – 11:05   Jason Foster: Scenario Analysis of Risk Parity using RcppParallel 11:05 – 11:35   Michael Weylandt: Convex Optimization for High-Dimensional Portfolio Construction    Lukas Elmiger: Risk Parity Under Parameter Uncertainty    Ilya Kipnis: Global Adaptive Asset Allocation, and the Possible End of Momentum    Vyacheslav Arbuzov: Dividend strategy: towards the efficient market    Nabil Bouamara: The Alpha and Beta of Equity Hedge UCITS Funds – Implications for Momentum Investing 11:35 – 12:15   Dave DeMers: Risk Fast and Slow 12:15 – 13:35   Lunch 13:35 – 13:55   Eric Glass: Equity Factor Portfolio Case Study 13:55 – 14:15   Jonathan Regenstein: Reproducible Finance with R: A Global ETF Map 14:15 – 14:35   David Ardia: Markov-Switching GARCH Models in R: The MSGARCH Package 14:35 – 14:55   Keven Bluteau: Forecasting Performance of Markov-Switching GARCH Models: A Large-Scale Empirical Study 14:55 – 15:07   Riccardo Porreca: Efficient, Consistent and Flexible Credit Default Simulation    Maisa Aniceto: Machine Learning and the Analysis of Consumer Lending 15:07 – 15:27   David Smith: Detecting Fraud at 1 Million Transactions per Second 15:27 – 15:50   Break 15:50 – 16:10   Thomas Harte: The PE package: Modeling private equity in the 21st century 16:10 – 16:30   Guanhao Feng: The Market for English Premier League (EPL) Odds 16:30 – 16:50   Bryan Lewis: Project and conquer 16:50 – 17:00   Prizes and Feedback 17:00 – 17:05   Conclusion    

R/Finance 2017 livestream: aka.ms/r_finance

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Improving automatic document production with R

Fri, 05/19/2017 - 10:02

(This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

In this post, I describe the latest iteration of my automatic document production with R. It improves upon the methods used in Rtraining, and previous work on this topic can read by going to the auto deploying R documentation tag.

I keep banging on about this area because reproducible research / analytical document pipelines is an area I’ve a keen interest in. I see it as a core part of DataOps as it’s vital for helping us ensure our models and analysis are correct in data science and boosting our productivity.

Even after (or because of) a few years of off and on again development to the process, Rtraining had a number of issues:

  • build time was long because all presentations and their respective dependencies were required
  • if a single presentation broke, any later presentations would not get generated
  • the presentation build step was in “after_success” so didn’t trigger notifications
  • the build script for the presentations did a lot of stuff I thought could be removed
  • the dynamic index page sucks

This post covers how I’m attempting to fix all bar the last problem (more on that in a later post).

With the problems outlined, let’s look at my new base solution and how it addresses these issues.

Structure

I have built a template that can be used to generate multiple presentations and publish them to a docs/ directory for online hosting by GitHub. I can now use this template to produce category repositories, based on the folders in inst/slides/ in Rtraining. I can always split them out further at a later date.

The new repo is structured like so:

  • Package infrastructure
    • DESCRIPTION – used for managing dependencies primarily
    • R/ – store any utility functions
    • .Rbuildignore – avoid non-package stuff getting checked
  • Presentations
    • pres/ – directory for storing presentation .Rmd files
    • pres/_output.yml – file with render preferences
  • Output directory
    • docs/ – directory for putting generated slides in
  • Document generation infrastructure
    • .travis.yml – used to generate documents every time we push a commit
    • buildpres.sh – shell script doing the git workflow and calling R
    • buildpres.R – R script that performs the render step
Presentations
  • My Rtraining repo contained all presentations in the inst/slidedecks/ directory with further categories. This meant that if someone installed Rtraining, they’d get all the decks. I think this is a sub-optimal experience for people, especially because it mean installing so many packages, and I’ll be focusing instead on improving the web delivery.
  • Render requirements are now stored in an _output.yml file instead of being hard coded into the render step so that I can add more variant later
  • I’m currently using a modified version of the revealjs package as I’ve built a heavily customised theme. As I’m not planning on any of these presentation packages ever going on CRAN, I can use the Remotes field in DESCRIPTION to describe the location. This simplifies the code significantly.
Document generation
Automatic document generation with R Travis

I use travis-ci to perform the presentation builds. The instructions I provide travis are:

language: r
cache: packages
latex: false
warnings_are_errors: false
install:
- R -e 'install.packages("devtools")'
- R -e 'devtools::install_deps(dep = T)'
- R CMD build --no-build-vignettes --no-manual .
- R CMD check --no-build-vignettes --no-manual *tar.gz
- Rscript -e 'devtools::install(pkg = ".")'
before_script:
- chmod +x ./buildpres.sh
script:
- ./buildpres.sh

One important thing to note here is that I used some arguments on my package build and check steps along with latex: false to drastically reduce the build time as I have no intention of producing PDFs normally.

The install section is the prep work, and then the script section does the important bit. Now if there are errors, I’ll get notified!

Bash

The script that gets executed in my Travis build:

  • changes over to a GITHUB_PAT based connection to the repo to facilitate pushing changes and does some other config
  • executes the R render step
  • puts the R execution log in docs/ for debugging
  • commits all the changes using the important prefix [CI SKIP] so we don’t get infinite build loops
  • pushes the changes
#!/bin/bash
AUTHORNAME="Steph"
AUTHOREMAIL="Steph@itsalocke.com"
GITURL="https://$GITHUB_PAT@github.com/$TRAVIS_REPO_SLUG.git"

git remote set-url origin $GITURL
git pull
git checkout master
git config --global user.name $AUTHORNAME
git config --global user.email $AUTHOREMAIL

R CMD BATCH './buildpres.R'

cp buildpres.Rout docs/

git add docs/
git commit -am "[ci skip] Documents produced in clean environment via Travis $TRAVIS_BUILD_NUMBER"
git push -u --quiet origin master
R

The R step is now very minimal in that it works out what presentations to generate, then loops through them and builds each one according to the options specified in _output.yml

library(rmarkdown)
slides=list.files("pres","*.Rmd",full.names=TRUE)

for (f in slides) render(f,output_dir = "docs")
Next steps for me

This work has substantially mitigated most of the issues I had with my previous Rtraining workflow. I now have to get all my slide decks building under this new process.

I will be writing about making an improved presentation portal and how to build and maintain your own substantially modified revealjs theme at a later date.

The modified workflow and scripts also have implications on my pRojects package that I’m currently developing along with Jon Calder. I’d be very interested to hear from you if you have thoughts on how to make things more streamlined.

The post Improving automatic document production with R appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages