R news and tutorials contributed by hundreds of R bloggers
Updated: 3 hours 59 min ago

### Ridge Regression and the Lasso

Tue, 05/23/2017 - 19:43

(This article was first published on R – Real Data, and kindly contributed to R-bloggers)

In my last post Which linear model is best? I wrote about using
stepwise selection as a method for selecting linear models, which turns
This post will be about two methods that slightly modify ordinary least
squares (OLS) regression – ridge regression and the lasso.

Ridge regression and the lasso are closely related, but only the Lasso
has the ability to select predictors. Like OLS, ridge attempts to
minimize residual sum of squares of predictors in a given model.
However, ridge regression includes an additional ‘shrinkage’ term – the
square of the coefficient estimate – which shrinks the estimate of the
coefficients towards zero. The impact of this term is controlled by
another term, lambda (determined seperately). Two interesting
implications of this design are the facts that when λ = 0 the OLS
coefficients are returned and when λ = ∞, coefficients will approach
zero.

To take a look at this, setup a model matrix (removing the intercept
column), store the independent variable as y, and create a vector of
lambda values.

swiss <- datasets::swiss x <- model.matrix(Fertility~., swiss)[,-1] y <- swiss\$Fertility lambda <- 10^seq(10, -2, length = 100)

First, let's prove the fact that when λ = 0 we get the same coefficients
as the OLS model.

#create test and training sets library(glmnet) ## Loading required package: Matrix ## Loading required package: foreach ## Loaded glmnet 2.0-10 set.seed(489) train = sample(1:nrow(x), nrow(x)/2) test = (-train) ytest = y[test]

#OLS swisslm <- lm(Fertility~., data = swiss) coef(swisslm) ## (Intercept) Agriculture Examination Education ## 66.9151817 -0.1721140 -0.2580082 -0.8709401 ## Catholic Infant.Mortality ## 0.1041153 1.0770481 #ridge ridge.mod <- glmnet(x, y, alpha = 0, lambda = lambda) predict(ridge.mod, s = 0, exact = T, type = 'coefficients')[1:6,] ## (Intercept) Agriculture Examination Education ## 66.9365901 -0.1721983 -0.2590771 -0.8705300 ## Catholic Infant.Mortality ## 0.1040307 1.0770215

The differences here are nominal. Let's see if we can use ridge to
improve on the OLS estimate.

swisslm <- lm(Fertility~., data = swiss, subset = train) ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = lambda) #find the best lambda from our list via cross-validation cv.out <- cv.glmnet(x[train,], y[train], alpha = 0) ## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations ## per fold bestlam <- cv.out\$lambda.min #make predictions ridge.pred <- predict(ridge.mod, s = bestlam, newx = x[test,]) s.pred <- predict(swisslm, newdata = swiss[test,]) #check MSE mean((s.pred-ytest)^2) ## [1] 106.0087 mean((ridge.pred-ytest)^2) ## [1] 93.02157

Ridge performs better for this data according to the MSE.

#a look at the coefficients out = glmnet(x[train,],y[train],alpha = 0) predict(ridge.mod, type = "coefficients", s = bestlam)[1:6,] ## (Intercept) Agriculture Examination Education ## 64.90631178 -0.16557837 -0.59425090 -0.35814759 ## Catholic Infant.Mortality ## 0.06545382 1.30375306

As expected, most of the coefficient estimates are more conservative.

Let's have a look at the lasso. The big difference here is in the
shrinkage term – the lasso takes the absolute value of the coefficient
estimates.

lasso.mod <- glmnet(x[train,], y[train], alpha = 1, lambda = lambda) lasso.pred <- predict(lasso.mod, s = bestlam, newx = x[test,]) mean((lasso.pred-ytest)^2) ## [1] 124.1039

The MSE is a bit higher for the lasso estimate. Let's check out the
coefficients.

lasso.coef <- predict(lasso.mod, type = 'coefficients', s = bestlam)[1:6,]

Looks like the lasso places high importance on Education,
Examination, and Infant.Mortality. From this we also gain some
evidence that Catholic and Agriculture are not useful predictors for
this model. It is likely that Catholic and Agriculture do have some
effect on Fertility, though, since pushing those coefficients to zero
hurt the model.

There is plenty more to delve into here, but I'll leave the details to
the experts. I am always happy to have your take on the topics I write
I learn just as much from you all as I do in researching the topic I

I think the next post will be about more GIS stuff – maybe on rasters or
point pattern analysis.

Kiefer

To leave a comment for the author, please follow the link and comment on their blog: R – Real Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R⁶ — Idiomatic (for the People)

Tue, 05/23/2017 - 13:56

(This article was first published on R – rud.is, and kindly contributed to R-bloggers)

NOTE: I’ll do my best to ensure the next post will have nothing to do with Twitter, and this post might not completely meet my R⁶ criteria.

A single, altruistic, nigh exuberant R tweet about slurping up a directory of CSVs devolved quickly — at least in my opinion, and partly (sadly) with my aid — into a thread that ultimately strayed from a crucial point: idiomatic is in the eye of the beholder.

I’m not linking to the twitter thread, but there are enough folks with sufficient Klout scores on it (is Klout even still a thing?) that you can easily find it if you feel so compelled.

I’ll take a page out of the U.S. High School “write an essay” playbook and start with a definition of idiomatic:

using, containing, or denoting expressions that are natural to a native speaker

That comes from idiom:

a form of expression natural to a language, person, or group of people

I usually joke with my students that a strength (and weakness) of R is that there are ~twelve ways to do any given task. While the statement is deliberately hyperbolic, the core message is accurate: there’s more than one way to do most things in R. A cascading truth is: what makes one way more “correct” over another often comes down to idiom.

My rstudio::conf 2017 presentation included an example of my version of using purrr for idiomatic CSV/JSON directory slurping. There are lots of ways to do this in R (the point of the post is not really to show you how to do the directory slurping and it is unlikely that I’ll approve comments with code snippets about that task). Here are three. One from base R tribe, one from the data.table tribe and one from the tidyverse tribe:

You get data for all the “years” into a data.frame, data.table and tibble (respectively) with those three “phrases”.

However, what if you want the year as a column? Many of these “datalogger” CSV data sets do not have a temporal “grouping” variable as they let the directory structure & naming conventions embed that bit of metadata. That information would be nice, though:

All three versions do the same thing, and each tribe understands each idiom.

The data.table and tidyverse versions get you much faster file reading and the ability to “fill” missing columns — another common slurping task. You can hack something together in base R to do column fills (you’ll find a few StackOverflow answers that accomplish such a task) but you will likely decide to choose one of the other idioms for that and become equally as comfortable in that new idiom.

There are multiple ways to further extend the slurping example, but that’s not the point of the post.

Each set of snippets contains 100% valid R code. They accomplish the task and are idiomatic for each tribe. Despite what any “mil gun feos turrach na latsa” experts’ exchange would try to tell you, the best idiom is the one that works for you/you & your collaborators and the one that gets you to the real work — data analysis — in the most straightforward & reproducible way possible (for you).

Idiomatic does not mean there’s only a singular One, True Way™, and I think a whole host of us forget that at times.

Write good, clean, error-free, reproducible code.

Choose idioms that work best for you and your collaborators.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to analyze max-diff data in R

Tue, 05/23/2017 - 09:33

(This article was first published on R – Displayr, and kindly contributed to R-bloggers)

This post discusses a number of options that are available in R for analyzing data from max-diff experiments, using the package flipMaxDiff. For a more detailed explanation of how to analyze max-diff, and what the outputs mean, you should read the post How max-diff analysis works.

The post will cover the processes of installing packages, importing your data and experimental design, before discussing counting analysis and the more powerful, and valid, latent class analysis.

Step 1: Installing the packages

The first step is to install the flipMaxDiff package and a series of dependent packages. Depending on how your R has been setup, you may need to install none of these (e.g., if using Displayr), or even more packages than are shown below.

install.packages("devtools") library(devtools) install.packages("AlgDesign") install_github("Displayr/flipData") install_github("Displayr/flipTransformations") install.packages("Rcpp") install.packages("foreign") install_github("Displayr/flipMaxDiff") install_github("Displayr/flipStandardCharts") Step 2: Bring in your data

The mechanism by which you import your survery data will depend upon its format and location. For the example that is used in our main post on analysing max-diff, you can use the following to import the SPSS file:

tech.data = foreign::read.spss("http://wiki.q-researchsoftware.com/images/f/f1/Technology_2017.sav", to.data.frame = TRUE)

Step 3: Bring in the experimental design

The flipMaxDiff package contains a method for creating your own experimental design, and this is discussed more here. In many cases, you may simply need to load in the design from your PC, or from a URL. The example that we cover here uses the following:

tech.design = read.csv("http://wiki.q-researchsoftware.com/images/7/78/Technology_MaxDiff_Design.csv") Step 4: Counting analysis

While the latent class analysis technique described below is more valid and powerful than analyses based on counting the selections made by respondents, the following code may be used to examine the results discussed in the post How max-diff analysis works.

First, collect the sections of the data set for the selections made by the respondents:

best = tech.data[, c("Q5a_left", "Q5b_left", "Q5c_left", "Q5d_left", "Q5e_left", "Q5f_left")] worst = tech.data[, c("Q5a_right", "Q5b_right", "Q5c_right", "Q5d_right", "Q5e_right", "Q5f_right")]

Here, the variable naming convention arises from the most-preferred option being shown on the left in the questionnaire,  and the least-preferred option being shown on the right.

This code will count up the number of times each alternative was selected as best:

b = unlist(best) # Turning the 6 variables into 1 variable t = table(b) # Creating a table s = sort(t, decreasing = TRUE) # Sorting from highest to lowest # Putting a name at the top of the column, and naming it. best.score = cbind(Best = s)

This code will count up the number of selections for each of the alternatives in your max-diff experiment:

b = table(unlist(best)) best.and.worst = cbind(Best = b, Worst = table(unlist(worst)))[order(b, decreasing = TRUE),]

To compute the difference between the best and worst counts:

diff = best.and.worst[, 1] - best.and.worst[, 2] cbind("Best - Worst" = diff) Step 5: Latent class analysis

The flipMaxDiff package contains a function for estimating the shares in the max-diff experiment that uses latent class analysis. For a more detailed explanation of what this means, and how to interpret the outputs, see this post.

To run the latent class analysis, you can use the following:

library(flipMaxDiff) # Name the alternatives in the design alt.names <- c("Apple", "Microsoft", "IBM", "Google", "Intel", "Samsung", "Sony", "Dell", "Yahoo", "Nokia") result = FitMaxDiff(design = tech.design, version = rep(1, nrow(best)), best = best, worst = worst, alternative.names = alt.names, n.classes = 5, output = "Classes") print(result)

The arguments that have been used in this example are:

• design: a data frame containing the experimental design.
• version: a vector of integers which can be used to specify which respondents were shown different versions of the experimental design. In this example, the design is simple and there is only a single version shown to all respondents.
• best: a data frame containing the options that were chosen as most-preferred.
• worst: a data frame containing the options that were chosen as most-preferred.
• alternative.names: a vector containing the names of the alternatives that were used in the max-diff experiment. In the code above, the names of the technology companies are used.
• n.classes: the number of classes to include in the analysis. Refer to this post for how to work out how many classes to include.
• output: a string which denotes the type of output that should be displayed when the object is printed. In this case we have chosen “Classes”, and so the print will show information about the preference shares for each class identified. Change this to “Probabilities”, and the output will show the Mean Probability (%) and distribution of shares.

Step 6: Obtaining respondent-level information

The latent class output can now be used to obtain various bits of information about the respondents. These can then be fed into later analyses:

To generate a new vector which tells you which class each respondent has been assigned to, you may use:

flipMaxDiff::Memberships(result)

In the latent class model, each respondent has a certain probability of membership for each class. The Memberships() method used above assigns each respondent to the class with the highest probability. To see the probabilities

input.maxdiff = result probs = input.maxdiff\$posterior.probabilities colnames(probs) = paste0('Class ', 1:ncol(probs) ) probs

Respondent-level preference shares:

respondent.shares = prop.table(exp(as.matrix( (flipMaxDiff::RespondentParameters(result) ) ) ), 1) Step 7: Obtaining information from the output table

While the latent class analysis outputs are displayed in a cool HTMLWidget, the data in the class table can be also be extracted with results\$probabilities. This lets you extract parts of the table for charting or for other analyses. For example, in our 5-class solution, we can plot the preference shares from the Total column as a donut chart (from the package flipStandardCharts):

library(flipStandardCharts) pref.shares = result\$probabilities[, 6] pref.shares = sort(pref.shares, decreasing = TRUE) * 100 Chart(pref.shares, type = "Donut")

Step 8: Preference simulation

Here, preference simulation refers to the process of removing some alternatives from the calculated shares, and then rebasing the remaining shares to see how they adjust in the absence of the removed alternatives. Here, we first modify result\$probabilities by removing the Total column (which will be recomputed later), as well as the rows for the alternatives that we want to exclude. Then, the results are rebased to add to 100 using the prop.table function. A new Total column is computed by taking the weighted sum, where the weights are given by the class sizes contained in result\$class.sizes.

# Remove the total column x = result\$probabilities[, -6] * 100 # Removing Apple and Samsung x = x[c(-1, -6),] # Re-basing x = prop.table(x, 2)* 100 # Adding the total sizes = result\$class.sizes x = cbind(x, Total = rowSums(sweep(x, 2, sizes, "*"))) new.preferences = rbind(x, Total = colSums(x)) Summary

Once you have loaded in your survey data and experimental design, the latent class analysis from flipMaxDiff can be run with a couple of lines of code. Additional methods are then available for saving respondent-level information, and data from the output table, to feed them in to your charts and analyses.

To leave a comment for the author, please follow the link and comment on their blog: R – Displayr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### simglm submission to CRAN this week

Tue, 05/23/2017 - 07:00

(This article was first published on Educate-R - R, and kindly contributed to R-bloggers)

This is a quick note looking for any further feedback on the simglm package prior to CRAN submission later this week. The goal is to submit Thursday or Friday this week. The last few documentation finishing touches are happening now working toward a version 0.5.0 release on CRAN.

For those who have not seen this package yet, the aim is to simulate regression models (single level and multilevel models) as well as employ empirical power analyses based on Monte Carlo simulation. The package is relatively flexible in user control of inputs to generate data.

To install the package and also build the vignettes:

devtools::install_github("lebebr01/simglm", build_vignettes = TRUE)

Then to generate a simple single level data set.

library(simglm) fixed <- ~ 1 + act + diff + numCourse + act:numCourse fixed_param <- c(2, 4, 1, 3.5, 2) cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'), var_type = c("single", "single", "single"), cov_param = list(list(mean = 0, sd = 4), list(mean = 0, sd = 3), list(mean = 0, sd = 3))) n <- 150 error_var <- 3 with_err_gen = 'rnorm' temp_single <- sim_reg(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param, n = n, error_var = error_var, with_err_gen = with_err_gen, data_str = "single") head(temp_single) ## X.Intercept. act diff numCourse act.numCourse Fbeta ## 1 1 -2.11697901 -0.1490870 -0.90292680 1.9114770938 -5.954293 ## 2 1 0.01298227 -0.1310381 -0.06197237 -0.0008045421 1.702379 ## 3 1 0.44564723 0.5913073 -0.59650183 -0.2658293887 1.754481 ## 4 1 -0.03528805 -0.5113031 -0.05915731 0.0020875460 1.144669 ## 5 1 1.77940941 0.5097288 0.54804919 0.9752038827 13.495946 ## 6 1 -1.42185444 0.4145870 1.08424301 -1.5416357400 -2.561252 ## err sim_data ID ## 1 -0.9567737 -6.911066 1 ## 2 1.3386926 3.041071 2 ## 3 0.3470914 2.101572 3 ## 4 0.9178861 2.062555 4 ## 5 0.8016335 14.297580 5 ## 6 0.2499601 -2.311292 6

Then one can extend this to conduct of power analysis. The benefit of this approach is that you are able to generate data that hopefully more closely resembles data that is to be collected and can also evaluate assumption violations, sample size differences, and other conditions directly into the power analysis similar to a Monte Carlo simulation.

fixed <- ~ 1 + act + diff + numCourse + act:numCourse fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1) cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'), var_type = c("single", "single", "single"), opts = list(list(mean = 0, sd = 2), list(mean = 0, sd = 2), list(mean = 0, sd = 1))) n <- NULL error_var <- NULL with_err_gen <- 'rnorm' pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse') alpha <- .01 pow_dist <- "t" pow_tail <- 2 replicates <- 10 terms_vary <- list(n = c(20, 40, 60, 80, 100), error_var = c(5, 10, 20), fixed_param = list(c(0.5, 1.1, 0.6, 0.9, 1.1), c(0.6, 1.1, 0.6, 0.9, 1.1)), cov_param = list(list(dist_fun = c('rnorm', 'rnorm', 'rnorm'), mean = c(0, 0, 0), sd = c(2, 2, 1), var_type = c("single", "single", "single")), list(dist_fun = c('rnorm', 'rnorm', 'rnorm'), mean = c(0.5, 0, 0), sd = c(2, 2, 1), var_type = c("single", "single", "single")) ) ) power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param, n = n, error_var = error_var, with_err_gen = with_err_gen, data_str = "single", pow_param = pow_param, alpha = alpha, pow_dist = pow_dist, pow_tail = pow_tail, replicates = replicates, terms_vary = terms_vary) head(power_out) ## Source: local data frame [6 x 11] ## Groups: var, n, error_var, fixed_param [3] ## ## var n error_var fixed_param ## ## 1 (Intercept) 20 5 0.5,1.1,0.6,0.9,1.1 ## 2 (Intercept) 20 5 0.5,1.1,0.6,0.9,1.1 ## 3 (Intercept) 20 5 0.6,1.1,0.6,0.9,1.1 ## 4 (Intercept) 20 5 0.6,1.1,0.6,0.9,1.1 ## 5 (Intercept) 20 10 0.5,1.1,0.6,0.9,1.1 ## 6 (Intercept) 20 10 0.5,1.1,0.6,0.9,1.1 ## # ... with 7 more variables: cov_param , avg_test_stat , ## # sd_test_stat , power , num_reject , num_repl , ## # data

Questions and feedback are welcomed by filing an issue on GitHub here: https://github.com/lebebr01/simglm/issues.

To leave a comment for the author, please follow the link and comment on their blog: Educate-R - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Preview of EARL San Francisco

Mon, 05/22/2017 - 23:32

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The first ever EARL (Enterprise Applications of the R Language) conference in San Francisco will take place on June 5-7 (and it's not too late to register).  The EARL conference series is now in its fourth year, and the prior conferences in London and Boston have each been a fantastic way to learn how R is used in real-world applications. Judging from the speaker lineup in San Francisco, next month's event looks like it will feature some great R stores as well. Included on the program will be stories from the following companies:

• AirBnB (Keynote presentation by Ricardo Bion)
• StitchFix (Keynote presentation by Hilary Parker)
• Uber (Keynote presentation by Prakhar Mehrota)
• Hitachi Solutions (David Bishop)
• Amgen (Daniel Saldana)
• Pfizer (Luke Fostveldt)
• Fred Hutch Cancer Research Center (Thomas Vaughan)
• Domino Data Lab (Eduardo  Ariño de la Rubia)
• Genentech (Gabriel Becker)

… and many more. On a personal note, I'm looking forward to presenting with my colleague Bharath Sankaranarayan on using Microsoft R to predict hospital stays.

For more information on EARL 2017 San Francisco or to register, follow the link below. (And if you can't make San Francisco, check out EARL 2017 London, coming September 12-14.)

EARL: San Francisco 2017

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Marcos Lopez de Prado Hierarchical Risk Parity Algorithm

Mon, 05/22/2017 - 20:21

This post will be about replicating the Marcos Lopez de Prado algorithm from his paper building diversified portfolios that outperform out of sample. This algorithm is one that attempts to make a tradeoff between the classic mean-variance optimization algorithm that takes into account a covariance structure, but is unstable, and an inverse volatility algorithm that ignores covariance, but is more stable.

This is a paper that I struggled with until I ran the code in Python (I have anaconda installed but have trouble installing some packages such as keras because I’m on windows…would love to have someone walk me through setting up a Linux dual-boot), as I assumed that the clustering algorithm actually was able to concretely group every asset into a particular cluster (I.E. ETF 1 would be in cluster 1, ETF 2 in cluster 3, etc.). Turns out, that isn’t at all the case.

Here’s how the algorithm actually works.

First off, it computes a covariance and correlation matrix (created from simulated data in Marcos’s paper). Next, it uses a hierarchical clustering algorithm on a distance-transformed correlation matrix, with the “single” method (I.E. friend of friends–do ?hclust in R to read up more on this). The key output here is the order of the assets from the clustering algorithm. Note well: this is the only relevant artifact of the entire clustering algorithm.

Using this order, it then uses an algorithm that does the following:

Initialize a vector of weighs equal to 1 for each asset.

Then, run the following recursive algorithm:

1) Break the order vector up into two equal-length (or as close to equal length) lists as possible.

2) For each half of the list, compute the inverse variance weights (that is, just the diagonal) of the covariance matrix slice containing the assets of interest, and then compute the variance of the cluster when multiplied by the weights (I.E. w’ * S^2 * w).

3) Then, do a basic inverse-variance weight for the two clusters. Call the weight of cluster 0 alpha = 1-cluster_variance_0/(cluster_variance_0 + cluster_variance_1), and the weight of cluster 1 its complement. (1 – alpha).

4) Multiply all assets in the original vector of weights containing assets in cluster 0 with the weight of cluster 0, and all weights containing assets in cluster 1 with the weight of cluster 1. That is, weights[index_assets_cluster_0] *= alpha, weights[index_assets_cluster_1] *= 1-alpha.

5) Lastly, if the list isn’t of length 1 (that is, not a single asset), repeat this entire process until every asset is its own cluster.

Here is the implementation in R code.

First off, the correlation matrix and the covariance matrix for use in this code, obtained from Marcos Lopez De Prado’s code in the appendix in his paper.

> covMat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 1.000647799 -0.003050479 0.010033224 -0.010759689 -0.005036503 0.008762563 0.998201625 -0.001393196 -0.001254522 -0.009365991 2 -0.003050479 1.009021349 0.008613817 0.007334478 -0.009492688 0.013031817 -0.009420720 -0.015346223 1.010520047 1.013334849 3 0.010033224 0.008613817 1.000739363 -0.000637885 0.001783293 1.001574768 0.006385368 0.001922316 0.012902050 0.007997935 4 -0.010759689 0.007334478 -0.000637885 1.011854725 0.005759976 0.000905812 -0.011912269 0.000461894 0.012572661 0.009621670 5 -0.005036503 -0.009492688 0.001783293 0.005759976 1.005835878 0.005606343 -0.009643250 1.008567427 -0.006183035 -0.007942770 6 0.008762563 0.013031817 1.001574768 0.000905812 0.005606343 1.064309825 0.004413960 0.005780148 0.017185396 0.011601336 7 0.998201625 -0.009420720 0.006385368 -0.011912269 -0.009643250 0.004413960 1.058172027 -0.006755374 -0.008099181 -0.016240271 8 -0.001393196 -0.015346223 0.001922316 0.000461894 1.008567427 0.005780148 -0.006755374 1.074833155 -0.011903469 -0.013738378 9 -0.001254522 1.010520047 0.012902050 0.012572661 -0.006183035 0.017185396 -0.008099181 -0.011903469 1.075346677 1.015220126 10 -0.009365991 1.013334849 0.007997935 0.009621670 -0.007942770 0.011601336 -0.016240271 -0.013738378 1.015220126 1.078586686 > corMat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 1.000000000 -0.003035829 0.010026270 -0.010693011 -0.005020245 0.008490954 0.970062043 -0.001343386 -0.001209382 -0.009015412 2 -0.003035829 1.000000000 0.008572055 0.007258718 -0.009422702 0.012575370 -0.009117080 -0.014736040 0.970108941 0.971348946 3 0.010026270 0.008572055 1.000000000 -0.000633903 0.001777455 0.970485047 0.006205079 0.001853505 0.012437239 0.007698212 4 -0.010693011 0.007258718 -0.000633903 1.000000000 0.005709500 0.000872861 -0.011512172 0.000442908 0.012052964 0.009210090 5 -0.005020245 -0.009422702 0.001777455 0.005709500 1.000000000 0.005418538 -0.009347204 0.969998023 -0.005945165 -0.007625721 6 0.008490954 0.012575370 0.970485047 0.000872861 0.005418538 1.000000000 0.004159261 0.005404237 0.016063910 0.010827955 7 0.970062043 -0.009117080 0.006205079 -0.011512172 -0.009347204 0.004159261 1.000000000 -0.006334331 -0.007592568 -0.015201540 8 -0.001343386 -0.014736040 0.001853505 0.000442908 0.969998023 0.005404237 -0.006334331 1.000000000 -0.011072068 -0.012759610 9 -0.001209382 0.970108941 0.012437239 0.012052964 -0.005945165 0.016063910 -0.007592568 -0.011072068 1.000000000 0.942667300 10 -0.009015412 0.971348946 0.007698212 0.009210090 -0.007625721 0.010827955 -0.015201540 -0.012759610 0.942667300 1.000000000

Now, for the implementation.

This reads in the two matrices above and gets the clustering order.

This is the clustering order:

> clustOrder [1] 9 2 10 1 7 3 6 4 5 8

Next, the getIVP (get Inverse Variance Portfolio) and getClusterVar functions (note: I’m trying to keep the naming conventions identical to Dr. Lopez’s paper)

getIVP <- function(covMat) { invDiag <- 1/diag(as.matrix(covMat)) weights <- invDiag/sum(invDiag) return(weights) } getClusterVar <- function(covMat, cItems) { covMatSlice <- covMat[cItems, cItems] weights <- getIVP(covMatSlice) cVar <- t(weights) %*% as.matrix(covMatSlice) %*% weights return(cVar) }

Next, my code diverges from the code in the paper, because I do not use the list comprehension structure, but instead opt for a recursive algorithm, as I find that style to be more readable.

One wrinkle to note is the use of the double arrow dash operator, to assign to a variable outside the scope of the recurFun function. I assign the initial weights vector w in the global environment, and update it from within the recurFun function. I am aware that it is a faux pas to create variables in the global environment, but my attempts at creating a temporary environment in which to update the weight vector did not produce the updating mechanism I had hoped to, so a little bit of assistance with refactoring this code would be appreciated.

getRecBipart <- function(covMat, sortIx) { # keeping track of w in the global environment assign("w", value = rep(1, ncol(covMat)), envir = .GlobalEnv) recurFun(covMat, sortIx) return(w) } recurFun <- function(covMat, sortIx) { subIdx <- 1:trunc(length(sortIx)/2) cItems0 <- sortIx[subIdx] cItems1 <- sortIx[-subIdx] cVar0 <- getClusterVar(covMat, cItems0) cVar1 <- getClusterVar(covMat, cItems1) alpha <- 1 - cVar0/(cVar0 + cVar1) # scoping mechanics using w as a free parameter w[cItems0] <<- w[cItems0] * alpha w[cItems1] <<- w[cItems1] * (1-alpha) if(length(cItems0) > 1) { recurFun(covMat, cItems0) } if(length(cItems1) > 1) { recurFun(covMat, cItems1) } }

Lastly, let’s run the function.

out <- getRecBipart(covMat, clustOrder)

With the result (which matches the paper):

> out [1] 0.06999366 0.07592151 0.10838948 0.19029104 0.09719887 0.10191545 0.06618868 0.09095933 0.07123881 0.12790318

So, hopefully this democratizes the use of this technology in R. While I have seen a raw Rcpp implementation and one from the Systematic Investor Toolbox, neither of those implementations satisfied me from a “plug and play” perspective. This implementation solves that issue. Anyone here can copy and paste these functions into their environment and immediately make use of one of the algorithms devised by one of the top minds in quantitative finance.

A demonstration in a backtest using this methodology will be forthcoming.

NOTE: I am always interested in networking and full-time opportunities which may benefit from my skills. Furthermore, I am also interested in project work in the volatility ETF trading space. My linkedin profile can be found here.

### Instrumental Variables in R exercises (Part-2)

Mon, 05/22/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

This is the second part of the series on Instrumental Variables. For other parts of the series follow the tag instrumental variables.

In this exercise set we will build on the example from part-1.
We will now consider an over-identified case i.e. we have multiple IVs for an endogenous variable. We will also look at tests for endogeneity and over-identifying restrictions.

Answers to the exercises are available here.

Exercise 1
Load the AER package (package description: here). Next, load PSID1976 dataset provided with the AER package. This has data regarding labor force participation of married women.
Define a new data-frame that has data for all married women that were employed. This data-frame will be used for the following exercises.

Exercise 2
We will use a more elaborate model as compared to the previous set.
Regress log(wage) on education, experience and experience^2.

Exercise 3
Previously, we identified feducation and meducation as possible IVs for education.
Regress education on experience, experience^2, feducation and meducation. Comment on your results.

Exercise 4
Test the hypothesis that feducation and meducation are jointly equal to zero.

Exercise 5

Exercise 6
Use the ivreg command to get the same results as in Exercise-5. Note that standard errors would be different.

Exercise 7
There is a short form for the ivreg command which saves time when we are dealing with numerous variables.
Try the short format and check that you get the same results as in Exercise-6.

Exercise 8
Regress log(wage) on education, experience, experience^2 and residuals from the model estimated in Exercise-3.
Use your result to test for the endogeneity of education.

Exercise 9
Regress the residuals from exercise-7 on experience, experience^2, feducation and meducation.
Use your result to test for over-identifying restrictions.

Exercise 10
The two tests we did in exercises 8 and 9 can be conveniently obtained using the summary command with diagnostics turned on. Verify that you get the same test results with the summary command.

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The End of the Honeymoon: Falling Out of Love with quantstrat

Mon, 05/22/2017 - 17:00

(This article was first published on R – Curtis Miller's Personal Website, and kindly contributed to R-bloggers)

Introduction

I spent good chunks of Friday, Saturday, and Sunday attempting to write another blog post on using R and the quantstrat package for backtesting, and all I have to show for my work is frustration. So I’ve started to fall out of love with quantstrat and am thinking of exploring Python backtesting libraries from now on.

Here’s my story…

Cross-Validation

Let’s suppose you’ve read my past blog posts on using R for analyzing stock data, in particular the most recent blog post on R for finance, on parameter optimization and order types. You’ll notice that we’ve been doing two things (perhaps among others):

• We only used most recent data as out-of-sample data. What if there are peculiarities unique to the most-recent time frame, and it’s not a good judge (alone) of future performance?
• We evaluate a strategy purely on its returns. Is this the best approach? What if a strategy doesn’t earn as much as another but it handles risk well?

I wanted to write an article focusing on addressing these points, by looking at cross-validation and different metrics for evaluating performance.

Data scientists want to fit training models to data that will do a good job of predicting future, out-of-sample data points. This is not done by finding the model that performs the best on the training data. This is called overfitting; a model may appear to perform well on training data but will not generalize to out-of-sample data. Techniques need to be applied to prevent against it.

A data scientist may first split her data set used for developing a predictive algorithm into a training set and a test set. The data scientist locks away the test set in a separate folder on the hard drive, never looking at it until she’s satisfied with the model she’s developed using the training data. The test set will be used once a final model has been found to determine the final model’s expected performance.

She would like to be able to simulate out-of-sample performance with the training set, though. The usual approach to do this is split the training set into, say, 10 different folds, or smaller data sets, so she can apply cross-validation. With cross-validation, she will choose one of the folds to be held out, and she fits a predictive model on the remaining nine folds. After fitting the model, she sees how the fitted model performs on the held-out fold, which the fitted model has not seen. She then repeats this model for the nine other folds, to get a sense of the distribution of the performance of the model, or average performance, on out-of-sample data. If there are hyperparameters (which are parameters that describe some higher-order aspect of a model and not learned from the data in the usual way other prarameters are; they’re difficult to define rigorously), she may try different combinations of them and see which combination lead to the model with the best predictive ability in cross-validation. After determining which predictive model generally leads to the best results and which hyperparameters lead to optimal results, she trains a model on the training set with those hyperparameters, evaluates its performance on the test set, and reports the results, perhaps deploying the model.

For those developing trading algorithms, the goals are similar, but there are some key differences:

1. We evaluate a trading method not by predictive accuracy but by some other measure of performance, such as profitability or profitability relative to risk. (Maybe predictive accuracy is profitable, maybe not; if the most profitable trading system always underestimates its profitability, we’re fine with it.)
2. We are using data where time and the order in which data comes in is thought to matter. We cannot just reshuffle data; important features would be lost. So we cannot divide up the data set randomly into different folds; order must be preserved.

I’ll talk more about how we could evaluate a trading system later; for now, let’s focus on how we can apply the idea of cross-validation analysis.

(Illustration by Max Kuhn in his book, The caret Package.)

For time-dependent data, we can employ walk-forward analysis or rolling forecast origin techniques. This comes in various flavors (see the above illustration), but I will focus on one flavor. We first divide up the data set into, say, ten periods. We first fit a trading algorithm on the first period in the data set, then see how it performs on the “out-of-sample” second period. Then we repeat for the second and third periods, third and fourth periods, and so on until we’ve run to the end of the data set. The out-of-sample data sets are then used for evaluating the potential performance of the trading system in question. If we like, we may be keeping a final data set, perhaps the most recent data set, for final evaluation of whatever trading system passes this initial test.

Other variants include overlapping training/testing periods (as described, my walk-forward analysis approach does not overlap) or ones where the initial window grows from beginning to end. Of these, I initially think either the approach I’ve described or the approach with overlapping training/testing periods makes most sense.

Attempting to Use quantstrat

I discovered that quantstrat has a function that I thought would implement the type of walk-forward analysis I wanted, called walk.forward(). I was able to define a training period duration, a testing period duration, an objective function to maximize, and many other features I wanted. It could take a strategy designed for optimization with apply.paramset() and apply a walk-forward analysis.

So I tried it once and the results were cryptic. The function returns a list with the results of training… but no sign of the test results. I would try looking at the portfolio object and tradeStats() output, but there was no sign of a portfolio. I had no idea why. I still have no idea where the portfolio went initially. I would try clearing my environments and restarting R and doing the whole process again. Same results: no sign of a portfolio, and all I got was a list of results for training data, with no sign of results for testing data. I looked at the source code. I saw that it was doing something with the test data: it was using applyStrategy() on the test data.

I had no idea what the hell that meant. When I use applyStrategy() as the user, I have to assume that the portfolio is wiped out, the account is reset, and every application is effectively a fresh start. So when I saw that line, it looked as if the function got an optimized function, applied it to the test data, and would repeat this process all the while forgetting the results of previous applications of the optimized strategy to test data sets. But online sources said that was not what was happening; the results of this backtesting were being appended back-to-back, which did not mesh with either my notion of walk-forward analysis (which allows for overlapping time frames, which does not mean you can “glue” together test data results) or what applyStrategy() does. And it still does not make sense to me. (That is what it does, though.)

The fact that the demonstration of walk.forward() supplied with quantstrat is bugged (and I never found how to fix all its bugs) didn’t help.

So I tried using the function’s auditing feature; by supplying a string to audit.prefix, *.RData files will be created containing the results of training and testing, storing them in a special .audit environment. So I changed the working directory to one specifically meant for the audit files (there could be a lot of them and I didn’t want to clutter up my other directories; I hate clutter), passed a string to audit.prefix, and tried again. After the analysis was completed, I look in the directory to see the files.

There’s nothing there.

It took hours and I had to go through a debugger to finally discover that walk.forward() was saving files, but it was using my Documents directory as the working directory, not the directory I setwd()ed into. Why? I have no idea. It could be quantstrat, blotter, or RStudio’s workbooks. I have no idea how long it took to figure out where my files had gone, but it was too long however long it was.

Good news though was that I now could try and figure out what was happening to the test data. This presentation by Guy Yollin was helpful, though it still took a while to get it figured out.

From what I can tell, what walk.forward() does is simulate a trading strategy that optimizes, trades, and re-optimizes on new data, and trades. So it doesn’t look like cross-validation as I described but rather a trading strategy, and that’s not really what I wanted here. I wanted to get a sense of the distribution of returns when using window sizes chosen optimally from backtesting. I don’t think there’s any simple way to do that with quantstrat.

Furthermore, I can’t see how I can collect the kind of statistics I want, such as a plot of the value of the account or the account’s growth since day one, with the way the walk-forward analysis is being done. Furthermore, the results I did get look… strange. I don’t know how those kinds of results would appear… at all. They look like nonsense. So I don’t even know if the function is working right.

Oh, did I mention that I discovered that the way I thought backtesting was being done isn’t how it’s actually done in quantstrat? When I was doing backtesting, I would rather see the results of the backtest on a variety of stocks instead of just one. I imagine an automated backtest behaving like a trader with an account, and active trades tie up the trader’s funds. So if the trader sees an opportunity but does not have the cash to act on it, the trader does not take the trade. (There is no margin here.)

I thought applyStrategy() was emulating such a trader, but the documentation (and the logs produced, for that matter), suggest that’s not what’s happening. It backtests on each symbol separately, unaware of any other positions that may exist. I don’t like that.

It was very difficult to figure this all out, especially given the state of the documentation currently. That said, in the developers’ defense, quantstrat is still considered under heavy development. Maybe in the future the documentation will be better. I feel, though, that creating good documentation is not something done once you’ve completed coding; you write good documentation as you code, not after.

blotter: The Fatal Flaw

Granted, I’m an ignorant person, but I don’t feel like all the confusion that lead to me wasting my weekend is completely my fault. quantstrat‘s design is strange and, honestly, not that intuitive. I think the problem stems from quantstrat‘s building off an older, key package: blotter.

blotter is a package intended to manage transactions, accounts, portfolios, financial instruments, and so on. All the functions for creating accounts and portfolios are blotter functions, not quantstrat functions. blotter does this by setting up its own environment, the .blotter environment, and the functions I’ve shown are for communicating with this environment, adding and working with entities stored in it.

I don’t know what the purpose of creating a separate environment for managing those objects is. I feel that by doing so a wall has been placed between me and the objects I’m trying to understand. quantstrat takes this further by defining a .strategy environment where strategies actually “live”, so we must use special functions to interact with that separate universe, adding features to our strategies, deleting and redefining them with special functions, and so on.

To me, this usage of environments feels very strange, and I think it makes quantstrat difficult and unintuitive. In some sense, I feel like strategies, portfolios, accounts, etc. are pseudo-objects in the object oriented programming (OOP) sense, and the developers of blotter and quantstrat are trying to program like R is a language that primarily follows the OOP paradigm.

While R does support OOP, it’s a very different kind of OOP and I would not call R an OOP language. R is a functional programming language. Many see the tidyverse as almost revolutionary in making R easy to use. That’s because the tidyverse of package, including ggplot2, magrittr, dplyr, and others, recognize that R relies on the functional programming paradigm and this paradigm does, in fact, lead to elegant code. Users can write pipelines and functions to put into pipelines that make complicated tasks easy to program. No part of the pipeline is cut off from the user, and the process of building the pipeline is very intuitive.

When I wrote my first blog post on R for finance and trading, one of the common comments I heard was, “tidyquant is better.” I have yet to look seriously at tidyquant, and maybe it is better, but so far all I’ve seen is tidyquant replace quantmod‘s functionality, and I have no complaint with quantmod (although tidyquant may also be suffering from the original sin of relying too heavily on environments). That said, the tidyquant approach may yet be the future of R backtesting.

Let’s try re-imagining what a better R backtesting package might look like, one that takes lessons from the tidyverse. Let’s first look at defining a strategy with quantstrat (after all the obnoxious boilerplate code).

After all this code, we have a strategy that lives in the .strategy environment, locked away from us. What if we were to create strategies like we create plots in ggplot2? The code might look like this:

myStrat <- strategy() + indicator(SMA, args = list(x = Close, n = 20), label = "fastMA") + indicator(SMA, args = list(x = Close, n = 50), label = "slowMA") + signal(sigCrossover, quote(fastMA > slowMA), label = "bull") + signal(sigCrossover, quote(fastMA < slowMA), label = "bear") + rule( ..... ) + rule( ..... ) + distribution( ...... ) + distribution( ...... ) + distribution_constraint( ...... )

In blotter (and thus quantstrat) we create porfolios and accounts like so:

portfolio_st <- "myPortf" account_st <- "myAcct" rm.strat(portfolio_st) rm.strat(account_st) # This creates a portfolio initPortf(portfolio_st, symbols = symbols, initDate = initDate, currency = "USD") # Then an account initAcct(account_st, portfolios = portfolio_st, initDate = initDate, currency = "USD", initEq = 100000) # And finally an orderbook initOrders(portfolio_st, store = TRUE)

Then we run the strategy with:

applyStrategy(strategy_st, portfolios = portfolio_st)

and we see the results with this code:

updatePortf(portfolio_st) dateRange <- time(getPortfolio(portfolio_st)\$summary)[-1] updateAcct(account_st, dateRange) updateEndEq(account_st) tStats <- tradeStats(Portfolios = portfolio_st, use="trades", inclZeroDays = FALSE) tStats[, 4:ncol(tStats)] <- round(tStats[, 4:ncol(tStats)], 2) print(data.frame(t(tStats[, -c(1,2)]))) final_acct <- getAccount(account_st) plot(final_acct\$summary\$End.Eq["2010/2016"] / 100000, main = "Portfolio Equity", ylim = c(0.8, 2.5)) lines(SPY\$SPY.Adjusted / SPY\$SPY.Adjusted[[1]], col = "blue")

A tidyverse approach might look like the following:

# Create an account with a pipe myAcct <- symbols %>% portfolio(currency = "USD") %>% account(currency = "USD", initEq = 100000) # Backtesting generates an object containing the orderbook and all other # necessary data (such as price at trade) ob <- myAcct %>% myStrat # This is a backtest

Other functions would then extract any information we need from the object ob, when we ask for them.

If I wanted to optimize, I might do the following:

ob_list <- myAcct %>% myStrat(optim = "MA") # Gives a list of orderbooks, and # optimizes the parameter set MA

And as for the original reason I started to gripe, I might be able to find slices for the analysis with caret functions (they exist), subset the data in symbols according to these slices, and get ob in, say, an lapply() loop.

I like the idea for this interface, myself.

Conclusion (On To Python)

I do not have the time to write a package like this myself. I don’t know enough and I am a Ph.D. student and should be preparing for qualifying exams/research/etc.. For now, though, I’m turned off to quantstrat, and thus R for backtesting and trading.

That said, my stock data analysis posts are still, by far, the most popular content of this blog, and I still would like to find a good backtesting system. I’m looking at going back to Python, and the backtrader package has caught my eye. I will either use that or zipline, but I am leaning to backtrader, since initially I like its logging features and its use of OOP.

I would love to hear thoughts and comments, especially from users of either backtrader or zipline or Python for stock data analysis in general.

To leave a comment for the author, please follow the link and comment on their blog: R – Curtis Miller's Personal Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R vs Python: Different similarities and similar differences

Mon, 05/22/2017 - 14:46

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

A debate about which language is better suited for Datascience, R or Python, can set off diehard fans of these languages into a tizzy. This post tries to look at some of the different similarities and similar differences between these languages. To a large extent the ease or difficulty in learning R or Python is subjective. I have heard that R has a steeper learning curve than Python and also vice versa. This probably depends on the degree of familiarity with the languuge To a large extent both R an Python do the same thing in just slightly different ways and syntaxes. The ease or the difficulty in the R/Python construct’s largely is in the ‘eyes of the beholder’ nay, programmer’ we could say.  I include my own experience with the languages below.

1. R data types

R has 2 data types

1. Character
2. Numeric

Python has several data types

1. Int
2. float
3. Long
4. Complex and so on
2. R Vector vs Python List

A common data type in R is the vector. Python has a similar data type, the list

# R vectors a<-c(4,5,1,3,4,5) print(a[3]) ## [1] 1 print(a[3:4]) # R does not always need the explicit print. ## [1] 1 3 #R type of variable print(class(a)) ## [1] "numeric" # Length of a print(length(a)) ## [1] 6 # Python lists a=[4,5,1,3,4,5] # print(a[2]) # Some python IDEs require the explicit print print(a[2:5]) print(type(a)) # Length of a print(len(a)) ## 1 ## [1, 3, 4] ## ## 6 2a. Other data types – Python

Python also has certain other data types like the tuple, dictionary etc as shown below. R does not have as many of the data types, nevertheless we can do everything that Python does in R

# Python tuple b = (4,5,7,8) print(b) #Python dictionary c={'name':'Ganesh','age':54,'Work':'Professional'} print(c) #Print type of variable c ## (4, 5, 7, 8) ## {'name': 'Ganesh', 'age': 54, 'Work': 'Professional'} 2.Type of Variable

To know the type of the variable in R we use ‘class’, In Python the corresponding command is ‘type’

#R - Type of variable a<-c(4,5,1,3,4,5) print(class(a)) ## [1] "numeric" #Python - Print type of tuple a a=[4,5,1,3,4,5] print(type(a)) b=(4,3,"the",2) print(type(b)) ## ## 3. Length

To know length in R, use length()

#R - Length of vector # Length of a a<-c(4,5,1,3,4,5) print(length(a)) ## [1] 6

To know the length of a list,tuple or dict we can use len()

# Python - Length of list , tuple etc # Length of a a=[4,5,1,3,4,5] print(len(a)) # Length of b b = (4,5,7,8) print(len(b)) ## 6 ## 4 4. Accessing help

To access help in R we use the ‘?’ or the ‘help’ function

#R - Help - To be done in R console or RStudio #?sapply #help(sapply)

Help in python on any topic involves

#Python help - This can be done on a (I)Python console #help(len) #?len 5. Subsetting

The key difference between R and Python with regards to subsetting is that in R the index starts at 1. In Python it starts at 0, much like C,C++ or Java To subset a vector in R we use

#R - Subset a<-c(4,5,1,3,4,8,12,18,1) print(a[3]) ## [1] 1 # To print a range or a slice. Print from the 3rd to the 5th element print(a[3:6]) ## [1] 1 3 4 8

Python also uses indices. The difference in Python is that the index starts from 0/

#Python - Subset a=[4,5,1,3,4,8,12,18,1] # Print the 4th element (starts from 0) print(a[3]) # Print a slice from 4 to 6th element print(a[3:6]) ## 3 ## [3, 4, 8] 6. Operations on vectors in R and operation on lists in Python

In R we can do many operations on vectors for e.g. element by element addition, subtraction, exponentation,product etc. as show

#R - Operations on vectors a<- c(5,2,3,1,7) b<- c(1,5,4,6,8) #Element wise Addition print(a+b) ## [1] 6 7 7 7 15 #Element wise subtraction print(a-b) ## [1] 4 -3 -1 -5 -1 #Element wise product print(a*b) ## [1] 5 10 12 6 56 # Exponentiating the elements of a vector print(a^2) ## [1] 25 4 9 1 49

In Python to do this on lists we need to use the ‘map’ and the ‘lambda’ function as follows

# Python - Operations on list a =[5,2,3,1,7] b =[1,5,4,6,8] #Element wise addition with map & lambda print(list(map(lambda x,y: x+y,a,b))) #Element wise subtraction print(list(map(lambda x,y: x-y,a,b))) #Element wise product print(list(map(lambda x,y: x*y,a,b))) # Exponentiating the elements of a list print(list(map(lambda x: x**2,a))) ## [6, 7, 7, 7, 15] ## [4, -3, -1, -5, -1] ## [5, 10, 12, 6, 56] ## [25, 4, 9, 1, 49]

However if we create ndarrays from lists then we can do the element wise addition,subtraction,product, etc. like R. Numpy is really a powerful module with many, many functions for matrix manipulations

import numpy as np a =[5,2,3,1,7] b =[1,5,4,6,8] a=np.array(a) b=np.array(b) #Element wise addition print(a+b) #Element wise subtraction print(a-b) #Element wise product print(a*b) # Exponentiating the elements of a list print(a**2) ## [ 6 7 7 7 15] ## [ 4 -3 -1 -5 -1] ## [ 5 10 12 6 56] ## [25 4 9 1 49] 7. Getting the index of element

To determine the index of an element which satisifies a specific logical condition in R use ‘which’. In the code below the index of element which is equal to 1 is 4

# R - Which a<- c(5,2,3,1,7) print(which(a == 1)) ## [1] 4

In Python array we can use np.where to get the same effect. The index will be 3 as the index starts from 0

# Python - np.where import numpy as np a =[5,2,3,1,7] a=np.array(a) print(np.where(a==1)) ## (array([3], dtype=int64),) 8. Data frames

R, by default comes with a set of in-built datasets. There are some datasets which come with the SkiKit- Learn package

# R # To check built datasets use #data() - In R console or in R Studio #iris - Don't print to console

We can use the in-built data sets that come with Scikit package

#Python import sklearn as sklearn import pandas as pd from sklearn import datasets # This creates a Sklearn bunch data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) 9. Working with dataframes

With R you can work with dataframes directly. For more complex dataframe operations in R there are convenient packages like dplyr, reshape2 etc. For Python we need to use the Pandas package. Pandas is quite comprehensive in the list of things we can do with data frames The most common operations on a dataframe are

• Check the size of the dataframe
• Take a look at the top 5 or bottom 5 rows of dataframe
• Check the content of the dataframe
a.Size

In R use dim()

#R - Size dim(iris) ## [1] 150 5

For Python use .shape

#Python - size import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) iris.shape b. Top & bottom 5 rows of dataframe

To know the top and bottom rows of a data frame we use head() & tail as shown below for R and Python

#R head(iris,5) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa tail(iris,5) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.head(5)) print(iris.tail(5)) ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 0 5.1 3.5 1.4 0.2 ## 1 4.9 3.0 1.4 0.2 ## 2 4.7 3.2 1.3 0.2 ## 3 4.6 3.1 1.5 0.2 ## 4 5.0 3.6 1.4 0.2 ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 145 6.7 3.0 5.2 2.3 ## 146 6.3 2.5 5.0 1.9 ## 147 6.5 3.0 5.2 2.0 ## 148 6.2 3.4 5.4 2.3 ## 149 5.9 3.0 5.1 1.8 c. Check the content of the dataframe #R summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## str(iris) ## 'data.frame': 150 obs. of 5 variables: ## \$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## \$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## \$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## \$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## \$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.info()) ## ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None d. Check column names #R names(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species" colnames(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species" #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) #Get column names print(iris.columns) ## Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', ## 'petal width (cm)'], ## dtype='object') e. Rename columns

In R we can assign a vector to column names

#R colnames(iris) <- c("lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal","Species") colnames(iris) ## [1] "lengthOfSepal" "widthOfSepal" "lengthOfPetal" "widthOfPetal" ## [5] "Species"

In Python we can assign a list to s.columns

#Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) iris.columns = ["lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal"] print(iris.columns) ## Index(['lengthOfSepal', 'widthOfSepal', 'lengthOfPetal', 'widthOfPetal'], dtype='object') f.Details of dataframe #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.info()) ## ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None g. Subsetting dataframes # R #To subset a dataframe 'df' in R we use df[row,column] or df[row vector,column vector] #df[row,column] iris[3,4] ## [1] 0.2 #df[row vector, column vector] iris[2:5,1:3] ## lengthOfSepal widthOfSepal lengthOfPetal ## 2 4.9 3.0 1.4 ## 3 4.7 3.2 1.3 ## 4 4.6 3.1 1.5 ## 5 5.0 3.6 1.4 #If we omit the row vector, then it implies all rows or if we omit the column vector # then implies all columns for that row iris[2:5,] ## lengthOfSepal widthOfSepal lengthOfPetal widthOfPetal Species ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa # In R we can all specific columns by column names iris\$Sepal.Length[2:5] ## NULL #Python # To select an entire row we use .iloc. The index can be used with the ':'. If # .iloc[start row: end row]. If start row is omitted then it implies the beginning of # data frame, if end row is omitted then it implies all rows till end #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.iloc[3]) print(iris[:5]) # In python we can select columns by column name as follows print(iris['sepal length (cm)'][2:6]) #If you want to select more than 2 columns then you must use the double '[[]]' since the # index is a list itself print(iris[['sepal length (cm)','sepal width (cm)']][4:7]) ## sepal length (cm) 4.6 ## sepal width (cm) 3.1 ## petal length (cm) 1.5 ## petal width (cm) 0.2 ## Name: 3, dtype: float64 ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 0 5.1 3.5 1.4 0.2 ## 1 4.9 3.0 1.4 0.2 ## 2 4.7 3.2 1.3 0.2 ## 3 4.6 3.1 1.5 0.2 ## 4 5.0 3.6 1.4 0.2 ## 2 4.7 ## 3 4.6 ## 4 5.0 ## 5 5.4 ## Name: sepal length (cm), dtype: float64 ## sepal length (cm) sepal width (cm) ## 4 5.0 3.6 ## 5 5.4 3.9 ## 6 4.6 3.4 h. Computing Mean, Standard deviation #R #Mean mean(iris\$lengthOfSepal) ## [1] 5.843333 #Standard deviation sd(iris\$widthOfSepal) ## [1] 0.4358663 #Python #Mean import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) # Convert to Pandas dataframe print(iris['sepal length (cm)'].mean()) #Standard deviation print(iris['sepal width (cm)'].std()) ## 5.843333333333335 ## 0.4335943113621737 i. Boxplot

Boxplot can be produced in R using baseplot

#R boxplot(iris\$lengthOfSepal)

Matplotlib is a popular package in Python for plots

#Python import sklearn as sklearn import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) img=plt.boxplot(iris['sepal length (cm)']) plt.show(img) j.Scatter plot #R plot(iris\$widthOfSepal,iris\$lengthOfSepal)

#Python import matplotlib.pyplot as plt import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) img=plt.scatter(iris['sepal width (cm)'],iris['sepal length (cm)']) #plt.show(img) k. Read from csv file #R tendulkar= read.csv("tendulkar.csv",stringsAsFactors = FALSE,na.strings=c(NA,"-")) #Dimensions of dataframe dim(tendulkar) ## [1] 347 13 names(tendulkar) ## [1] "X" "Runs" "Mins" "BF" "X4s" ## [6] "X6s" "SR" "Pos" "Dismissal" "Inns" ## [11] "Opposition" "Ground" "Start.Date"

#Python import pandas as pd #Read csv tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"]) print(tendulkar.shape) print(tendulkar.columns) ## (347, 13) ## Index(['Unnamed: 0', 'Runs', 'Mins', 'BF', '4s', '6s', 'SR', 'Pos', ## 'Dismissal', 'Inns', 'Opposition', 'Ground', 'Start Date'], ## dtype='object') l. Clean the dataframe in R and Python.

The following steps are done for R and Python
1.Remove rows with ‘DNB’
2.Remove rows with ‘TDNB’
3.Remove rows with absent
4.Remove the “*” indicating not out
5.Remove incomplete rows with NA for R or NaN in Python
6.Do a scatter plot

#R # Remove rows with 'DNB' a <- tendulkar\$Runs != "DNB" tendulkar <- tendulkar[a,] dim(tendulkar) ## [1] 330 13 # Remove rows with 'TDNB' b <- tendulkar\$Runs != "TDNB" tendulkar <- tendulkar[b,] # Remove rows with absent c <- tendulkar\$Runs != "absent" tendulkar <- tendulkar[c,] dim(tendulkar) ## [1] 329 13 # Remove the "* indicating not out tendulkar\$Runs <- as.numeric(gsub("\\*","",tendulkar\$Runs)) dim(tendulkar) ## [1] 329 13 # Select only complete rows - complete.cases() c <- complete.cases(tendulkar) #Subset the rows which are complete tendulkar <- tendulkar[c,] dim(tendulkar) ## [1] 327 13 # Do some base plotting - Scatter plot plot(tendulkar\$BF,tendulkar\$Runs)

#Python import pandas as pd import matplotlib.pyplot as plt #Read csv tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"]) print(tendulkar.shape) # Remove rows with 'DNB' a=tendulkar.Runs !="DNB" tendulkar=tendulkar[a] print(tendulkar.shape) # Remove rows with 'TDNB' b=tendulkar.Runs !="TDNB" tendulkar=tendulkar[b] print(tendulkar.shape) # Remove rows with absent c= tendulkar.Runs != "absent" tendulkar=tendulkar print(tendulkar.shape) # Remove the "* indicating not out tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","") #Select only complete rows - dropna() tendulkar=tendulkar.dropna() print(tendulkar.shape) tendulkar.Runs = tendulkar.Runs.astype(int) tendulkar.BF = tendulkar.BF.astype(int) #Scatter plot plt.scatter(tendulkar.BF,tendulkar.Runs) ## (347, 13) ## (330, 13) ## (329, 13) ## (329, 13) ## (327, 13) m.Chaining operations on dataframes

To chain a set of operations we need to use an R package like dplyr. Pandas does this The following operations are done on tendulkar data frame by dplyr for R and Pandas for Python below

1. Group by ground
2. Compute average runs in each ground
3. Arrange in descending order
#R library(dplyr) tendulkar1 <- tendulkar %>% group_by(Ground) %>% summarise(meanRuns= mean(Runs)) %>% arrange(desc(meanRuns)) head(tendulkar1,10) ## # A tibble: 10 × 2 ## Ground meanRuns ## ## 1 Multan 194.00000 ## 2 Leeds 193.00000 ## 3 Colombo (RPS) 143.00000 ## 4 Lucknow 142.00000 ## 5 Dhaka 132.75000 ## 6 Manchester 93.50000 ## 7 Sydney 87.22222 ## 8 Bloemfontein 85.00000 ## 9 Georgetown 81.00000 ## 10 Colombo (SSC) 77.55556 #Python import pandas as pd #Read csv tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"]) print(tendulkar.shape) # Remove rows with 'DNB' a=tendulkar.Runs !="DNB" tendulkar=tendulkar[a] # Remove rows with 'TDNB' b=tendulkar.Runs !="TDNB" tendulkar=tendulkar[b] # Remove rows with absent c= tendulkar.Runs != "absent" tendulkar=tendulkar # Remove the "* indicating not out tendulkar.Runs= tendulkar.Runs.str.replace(r"[*]","") #Select only complete rows - dropna() tendulkar=tendulkar.dropna() tendulkar.Runs = tendulkar.Runs.astype(int) tendulkar.BF = tendulkar.BF.astype(int) tendulkar1= tendulkar.groupby('Ground').mean()['Runs'].sort_values(ascending=False) print(tendulkar1.head(10)) ## (347, 13) ## Ground ## Multan 194.000000 ## Leeds 193.000000 ## Colombo (RPS) 143.000000 ## Lucknow 142.000000 ## Dhaka 132.750000 ## Manchester 93.500000 ## Sydney 87.222222 ## Bloemfontein 85.000000 ## Georgetown 81.000000 ## Colombo (SSC) 77.555556 ## Name: Runs, dtype: float64 9. Functions product <- function(a,b){ c<- a*b c } product(5,7) ## [1] 35 def product(a,b): c = a*b return c print(product(5,7)) ## 35 Conclusion

Personally, I took to R, much like a ‘duck takes to water’. I found the R syntax very simple and mostly intuitive. R packages like dplyr, ggplot2, reshape2, make the language quite irrestible. R is weakly typed and has only numeric and character types as opposed to the full fledged data types in Python.

Python, has too many bells and whistles, which can be a little bewildering to the novice. It is possible that they may be useful as one becomes more experienced with the language. Also I found that installing Python packages sometimes gives errors with Python versions 2.7 or 3.6. This will leave you scrambling to google to find how to fix these problems. These can be quite frustrating. R on the other hand makes installing R packages a breeze.

Anyway, this is my current opinion, and like all opinions, may change in the course of time. Let’s see!

I may write a follow up post with more advanced features of R and Python. So do keep checking! Long live R! Viva la Python!

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Bagging, the perfect solution for model instability

Mon, 05/22/2017 - 14:27

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos Motivation

The name bagging comes from boostrap aggregating. It is a machine learning technique proposed by Breiman (1996) to increase stability in potentially unstable estimators. For example, suppose you want to run a regression with a few variables in two steps. First, you run the regression with all the variables in your data and select the significant ones. Second, you run a new regression using only the selected variables and compute the predictions.

This procedure is not wrong if your problem is forecasting. However, this two step estimation may result in highly unstable models. If many variables are important but individually their importance is small, you will probably leave some of them out, and small perturbations on the data may drastically change the results.

Bagging

The Bagging solves instability problem by giving each variable several chances to be in the model. The same procedure described in the previous section is repeated in many bootstrap samples and the final forecast will be the average forecast across all these samples.

In this post I will show an example of Bagging with regression in the selection context described above. But you should keep in mind that this technique can be used for any unstable model. The most successful case is to use Bagging on regression trees, resulting in the famous random forest. The Bagging is also very robust to over fitting and produces forecasts with less variance.

Application

I am going to show an application of Bagging from the scratch using the boot package for the bootstrap replications. We start by generating some data with the following dgp:

The first equation shows how the variable we want to predict, , is generated. The second equation shows that the independent variables are generated by a common factor, which will induce a correlation structure. Since we need instability, the data will be generated to have an around and in order to have many variables with small individual importance.

N = 300 # = Number of obs. = # K = 40 # = Number of Variables = # set.seed(123) # = Seed for replication = # f = rnorm(N) # = Common factor = # A = runif(K,-1,1) # = Coefficients from the second equation = # X = t(A%*%t(f)) + matrix(rnorm(N*K), N, K) # = Generate xij = # beta = c((-1)^(1:(K-10)) ,rep(0, 10)) # = Coefficients from the first equation, the first 30 are equal 1 or -1 and the last 10 are 0 = # # = R2 setup = # aux = var(X%*%beta) erro = rnorm(N, 0, sqrt(2*aux)) # = The variance of the error will be twice the variance of X%*%beta = # y = X %*% beta+erro # = Generate y = # 1-var(erro)/var(y) # = R2 = # ## [,1] ## [1,] 0.2925866 # = Break data into in-sample and out-of-sample = # y.in = y[1:(N-100)] X.in = X[1:(N-100),] y.out = y[-c(1:(N-100))] X.out = X[-c(1:(N-100)),]

Now we must define a function to be called by the boostrap. This function must receive the data and an argument for indexes that will tell R where the sampling must be made. This function will run a linear regression with all variables, select those with t-statistics bigger than , run a new regression with the selected variables and store the coefficients.

bagg = function(data,ind){ sample = data[ind,] y = sample[, 1] X = sample[, 2:ncol(data)] model1 = lm(y ~ X) t.stat = summary(model1)\$coefficients[-1, 3] selected = which(abs(t.stat)>=1.96) model2 = lm(y ~ X[, selected]) coefficients = coef(model2) cons = coefficients[1] betas = coefficients[-1] aux = rep(0, ncol(X)) aux[selected] = betas res = c(cons, aux) return(res) }

Now we are ready to use the boot function to run the Bagging. The code below does precisely that with boostrap replications. It also runs the simple OLS procedure for us to compare. If you type selected after running this code you will see that many variables were left out and some variables with were selected.

library(boot) # = Bagging = # bagging = boot(data = cbind(y.in, X.in), statistic=bagg, R=500) bagg.coef = bagging\$t y.pred.bagg = rowMeans(cbind(1, X.out)%*%t(bagg.coef)) # = OLS = # ols0 = lm(y.in ~ X.in) selected = which(abs(summary(ols0)\$coefficients[-1, 3])>=1.96) ols1 = lm(y.in ~ X.in[, selected]) y.pred = cbind(1, X.out[, selected])%*%coef(ols1) # = Forecasting RMSE = # sqrt(mean((y.pred-y.out)^2)) # = OLS = # ## [1] 10.43338 sqrt(mean((y.pred.bagg-y.out)^2)) # = Bagging = # ## [1] 9.725554

The results showed that the Bagging has a RMSE smaller than the OLS. It is also interesting to see what happens to the coefficients. The plot below shows that the bagging performs some shrinkage on the coefficients towards zero.

Finally, what would happen if we had no instability? If you run the same code again changing the to around you will find the plot below. To change the to just replace by in the code erro = rnorm(N, 0, sqrt(2*aux)) in the dgp. As you can see, the bagging and the simple procedure are the same in the absence of instability.

References

Breiman, Leo (1996). “Bagging predictors”. Machine Learning. 24 (2): 123–140

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Connecting R to an Oracle Database

Mon, 05/22/2017 - 12:30

(This article was first published on R – Gerald Belton, and kindly contributed to R-bloggers)

R is a very popular language for doing analytics, and particularly statistics, on your data. There are a number of R functions for reading in data, but most of them take a delimited text file (such as .CSV) for input. That’s great if your existing data is in a spreadsheet, but if you have large amounts of data, it’s probably stored in a relational database. If you work for a large company, chances are that it is an Oracle database.

The most efficient way to access an Oracle database from R is using the RODBC package, available from CRAN. If the RODBC package is not installed in your R environment, use the install.packages(“RODBC”) command to install it. ODBC stands for Open DataBase Connectivity, an open standard application programming interface (API) for databases. ODBC was created by the SQL Access Group and first released in September, 1992. Although Microsoft Windows was the first to provide an ODBC product, versions now exist for Linux and Macintosh platforms as well. ODBC is built-in to current versions of Windows. If you are using a different operating system, you’ll need to install on OBDC driver manager.

Before you can access a database from R, you’ll need to create a Data Source Name, or DSN. This is an alias to the database, which provides the connection details. In Windows, you create the DSN using the ODBC Source Administrator. This tool can be found in the Control Panel. In Windows 10, it’s under System and Security -> Administrative Tools -> ODBC Data Sources. Or you can just type “ODBC” in the search box. On my system, it looks like this:

As you can see, I already have a connection to an Oracle database. To set one up, click Add, and you’ll get this box:

Select the appropriate driver (in my case, Oracle in OraDB12Home1) and click the Finish button. A Driver Configuration box opens:

For “Data Source Name,” you can put in almost anything you want. This is the name you will use in R when you connect to the database.

The “Description” field is optional, and again, you can put in whatever you want.

TNS Service Name is the name that you (or your company data base administrator) assigned when configuring the Oracle database. And “User ID” is your ID that you use with the database.

After you fill in these fields, click the “Test Connection” button. Another box pops up, with the TNS Service Name and User ID already populated, and an empty field for your password. Enter your password and click “OK.” You should see a “Connection Successful” message. If not, check the Service Name, User ID, and Password.

Now you are ready to connect R to the database.

Here’s the R code that you need:

# Load RODBC package library(RODBC) # Create a connection to the database called "channel" channel <- odbcConnect("DATABASE", uid="USERNAME", pwd="PASSWORD") # Query the database and put the results into the data frame # "dataframe"  dataframe <- sqlQuery(channel, "  SELECT *  FROM  SCHEMA.DATATABLE") # When finished, it's a good idea to close the connection odbcClose(channel)

First, I don’t like the idea of having a password appear, unencrypted, in the R program. One possible solution is to prompt the user for the password before creating the connection:

This will enable the connection to be made without compromising the security of the password.

Second, the sqlQuery will pass to Oracle whatever is inside the quotation marks. This is the workhorse function of the RODBC package. The term ‘query’ includes any valid SQL statement including table creation, updates, etc, as well as ‘SELECT’s.

Finally, I should mention that R works with data that is loaded into the computer’s memory. If you try to load a really huge database into memory all at once, it will a) take a very long time, and b) possibly fail due to exceeding your computer’s memory capacity. Of course, relational database systems like Oracle are the natural habitat of very large data sets, so that may be your motivation for connecting R to Oracle in the first place. Carefully constructed SQL Queries will let Oracle do the work of managing the data, and return just the data that R needs for performing analytics.

Writing SQL Queries is beyond the scope of this blog post. If you need help with that, there are plenty of free tutorials on the web, or you might find this book helpful: Oracle 12c for Dummies

To leave a comment for the author, please follow the link and comment on their blog: R – Gerald Belton. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Sunflowers for COLOURlovers

Mon, 05/22/2017 - 09:00

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

Andar, lo que es andar, anduve encima siempre de las nubes (Del tiempo perdido, Robe)

If you give importance to colours, maybe you know already COLOURlovers. As can be read in their website, COLOURlovers is a creative community where people from around the world create and share colors, palettes and patterns, discuss the latest trends and explore colorful articles… All in the spirit of love.

There is a R package called colourlovers which provides access to the COLOURlovers API. It makes very easy to choose nice colours for your graphics. I used clpalettes function to search for the top palettes of the website. Their names are pretty suggestive as well: Giant Goldfish, Thought Provoking, Adrift in Dreams, let them eat cake … Inspired by this post I have done a Shiny app to create colored flowers using that palettes. Seeds are arranged according to the golden angle. One example:

Some others:

You can play with the app here.

If you want to do your own sunflowers, here you have the code. This is the ui.R file:

library(colourlovers) library(rlist) top=clpalettes('top') sapply(1:length(top), function(x) list.extract(top, x)\$title)->titles fluidPage( titlePanel("Sunflowers for COLOURlovers"), fluidRow( column(3, wellPanel( selectInput("pal", label = "Palette:", choices = titles), sliderInput("nob", label = "Number of points:", min = 200, max = 500, value = 400, step = 50) ) ), mainPanel( plotOutput("Flower") ) ) )

And this is the server.R one:

library(shiny) library(ggplot2) library(colourlovers) library(rlist) library(dplyr) top=clpalettes('top') sapply(1:length(top), function(x) list.extract(top, x)\$title)->titles CreatePlot = function (ang=pi*(3-sqrt(5)), nob=150, siz=15, sha=21, pal="LoversInJapan") { list.extract(top, which(titles==pal))\$colors %>% unlist %>% as.vector() %>% paste0("#", .) -> all_colors colors=data.frame(hex=all_colors, darkness=colSums(col2rgb(all_colors))) colors %>% arrange(-darkness)->colors background=colors[1,"hex"] %>% as.character colors %>% filter(hex!=background) %>% .[,1] %>% as.vector()->colors ggplot(data.frame(r=sqrt(1:nob), t=(1:nob)*ang*pi/180), aes(x=r*cos(t), y=r*sin(t)))+ geom_point(colour=sample(colors, nob, replace=TRUE, prob=exp(1:length(colors))), aes(size=(nob-r)), shape=16)+ scale_x_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+ scale_y_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+ theme(legend.position="none", panel.background = element_rect(fill=background), panel.grid=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), axis.text=element_blank())} function(input, output) { output\$Flower=renderPlot({ CreatePlot(ang=180*(3-sqrt(5)), nob=input\$nob, siz=input\$siz, sha=as.numeric(input\$sha), pal=input\$pal) }, height = 550, width = 550 )}

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Using R: When using do in dplyr, don’t forget the dot

Sun, 05/21/2017 - 11:50

There will be a few posts about switching from plyr/reshape2 for data wrangling to the more contemporary dplyr/tidyr.

My most common use of plyr looked something like this: we take a data frame, split it by some column(s), and use an anonymous function to do something useful. The function takes a data frame and returns another data frame, both of which could very possibly have only one row. (If, in fact, it has to have only one row, I’d suggest an assert_that() call as the first line of the function.)

library(plyr) results <- ddply(some_data, "key", function(x) { ## do something; return data.frame() })

Or maybe, if I felt serious and thought the function would ever be used again, I’d write:

calculate <- function(x) { ## do something; return data.frame() } result <- ddply(some_data, "key", calculate)

Rinse and repeat over and over again. For me, discovering ddply was like discovering vectorization, but for data frames. Vectorization lets you think of operations on vectors, without having to think about their elements. ddply lets you think about operations on data frames, without having to think about rows and columns. It saves a lot of thinking.

The dplyr equivalent would be do(). It looks like this:

library(dplyr) grouped <- group_by(some_data, key) result <- do(grouped, calculate(.))

Or once again with magrittr:

library(magrittr) some_data %>% group_by(key) %>% do(calculate(.)) -> result

(Yes, I used the assignment arrow from the left hand side to the right hand side. Roll your eyes all you want. I think it’s in keeping with the magrittr theme of reading from left to right.)

One important thing here, which got me at first: There has to be a dot! Just passing the function name, as one would have done with ddply, will not work:

grouped <- group_by(some_data, key) ## will not work: Error: Results are not data frames at positions ... try(result <- do(grouped, calculate))

Don’t forget the dot!

Postat i:computer stuff Tagged: dplyr, plyr, R

### Shiny Applications Layouts Exercises (Part-9)

Sat, 05/20/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Shiny Application Layouts – Update Input

In the ninth part of our series we will see the updated input scenario. This is different from the Dynamic UI example we met in part 8, where the UI component is generated on the server and sent to the UI, where it replaces an existing UI component.

This part can be useful for you in two ways.

First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

Secondly you can make a revision on what you learnt in “Building Shiny App” series as we will build basic shiny staff in order to present it in the proper way.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

UI Context

As always let us create the skeleton of the app.
#ui.r
fluidPage(
titlePanel("Title"),
fluidRow(
column(3,
wellPanel()),
column(3,
wellPanel()),
column(3,
wellPanel())
)
)
#server.R
function(input, output, clientData, session) {}

Exercise 1

Build the initial UI of your app. Give it a title, use 4 columns and one row for each one of the three well Panel that we are going to use.

Applying reactivity

To create a reactive context for our app we will use the observe({}) function in our server side.

Exercise 2

Create reactive context. HINT: Use observe({}).

“Master” Inputs.

Now we are going to create the two inputs that control the rest of the inputs of the app. Look at the example:
#ui.r
textInput("text_input",
"labels:",
"Some Text"),
sliderInput("slider_input",
"values:",
min = 1, max = 100, value = 50)
#server.r
t_input <- input\$text_input
s_input <- input\$slider_input

Exercise 3

Create a text Input in the ui side that controls labels and then pass it to a new variable on the server side.

Learn more about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

Exercise 4

Create a slider Input in the ui side that controls values and then pass it to a new variable on the server side.

Dependent Inputs

Firstly let’s create a text input that changes both the label and the text.
#ui.r
textInput("Text", "Text input:", value = "text")
#server.r
updateTextInput(session, "Text",
label = paste("Sth", t_input),
value = paste("Sth", s_input)
)

Exercise 5

Create a text input, in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateTextInput().

Exercise 6

Create a slider input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateSliderInput().

Exercise 7

Create a numeric input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateNumericInput().

Exercise 8

Create a date input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateDateInput().

In order to create a Checkbox group with the same conditions as the rest of the inputs we just created we should first create a list of options like in the example below:
#server.r
options <- list()
options[[paste("option", s_input, "A")]] <-
paste0("option-", s_input, "-A")
options[[paste("option", s_input, "B")]] <-
paste0("option-", s_input, "-B")

Exercise 9

Create a list with three choices for your Checkbox Group. HINT: Use list().

Exercise 10

Create a checkboxgroup input, in the third well Panel,that changes its label and its value according to the two “Master” Inputs you created before. HINT: Use updateCheckboxGroupInput().

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New series: R and big data (concentrating on Spark and sparklyr)

Sat, 05/20/2017 - 15:56

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Win-Vector LLC has recently been teaching how to use R with big data through Spark and sparklyr. We have also been helping clients become productive on R/Spark infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using R, share some hints, and even admit to some of the warts found in this combination of systems.

The ability to perform sophisticated analyses and modeling on “big data” with R is rapidly improving, and this is the time for businesses to invest in the technology. Win-Vector can be your key partner in methodology development and training (through our consulting and training practices).

The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using Spark through R and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.

Please read on for a brief description of our new articles series: “R and big data.”

Background

R is a best of breed in-memory analytics platform. R allows the analyst to write programs that operate over their data and bring in a huge suite of powerful statistical techniques and machine learning procedures. Spark is an analytics platform designed to operate over big data that exposes some of its own statistical and machine learning capabilities. R can now be operated “over Spark“. That is: R programs can delegate tasks to Spark clusters and issue commands to Spark clusters. In some cases the syntax for operating over Spark is deliberately identical to working over data stored in R.

Why R and Spark

• Spark can work at a scale and speed far larger than native R . The ability to send work to Spark increases R‘s capabilities.
• R has machine learning and statistical capabilities that go far beyond what is available on Spark or any other “big data” system (many of which are descended from report generation or basic analytics). The ability to use specialized R methods on data samples yields additional capabilities.
• R and Spark can share code and data.

The R/Spark combination is not the only show in town; but it is a powerful capability that may not be safe to ignore. We will also talk about additional tools that can be brought into the mix: such as the powerful large scale machine learning capabilities from h2o

The warts

Frankly a lot of this is very new, and still on the “bleeding edge.” Spark 2.x has only been available in stable form since July 26, 2016 (or just under a year). Spark 2.x is much more capable than the Spark 1.x series in terms of both data manipulation and machine learning, so we strongly suggest clients strongly insist on Spark 2.x clusters from their infrastructure vendors (such as Couldera, Hortonworks, MapR, and others) despite having only become available in these packaged solutions recently. The sparklyr adapter itself was first available on CRAN only as of September 24th, 2016. And SparkR only started distributing with Spark 1.4 as of June 2015.

While R/Spark is indeed a powerful combination, nobody seems to sharing a lot of production experiences and best practices whith it yet.

Some of the problems are sins of optimism. A lot of people still confuse successfully standing a cluster up with effectively using it. Other people confuse statistical and procedures available in in-memory R (which are very broad and often quite mature) with those available in Spark (which are less numerous and less mature).

Our goal

What we want to do with the “R and big data” series is:

• Give a taste of some of the power of the R/Spark combination.
• Share a “capabilities and readiness” checklist you should apply when evaluating infrastructure.
• Start to publicly document R/Spark best practices.
• Describe some of the warts and how to work around them.
• Share fun tricks and techniques that make working with R/Spark much easier and more effective.
The start

Our next article in this series will be up soon and will discuss the nature of data-handles in Sparklyr (one of the R/Spark interfaces) and how to manage your data inventory neatly.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Sankey charts for swinging voters

Sat, 05/20/2017 - 14:00

(This article was first published on Peter's stats stuff - R, and kindly contributed to R-bloggers)

Continuing my examination of the individual level voting behaviour from the New Zealand Election Study, I wanted to look at the way individuals swap between parties, and between “did not vote” and a party, from one election to another. How much and how this happens is obviously an important question for both political scientists and for politicians.

Vote transition visualisations

I chose a Sankey chart as a way of showing the transition matrix from self-reported party vote in the 2011 election to the 2014 election. Here’s a static version:

And here is the more screen-friendly interactive version, with mouseover tooltips to give actual estimates:

The point with these graphics is to highlight the transitions. For example, what were the implications of turnout being higher in 2014 than 2011 (77.9% of enrolled voters in 2014 compared to 74.2% in 2011)? Judging from this survey data, the National Party gained 6.6% of the enrolled population in 2014 by converting them from a 2011 “did not vote” and lost only 3.6% in the other direction. This net gain of three percentage points was enough to win the election for the National-led coalition. In contrast, the Labour party had a net gain from “did not vote” in 2011 of only 0.2 percentage points. Remember though that these are survey-based estimates, and subject to statistical error.

I find setting up and polishing Sankey charts – controlling colours for example – a bit of a pain, so the code at the bottom of this post on how this was done might be of interest.

Weighting, missing data, population and age complications

Those visualisations have a few hidden fishhooks, which careful readers would find if they compare the percentages in the tooltips of the interactive version with percentages reported by the New Zealand Electoral Commission.

• The 2014 percentages are proportions of the enrolled population. As the 2014 turnout of enrolled voters was 77.9%, the numbers here are noticeably less than the usually cited percentages which were used to translate into seat counts (for example, National Party reported party vote of 47.0% of votes becomes 36.6% of enrolled voters)
• The 2011 percentages are even harder to explain, because I’ve chosen not only to scale the party vote and “did not vote” to the 2011 enrolled population as reported by the Commission, but also to add in around 5% of the 2014 population that were too young to vote in 2011.

Two things I would have liked to have taken into account but wasn’t able to were:

• The “leakage” from the 2011 election of people who were deceased or had left the country by 2014
• Explicit recognition of people who voted in 2014 but not in 2011 because they were out of the country. There is a variable in the survey that picks up the year the respondent came to live in New Zealand if not born here, but for only 10 respondents was this 2012 or later (in contrast to age – there were 58 respondents aged 20 or less in 2014).

I re-weighted the survey so the 2014 and 2011 reported party votes matched the known totals (with the addition of people aged 15 to 17 in 2011). One doesn’t normally re-weight a survey based on answers provided by the respondents, but in this case I think it makes perfect sense to calibrate to the public totals. The biggest impact is that for both years, but particularly 2011, relying on the respondents’ self-report and the published weighting of the NZES, totals for “did not vote” are materially understated, compared to results when calibrated to the known totals.

When party vote in 2011 had been forgotten or was an NA, and this wasn’t explained by being too young in 2011, I used multiple imputation based on a subset of relevant variables to give five instances of probable party vote to each such respondent.

Taken together, all this gives the visualisations a perspective based in 2014. It is better to think of it as “where did the 2014 voters come from” than “where did the 2011 voters go”. This is fairly natural when we consider it is the 2014 New Zealand Election Study, but is worth keeping in mind in interpretation.

Age (and hence the impact new young voters coming in, and of older voters passing on) is important in voting behaviour, as even the most casual observation of politics shows. In New Zealand, the age distribution of party voters in 2014 is seen in the chart below:

Non-voters, Green voters and to a certain extent Labour voters are young; New Zealand First voters are older. If this interests you though, I suggest you look at the multivariate analysis in this blog post or, probably more fun, this fancy interactive web app which lets you play with the predicted probabilities of voting based on a combination of demographic and socio-economic status variables.

Code

Here’s the R code that did that imputation, weighting, and the graphics:

To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Weekly Bulletin Vol – IX

Sat, 05/20/2017 - 08:42

This week’s R bulletin will cover topics on how to list files, extracting file names, and creating a folder using R.

We will also cover functions like the select function, filter function, and the arrange function. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. Run the current chunk – Ctrl+Alt+C
2. Run the next chunk – Ctrl+Alt+N
3. Run the current function definition – Ctrl+Alt+F

Problem Solving Ideas How to list files with a particular extension

To list files with a particular extension, one can use the pattern argument in the list.files function. For example to list csv files use the following syntax.

Example:

files = list.files(pattern = "\\.csv\$")

This will list all the csv files present in the current working directory. To list files in any other folder, you need to provide the folder path.

list.files(path = "C:/Users/MyFolder", pattern = "\\.csv\$")

\$ at the end means that this is end of the string. Adding \. ensures that you match only files with extension .csv

Extracting file name using gsub function

When we download stock data from google finance, the file’s name corresponds to the stock data symbol. If we want to extract the stock data symbol from the file name, we can do it using the gsub function. The function searches for a match to the pattern argument and replaces all the matches with the replacement value given in the replacement argument. The syntax for the function is given as:

gsub(pattern, replacement, x)

where,

pattern – is a character string containing a regular expression to be matched in the given character vector.
replacement – a replacement for matched pattern.
x – is a character vector where matches are sought.

In the example given below, we extract the file name for files stored in the “Reading MFs” folder. We have downloaded the stock price data in R working directory for two companies namely, MRF and PAGEIND Ltd.

Example:

folderpath = paste(getwd(), "/Reading MFs", sep = "") temp = list.files(folderpath, pattern = "*.csv") print(temp)

[1] “MRF.csv”  “PAGEIND.csv”

gsub("*.csv\$", "", temp)

[1] “MRF”   “PAGEIND”

Create a folder using R

One can create a folder via R with the help of the “dir.create” function. The function creates a folder with the name as specified in the last element of the path. Trailing path separators are discarded.

The syntax is given as:

dir.create(path, showWarnings = FALSE, recursive = FALSE)

Example:

dir.create("D:/RCodes", showWarnings = FALSE, recursive = FALSE)

This will create a folder called “RCodes” in the D drive.

Functions Demystified select function

The select function comes from the dplyr package and can be used to select certain columns of a data frame which you need. Consider the data frame “df” given in the example.

Example:

library(dplyr) Ticker = c("INFY", "TCS", "HCL", "TECHM") OpenPrice = c(2012, 2300, 900, 520) ClosePrice = c(2021, 2294, 910, 524) df = data.frame(Ticker, OpenPrice, ClosePrice) print(df)

# Suppose we wanted to select the first 2 columns only. We can use the names of the columns in the # second argument to select them from the main data frame. subset_df = select(df, Ticker:OpenPrice) print(subset_df)

# Suppose we want to omit the OpenPrice column using the select function. We can do so by using # the negative sign along with the column name as the second argument to the function. subset_df = select(df, -OpenPrice) print(subset_df)

# We can also use the 'starts_with' and the 'ends_with' arguments for selecting columns from the # data frame. The example below will select all the columns which end with the word 'Price'. subset_df = select(df, ends_with("Price")) print(subset_df)

filter function

The filter function comes from the dplyr package and is used to extract subsets of rows from a data frame. This function is similar to the subset function in R.

Example:

library(dplyr) Ticker = c("INFY", "TCS", "HCL", "TECHM") OpenPrice = c(2012, 2300, 900, 520) ClosePrice = c(2021, 2294, 910, 524) df = data.frame(Ticker, OpenPrice, ClosePrice) print(df)

# Suppose we want to select stocks with closing prices above 750, we can do so using the filter # function in the following manner: subset_df = filter(df, ClosePrice > 750) print(subset_df)

# One can also use a combination of conditions as the second argument in filtering a data set. subset_df = filter(df, ClosePrice > 750 & OpenPrice < 2000) print(subset_df)

arrange function

The arrange function is part of the dplyr package, and is used to reorder rows of a data frame according to one of the columns. Columns can be arranged in descending order or ascending order by using the special desc() operator.

Example:

library(dplyr) Ticker = c("INFY", "TCS", "HCL", "TECHM") OpenPrice = c(2012, 2300, 900, 520) ClosePrice = c(2021, 2294, 910, 524) df = data.frame(Ticker, OpenPrice, ClosePrice) print(df)

# Arrange in descending order subset_df = arrange(df, desc(OpenPrice)) print(subset_df)

# Arrange in ascending order. subset_df = arrange(df, -desc(OpenPrice)) print(subset_df)

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

The post R Weekly Bulletin Vol – IX appeared first on .

### Which science is all around? #BillMeetScienceTwitter

Sat, 05/20/2017 - 02:00

(This article was first published on Maëlle, and kindly contributed to R-bloggers)

I’ll admit I didn’t really know who Bill Nye was before yesterday. His name sounds a bit like Bill Nighy’s, that’s all I knew. But well science is all around and quite often scientists on Twitter start interesting campaigns. Remember the #actuallylivingscientists whose animals I dedicated a blog post? This time, the Twitter campaign is the #BillMeetScienceTwitter hashtag with which scientists introduce themselves to the famous science TV host Bill Nye. Here is a nice article about the movement.

Since I like surfing on Twitter trends, I decided to download a few of these tweets and to use my own R interface to the Monkeylearn machine learning API, monkeylearn (part of the rOpenSci project!), to classify the tweets in the hope of finding the most represented science fields. So, which science is all around?

Getting the tweets

It might sound a bit like trolling by now, but if you wanna get Twitter data, I recommend using rtweet because it’s a good package and because it’s going to replace twitteR which you might know from other blogs.

I only keep tweets in English, and moreover original ones, i.e. not retweets.

library("rtweet") billmeet <- search_tweets(q = "#BillMeetScienceTwitter", n = 18000, type = "recent") billmeet <- unique(billmeet) billmeet <- dplyr::filter(billmeet, lang == "en") billmeet <- dplyr::filter(billmeet, is_retweet == FALSE)

I’ve ended up with 2491 tweets.

Classifying the tweets

I’ve chosen to use this taxonomy classifier which classifies text according to generic topics and had quite a few stars on Monkeylearn website. I don’t think it was trained on tweets, and well it wasn’t trained to classify science topics in particular, which is not optimal, but it had the merit of being readily available. I’ve still not started training my own algorithms, and anyway, if I did I’d start by creating a very crucial algorithm for determining animal fluffiness on pictures, not text mining stuff. This was a bit off topic, let’s go back to science Twitter!

When I decided to use my own package I had forgotten it took charge of cutting the request vector into groups of 20 tweets, since the API only accept 20 texts at a time. I thought I’d have to do that splitting myself, but no, since I did it once in the code of the package, I’ll never need to write that code ever again. Great feeling! Look at how easy the code is after cleaning up the tweets a bit! One just needs to wait a bit before getting all results.

output <- monkeylearn::monkeylearn_classify(request = billmeet\$text, classifier_id = "cl_5icAVzKR") str(output) ## Classes 'tbl_df', 'tbl' and 'data.frame': 4466 obs. of 4 variables: ## \$ category_id: int 64638 64640 64686 64696 64686 64687 64689 64692 64648 64600 ... ## \$ probability: num 0.207 0.739 0.292 0.784 0.521 0.565 0.796 0.453 0.301 0.605 ... ## \$ label : chr "Computers & Internet" "Internet" "Humanities" "Religion & Spirituality" ... ## \$ text_md5 : chr "f7b28f45ea379b4ca6f34284ce0dc4b7" "f7b28f45ea379b4ca6f34284ce0dc4b7" "b95429d83df2cabb9cd701a562444f0b" "b95429d83df2cabb9cd701a562444f0b" ... ## - attr(*, "headers")=Classes 'tbl_df', 'tbl' and 'data.frame': 0 obs. of 0 variables

In the output, the package creator decided not to put the whole text corresponding to each line but its digested form itself, digested by the MD5 algorithm. So to join the output to the tweets again, I’ll have to first digest the tweet, which I do just copying the code from the package. After all I wrote it. Maybe it was the only time I successfully used vapply in my whole life.

billmeet <- dplyr::mutate(billmeet, text_md5 = vapply(X = text, FUN = digest::digest, FUN.VALUE = character(1), USE.NAMES = FALSE, algo = "md5")) billmeet <- dplyr::select(billmeet, text, text_md5) output <- dplyr::left_join(output, billmeet, by = "text_md5")

Looking at this small sample, some things make sense, other make less sense, either because the classification isn’t good or because the tweet looks like spam. Since my own field isn’t text analysis, I’ll consider myself happy with these results, but I’d be of course happy to read any better version of it.

As in my #first7jobs, I’ll make a very arbitrary decision and filter the labels to which a probability higher to 0.5 was attributed.

output <- dplyr::filter(output, probability > 0.5)

This covers 0.45 of the original tweets sample. I can only hope it’s a representative sample.

How many labels do I have by tweet?

dplyr::group_by(output) %>% dplyr::summarise(nlabels = n()) %>% dplyr::group_by(nlabels) %>% dplyr::summarise(n_tweets = n()) %>% knitr::kable() nlabels n_tweets 1415 1

Perfect, only one.

Looking at the results

I know I suck at finding good section titles… At least I like the title of the post, which is a reference to the song Bill Nighy, not Bill Nye, sings in Love Actually. My husband assumed that science Twitter has more biomedical stuff. Now, even if my results were to support this fact, note that this could as well be because it’s easier to classify biomedical tweets.

I’ll first show a few examples of tweets for given labels.

Based on that, and on the huge number of internet-labelled tweets, I decided to remove those.

library("ggplot2") library("viridis") label_counts <- output %>% dplyr::filter(label != "Internet") %>% dplyr::group_by(label) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(desc(n)) label_counts <- label_counts %>% dplyr::mutate(label = ifelse(n < 5, "others", label)) %>% dplyr::group_by(label) %>% dplyr::summarize(n = sum(n)) %>% dplyr::arrange(desc(n)) label_counts <- dplyr::mutate(label_counts, label = factor(label, ordered = TRUE, levels = unique(label))) ggplot(label_counts) + geom_bar(aes(label, n, fill = label), stat = "identity")+ scale_fill_viridis(discrete = TRUE, option = "plasma")+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1), text = element_text(size=25), legend.position = "none")

In the end, I’m always skeptical when looking at the results of such classifiers, and well at the quality of my sample to begin with – but then I doubt there ever was a hashtag that was perfectly used to only answer the question and not spam it and comment it (which is what I’m doing). I’d say it seems to support my husband’s hypothesis about biomedical stuff.

I’m pretty sure Bill Nye won’t have had the time to read all the tweets, but I think he should save them, or at least all the ones he can get via the Twitter API thanks to e.g. rtweet, in order to be able to look through them next time he needs an expert. And in the random sample of tweets he’s read, let’s hope he was exposed to a great diversity of science topics (and of scientists), although, hey, the health and life related stuff is the most interesting of course. Just kidding. I liked reading tweets about various scientists, science rocks! And these last words would be labelled with “performing arts”, perfect way to end this post.

To leave a comment for the author, please follow the link and comment on their blog: Maëlle. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New R Users group in Münster!

Sat, 05/20/2017 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

This is to announce that Münster now has its very own R users group!

If you are from the area, come join us (or if you happen to know someone who is and who might be interested, please forward the info).

You can find us on meetup.com: https://www.meetup.com/Munster-R-Users-Group/ and we are also on the R User groups list.

Code for the logo, which of course has been created in R:

library(ggplot2) library(ggmap) library(ggthemes) munster <- get_map(location = 'Munster', zoom = 12, maptype = "watercolor") ggmap(munster) + scale_y_continuous(limits=c(51.94, 51.97)) + annotate("text", x=7.565, y=51.96, label= "MünsteR", size = 8) + annotate("text", x=7.685, y=51.945, label= "R Users Group", size = 5) + theme_map()

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A Primer in Functional Programming in R Exercises (Part – 1)

Fri, 05/19/2017 - 18:00

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

In the exercises below we cover the basics of functional programming in R( part 1 of a two series exercises on functional programming) . We consider recursion with R , apply family of functions , higher order functions such as Map ,Reduce,Filter in R .
Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
create a function which calculates factorial of a number with help of Reduce ,

Exercise 2
create the same function which calculates factorial but with recursion and memoization. :

Exercise 3
create a function cum_add which makes cumulative summation for e.g if x <- 1:3 cum_add(x) will result in 1 3 6 .Don’t use cumsum .

Exercise 4
create a function which takes a dataframe and returns mean , minimum and maximum of all numeric columns . Your function should take a dataframe as input .for e.g your_func(iris)
.Try to avoid loops ,its possible to make it in one line .swag your functional skills .

Learn more about Functions in the online course R Programming: Advanced Analytics In R For Data Science. In this course you will learn how to prepare your data quicker and more efficiently, leaving you more time to focus on analyzing and visualizing your results!

Exercise 5
create a function centerColumnAroundMean which takes a numeric dataframe and manipulates the dataframe such a way that all column’s values are centered with mean value of the column . For example if my data frame is like this
df

then centerColumnAroundMean(df) will result in

It is possible to achieve this by single line of R , Please dont use loops .
Exercise 6
I have list of movies ,which have two movie franchieses as the elements starwars and LOTR,you can create the movie list by
my_movie_list<- list(STARWARS= list("A NEW HOPE","The Last Jedi","The Force Awakens"),LOTR=list("THE FELLOWSHIP OF THE RING","THE Two Towers","The RETURN of the KING","Hobbit" = list("An unexpected Journey","The Battle of the FIVE ARMY","The Desolation of Smaug") ))
now the problem I am facing is some of the texts are in caps and some of them are in small without any particular order , I would like the list to have a format like
“The Last Jedi” , Help me in writing a function which will do the same . Please keep in mind that the list is a nested list .

Exercise 7
load the diamonds data set from ggplot 2 package , I want to buy a diamond of each color but don’t want to pay the top price , I assume the second highest price is good enough for me . Help me in finding the second highest price for each color from the dataset.

Exercise 8
use the already loaded diamonds dataset from previous exercise , I want to know the average price for each combination of cut and color .your output should be similar to following. Don’t use table

Exercise 9
load iris dataset , I want to get the third row for each group of species . Help me to write a short function for me to achieve that .

Exercise 10
Create a new environment with new.env() command and create 3 vector variables under that environment as a=1:10;b=100:500;c=1000:1500
without knowing or manually calling mean for all the vector variables can you print the mean of all variables of the new environment .

Related exercise sets:

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...