### Ridge Regression and the Lasso

(This article was first published on ** R – Real Data**, and kindly contributed to R-bloggers)

In my last post Which linear model is best? I wrote about using

stepwise selection as a method for selecting linear models, which turns

out to have some issues (see this article, and Wikipedia).

This post will be about two methods that slightly modify ordinary least

squares (OLS) regression – ridge regression and the lasso.

Ridge regression and the lasso are closely related, but only the Lasso

has the ability to select predictors. Like OLS, ridge attempts to

minimize residual sum of squares of predictors in a given model.

However, ridge regression includes an additional ‘shrinkage’ term – the

square of the coefficient estimate – which shrinks the estimate of the

coefficients towards zero. The impact of this term is controlled by

another term, lambda (determined seperately). Two interesting

implications of this design are the facts that when λ = 0 the OLS

coefficients are returned and when λ = ∞, coefficients will approach

zero.

To take a look at this, setup a model matrix (removing the intercept

column), store the independent variable as y, and create a vector of

lambda values.

First, let's prove the fact that when λ = 0 we get the same coefficients

as the OLS model.

Fit your models.

#OLS swisslm <- lm(Fertility~., data = swiss) coef(swisslm) ## (Intercept) Agriculture Examination Education ## 66.9151817 -0.1721140 -0.2580082 -0.8709401 ## Catholic Infant.Mortality ## 0.1041153 1.0770481 #ridge ridge.mod <- glmnet(x, y, alpha = 0, lambda = lambda) predict(ridge.mod, s = 0, exact = T, type = 'coefficients')[1:6,] ## (Intercept) Agriculture Examination Education ## 66.9365901 -0.1721983 -0.2590771 -0.8705300 ## Catholic Infant.Mortality ## 0.1040307 1.0770215The differences here are nominal. Let's see if we can use ridge to

improve on the OLS estimate.

Ridge performs better for this data according to the MSE.

#a look at the coefficients out = glmnet(x[train,],y[train],alpha = 0) predict(ridge.mod, type = "coefficients", s = bestlam)[1:6,] ## (Intercept) Agriculture Examination Education ## 64.90631178 -0.16557837 -0.59425090 -0.35814759 ## Catholic Infant.Mortality ## 0.06545382 1.30375306As expected, most of the coefficient estimates are more conservative.

Let's have a look at the lasso. The big difference here is in the

shrinkage term – the lasso takes the absolute value of the coefficient

estimates.

The MSE is a bit higher for the lasso estimate. Let's check out the

coefficients.

Looks like the lasso places high importance on Education,

Examination, and Infant.Mortality. From this we also gain some

evidence that Catholic and Agriculture are not useful predictors for

this model. It is likely that Catholic and Agriculture do have some

effect on Fertility, though, since pushing those coefficients to zero

hurt the model.

There is plenty more to delve into here, but I'll leave the details to

the experts. I am always happy to have your take on the topics I write

about, so please feel free to leave a comment or contact me. Oftentimes

I learn just as much from you all as I do in researching the topic I

write about.

I think the next post will be about more GIS stuff – maybe on rasters or

point pattern analysis.

Thank you for reading!

Kiefer

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Real Data**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R⁶ — Idiomatic (for the People)

(This article was first published on ** R – rud.is**, and kindly contributed to R-bloggers)

NOTE: I’ll do my best to ensure the next post will have nothing to do with Twitter, and this post might not completely meet my R⁶ criteria.

A single, altruistic, nigh exuberant R tweet about slurping up a directory of CSVs devolved quickly — at least in my opinion, and partly (sadly) with my aid — into a thread that ultimately strayed from a crucial point: *idiomatic is in the eye of the beholder*.

I’m not linking to the twitter thread, but there are enough folks with sufficient Klout scores on it (is Klout even still a thing?) that you can easily find it if you feel so compelled.

I’ll take a page out of the U.S. High School “write an essay” playbook and start with a definition of *idiomatic*:

*using, containing, or denoting expressions that are natural to a native speaker*

That comes from *idiom*:

*a form of expression natural to a language, person, or group of people*

I usually joke with my students that a strength (and weakness) of R is that there are ~twelve ways to do any given task. While the statement is deliberately hyperbolic, the core message is accurate: there’s more than one way to do most things in R. A cascading truth is: what makes one way more “correct” over another often comes down to idiom.

My rstudio::conf 2017 presentation included an example of my version of using purrr for idiomatic CSV/JSON directory slurping. There are lots of ways to do this in R (the point of the post is not really to show you how to do the directory slurping and it is unlikely that I’ll approve comments with code snippets about that task). Here are three. One from base R tribe, one from the data.table tribe and one from the tidyverse tribe:

# We need some files and we'll use base R to make some dir.create("readings") for (i in 1970:2010) write.csv(mtcars, file.path("readings", sprintf("%s.csv", i)), row.names=FALSE) fils <- list.files("readings", pattern = ".csv$", full.names=TRUE) do.call(rbind, lapply(fils, read.csv, stringsAsFactors=FALSE)) data.table::rbindlist(lapply(fils, data.table::fread)) purrr::map_df(fils, readr::read_csv)You get data for all the “years” into a data.frame, data.table and tibble (respectively) with those three “phrases”.

However, what if you want the year as a column? Many of these “datalogger” CSV data sets do not have a temporal “grouping” variable as they let the directory structure & naming conventions embed that bit of metadata. That information would be nice, though:

do.call(rbind, lapply(fils, function(x) { f <- read.csv(x, stringsAsFactors=FALSE) f$year <- gsub("^readings/|\\.csv$", "", x) f })) dt <- data.table::rbindlist(lapply(fils, data.table::fread), idcol="year") dt[, year := gsub("^readings/|\\.csv$", "", fils[year])] purrr::map_df(fils, readr::read_csv, .id = "year") %>% dplyr::mutate(year = stringr::str_replace_all(fils[as.numeric(year)], "^readings/|\\.csv$", ""))All three versions do the same thing, and each tribe understands each idiom.

The data.table and tidyverse versions get you much faster file reading and the ability to “fill” missing columns — another common slurping task. You can hack something together in base R to do column fills (you’ll find a few StackOverflow answers that accomplish such a task) but you will likely decide to choose one of the other idioms for that and become equally as comfortable in that new idiom.

There are multiple ways to further extend the slurping example, but that’s not the point of the post.

Each set of snippets contains 100% valid R code. They accomplish the task and are *idiomatic for each tribe*. Despite what any *“mil gun feos turrach na latsa”* experts’ exchange would try to tell you, the best idiom is the one that works for you/you & your collaborators and the one that gets you to the real work — data analysis — in the most straightforward & reproducible way possible (for you).

Idiomatic does not mean there’s only a singular One, True Way™, and I think a whole host of us forget that at times.

Write good, clean, error-free, reproducible code.

Choose idioms that work best for you and your collaborators.

Adapt when necessary.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – rud.is**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### How to analyze max-diff data in R

(This article was first published on ** R – Displayr**, and kindly contributed to R-bloggers)

This post discusses a number of options that are available in R for analyzing data from max-diff experiments, using the package flipMaxDiff. For a more detailed explanation of how to analyze max-diff, and what the outputs mean, you should read the post How max-diff analysis works.

The post will cover the processes of installing packages, importing your data and experimental design, before discussing counting analysis and the more powerful, and valid, latent class analysis.

Step 1: Installing the packagesThe first step is to install the flipMaxDiff package and a series of dependent packages. Depending on how your R has been setup, you may need to install none of these (e.g., if using Displayr), or even more packages than are shown below.

install.packages("devtools") library(devtools) install.packages("AlgDesign") install_github("Displayr/flipData") install_github("Displayr/flipTransformations") install.packages("Rcpp") install.packages("foreign") install_github("Displayr/flipMaxDiff") install_github("Displayr/flipStandardCharts") Step 2: Bring in your dataThe mechanism by which you import your survery data will depend upon its format and location. For the example that is used in our main post on analysing max-diff, you can use the following to import the SPSS file:

tech.data = foreign::read.spss("http://wiki.q-researchsoftware.com/images/f/f1/Technology_2017.sav", to.data.frame = TRUE)Step 3: Bring in the experimental design

The flipMaxDiff package contains a method for creating your own experimental design, and this is discussed more here. In many cases, you may simply need to load in the design from your PC, or from a URL. The example that we cover here uses the following:

tech.design = read.csv("http://wiki.q-researchsoftware.com/images/7/78/Technology_MaxDiff_Design.csv") Step 4: Counting analysisWhile the latent class analysis technique described below is more valid and powerful than analyses based on counting the selections made by respondents, the following code may be used to examine the results discussed in the post How max-diff analysis works.

First, collect the sections of the data set for the selections made by the respondents:

best = tech.data[, c("Q5a_left", "Q5b_left", "Q5c_left", "Q5d_left", "Q5e_left", "Q5f_left")] worst = tech.data[, c("Q5a_right", "Q5b_right", "Q5c_right", "Q5d_right", "Q5e_right", "Q5f_right")]Here, the variable naming convention arises from the most-preferred option being shown on the left in the questionnaire, and the least-preferred option being shown on the right.

This code will count up the number of times each alternative was selected as best:

b = unlist(best) # Turning the 6 variables into 1 variable t = table(b) # Creating a table s = sort(t, decreasing = TRUE) # Sorting from highest to lowest # Putting a name at the top of the column, and naming it. best.score = cbind(Best = s)This code will count up the number of selections for each of the alternatives in your max-diff experiment:

b = table(unlist(best)) best.and.worst = cbind(Best = b, Worst = table(unlist(worst)))[order(b, decreasing = TRUE),]To compute the difference between the best and worst counts:

diff = best.and.worst[, 1] - best.and.worst[, 2] cbind("Best - Worst" = diff) Step 5: Latent class analysisThe flipMaxDiff package contains a function for estimating the shares in the max-diff experiment that uses latent class analysis. For a more detailed explanation of what this means, and how to interpret the outputs, see this post.

To run the latent class analysis, you can use the following:

library(flipMaxDiff) # Name the alternatives in the design alt.names <- c("Apple", "Microsoft", "IBM", "Google", "Intel", "Samsung", "Sony", "Dell", "Yahoo", "Nokia") result = FitMaxDiff(design = tech.design, version = rep(1, nrow(best)), best = best, worst = worst, alternative.names = alt.names, n.classes = 5, output = "Classes") print(result)The arguments that have been used in this example are:

- design: a data frame containing the experimental design.
- version: a vector of integers which can be used to specify which respondents were shown different versions of the experimental design. In this example, the design is simple and there is only a single version shown to all respondents.
- best: a data frame containing the options that were chosen as most-preferred.
- worst: a data frame containing the options that were chosen as most-preferred.
- alternative.names: a vector containing the names of the alternatives that were used in the max-diff experiment. In the code above, the names of the technology companies are used.
- n.classes: the number of classes to include in the analysis. Refer to this post for how to work out how many classes to include.
- output: a string which denotes the type of output that should be displayed when the object is printed. In this case we have chosen “Classes”, and so the print will show information about the preference shares for each class identified. Change this to “Probabilities”, and the output will show the Mean Probability (%) and distribution of shares.

Step 6: Obtaining respondent-level information

The latent class output can now be used to obtain various bits of information about the respondents. These can then be fed into later analyses:

To generate a new vector which tells you which class each respondent has been assigned to, you may use:

flipMaxDiff::Memberships(result)In the latent class model, each respondent has a certain probability of membership for each class. The Memberships() method used above assigns each respondent to the class with the highest probability. To see the probabilities

input.maxdiff = result probs = input.maxdiff$posterior.probabilities colnames(probs) = paste0('Class ', 1:ncol(probs) ) probsRespondent-level preference shares:

respondent.shares = prop.table(exp(as.matrix( (flipMaxDiff::RespondentParameters(result) ) ) ), 1) Step 7: Obtaining information from the output tableWhile the latent class analysis outputs are displayed in a cool HTMLWidget, the data in the class table can be also be extracted with results$probabilities. This lets you extract parts of the table for charting or for other analyses. For example, in our 5-class solution, we can plot the preference shares from the Total column as a donut chart (from the package flipStandardCharts):

library(flipStandardCharts) pref.shares = result$probabilities[, 6] pref.shares = sort(pref.shares, decreasing = TRUE) * 100 Chart(pref.shares, type = "Donut") Step 8: Preference simulationHere, preference simulation refers to the process of removing some alternatives from the calculated shares, and then rebasing the remaining shares to see how they adjust in the absence of the removed alternatives. Here, we first modify result$probabilities by removing the Total column (which will be recomputed later), as well as the rows for the alternatives that we want to exclude. Then, the results are rebased to add to 100 using the prop.table function. A new Total column is computed by taking the weighted sum, where the weights are given by the class sizes contained in result$class.sizes.

# Remove the total column x = result$probabilities[, -6] * 100 # Removing Apple and Samsung x = x[c(-1, -6),] # Re-basing x = prop.table(x, 2)* 100 # Adding the total sizes = result$class.sizes x = cbind(x, Total = rowSums(sweep(x, 2, sizes, "*"))) new.preferences = rbind(x, Total = colSums(x)) SummaryOnce you have loaded in your survey data and experimental design, the latent class analysis from flipMaxDiff can be run with a couple of lines of code. Additional methods are then available for saving respondent-level information, and data from the output table, to feed them in to your charts and analyses.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Displayr**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### simglm submission to CRAN this week

(This article was first published on ** Educate-R - R**, and kindly contributed to R-bloggers)

This is a quick note looking for any further feedback on the simglm package prior to CRAN submission later this week. The goal is to submit Thursday or Friday this week. The last few documentation finishing touches are happening now working toward a version 0.5.0 release on CRAN.

For those who have not seen this package yet, the aim is to simulate regression models (single level and multilevel models) as well as employ empirical power analyses based on Monte Carlo simulation. The package is relatively flexible in user control of inputs to generate data.

To install the package and also build the vignettes:

devtools::install_github("lebebr01/simglm", build_vignettes = TRUE)Then to generate a simple single level data set.

library(simglm) fixed <- ~ 1 + act + diff + numCourse + act:numCourse fixed_param <- c(2, 4, 1, 3.5, 2) cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'), var_type = c("single", "single", "single"), cov_param = list(list(mean = 0, sd = 4), list(mean = 0, sd = 3), list(mean = 0, sd = 3))) n <- 150 error_var <- 3 with_err_gen = 'rnorm' temp_single <- sim_reg(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param, n = n, error_var = error_var, with_err_gen = with_err_gen, data_str = "single") head(temp_single) ## X.Intercept. act diff numCourse act.numCourse Fbeta ## 1 1 -2.11697901 -0.1490870 -0.90292680 1.9114770938 -5.954293 ## 2 1 0.01298227 -0.1310381 -0.06197237 -0.0008045421 1.702379 ## 3 1 0.44564723 0.5913073 -0.59650183 -0.2658293887 1.754481 ## 4 1 -0.03528805 -0.5113031 -0.05915731 0.0020875460 1.144669 ## 5 1 1.77940941 0.5097288 0.54804919 0.9752038827 13.495946 ## 6 1 -1.42185444 0.4145870 1.08424301 -1.5416357400 -2.561252 ## err sim_data ID ## 1 -0.9567737 -6.911066 1 ## 2 1.3386926 3.041071 2 ## 3 0.3470914 2.101572 3 ## 4 0.9178861 2.062555 4 ## 5 0.8016335 14.297580 5 ## 6 0.2499601 -2.311292 6Then one can extend this to conduct of power analysis. The benefit of this approach is that you are able to generate data that hopefully more closely resembles data that is to be collected and can also evaluate assumption violations, sample size differences, and other conditions directly into the power analysis similar to a Monte Carlo simulation.

fixed <- ~ 1 + act + diff + numCourse + act:numCourse fixed_param <- c(0.5, 1.1, 0.6, 0.9, 1.1) cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'), var_type = c("single", "single", "single"), opts = list(list(mean = 0, sd = 2), list(mean = 0, sd = 2), list(mean = 0, sd = 1))) n <- NULL error_var <- NULL with_err_gen <- 'rnorm' pow_param <- c('(Intercept)', 'act', 'diff', 'numCourse') alpha <- .01 pow_dist <- "t" pow_tail <- 2 replicates <- 10 terms_vary <- list(n = c(20, 40, 60, 80, 100), error_var = c(5, 10, 20), fixed_param = list(c(0.5, 1.1, 0.6, 0.9, 1.1), c(0.6, 1.1, 0.6, 0.9, 1.1)), cov_param = list(list(dist_fun = c('rnorm', 'rnorm', 'rnorm'), mean = c(0, 0, 0), sd = c(2, 2, 1), var_type = c("single", "single", "single")), list(dist_fun = c('rnorm', 'rnorm', 'rnorm'), mean = c(0.5, 0, 0), sd = c(2, 2, 1), var_type = c("single", "single", "single")) ) ) power_out <- sim_pow(fixed = fixed, fixed_param = fixed_param, cov_param = cov_param, n = n, error_var = error_var, with_err_gen = with_err_gen, data_str = "single", pow_param = pow_param, alpha = alpha, pow_dist = pow_dist, pow_tail = pow_tail, replicates = replicates, terms_vary = terms_vary) head(power_out) ## Source: local data frame [6 x 11] ## Groups: var, n, error_var, fixed_param [3] ## ## var n error_var fixed_param ## ## 1 (Intercept) 20 5 0.5,1.1,0.6,0.9,1.1 ## 2 (Intercept) 20 5 0.5,1.1,0.6,0.9,1.1 ## 3 (Intercept) 20 5 0.6,1.1,0.6,0.9,1.1 ## 4 (Intercept) 20 5 0.6,1.1,0.6,0.9,1.1 ## 5 (Intercept) 20 10 0.5,1.1,0.6,0.9,1.1 ## 6 (Intercept) 20 10 0.5,1.1,0.6,0.9,1.1 ## # ... with 7 more variables: cov_param , avg_test_stat , ## # sd_test_stat , power , num_reject , num_repl , ## # dataQuestions and feedback are welcomed by filing an issue on GitHub here: https://github.com/lebebr01/simglm/issues.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Educate-R - R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Preview of EARL San Francisco

(This article was first published on ** Revolutions**, and kindly contributed to R-bloggers)

The first ever EARL (Enterprise Applications of the R Language) conference in San Francisco will take place on June 5-7 (and it's not too late to register). The EARL conference series is now in its fourth year, and the prior conferences in London and Boston have each been a fantastic way to learn how R is used in real-world applications. Judging from the speaker lineup in San Francisco, next month's event looks like it will feature some great R stores as well. Included on the program will be stories from the following companies:

- AirBnB (Keynote presentation by Ricardo Bion)
- StitchFix (Keynote presentation by Hilary Parker)
- Uber (Keynote presentation by Prakhar Mehrota)
- Hitachi Solutions (David Bishop)
- Amgen (Daniel Saldana)
- Pfizer (Luke Fostveldt)
- Fred Hutch Cancer Research Center (Thomas Vaughan)
- Domino Data Lab (Eduardo Ariño de la Rubia)
- Genentech (Gabriel Becker)

… and many more. On a personal note, I'm looking forward to presenting with my colleague Bharath Sankaranarayan on using Microsoft R to predict hospital stays.

For more information on EARL 2017 San Francisco or to register, follow the link below. (And if you can't make San Francisco, check out EARL 2017 London, coming September 12-14.)

EARL: San Francisco 2017

To **leave a comment** for the author, please follow the link and comment on their blog: ** Revolutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Marcos Lopez de Prado Hierarchical Risk Parity Algorithm

(This article was first published on ** R – QuantStrat TradeR**, and kindly contributed to R-bloggers)

This post will be about replicating the Marcos Lopez de Prado algorithm from his paper building diversified portfolios that outperform out of sample. This algorithm is one that attempts to make a tradeoff between the classic mean-variance optimization algorithm that takes into account a covariance structure, but is unstable, and an inverse volatility algorithm that ignores covariance, but is more stable.

This is a paper that I struggled with until I ran the code in Python (I have anaconda installed but have trouble installing some packages such as keras because I’m on windows…would love to have someone walk me through setting up a Linux dual-boot), as I assumed that the clustering algorithm actually was able to concretely group every asset into a particular cluster (I.E. ETF 1 would be in cluster 1, ETF 2 in cluster 3, etc.). Turns out, that isn’t at all the case.

Here’s how the algorithm actually works.

First off, it computes a covariance and correlation matrix (created from simulated data in Marcos’s paper). Next, it uses a hierarchical clustering algorithm on a distance-transformed correlation matrix, with the “single” method (I.E. friend of friends–do ?hclust in R to read up more on this). The key output here is the order of the assets from the clustering algorithm. Note well: this is the only relevant artifact of the entire clustering algorithm.

Using this order, it then uses an algorithm that does the following:

Initialize a vector of weighs equal to 1 for each asset.

Then, run the following recursive algorithm:

1) Break the order vector up into two equal-length (or as close to equal length) lists as possible.

2) For each half of the list, compute the inverse variance weights (that is, just the diagonal) of the covariance matrix slice containing the assets of interest, and then compute the variance of the cluster when multiplied by the weights (I.E. w’ * S^2 * w).

3) Then, do a basic inverse-variance weight for the two clusters. Call the weight of cluster 0 alpha = 1-cluster_variance_0/(cluster_variance_0 + cluster_variance_1), and the weight of cluster 1 its complement. (1 – alpha).

4) Multiply all assets in the original vector of weights containing assets in cluster 0 with the weight of cluster 0, and all weights containing assets in cluster 1 with the weight of cluster 1. That is, weights[index_assets_cluster_0] *= alpha, weights[index_assets_cluster_1] *= 1-alpha.

5) Lastly, if the list isn’t of length 1 (that is, not a single asset), repeat this entire process until every asset is its own cluster.

Here is the implementation in R code.

First off, the correlation matrix and the covariance matrix for use in this code, obtained from Marcos Lopez De Prado’s code in the appendix in his paper.

> covMat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 1.000647799 -0.003050479 0.010033224 -0.010759689 -0.005036503 0.008762563 0.998201625 -0.001393196 -0.001254522 -0.009365991 2 -0.003050479 1.009021349 0.008613817 0.007334478 -0.009492688 0.013031817 -0.009420720 -0.015346223 1.010520047 1.013334849 3 0.010033224 0.008613817 1.000739363 -0.000637885 0.001783293 1.001574768 0.006385368 0.001922316 0.012902050 0.007997935 4 -0.010759689 0.007334478 -0.000637885 1.011854725 0.005759976 0.000905812 -0.011912269 0.000461894 0.012572661 0.009621670 5 -0.005036503 -0.009492688 0.001783293 0.005759976 1.005835878 0.005606343 -0.009643250 1.008567427 -0.006183035 -0.007942770 6 0.008762563 0.013031817 1.001574768 0.000905812 0.005606343 1.064309825 0.004413960 0.005780148 0.017185396 0.011601336 7 0.998201625 -0.009420720 0.006385368 -0.011912269 -0.009643250 0.004413960 1.058172027 -0.006755374 -0.008099181 -0.016240271 8 -0.001393196 -0.015346223 0.001922316 0.000461894 1.008567427 0.005780148 -0.006755374 1.074833155 -0.011903469 -0.013738378 9 -0.001254522 1.010520047 0.012902050 0.012572661 -0.006183035 0.017185396 -0.008099181 -0.011903469 1.075346677 1.015220126 10 -0.009365991 1.013334849 0.007997935 0.009621670 -0.007942770 0.011601336 -0.016240271 -0.013738378 1.015220126 1.078586686 > corMat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 1 1.000000000 -0.003035829 0.010026270 -0.010693011 -0.005020245 0.008490954 0.970062043 -0.001343386 -0.001209382 -0.009015412 2 -0.003035829 1.000000000 0.008572055 0.007258718 -0.009422702 0.012575370 -0.009117080 -0.014736040 0.970108941 0.971348946 3 0.010026270 0.008572055 1.000000000 -0.000633903 0.001777455 0.970485047 0.006205079 0.001853505 0.012437239 0.007698212 4 -0.010693011 0.007258718 -0.000633903 1.000000000 0.005709500 0.000872861 -0.011512172 0.000442908 0.012052964 0.009210090 5 -0.005020245 -0.009422702 0.001777455 0.005709500 1.000000000 0.005418538 -0.009347204 0.969998023 -0.005945165 -0.007625721 6 0.008490954 0.012575370 0.970485047 0.000872861 0.005418538 1.000000000 0.004159261 0.005404237 0.016063910 0.010827955 7 0.970062043 -0.009117080 0.006205079 -0.011512172 -0.009347204 0.004159261 1.000000000 -0.006334331 -0.007592568 -0.015201540 8 -0.001343386 -0.014736040 0.001853505 0.000442908 0.969998023 0.005404237 -0.006334331 1.000000000 -0.011072068 -0.012759610 9 -0.001209382 0.970108941 0.012437239 0.012052964 -0.005945165 0.016063910 -0.007592568 -0.011072068 1.000000000 0.942667300 10 -0.009015412 0.971348946 0.007698212 0.009210090 -0.007625721 0.010827955 -0.015201540 -0.012759610 0.942667300 1.000000000Now, for the implementation.

This reads in the two matrices above and gets the clustering order.

covMat <- read.csv('cov.csv', header = FALSE) corMat <- read.csv('corMat.csv', header = FALSE) clustOrder <- hclust(dist(corMat), method = 'single')$orderThis is the clustering order:

> clustOrder [1] 9 2 10 1 7 3 6 4 5 8Next, the getIVP (get Inverse Variance Portfolio) and getClusterVar functions (note: I’m trying to keep the naming conventions identical to Dr. Lopez’s paper)

getIVP <- function(covMat) { invDiag <- 1/diag(as.matrix(covMat)) weights <- invDiag/sum(invDiag) return(weights) } getClusterVar <- function(covMat, cItems) { covMatSlice <- covMat[cItems, cItems] weights <- getIVP(covMatSlice) cVar <- t(weights) %*% as.matrix(covMatSlice) %*% weights return(cVar) }Next, my code diverges from the code in the paper, because I do not use the list comprehension structure, but instead opt for a recursive algorithm, as I find that style to be more readable.

One wrinkle to note is the use of the double arrow dash operator, to assign to a variable outside the scope of the recurFun function. I assign the initial weights vector w in the global environment, and update it from within the recurFun function. I am aware that it is a faux pas to create variables in the global environment, but my attempts at creating a temporary environment in which to update the weight vector did not produce the updating mechanism I had hoped to, so a little bit of assistance with refactoring this code would be appreciated.

getRecBipart <- function(covMat, sortIx) { # keeping track of w in the global environment assign("w", value = rep(1, ncol(covMat)), envir = .GlobalEnv) recurFun(covMat, sortIx) return(w) } recurFun <- function(covMat, sortIx) { subIdx <- 1:trunc(length(sortIx)/2) cItems0 <- sortIx[subIdx] cItems1 <- sortIx[-subIdx] cVar0 <- getClusterVar(covMat, cItems0) cVar1 <- getClusterVar(covMat, cItems1) alpha <- 1 - cVar0/(cVar0 + cVar1) # scoping mechanics using w as a free parameter w[cItems0] <<- w[cItems0] * alpha w[cItems1] <<- w[cItems1] * (1-alpha) if(length(cItems0) > 1) { recurFun(covMat, cItems0) } if(length(cItems1) > 1) { recurFun(covMat, cItems1) } }Lastly, let’s run the function.

out <- getRecBipart(covMat, clustOrder)With the result (which matches the paper):

> out [1] 0.06999366 0.07592151 0.10838948 0.19029104 0.09719887 0.10191545 0.06618868 0.09095933 0.07123881 0.12790318So, hopefully this democratizes the use of this technology in R. While I have seen a raw Rcpp implementation and one from the Systematic Investor Toolbox, neither of those implementations satisfied me from a “plug and play” perspective. This implementation solves that issue. Anyone here can copy and paste these functions into their environment and immediately make use of one of the algorithms devised by one of the top minds in quantitative finance.

A demonstration in a backtest using this methodology will be forthcoming.

Thanks for reading.

NOTE: I am always interested in networking and full-time opportunities which may benefit from my skills. Furthermore, I am also interested in project work in the volatility ETF trading space. My linkedin profile can be found here.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – QuantStrat TradeR**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Instrumental Variables in R exercises (Part-2)

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

This is the second part of the series on Instrumental Variables. For other parts of the series follow the tag instrumental variables.

In this exercise set we will build on the example from part-1.

We will now consider an over-identified case i.e. we have multiple IVs for an endogenous variable. We will also look at tests for endogeneity and over-identifying restrictions.

Answers to the exercises are available here.

**Exercise 1**

Load the AER package (package description: here). Next, load PSID1976 dataset provided with the AER package. This has data regarding labor force participation of married women.

Define a new data-frame that has data for all married women that were employed. This data-frame will be used for the following exercises.

**Exercise 2**

We will use a more elaborate model as compared to the previous set.

Regress log(wage) on education, experience and experience^2.

**Exercise 3**

Previously, we identified feducation and meducation as possible IVs for education.

Regress education on experience, experience^2, feducation and meducation. Comment on your results.

**Exercise 4**

Test the hypothesis that feducation and meducation are jointly equal to zero.

**Exercise 5**

Use 2SLS to estimate the IV based return to education.

**Exercise 6**

Use the ivreg command to get the same results as in Exercise-5. Note that standard errors would be different.

**Exercise 7**

There is a short form for the ivreg command which saves time when we are dealing with numerous variables.

Try the short format and check that you get the same results as in Exercise-6.

**Exercise 8**

Regress log(wage) on education, experience, experience^2 and residuals from the model estimated in Exercise-3.

Use your result to test for the endogeneity of education.

**Exercise 9**

Regress the residuals from exercise-7 on experience, experience^2, feducation and meducation.

Use your result to test for over-identifying restrictions.

**Exercise 10**

The two tests we did in exercises 8 and 9 can be conveniently obtained using the summary command with diagnostics turned on. Verify that you get the same test results with the summary command.

- Instrumental Variables in R exercises (Part-1)
- Bioinformatics Tutorial with Exercises in R (part 1)
- Let’s get started with dplyr
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The End of the Honeymoon: Falling Out of Love with quantstrat

(This article was first published on ** R – Curtis Miller's Personal Website**, and kindly contributed to R-bloggers)

I spent good chunks of Friday, Saturday, and Sunday attempting to write another blog post on using R and the **quantstrat** package for backtesting, and all I have to show for my work is frustration. So I’ve started to fall out of love with **quantstrat** and am thinking of exploring Python backtesting libraries from now on.

Here’s my story…

Cross-ValidationLet’s suppose you’ve read my past blog posts on using R for analyzing stock data, in particular the most recent blog post on R for finance, on parameter optimization and order types. You’ll notice that we’ve been doing two things (perhaps among others):

- We only used most recent data as out-of-sample data. What if there are peculiarities unique to the most-recent time frame, and it’s not a good judge (alone) of future performance?
- We evaluate a strategy purely on its returns. Is this the best approach? What if a strategy doesn’t earn as much as another but it handles risk well?

I wanted to write an article focusing on addressing these points, by looking at cross-validation and different metrics for evaluating performance.

Data scientists want to fit training models to data that will do a good job of predicting *future*, out-of-sample data points. This is *not* done by finding the model that performs the best on the training data. This is called overfitting; a model may appear to perform well on training data but will not generalize to out-of-sample data. Techniques need to be applied to prevent against it.

A data scientist may first split her data set used for developing a predictive algorithm into a training set and a test set. The data scientist locks away the test set in a separate folder on the hard drive, never looking at it until she’s satisfied with the model she’s developed using the training data. The test set will be used once a final model has been found to determine the final model’s expected performance.

She would like to be able to simulate out-of-sample performance with the training set, though. The usual approach to do this is split the training set into, say, 10 different folds, or smaller data sets, so she can apply cross-validation. With cross-validation, she will choose one of the folds to be held out, and she fits a predictive model on the remaining nine folds. After fitting the model, she sees how the fitted model performs on the held-out fold, which the fitted model has not seen. She then repeats this model for the nine other folds, to get a sense of the distribution of the performance of the model, or average performance, on out-of-sample data. If there are hyperparameters (which are parameters that describe some higher-order aspect of a model and not learned from the data in the usual way other prarameters are; they’re difficult to define rigorously), she may try different combinations of them and see which combination lead to the model with the best predictive ability in cross-validation. After determining which predictive model generally leads to the best results and which hyperparameters lead to optimal results, she trains a model on the training set with those hyperparameters, evaluates its performance on the test set, and reports the results, perhaps deploying the model.

For those developing trading algorithms, the goals are similar, but there are some key differences:

- We evaluate a trading method not by predictive accuracy but by some other measure of performance, such as profitability or profitability relative to risk. (
*Maybe*predictive accuracy is profitable, maybe not; if the most profitable trading system always underestimates its profitability, we’re fine with it.) - We are using data where time and the order in which data comes in is thought to matter. We cannot just reshuffle data; important features would be lost. So we cannot divide up the data set randomly into different folds; order must be preserved.

I’ll talk more about how we could evaluate a trading system later; for now, let’s focus on how we can apply the idea of cross-validation analysis.

*(Illustration by Max Kuhn in his book,* The caret Package*.)*

For time-dependent data, we can employ walk-forward analysis or rolling forecast origin techniques. This comes in various flavors (see the above illustration), but I will focus on one flavor. We first divide up the data set into, say, ten periods. We first fit a trading algorithm on the first period in the data set, then see how it performs on the “out-of-sample” second period. Then we repeat for the second and third periods, third and fourth periods, and so on until we’ve run to the end of the data set. The out-of-sample data sets are then used for evaluating the potential performance of the trading system in question. If we like, we may be keeping a final data set, perhaps the most recent data set, for final evaluation of whatever trading system passes this initial test.

Other variants include overlapping training/testing periods (as described, my walk-forward analysis approach does not overlap) or ones where the initial window grows from beginning to end. Of these, I initially think either the approach I’ve described or the approach with overlapping training/testing periods makes most sense.

Attempting to Use**quantstrat**

I discovered that **quantstrat** has a function that I *thought* would implement the type of walk-forward analysis I wanted, called walk.forward(). I was able to define a training period duration, a testing period duration, an objective function to maximize, and many other features I wanted. It could take a strategy designed for optimization with apply.paramset() and apply a walk-forward analysis.

So I tried it once and the results were cryptic. The function returns a list with the results of training… but no sign of the test results. I would try looking at the portfolio object and tradeStats() output, but there was no sign of a portfolio. I had no idea why. I still have no idea where the portfolio went initially. I would try clearing my environments and restarting R and doing the whole process again. Same results: no sign of a portfolio, and all I got was a list of results for training data, with no sign of results for testing data. I looked at the source code. I saw that it was doing *something* with the test data: it was using applyStrategy() on the test data.

I had no idea what the hell that meant. When I use applyStrategy() as the user, I have to assume that the portfolio is wiped out, the account is reset, and every application is effectively a fresh start. So when I saw that line, it looked as if the function got an optimized function, applied it to the test data, and would repeat this process all the while forgetting the results of previous applications of the optimized strategy to test data sets. But online sources said that was *not* what was happening; the results of this backtesting were being appended back-to-back, which did not mesh with either my notion of walk-forward analysis (which allows for overlapping time frames, which does not mean you can “glue” together test data results) or what applyStrategy() does. And it still does not make sense to me. (That is what it does, though.)

The fact that the demonstration of walk.forward() supplied with **quantstrat** is bugged (and I never found how to fix all its bugs) didn’t help.

So I tried using the function’s auditing feature; by supplying a string to audit.prefix, *.RData files will be created containing the results of training and testing, storing them in a special .audit environment. So I changed the working directory to one specifically meant for the audit files (there could be a lot of them and I didn’t want to clutter up my other directories; I hate clutter), passed a string to audit.prefix, and tried again. After the analysis was completed, I look in the directory to see the files.

There’s nothing there.

It took hours and I had to go through a debugger to finally discover that walk.forward() *was* saving files, but it was using my Documents directory as the working directory, not the directory I setwd()ed into. Why? I have no idea. It could be **quantstrat**, **blotter**, or RStudio’s workbooks. I have no idea how long it took to figure out where my files had gone, but it was too long however long it was.

Good news though was that I now could try and figure out what was happening to the test data. This presentation by Guy Yollin was helpful, though it still took a while to get it figured out.

From what I can tell, what walk.forward() does is simulate a trading strategy that optimizes, trades, and re-optimizes on new data, and trades. So it doesn’t look like cross-validation as I described but rather a trading strategy, and that’s not really what I wanted here. I wanted to get a sense of the distribution of returns when using window sizes chosen optimally from backtesting. I don’t think there’s any simple way to do that with **quantstrat**.

Furthermore, I can’t see how I can collect the kind of statistics I want, such as a plot of the value of the account or the account’s growth since day one, with the way the walk-forward analysis is being done. Furthermore, the results I did get look… strange. I don’t know how those kinds of results would appear… at all. They look like nonsense. So I don’t even know if the function is working right.

Oh, did I mention that I discovered that the way I thought backtesting was being done isn’t how it’s actually done in **quantstrat**? When I was doing backtesting, I would rather see the results of the backtest on a variety of stocks instead of just one. I imagine an automated backtest behaving like a trader with an account, and active trades tie up the trader’s funds. So if the trader sees an opportunity but does not have the cash to act on it, the trader does not take the trade. (There is no margin here.)

I thought applyStrategy() was emulating such a trader, but the documentation (and the logs produced, for that matter), suggest that’s *not* what’s happening. It backtests on each symbol separately, unaware of any other positions that may exist. I don’t like that.

It was very difficult to figure this all out, especially given the state of the documentation currently. That said, in the developers’ defense, **quantstrat** is still considered under heavy development. Maybe in the future the documentation will be better. I feel, though, that creating good documentation is not something done once you’ve completed coding; you write good documentation *as* you code, not *after*.

**blotter**: The Fatal Flaw

Granted, I’m an ignorant person, but I don’t feel like all the confusion that lead to me wasting my weekend is completely my fault. **quantstrat**‘s design is strange and, honestly, not that intuitive. I think the problem stems from **quantstrat**‘s building off an older, key package: **blotter**.

**blotter** is a package intended to manage transactions, accounts, portfolios, financial instruments, and so on. All the functions for creating accounts and portfolios are **blotter** functions, not **quantstrat** functions. **blotter** does this by setting up its own environment, the **.blotter** environment, and the functions I’ve shown are for communicating with this environment, adding and working with entities stored in it.

I don’t know what the purpose of creating a separate environment for managing those objects is. I feel that by doing so a wall has been placed between me and the objects I’m trying to understand. **quantstrat** takes this further by defining a **.strategy** environment where strategies actually “live”, so we must use special functions to interact with that separate universe, adding features to our strategies, deleting and redefining them with special functions, and so on.

To me, this usage of environments feels very strange, and I think it makes **quantstrat** difficult and unintuitive. In some sense, I feel like strategies, portfolios, accounts, etc. are pseudo-objects in the object oriented programming (OOP) sense, and the developers of **blotter** and **quantstrat** are trying to program like R is a language that primarily follows the OOP paradigm.

While R does support OOP, it’s a very different kind of OOP and I would not call R an OOP language. R is a *functional* programming language. Many see the **tidyverse** as almost revolutionary in making R easy to use. That’s because the **tidyverse** of package, including **ggplot2**, **magrittr**, **dplyr**, and others, recognize that R relies on the functional programming paradigm and this paradigm does, in fact, lead to elegant code. Users can write pipelines and functions to put into pipelines that make complicated tasks easy to program. No part of the pipeline is cut off from the user, and the process of building the pipeline is very intuitive.

When I wrote my first blog post on R for finance and trading, one of the common comments I heard was, “**tidyquant** is better.” I have yet to look seriously at **tidyquant**, and maybe it is better, but so far all I’ve seen is **tidyquant** replace **quantmod**‘s functionality, and I have no complaint with **quantmod** (although **tidyquant** may also be suffering from the original sin of relying too heavily on environments). That said, the **tidyquant** approach may yet be the future of R backtesting.

Let’s try re-imagining what a better R backtesting package might look like, one that takes lessons from the **tidyverse**. Let’s first look at defining a strategy with **quantstrat** (after all the obnoxious boilerplate code).

After all this code, we have a strategy that lives in the .strategy environment, locked away from us. What if we were to create strategies like we create plots in **ggplot2**? The code might look like this:

In **blotter** (and thus **quantstrat**) we create porfolios and accounts like so:

Then we run the strategy with:

applyStrategy(strategy_st, portfolios = portfolio_st)and we see the results with this code:

updatePortf(portfolio_st) dateRange <- time(getPortfolio(portfolio_st)$summary)[-1] updateAcct(account_st, dateRange) updateEndEq(account_st) tStats <- tradeStats(Portfolios = portfolio_st, use="trades", inclZeroDays = FALSE) tStats[, 4:ncol(tStats)] <- round(tStats[, 4:ncol(tStats)], 2) print(data.frame(t(tStats[, -c(1,2)]))) final_acct <- getAccount(account_st) plot(final_acct$summary$End.Eq["2010/2016"] / 100000, main = "Portfolio Equity", ylim = c(0.8, 2.5)) lines(SPY$SPY.Adjusted / SPY$SPY.Adjusted[[1]], col = "blue")A **tidyverse** approach might look like the following:

Other functions would then extract any information we need from the object ob, when we ask for them.

If I wanted to optimize, I might do the following:

ob_list <- myAcct %>% myStrat(optim = "MA") # Gives a list of orderbooks, and # optimizes the parameter set MAAnd as for the original reason I started to gripe, I might be able to find slices for the analysis with **caret** functions (they exist), subset the data in symbols according to these slices, and get ob in, say, an lapply() loop.

I like the idea for this interface, myself.

Conclusion (On To Python)I do not have the time to write a package like this myself. I don’t know enough and I *am* a Ph.D. student and *should* be preparing for qualifying exams/research/etc.. For now, though, I’m turned off to **quantstrat**, and thus R for backtesting and trading.

That said, my stock data analysis posts are still, by far, the most popular content of this blog, and I still would like to find a good backtesting system. I’m looking at going back to Python, and the **backtrader** package has caught my eye. I will either use that or **zipline**, but I am leaning to **backtrader**, since initially I like its logging features and its use of OOP.

I would love to hear thoughts and comments, especially from users of either **backtrader** or **zipline** or Python for stock data analysis in general.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Curtis Miller's Personal Website**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R vs Python: Different similarities and similar differences

(This article was first published on ** R – Giga thoughts …**, and kindly contributed to R-bloggers)

A debate about which language is better suited for Datascience, R or Python, can set off diehard fans of these languages into a tizzy. This post tries to look at some of the different similarities and similar differences between these languages. To a large extent the ease or difficulty in learning R or Python is subjective. I have heard that R has a steeper learning curve than Python and also vice versa. This probably depends on the degree of familiarity with the languuge To a large extent both R an Python do the same thing in just slightly different ways and syntaxes. The ease or the difficulty in the R/Python construct’s largely is in the ‘eyes of the beholder’ nay, programmer’ we could say. I include my own experience with the languages below.

1. R data typesR has 2 data types

- Character
- Numeric

Python has several data types

- Int
- float
- Long
- Complex and so on

A common data type in R is the vector. Python has a similar data type, the list

# R vectors a<-c(4,5,1,3,4,5) print(a[3]) ## [1] 1 print(a[3:4]) # R does not always need the explicit print. ## [1] 1 3 #R type of variable print(class(a)) ## [1] "numeric" # Length of a print(length(a)) ## [1] 6 # Python lists a=[4,5,1,3,4,5] # print(a[2]) # Some python IDEs require the explicit print print(a[2:5]) print(type(a)) # Length of a print(len(a)) ## 1 ## [1, 3, 4] ## ## 6 2a. Other data types – PythonPython also has certain other data types like the tuple, dictionary etc as shown below. R does not have as many of the data types, nevertheless we can do everything that Python does in R

# Python tuple b = (4,5,7,8) print(b) #Python dictionary c={'name':'Ganesh','age':54,'Work':'Professional'} print(c) #Print type of variable c ## (4, 5, 7, 8) ## {'name': 'Ganesh', 'age': 54, 'Work': 'Professional'} 2.Type of VariableTo know the type of the variable in R we use ‘class’, In Python the corresponding command is ‘type’

#R - Type of variable a<-c(4,5,1,3,4,5) print(class(a)) ## [1] "numeric" #Python - Print type of tuple a a=[4,5,1,3,4,5] print(type(a)) b=(4,3,"the",2) print(type(b)) ## ## 3. LengthTo know length in R, use length()

#R - Length of vector # Length of a a<-c(4,5,1,3,4,5) print(length(a)) ## [1] 6To know the length of a list,tuple or dict we can use len()

# Python - Length of list , tuple etc # Length of a a=[4,5,1,3,4,5] print(len(a)) # Length of b b = (4,5,7,8) print(len(b)) ## 6 ## 4 4. Accessing helpTo access help in R we use the ‘?’ or the ‘help’ function

#R - Help - To be done in R console or RStudio #?sapply #help(sapply)Help in python on any topic involves

#Python help - This can be done on a (I)Python console #help(len) #?len 5. SubsettingThe key difference between R and Python with regards to subsetting is that in R the index starts at 1. In Python it starts at 0, much like C,C++ or Java To subset a vector in R we use

#R - Subset a<-c(4,5,1,3,4,8,12,18,1) print(a[3]) ## [1] 1 # To print a range or a slice. Print from the 3rd to the 5th element print(a[3:6]) ## [1] 1 3 4 8Python also uses indices. The difference in Python is that the index starts from 0/

#Python - Subset a=[4,5,1,3,4,8,12,18,1] # Print the 4th element (starts from 0) print(a[3]) # Print a slice from 4 to 6th element print(a[3:6]) ## 3 ## [3, 4, 8] 6. Operations on vectors in R and operation on lists in PythonIn R we can do many operations on vectors for e.g. element by element addition, subtraction, exponentation,product etc. as show

#R - Operations on vectors a<- c(5,2,3,1,7) b<- c(1,5,4,6,8) #Element wise Addition print(a+b) ## [1] 6 7 7 7 15 #Element wise subtraction print(a-b) ## [1] 4 -3 -1 -5 -1 #Element wise product print(a*b) ## [1] 5 10 12 6 56 # Exponentiating the elements of a vector print(a^2) ## [1] 25 4 9 1 49In Python to do this on lists we need to use the ‘map’ and the ‘lambda’ function as follows

# Python - Operations on list a =[5,2,3,1,7] b =[1,5,4,6,8] #Element wise addition with map & lambda print(list(map(lambda x,y: x+y,a,b))) #Element wise subtraction print(list(map(lambda x,y: x-y,a,b))) #Element wise product print(list(map(lambda x,y: x*y,a,b))) # Exponentiating the elements of a list print(list(map(lambda x: x**2,a))) ## [6, 7, 7, 7, 15] ## [4, -3, -1, -5, -1] ## [5, 10, 12, 6, 56] ## [25, 4, 9, 1, 49]However if we create ndarrays from lists then we can do the element wise addition,subtraction,product, etc. like R. Numpy is really a powerful module with many, many functions for matrix manipulations

import numpy as np a =[5,2,3,1,7] b =[1,5,4,6,8] a=np.array(a) b=np.array(b) #Element wise addition print(a+b) #Element wise subtraction print(a-b) #Element wise product print(a*b) # Exponentiating the elements of a list print(a**2) ## [ 6 7 7 7 15] ## [ 4 -3 -1 -5 -1] ## [ 5 10 12 6 56] ## [25 4 9 1 49] 7. Getting the index of elementTo determine the index of an element which satisifies a specific logical condition in R use ‘which’. In the code below the index of element which is equal to 1 is 4

# R - Which a<- c(5,2,3,1,7) print(which(a == 1)) ## [1] 4In Python array we can use np.where to get the same effect. The index will be 3 as the index starts from 0

# Python - np.where import numpy as np a =[5,2,3,1,7] a=np.array(a) print(np.where(a==1)) ## (array([3], dtype=int64),) 8. Data framesR, by default comes with a set of in-built datasets. There are some datasets which come with the SkiKit- Learn package

# R # To check built datasets use #data() - In R console or in R Studio #iris - Don't print to consoleWe can use the in-built data sets that come with Scikit package

#Python import sklearn as sklearn import pandas as pd from sklearn import datasets # This creates a Sklearn bunch data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) 9. Working with dataframesWith R you can work with dataframes directly. For more complex dataframe operations in R there are convenient packages like dplyr, reshape2 etc. For Python we need to use the Pandas package. Pandas is quite comprehensive in the list of things we can do with data frames The most common operations on a dataframe are

- Check the size of the dataframe
- Take a look at the top 5 or bottom 5 rows of dataframe
- Check the content of the dataframe

In R use dim()

#R - Size dim(iris) ## [1] 150 5For Python use .shape

#Python - size import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) iris.shape b. Top & bottom 5 rows of dataframeTo know the top and bottom rows of a data frame we use head() & tail as shown below for R and Python

#R head(iris,5) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa tail(iris,5) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.head(5)) print(iris.tail(5)) ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 0 5.1 3.5 1.4 0.2 ## 1 4.9 3.0 1.4 0.2 ## 2 4.7 3.2 1.3 0.2 ## 3 4.6 3.1 1.5 0.2 ## 4 5.0 3.6 1.4 0.2 ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 145 6.7 3.0 5.2 2.3 ## 146 6.3 2.5 5.0 1.9 ## 147 6.5 3.0 5.2 2.0 ## 148 6.2 3.4 5.4 2.3 ## 149 5.9 3.0 5.1 1.8 c. Check the content of the dataframe #R summary(iris) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ## str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.info()) ## ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None d. Check column names #R names(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species" colnames(iris) ## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" ## [5] "Species" #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) #Get column names print(iris.columns) ## Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', ## 'petal width (cm)'], ## dtype='object') e. Rename columnsIn R we can assign a vector to column names

#R colnames(iris) <- c("lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal","Species") colnames(iris) ## [1] "lengthOfSepal" "widthOfSepal" "lengthOfPetal" "widthOfPetal" ## [5] "Species"In Python we can assign a list to s.columns

#Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) iris.columns = ["lengthOfSepal","widthOfSepal","lengthOfPetal","widthOfPetal"] print(iris.columns) ## Index(['lengthOfSepal', 'widthOfSepal', 'lengthOfPetal', 'widthOfPetal'], dtype='object') f.Details of dataframe #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.info()) ## ## RangeIndex: 150 entries, 0 to 149 ## Data columns (total 4 columns): ## sepal length (cm) 150 non-null float64 ## sepal width (cm) 150 non-null float64 ## petal length (cm) 150 non-null float64 ## petal width (cm) 150 non-null float64 ## dtypes: float64(4) ## memory usage: 4.8 KB ## None g. Subsetting dataframes # R #To subset a dataframe 'df' in R we use df[row,column] or df[row vector,column vector] #df[row,column] iris[3,4] ## [1] 0.2 #df[row vector, column vector] iris[2:5,1:3] ## lengthOfSepal widthOfSepal lengthOfPetal ## 2 4.9 3.0 1.4 ## 3 4.7 3.2 1.3 ## 4 4.6 3.1 1.5 ## 5 5.0 3.6 1.4 #If we omit the row vector, then it implies all rows or if we omit the column vector # then implies all columns for that row iris[2:5,] ## lengthOfSepal widthOfSepal lengthOfPetal widthOfPetal Species ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa # In R we can all specific columns by column names iris$Sepal.Length[2:5] ## NULL #Python # To select an entire row we use .iloc. The index can be used with the ':'. If # .iloc[start row: end row]. If start row is omitted then it implies the beginning of # data frame, if end row is omitted then it implies all rows till end #Python import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) print(iris.iloc[3]) print(iris[:5]) # In python we can select columns by column name as follows print(iris['sepal length (cm)'][2:6]) #If you want to select more than 2 columns then you must use the double '[[]]' since the # index is a list itself print(iris[['sepal length (cm)','sepal width (cm)']][4:7]) ## sepal length (cm) 4.6 ## sepal width (cm) 3.1 ## petal length (cm) 1.5 ## petal width (cm) 0.2 ## Name: 3, dtype: float64 ## sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) ## 0 5.1 3.5 1.4 0.2 ## 1 4.9 3.0 1.4 0.2 ## 2 4.7 3.2 1.3 0.2 ## 3 4.6 3.1 1.5 0.2 ## 4 5.0 3.6 1.4 0.2 ## 2 4.7 ## 3 4.6 ## 4 5.0 ## 5 5.4 ## Name: sepal length (cm), dtype: float64 ## sepal length (cm) sepal width (cm) ## 4 5.0 3.6 ## 5 5.4 3.9 ## 6 4.6 3.4 h. Computing Mean, Standard deviation #R #Mean mean(iris$lengthOfSepal) ## [1] 5.843333 #Standard deviation sd(iris$widthOfSepal) ## [1] 0.4358663 #Python #Mean import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) # Convert to Pandas dataframe print(iris['sepal length (cm)'].mean()) #Standard deviation print(iris['sepal width (cm)'].std()) ## 5.843333333333335 ## 0.4335943113621737 i. BoxplotBoxplot can be produced in R using baseplot

#R boxplot(iris$lengthOfSepal)

Matplotlib is a popular package in Python for plots

#Python import sklearn as sklearn import pandas as pd import matplotlib.pyplot as plt from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) img=plt.boxplot(iris['sepal length (cm)']) plt.show(img) j.Scatter plot #R plot(iris$widthOfSepal,iris$lengthOfSepal) #Python import matplotlib.pyplot as plt import sklearn as sklearn import pandas as pd from sklearn import datasets data = datasets.load_iris() # Convert to Pandas dataframe iris = pd.DataFrame(data.data, columns=data.feature_names) img=plt.scatter(iris['sepal width (cm)'],iris['sepal length (cm)']) #plt.show(img) k. Read from csv file #R tendulkar= read.csv("tendulkar.csv",stringsAsFactors = FALSE,na.strings=c(NA,"-")) #Dimensions of dataframe dim(tendulkar) ## [1] 347 13 names(tendulkar) ## [1] "X" "Runs" "Mins" "BF" "X4s" ## [6] "X6s" "SR" "Pos" "Dismissal" "Inns" ## [11] "Opposition" "Ground" "Start.Date"Use pandas.read_csv() for Python

#Python import pandas as pd #Read csv tendulkar= pd.read_csv("tendulkar.csv",na_values=["-"]) print(tendulkar.shape) print(tendulkar.columns) ## (347, 13) ## Index(['Unnamed: 0', 'Runs', 'Mins', 'BF', '4s', '6s', 'SR', 'Pos', ## 'Dismissal', 'Inns', 'Opposition', 'Ground', 'Start Date'], ## dtype='object') l. Clean the dataframe in R and Python.The following steps are done for R and Python

1.Remove rows with ‘DNB’

2.Remove rows with ‘TDNB’

3.Remove rows with absent

4.Remove the “*” indicating not out

5.Remove incomplete rows with NA for R or NaN in Python

6.Do a scatter plot

To chain a set of operations we need to use an R package like dplyr. Pandas does this The following operations are done on tendulkar data frame by dplyr for R and Pandas for Python below

- Group by ground
- Compute average runs in each ground
- Arrange in descending order

Personally, I took to R, much like a ‘duck takes to water’. I found the R syntax very simple and mostly intuitive. R packages like dplyr, ggplot2, reshape2, make the language quite irrestible. R is weakly typed and has only numeric and character types as opposed to the full fledged data types in Python.

Python, has too many bells and whistles, which can be a little bewildering to the novice. It is possible that they may be useful as one becomes more experienced with the language. Also I found that installing Python packages sometimes gives errors with Python versions 2.7 or 3.6. This will leave you scrambling to google to find how to fix these problems. These can be quite frustrating. R on the other hand makes installing R packages a breeze.

Anyway, this is my current opinion, and like all opinions, may change in the course of time. Let’s see!

I may write a follow up post with more advanced features of R and Python. So do keep checking! Long live R! Viva la Python!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Giga thoughts …**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Bagging, the perfect solution for model instability

(This article was first published on ** R – insightR**, and kindly contributed to R-bloggers)

**By Gabriel Vasconcelos**Motivation

The name bagging comes from boostrap aggregating. It is a machine learning technique proposed by Breiman (1996) to increase stability in potentially unstable estimators. For example, suppose you want to run a regression with a few variables in two steps. First, you run the regression with all the variables in your data and select the significant ones. Second, you run a new regression using only the selected variables and compute the predictions.

This procedure is not wrong if your problem is forecasting. However, this two step estimation may result in highly unstable models. If many variables are important but individually their importance is small, you will probably leave some of them out, and small perturbations on the data may drastically change the results.

BaggingThe Bagging solves instability problem by giving each variable several chances to be in the model. The same procedure described in the previous section is repeated in many bootstrap samples and the final forecast will be the average forecast across all these samples.

In this post I will show an example of Bagging with regression in the selection context described above. But you should keep in mind that this technique can be used for any unstable model. The most successful case is to use Bagging on regression trees, resulting in the famous random forest. The Bagging is also very robust to over fitting and produces forecasts with less variance.

ApplicationI am going to show an application of Bagging from the scratch using the boot package for the bootstrap replications. We start by generating some data with the following dgp:

The first equation shows how the variable we want to predict, , is generated. The second equation shows that the independent variables are generated by a common factor, which will induce a correlation structure. Since we need instability, the data will be generated to have an around and in order to have many variables with small individual importance.

N = 300 # = Number of obs. = # K = 40 # = Number of Variables = # set.seed(123) # = Seed for replication = # f = rnorm(N) # = Common factor = # A = runif(K,-1,1) # = Coefficients from the second equation = # X = t(A%*%t(f)) + matrix(rnorm(N*K), N, K) # = Generate xij = # beta = c((-1)^(1:(K-10)) ,rep(0, 10)) # = Coefficients from the first equation, the first 30 are equal 1 or -1 and the last 10 are 0 = # # = R2 setup = # aux = var(X%*%beta) erro = rnorm(N, 0, sqrt(2*aux)) # = The variance of the error will be twice the variance of X%*%beta = # y = X %*% beta+erro # = Generate y = # 1-var(erro)/var(y) # = R2 = # ## [,1] ## [1,] 0.2925866 # = Break data into in-sample and out-of-sample = # y.in = y[1:(N-100)] X.in = X[1:(N-100),] y.out = y[-c(1:(N-100))] X.out = X[-c(1:(N-100)),]Now we must define a function to be called by the boostrap. This function must receive the data and an argument for indexes that will tell R where the sampling must be made. This function will run a linear regression with all variables, select those with t-statistics bigger than , run a new regression with the selected variables and store the coefficients.

bagg = function(data,ind){ sample = data[ind,] y = sample[, 1] X = sample[, 2:ncol(data)] model1 = lm(y ~ X) t.stat = summary(model1)$coefficients[-1, 3] selected = which(abs(t.stat)>=1.96) model2 = lm(y ~ X[, selected]) coefficients = coef(model2) cons = coefficients[1] betas = coefficients[-1] aux = rep(0, ncol(X)) aux[selected] = betas res = c(cons, aux) return(res) }Now we are ready to use the boot function to run the Bagging. The code below does precisely that with boostrap replications. It also runs the simple OLS procedure for us to compare. If you type selected after running this code you will see that many variables were left out and some variables with were selected.

library(boot) # = Bagging = # bagging = boot(data = cbind(y.in, X.in), statistic=bagg, R=500) bagg.coef = bagging$t y.pred.bagg = rowMeans(cbind(1, X.out)%*%t(bagg.coef)) # = OLS = # ols0 = lm(y.in ~ X.in) selected = which(abs(summary(ols0)$coefficients[-1, 3])>=1.96) ols1 = lm(y.in ~ X.in[, selected]) y.pred = cbind(1, X.out[, selected])%*%coef(ols1) # = Forecasting RMSE = # sqrt(mean((y.pred-y.out)^2)) # = OLS = # ## [1] 10.43338 sqrt(mean((y.pred.bagg-y.out)^2)) # = Bagging = # ## [1] 9.725554The results showed that the Bagging has a RMSE smaller than the OLS. It is also interesting to see what happens to the coefficients. The plot below shows that the bagging performs some shrinkage on the coefficients towards zero.

barplot(coef(ols0)[-1]) barplot(colMeans(bagg.coef)[-1], add=TRUE, col="red")Finally, what would happen if we had no instability? If you run the same code again changing the to around you will find the plot below. To change the to just replace by in the code erro = rnorm(N, 0, sqrt(2*aux)) in the dgp. As you can see, the bagging and the simple procedure are the same in the absence of instability.

ReferencesBreiman, Leo (1996). “Bagging predictors”. Machine Learning. 24 (2): 123–140

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – insightR**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Connecting R to an Oracle Database

(This article was first published on ** R – Gerald Belton**, and kindly contributed to R-bloggers)

R is a very popular language for doing analytics, and particularly statistics, on your data. There are a number of R functions for reading in data, but most of them take a delimited text file (such as .CSV) for input. That’s great if your existing data is in a spreadsheet, but if you have large amounts of data, it’s probably stored in a relational database. If you work for a large company, chances are that it is an Oracle database.

The most efficient way to access an Oracle database from R is using the RODBC package, available from CRAN. If the RODBC package is not installed in your R environment, use the *install.packages(“RODBC”)* command to install it. ODBC stands for Open DataBase Connectivity, an open standard application programming interface (API) for databases. ODBC was created by the SQL Access Group and first released in September, 1992. Although Microsoft Windows was the first to provide an ODBC product, versions now exist for Linux and Macintosh platforms as well. ODBC is built-in to current versions of Windows. If you are using a different operating system, you’ll need to install on OBDC driver manager.

Before you can access a database from R, you’ll need to create a Data Source Name, or DSN. This is an alias to the database, which provides the connection details. In Windows, you create the DSN using the ODBC Source Administrator. This tool can be found in the Control Panel. In Windows 10, it’s under System and Security -> Administrative Tools -> ODBC Data Sources. Or you can just type “ODBC” in the search box. On my system, it looks like this:

As you can see, I already have a connection to an Oracle database. To set one up, click Add, and you’ll get this box:

Select the appropriate driver (in my case, Oracle in OraDB12Home1) and click the Finish button. A Driver Configuration box opens:

For “Data Source Name,” you can put in almost anything you want. This is the name you will use in R when you connect to the database.

The “Description” field is optional, and again, you can put in whatever you want.

TNS Service Name is the name that you (or your company data base administrator) assigned when configuring the Oracle database. And “User ID” is your ID that you use with the database.

After you fill in these fields, click the “Test Connection” button. Another box pops up, with the TNS Service Name and User ID already populated, and an empty field for your password. Enter your password and click “OK.” You should see a “Connection Successful” message. If not, check the Service Name, User ID, and Password.

Now you are ready to connect R to the database.

Here’s the R code that you need:

*# Load RODBC package*library(RODBC)

*# Create a connection to the database called "channel"*channel <- odbcConnect("DATABASE", uid="USERNAME", pwd="PASSWORD")

*# Query the database and put the results into the data frame*

*# "dataframe"*dataframe <- sqlQuery(channel, " SELECT * FROM SCHEMA.DATATABLE") # When finished, it's a good idea to close the connection odbcClose(channel)

A couple of comments about this code are in order:

First, I don’t like the idea of having a password appear, unencrypted, in the R program. One possible solution is to prompt the user for the password before creating the connection:

pswd <- readline("Input Password: ") channel <- odbcConnect("DATABASE", uid="USERNAME", pwd=pswd)This will enable the connection to be made without compromising the security of the password.

Second, the sqlQuery will pass to Oracle whatever is inside the quotation marks. This is the workhorse function of the RODBC package. The term ‘query’ includes any valid SQL statement including table creation, updates, etc, as well as ‘SELECT’s.

Finally, I should mention that R works with data that is loaded into the computer’s memory. If you try to load a really huge database into memory all at once, it will a) take a very long time, and b) possibly fail due to exceeding your computer’s memory capacity. Of course, relational database systems like Oracle are the natural habitat of very large data sets, so that may be your motivation for connecting R to Oracle in the first place. Carefully constructed SQL Queries will let Oracle do the work of managing the data, and return just the data that R needs for performing analytics.

Writing SQL Queries is beyond the scope of this blog post. If you need help with that, there are plenty of free tutorials on the web, or you might find this book helpful: Oracle 12c for Dummies

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Gerald Belton**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Sunflowers for COLOURlovers

(This article was first published on ** R – Fronkonstin**, and kindly contributed to R-bloggers)

Andar, lo que es andar, anduve encima siempre de las nubes (Del tiempo perdido, Robe)

If you give importance to colours, maybe you know already COLOURlovers. As can be read in their website, *COLOURlovers is a creative community where people from around the world create and share colors, palettes and patterns, discuss the latest trends and explore colorful articles… All in the spirit of love*.

There is a R package called colourlovers which provides access to the COLOURlovers API. It makes very easy to choose nice colours for your graphics. I used clpalettes function to search for the top palettes of the website. Their names are pretty suggestive as well: Giant Goldfish, Thought Provoking, Adrift in Dreams, let them eat cake … Inspired by this post I have done a Shiny app to create colored flowers using that palettes. Seeds are arranged according to the golden angle. One example:

Some others:

You can play with the app here.

If you want to do your own sunflowers, here you have the code. This is the ui.R file:

library(colourlovers) library(rlist) top=clpalettes('top') sapply(1:length(top), function(x) list.extract(top, x)$title)->titles fluidPage( titlePanel("Sunflowers for COLOURlovers"), fluidRow( column(3, wellPanel( selectInput("pal", label = "Palette:", choices = titles), sliderInput("nob", label = "Number of points:", min = 200, max = 500, value = 400, step = 50) ) ), mainPanel( plotOutput("Flower") ) ) )And this is the server.R one:

library(shiny) library(ggplot2) library(colourlovers) library(rlist) library(dplyr) top=clpalettes('top') sapply(1:length(top), function(x) list.extract(top, x)$title)->titles CreatePlot = function (ang=pi*(3-sqrt(5)), nob=150, siz=15, sha=21, pal="LoversInJapan") { list.extract(top, which(titles==pal))$colors %>% unlist %>% as.vector() %>% paste0("#", .) -> all_colors colors=data.frame(hex=all_colors, darkness=colSums(col2rgb(all_colors))) colors %>% arrange(-darkness)->colors background=colors[1,"hex"] %>% as.character colors %>% filter(hex!=background) %>% .[,1] %>% as.vector()->colors ggplot(data.frame(r=sqrt(1:nob), t=(1:nob)*ang*pi/180), aes(x=r*cos(t), y=r*sin(t)))+ geom_point(colour=sample(colors, nob, replace=TRUE, prob=exp(1:length(colors))), aes(size=(nob-r)), shape=16)+ scale_x_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+ scale_y_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+ theme(legend.position="none", panel.background = element_rect(fill=background), panel.grid=element_blank(), axis.ticks=element_blank(), axis.title=element_blank(), axis.text=element_blank())} function(input, output) { output$Flower=renderPlot({ CreatePlot(ang=180*(3-sqrt(5)), nob=input$nob, siz=input$siz, sha=as.numeric(input$sha), pal=input$pal) }, height = 550, width = 550 )}
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Fronkonstin**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Using R: When using do in dplyr, don’t forget the dot

There will be a few posts about switching from plyr/reshape2 for data wrangling to the more contemporary dplyr/tidyr.

My most common use of plyr looked something like this: we take a data frame, split it by some column(s), and use an anonymous function to do something useful. The function takes a data frame and returns another data frame, both of which could very possibly have only one row. (If, in fact, it has to have only one row, I’d suggest an assert_that() call as the first line of the function.)

library(plyr) results <- ddply(some_data, "key", function(x) { ## do something; return data.frame() })Or maybe, if I felt serious and thought the function would ever be used again, I’d write:

calculate <- function(x) { ## do something; return data.frame() } result <- ddply(some_data, "key", calculate)Rinse and repeat over and over again. For me, discovering ddply was like discovering vectorization, but for data frames. Vectorization lets you think of operations on vectors, without having to think about their elements. ddply lets you think about operations on data frames, without having to think about rows and columns. It saves a lot of thinking.

The dplyr equivalent would be do(). It looks like this:

library(dplyr) grouped <- group_by(some_data, key) result <- do(grouped, calculate(.))Or once again with magrittr:

library(magrittr) some_data %>% group_by(key) %>% do(calculate(.)) -> result(Yes, I used the assignment arrow from the left hand side to the right hand side. Roll your eyes all you want. I think it’s in keeping with the magrittr theme of reading from left to right.)

One important thing here, which got me at first: There has to be a dot! Just passing the function name, as one would have done with ddply, will not work:

grouped <- group_by(some_data, key) ## will not work: Error: Results are not data frames at positions ... try(result <- do(grouped, calculate))Don’t forget the dot!

Postat i:computer stuff Tagged: dplyr, plyr, R

### Shiny Applications Layouts Exercises (Part-9)

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

**Shiny Application Layouts – Update Input**

In the ninth part of our series we will see the updated input scenario. This is different from the Dynamic UI example we met in part 8, where the UI component is generated on the server and sent to the UI, where it replaces an existing UI component.

This part can be useful for you in two ways.

First of all, you can see different ways to enhance the appearance and the utility of your shiny app.

Secondly you can make a revision on what you learnt in “Building Shiny App” series as we will build basic shiny staff in order to present it in the proper way.

Read the examples below to understand the logic of what we are going to do and then test yous skills with the exercise set we prepared for you. Lets begin!

Answers to the exercises are available here.

**UI Context**

As always let us create the skeleton of the app.

#ui.r

fluidPage(

titlePanel("Title"),

fluidRow(

column(3,

wellPanel()),

column(3,

wellPanel()),

column(3,

wellPanel())

)

)

#server.R

function(input, output, clientData, session) {}

**Exercise 1**

Build the initial UI of your app. Give it a title, use 4 columns and one row for each one of the three well Panel that we are going to use.

**Applying reactivity**

To create a reactive context for our app we will use the observe({}) function in our server side.

**Exercise 2**

Create reactive context. **HINT**: Use observe({}).

**“Master” Inputs.**

Now we are going to create the two inputs that control the rest of the inputs of the app. Look at the example:

#ui.r

textInput("text_input",

"labels:",

"Some Text"),

sliderInput("slider_input",

"values:",

min = 1, max = 100, value = 50)

#server.r

t_input <- input$text_input

s_input <- input$slider_input

**Exercise 3**

Create a text Input in the ui side that controls labels and then pass it to a new variable on the server side.

**Learn more**about Shiny in the online course R Shiny Interactive Web Apps – Next Level Data Visualization. In this course you will learn how to create advanced Shiny web apps; embed video, pdfs and images; add focus and zooming tools; and many other functionalities (30 lectures, 3hrs.).

**Exercise 4**

Create a slider Input in the ui side that controls values and then pass it to a new variable on the server side.

**Dependent Inputs**

Firstly let’s create a text input that changes both the label and the text.

#ui.r

textInput("Text", "Text input:", value = "text")

#server.r

updateTextInput(session, "Text",

label = paste("Sth", t_input),

value = paste("Sth", s_input)

)

**Exercise 5**

Create a text input, in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. **HINT**: Use updateTextInput().

**Exercise 6**

Create a slider input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. **HINT**: Use updateSliderInput().

**Exercise 7**

Create a numeric input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. **HINT**: Use updateNumericInput().

**Exercise 8**

Create a date input,in the second well Panel,that changes its label and its value according to the two “Master” Inputs you created before. **HINT**: Use updateDateInput().

In order to create a Checkbox group with the same conditions as the rest of the inputs we just created we should first create a list of options like in the example below:

#server.r

options <- list()

options[[paste("option", s_input, "A")]] <-

paste0("option-", s_input, "-A")

options[[paste("option", s_input, "B")]] <-

paste0("option-", s_input, "-B")

**Exercise 9**

Create a list with three choices for your Checkbox Group. **HINT**: Use list().

**Exercise 10**

Create a checkboxgroup input, in the third well Panel,that changes its label and its value according to the two “Master” Inputs you created before. **HINT**: Use updateCheckboxGroupInput().

- Building Shiny App exercises part 4
- Building Shiny App exercises part 1
- Building Shiny App exercises part 3
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### New series: R and big data (concentrating on Spark and sparklyr)

(This article was first published on ** R – Win-Vector Blog**, and kindly contributed to R-bloggers)

Win-Vector LLC has recently been teaching how to use R with big data through Spark and sparklyr. We have also been helping clients become productive on R/Spark infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using R, share some hints, and even admit to some of the warts found in this combination of systems.

The ability to perform sophisticated analyses and modeling on “big data” with R is rapidly improving, and this is the time for businesses to invest in the technology. Win-Vector can be your key partner in methodology development and training (through our consulting and training practices).

The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using Spark through R and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.

Please read on for a brief description of our new articles series: “R and big data.”

BackgroundR is a best of breed in-memory analytics platform. R allows the analyst to write programs that operate over their data and bring in a huge suite of powerful statistical techniques and machine learning procedures. Spark is an analytics platform designed to operate over big data that exposes some of its own statistical and machine learning capabilities. R can now be operated “over Spark“. That is: R programs can delegate tasks to Spark clusters and issue commands to Spark clusters. In some cases the syntax for operating over Spark is deliberately identical to working over data stored in R.

Why R and SparkThe advantages are:

- Spark can work at a scale and speed far larger than native R . The ability to send work to Spark increases R‘s capabilities.
- R has machine learning and statistical capabilities that go far beyond what is available on Spark or any other “big data” system (many of which are descended from report generation or basic analytics). The ability to use specialized R methods on data samples yields additional capabilities.
- R and Spark can share code and data.

The R/Spark combination is not the only show in town; but it is a powerful capability that may not be safe to ignore. We will also talk about additional tools that can be brought into the mix: such as the powerful large scale machine learning capabilities from h2o

The wartsFrankly a lot of this is very new, and still on the “bleeding edge.” Spark 2.x has only been available in stable form since July 26, 2016 (or just under a year). Spark 2.x is much more capable than the Spark 1.x series in terms of both data manipulation and machine learning, so we strongly suggest clients strongly insist on Spark 2.x clusters from their infrastructure vendors (such as Couldera, Hortonworks, MapR, and others) despite having only become available in these packaged solutions recently. The sparklyr adapter itself was first available on CRAN only as of September 24th, 2016. And SparkR only started distributing with Spark 1.4 as of June 2015.

While R/Spark is indeed a powerful combination, nobody seems to sharing a lot of production experiences and best practices whith it yet.

Some of the problems are sins of optimism. A lot of people still confuse successfully standing a cluster up with effectively using it. Other people confuse statistical and procedures available in in-memory R (which are very broad and often quite mature) with those available in Spark (which are less numerous and less mature).

Our goalWhat we want to do with the “R and big data” series is:

- Give a taste of some of the power of the R/Spark combination.
- Share a “capabilities and readiness” checklist you should apply when evaluating infrastructure.
- Start to publicly document R/Spark best practices.
- Describe some of the warts and how to work around them.
- Share fun tricks and techniques that make working with R/Spark much easier and more effective.

Our next article in this series will be up soon and will discuss the nature of data-handles in Sparklyr (one of the R/Spark interfaces) and how to manage your data inventory neatly.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Win-Vector Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Sankey charts for swinging voters

(This article was first published on ** Peter's stats stuff - R**, and kindly contributed to R-bloggers)

Continuing my examination of the individual level voting behaviour from the New Zealand Election Study, I wanted to look at the way individuals swap between parties, and between “did not vote” and a party, from one election to another. How much and how this happens is obviously an important question for both political scientists and for politicians.

Vote transition visualisationsI chose a Sankey chart as a way of showing the transition matrix from self-reported party vote in the 2011 election to the 2014 election. Here’s a static version:

And here is the more screen-friendly interactive version, with mouseover tooltips to give actual estimates:

The point with these graphics is to highlight the transitions. For example, what were the implications of turnout being higher in 2014 than 2011 (77.9% of enrolled voters in 2014 compared to 74.2% in 2011)? Judging from this survey data, the National Party gained 6.6% of the enrolled population in 2014 by converting them from a 2011 “did not vote” and lost only 3.6% in the other direction. This net gain of three percentage points was enough to win the election for the National-led coalition. In contrast, the Labour party had a net gain from “did not vote” in 2011 of only 0.2 percentage points. Remember though that these are survey-based estimates, and subject to statistical error.

I find setting up and polishing Sankey charts – controlling colours for example – a bit of a pain, so the code at the bottom of this post on how this was done might be of interest.

Weighting, missing data, population and age complicationsThose visualisations have a few hidden fishhooks, which careful readers would find if they compare the percentages in the tooltips of the interactive version with percentages reported by the New Zealand Electoral Commission.

- The 2014 percentages are proportions of the
*enrolled*population. As the 2014 turnout of enrolled voters was 77.9%, the numbers here are noticeably less than the usually cited percentages which were used to translate into seat counts (for example, National Party reported party vote of 47.0% of votes becomes 36.6% of enrolled voters) - The 2011 percentages are even harder to explain, because I’ve chosen not only to scale the party vote and “did not vote” to the 2011 enrolled population as reported by the Commission, but also to add in around 5% of the
*2014*population that were too young to vote in 2011.

Two things I would have liked to have taken into account but wasn’t able to were:

- The “leakage” from the 2011 election of people who were deceased or had left the country by 2014
- Explicit recognition of people who voted in 2014 but not in 2011 because they were out of the country. There is a variable in the survey that picks up the year the respondent came to live in New Zealand if not born here, but for only 10 respondents was this 2012 or later (in contrast to age – there were 58 respondents aged 20 or less in 2014).

I re-weighted the survey so the 2014 and 2011 reported party votes matched the known totals (with the addition of people aged 15 to 17 in 2011). One doesn’t normally re-weight a survey based on answers provided by the respondents, but in this case I think it makes perfect sense to calibrate to the public totals. The biggest impact is that for both years, but particularly 2011, relying on the respondents’ self-report and the published weighting of the NZES, totals for “did not vote” are materially understated, compared to results when calibrated to the known totals.

When party vote in 2011 had been forgotten or was an NA, and this wasn’t explained by being too young in 2011, I used multiple imputation based on a subset of relevant variables to give five instances of probable party vote to each such respondent.

Taken together, all this gives the visualisations a perspective based in 2014. It is better to think of it as “where did the 2014 voters come from” than “where did the 2011 voters go”. This is fairly natural when we consider it is the 2014 New Zealand Election Study, but is worth keeping in mind in interpretation.

Age (and hence the impact new young voters coming in, and of older voters passing on) is important in voting behaviour, as even the most casual observation of politics shows. In New Zealand, the age distribution of party voters in 2014 is seen in the chart below:

Non-voters, Green voters and to a certain extent Labour voters are young; New Zealand First voters are older. If this interests you though, I suggest you look at the multivariate analysis in this blog post or, probably more fun, this fancy interactive web app which lets you play with the predicted probabilities of voting based on a combination of demographic and socio-economic status variables.

CodeHere’s the R code that did that imputation, weighting, and the graphics:

library(tidyverse) library(forcats) library(riverplot) library(sankeyD3) library(foreign) library(testthat) library(mice) library(grid) library(survey) # Caution networkD3 has its own sankeyD3 with less features. Make sure you know which one you're using! # Don't load up networkD3 in the same session #============import data====================== # See previous blog posts for where this comes from: unzip("D:/Downloads/NZES2014GeneralReleaseApril16.sav.zip") nzes_orig <- read.spss("NZES2014GeneralReleaseApril16.sav", to.data.frame = TRUE, trim.factor.names = TRUE) # Five versions of each row, so when we do imputation later on we # in effect doing multiple imputation: nzes <- nzes_orig[rep(1:nrow(nzes_orig), each = 5), ] %>% # lump up party vote in 2014: mutate(partyvote2014 = ifelse(is.na(dpartyvote), "Did not vote", as.character(dpartyvote)), partyvote2014 = gsub("M.ori", "Maori", partyvote2014), partyvote2014 = gsub("net.Man", "net-Man", partyvote2014), partyvote2014 = fct_infreq(partyvote2014)) %>% mutate(partyvote2014 = fct_lump(partyvote2014, 5)) # party vote in 2011, and a smaller set of columns to do the imputation: nzes2 <- nzes %>% mutate(partyvote2011 = as.factor(ifelse(dlastpvote == "Don't know", NA, as.character(dlastpvote)))) %>% # select a smaller number of variables so imputation is feasible: select(dwtfin, partyvote2014, partyvote2011, dethnicity_m, dage, dhhincome, dtradeunion, dprofassoc, dhousing, dclass, dedcons, dwkpt) # impute the missing values. Although we are imputing only a single set of values, # because we are doing it with each row repeated five times this has the same impact, # for the simple analysis we do later on, as multiple imputation: nzes3 <- complete(mice(nzes2, m = 1)) # Lump up the 2011 vote, tidy labels, and work out who was too young to vote: nzes4 <- nzes3 %>% mutate(partyvote2011 = fct_lump(partyvote2011, 5), partyvote2011 = ifelse(grepl("I did not vote", partyvote2011), "Did not vote", as.character(partyvote2011)), partyvote2011 = ifelse(dage < 21, "Too young to vote", partyvote2011), partyvote2014 = as.character(partyvote2014)) #===============re-weighting to match actual votes in 2011 and 2014=================== # This makes a big difference, for both years, but more for 2011. "Did not vote" is only 16% in 2011 # if we only calibrate to 2014 voting outcomes, but in reality was 25.8%. We calibrate for both # 2011 and 2014 so the better proportions for both elections. #------------2014 actual outcomes---------------- # http://www.elections.org.nz/news-media/new-zealand-2014-general-election-official-results actual_vote_2014 <- data_frame( partyvote2014 = c("National", "Labour", "Green", "NZ First", "Other", "Did not vote"), Freq = c(1131501, 604534, 257356, 208300, 31850 + 16689 + 5286 + 95598 + 34095 + 10961 + 5113 + 1730 + 1096 + 872 + 639, NA) ) # calculate the did not vote, from the 77.9 percent turnout actual_vote_2014[6, 2] <- (100 / 77.9 - 1) * sum(actual_vote_2014[1:5, 2]) # check I did the turnout sums right: expect_equal(0.779 * sum(actual_vote_2014[ ,2]), sum(actual_vote_2014[1:5, 2])) #---------------2011 actual outcomes--------------------- # http://www.elections.org.nz/events/past-events-0/2011-general-election/2011-general-election-official-results actual_vote_2011 <- data_frame( partyvote2011 = c("National", "Labour", "Green", "NZ First", "Other", "Did not vote", "Too young to vote"), Freq = c(1058636, 614937, 247372, 147511, 31982 + 24168 + 23889 + 13443 + 59237 + 11738 + 1714 + 1595 + 1209, NA, NA) ) # calculate the did not vote, from the 74.21 percent turnout at # http://www.elections.org.nz/events/past-events/2011-general-election actual_vote_2011[6, 2] <- (100 / 74.21 - 1) * sum(actual_vote_2011[1:5, 2]) # check I did the turnout sums right: expect_equal(0.7421 * sum(actual_vote_2011[1:6 ,2]), sum(actual_vote_2011[1:5, 2])) # from the below, we conclude 4.8% of the 2014 population (as estimated by NZES) # were too young to vote in 2011: nzes_orig %>% mutate(tooyoung = dage < 21) %>% group_by(tooyoung) %>% summarise(pop = sum(dwtfin), n = n()) %>% ungroup() %>% mutate(prop = pop / sum(pop)) # this is pretty approximate but will do for our purposes. actual_vote_2011[7, 2] <- 0.048 * sum(actual_vote_2011[1:6, 2]) # Force the 2011 numbers to match the 2014 population actual_vote_2011$Freq <- actual_vote_2011$Freq / sum(actual_vote_2011$Freq) * sum(actual_vote_2014$Freq) #------------create new weights-------------------- # set up survey design with the original weights: nzes_svy <- svydesign(~1, data = nzes4, weights = ~dwtfin) # calibrate weights to the known total party votes in 2011 and 2014: nzes_cal <- calibrate(nzes_svy, list(~partyvote2014, ~partyvote2011), list(actual_vote_2014, actual_vote_2011)) # extract those weights for use in straight R (not just "survey") nzes5 <- nzes4 %>% mutate(weight = weights(nzes_cal)) # See impact of calibrated weights for the 2014 vote: prop.table(svytable(~partyvote2014, nzes_svy)) # original weights prop.table(svytable(~partyvote2014, nzes_cal)) # recalibrated weights # See impact of calibrated weights for the 2011 vote: prop.table(svytable(~partyvote2011, nzes_svy)) # original weights prop.table(svytable(~partyvote2011, nzes_cal)) # recalibrated weights #=======================previous years vote riverplot version============ the_data <- nzes5 %>% mutate( # add two spaces to ensure the partyvote2011 does not have any # names that exactly match the party vote in 2014 partyvote2011 = paste(partyvote2011, " "), partyvote2011 = gsub("M.ori", "Maori", partyvote2011)) %>% group_by(partyvote2011, partyvote2014) %>% summarise(vote_prop = sum(weight)) %>% ungroup() # change names to the names wanted by makeRiver names(the_data) <- c("col1", "col2", "Value") # node ID need to be characters I think c1 <- unique(the_data$col1) c2 <- unique(the_data$col2) nodes_ref <- data_frame(fullname = c(c1, c2)) %>% mutate(position = rep(c(1, 2), times = c(length(c1), length(c2)))) %>% mutate(ID = LETTERS[1:n()]) edges <- the_data %>% left_join(nodes_ref[ , c("fullname", "ID")], by = c("col1" = "fullname")) %>% rename(N1 = ID) %>% left_join(nodes_ref[ , c("fullname", "ID")], by = c("col2" = "fullname")) %>% rename(N2 = ID) %>% as.data.frame(stringsAsFactors = FALSE) rp <- makeRiver(nodes = as.vector(nodes_ref$ID), edges = edges, node_labels = nodes_ref$fullname, # manual vertical positioning by parties. Look at # nodes_ref to see the order in which positions are set. # This is a pain, so I just let it stay with the defaults: # node_ypos = c(4, 1, 1.8, 3, 6, 7, 5, 4, 1, 1.8, 3, 6, 7), node_xpos = nodes_ref$position, # set party colours; all based on those in nzelect::parties_v: node_styles = list(C = list(col = "#d82a20"), # red labour D = list(col = "#00529F"), # blue national E= list(col = "black"), # black NZFirst B = list(col = "#098137"), # green J = list(col = "#d82a20"), # labour I = list(col = "#098137"), # green K = list(col = "#00529F"), # national L = list(col = "black"))) # NZ First # Turn the text horizontal, and pale grey ds <- default.style() ds$srt <- 0 ds$textcol <- "grey95" mygp <- gpar(col = "grey75") # using PNG rather than SVG as vertical lines appear in the SVG version par(bg = "grey40") # note the plot_area argument - for some reason, defaults to using only half the # vertical space available, so set this to higher than 0.5!: plot(rp, default_style = ds, plot_area = 0.9) title(main = "Self-reported party vote in 2011 compared to 2014", col.main = "white", font.main = 1) grid.text(x = 0.15, y = 0.1, label = "2011 party vote", gp = mygp) grid.text(x = 0.85, y = 0.1, label = "2014 party vote", gp = mygp) grid.text(x = 0.95, y = 0.03, gp = gpar(fontsize = 7, col = "grey75"), just = "right", label = "Source: New Zealand Election Study, analysed at https://ellisp.github.io") #=======================sankeyD3 version==================== nodes_ref2 <- nodes_ref %>% mutate(ID = as.numeric(as.factor(ID)) - 1) %>% as.data.frame() edges2 <- the_data %>% ungroup() %>% left_join(nodes_ref2[ , c("fullname", "ID")], by = c("col1" = "fullname")) %>% rename(N1 = ID) %>% left_join(nodes_ref2[ , c("fullname", "ID")], by = c("col2" = "fullname")) %>% rename(N2 = ID) %>% as.data.frame(stringsAsFactors = FALSE) %>% mutate(Value = Value / sum(Value) * 100) pal <- 'd3.scaleOrdinal() .range(["#DCDCDC", "#098137", "#d82a20", "#00529F", "#000000", "#DCDCDC", "#DCDCDC", "#098137", "#d82a20", "#00529F", "#000000", "#DCDCDC"]);' sankeyNetwork(Links = edges2, Nodes = nodes_ref2, Source = "N1", Target = "N2", Value = "Value", NodeID = "fullname", NodeGroup = "fullname", LinkGroup = "col2", fontSize = 12, nodeWidth = 30, colourScale = JS(pal), numberFormat = JS('d3.format(".1f")'), fontFamily = "Calibri", units = "%", nodeShadow = TRUE, showNodeValues = FALSE, width = 700, height = 500) #=======other by the by analysis================== # Age density plot by party vote # Remember to weight by the survey weights - in this case it controls for # the under or over sampling by age in the original design. nzes5 %>% ggplot(aes(x = dage, fill = partyvote2014, weight = weight / sum(nzes5$weight))) + geom_density(alpha = 0.3) + facet_wrap(~partyvote2014, scales = "free_y") + scale_fill_manual(values = parties_v) + theme(legend.position = "none") + labs(x = "Age at time of 2014 election", caption = "Source: New Zealand Election Study") + ggtitle("Age distribution by Party Vote in the 2014 New Zealand General Election")
To **leave a comment** for the author, please follow the link and comment on their blog: ** Peter's stats stuff - R**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### R Weekly Bulletin Vol – IX

This week’s R bulletin will cover topics on how to list files, extracting file names, and creating a folder using R.

We will also cover functions like the select function, filter function, and the arrange function. Click To TweetHope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys1. Run the current chunk – Ctrl+Alt+C

2. Run the next chunk – Ctrl+Alt+N

3. Run the current function definition – Ctrl+Alt+F

To list files with a particular extension, one can use the pattern argument in the list.files function. For example to list csv files use the following syntax.

**Example: **

This will list all the csv files present in the current working directory. To list files in any other folder, you need to provide the folder path.

list.files(path = "C:/Users/MyFolder", pattern = "\\.csv$")$ at the end means that this is end of the string. Adding \. ensures that you match only files with extension .csv

Extracting file name using gsub functionWhen we download stock data from google finance, the file’s name corresponds to the stock data symbol. If we want to extract the stock data symbol from the file name, we can do it using the gsub function. The function searches for a match to the pattern argument and replaces all the matches with the replacement value given in the replacement argument. The syntax for the function is given as:

gsub(pattern, replacement, x)where,

pattern – is a character string containing a regular expression to be matched in the given character vector.

replacement – a replacement for matched pattern.

x – is a character vector where matches are sought.

In the example given below, we extract the file name for files stored in the “Reading MFs” folder. We have downloaded the stock price data in R working directory for two companies namely, MRF and PAGEIND Ltd.

**Example:**

[1] “MRF.csv” “PAGEIND.csv”

gsub("*.csv$", "", temp)[1] “MRF” “PAGEIND”

Create a folder using ROne can create a folder via R with the help of the “dir.create” function. The function creates a folder with the name as specified in the last element of the path. Trailing path separators are discarded.

The syntax is given as:

dir.create(path, showWarnings = FALSE, recursive = FALSE)**Example: **

This will create a folder called “RCodes” in the D drive.

Functions Demystified select functionThe select function comes from the dplyr package and can be used to select certain columns of a data frame which you need. Consider the data frame “df” given in the example.

**Example:**

The filter function comes from the dplyr package and is used to extract subsets of rows from a data frame. This function is similar to the subset function in R.

**Example: **

The arrange function is part of the dplyr package, and is used to reorder rows of a data frame according to one of the columns. Columns can be arranged in descending order or ascending order by using the special desc() operator.

** Example:**

**Next Step**

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

Download the PDF Now!The post R Weekly Bulletin Vol – IX appeared first on .

### Which science is all around? #BillMeetScienceTwitter

(This article was first published on ** Maëlle**, and kindly contributed to R-bloggers)

I’ll admit I didn’t really know who Bill Nye was before yesterday. His name sounds a bit like Bill Nighy’s, that’s all I knew. But well science is all around and quite often scientists on Twitter start interesting campaigns. Remember the #actuallylivingscientists whose animals I dedicated a blog post? This time, the Twitter campaign is the #BillMeetScienceTwitter hashtag with which scientists introduce themselves to the famous science TV host Bill Nye. Here is a nice article about the movement.

Since I like surfing on Twitter trends, I decided to download a few of these tweets and to use my own R interface to the Monkeylearn machine learning API, monkeylearn (part of the rOpenSci project!), to classify the tweets in the hope of finding the most represented science fields. So, which science is all around?

Getting the tweetsIt might sound a bit like trolling by now, but if you wanna get Twitter data, I recommend using rtweet because it’s a good package and because it’s going to replace twitteR which you might know from other blogs.

I only keep tweets in English, and moreover original ones, i.e. not retweets.

library("rtweet") billmeet <- search_tweets(q = "#BillMeetScienceTwitter", n = 18000, type = "recent") billmeet <- unique(billmeet) billmeet <- dplyr::filter(billmeet, lang == "en") billmeet <- dplyr::filter(billmeet, is_retweet == FALSE)I’ve ended up with 2491 tweets.

Classifying the tweetsI’ve chosen to use this taxonomy classifier which classifies text according to generic topics and had quite a few stars on Monkeylearn website. I don’t think it was trained on tweets, and well it wasn’t trained to classify science topics in particular, which is not optimal, but it had the merit of being readily available. I’ve still not started training my own algorithms, and anyway, if I did I’d start by creating a very crucial algorithm for determining animal fluffiness on pictures, not text mining stuff. This was a bit off topic, let’s go back to science Twitter!

When I decided to use my own package I had forgotten it took charge of cutting the request vector into groups of 20 tweets, since the API only accept 20 texts at a time. I thought I’d have to do that splitting myself, but no, since I did it once in the code of the package, I’ll never need to write that code ever again. Great feeling! Look at how easy the code is after cleaning up the tweets a bit! One just needs to wait a bit before getting all results.

output <- monkeylearn::monkeylearn_classify(request = billmeet$text, classifier_id = "cl_5icAVzKR") str(output) ## Classes 'tbl_df', 'tbl' and 'data.frame': 4466 obs. of 4 variables: ## $ category_id: int 64638 64640 64686 64696 64686 64687 64689 64692 64648 64600 ... ## $ probability: num 0.207 0.739 0.292 0.784 0.521 0.565 0.796 0.453 0.301 0.605 ... ## $ label : chr "Computers & Internet" "Internet" "Humanities" "Religion & Spirituality" ... ## $ text_md5 : chr "f7b28f45ea379b4ca6f34284ce0dc4b7" "f7b28f45ea379b4ca6f34284ce0dc4b7" "b95429d83df2cabb9cd701a562444f0b" "b95429d83df2cabb9cd701a562444f0b" ... ## - attr(*, "headers")=Classes 'tbl_df', 'tbl' and 'data.frame': 0 obs. of 0 variablesIn the output, the package creator decided not to put the whole text corresponding to each line but its digested form itself, digested by the MD5 algorithm. So to join the output to the tweets again, I’ll have to first digest the tweet, which I do just copying the code from the package. After all I wrote it. Maybe it was the only time I successfully used vapply in my whole life.

billmeet <- dplyr::mutate(billmeet, text_md5 = vapply(X = text, FUN = digest::digest, FUN.VALUE = character(1), USE.NAMES = FALSE, algo = "md5")) billmeet <- dplyr::select(billmeet, text, text_md5) output <- dplyr::left_join(output, billmeet, by = "text_md5")Looking at this small sample, some things make sense, other make less sense, either because the classification isn’t good or because the tweet looks like spam. Since my own field isn’t text analysis, I’ll consider myself happy with these results, but I’d be of course happy to read any better version of it.

As in my #first7jobs, I’ll make a very arbitrary decision and filter the labels to which a probability higher to 0.5 was attributed.

output <- dplyr::filter(output, probability > 0.5)This covers 0.45 of the original tweets sample. I can only hope it’s a representative sample.

How many labels do I have by tweet?

dplyr::group_by(output) %>% dplyr::summarise(nlabels = n()) %>% dplyr::group_by(nlabels) %>% dplyr::summarise(n_tweets = n()) %>% knitr::kable() nlabels n_tweets 1415 1Perfect, only one.

Looking at the resultsI know I suck at finding good section titles… At least I like the title of the post, which is a reference to the song Bill Nighy, not Bill Nye, sings in Love Actually. My husband assumed that science Twitter has more biomedical stuff. Now, even if my results were to support this fact, note that this could as well be because it’s easier to classify biomedical tweets.

I’ll first show a few examples of tweets for given labels.

dplyr::filter(output, label == "Chemistry") %>% head(n = 5) %>% knitr::kable() category_id probability label text_md5 text 64701 0.530 Chemistry e82fc920b07ea9d08850928218529ca9 Hi @billnye I started off running BLAST for other ppl but now I have all the money I make them do my DNA extractions #BillMeetScienceTwitter 64701 0.656 Chemistry d21ce4386512aae5458565fc2e36b686 .@uw’s biochemistry dept – home to Nobel Laureate Eddy Fischer & ZymoGenetics co founder Earl Davie… https://t.co/0nsZW3b3xu 64701 0.552 Chemistry 1d5be9d1e169dfbe2453b6cbe07a4b34 Yo @BillNye – I’m a chemist who plays w lasers & builds

__to study protein interactions w materials #BillMeetScienceTwitter 64701 0.730 Chemistry 1b6a25fcb66deebf35246d7eeea34b1f Meow @BillNye! I’m Zee and I study quantum physics and working on a Nobel prize. #BillMeetScienceTwitter https://t.co/oxAZO5Y6kI 64701 0.873 Chemistry 701d8c53e3494961ee7f7146b28b9c8c Hi @BillNye, I’m a organic chemist studying how molecules form materials like the liquid crystal shown below.… https://t.co/QNG2hSG8Fw dplyr::filter(output, label == "Aquatic Mammals") %>% head(n = 5) %>% knitr::kable() category_id probability label text_md5 text 64609 0.515 Aquatic Mammals f070a05b09d2ccc85b4b1650139b6cd0 Hi Bill, I am Anusuya. I am a palaeo-biologist working at the University of Cape Town. @BillNye #BillMeetScienceTwitter 64609 0.807 Aquatic Mammals bb06d18a1580c28c255e14e15a176a0f Hi @BillNye! I worked with people at APL to show that California blue whales are nearly recovered #BillMeetScienceTwitter 64609 0.748 Aquatic Mammals 1ca07aad8bc1abe54836df8dd1ff1a9d Hi @BillNye! I’m researching marine ecological indicators to improve Arctic marine monitoring and management… https://t.co/pJv8Om4IeI 64609 0.568 Aquatic Mammals a140320fcf948701cfc9e7b01309ef8b More like as opposed to vaginitis in dolphins or chimpanzees or sharks #BillMeetScienceTwitter https://t.co/gFCQIASty1 64609 0.520 Aquatic Mammals 06d1e8423a7d928ea31fd6db3c5fee05 Hi @BillNye I study visual function in ppl born w/o largest connection between brain hemispheres #callosalagenesis… https://t.co/WSz8xsP38R dplyr::filter(output, label == "Internet") %>% head(n = 5) %>% knitr::kable() category_id probability label text_md5 text 64640 0.739 Internet f7b28f45ea379b4ca6f34284ce0dc4b7 @BillNye #AskBillNye @BillNye join me @AllendaleCFD. More details at https://t.co/nJPwWARSsa #BillMeetScienceTwitter 64640 0.725 Internet b2b7843dc9fcd9cd959c828beb72182d @120Stat you could also use #actuallivingscientist #womeninSTEM or #BillMeetScienceTwitter to spread the word about your survey as well 64640 0.542 Internet a357e1216c5e366d7f9130c7124df316 Thank you so much for the retweet, @BillNye!____I’m excited for our next generation of science-lovers!… https://t.co/B3iz3KVCOQ 64640 0.839 Internet 61712f61e877f3873b69fed01486d073 @ParkerMolloy Hi @BillNye, Im an elem school admin who wants 2 bring in STEM/STEAM initiatives 2 get my students EX… https://t.co/VMLO3WKVRv 64640 0.924 Internet 4c7f961acfa2cdd17c9af655c2e81684 I just filled my twitter-feed with brilliance. #BIllMeetScienceTwitter__Based on that, and on the huge number of internet-labelled tweets, I decided to remove those.

library("ggplot2") library("viridis") label_counts <- output %>% dplyr::filter(label != "Internet") %>% dplyr::group_by(label) %>% dplyr::summarise(n = n()) %>% dplyr::arrange(desc(n)) label_counts <- label_counts %>% dplyr::mutate(label = ifelse(n < 5, "others", label)) %>% dplyr::group_by(label) %>% dplyr::summarize(n = sum(n)) %>% dplyr::arrange(desc(n)) label_counts <- dplyr::mutate(label_counts, label = factor(label, ordered = TRUE, levels = unique(label))) ggplot(label_counts) + geom_bar(aes(label, n, fill = label), stat = "identity")+ scale_fill_viridis(discrete = TRUE, option = "plasma")+ theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1), text = element_text(size=25), legend.position = "none")In the end, I’m always skeptical when looking at the results of such classifiers, and well at the quality of my sample to begin with – but then I doubt there ever was a hashtag that was perfectly used to only answer the question and not spam it and comment it (which is what I’m doing). I’d say it seems to support my husband’s hypothesis about biomedical stuff.

I’m pretty sure Bill Nye won’t have had the time to read all the tweets, but I think he should save them, or at least all the ones he can get via the Twitter API thanks to e.g. rtweet, in order to be able to look through them next time he needs an expert. And in the random sample of tweets he’s read, let’s hope he was exposed to a great diversity of science topics (and of scientists), although, hey, the health and life related stuff is the most interesting of course. Just kidding. I liked reading tweets about various scientists, science rocks! And these last words would be labelled with “performing arts”, perfect way to end this post.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Maëlle**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

__
__### New R Users group in Münster!

(This article was first published on ** Shirin's playgRound**, and kindly contributed to R-bloggers)

This is to announce that Münster now has its very own R users group!

If you are from the area, come join us (or if you happen to know someone who is and who might be interested, please forward the info).

You can find us on meetup.com: https://www.meetup.com/Munster-R-Users-Group/ and we are also on the R User groups list.

Code for the logo, which of course has been created in R:

library(ggplot2) library(ggmap) library(ggthemes) munster <- get_map(location = 'Munster', zoom = 12, maptype = "watercolor") ggmap(munster) + scale_y_continuous(limits=c(51.94, 51.97)) + annotate("text", x=7.565, y=51.96, label= "MünsteR", size = 8) + annotate("text", x=7.685, y=51.945, label= "R Users Group", size = 5) + theme_map()
To **leave a comment** for the author, please follow the link and comment on their blog: ** Shirin's playgRound**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A Primer in Functional Programming in R Exercises (Part – 1)

(This article was first published on ** R-exercises**, and kindly contributed to R-bloggers)

In the exercises below we cover the basics of functional programming in R( part 1 of a two series exercises on functional programming) . We consider recursion with R , apply family of functions , higher order functions such as Map ,Reduce,Filter in R .

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

create a function which calculates factorial of a number with help of Reduce ,

**Exercise 2**

create the same function which calculates factorial but with recursion and memoization. :

**Exercise 3**

create a function cum_add which makes cumulative summation for e.g if x <- 1:3 cum_add(x) will result in 1 3 6 .Don’t use cumsum .

**Exercise 4**

create a function which takes a dataframe and returns mean , minimum and maximum of all numeric columns . Your function should take a dataframe as input .for e.g your_func(iris)

.Try to avoid loops ,its possible to make it in one line .swag your functional skills .

**Learn more**about Functions in the online course R Programming: Advanced Analytics In R For Data Science. In this course you will learn how to prepare your data quicker and more efficiently, leaving you more time to focus on analyzing and visualizing your results!

**Exercise 5**

create a function centerColumnAroundMean which takes a numeric dataframe and manipulates the dataframe such a way that all column’s values are centered with mean value of the column . For example if my data frame is like this

df

then centerColumnAroundMean(df) will result in

It is possible to achieve this by single line of R , Please dont use loops .

**Exercise 6**

I have list of movies ,which have two movie franchieses as the elements starwars and LOTR,you can create the movie list by

my_movie_list<- list(STARWARS= list("A NEW HOPE","The Last Jedi","The Force Awakens"),LOTR=list("THE FELLOWSHIP OF THE RING","THE Two Towers","The RETURN of the KING","Hobbit" = list("An unexpected Journey","The Battle of the FIVE ARMY","The Desolation of Smaug") ))

now the problem I am facing is some of the texts are in caps and some of them are in small without any particular order , I would like the list to have a format like

“The Last Jedi” , Help me in writing a function which will do the same . Please keep in mind that the list is a nested list .

**Exercise 7**

load the diamonds data set from ggplot 2 package , I want to buy a diamond of each color but don’t want to pay the top price , I assume the second highest price is good enough for me . Help me in finding the second highest price for each color from the dataset.

**Exercise 8**

use the already loaded diamonds dataset from previous exercise , I want to know the average price for each combination of cut and color .your output should be similar to following. Don’t use table

**Exercise 9**

load iris dataset , I want to get the third row for each group of species . Help me to write a short function for me to achieve that .

**Exercise 10**

Create a new environment with new.env() command and create 3 vector variables under that environment as a=1:10;b=100:500;c=1000:1500

without knowing or manually calling mean for all the vector variables can you print the mean of all variables of the new environment .

- Accessing Dataframe Objects Exercises
- Correlation and Correlogram Exercises
- Building Shiny App exercises part 4
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory

To **leave a comment** for the author, please follow the link and comment on their blog: ** R-exercises**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

## Pages