R news and tutorials contributed by hundreds of R bloggers
Updated: 3 hours 46 min ago

### Formal ways to compare forecasting models: Rolling windows

Thu, 11/09/2017 - 01:49

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos Overview

When working with time-series forecasting we often have to choose between a few potential models and the best way is to test each model in pseudo-out-of-sample estimations. In other words, we simulate a forecasting situation where we drop some data from the estimation sample to see how each model perform.

Naturally, if you do only one (or just a few) forecasting test you results will have no robustness and in the next forecast the results may change drastically. Another possibility is to estimate the model in, let’s say, half of the sample, and use the estimated model to forecast the other half. This is better than a single forecast but it does not account for possible changes in the structure of the data over the time because you have only one estimation of the model. The most accurate way to compare models is using rolling windows. Suppose you have, for example, 200 observations of a time-series. First you estimate the model with the first 100 observations to forecast the observation 101. Then you include the observation 101 in the estimation sample and estimate the model again to forecast the observation 102. The process is repeated until you have a forecast for all 100 out-of-sample observations. This procedure is also called expanding window. If you drop the first observation in each iteration to keep the window size always the same then you have a fixed rolling window estimation. In the end you will have 100 forecasts for each model and you can calculate RMSE, MAE and formal tests such as Diebold & Mariano.

In general, the fixed rolling window is better than the expanding window because of the following example. Suppose we have two models:

Let’s assume that the true value of is zero. If we use expanding windows the asymptotic theory tells us that will go to zero and both models will be the same. If that is the case, we may be unable to distinguish which model is more accurate to forecast . However, the first model is better than the second model in small samples and it is just as good in large samples. We should be able to identify this feature. Fixed rolling windows keep the sample size fixed and they are free from this problem conditional on the sample size. In this case, the Diebold & Mariano test becomes the Giacomini & White test.

Application

In this example we are going to use some inflation data from the AER package. First let’s have a look at the function embed. This function is very useful in this rolling window framework because we often include lags of variables in the models and the function embed creates all lags for us in a single line of code. Here is an example:

library(AER) library(xts) library(foreach) library(reshape2) library(ggplot2) library(forecast) ## = embed = ## x1 = c(1, 2, 3, 4, 5) x2 = c(11, 12, 13, 14, 15) x = cbind(x1, x2) (x_embed = embed(x, 2)) ## [,1] [,2] [,3] [,4] ## [1,] 2 12 1 11 ## [2,] 3 13 2 12 ## [3,] 4 14 3 13 ## [4,] 5 15 4 14

As you can see. The first two columns show the variables x1 and x2 at lag 0 and the second column shows the same variables with one lag. We lost one observation because of the lag operation.

To the real example!!! We are going to estimate a model to forecast the US inflation using four autorregressive variables (four lags of the inflation), four lags of the industrial production and dummy variables for months. The second model will be a simple random walk. I took the first log-difference on both variables (CPI and industrial production index). The code below loads and prepare the data with the embed function.

## = Load Data = ## data("USMacroSWM") data = as.xts(USMacroSWM)[ , c("cpi", "production"), ] data = cbind(diff(log(data[ ,"cpi"])), diff(log(data[ ,"production"])))[-1, ] ## = Prep data with embed = ## lag = 4 X = embed(data, lag + 1) X = as.data.frame(X) colnames(X) = paste(rep(c("inf", "prod"), lag + 1), sort(rep(paste("l", 0:lag, sep = ""),2)), sep = "" ) X\$month = months(tail(index(data), nrow(X))) head(X) ## infl0 prodl0 infl1 prodl1 infl2 ## 1 0.005905082 0.000000000 -0.002275314 0.004085211 0.000000000 ## 2 0.006770507 -0.005841138 0.005905082 0.000000000 -0.002275314 ## 3 0.007618231 0.005841138 0.006770507 -0.005841138 0.005905082 ## 4 0.019452426 0.007542827 0.007618231 0.005841138 0.006770507 ## 5 0.003060112 0.009778623 0.019452426 0.007542827 0.007618231 ## 6 0.006526018 0.013644328 0.003060112 0.009778623 0.019452426 ## prodl2 infl3 prodl3 infl4 prodl4 ## 1 -0.008153802 0.017423641 0.005817352 0.006496543 0.005851392 ## 2 0.004085211 0.000000000 -0.008153802 0.017423641 0.005817352 ## 3 0.000000000 -0.002275314 0.004085211 0.000000000 -0.008153802 ## 4 -0.005841138 0.005905082 0.000000000 -0.002275314 0.004085211 ## 5 0.005841138 0.006770507 -0.005841138 0.005905082 0.000000000 ## 6 0.007542827 0.007618231 0.005841138 0.006770507 -0.005841138 ## month ## 1 June ## 2 July ## 3 August ## 4 September ## 5 October ## 6 November

The following code estimates 391 fixed rolling windows with a sample size of 300 in each window:

# = Number of windows and window size w_size = 300 n_windows = nrow(X) - 300 # = Rolling Window Loop = # forecasts = foreach(i=1:n_windows, .combine = rbind) %do%{ # = Select data for the window (in and out-of-sample) = # X_in = X[i:(w_size + i - 1), ] # = change to X[1:(w_size + i - 1), ] for expanding window X_out = X[w_size + i, ] # = Regression Model = # m1 = lm(infl0 ~ . - prodl0, data = X_in) f1 = predict(m1, X_out) # = Random Walk = # f2 = tail(X_in\$infl0, 1) return(c(f1, f2)) }

Finally, the remaining code calculates the forecasting errors, forecasting RMSE across the rolling windows and the Giacomini & White test. As you can see, the test rejected the null hypothesis of both models being equally accurate and the RMSE was smaller for the model with the lags, production and dummies.

# = Calculate and plot errors = # e1 = tail(X[ ,"infl0"], nrow(forecasts)) - forecasts[ ,1] e2 = tail(X[ ,"infl0"], nrow(forecasts)) - forecasts[ ,2] df = data.frame("date"=tail(as.Date(index(data)), n_windows), "Regression" = e1, "RandomWalk" = e2) mdf = melt(df,id.vars = "date") ggplot(data = mdf) + geom_line(aes(x = date, y = value, linetype = variable, color = variable))

# = RMSE = # (rmse1 = 1000 * sqrt(mean(e1 ^ 2))) ## [1] 2.400037 (rmse2 = 1000 * sqrt(mean(e2 ^ 2))) ## [1] 2.62445 # = DM test = # (dm = dm.test(e1, e2)) ## ## Diebold-Mariano Test ## ## data: e1e2 ## DM = -1.977, Forecast horizon = 1, Loss function power = 2, ## p-value = 0.04874 ## alternative hypothesis: two.sided References

Diebold, Francis X., and Robert S. Mariano. “Comparing predictive accuracy.” Journal of Business & economic statistics 20.1 (2002): 134-144.

Giacomini, Raffaella, and Halbert White. “Tests of conditional predictive ability.” Econometrica 74.6 (2006): 1545-1578.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Introduction to Visualizing Asset Returns

Thu, 11/09/2017 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

In a previous post, we reviewed how to import daily prices, build a portfolio, and calculate portfolio returns. Today, we will visualize the returns of our individual assets that ultimately get mashed into a portfolio. The motivation here is to make sure we have scrutinized our assets before they get into our portfolio, because once the portfolio has been constructed, it is tempting to keep the analysis at the portfolio level.

By way of a quick reminder, our ultimate portfolio consists of the following.

+ SPY (S&P 500 fund) weighted 25% + EFA (a non-US equities fund) weighted 25% + IJS (a small-cap value fund) weighted 20% + EEM (an emerging-mkts fund) weighted 20% + AGG (a bond fund) weighted 10%

library(tidyverse) library(tidyquant) library(timetk) library(tibbletime) library(highcharter)

To get our objects into the global environment, we use the next code chunk, which should look familiar from the previous post: we will create one xts object and one tibble, in long/tidy format, of monthly log returns.

# The symbols vector holds our tickers. symbols <- c("SPY","EFA", "IJS", "EEM","AGG") prices <- getSymbols(symbols, src = 'yahoo', from = "2005-01-01", auto.assign = TRUE, warnings = FALSE) %>% map(~Ad(get(.))) %>% reduce(merge) %>% `colnames<-`(symbols) # XTS method prices_monthly <- to.monthly(prices, indexAt = "last", OHLC = FALSE) asset_returns_xts <- na.omit(Return.calculate(prices_monthly, method = "log")) # Tidyverse method, to long, tidy format asset_returns_long <- prices %>% to.monthly(indexAt = "last", OHLC = FALSE) %>% tk_tbl(preserve_index = TRUE, rename_index = "date") %>% gather(asset, returns, -date) %>% group_by(asset) %>% mutate(returns = (log(returns) - log(lag(returns))))

We now have two objects holding monthly log returns, asset_returns_xts and asset_returns_long. First, let’s use highcharter to visualize the xts formatted returns.

Highcharter is fantastic for visualizing a time series or many time series. First, we set highchart(type = "stock") to get a nice time series line. Then we add each of our series to the highcharter code flow. In this case, we’ll add our columns from the xts object.

highchart(type = "stock") %>% hc_title(text = "Monthly Log Returns") %>% hc_add_series(asset_returns_xts\$SPY, name = names(asset_returns_xts\$SPY)) %>% hc_add_series(asset_returns_xts\$EFA, name = names(asset_returns_xts\$EFA)) %>% hc_add_series(asset_returns_xts\$IJS, name = names(asset_returns_xts\$IJS)) %>% hc_add_theme(hc_theme_flat()) %>% hc_navigator(enabled = FALSE) %>% hc_scrollbar(enabled = FALSE)

Take a look at the chart. It has a line for the monthly log returns of 3 of our ETFs (and in my opinion it’s already starting to get crowded). We might be able to pull some useful intuition from this chart. Perhaps one of our ETFs remained stable during the 2008 financial crisis, or had an era of consistently negative/positive returns. Highcharter is great for plotting time series line charts.

Highcharter does have the capacity for histogram making. One method is to first call the base function hist on the data along with the arguments for breaks and plot = FALSE. Then we can call hchart on that object.

=======

Take a look at the chart. It has a line for the monthly log returns of three of our ETFs (and in my opinion, it’s already starting to get crowded). We might be able to pull some useful intuition from this chart. Perhaps one of our ETFs remained stable during the 2008 financial crisis, or had an era of consistently negative/positive returns. Highcharter is great for plotting time series line charts.

Highcharter does have the capacity for histogram making. One method is to first call the base function hist on the data with the arguments for breaks and plot = FALSE. Then we can call hchart on that object.

>>>>>>> ea33b5686b615c7a2697ce0b8da92b85932d655f

hc_spy <- hist(asset_returns_xts\$SPY, breaks = 50, plot = FALSE) hchart(hc_spy) %>% hc_title(text = "SPY Log Returns Distribution")

{"x":{"hc_opts":{"title":{"text":"SPY Log Returns Distribution"},"yAxis":{"title":{"text":null}},"credits":{"enabled":false},"exporting":{"enabled":false},"plotOptions":{"series":{"turboThreshold":0},"treemap":{"layoutAlgorithm":"squarified"},"bubble":{"minSize":5,"maxSize":25}},"annotationsOptions":{"enabledButtons":false},"tooltip":{"delayForDisplay":10,"formatter":"function() { return this.point.name + '

Nothing wrong with that chart, and it shows us the distribution of SPY returns. However, highcharter is missing an easy way to chart multiple histograms, and to add density lines to those multiple histograms. The functionality is fine for one set of returns, but here we want to see the distribution of all of our returns series together.

For that, we will head to the tidyverse and use ggplot2 on our tidy tibble called assets_returns_long. Because it is in long, tidy format, and it is grouped by the ‘asset’ column, we can chart the asset histograms collectively on one chart.

# Make so all titles centered in the upcoming ggplots theme_update(plot.title = element_text(hjust = 0.5)) asset_returns_long %>% ggplot(aes(x = returns, fill = asset)) + geom_histogram(alpha = 0.25, binwidth = .01)

Let’s use facet_wrap(~asset) to break these out by asset. We can add a title with ggtitle.

asset_returns_long %>% ggplot(aes(x = returns, fill = asset)) + geom_histogram(alpha = 0.25, binwidth = .01) + facet_wrap(~asset) + ggtitle("Monthly Returns Since 2005")

Maybe we don’t want to use a histogram, but instead want to use a density line to visualize the various distributions. We can use the stat_density(geom = "line", alpha = 1) function to do this. The alpha argument is selecting a line thickness. Let’s also add a label to the x and y axes with the xlab and ylab functions.

asset_returns_long %>% ggplot(aes(x = returns, colour = asset, fill = asset)) + stat_density(geom = "line", alpha = 1) + ggtitle("Monthly Returns Since 2005") + xlab("monthly returns") + ylab("distribution")

That chart is quite digestible, but we can also facet_wrap(~asset) to break the densities out into individual charts.

asset_returns_long %>% ggplot(aes(x = returns, colour = asset, fill = asset)) + stat_density(geom = "line", alpha = 1) + facet_wrap(~asset) + ggtitle("Monthly Returns Since 2005") + xlab("monthly returns") + ylab("distribution")

Now we can combine all of our ggplots into one nice, faceted plot.

At the same time, to add to the aesthetic toolkit a bit, we will do some editing to the label colors. First off, let’s choose a different color besides black to be the theme. I will go with cornflower blue, because it’s a nice shade and I don’t see it used very frequently elsewhere. Once we have a color, we can choose the different elements of the chart to change in the the theme function. I make a lot of changes here by way of example but feel free to comment out a few of those lines and see the different options.

asset_returns_long %>% ggplot(aes(x = returns, colour = asset, fill = asset)) + stat_density(geom = "line", alpha = 1) + geom_histogram(alpha = 0.25, binwidth = .01) + facet_wrap(~asset) + ggtitle("Monthly Returns Since 2005") + xlab("monthly returns") + ylab("distribution") + # Lots of elements can be customized in the theme() function theme(plot.title = element_text(colour = "cornflowerblue"), strip.text.x = element_text(size = 8, colour = "white"), strip.background = element_rect(colour = "white", fill = "cornflowerblue"), axis.text.x = element_text(colour = "cornflowerblue"), axis.text = element_text(colour = "cornflowerblue"), axis.ticks.x = element_line(colour = "cornflowerblue"), axis.text.y = element_text(colour = "cornflowerblue"), axis.ticks.y = element_line(colour = "cornflowerblue"), axis.title = element_text(colour = "cornflowerblue"), legend.title = element_text(colour = "cornflowerblue"), legend.text = element_text(colour = "cornflowerblue") )

We now have one chart, with histograms and line densities broken out for each of our assets. This would scale nicely if we had more assets and wanted to peek at more distributions of returns.

We have not done any substantive work today, but the chart of monthly returns is a tool to quickly glance at the data and see if anything unusual jumps out, or some sort of hypothesis comes to mind. We are going to be combining these assets into a portfolio and, once that occurs, we will rarely view the assets in isolation again. Before that leap to portfolio building, it’s a good idea to glance at the portfolio component distributions.

That’s all for today. Thanks for reading!

_____='https://rviews.rstudio.com/2017/11/09/introduction-to-visualizing-asset-returns/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### bridgesampling [R package]

Thu, 11/09/2017 - 00:17

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

Quentin F. Gronau, Henrik Singmann and Eric-Jan Wagenmakers have arXived a detailed documentation about their bridgesampling R package. (No wonder that researchers from Amsterdam favour bridge sampling!) [The package relates to a [52 pages] tutorial on bridge sampling by Gronau et al. that I will hopefully comment soon.] The bridge sampling methodology for marginal likelihood approximation requires two Monte Carlo samples for a ratio of two integrals. A nice twist in this approach is to use a dummy integral that is already available, with respect to a probability density that is an approximation to the exact posterior. This means avoiding the difficulties with bridge sampling of bridging two different parameter spaces, in possibly different dimensions, with potentially very little overlap between the posterior distributions. The substitute probability density is chosen as Normal or warped Normal, rather than a t which would provide more stability in my opinion. The bridgesampling package also provides an error evaluation for the approximation, although based on spectral estimates derived from the coda package. The remainder of the document exhibits how the package can be used in conjunction with either JAGS or Stan. And concludes with the following words of caution:

“It should also be kept in mind that there may be cases in which the bridge sampling procedure may not be the ideal choice for conducting Bayesian model comparisons. For instance, when the models are nested it might be faster and easier to use the Savage-Dickey density ratio (Dickey and Lientz 1970; Wagenmakers et al. 2010). Another example is when the comparison of interest concerns a very large model space, and a separate bridge sampling based computation of marginal likelihoods may take too much time. In this scenario, Reversible Jump MCMC (Green 1995) may be more appropriate.”

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Calculating the house edge of a slot machine, with R

Wed, 11/08/2017 - 23:32

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Modern slot machines (fruit machine, pokies, or whatever those electronic gambling devices are called in your part of the world) are designed to be addictive. They're also usually quite complicated, with a bunch of features that affect the payout of a spin: multiple symbols with different pay scales, wildcards, scatter symbols, free spins, jackpots … the list goes on. Many machines also let you play multiple combinations at the same time (20 lines, or 80, or even more with just one spin). All of this complexity is designed to make it hard for you, the player, to judge the real odds of success. But rest assured: in the long run, you always lose.

All slot machines are designed to have a "house edge" — the percentage of player bets retained by the machine in the long run — greater than zero. Some may take 1% of each bet (over a long-run average); some may take as much as 15%. But every slot machine takes something.

That being said, with all those complex rules and features, trying to calculate the house edge, even when you know all of the underlying probabilities and frequencies, is no easy task. Giora Simchoni demonstrates the problem with an R script to calculate the house edge of an "open source" slot machine Atkins Diet. Click the image below to try it out.

This virtual machine is at a typical level of complexity of modern slot machines. Even though we know the pay table (which is always public) and the relative frequency of the symbols on the reels (which usually isn't), calculating the house edge for this machine requires several pages of code. You could calculate the expected return analytically, of course, but it turns out to be a somewhat error-prone combinatorial problem. The simplest approach is to simulate playing the machine 100,000 times or so. Then we can have a look at the distribution of the payouts over all of these spins:

The x axis here is log(Total Wins + 1), in log-dollars, from a single spin. It's interesting to see the impact of the bet size (which increases variance but doesn't change the distribution), and the number of lines played. Playing one 20-line game isn't the same as playing 20 1-line games, because the re-use of the symbols means multi-line wins are not independent: a high-value symbol (like a wild) may contribute to wins on multiple lines. Conversely, losing combinations have a tendency to cluster together, too. It all balances in the end, but the possibility of more frequent wins (coupled with higher-value losses) is apparently appealing to players, since many machines encourage multi-line play.

Nonetheless, whichever method you play, the house edge is always positive. For Atkins Diet, it's about 4% for single-line play, and about 3.2% for multi-line play. You can see the details of the calculation, and the complete R code behind it, at the link below.

Giora Simchoni: Don't Drink and Gamble (via the author)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Creating Reporting Template with Glue in R

Wed, 11/08/2017 - 15:00

Report Generation is a very important part in any Organization’s Business Intelligence and Analytics Division. The ability to create automated reports out of the given data is one of the most desirable things, that any innovative team would thrive for. And that is one area where SAS is considered to be more matured than R – not because R does not have those features – but primarily because R practitioners are not familiar with those. That’s the same feeling I came across today when I stumbled upon this package glue in R, which is a very good and competitive alternative for Reporting Template packages like whisker and brew.

The package can be installed directly from CRAN.

install.packages('glue')

Let us try to put together a very minimal reporting template to output basic information about the given Dataset.

library(glue) df <- mtcars msg <- 'Dataframe Info: \n\n This dataset has {nrow(df)} rows and {ncol(df)} columns. \n There {ifelse(sum(is.na(df))>0,"is","are")} {sum(is.na(df))} Missing Value glue(msg) Dataframe Info: This dataset has 32 rows and 11 columns. There are 0 Missing Values.

As in the above code, glue() is the primary function that takes a string with r expressions enclosed in curly braces {} whose resulting value would get concatenated with the given string. Creation of the templatised string is what we have done with msg. This whole exercise wouldn’t make much of a sense if it’s required for only one dataset, rather it serves its purpose when the same code is used for different datasets with no code change. Hence let us try running this on a different dataset – R’s inbuilt iris dataset. Also since we are outputting the count of missing values, let’s manually assign NA for two instances and run the code.

df <- iris df[2,3] <- NA df[4,2] <- NA msg <- 'Dataframe Info: \n\n This dataset has {nrow(df)} rows and {ncol(df)} columns. \n There {ifelse(sum(is.na(df))>0,"is","are")} {sum(is.na(df))} Missing Value glue(msg) Dataframe Info: This dataset has 150 rows and 5 columns. There is 2 Missing Values.

That looks fine. But what if we want to report the contents of the dataframe. That’s where coupling glue's glue_data() function with magrittr's %>% operator helps.

library(magrittr) head(mtcars) %>% glue_data("* {rownames(.)} has {cyl} cylinders and {hp} hp") * Mazda RX4 has 6 cylinders and 110 hp * Mazda RX4 Wag has 6 cylinders and 110 hp * Datsun 710 has 4 cylinders and 93 hp * Hornet 4 Drive has 6 cylinders and 110 hp * Hornet Sportabout has 8 cylinders and 175 hp * Valiant has 6 cylinders and 105 hp

This is just to introduce glue and its possibilities. This could potentially help in automating a lot of Reports and also to start with Exception-based Reporting. The code used in the article can be found on my github.

Related Post

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### R / Finance 2018 Call for Papers

Wed, 11/08/2017 - 13:50

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

The tenth (!!) annual annual R/Finance conference will take in Chicago on the UIC campus on June 1 and 2, 2018. Please see the call for papers below (or at the website) and consider submitting a paper.

We are once again very excited about our conference, thrilled about who we hope may agree to be our anniversary keynotes, and hope that many R / Finance users will not only join us in Chicago in June — and also submit an exciting proposal.

So read on below, and see you in Chicago in June!

Call for Papers

R/Finance 2018: Applied Finance with R
June 1 and 2, 2018
University of Illinois at Chicago, IL, USA

The tenth annual R/Finance conference for applied finance using R will be held June 1 and 2, 2018 in Chicago, IL, USA at the University of Illinois at Chicago. The conference will cover topics including portfolio management, time series analysis, advanced risk tools, high-performance computing, market microstructure, and econometrics. All will be discussed within the context of using R as a primary tool for financial risk management, portfolio construction, and trading.

Over the past nine years, R/Finance has includedattendeesfrom around the world. It has featured presentations from prominent academics and practitioners, and we anticipate another exciting line-up for 2018.

We invite you to submit complete papers in pdf format for consideration. We will also consider one-page abstracts (in txt or pdf format) although more complete papers are preferred. We welcome submissions for both full talks and abbreviated "lightning talks." Both academic and practitioner proposals related to R are encouraged.

All slides will be made publicly available at conference time. Presenters are strongly encouraged to provide working R code to accompany the slides. Data sets should also be made public for the purposes of reproducibility (though we realize this may be limited due to contracts with data vendors). Preference may be given to presenters who have released R packages.

Please submit proposals online at http://go.uic.edu/rfinsubmit. Submissions will be reviewed and accepted on a rolling basis with a final submission deadline of February 2, 2018. Submitters will be notified via email by March 2, 2018 of acceptance, presentation length, and financial assistance (if requested).

Financial assistance for travel and accommodation may be available to presenters. Requests for financial assistance do not affect acceptance decisions. Requests should be made at the time of submission. Requests made after submission are much less likely to be fulfilled. Assistance will be granted at the discretion of the conference committee.

Additional details will be announced via the conference website at http://www.RinFinance.com/ as they become available. Information on previous years’presenters and their presentations are also at the conference website. We will make a separate announcement when registration opens.

For the program committee:

Gib Bassett, Peter Carl, Dirk Eddelbuettel, Brian Peterson,
Dale Rosenthal, Jeffrey Ryan, Joshua Ulrich

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RQuantLib 0.4.4: Several smaller updates

Wed, 11/08/2017 - 12:45

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A shiny new (mostly-but-not-completely maintenance) release of RQuantLib, now at version 0.4.4, arrived on CRAN overnight, and will get to Debian shortly. This is the first release in over a year, and it it contains (mostly) a small number of fixes throughout. It also includes the update to the new DateVector and DatetimeVector classes which become the default with the upcoming Rcpp 0.12.14 release (just like this week’s RcppQuantuccia release). One piece of new code is due to François Cocquemas who added support for discrete dividends to both European and American options. See below for the complete set of changes reported in the NEWS file.

As with release 0.4.3 a little over a year ago, we will not have new Windows binaries from CRAN as I apparently have insufficient powers of persuasion to get CRAN to update their QuantLib libraries. So we need a volunteer. If someone could please build a binary package for Windows from the 0.4.4 sources, I would be happy to once again host it on the GHRR drat repo. Please contact me directly if you can help.

Changes are listed below:

Changes in RQuantLib version 0.4.4 (2017-11-07)
• Changes in RQuantLib code:

• Equity options can now be analyzed via discrete dividends through two vectors of dividend dates and values (Francois Cocquemas in #73 fixing #72)

• Some package and dependency information was updated in files DESCRIPTION and NAMESPACE.

• The new Date(time)Vector classes introduced with Rcpp 0.12.8 are now used when available.

• Minor corrections were applied to BKTree, to vanilla options for the case of intraday time stamps, to the SabrSwaption documentation, and to bond utilities for the most recent QuantLib release.

Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the RQuantLib page. Questions, comments etc should go to the rquantlib-devel mailing list off the R-Forge page. Issue tickets can be filed at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### solrium 1.0: Working with Solr from R

Wed, 11/08/2017 - 01:00

(This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers)

Nearly 4 years ago I wrote on this blog about an R package solr for working with the database Solr. Since then we’ve created a refresh of that package in the solrium package. Since solrium first hit CRAN about two years ago, users have raised a number of issues that required breaking changes. Thus, this blog post is about a major version bump in solrium.

What is Solr?

Solr is a “search platform” – a NoSQL database – data is organized by so called documents that are xml/json/etc blobs of text. Documents are nested within either collections or cores (depending on the mode you start Solr in). Solr makes it easy to search for documents, with a huge variety of parameters, and a number of different data formats (json/xml/csv). Solr is similar to Elasticsearch (see our Elasticsearch client elastic) – and was around before it. Solr in my opinion is harder to setup than Elasticsearch, but I don’t claim to be an expert on either.

Vignettes Noteable features
• Added in v1, you can now work with many connection objects to different Solr instances.
• Methods for the major search functionalities: search, highlight, stats, mlt, group, and facet. In addition, a catch all function all to combine all of those.
• Comprehensive coverage of the Solr HTTP API
• Can coerce data from Solr API into data.frame’s when possible
Setup

Install solrium

install.packages("solrium")

Or get the development version:

devtools::install_github("ropensci/solrium") library(solrium)

Initialize a client

A big change in v1 of solrium is solr_connect has been replaced by SolrClient. Now you create an R6 connection object with SolrClient, then you can call methods on that R6 object, OR you can pass the connection object to functions.

By default, SolrClient\$new() sets connections details for a Solr instance that’s running on localhost, and on port 8983.

(conn <- SolrClient\$new()) #> #> host: 127.0.0.1 #> path: #> port: 8983 #> scheme: http #> errors: simple #> proxy:

On instantiation, it does not check that the Solr instance is up, but merely sets connection details. You can check if the instance is up by doing for example (assuming you have a collection named gettingstarted):

A good hint when connecting to a publicly exposed Solr instance is that you likely don’t need to specify a port, so a pattern like this should work to connect to a URL like http://foobar.com/search:

SolrClient\$new(host = "foobar.com", path = "search", port = NULL)

If the instance uses SSL, simply specify that like:

SolrClient\$new(host = "foobar.com", path = "search", port = NULL, scheme = "https")

Query and body parameters

Another big change in the package is that we wanted to make it easy to determine whether your Solr query gets passed as query parameters in a GET request or as body in a POST request. Solr clients in some other languages do this, and it made sense to port over that idea here. Now you pass your key-value pairs to either params or body. If nothing is passed to body, we do a GET request. If something is passed to body we do a POST request, even if there’s also key-value pairs passed to params.

This change does break the interface we had in the old version, but we think it’s worth it.

For example, to do a search you have to pass the collection name and a list of named parameters:

conn\$search(name = "gettingstarted", params = list(q = "*:*")) #> # A tibble: 5 x 5 #> id title title_str `_version_` price #> #> 1 10 adfadsf adfadsf 1.582913e+18 NA #> 2 12 though though 1.582913e+18 NA #> 3 14 animals animals 1.582913e+18 NA #> 4 1 1.582913e+18 100 #> 5 2 1.582913e+18 500

You can instead pass the connection object to solr_search:

solr_search(conn, name = "gettingstarted", params = list(q = "*:*")) #> # A tibble: 5 x 5 #> id title title_str `_version_` price #> #> 1 10 adfadsf adfadsf 1.582913e+18 NA #> 2 12 though though 1.582913e+18 NA #> 3 14 animals animals 1.582913e+18 NA #> 4 1 1.582913e+18 100 #> 5 2 1.582913e+18 500

And the same pattern applies for the other functions:

• solr_facet
• solr_group
• solr_mlt
• solr_highlight
• solr_stats
• solr_all

A user requested the ability to do atomic updates – partial updates to documents without having to re-index the entire document.

Two functions were added: update_atomic_json and update_atomic_xml for JSON and XML based updates. Check out their help pages for usage.

Search results as attributes

solr_search and solr_all in v1 gain attributes that include numFound, start, and maxScore. That is, you can get to these three values after data is returned. Note that some Solr instances may not return all three values.

For example, let’s use the Public Library of Science Solr search instance at http://api.plos.org/search:

plos <- SolrClient\$new(host = "api.plos.org", path = "search", port = NULL)

Search

res <- plos\$search(params = list(q = "*:*"))

Get attributes

attr(res, "numFound") #> [1] 1902279 attr(res, "start") #> [1] 0 attr(res, "maxScore") #> [1] 1

A user higlighted that there’s a performance penalty when asking for too many rows. The resulting change in solrium is that in some search functions we automatically adjust the rows parameter to avoid the performance penalty.

Usage in other packages

I maintain 4 other packages that use solrium: rplos, ritis, rdatacite, and rdryad. If you are interested in using solrium in your package, looking at any of those four packages will give a good sense of how to do it.

Notes solr pkg

The solr package will soon be archived on CRAN. We’ve moved all packages depending on it to solrium. Let me know ASAP if you have any complaints about archiving it on CRAN.

Feedback!

Please do upgrade/install solrium v1 and let us know what you think.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### In case you missed it: October 2017 roundup

Wed, 11/08/2017 - 00:04

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

In case you missed them, here are some articles from October of particular interest to R users.

A recent survey of competitors on the Kaggle platform reveals that Python (76%) and R (59%) are the preferred tools for building predictive models.

Microsoft's "Team Data Science Process" has been updated with new guidelines on use of the IDEAR framework for R and Python.

Microsoft R Open 3.4.2 is now available for Windows, Mac and Linux.

Using the foreach package to estimate bias of rpart trees via bootstrapping.

Replays of webinars on the Azure Data Science VM, and on document collection analysis with Azure ML Workbench, are now available.

The "officer" package makes it possible to create PowerPoint and Word documents from R, and even include editable R charts.

An online book on statistical machine learning with the MicrosoftML package.

An updated list of major events in the history of the R project, 1992-2016.

An overview of the R manuals, now also available in Bookdown format.

An R-based analysis comparing the speeds of bikes and taxis for trips across New York City.

Vision-based AI techniques used to estimate the population of endangered snow leopards.

ROpenSci interviews David Smith about working in the R community.

A generational neural network, implemented in R, synthesizes startup names and business plans.

Two R-themed crosswords: a cryptic one by Barry Rowlingson, and a standard one from R-Ladies DC.

A tutorial on using Azure Data Lake Analytics with R.

The remarkable growth of R, as seen in StackOverflow traffic data.

Version 1.0.0 of the dplyrXdf package, providing dplyr operations for Microsoft R out-of-memory data files, is now available.

The GPU-enabled Deep Learning Virtual Machine on Azure includes R, Spark, Tensorflow and more.

A comparison of assault death rates in the US and other advanced democracies, generated in R by Kieran Healy.

And some general interest stories (not necessarily related to R):

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### 9th MilanoR Meeting: November 20th

Tue, 11/07/2017 - 17:29

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

MilanoR Staff is happy to announce the 9th MilanoR Meeting!

A MilanoR meeting is an occasion to bring together R users in the Milano area to share knowledge and experiences. The meeting is a free event, open to everybody. The meeting consists of some R talks and a free buffet + networking time.

This time we want to focus on a specific topic: data visualization with R. We’ll try to explore the R dataviz world from any corner, checking tools, ideas and applied case studies.

When

Monday, November 20, 2017

from 7,00 to 9,30 pm

Agenda

Intro: Dataviz with R: past, present and future

by MilanoR

Talk: Automagically build the perfect palette for your plot with paletteR

by  Andrea Cirillo

Lightning Talks:

From modelling to visualization by Stefano Biffani;  Mapping data with R by Stefano Barberis; Interactive plots with Highchart by Nicola Pesenti; Shiny in a Data Science app lifecycle by Moreno Riccardi

Where

Mikamai
Via Venini 42, Milano

Buffet

Our sponsors will provide a free buffet!

Registration

MilanoR Meeting is a free event, open to all R users and enthusiasts or those who wish to learn more about R.

Places are limited so if you would like to attend the meeting, you must register here!

—— FAQ Uhm, this MilanoR meeting looks different from the others.. How does it work?

You’re right, this time we decided to focus on a specific topic, and just explore it from several corners. So instead of having two long talks, we moved to many, but shorter.

Ideally, the meeting may be divided in three half an hour blocks: from about 19:00 to 19:30 we will have welcome presentations and a first introduction to the R dataviz world.

From about 19:30 to 20:00 Andrea Cirillo will introduce the package paletteR, going deeper into a new tool for dataviz (and showing how everyone can extend the R capabilities in every field).

From 20:00 to 20:30 we’ll see a roundup of examples that complete our exploration of the R data visualization world: from modeling to interactive dashboard, we’ll see how R is used in the “real” world.

Then, as always, free food and chat <3

So, who are the speakers?

Mariachiara Fortuna is a freelance statistician that works with Quantide and takes care of MilanoR

Andrea Cirillo is a Quant Analyst at Intesa Sanpaolo currently employing R to explore and assess advanced credit risk models.

Stefano Biffani is a biostatistician/geneticist using R in animal science, working with AIA, Associazione Italiana Allevatori

Stefano Barberis is a PhD in Statistics at the University of Milano-Bicocca. His research is focused on representation and mapping of geospatial data

Nicola Pesenti works as data analyst at Istituto Mangiagalli, previously at SSE Airtricity, Dublin

Moreno Riccardi is Deliverability Specialist & Data Analyst at MailUp

Quantide is a provider of consulting services and training courses about Data Science and Big Data. It’s specialized in R, the open source software for statistical computing. Headquartered in Legnano, near Milan (Italy), Quantide has been supporting for 10 years customers from several industries around the world. Quantide is the founder and the supporter of the Milano R meeting, since its first edition in 2012.

Open Search Network is an innovative headhunting company, based in London and focused on Italian Data, Digital Experts (for opportunities in Italy and abroad). OSN is specialised in opportunities for Data Science, Data Engineering, Data Strategy professional, from middle management to executive. OSN is the first headhunting company to be specialised on the Italian Data Sector and to organise online competition for recruitment purpose.

I have another question

The post 9th MilanoR Meeting: November 20th appeared first on MilanoR.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MilanoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Setting up twitter streamR Service on an Ubuntu server

Tue, 11/07/2017 - 16:56

(This article was first published on R – AmitKohli.com, and kindly contributed to R-bloggers)

I am working on a super-secret project for which I am harvesting a highly confidential source of data: twitter The idea is to gather a small amount of twitter data, but for a long time… maybe a year. I tried to use the package TwitteR, but it can only  grab up to a week of tweets… it’s not really good for a set-it-and-forget-it ongoing capture since it requires user-based authentication, which means (I guess) that a machine can’t authenticate for it. Tangibly this means a human needs to start the process every time. So I could run the script weekly, but of course there’s days you miss, or run at different times… plus it’s just plain annoying…

does no #rstats package useTwitter oauth 2.0? I can’t believe it… this means that automated tweet-scrapers in R aren’t possible? pic.twitter.com/eq9dnrbB6L

— amit (@VizMonkey) August 9, 2017

And then I remembered about streamR, which allows exactly for this ongoing scraping. This blog documents my experience this up on my server, using a linux service.

Let’s Go!

(Small meta-note: I’m experimenting with a new blogging style: showing more of my errors and my iterative approach to solving problems in order to counter the perception of the perfect analyst… something a bunch of people have been talking about recently. I was exposed to it by   and  during EARL London, and it’s really got me thinking. Anyway, I do realize that it makes for a messier read full of tangents and dead ends. Love it? Hate it? Please let me know what you think in the comments!)

(metanote 2: The linux bash scripts are available in their own github repo)

So if you don’t have a linux server of your own, follow Dean Atalli’s excellent guide to set one up on Digital Ocean… it’s cheap and totally worth it. Obviously, you’ll need to install streamR, also ROauth. I use other packages in the scripts here, up to you to do it exactly how I do it or not. Also… remember when you install R-Packages on Ubuntu, you have to do it as the superuser in linux, not from R (otherwise that package won’t be available for any other user (like shiny). If you don’t know what I’m talking about then you didn’t read Dean Atalli’s guide like I said above… why are you still here?). Actually, it’s so annoying to have to remember how to correctly install R packages on linux, that I created a little utility for it. save the following into a file called “Rinstaller.sh”:

!/bin/bash # Ask the user what package to install echo what package should I grab? read varname echo I assume you mean CRAN, but to use github type "g" read source if [ "\$source" = "g" ]; then echo -------------------------------- echo Installing \$varname from GitHub sudo su - -c \\"R -e \"devtools::install_github('\$varname')\"\\" else echo -------------------------------- echo Grabbin \$varname from CRAN sudo su - -c \\"R -e \"install.packages('\$varname', repos='http://cran.\$ fi

this function will accept an input (the package name) and then will ask if to install from CRON or from github. From github, obviously you need to supply the user account and package name. There! Now we don’t need to remember anything anymore! Oh, make sure you chmod 777 Rinstaller.sh (which lets anyone execute the file) and then run it as ./Rinstaller.sh

Anyway, I messed around with streamR for a while and figured out how I wanted to structure the files. I think I want 3 files… one to authenticate, one to capture tweets, and the third to do the supersecret analysis. Here they are:

Authenticator ## Auth library(ROAuth) requestURL <- "https://api.twitter.com/oauth/request_token" accessURL <- "https://api.twitter.com/oauth/access_token" authURL <- "https://api.twitter.com/oauth/authorize" consumerKey <- "myKey" consumerSecret <- "mySecret" my_oauth <- OAuthFactory\$new(consumerKey = consumerKey, consumerSecret = consumerSecret, requestURL = requestURL, accessURL = accessURL, authURL = authURL) my_oauth\$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")) save(my_oauth, file = "my_oauth.Rdata")

So we use this file to connect to the Twitter service. You will need to set yourself up with an API… it’s fairly painless. Go do that here and select “Create new app”. It’s fairly painless.

(small caveat: make sure the my_oauth file is saved in the working directory. You can make sure of it by creating a Project for these three files… tho this is a pain… more on this later).

Tweet-Getter library(streamR) library(here) ## Get load("/srv/shiny-server/SecretFolder/my_oauth.Rdata") filterStream("/srv/shiny-server/SecretFolder/tweets.json", track = "SecretTopic", oauth = my_oauth)

OK, so we run the authenticator once, then we run can run this file, which just goes out and gathers all tweets related to SecretTopic, and saves them to tweets.json. This works with my stream because it’s relatively small number of tweets… but be careful, if your topic gets tons of hits, the file can grow VERY quickly. You might be interested in splitting up the output into multiple files. Check this to see how.

On working directories, it’s super annoying, it’s typically bad practice to specify a direct path to files in your script, instead, it’s encouraged that you use tech to “know your path”… for example, we can use the here package, or use the Project folder. The problem that arises when running files from some kind of automated cron or scheduler is that it doesn’t know how to read .Rproj files, and therefore doesn’t know what folder to use. I asked this question in the RStudio Community site, which have sparked a large discussion… check it out! Anyway, the last script:

Tweet-analyzer ## read library(streamR) tweets.df <- parseTweets("tweets.json", verbose = FALSE) ## Do the Secret Stuff

Ok, so now we can authenticate, gather tweets, and anaylze the resulting file!

OK cool! So let’s get the TweetGetter running! As long as it’s running, it will be appending tweets to the json file. We could run it on our laptop, but it’ll stop running when we close our laptop, so that’s a perfect candidate to run on a server. If you don’t know how to get your stuff up into a linux server, I recommend saving your work locally, setting up git, git pushing it up to a private github remote (CAREFUL! This will have your Private Keys so make sure you don’t use a public repo) and then git pulling it into your server.

OK, set it up to run on the server

The first time you run the script, it will ask you to authenticate… So I recommend running the Authenticator file from RStudio on the server, which will allow you to grab the auth code and paste it into the Rstudio session. Once you’re done, you should be good to capture tweets on that server. The problem is, if you run the TweetGetter in RStudio… when you close that session, it stops the script.

Idea 2: Hrm… let’s try in the shell. So SSH into the server (on windows use Putty), go to the Project folder and type:

Rscript TweetGetter.R

It runs, but when I close the SSH session it also stops the script :-\ . I guess that instance is tied to the SSH session? I don’t get it… but whatever, fine.

Idea 3: set a cronjob to run it! In case you don’t know, cron jobs are the schedulers on linux. Run crontab -e to edit the jobs, and crontab -l to view what jobs you have scheduled. To understand the syntax of the crontabs, see this.

So the idea would be to start the task on a schedule… that way it’s not my session that started it… although of course, if it’s set on a schedule and the schedule dictates it’s time to start up again but the file is already running, I don’t want it to run twice… hrm…

Oh I know! I’ll create a small bash file (like a small executable) that CHECKS if the thingie is running, and if it isn’t then run it, if it is, then don’t do anything! This is what I came up with:

f pgrep -x "Rscript" > /dev/null then echo "Running" else echo "Stopped... restarting" Rscript "/srv/shiny-server/SecretFolder/newTweetGetter.R" fi

WARNING! THIS IS WRONG.

What this is saying is “check if ‘Rscript’ is running on the server (I assumed I didn’t have  any OTHER running R process at the time, a valid assumption in this case). If it is, then just say ‘Running’, if it’s not, then say ‘Stopped… restarting’ and re-run the file, using Rscript. Then, we can put file on the cron job to run hourly… so hourly I will check if the job is running or not, and if it’s stopped, restart. This is what the cron job looks like:

1 * * * * "/srv/shiny-server/SecretFolder/chek.sh"

In other words, run the file chek.sh during minute 1 of every hour, every day of money, every month of the year, and every day of the week (ie, every hour :))

OK…. Cool! So I’m good right? Let me check if the json is getting tweets… hrm… no data in the past 10 minutes or so… has nobody tweeted or is it broken? Hrm2… how does one check the cronjob log file? Oh, there is none… but shouldn’t there be? ::google:: I guess there is supposed to be one… ::think:: Oh, it’s because I’m logged in with a user that doesn’t have admin rights, so when it tries to create a log file in a protected folder, it gets rejected… Well Fine! I’ll pipe the output of the run to a file in a folder I know I can write to. (Another option is to set up the cron job as the root admin…. ie instead of crontab -e you would say sudo crontab -e… but if there’s one thing I know about linux is that I don’t know linux and therefore I use admin commands as infrequently as I can get away with). So how do I pipe run contents to a location I can see? Well… google says this is one way:

40 * * * * "/srv/shiny-server/SecretFolder/chek.sh" >> /home/amit/SecretTweets.log 2>&1

So what this is doing is running the file just as before, but the >> pushes the results to a log file on my home directory. Just a bit of Linux for you… > recreates the piped output everytime (ie overwrites), whereas >> appends to what was already there. The 2>&1 part means ‘grab standard output and errors’… if you wanna read more about why, geek out, but I think you’re basically saying “grab any errors and pipe them to standard output and then grab all standard output”.

OK, so after looking at the output, I saw what was happening… during every crontab run, the chek.sh file made it seem like the newTweetGetter.R wasn’t running… so it would restart it, gather 1 tweet and then time out. What strange behaviour! Am I over some Twitter quota? No, it can’t be, it’s a streaming service, twitter will feed me whatever it wants, I’m not requesting any amount… so it can’t be that.

here is where I threw my hands up and asked Richard, my local linux expert for help

Enter a very useful command: top. This command, and it’s slightly cooler version htop (which doesn’t come in Ubuntu by default but is easy to install… sudo apt install htop) quickly showed me that when you call an R file via Rscript, it doesn’t launch a service called Rscript, it launches a service called /usr/lib/R/bin/exec/R --slave --no-restore --file=/srv/shiny-server/SecretFolder/newTweetGetter.R.  Which explains why chek.sh didn’t think it was running (when it was)… and when the second run would try to connect to the twitter stream, it got rejected (because the first script was already connected). So this is where Richard said “BTW, you should probably set this up as a service…”. And being who I am and Richard who he is, I said: “ok”. (Although I didn’t give up on the cron… see ENDNOTE1).

After a bit of playing around, we found that the shiny-server linux service was probably easy enough to manipulate and get functional (guidance here and here), so let’s do it!

Setting up the service:
1. First of all, go to the folder where all the services live.  cd /etc/systemd/system/
2. Next, copy the shiny service into your new one, called SECRETtweets.service: sudo cp shiny-server.service SECRETtweets.service
3. Now edit the contents! sudo nano SECRETtweets.service and copy paste the following code:
[Unit] Description=SECRETTweets [Service] Type=simple User=amit ExecStart=/usr/bin/Rscript "/srv/shiny-server/SecretFolder/newTweetGetter.R" Restart=always WorkingDirectory= /srv/shiny-server/SecretFolder/ Environment="LANG=en_US.UTF-8" [Install] WantedBy=multi-user.target

4. restart the daemon that picks up services? Don’t know why… just do it: sudo systemctl daemon-reload

5. Now start the service!! sudo systemctl start SECRETtweets

Where each part does this:

• Description is what the thingie does
• Type says how to run it, and “simple” is the default… but check the documentation if u wanna do something more fancy
• User this defines what user is running the service. This is a bit of extra insurance, in case you installed a package as a yourself and not as a superuser (which is the correct way)
• ExecStart is the command to run
• Restart by specifying this to “always”, if the script ever goes down, it’ll automatically restart and start scraping again! Super cool, no? WARNING: Not sure about whether this can cause trouble… if twitter is for some reason pissed off and doesn’t want to serve tweets to you anymore, not sure if CONSTANTLY restarting this could get you in trouble. If I get banned, I’ll letchu know… stay tuned)
• WorkingDirectory This part is where the magic happens. Remember earlier on we were worried and worried about HOW to pass the working directory to the R script? This is how!! Now we don’t have to worry about paths on the server anymore!
• Environment is the language
• WantedBy I have no idea what this does and don’t care because it works!

So there you go! This is the way to set up a proper service that you can monitor, and treat properly like any formal linux process! Enjoy!

ENDNOTE 1

Ok, it’s true… sometimes a Service is the right thing to do, if you have a job that runs for a certain amount of time, finishes, and then you want to run it again discretely later, you should set it up as a cron job. So for those cases, here’s the correct script to check the script is running, even assigning a working directory.

if ps aux | grep "R_file_you_want_to_check.R" | grep -v grep > /dev/null then echo "Running, all good!" else echo "Not running... will restart:" cd /path_to_your_working_directory Rscript "R_file_you_want_to_check.R" fi

save that as chek.sh and assign it to the cron with the output to your home path, like:

40 * * * * "/srv/shiny-server/SecretFolder/chek.sh" >> /home/amit/SecretTweets.log 2>&1

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AmitKohli.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz

Tue, 11/07/2017 - 15:12

(This article was first published on R – Stable Markets, and kindly contributed to R-bloggers)

First off, here are the previous posts in my Bayesian sampling series: Bayesian Simple Linear Regression with Gibbs Sampling in R Blocked Gibbs Sampling in R for Bayesian Multiple Linear Regression In the first post, I illustrated Gibbs Sampling – an algorithm for getting draws from a posterior when conditional posteriors are known. In the … Continue reading Metropolis-in-Gibbs Sampling and Runtime Analysis with Profviz →

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Stable Markets. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### simmer 3.6.4

Tue, 11/07/2017 - 14:08

(This article was first published on R – Enchufa2, and kindly contributed to R-bloggers)

The fourth update of the 3.6.x release of simmer, the Discrete-Event Simulator for R, is on CRAN. This release patches several bugs regarding resource priority and preemption management when seized amounts greater than one were involved. Check the examples available in the corresponding issues on GitHub (#114#115#116) to know if you are affected.

It can be noted that we already broke the intended bi-monthly release cycle, but it is for a good reason, since we are preparing a journal publication. Further details to come.

Minor changes and fixes:
• Fix preemption in non-saturated multi-server resources when seizing amounts > 1 (#114).
• Fix queue priority in non-saturated finite-queue resources when seizing amounts > 1 (#115).
• Fix resource seizing: avoid jumping the queue when there is room in the server but other arrivals are waiting (#116).

Article originally published in Enchufa2.es: simmer 3.6.4.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Enchufa2. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Drawing 10 Million Points With ggplot: Clifford Attractors

Tue, 11/07/2017 - 08:15

(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)

For me, mathematics cultivates a perpetual state of wonder about the nature of mind, the limits of thoughts, and our place in this vast cosmos (Clifford A. Pickover – The Math Book: From Pythagoras to the 57th Dimension, 250 Milestones in the History of Mathematics)

I am a big fan of Clifford Pickover and I find inspiration in his books very often. Thanks to him, I discovered the harmonograph and the Parrondo’s paradox, among many other mathematical treasures. Apart of being a great teacher, he also invented a family of strange attractors wearing his name. Clifford attractors are defined by these equations:

There are infinite attractors, since a, b, c and d are parameters. Given four values (one for each parameter) and a starting point (x0, y0), the previous equation defines the exact location of the point at step n, which is defined just by its location at n-1; an attractor can be thought as the trajectory described by a particle. This plot shows the evolution of a particle starting at (x0, y0)=(0, 0) with parameters a=-1.24458046630025, b=-1.25191834103316, c=-1.81590817030519 and d=-1.90866735205054 along 10 million of steps:

Changing parameters is really entertaining. Drawings have a sandy appearance:

From a technical point of view, the challenge is creating a data frame with all locations, since it must have 10 milion rows and must be populated sequentially. A very fast way to do it is using Rcpp package. To render the plot I use ggplot, which works quite well. Here you have the code to play with Clifford Attractors if you want:

library(Rcpp) library(ggplot2) library(dplyr) opt = theme(legend.position = "none", panel.background = element_rect(fill="white"), axis.ticks = element_blank(), panel.grid = element_blank(), axis.title = element_blank(), axis.text = element_blank()) cppFunction('DataFrame createTrajectory(int n, double x0, double y0, double a, double b, double c, double d) { // create the columns NumericVector x(n); NumericVector y(n); x[0]=x0; y[0]=y0; for(int i = 1; i < n; ++i) { x[i] = sin(a*y[i-1])+c*cos(a*x[i-1]); y[i] = sin(b*x[i-1])+d*cos(b*y[i-1]); } // return a new data frame return DataFrame::create(_["x"]= x, _["y"]= y); } ') a=-1.24458046630025 b=-1.25191834103316 c=-1.81590817030519 d=-1.90866735205054 df=createTrajectory(10000000, 0, 0, a, b, c, d) png("Clifford.png", units="px", width=1600, height=1600, res=300) ggplot(df, aes(x, y)) + geom_point(color="black", shape=46, alpha=.01) + opt dev.off() var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Plots of Back-Calculated Lengths-At-Age I

Tue, 11/07/2017 - 07:00

(This article was first published on fishR Blog, and kindly contributed to R-bloggers)

Last spring, I posted about my version of a modified age-bias plot. One reader commented on that post via Twitter – “Now that you solved the age-bias plot, how about the ‘best’ display of back-calculated length-at-age data, with VonB growth curve overlay?”. In addition, I recently received a question related to the non-convergence of a hierarchical (mixed) model applied to fitting a von Bertalanffy growth function (VBGF) to back-calculated lengths at age. In exploring that question, I realized that a “good” plot of back-calculated lengths at age was needed to understand why the VBGF may (or may not) fit.

Here I write about my initial attempts to visualize back-calculated lengths at age with what are basically spaghetti plots. Spaghetti plots show individual longitudinal traces for each subject (e.g., one example). Recently “spaghetti plots” were in the news to show modeled paths of hurricanes (e.g., I particularly enjoyed this critique).

In this post, I examine back-calculated lengths (mm) at age for Walleye (Sander vitreus) captured from Lake Mille Lacs, Minnesota in late fall (September-October). [More details are here.] These data were kindly provided by the Minnesota Department of Natural Resources, are available in the FSAData package, and were used extensively in the “Growth Estimation: Growth Models and Statistical Inference” chapter of the forthcoming “Age and Growth of Fishes: Principles and Techniques” book to be published by the American Fisheries Society. For simplicity of presentation here, these data were reduced to a single year and sex and several superfluous variables were removed. A “snapshot” of the data file is below.

ID Est.Age TL BC.Age BC.Len 2002.10416.F 1 232 1 81.21 2002.10493.F 1 248 1 114.60 2002.10606.F 1 287 1 112.21 2002.1153.F 1 273 1 116.29 2002.1154.F 1 244 1 157.55 2002.20742.F 10 592 6 523.52 2002.20742.F 10 592 7 539.77 2002.20742.F 10 592 8 554.83 2002.20742.F 10 592 9 567.56 2002.20742.F 10 592 10 576.20

These fish were captured in late fall such that the observed length includes current year’s growth. However, the observed age does not account for time since the fish’s “birthday.” In other words, the observed age at capture should be a “fractional age” such that it represents completed years of growth plus the fraction of the current year’s growth season completed (i.e., the “current age” should be something like 10.9 rather than 10). An example of this is seen by comparing the observed length at capture (in TL) and the back-calculated length (in BC.Len) to age-1 for the first fish in the data.frame (first line in data shown above).

Some of the plots below require a data.frame where the length and age for the oldest age match in time. In other words, this data.frame should contain the length of the fish on the fish’s last “birthday.” With these data, that length is the back-calculated length at the age (in BC.Age) that matches the age of the fish at the time of capture (in Est.Age). With other data, that length may simply be the length of the fish at the time of capture. An example of this data.frame is below (especially compare the last five lines below to the last five lines in the previous data.frame snippet above).

ID Est.Age TL BC.Age BC.Len 2002.10416.F 1 232 1 81.21 2002.10493.F 1 248 1 114.60 2002.10606.F 1 287 1 112.21 2002.1153.F 1 273 1 116.29 2002.1154.F 1 244 1 157.55 2002.20381.F 10 568 10 563.73 2002.20483.F 10 594 10 580.81 2002.20511.F 10 628 10 618.89 2002.20688.F 10 620 10 611.08 2002.20742.F 10 592 10 576.20

Finally, in some of the plots below I include the mean back-calculated length at age. An example of this data.frame is below.

fEst.Age BC.Age mnBC.Len 1 1 111.3464 2 1 113.4633 2 2 225.3060 3 1 98.5338 3 2 211.4096 10 6 505.5437 10 7 538.0147 10 8 558.6757 10 9 571.4900 10 10 579.8577 Plots for Exploratory Data Analysis

When modeling fish growth, I explore the data to make observations about (i) variability in length at each age and (ii) “shape” of growth (i.e., whether or not there is evidence for an horizontal asympote or inflection point). When using repeated-measures data, for example from back-calculated lengths at age, I observe the “shape” of growth for each individual and (iii) identify how the back-calculated lengths at age from older fish compare to the back-calculated lengths at age from younger fish (as major differences could suggest “Lee’s Phenomenon”, substantial changes in growth between year-class or over time, or problems with the back-calculation model). In this section, I describe two plots (with some augmentations to the first type) that could be useful during this exploratory stage. In the last section, I describe a plot that could be used for publication.

Figure 1 shows longitudinal traces of back-calculated lengths at age for each fish, with separate colors for fish with different observed ages at capture. From this I see variability of approximately 100 mm at each age, individual fish that generally follow the typical shape of a VBGF, and some evidence that back-calculated lengths at earlier ages from “older” fish at capture are somewhat lower than the back-calculated lengths at earlier ages for “younger” fish at capture (this is most evident with the pinkish lines).

Figure 1: Traces of back-calculated lengths at age for each fish. Traces with the same color are fish with the same observed age at capture.

Figure 2 is the same as Figure 1 except that heavy lines have been added for the mean back-calculated lengths at age for fish from each age at capture (Figure 2). Here the evidence that back-calculated lengths at earlier ages from “older” fish at capture are somewhat lower than the back-calculated lengths at earlier ages for “younger” fish at capture is a little more obvious.

Figure 2: Traces of back-calculated lengths at age for each fish with mean back-calculated lengths at age shown by the heavier lines. Traces with the same color are fish with the same observed age at capture.

Figure 3 is the same as Figure 1 but also has points for the length and age of each fish at the last completed year of growth. These points are most near to the observed lengths and ages at capture (and will be the observed lengths and ages at capture for datasets where the fish were captured prior to when the current season’s growth had commenced) and, thus, most nearly represent the data that would be used fit a growth model if back-calculations had not been made. With this I observe that most traces of back-calculated lengths at age pass near these points, which suggests that “growth” has not changed dramatically over the time represented in these data and that the model used to back-calculate lengths and ages is not dramatically incorrect.

Figure 3: Traces of back-calculated lengths at age for each fish. Traces with the same color are fish with the same observed age at capture.

The previous spaghetti plots are cluttered because of the number of individual fish. This clutter can be somewhat reduced by creating separate spaghetti plots for each observed age at capture (Figure 4). From this, I observe the clear start of an asymptote at about age 5, an indication of a slight inflection around age 2(most evident for fish that were older at capture), and that a good portion of the variability in length at early ages may be attributable to fish from different year-classes (i.e., of different observed ages at capture). It is, however, more difficult to see that back-calculated lengths at earlier ages from “older” fish at capture are somewhat lower than the back-calculated lengths at earlier ages for “younger” fish at capture. [Note that I left the facet for age-1 fish in this plot to remind me that there were age-1 fish in these data, even though they do not show a trace. Also, the color here is superfluous and could be removed. I left the color here for comparison with previous figures.]

Figure 4: Traces of back-calculated lengths at age for each fish separated by observed age at capture. Black lines in each facet are the mean back-calculated lengths at age for fish shown in that facet.

Publication Graphic with Model Overlaid

For publication I would include traces for individual fish, but without color-coding by estimated age at capture, and overlay the population-average growth model (i.e., the growth model expressed from using the “fixed effects” for each model parameter; Figure 5).

Figure 5: Traces of back-calculated lengths at age for each fish (lighter black lines) with the population-averaged von Bertalanffy growth function (dark black line) overlaid. The equation for the best-fit von Bertalanffy growth function is shown.

R Code ## Required packages library(FSA) # for filterD(), headtail() library(FSAdata) # for WalleyeML library(dplyr) # for %>%, select(), arrange(), unique(), group_by, et al. library(magrittr) # for %<>% library(ggplot2) # for all of the plotting functions library(nlme) # for nlme() ## Custom ggplot2 theme ... the black-and-white theme with slightly ## lighter and dashed grid lines and slightly lighter facet labels. theme_BC <- function() { theme_bw() %+replace% theme( panel.grid.major=element_line(color="gray95",linetype="dashed"), panel.grid.minor=element_line(color="gray95",linetype="dashed"), strip.background=element_rect(fill="gray95") ) } ## Base ggplot ## Uses main data.frame, BC.Age on x- and BC.Len on y-axis ## group=ID to connect lengths at age for each fish ## Applied custom them, labelled axis, controlled ticks on x-axis ## Removed legends ("guides") related to colors and sizes p <- ggplot(data=df,aes(x=BC.Age,y=BC.Len,group=ID)) + theme_BC() + scale_x_continuous("Back-Calculated Age",breaks=seq(0,10,2)) + scale_y_continuous("Back-Calculated Length (mm)") + guides(color=FALSE,size=FALSE) ## Main spaghetti plot p + geom_line(aes(color=fEst.Age),alpha=1/8) ## Spaghetti plot augmented with mean back-calcd lengths at age p + geom_line(aes(color=fEst.Age),alpha=1/8) + geom_line(data=df3,aes(x=BC.Age,y=mnBC.Len,group=fEst.Age,color=fEst.Age),size=1) ## Spaghetti plot augmented with last full length at age points p + geom_line(aes(color=fEst.Age),alpha=1/8) + geom_point(data=df2,aes(color=fEst.Age),alpha=1/5,size=0.9) ## Make facet labels for the plot below lbls <- paste("Age =",levels(df\$fEst.Age)) names(lbls) <- levels(df\$fEst.Age) ## Spaghetti plot separated by age at capture (with means) p + geom_line(aes(color=fEst.Age),alpha=1/5) + facet_wrap(~fEst.Age,labeller=labeller(fEst.Age=lbls)) + geom_line(data=df3,aes(x=BC.Age,y=mnBC.Len,group=NULL),color="black") ## von B equation for the plot ( tmp <- fixef(fitVB) ) lbl <- paste("L==",round(exp(tmp[1]),0), "*bgroup('(',1-e^-",round(tmp[2],3), "(Age-",round(tmp[3],2),"),')')") ## Base ggplot ## Uses main data.frame, BC.Age on x- and BC.Len on y-axis ## group=ID to connect lengths at age for each fish ## Applied custom them, labelled axis, controlled ticks on x-axis ## Removed legends ("guides") related to colors and sizes p <- ggplot(data=df,aes(x=BC.Age,y=BC.Len,group=ID)) + theme_BC() + scale_x_continuous("Back-Calculated Age",breaks=seq(0,10,2)) + scale_y_continuous("Back-Calculated Length (mm)") + guides(color=FALSE,size=FALSE) ## Add transparent traces for each individual fish ## Add a trace for the overall von B model (in fixed effects) ## Add a label for the overall von B model p + geom_line(aes(),alpha=1/15) + stat_function(data=data.frame(T=seq(1,10,0.1)),aes(x=T,y=NULL,group=NULL), fun=vbT,args=list(logLinf=fixef(fitVB)),size=1.1) + geom_text(data=data.frame(x=7,y=120),aes(x=x,y=y,group=NULL, label=lbl),parse=TRUE,size=4) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Practical Machine Learning with R and Python – Part 5

Tue, 11/07/2017 - 02:11

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

This is the 5th and probably penultimate part of my series on ‘Practical Machine Learning with R and Python’. The earlier parts of this series included

1. Practical Machine Learning with R and Python – Part 1 In this initial post, I touch upon univariate, multivariate, polynomial regression and KNN regression in R and Python
2.Practical Machine Learning with R and Python – Part 2 In this post, I discuss Logistic Regression, KNN classification and cross validation error for both LOOCV and K-Fold in both R and Python
3.Practical Machine Learning with R and Python – Part 3 This post covered ‘feature selection’ in Machine Learning. Specifically I touch best fit, forward fit, backward fit, ridge(L2 regularization) & lasso (L1 regularization). The post includes equivalent code in R and Python.
4.Practical Machine Learning with R and Python – Part 4 In this part I discussed SVMs, Decision Trees, validation, precision recall, and roc curves

This post ‘Practical Machine Learning with R and Python – Part 5’ discusses regression with B-splines, natural splines, smoothing splines, generalized additive models (GAMS), bagging, random forest and boosting

As with my previous posts in this series, this post is largely based on the following 2 MOOC courses

1. Statistical Learning, Prof Trevor Hastie & Prof Robert Tibesherani, Online Stanford
2. Applied Machine Learning in Python Prof Kevyn-Collin Thomson, University Of Michigan, Coursera

You can download this R Markdown file and associated data files from Github at MachineLearning-RandPython-Part5

For this part I have used the data sets from UCI Machine Learning repository(Communities and Crime and Auto MPG)

1. Splines

When performing regression (continuous or logistic) between a target variable and a feature (or a set of features), a single polynomial for the entire range of the data set usually does not perform a good fit.Rather we would need to provide we could fit
regression curves for different section of the data set.

There are several techniques which do this for e.g. piecewise-constant functions, piecewise-linear functions, piecewise-quadratic/cubic/4th order polynomial functions etc. One such set of functions are the cubic splines which fit cubic polynomials to successive sections of the dataset. The points where the cubic splines join, are called ‘knots’.

Since each section has a different cubic spline, there could be discontinuities (or breaks) at these knots. To prevent these discontinuities ‘natural splines’ and ‘smoothing splines’ ensure that the seperate cubic functions have 2nd order continuity at these knots with the adjacent splines. 2nd order continuity implies that the value, 1st order derivative and 2nd order derivative at these knots are equal.

A cubic spline with knots , k=1,2,3,..K is a piece-wise cubic polynomial with continuous derivative up to order 2 at each knot. We can write .
For each (), are called ‘basis’ functions, where  ,  , , where k=1,2,3… K The 1st and 2nd derivatives of cubic splines are continuous at the knots. Hence splines provide a smooth continuous fit to the data by fitting different splines to different sections of the data

1.1a Fit a 4th degree polynomial – R code

In the code below a non-linear function (a 4th order polynomial) is used to fit the data. Usually when we fit a single polynomial to the entire data set the tails of the fit tend to vary a lot particularly if there are fewer points at the ends. Splines help in reducing this variation at the extremities

library(dplyr) library(ggplot2) source('RFunctions-1.R') # Read the data df=read.csv("auto_mpg.csv",stringsAsFactors = FALSE) # Data from UCI df1 <- as.data.frame(sapply(df,as.numeric)) #Select specific columns df2 <- df1 %>% dplyr::select(cylinder,displacement, horsepower,weight, acceleration, year,mpg) auto <- df2[complete.cases(df2),] # Fit a 4th degree polynomial fit=lm(mpg~poly(horsepower,4),data=auto) #Display a summary of fit summary(fit) ## ## Call: ## lm(formula = mpg ~ poly(horsepower, 4), data = auto) ## ## Residuals: ## Min 1Q Median 3Q Max ## -14.8820 -2.5802 -0.1682 2.2100 16.1434 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 23.4459 0.2209 106.161 <2e-16 *** ## poly(horsepower, 4)1 -120.1377 4.3727 -27.475 <2e-16 *** ## poly(horsepower, 4)2 44.0895 4.3727 10.083 <2e-16 *** ## poly(horsepower, 4)3 -3.9488 4.3727 -0.903 0.367 ## poly(horsepower, 4)4 -5.1878 4.3727 -1.186 0.236 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.373 on 387 degrees of freedom ## Multiple R-squared: 0.6893, Adjusted R-squared: 0.6861 ## F-statistic: 214.7 on 4 and 387 DF, p-value: < 2.2e-16 #Get the range of horsepower hp <- range(auto\$horsepower) #Create a sequence to be used for plotting hpGrid <- seq(hp[1],hp[2],by=10) #Predict for these values of horsepower. Set Standard error as TRUE pred=predict(fit,newdata=list(horsepower=hpGrid),se=TRUE) #Compute bands on either side that is 2xSE seBands=cbind(pred\$fit+2*pred\$se.fit,pred\$fit-2*pred\$se.fit) #Plot the fit with Standard Error bands plot(auto\$horsepower,auto\$mpg,xlim=hp,cex=.5,col="black",xlab="Horsepower", ylab="MPG", main="Polynomial of degree 4") lines(hpGrid,pred\$fit,lwd=2,col="blue") matlines(hpGrid,seBands,lwd=2,col="blue",lty=3)

1.1b Fit a 4th degree polynomial – Python code import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression #Read the auto data autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") # Select columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] # Convert all columns to numeric autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') #Drop NAs autoDF3=autoDF2.dropna() autoDF3.shape X=autoDF3[['horsepower']] y=autoDF3['mpg'] #Create a polynomial of degree 4 poly = PolynomialFeatures(degree=4) X_poly = poly.fit_transform(X) # Fit a polynomial regression line linreg = LinearRegression().fit(X_poly, y) # Create a range of values hpGrid = np.arange(np.min(X),np.max(X),10) hp=hpGrid.reshape(-1,1) # Transform to 4th degree poly = PolynomialFeatures(degree=4) hp_poly = poly.fit_transform(hp) #Create a scatter plot plt.scatter(X,y) # Fit the prediction ypred=linreg.predict(hp_poly) plt.title("Poylnomial of degree 4") fig2=plt.xlabel("Horsepower") fig2=plt.ylabel("MPG") # Draw the regression curve plt.plot(hp,ypred,c="red") plt.savefig('fig1.png', bbox_inches='tight') 1.1c Fit a B-Spline – R Code

In the code below a B- Spline is fit to data. The B-spline requires the manual selection of knots

#Splines library(splines) # Fit a B-spline to the data. Select knots at 60,75,100,150 fit=lm(mpg~bs(horsepower,df=6,knots=c(60,75,100,150)),data=auto) # Use the fitted regresion to predict pred=predict(fit,newdata=list(horsepower=hpGrid),se=T) # Create a scatter plot plot(auto\$horsepower,auto\$mpg,xlim=hp,cex=.5,col="black",xlab="Horsepower", ylab="MPG", main="B-Spline with 4 knots") #Draw lines with 2 Standard Errors on either side lines(hpGrid,pred\$fit,lwd=2) lines(hpGrid,pred\$fit+2*pred\$se,lty="dashed") lines(hpGrid,pred\$fit-2*pred\$se,lty="dashed") abline(v=c(60,75,100,150),lty=2,col="darkgreen")

1.1d Fit a Natural Spline – R Code

Here a ‘Natural Spline’ is used to fit .The Natural Spline extrapolates beyond the boundary knots and the ends of the function are much more constrained than a regular spline or a global polynomoial where the ends can wag a lot more. Natural splines do not require the explicit selection of knots

# There is no need to select the knots here. There is a smoothing parameter which # can be specified by the degrees of freedom 'df' parameter. The natural spline fit2=lm(mpg~ns(horsepower,df=4),data=auto) pred=predict(fit2,newdata=list(horsepower=hpGrid),se=T) plot(auto\$horsepower,auto\$mpg,xlim=hp,cex=.5,col="black",xlab="Horsepower", ylab="MPG", main="Natural Splines") lines(hpGrid,pred\$fit,lwd=2) lines(hpGrid,pred\$fit+2*pred\$se,lty="dashed") lines(hpGrid,pred\$fit-2*pred\$se,lty="dashed")

1.1.e Fit a Smoothing Spline – R code

Here a smoothing spline is used. Smoothing splines also do not require the explicit setting of knots. We can change the ‘degrees of freedom(df)’ paramater to get the best fit

# Smoothing spline has a smoothing parameter, the degrees of freedom # This is too wiggly plot(auto\$horsepower,auto\$mpg,xlim=hp,cex=.5,col="black",xlab="Horsepower", ylab="MPG", main="Smoothing Splines") # Here df is set to 16. This has a lot of variance fit=smooth.spline(auto\$horsepower,auto\$mpg,df=16) lines(fit,col="red",lwd=2) # We can use Cross Validation to allow the spline to pick the value of this smpopothing paramter. We do not need to set the degrees of freedom 'df' fit=smooth.spline(auto\$horsepower,auto\$mpg,cv=TRUE) lines(fit,col="blue",lwd=2)

1.1e Splines – Python

There isn’t as much treatment of splines in Python and SKLearn. I did find the LSQUnivariate, UnivariateSpline spline. The LSQUnivariate spline requires the explcit setting of knots

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from scipy.interpolate import LSQUnivariateSpline autoDF =pd.read_csv("auto_mpg.csv",encoding="ISO-8859-1") autoDF.shape autoDF.columns autoDF1=autoDF[['mpg','cylinder','displacement','horsepower','weight','acceleration','year']] autoDF2 = autoDF1.apply(pd.to_numeric, errors='coerce') auto=autoDF2.dropna() auto=auto[['horsepower','mpg']].sort_values('horsepower') # Set the knots manually knots=[65,75,100,150] # Create an array for X & y X=np.array(auto['horsepower']) y=np.array(auto['mpg']) # Fit a LSQunivariate spline s = LSQUnivariateSpline(X,y,knots) #Plot the spline xs = np.linspace(40,230,1000) ys = s(xs) plt.scatter(X, y) plt.plot(xs, ys) plt.savefig('fig2.png', bbox_inches='tight') 1.2 Generalized Additiive models (GAMs)

Generalized Additive Models (GAMs) is a really powerful ML tool.

In GAMs we use a different functions for each of the variables. GAMs give a much better fit since we can choose any function for the different sections

1.2a Generalized Additive Models (GAMs) – R Code

The plot below show the smooth spline that is fit for each of the features horsepower, cylinder, displacement, year and acceleration. We can use any function for example loess, 4rd order polynomial etc.

library(gam) # Fit a smoothing spline for horsepower, cyliner, displacement and acceleration gam=gam(mpg~s(horsepower,4)+s(cylinder,5)+s(displacement,4)+s(year,4)+s(acceleration,5),data=auto) # Display the summary of the fit. This give the significance of each of the paramwetr # Also an ANOVA is given for each combination of the features summary(gam) ## ## Call: gam(formula = mpg ~ s(horsepower, 4) + s(cylinder, 5) + s(displacement, ## 4) + s(year, 4) + s(acceleration, 5), data = auto) ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -8.3190 -1.4436 -0.0261 1.2279 12.0873 ## ## (Dispersion Parameter for gaussian family taken to be 6.9943) ## ## Null Deviance: 23818.99 on 391 degrees of freedom ## Residual Deviance: 2587.881 on 370 degrees of freedom ## AIC: 1898.282 ## ## Number of Local Scoring Iterations: 3 ## ## Anova for Parametric Effects ## Df Sum Sq Mean Sq F value Pr(>F) ## s(horsepower, 4) 1 15632.8 15632.8 2235.085 < 2.2e-16 *** ## s(cylinder, 5) 1 508.2 508.2 72.666 3.958e-16 *** ## s(displacement, 4) 1 374.3 374.3 53.514 1.606e-12 *** ## s(year, 4) 1 2263.2 2263.2 323.583 < 2.2e-16 *** ## s(acceleration, 5) 1 372.4 372.4 53.246 1.809e-12 *** ## Residuals 370 2587.9 7.0 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Anova for Nonparametric Effects ## Npar Df Npar F Pr(F) ## (Intercept) ## s(horsepower, 4) 3 13.825 1.453e-08 *** ## s(cylinder, 5) 3 17.668 9.712e-11 *** ## s(displacement, 4) 3 44.573 < 2.2e-16 *** ## s(year, 4) 3 23.364 7.183e-14 *** ## s(acceleration, 5) 4 3.848 0.004453 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 par(mfrow=c(2,3)) plot(gam,se=TRUE)

1.2b Generalized Additive Models (GAMs) – Python Code

I did not find the equivalent of GAMs in SKlearn in Python. There was an early prototype (2012) in Github. Looks like it is still work in progress or has probably been abandoned.

1.3 Tree based Machine Learning Models

Tree based Machine Learning are all based on the ‘bootstrapping’ technique. In bootstrapping given a sample of size N, we create datasets of size N by sampling this original dataset with replacement. Machine Learning models are built on the different bootstrapped samples and then averaged.

Decision Trees as seen above have the tendency to overfit. There are several techniques that help to avoid this namely a) Bagging b) Random Forests c) Boosting

Bagging, Random Forest and Gradient Boosting

Bagging: Bagging, or Bootstrap Aggregation decreases the variance of predictions, by creating separate Decisiion Tree based ML models on the different samples and then averaging these ML models

Random Forests: Bagging is a greedy algorithm and tries to produce splits based on all variables which try to minimize the error. However the different ML models have a high correlation. Random Forests remove this shortcoming, by using a variable and random set of features to split on. Hence the features chosen and the resulting trees are uncorrelated. When these ML models are averaged the performance is much better.

Boosting: Gradient Boosted Decision Trees also use an ensemble of trees but they don’t build Machine Learning models with random set of features at each step. Rather small and simple trees are built. Successive trees try to minimize the error from the earlier trees.

Out of Bag (OOB) Error: In Random Forest and Gradient Boosting for each bootstrap sample taken from the dataset, there will be samples left out. These are known as Out of Bag samples.Classification accuracy carried out on these OOB samples is known as OOB error

1.31a Decision Trees – R Code

The code below creates a Decision tree with the cancer training data. The summary of the fit is output. Based on the ML model, the predict function is used on test data and a confusion matrix is output.

# Read the cancer data library(tree) library(caret) library(e1071) cancer <- read.csv("cancer.csv",stringsAsFactors = FALSE) cancer <- cancer[,2:32] cancer\$target <- as.factor(cancer\$target) train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Create Decision Tree cancerStatus=tree(target~.,train) summary(cancerStatus) ## ## Classification tree: ## tree(formula = target ~ ., data = train) ## Variables actually used in tree construction: ## [1] "worst.perimeter" "worst.concave.points" "area.error" ## [4] "worst.texture" "mean.texture" "mean.concave.points" ## Number of terminal nodes: 9 ## Residual mean deviance: 0.1218 = 50.8 / 417 ## Misclassification error rate: 0.02347 = 10 / 426 pred <- predict(cancerStatus,newdata=test,type="class") confusionMatrix(pred,test\$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 49 7 ## 1 8 78 ## ## Accuracy : 0.8944 ## 95% CI : (0.8318, 0.9397) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : 4.641e-15 ## ## Kappa : 0.7795 ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.8596 ## Specificity : 0.9176 ## Pos Pred Value : 0.8750 ## Neg Pred Value : 0.9070 ## Prevalence : 0.4014 ## Detection Rate : 0.3451 ## Detection Prevalence : 0.3944 ## Balanced Accuracy : 0.8886 ## ## 'Positive' Class : 0 ## # Plot decision tree with labels plot(cancerStatus) text(cancerStatus,pretty=0)

1.31b Decision Trees – Cross Validation – R Code

We can also perform a Cross Validation on the data to identify the Decision Tree which will give the minimum deviance.

library(tree) cancer <- read.csv("cancer.csv",stringsAsFactors = FALSE) cancer <- cancer[,2:32] cancer\$target <- as.factor(cancer\$target) train_idx <- trainTestSplit(cancer,trainPercent=75,seed=5) train <- cancer[train_idx, ] test <- cancer[-train_idx, ] # Create Decision Tree cancerStatus=tree(target~.,train) # Execute 10 fold cross validation cvCancer=cv.tree(cancerStatus) plot(cvCancer)

# Plot the plot(cvCancer\$size,cvCancer\$dev,type='b')

prunedCancer=prune.tree(cancerStatus,best=4) plot(prunedCancer) text(prunedCancer,pretty=0)

pred <- predict(prunedCancer,newdata=test,type="class") confusionMatrix(pred,test\$target) ## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 50 7 ## 1 7 78 ## ## Accuracy : 0.9014 ## 95% CI : (0.8401, 0.945) ## No Information Rate : 0.5986 ## P-Value [Acc > NIR] : 7.988e-16 ## ## Kappa : 0.7948 ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.8772 ## Specificity : 0.9176 ## Pos Pred Value : 0.8772 ## Neg Pred Value : 0.9176 ## Prevalence : 0.4014 ## Detection Rate : 0.3521 ## Detection Prevalence : 0.4014 ## Balanced Accuracy : 0.8974 ## ## 'Positive' Class : 0 ## 1.31c Decision Trees – Python Code

Below is the Python code for creating Decision Trees. The accuracy, precision, recall and F1 score is computed on the test data set.

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.metrics import confusion_matrix from sklearn import tree from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_classification, make_blobs from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score import graphviz cancer = load_breast_cancer() (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X_cancer, y_cancer, random_state = 0) clf = DecisionTreeClassifier().fit(X_train, y_train) print('Accuracy of Decision Tree classifier on training set: {:.2f}' .format(clf.score(X_train, y_train))) print('Accuracy of Decision Tree classifier on test set: {:.2f}' .format(clf.score(X_test, y_test))) y_predicted=clf.predict(X_test) confusion = confusion_matrix(y_test, y_predicted) print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_predicted))) print('Precision: {:.2f}'.format(precision_score(y_test, y_predicted))) print('Recall: {:.2f}'.format(recall_score(y_test, y_predicted))) print('F1: {:.2f}'.format(f1_score(y_test, y_predicted))) # Plot the Decision Tree clf = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train) dot_data = tree.export_graphviz(clf, out_file=None, feature_names=cancer.feature_names, class_names=cancer.target_names, filled=True, rounded=True, special_characters=True) graph = graphviz.Source(dot_data) graph ## Accuracy of Decision Tree classifier on training set: 1.00 ## Accuracy of Decision Tree classifier on test set: 0.87 ## Accuracy: 0.87 ## Precision: 0.97 ## Recall: 0.82 ## F1: 0.89 1.31d Decision Trees – Cross Validation – Python Code

In the code below 5-fold cross validation is performed for different depths of the tree and the accuracy is computed. The accuracy on the test set seems to plateau when the depth is 8. But it is seen to increase again from 10 to 12. More analysis needs to be done here

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.datasets import load_breast_cancer from sklearn.tree import DecisionTreeClassifier (X_cancer, y_cancer) = load_breast_cancer(return_X_y = True) from sklearn.cross_validation import train_test_split, KFold def computeCVAccuracy(X,y,folds): accuracy=[] foldAcc=[] depth=[1,2,3,4,5,6,7,8,9,10,11,12] nK=len(X)/float(folds) xval_err=0 for i in depth: kf = KFold(len(X),n_folds=folds) for train_index, test_index in kf: X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] clf = DecisionTreeClassifier(max_depth = i).fit(X_train, y_train) score=clf.score(X_test, y_test) accuracy.append(score) foldAcc.append(np.mean(accuracy)) return(foldAcc) cvAccuracy=computeCVAccuracy(pd.DataFrame(X_cancer),pd.DataFrame(y_cancer),folds=10) df1=pd.DataFrame(cvAccuracy) df1.columns=['cvAccuracy'] df=df1.reindex([1,2,3,4,5,6,7,8,9,10,11,12]) df.plot() plt.title("Decision Tree - 10-fold Cross Validation Accuracy vs Depth of tree") plt.xlabel("Depth of tree") plt.ylabel("Accuracy") plt.savefig('fig3.png', bbox_inches='tight')

1.4a Random Forest – R code

A Random Forest is fit using the Boston data. The summary shows that 4 variables were randomly chosen at each split and the resulting ML model explains 88.72% of the test data. Also the variable importance is plotted. It can be seen that ‘rooms’ and ‘status’ are the most influential features in the model

library(randomForest) df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL # Select specific columns Boston <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color", "status","medianValue") # Fit a Random Forest on the Boston training data rfBoston=randomForest(medianValue~.,data=Boston) # Display the summatu of the fit. It can be seen that the MSE is 10.88 # and the percentage variance explained is 86.14%. About 4 variables were tried at each # #split for a maximum tree of 500. # The MSE and percent variance is on Out of Bag trees rfBoston ## ## Call: ## randomForest(formula = medianValue ~ ., data = Boston) ## Type of random forest: regression ## Number of trees: 500 ## No. of variables tried at each split: 4 ## ## Mean of squared residuals: 9.521672 ## % Var explained: 88.72 #List and plot the variable importances importance(rfBoston) ## IncNodePurity ## crimeRate 2602.1550 ## zone 258.8057 ## indus 2599.6635 ## charles 240.2879 ## nox 2748.8485 ## rooms 12011.6178 ## age 1083.3242 ## distances 2432.8962 ## highways 393.5599 ## tax 1348.6987 ## teacherRatio 2841.5151 ## color 731.4387 ## status 12735.4046 varImpPlot(rfBoston)

1.4b Random Forest-OOB and Cross Validation Error – R code

The figure below shows the OOB error and the Cross Validation error vs the ‘mtry’. Here mtry indicates the number of random features that are chosen at each split. The lowest test error occurs when mtry = 8

library(randomForest) df=read.csv("Boston.csv",stringsAsFactors = FALSE) # Data from MASS - SL # Select specific columns Boston <- df %>% dplyr::select("crimeRate","zone","indus","charles","nox","rooms","age", "distances","highways","tax","teacherRatio","color", "status","medianValue") # Split as training and tst sets train_idx <- trainTestSplit(Boston,trainPercent=75,seed=5) train <- Boston[train_idx, ] test <- Boston[-train_idx, ] #Initialize OOD and testError oobError <- NULL testError <- NULL # In the code below the number of variables to consider at each split is increased # from 1 - 13(max features) and the OOB error and the MSE is computed for(i in 1:13){ fitRF=randomForest(medianValue~.,data=train,mtry=i,ntree=400) oobError[i] <-fitRF\$mse[400] pred <- predict(fitRF,newdata=test) testError[i] <- mean((pred-test\$medianValue)^2) } # We can see the OOB and Test Error. It can be seen that the Random Forest performs # best with the lowers MSE at mtry=6 matplot(1:13,cbind(testError,oobError),pch=19,col=c("red","blue"), type="b",xlab="mtry(no of varaibles at each split)", ylab="Mean Squared Error", main="Random Forest - OOB and Test Error") legend("topright",legend=c("OOB","Test"),pch=19,col=c("red","blue"))

1.4c Random Forest – Python code

The python code for Random Forest Regression is shown below. The training and test score is computed. The variable importance shows that ‘rooms’ and ‘status’ are the most influential of the variables

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") X=df[['crimeRate','zone', 'indus','charles','nox','rooms', 'age','distances','highways','tax', 'teacherRatio','color','status']] y=df['medianValue'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) regr = RandomForestRegressor(max_depth=4, random_state=0) regr.fit(X_train, y_train) print('R-squared score (training): {:.3f}' .format(regr.score(X_train, y_train))) print('R-squared score (test): {:.3f}' .format(regr.score(X_test, y_test))) feature_names=['crimeRate','zone', 'indus','charles','nox','rooms', 'age','distances','highways','tax', 'teacherRatio','color','status'] print(regr.feature_importances_) plt.figure(figsize=(10,6),dpi=80) c_features=X_train.shape[1] plt.barh(np.arange(c_features),regr.feature_importances_) plt.xlabel("Feature importance") plt.ylabel("Feature name") plt.yticks(np.arange(c_features), feature_names) plt.tight_layout() plt.savefig('fig4.png', bbox_inches='tight') ## R-squared score (training): 0.917 ## R-squared score (test): 0.734 ## [ 0.03437382 0. 0.00580335 0. 0.00731004 0.36461548 ## 0.00638577 0.03432173 0.0041244 0.01732328 0.01074148 0.0012638 ## 0.51373683] 1.4d Random Forest – Cross Validation and OOB Error – Python code

As with R the ‘max_features’ determines the random number of features the random forest will use at each split. The plot shows that when max_features=8 the MSE is lowest

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import cross_val_score df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") X=df[['crimeRate','zone', 'indus','charles','nox','rooms', 'age','distances','highways','tax', 'teacherRatio','color','status']] y=df['medianValue'] cvError=[] oobError=[] oobMSE=[] for i in range(1,13): regr = RandomForestRegressor(max_depth=4, n_estimators=400,max_features=i,oob_score=True,random_state=0) mse= np.mean(cross_val_score(regr, X, y, cv=5,scoring = 'neg_mean_squared_error')) # Since this is neg_mean_squared_error I have inverted the sign to get MSE cvError.append(-mse) # Fit on all data to compute OOB error regr.fit(X, y) # Record the OOB error for each `max_features=i` setting oob = 1 - regr.oob_score_ oobError.append(oob) # Get the Out of Bag prediction oobPred=regr.oob_prediction_ # Compute the Mean Squared Error between OOB Prediction and target mseOOB=np.mean(np.square(oobPred-y)) oobMSE.append(mseOOB) # Plot the CV Error and OOB Error # Set max_features maxFeatures=np.arange(1,13) cvError=pd.DataFrame(cvError,index=maxFeatures) oobMSE=pd.DataFrame(oobMSE,index=maxFeatures) #Plot fig8=df.plot() fig8=plt.title('Random forest - CV Error and OOB Error vs max_features') fig8.figure.savefig('fig8.png', bbox_inches='tight') #Plot the OOB Error vs max_features plt.plot(range(1,13),oobError) fig2=plt.title("Random Forest - OOB Error vs max_features (variable no of features)") fig2=plt.xlabel("max_features (variable no of features)") fig2=plt.ylabel("OOB Error") fig2.figure.savefig('fig7.png', bbox_inches='tight')   1.5a Boosting – R code

Here a Gradient Boosted ML Model is built with a n.trees=5000, with a learning rate of 0.01 and depth of 4. The feature importance plot also shows that rooms and status are the 2 most important features. The MSE vs the number of trees plateaus around 2000 trees

library(gbm) # Perform gradient boosting on the Boston data set. The distribution is gaussian since we # doing MSE. The interaction depth specifies the number of splits boostBoston=gbm(medianValue~.,data=train,distribution="gaussian",n.trees=5000, shrinkage=0.01,interaction.depth=4) #The summary gives the variable importance. The 2 most significant variables are # number of rooms and lower status summary(boostBoston)

## var rel.inf ## rooms rooms 42.2267200 ## status status 27.3024671 ## distances distances 7.9447972 ## crimeRate crimeRate 5.0238827 ## nox nox 4.0616548 ## teacherRatio teacherRatio 3.1991999 ## age age 2.7909772 ## color color 2.3436295 ## tax tax 2.1386213 ## charles charles 1.3799109 ## highways highways 0.7644026 ## indus indus 0.7236082 ## zone zone 0.1001287 # The plots below show how each variable relates to the median value of the home. As # the number of roomd increase the median value increases and with increase in lower status # the median value decreases par(mfrow=c(1,2)) #Plot the relation between the top 2 features and the target plot(boostBoston,i="rooms") plot(boostBoston,i="status")

# Create a sequence of trees between 100-5000 incremented by 50 nTrees=seq(100,5000,by=50) # Predict the values for the test data pred <- predict(boostBoston,newdata=test,n.trees=nTrees) # Compute the mean for each of the MSE for each of the number of trees boostError <- apply((pred-test\$medianValue)^2,2,mean) #Plot the MSE vs the number of trees plot(nTrees,boostError,pch=19,col="blue",ylab="Mean Squared Error", main="Boosting Test Error")

1.5b Cross Validation Boosting – R code

Included below is a cross validation error vs the learning rate. The lowest error is when learning rate = 0.09

cvError <- NULL s <- c(.001,0.01,0.03,0.05,0.07,0.09,0.1) for(i in seq_along(s)){ cvBoost=gbm(medianValue~.,data=train,distribution="gaussian",n.trees=5000, shrinkage=s[i],interaction.depth=4,cv.folds=5) cvError[i] <- mean(cvBoost\$cv.error) } # Create a data frame for plotting a <- rbind(s,cvError) b <- as.data.frame(t(a)) # It can be seen that a shrinkage parameter of 0,05 gives the lowes CV Error ggplot(b,aes(s,cvError)) + geom_point() + geom_line(color="blue") + xlab("Shrinkage") + ylab("Cross Validation Error") + ggtitle("Gradient boosted trees - Cross Validation error vs Shrinkage")

1.5c Boosting – Python code

A gradient boost ML model in Python is created below. The Rsquared score is computed on the training and test data.

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.ensemble import GradientBoostingRegressor df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") X=df[['crimeRate','zone', 'indus','charles','nox','rooms', 'age','distances','highways','tax', 'teacherRatio','color','status']] y=df['medianValue'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) regr = GradientBoostingRegressor() regr.fit(X_train, y_train) print('R-squared score (training): {:.3f}' .format(regr.score(X_train, y_train))) print('R-squared score (test): {:.3f}' .format(regr.score(X_test, y_test))) ## R-squared score (training): 0.983 ## R-squared score (test): 0.821 1.5c Cross Validation Boosting – Python code

the cross validation error is computed as the learning rate is varied. The minimum CV eror occurs when lr = 0.04

import numpy as np import pandas as pd import os import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import GradientBoostingRegressor from sklearn.model_selection import cross_val_score df = pd.read_csv("Boston.csv",encoding = "ISO-8859-1") X=df[['crimeRate','zone', 'indus','charles','nox','rooms', 'age','distances','highways','tax', 'teacherRatio','color','status']] y=df['medianValue'] cvError=[] learning_rate =[.001,0.01,0.03,0.05,0.07,0.09,0.1] for lr in learning_rate: regr = GradientBoostingRegressor(max_depth=4, n_estimators=400,learning_rate =lr,random_state=0) mse= np.mean(cross_val_score(regr, X, y, cv=10,scoring = 'neg_mean_squared_error')) # Since this is neg_mean_squared_error I have inverted the sign to get MSE cvError.append(-mse) learning_rate =[.001,0.01,0.03,0.05,0.07,0.09,0.1] plt.plot(learning_rate,cvError) plt.title("Gradient Boosting - 5-fold CV- Mean Squared Error vs max_features (variable no of features)") plt.xlabel("max_features (variable no of features)") plt.ylabel("Mean Squared Error") plt.savefig('fig6.png', bbox_inches='tight')

Conclusion This post covered Splines and Tree based ML models like Bagging, Random Forest and Boosting. Stay tuned for further updates.

You may also like

To see all posts see Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### RcppQuantuccia 0.0.2

Tue, 11/07/2017 - 01:33

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A first maintenance release of RcppQuantuccia got to CRAN earlier today.

RcppQuantuccia brings the Quantuccia header-only subset / variant of QuantLib to R. At present it mostly offers calendaring, but Quantuccia just got a decent amount of new functions so hopefully we can offer more here too.

This release was motivated by the upcoming Rcpp release which will deprecate the okd Date and Datetime vectors in favours of newer ones. So this release of RcppQuantuccia switches to the newer ones.

Other changes are below:

Changes in version 0.0.2 (2017-11-06)

• Added bespoke and joint calendars.

• Using new date(time) vectors (#6).

Courtesy of CRANberries, there is also a diffstat report relative to the previous release. More information is on the RcppQuantuccia page. Issues and bugreports should go to the GitHub issue tracker.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Data acquisition in R (2/4)

Tue, 11/07/2017 - 01:00

(This article was first published on Ilya Kashnitsky, and kindly contributed to R-bloggers)

R is an incredible tool for reproducible research. In the present series of blog posts I want to show how one can easily acquire data within an R session, documenting every step in a fully reproducible way. There are numerous data acquisition options for R users. Of course, I do not attempt to show all the data possibilities and tend to focus mostly on demographic data. If your prime interest lies outside human population statistics, it’s worth checking the amazing Open Data Task View.

The series consists of four posts:

For each of the data acquisition options I provide a small visualization use case.

Eurostat

The package eurostat has a function search_eurostat to search for the relevant datasets. Though, sadly enough, this function does not provide the codes of all the datasets that has the expression of interest in the title. For example, the search on the expression life expectancy produces an output with just 2 results, which does not make any sense. Thus, the best strategy is to go to Eurostat website, find the needed dataset code, and fetch the desired dataset by its code. Note that there is a separate database for subregional level indicators.

I am going to download life expectancy estimates for European countries; the dataset code is demo_mlexpec.

It can take a while because the dataset is quite big (0.4m obs). If the automated procedure does not work, one can download the data manually via the Bulk Download Service of Eurostat.

Let’s have a look at the remaining life expectancy at age 65, the most common conventional age at retirement, in some European countries, separately for males, females, and total population. Some data preparation steps are needed. First, we only need the life expectancy estimates for those aged 65. Next, we don’t need total population, only males and females separately. Finally, let’s select just a bunch of countries: Germany, France, Italy, Russia, Spain, the UK.

e0 %>% filter(! sex == "T", age == "Y65", geo %in% c("DE", "FR", "IT", "RU", "ES", "UK")) %>% ggplot(aes(x = time %>% year(), y = values, color = sex))+ geom_path()+ facet_wrap(~ geo, ncol = 3)+ labs(y = "Life expectancy at age 65", x = NULL)+ theme_minimal()

World Bank

There are several packages that provide an API to World Bank data. Probably, the most elaborated one is a fairly recent wbstats. Its wbsearch function really does great job searching through the database for the relevant datasets. For example, wbsearch("fertility") produces a dataframe of 339 entries with the codes and names of the relevant indicators.

library(tidyverse) library(wbstats) # search for a dataset of interest wbsearch("fertility") %>% head   indicatorID indicator 2479 SP.DYN.WFRT.Q5 Total wanted fertility rate (births per woman): Q5 (highest) 2480 SP.DYN.WFRT.Q4 Total wanted fertility rate (births per woman): Q4 2481 SP.DYN.WFRT.Q3 Total wanted fertility rate (births per woman): Q3 2482 SP.DYN.WFRT.Q2 Total wanted fertility rate (births per woman): Q2 2483 SP.DYN.WFRT.Q1 Total wanted fertility rate (births per woman): Q1 (lowest) 2484 SP.DYN.WFRT Wanted fertility rate (births per woman)

Let’s have a look at the indicator Lifetime risk of maternal death (%) (code SH.MMR.RISK.ZS). World Bank provides a variety of country groupings. One of the curious groupings divides countries based on the advance over the Demographic Transition path. Below I plot our selected indicator for (1) the countries that have passed the Demographic Transition, (2) the countries that haven’t yet experienced demographic dividend, and (3) the whole World.

# fetch the selected dataset df_wb <- wb(indicator = "SH.MMR.RISK.ZS", startdate = 2000, enddate = 2015) # have look at the data for one year df_wb %>% filter(date == 2015) %>% View df_wb %>% filter(iso2c %in% c("V4", "V1", "1W")) %>% ggplot(aes(x = date %>% as.numeric(), y = value, color = country))+ geom_path(size = 1)+ scale_color_brewer(NULL, palette = "Dark2")+ labs(x = NULL, y = NULL, title = "Lifetime risk of maternal death (%)")+ theme_minimal()+ theme(panel.grid.minor = element_blank(), legend.position = c(.8, .9))

OECD

Organization for Economic Cooperation and Development provides detailed economic and demographic data on the member countries. There is an R package OECD that streamlines the use of their data in R. The function search_dataset works nicely to browse the available datasets by keywords. Then get_dataset would fetch the chosen dataset. In the example below I grab the data on the duration of unemployment and then plot the data for the male population of EU16, EU28 and the US as heatmaps.

library(tidyverse) library(viridis) library(OECD) # search by keyword search_dataset("unemployment") %>% View # download the selected dataset df_oecd <- get_dataset("AVD_DUR") # turn variable names to lowercase names(df_oecd) <- names(df_oecd) %>% tolower() df_oecd %>% filter(country %in% c("EU16", "EU28", "USA"), sex == "MEN", ! age == "1524") %>% ggplot(aes(obstime, age, fill = obsvalue))+ geom_tile()+ scale_fill_viridis("Months", option = "B")+ scale_x_discrete(breaks = seq(1970, 2015, 5) %>% paste)+ facet_wrap(~ country, ncol = 1)+ labs(x = NULL, y = "Age groups", title = "Average duration of unemployment in months, males")+ theme_minimal()

WID

[World Wealth and Income Database][wdi] is a harmonized dataset on income and wealth inequality. The developers of the database provide an R package to get their data, which is only available from github so far.

library(tidyverse) #install.packages("devtools") devtools::install_github("WIDworld/wid-r-tool") library(wid)

The function to acquire data is download_wid(). To specify the arguments, one would have to consult help pages of the package and select desired datasets.

?wid_series_type ?wid_concepts

The following nice example is adapted from the package vignette. It shows the share of wealth that was owned by the richest 1% and 10% of population in France and Great Britain.

df_wid <- download_wid( indicators = "shweal", # Shares of personal wealth areas = c("FR", "GB"), # In France an Italy perc = c("p90p100", "p99p100") # Top 1% and top 10% ) df_wid %>% ggplot(aes(x = year, y = value, color = country)) + geom_path()+ labs(title = "Top 1% and top 10% personal wealth shares in France and Great Britain", y = "top share")+ facet_wrap(~ percentile)+ theme_minimal()

Full reproducibility

All the code chunks together can be found in this gist.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Ilya Kashnitsky. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Automating Summary of Surveys with RMarkdown

Tue, 11/07/2017 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

This guide shows how to automate the summary of surveys with R and R Markdown using RStudio. This is great for portions of the document that don’t change (e.g., “the survey shows substantial partisan polarization”). The motivation is really twofold: efficiency (maximize the reusabililty of code, minimize copying and pasting errors) and reproducibility (maximize the number of people and computers that can reproduce findings).

The basic setup is to write an Rmd file that will serve as a template, and then a short R script that loops over each data file (using library(knitr)). The render function then turns the Rmd into documents or slides (typically in PDF, HTML, or docx) by taking file metadata as a parameter.

There are countless ways to summarize a survey in R. This guide shows a few basics with ggplot and questionr, but focuses on the overall workflow (file management, etc.). Following the instructions here, you should be able to reproduce all four reports (and in principle, many more) despite only writing code to clean one survey. Most of the code is displayed in this document, but all code is found in either pewpoliticaltemplate.Rmd or pew_report_generator.R. All code, as well as the outputted documents, can be found here, and details on obtaining the data are found below.

Software

RStudio’s interface with library(rmarkdown) is evolving rapidly. Installing the current version of RStudio is highly recommended, particularly for the previews of the R Markdown code (this doc was created with RStudio 1.1.83). (Here is my install guide, which includes links to tutorials and cheat sheets. For somewhat more advanced survey data cleaning, click here.)

Even if you’ve knit Rmd before, your libraries may not be new enough to create parameterized reports. I recommend installing pacman, which has a convenience function p_load that smooths package installation, loading, and maintenance. I’d recommend p_load particularly if you are collaborating, say, on Dropbox.

install.packages("pacman") p_load(rmarkdown, knitr, foreign, scales, questionr, tidyverse, update = TRUE)

Remember PDF requires LaTeX (install links). By contrast, knitting to docx or HTML does not require LaTeX. Creating pptx is possible in R with library(ReporteRs), but is not discussed here.

The Data

Download the four “political surveys” from Pew Research available here (i.e., January, March, August, and October 2016). You may recall, some politics happened in 2016. (The data is free, provided you take a moment to make an account.)

• If need be, decompress each zip folder.

Three of my folders have intuitive names (Jan16, Mar16, and Oct16), but one of my folders picked up a lengthy name, http___www.people-press.org_files_datasets_Aug16. Don’t worry about that.

• Create a new folder, call it, say, automating.

• Move all four data folders into automating.

Please note that I have no affiliation (past or present) with Pew Research. I simply think that they do great work and they make it relatively hassle-free to get started with meaningful data sets.

The R Notebook (R Markdown) Template

(R Markdown ninjas can skip this section.)

In RStudio, create a new R Notebook and save it as pewpoliticaltemplate.Rmd in the automating folder you just created. This document will likely knit to HTML by default; hold down the Knit button to change it to PDF. Add fields to the header as desired. The sample header below automatically puts today’s date on the document by parsing the expression next to Date: as R code. classoption: landscape may help with wide tables. You can also specify the file that contains your bibliography in several formats, such as BibTex and EndNote (citation details).

Next add an R code chunk to pewpoliticaltemplate.Rmd to take care of background stuff like formatting. Though setting a working directory would not be needed just to knit the Rmd, the directory must be set by knitr::opts_knit\$set(root.dir = '...') to automate document prep. (setwd isn’t needed in the Rmd, but setting the working directory separately in Console is recommended if you’re still editing.)

The Play button at the top right gives a preview of the code’s output, which is handy. If some part of the analysis is very lengthy, it only needs to be run once, freeing you to tinker with graphics and the like.

– Now the default settings have been set and you don’t need to worry about suppressing warnings and so on with each code chunk. You can, of course, change them case-by-case as you like.

– Unlike in R, when setting the format options for individual code chunks (as shown above to suppress warnings before the defaults kick in), you do need to type out the words TRUE and FALSE in full.

– Unlike the template, in this doc, I’ve set the defaults to echo = TRUE and tidy = TRUE to display the R code more pleasingly.

– The setting asis = TRUE is very useful for professionally formatted tables (shown below), but is not recommended for raw R output of matrix and tables. To make raw data frames display with kable by default, see here.

The Template

I find it easiest to write a fully working example and then make little changes as needed so that knitr::render() can loop over the data sets. First things first.

survey <- read.spss("Jan16/Jan16 public.sav", to.data.frame = TRUE)

Summary stats can easily be inserted into the text like so:

The template contains additional examples with survey weights (lengthier calculations should be done in blocks of code and then their result referenced with that inline style).

Here is a basic plot we might want, which reflects the survey weights. facet_grid() is used to create analogous plots for each party identification. The plot uses the slightly wonky syntax y = (..count..)/sum(..count..) to display the results as percentages rather than counts. Note that some code that cleans the data (mostly shortening labels) is omitted for brevity, but can be found here.

PA <- ggplot(survey) + theme_minimal() PA <- PA + geom_bar(aes(q1, y = (..count..)/sum(..count..), weight = weight, fill = q1)) PA <- PA + facet_grid(party.clean ~ .) + theme(strip.text.y = element_text(angle = 45)) PA <- PA + xlab("") + ylab("Percent of Country") PA <- PA + ggtitle("Presidential Approval: January 2016") PA <- PA + scale_y_continuous(labels = scales::percent) PA

Here is an example of a weighted crosstab. knitr::kable will create a table that’s professional in appearance (when knit as PDF; kable takes the style of an academic journal).

kable(wtd.table(survey\$ideo, survey\$sex, survey\$weight)/nrow(survey), digits = 2) Male Female Very conservative 0.04 0.03 Conservative 0.14 0.13 Moderate 0.20 0.20 Liberal 0.08 0.09 Very liberal 0.03 0.03 DK* 0.02 0.03

Suppose we want to display Presidential Approval, where the first column provides overall approval and subsequent columns are crosstabs for various factors of interest (using the cell weights). I’ve written a convenience function called Xtabs that creates this format, which is common in the survey world.

source("https://raw.githubusercontent.com/rdrr1990/datascience101/master/automating/Xtabs.R") kable(Xtabs(survey, "q1", c("sex", "race"), weight = "cellweight")) Overall Male Female White (nH) Black (nH) Hispanic Other DK* Approve 45.6% 21.3% 24.2% 20.4% 9.48% 10.2% 4.97% 0.443% Disapprove 48.5% 25.5% 23% 39.6% 1.33% 3.95% 2.53% 1.12% Don’t Know (VOL) 5.94% 2.67% 3.27% 3.39% 0.646% 1.14% 0.489% 0.269%

Suppose we want to do many crosstabs. The syntax survey\$ideo is widely used for convenience, but survey[["ideo"]] will serve us better since it allows us to work with vectors of variable names ( details from win-vector ). Below, the first two calls to comparisons are identical, but the final one is not because there is no variable “x” in the data frame survey.

identical(survey\$ideo, survey[["ideo"]]) [1] TRUE x <- "ideo" identical(survey[[x]], survey[["ideo"]]) [1] TRUE identical(survey[[x]], survey\$x) [1] FALSE

So say we want weighted crosstabs for ideology and party ID crossed by all question 20, 21, 22.. 29. Here is some code that will do that.

x <- names(survey)[grep("q2[[:digit:]]", names(survey))] x [1] "q20" "q21" "q22a" "q22b" "q22c" "q22d" "q22e" "q22f" "q22g" "q22h" [11] "q22i" "q25" "q26" "q27" "q28" y <- c("ideo", "party") for (i in x) { for (j in y) { cat("\nWeighted proportions for", i, "broken down by", j, "\n") print(kable(wtd.table(survey[[i]], survey[[j]], survey\$weight)/nrow(survey), digits = 2)) cat("\n") # break out of table formatting } cat("\\newpage") }

A few notes:

– This code will only work with the asis setting (shown above) that lets knitr interpret the output of print(kable()) as something to render (rather than just Markdown code to display for use elsewhere).

– Ideally one would have a csv or data.frame of the questions, and display them as loop-switched questions. In this case, the questionnaire is in a docx and so library(docxtrackr) may help.

– Rather than a nested loop, one would likely prefer to pick a question, loop over the demographic and ideological categories for the crosstabs, and then insert commentary and overview.

– The outer loops makes a new page each time it is run, with the inner loop with cat("\\newpage")), which is specific to rendering as PDF. Extra line breaks \n are needed to break out of the table formatting and keep code and text separate. A different approach to page breaks is needed for docx.

The next step is to add a parameter with any variables you need. The parameters will be controlled by the R script discussed below. There is, of course, quite a bit of choice as to what is controlled by which file, but often only a handful of parameters are necessary. Add the following to the end of the header of pewpoliticaltemplate.Rmd:

params: spssfile: !r 1 surveywave: !r 2016

That creates variables params\$spssfile and params\$surveywave that can be controlled externally from other R sessions, and gives them default values of 1 and 2016 respectively. Setting default values smooths debugging by allowing you to continue knitting the Rmd on its own (rather than from the R script we will create in a moment… You can also click on Knit and choose Knit with Parameters to specify particular values).

Now make any changes to Rmd template. For example, in the ggplot code…

PA <- PA + ggtitle(paste("Presidential Approval:", params\$surveywave))

Notice we can get a list of all the spss files like so:

dir(pattern = "sav", recursive = TRUE) [1] "http___www.people-press.org_files_datasets_Aug16/Aug16 public.sav" [2] "Jan16/Jan16 public.sav" [3] "March16/March16 public.sav" [4] "Oct16/Oct16 public.sav"

or in this case

dir(pattern = "16 public.sav", recursive = TRUE) [1] "http___www.people-press.org_files_datasets_Aug16/Aug16 public.sav" [2] "Jan16/Jan16 public.sav" [3] "March16/March16 public.sav" [4] "Oct16/Oct16 public.sav"

I recommend making the pattern as specific as possible in case you or your collaborators add other spss files with similar names. To use regular expressions to specify more complicated patterns, see here.

Now back to editing pewpoliticaltemplate.Rmd…

Knit the file to see how it looks with these default settings; that’s it for this portion.

Automating with knitr

Now create a new R script; mine’s called pew_report_generator.R. It’s just a simple loop that tells which data set to grab, as well as the label to pass to the Rmd. Note that the labels appear in alphabetical rather than chronological order as a function of the way that the Rmd happens to find the files.

library(pacman) p_load(knitr, rmarkdown, sessioninfo) setwd("/users/mohanty/Desktop/pewpolitical/") waves <- c("August 2016", "January 2016", "March 2016", "October 2016") for (i in 1:length(waves)) { render("pewpoliticaltemplate.Rmd", params = list(spssfile = i, surveywave = waves[i]), output_file = paste0("Survey Analysis ", waves[i], ".pdf")) } session <- session_info() save(session, file = paste0("session", format(Sys.time(), "%m%d%Y"), ".Rdata"))

That’s it. Of course, in practice you might write some code on the first survey that doesn’t work for all of them. Pew, for example, seems to have formatted the survey date differently in the last two surveys, which led me to make a few changes. But if the data are formatted fairly consistently, a one-time investment can save massive amounts of time lost to error-prone copying and pasting.

A Little Version Control

The last bit of code is not necessary, but is a convenient way to store which versions of which libraries were actually used on which version of R. If something works now but not in the future, install_version (found in library(devtools)) can be used to install the older version of particular packages.

s <- session_info() s\$platform setting value version R version 3.4.2 (2017-09-28) os macOS Sierra 10.12.6 system x86_64, darwin15.6.0 ui X11 language (EN) collate en_US.UTF-8 tz America/Los_Angeles date 2017-11-06 s\$packages package * version date source assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0) backports 1.1.1 2017-09-25 CRAN (R 3.4.2) bindr 0.1 2016-11-13 CRAN (R 3.4.0) bindrcpp 0.2 2017-06-17 CRAN (R 3.4.0) broom 0.4.2 2017-02-13 CRAN (R 3.4.0) cellranger 1.1.0 2016-07-27 CRAN (R 3.4.0) clisymbols 1.2.0 2017-05-21 cran (@1.2.0) colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0) digest 0.6.12 2017-01-27 CRAN (R 3.4.0) dplyr * 0.7.4 2017-09-28 cran (@0.7.4) evaluate 0.10.1 2017-06-24 CRAN (R 3.4.1) forcats 0.2.0 2017-01-23 CRAN (R 3.4.0) foreign * 0.8-69 2017-06-22 CRAN (R 3.4.2) formatR 1.5 2017-04-25 CRAN (R 3.4.0) ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0) glue 1.2.0 2017-10-29 CRAN (R 3.4.2) gtable 0.2.0 2016-02-26 CRAN (R 3.4.0) haven 1.1.0 2017-07-09 CRAN (R 3.4.1) highr 0.6 2016-05-09 CRAN (R 3.4.0) hms 0.3 2016-11-22 CRAN (R 3.4.0) htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0) httpuv 1.3.5 2017-07-04 CRAN (R 3.4.1) httr 1.3.1 2017-08-20 cran (@1.3.1) jsonlite 1.5 2017-06-01 CRAN (R 3.4.0) knitr * 1.17 2017-08-10 CRAN (R 3.4.1) labeling 0.3 2014-08-23 CRAN (R 3.4.0) lattice 0.20-35 2017-03-25 CRAN (R 3.4.2) lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2) lubridate 1.7.0 2017-10-29 CRAN (R 3.4.2) magrittr 1.5 2014-11-22 CRAN (R 3.4.0) mime 0.5 2016-07-07 CRAN (R 3.4.0) miniUI 0.1.1 2016-01-15 CRAN (R 3.4.0) mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0) modelr 0.1.1 2017-07-24 CRAN (R 3.4.1) munsell 0.4.3 2016-02-13 CRAN (R 3.4.0) nlme 3.1-131 2017-02-06 CRAN (R 3.4.2) pacman * 0.4.6 2017-05-14 CRAN (R 3.4.0) pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.0) plyr 1.8.4 2016-06-08 CRAN (R 3.4.0) psych 1.7.8 2017-09-09 CRAN (R 3.4.1) purrr * 0.2.4 2017-10-18 CRAN (R 3.4.2) questionr * 0.6.3 2017-11-06 local R6 2.2.2 2017-06-17 CRAN (R 3.4.0) Rcpp 0.12.13 2017-09-28 cran (@0.12.13) readr * 1.1.1 2017-05-16 CRAN (R 3.4.0) readxl 1.0.0 2017-04-18 CRAN (R 3.4.0) reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0) rlang 0.1.2 2017-08-09 CRAN (R 3.4.1) rmarkdown 1.6 2017-06-15 CRAN (R 3.4.0) rprojroot 1.2 2017-01-16 CRAN (R 3.4.0) rstudioapi 0.7 2017-09-07 cran (@0.7) rvest 0.3.2 2016-06-17 CRAN (R 3.4.0) scales 0.5.0 2017-08-24 cran (@0.5.0) sessioninfo * 1.0.0 2017-06-21 CRAN (R 3.4.1) shiny 1.0.5 2017-08-23 cran (@1.0.5) stringi 1.1.5 2017-04-07 CRAN (R 3.4.0) stringr 1.2.0 2017-02-18 CRAN (R 3.4.0) tibble * 1.3.4 2017-08-22 cran (@1.3.4) tidyr * 0.7.2 2017-10-16 CRAN (R 3.4.2) tidyverse * 1.1.1 2017-01-27 CRAN (R 3.4.0) withr 2.0.0 2017-10-25 Github (jimhester/withr@a43df66) xml2 1.1.1 2017-01-24 CRAN (R 3.4.0) xtable 1.8-2 2016-02-05 CRAN (R 3.4.0) yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)

_____='https://rviews.rstudio.com/2017/11/07/automating-summary-of-surveys-with-rmarkdown/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### A history-oriented introduction to R for Excel users

Mon, 11/06/2017 - 17:49

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

While spreadsheets are fine tools for collecting and sharing data, the temptation is often there to also use them for in-depth analysis better suited to reproducible systems like R. Historian Jesse Sadler recently published the useful guide Excel vs R: A Brief Introduction to R, which provides useful advice to data analysts currently using spreadsheets on how to transition to R:

Quantitative research often begins with the humble process of counting. Historical documents are never as plentiful as a historian would wish, but counting words, material objects, court cases, etc. can lead to a better understanding of the sources and the subject under study. When beginning the process of counting, the first instinct is to open a spreadsheet. The end result might be the production of tables and charts created in the very same spreadsheet document. In this post, I want to show why this spreadsheet-centric workflow is problematic and recommend the use of a programming language such as R as an alternative for both analyzing and visualizing data.

The post provides a good overview of the pros and cons of using spreadsheets for data analysis, and then provides a useful — aimed at spreadsheet users — to using R for the problematic parts. It includes:

• Basics of the R command line
• An overview of the Tidyverse, a suite of R packages for data manipulation
• Working with data in R: numbers, strings and dates
• Manipulating data frames by linking operations together with the pipe operator
• Visualizing data with the ggplot2 package

The guide is built around a worked analysis of an interesting historical data set: the correspondence network from 6,600 letters written to the 16th-century Dutch diplomat Daniel van der Meulen. You can find the complete guide, including a link to download the data for the examples, at the link below.

Jesse Sadler: Excel vs R: A Brief Introduction to R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...