Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 3 hours 40 min ago

Rcpp now used by 1000 CRAN packages

Sat, 04/15/2017 - 23:28

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Moments ago Rcpp passed a big milestone as there are now 1000 packages on CRAN depending on it (as measured by Depends, Imports and LinkingTo, but excluding Suggests). The graph is on the left depicts the growth of Rcpp usage over time.

One easy way to compute such reverse dependency counts is the tools::dependsOnPkgs() function that was just mentioned in yesterday’s R^4 blog post. Another way is to use the reverse_dependencies_with_maintainers() function from this helper scripts file on CRAN. Lastly, devtools has a function revdep() but it has the wrong default parameters as it includes Suggests: which you’d have to override to get the count I use here (it currently gets 1012 in this wider measure).

Rcpp cleared 300 packages in November 2014. It passed 400 packages in June 2015 (when I only tweeted about it), 500 packages in late October 2015, 600 packages last March, 700 packages last July, 800 packages last October and 900 packages early January. The chart extends to the very beginning via manually compiled data from CRANberries and checked with crandb. The next part uses manually saved entries. The core (and by far largest) part of the data set was generated semi-automatically via a short script appending updates to a small file-based backend. A list of packages using Rcpp is kept on this page.

Also displayed in the graph is the relative proportion of CRAN packages using Rcpp. The four per-cent hurdle was cleared just before useR! 2014 where I showed a similar graph (as two distinct graphs) in my invited talk. We passed five percent in December of 2014, six percent July of 2015, seven percent just before Christmas 2015, eight percent last summer, and nine percent mid-December 2016. Ten percent is next; we may get there during the summer.

1000 user packages is a really large number. This puts a whole lot of responsibility on us in the Rcpp team as we continue to keep Rcpp as performant and reliable as it has been.

And with that a very big Thank You! to all users and contributors of Rcpp for help, suggestions, bug reports, documentation or, of course, code.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny server series part 2: running shiny on multiple ports

Sat, 04/15/2017 - 22:39

(This article was first published on Jasper Ginn's blog, and kindly contributed to R-bloggers)

This guide is part of a series on setting up your own private server running shiny apps. There are many guides with great advice on how to set up an R shiny server and related software. I try to make a comprehensive guide based in part on these resources as well as my own experiences. I always aim to properly attribute information to their respective sources. If you notice an issue, please contact me.

I noticed there were some errors in the first part of this series. These are now fixed. The revised post is available here

Part 1 of this series is available here

In part 1 of this series, we set up shiny server on a VPS. Our setup currently looks like this:

  • Rstudio server is running on port 8787
  • Shiny server is running on port 3838
  • We have a custom domain name pointing to the DigitalOcean VPS

In this part, we’ll configure shiny server to run on two ports; one on 3838 and one on 4949 (although this is arbitrary and you could use any port you’d like). This way, we can set up our server to service a publicly accessible shiny server instance as well as an instance that will be password protected. This is useful if you want to share an app with a select group of people or just for development purposes.

Resources used for this part

This tutorial draws from Rstudio’s documentation about shiny server configuration.

Running shiny on multiple ports

These steps are fairly straightforward. First, log into your VPS, switch to the shiny user and back up the shiny configuration file in case something goes wrong:

# Log into the server ssh root@<your-VPS-ip> # Switch to shiny user su shiny # Copy the config file sudo cp /etc/shiny-server/shiny-server.conf /etc/shiny-server/shiny-server-backup.conf

Open the configuration file:

sudo nano /etc/shiny-server/shiny-server.conf

This configuration governs the port on which shiny runs, as well as the location where it stores error logs and where shiny can find your apps.

Firstly, add the following line under ‘run_as shiny’ (denoted by ‘1’ in the image below). This ensures that shiny outputs error logs in this folder:

# Log all Shiny output to files in this directory log_dir /var/log/shiny-server;

Then, add ‘127.0.0.1’ next to ‘listen 3838’ (denoted by ‘2’ in the image below).

Finally, add the following lines below the last ‘}’ (denoted by ‘3’ in the image below):

# Define a server that listens on port 4949 server { listen 4949 127.0.0.1; # Define a location at the base URL location / { # Host the directory of Shiny Apps stored in this directory site_dir /srv/private-server; # When a user visits the base URL rather than a particular application, # an index of the applications available in this directory will be shown. directory_index on; } }

Hit control+x and Y and enter to save your changes.

This configuration ensures that shiny listens on both ports 3838 and 4949. As you can see in the config file, the private shiny server does not look for shiny apps in the ‘/srv/shiny-server’ folder, but rather in the ‘/srv/private-server’ folder. We need to create this folder and give the shiny user read and write permissions:

# Navigate to the /srv/ folder cd /srv # Create a folder for the private shiny server sudo mkdir private-server # Enter the folder cd private-server # Give shiny read/write permissions sudo chown -R shiny:shiny-apps . sudo chmod g+w . sudo chmod g+s .

Now, we just need to restart shiny server:

sudo service shiny-server restart

You are now running shiny server on both port 3838 and 4949!

Now, we simply need to arrange a new route to the nginx configuration. To do this, open up the config file:

sudo nano /etc/nginx/sites-enabled/default

Then, copy paste the following lines just below the editor route. Indent the lines so they line up with the other routes, and remember to only use spaces, no tabs.

location /private-apps/ { proxy_pass http://127.0.0.1:4949/; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; }

It should look like the configuration in the image below:

Hit control+x and Y and enter to save your changes.

You can now access your second shiny server on http://www.<yourdomainname>.<extension>/private-apps/

In the next part, we’ll add an SSL security certificate to the server.

To leave a comment for the author, please follow the link and comment on their blog: Jasper Ginn's blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Encoding categorical variables: one-hot and beyond

Sat, 04/15/2017 - 18:32

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

(or: how to correctly use xgboost from R)

R has "one-hot" encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.

For example we can see evidence of one-hot encoding in the variable names chosen by a linear regression:

dTrain <- data.frame(x= c('a','b','b', 'c'), y= c(1, 2, 1, 2)) summary(lm(y~x, data= dTrain)) ## ## Call: ## lm(formula = y ~ x, data = dTrain) ## ## Residuals: ## 1 2 3 4 ## -2.914e-16 5.000e-01 -5.000e-01 2.637e-16 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.0000 0.7071 1.414 0.392 ## xb 0.5000 0.8660 0.577 0.667 ## xc 1.0000 1.0000 1.000 0.500 ## ## Residual standard error: 0.7071 on 1 degrees of freedom ## Multiple R-squared: 0.5, Adjusted R-squared: -0.5 ## F-statistic: 0.5 on 2 and 1 DF, p-value: 0.7071

Much of the encoding in R is essentially based on "contrasts" implemented in stats::model.matrix() Note: do not use base::data.matrix() or use hashing before modeling- you might get away with them (especially with tree based methods), but they are not in general good technique as we show below:

data.matrix(dTrain) ## x y ## [1,] 1 1 ## [2,] 2 2 ## [3,] 2 1 ## [4,] 3 2

stats::model.matrix() does not store its one-hot plan in a convenient manner (it can be inferred by pulling the "contrasts" attribute plus examining the column names of the first encoding, but the levels identified are not conveniently represented). When directly applying stats::model.matrix() you can not safely assume the same formula applied to two different data sets (say train and application or test) are using the same encoding! We demonstrate this below:

dTrain <- data.frame(x= c('a','b','c'), stringsAsFactors = FALSE) encTrain <- stats::model.matrix(~x, dTrain) print(encTrain) ## (Intercept) xb xc ## 1 1 0 0 ## 2 1 1 0 ## 3 1 0 1 ## attr(,"assign") ## [1] 0 1 1 ## attr(,"contrasts") ## attr(,"contrasts")$x ## [1] "contr.treatment" dTest <- data.frame(x= c('b','c'), stringsAsFactors = FALSE) stats::model.matrix(~x, dTest) ## (Intercept) xc ## 1 1 0 ## 2 1 1 ## attr(,"assign") ## [1] 0 1 ## attr(,"contrasts") ## attr(,"contrasts")$x ## [1] "contr.treatment"

The above mal-coding can be a critical flaw when you are building a model and then later using the model on new data (be it cross-validation data, test data, or future application data). Many R users are not familiar with the above issue as encoding is hidden in model training, and how to encode new data is stored as part of the model. Python scikit-learn users coming to R often ask "where is the one-hot encoder" (as it isn’t discussed as much in R as it is in scikit-learn) and even supply a number of (low quality) one-off packages "porting one-hot encoding to R."

The main place an R user needs a proper encoder (and that is an encoder that stores its encoding plan in a conveniently re-usable form, which many of the "one-off ported from Python" packages actually fail to do) is when using a machine learning implementation that isn’t completely R-centric. One such system is xgboost which requires (as is typical of machine learning in scikit-learn) data to already be encoded as a numeric matrix (instead of a heterogeneous structure such as a data.frame). This requires explicit conversion on the part of the R user, and many R users get it wrong (fail to store the encoding plan somewhere). To make this concrete let’s work a simple example.

Let’s try the Titanic data set to see encoding in action. Note: we are not working hard on this example (as in adding extra variables derived from cabin layout, commonality of names, and other sophisticated feature transforms)- just plugging the obvious variable into xgboost. As we said: xgboost requires a numeric matrix for its input, so unlike many R modeling methods we must manage the data encoding ourselves (instead of leaving that to R which often hides the encoding plan in the trained model). Also note: differences observed in performance that are below the the sampling noise level should not be considered significant (e.g., all the methods demonstrated here performed about the same).

We bring in our data:

# set up example data set library("titanic") data(titanic_train) str(titanic_train) ## 'data.frame': 891 obs. of 12 variables: ## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... ## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ... ## $ Sex : chr "male" "female" "female" "female" ... ## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... ## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... ## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... ## $ Cabin : chr "" "C85" "" "C123" ... ## $ Embarked : chr "S" "C" "S" "S" ... summary(titanic_train) ## PassengerId Survived Pclass Name ## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891 ## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character ## Median :446.0 Median :0.0000 Median :3.000 Mode :character ## Mean :446.0 Mean :0.3838 Mean :2.309 ## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000 ## Max. :891.0 Max. :1.0000 Max. :3.000 ## ## Sex Age SibSp Parch ## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000 ## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 ## Mode :character Median :28.00 Median :0.000 Median :0.0000 ## Mean :29.70 Mean :0.523 Mean :0.3816 ## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 ## Max. :80.00 Max. :8.000 Max. :6.0000 ## NA's :177 ## Ticket Fare Cabin Embarked ## Length:891 Min. : 0.00 Length:891 Length:891 ## Class :character 1st Qu.: 7.91 Class :character Class :character ## Mode :character Median : 14.45 Mode :character Mode :character ## Mean : 32.20 ## 3rd Qu.: 31.00 ## Max. :512.33 ## outcome <- 'Survived' target <- 1 shouldBeCategorical <- c('PassengerId', 'Pclass', 'Parch') for(v in shouldBeCategorical) { titanic_train[[v]] <- as.factor(titanic_train[[v]]) } tooDetailed <- c("Ticket", "Cabin", "Name", "PassengerId") vars <- setdiff(colnames(titanic_train), c(outcome, tooDetailed)) dTrain <- titanic_train

And design our cross-validated modeling experiment:

library("xgboost") library("sigr") library("WVPlots") library("vtreat") set.seed(4623762) crossValPlan <- vtreat::kWayStratifiedY(nrow(dTrain), 10, dTrain, dTrain[[outcome]]) evaluateModelingProcedure <- function(xMatrix, outcomeV, crossValPlan) { preds <- rep(NA_real_, nrow(xMatrix)) for(ci in crossValPlan) { nrounds <- 1000 cv <- xgb.cv(data= xMatrix[ci$train, ], label= outcomeV[ci$train], objective= 'binary:logistic', nrounds= nrounds, verbose= 0, nfold= 5) #nrounds <- which.min(cv$evaluation_log$test_rmse_mean) # regression nrounds <- which.min(cv$evaluation_log$test_error_mean) # classification model <- xgboost(data= xMatrix[ci$train, ], label= outcomeV[ci$train], objective= 'binary:logistic', nrounds= nrounds, verbose= 0) preds[ci$app] <- predict(model, xMatrix[ci$app, ]) } preds }

Our preferred way to encode data is to use the vtreat package in the "no variables mode" shown below (differing from the powerful "y aware" modes we usually teach).

set.seed(4623762) tplan <- vtreat::designTreatmentsZ(dTrain, vars, minFraction= 0, verbose=FALSE) # restrict to common varaibles types # see vignette('vtreatVariableTypes', package = 'vtreat') for details sf <- tplan$scoreFrame newvars <- sf$varName[sf$code %in% c("lev", "clean", "isBAD")] trainVtreat <- as.matrix(vtreat::prepare(tplan, dTrain, varRestriction = newvars)) print(dim(trainVtreat)) ## [1] 891 20 print(colnames(trainVtreat)) ## [1] "Pclass_lev_x.1" "Pclass_lev_x.2" "Pclass_lev_x.3" ## [4] "Sex_lev_x.female" "Sex_lev_x.male" "Age_clean" ## [7] "Age_isBAD" "SibSp_clean" "Parch_lev_x.0" ## [10] "Parch_lev_x.1" "Parch_lev_x.2" "Parch_lev_x.3" ## [13] "Parch_lev_x.4" "Parch_lev_x.5" "Parch_lev_x.6" ## [16] "Fare_clean" "Embarked_lev_x." "Embarked_lev_x.C" ## [19] "Embarked_lev_x.Q" "Embarked_lev_x.S" dTrain$predVtreatZ <- evaluateModelingProcedure(trainVtreat, dTrain[[outcome]]==target, crossValPlan) sigr::permTestAUC(dTrain, 'predVtreatZ', outcome, target) ## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.86, s.d.=0.017, p<1e-05)." WVPlots::ROCPlot(dTrain, 'predVtreatZ', outcome, target, 'vtreat encoder performance')

Model matrix can perform similar encoding when we only have a single data set.

set.seed(4623762) f <- paste('~ 0 + ', paste(vars, collapse = ' + ')) # model matrix skips rows with NAs by default, # get control of this through an option oldOpt <- getOption('na.action') options(na.action='na.pass') trainModelMatrix <- stats::model.matrix(as.formula(f), dTrain) # note model.matrix does not conveniently store the encoding # plan, so you may run into difficulty if you were to encode # new data which didn't have all the levels seen in the training # data. options(na.action=oldOpt) print(dim(trainModelMatrix)) ## [1] 891 16 print(colnames(trainModelMatrix)) ## [1] "Pclass1" "Pclass2" "Pclass3" "Sexmale" "Age" ## [6] "SibSp" "Parch1" "Parch2" "Parch3" "Parch4" ## [11] "Parch5" "Parch6" "Fare" "EmbarkedC" "EmbarkedQ" ## [16] "EmbarkedS" dTrain$predModelMatrix <- evaluateModelingProcedure(trainModelMatrix, dTrain[[outcome]]==target, crossValPlan) sigr::permTestAUC(dTrain, 'predModelMatrix', outcome, target) ## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.87, s.d.=0.019, p<1e-05)." WVPlots::ROCPlot(dTrain, 'predModelMatrix', outcome, target, 'model.matrix encoder performance')

The caret package also supplies an encoding functionality properly split between training (caret::dummyVars()) and application (called predict()).

library("caret") ## Loading required package: lattice ## Loading required package: ggplot2 set.seed(4623762) f <- paste('~', paste(vars, collapse = ' + ')) encoder <- caret::dummyVars(as.formula(f), dTrain) trainCaret <- predict(encoder, dTrain) print(dim(trainCaret)) ## [1] 891 19 print(colnames(trainCaret)) ## [1] "Pclass.1" "Pclass.2" "Pclass.3" "Sexfemale" "Sexmale" ## [6] "Age" "SibSp" "Parch.0" "Parch.1" "Parch.2" ## [11] "Parch.3" "Parch.4" "Parch.5" "Parch.6" "Fare" ## [16] "Embarked" "EmbarkedC" "EmbarkedQ" "EmbarkedS" dTrain$predCaret <- evaluateModelingProcedure(trainCaret, dTrain[[outcome]]==target, crossValPlan) sigr::permTestAUC(dTrain, 'predCaret', outcome, target) ## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.85, s.d.=0.017, p<1e-05)." WVPlots::ROCPlot(dTrain, 'predCaret', outcome, target, 'caret encoder performance')

We usually forget to teach vtreat::designTreatmentsZ() as it is often dominated by the more powerful y-aware methods vtreat supplies (though not for this simple example). vtreat::designTreatmentsZ has a number of useful properties:

  • Does not look at the outcome values, so does not require extra care in cross-validation.
  • Saves its encoding, so can be used correctly on new data.

The above two properties are shared with caret::dummyVars(). Additional features of vtreat::designTreatmentsZ (that differ from caret::dummyVars()‘s choices) include:

  • No NA values are passed through by vtreat::prepare().
  • NA presence is added as an additional informative column.
  • A few derived columns (such as pooling of rare levels are made available).
  • Rare dummy variables are pruned (under a user-controlled threshold) to prevent encoding explosion.
  • Novel levels (levels that occur during test or application, but not during training) are deliberately passed through as "no training level activated" by vtreat::prepare() (caret::dummyVars() considers this an error).

The vtreat y-aware methods include proper nested modeling and y-aware dimension reduction.

vtreat is designed "to always work" (always return a pure numeric data frame with no missing values). It also excels in "big data" situations where the statistics it can collect on high cardinality categorical variables can have a huge positive impact in modeling performance. In many cases vtreat works around problems that kill the analysis pipeline (such as discovering new variable levels during test or application). We teach vtreat sore of "bimodally" in both a "fire and forget" mode and a "all the details on deck" mode (suitable for formal citation). Either way vtreat can make your modeling procedures stronger, more reliable, and easier.

All code for this article can be found here.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Forecasting: Linear Trend and ARIMA Models Exercises (Part-2)

Sat, 04/15/2017 - 17:50

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

There are two main approaches to time series forecasting. One of them is to find persistent patterns in a time series itself, and extrapolate those patterns. Another approach is to discover how a series depend on other variables, which serve as predictors.
This set of exercises focuses on the first approach, while the second one will be considered in a later set. The present set allows to practice in applying three forecasting models:
– a naive model, which provides probably the simplest forecasting technique, but still can be useful as a benchmark for evaluating other methods,
– a linear trend model (based on a simple linear regression),
– the ARIMA model, a more sophisticated and popular model, which assumes a linear dependence of a time series on its past values and random shocks.
The exercises do not require a deep understanding of underlying theories, and make use of automatic model estimation functions included in the forecast package. The set also help to practice in retrieving useful data from forecasts (confidence intervals, forecast errors), and comparing predictive accuracy of different models. The exercises are based on data on e-commerce retail sales in the USA for 1999-2016 retrieved from FRED, the Federal Reserve Bank of St. Louis database (download here). The data represent quarterly sales volumes in millions of dollars.

For other parts of the series follow the tag forecasting

Answers to the exercises are available here

Exercise 1
Read the data from the file, and transform it into a time series (ts) object (given that the data is quarterly and the starting period is the fourth quarter of 1999).
Plot the obtained series.

Exercise 2
Make a naive forecast for the next 8 periods using the appropriate function from the forecast package (i.e. create an object of the class forecast using the function that implements the naive method of forecasting) (Note that this method sets all forecast values equal to the last known time series value).

Exercise 3
Plot the forecast values.

Exercise 4
Make a forecast for the next 8 periods based on a linear model in two steps:
(1) create a linear regression model for the forecast using the tslm function from the forecast package (use the series as the dependent variable, trend and season as independent variables),
(2) make a forecast based on the model using the forecast function from the same package.
Plot the forecast.

Exercise 5
Retrieve forecast errors (residuals) from the linear model based forecast and save them as a separate variable.

Learn more about Forecasting in the online course Time Series Analysis and Forecasting in R. In this course you will learn how to:

  • A complete introduction on Forecasting
  • Work thru an exponentional smoothing instruction
  • And much more

Exercise 6
Make a forecast for the next 8 periods based on the ARIMA model in two steps:
(1) create a model using the auto.arima function from the forecast package,
(2) make a forecast based on the model using the forecast function from the same package.
Plot the forecast.

Exercise 7
Print the summary of the forecast based on the ARIMA model.

Exercise 8
Explore the structure the forecast summary. Find the forecast value for the last period, and its 5% confidence interval values.

Exercise 9
Retrieve forecast errors (residuals) from the ARIMA based forecast.

Exercise 10
Use the errors from the ARIMA based forecast and the errors from the linear model based forecast to compare predictive accuracy of the two models with the Diebold-Mariano test (implemented as a function in the forecast package). Test the hypothesis that the ARIMA based forecast is more accurate than the linear model based forecast.

Related exercise sets:
  1. Multiple Regression (Part 3) Diagnostics
  2. Forecasting for small business Exercises (Part-1)
  3. 3D plotting exercises
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R Weekly Bulletin Vol – IV

Sat, 04/15/2017 - 14:43

(This article was first published on R programming, and kindly contributed to R-bloggers)

This week’s R bulletin will cover topics like removing duplicate rows, finding row number, sorting a data frame in the same order, sorting a data frame in different order, and creating two tabs in an excel workbook. We will also cover functions like Sys.time, julian, and the second & minute function from the lubridate package. Hope you like this R weekly bulletin. Enjoy reading!

Shortcut Keys

1. To open a file in R – Ctrl+O
2. To open a new R script – Ctrl+Shift+N
3. To save an R script – Ctrl+S

Problem Solving Ideas Remove duplicate rows in R

Consider the following data frame “df” comprising of 3 rows of a 1-minute OHC of a stock.

Example:

open = c(221, 221.25, 221.25) high = c(221.75, 221.45, 221.45) close = c(221.35, 221.2, 221.2) df = data.frame(open, high, close) print(df)

As can be seen from the output, the 2nd and the 3rd rows are duplicates. If we want to retain only the unique rows, we can use the duplicated function in the following manner:

df[!duplicated(df), ]

This will remove the 3rd duplicate row, and retain the first 2 rows. In case we want the duplicate row, we can do so in the following way:

df[duplicated(df), ]

Return row number for a particular value in a column

To find the row number for a particular value in a column in a vector/data frame we can use the “which” function.

Example 1:

day = c("Mon", "Tue", "Wed", "Thurs", "Fri", "Sat", "Sun") row_number = which(day == "Sat") print(row_number)

[1] 6

Example 2: Consider a 2 column data frame “df” comprising of the stock symbols and their respective closing price for the day. To find the row number corresponding to the HCL stock we call the “which” function on the Ticker column with its value selected as HCL.

Ticker = c("INFY", "TCS", "HCL") ClosePrice = c(2021, 2294, 910) data = data.frame(Ticker, ClosePrice) row_number = which(data$Ticker == "HCL") print(row_number)

[1] 3

Sorting a data frame by two columns in the same order

To sort a data frame by two columns in the same order, we can use the “order” function and the “with” function. Consider a data frame comprising of stock symbols, their categorization, and the percentage change in price. We first sort the data frame based on category and then sort based on the percentage change in price.

The order function by default sorts in an ascending manner. Hence to sort both the columns in descending order we keep the decreasing argument as TRUE.

Example – Sorting a data frame by two columns

# Create a data frame Ticker = c("INFY", "TCS", "HCLTECH", "SBIN") Category = c(1, 1, 1, 2) Percent_Change = c(2.3, -0.25, 0.5, 0.25) df = data.frame(Ticker, Category, Percent_Change) print(df)

# Sorting by Category column first and then the Percent_Change column: df_sort = df[with(df, order(Category, Percent_Change, decreasing = TRUE)), ] print(df_sort)

Sorting a data frame by two columns in different order

To sort a data frame by two columns in different order, we can use the “order” function along with the “with” function.

Consider a data frame comprising of stock symbols, their categorization, and the percentage change in price. Assume that we want to sort first in an ascending order by column “Category”, and then by column “Percent_Change” in a descending order.

The order function by default sorts in an ascending manner. Hence, to sort the “Category” column we mention it as the first variable in the order function without prepending it with any sign. To sort the “Percent_Change” column in a descending order we prepend it with a negative sign.

Example – Sorting a data frame by two columns

# Create a data frame Ticker = c("INFY", "TCS", "HCLTECH", "SBIN") Category = c(1, 1, 1, 2) Percent_Change = c(2.3, -0.25, 0.5, 0.25) df = data.frame(Ticker, Category, Percent_Change) print(df)

# Sort by Category column first and then the Percent_Change column: df_sort = df[with(df, order(Category, -Percent_Change)), ] print(df_sort)

Creating two tabs in the output excel workbook

At times we want to write & save the multiple results generated after running our R script in an excel workbook on separate worksheets. To do so, we can make use of the “append” argument in the write.xlsx function.To write to an excel file, one must first install and load the xlsx package in R.

Example: In this example, we are creating two worksheets in the “Stocks.xlsx” workbook. In the worksheet named, “Top Gainers”, we save the table_1 output, while in the “Top Losers” worksheet we save the table_2 output. To create a second worksheet in the same workbook, we keep the append argument as TRUE in the second line.

write.xlsx(table_1, "Stocks.xlsx", sheetName = "Top Gainers", append = FALSE) write.xlsx(table_2, " Stocks.xlsx", sheetName = "Top Losers", append = TRUE) Functions Demystified Sys.time function

The Sys.time function gives the current date and time.

Example:

date_time = Sys.time() print(date_time)

[1] “2017-04-15 16:25:38 IST”

The function can be used to find the time required to run a code by placing this function at the start and the end of the code. The difference between the start and the end will give the time taken to execute the code.

julian function

The julian function is used to extract the Julian date, which is the number of days since January 1, 1970. Given a date, the syntax of the function is given as:

julian(date)

Example:

date = as.Date("2010-03-15") julian(date)

[1] 14683
attr(,”origin”)
[1] “1970-01-01”

Alternatively one can use the as.integer function to get the same result

second and minute functions

These functions are part of the lubridate package. The second function retrieves/sets the second component of a date-time object, while the minute function retrieves/sets the minute component. It allows Date-time objects like those belonging to the POSIXct, POSIXlt, Date, zoo, xts, and the timeSeries objects.

Example:

library(lubridate) # Retrieving the seconds We have used the ymd_hms function to parse the given object. x = ymd_hms("2016-06-01 12:23:45") second(x)

[1] 45

minute function:

library(lubridate) # Retrieving the minute from a date-time object x = ymd_hms("2016-06-01 12:23:45") minute(x)

[1] 23

# Retrieving the minute from a time object. We have used the hms function to parse the given object. x = hms("15:29:06") minute(x)

[1] 29

Next Step

We hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.

 

Download the PDF Now!

 

 

The post R Weekly Bulletin Vol – IV appeared first on .

To leave a comment for the author, please follow the link and comment on their blog: R programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Another rule of three, this one in statistics

Sat, 04/15/2017 - 04:14

(This article was first published on R – Decision Science News, and kindly contributed to R-bloggers)

SAMPLE UNTIL THE FIRST SUCCESS, PUT A CONFIDENCE INTERVAL ON P(SUCCESS)

We wrote last week of Charles Darwin’s love of the “rule of three” which, according to Stigler “is simply the mathematical proposition that if a/b = c/d, then any three of a, b, c, and d suffice to determine the fourth.”

We were surprised to learn this is called the rule of three, as we had only heard of the rule of three in comedy. We were even more surprised when a reader wrote in telling us about a rule of three in statistics. According to Wikipedia: the rule of three states that if a certain event did not occur in a sample with n subjects … the interval from 0 to 3/n is a 95% confidence interval for the rate of occurrences in the population.

It’s a heuristic. We love heuristics!

We decided to give it a test run in a little simulation. You can imagine that we’re testing for defects in products coming off of a production line. Here’s how the simulation works:

  • We test everything that comes off the line, one by one, until we come across a defect (the n+1st test).
  • We then make a confidence interval bounded by 0 and 3/n. and make note of it. 95% of such intervals should contain the true underlying probability of defect.
  • Because it’s a simulation and we know the true underlying probability of defect, we make note of whether the interval contains the true probability of defect.
  • We repeat this 10,000 times at each of the following underlying probabilities: .001, .002, and .003.

The simulation thus far assumes the testers have the patience to keep testing until they find a defect. In reality, however, they might get bored and stop testing before the first defect is found. To address this, we also simulated another condition in which the testing is stopped at n/2, halfway before the first defect is found. Of course, in reality, one has no way of knowing when one is half the way to the first defective test, but our simulation can at least let us know what kind of confidence interval one will get if one does indeed stop halfway.

Here’s the result on bracketing, that is, the probability that the confidence intervals contain the correct value:

Across all three levels of true underlying probability, when stopping just before the first defect, we get 95% confidence intervals. However, when we stop half way to the first defect, we get closer to 100% intervals (99.73%, 99.80%, and 99.86%, respectively).

So we know that the upper bounds of these confidence intervals fall above the true probability 95% to about 100% of the time, but where to they fall?

In the uppermost figure of this post, we see the locations of the upper bounds of the simulated confidence intervals when we stop immediately before the first defect. For convenience, we draw blue lines at the true underlying probabilities of .001 (top), .002 (middle), and .003 (bottom). When it’s a 95% confidence interval, about 95% of the upper bounds should fall to the right of the blue line, and 5% to the left. We cut the axis at .05 so you can actually see something, but realize it extends all the way to 1.0, with the heights of the bars trailing off.

For comparison, let’s look at the case in which we stop halfway to the first defect. As suggested by the bracketing probabilities, here we see almost all of the upper bounds exceed the true underlying probability. As our applied statistician reader wrote us about the rule of three, “the weakness is that in some situations it’s a very broad confidence interval.”

REFERENCE
A Look at the Rule of Three
B. D. Jovanovic and P. S. Levy
The American Statistician
Vol. 51, No. 2 (May, 1997), pp. 137-139
DOI: 10.2307/2685405
Stable URL: http://www.jstor.org/stable/2685405

R CODE FOR THOSE WHO WANT TO PLAY ALONG AT HOME

require(scales)
library(ggplot2)
library(dplyr)
levels = c(.001,.002,.003)
ITER = 10000
res_list = vector('list', ITER*length(levels))
index=1
for(true_p in levels) {
for (i in 1:ITER) {
onesam = sample(
x = c(1, 0),
size = 10*1/true_p,
prob = c(true_p, 1 - true_p),
replace = TRUE
)
cut = which.max(onesam) - 1
upper_bound_halfway = min(3 / (cut/2),1)
upper_bound_lastpossible = min(3/cut,1)
res_list[[index]] =
data.frame(
true_p = true_p,
cut = cut,
upper_bound_halfway = upper_bound_halfway,
bracketed_halfway = true_p < upper_bound_halfway,
upper_bound_lastpossible = upper_bound_lastpossible,
bracketed_lastpossible = true_p < upper_bound_lastpossible ) index=index+1 } } df = do.call('rbind',res_list) rm(res_list) plot_data = rbind( df %>% group_by(true_p) %>% summarise(bracketing_probability = mean(bracketed_halfway),type="halfway"),
df %>% group_by(true_p) %>% summarise(bracketing_probability = mean(bracketed_lastpossible),type="last possible")
)
plot_data
p=ggplot(plot_data,aes(x=true_p,y=bracketing_probability,group=type,fill=type)) +
geom_bar(stat="identity",position="dodge") +
coord_cartesian(ylim=c(.5,1)) +
theme_bw() +
theme(legend.position = "bottom",
panel.grid.minor.x = element_blank()) +
labs(x="True Probability",y="Bracketing Probability")
p
ggsave(plot=p,file="bracketing.png",width=4,height=4)
plot_data2 = df %>%
dplyr::select(-bracketed_halfway,-bracketed_lastpossible) %>%
tidyr::gather(bound_type,upper_bound,c(upper_bound_halfway,upper_bound_lastpossible)) %>%
arrange(bound_type,upper_bound) %>%
mutate(bin = floor(upper_bound/.001)*.001) %>%
group_by(bound_type,true_p,bin) %>%
summarise(count = n()) %>%
ungroup()
p=ggplot(subset(plot_data2,bound_type=="upper_bound_lastpossible"),aes(x=bin+.0005,y=count)) +
geom_bar(stat="identity",width = .0005) +
geom_vline(aes(xintercept = true_p),color="blue") +
coord_cartesian(xlim = c(0,.05)) +
labs(x="Upper Bound",y="Count") +
facet_grid(true_p ~ .) +
theme_bw() +
theme(legend.position = "none")
p
ggsave(plot=p,file="upper_bound_lastpossible.png",width=4,height=4)
#repeat for upper_bound_halfway
p=ggplot(subset(plot_data2,bound_type=="upper_bound_halfway"),aes(x=bin+.0005,y=count)) +
geom_bar(stat="identity",width = .0005) +
geom_vline(aes(xintercept = true_p),color="blue") +
coord_cartesian(xlim = c(0,.05),ylim=c(0,1750)) +
labs(x="Upper Bound",y="Count") +
facet_grid(true_p ~ .) +
theme_bw() +
theme(legend.position = "none")
p
ggsave(plot=p,file="upper_bound_halfway.png",width=4,height=4)

The post Another rule of three, this one in statistics appeared first on Decision Science News.

To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

#5: Easy package information

Sat, 04/15/2017 - 02:56

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

Welcome to the fifth post in the recklessly rambling R rants series, or R4 for short.

The third post showed an easy way to follow R development by monitoring (curated) changes on the NEWS file for the development version r-devel. As a concrete example, I mentioned that it has shown a nice new function (tools::CRAN_package_db()) coming up in R 3.4.0. Today we will build on that.

Consider the following short snippet:

library(data.table) getPkgInfo <- function() { if (exists("tools::CRAN_package_db")) { dat <- tools::CRAN_package_db() } else { tf <- tempfile() download.file("https://cloud.r-project.org/src/contrib/PACKAGES.rds", tf, quiet=TRUE) dat <- readRDS(tf) # r-devel can now readRDS off a URL too } dat <- as.data.frame(dat) setDT(dat) dat }

It defines a simple function getPkgInfo() as a wrapper around said new function from R 3.4.0, ie tools::CRAN_package_db(), and a fallback alternative using a tempfile (in the automagically cleaned R temp directory) and an explicit download and read of the underlying RDS file. As an aside, just this week the r-devel NEWS told us that such readRDS() operations can now read directly from URL connection. Very nice—as RDS is a fantastic file format when you are working in R.

Anyway, back to the RDS file! The snippet above returns a data.table object with as many rows as there are packages on CRAN, and basically all their (parsed !!) DESCRIPTION info and then some. A gold mine!

Consider this to see how many package have a dependency (in the sense of Depends, Imports or LinkingTo, but not Suggests because Suggests != Depends) on Rcpp:

R> dat <- getPkgInfo() R> rcppRevDepInd <- as.integer(tools::dependsOnPkgs("Rcpp", recursive=FALSE, installed=dat)) R> length(rcppRevDepInd) [1] 998 R>

So exciting—we will hit 1000 within days! But let’s do some more analysis:

R> dat[ rcppRevDepInd, RcppRevDep := TRUE] # set to TRUE for given set R> dat[ RcppRevDep==TRUE, 1:2] Package Version 1: ABCoptim 0.14.0 2: AbsFilterGSEA 1.5 3: acc 1.3.3 4: accelerometry 2.2.5 5: acebayes 1.3.4 --- 994: yakmoR 0.1.1 995: yCrypticRNAs 0.99.2 996: yuima 1.5.9 997: zic 0.9 998: ziphsmm 1.0.4 R>

Here we index the reverse dependency using the vector we had just computed, and then that new variable to subset the data.table object. Given the aforementioned parsed information from all the DESCRIPTION files, we can learn more:

R> ## likely false entries R> dat[ RcppRevDep==TRUE, ][NeedsCompilation!="yes", c(1:2,4)] Package Version Depends 1: baitmet 1.0.0 Rcpp, erah (>= 1.0.5) 2: bea.R 1.0.1 R (>= 3.2.1), data.table 3: brms 1.6.0 R (>= 3.2.0), Rcpp (>= 0.12.0), ggplot2 (>= 2.0.0), methods 4: classifierplots 1.3.3 R (>= 3.1), ggplot2 (>= 2.2), data.table (>= 1.10), 5: ctsem 2.3.1 R (>= 3.2.0), OpenMx (>= 2.3.0), Rcpp 6: DeLorean 1.2.4 R (>= 3.0.2), Rcpp (>= 0.12.0) 7: erah 1.0.5 R (>= 2.10), Rcpp 8: GxM 1.1 NA 9: hmi 0.6.3 R (>= 3.0.0) 10: humarray 1.1 R (>= 3.2), NCmisc (>= 1.1.4), IRanges (>= 1.22.10),\nGenomicRanges (>= 1.16.4) 11: iNextPD 0.3.2 R (>= 3.1.2) 12: joinXL 1.0.1 R (>= 3.3.1) 13: mafs 0.0.2 NA 14: mlxR 3.1.0 R (>= 3.0.1), ggplot2 15: RmixmodCombi 1.0 R(>= 3.0.2), Rmixmod(>= 2.0.1), Rcpp(>= 0.8.0), methods,\ngraphics 16: rrr 1.0.0 R (>= 3.2.0) 17: UncerIn2 2.0 R (>= 3.0.0), sp, RandomFields, automap, fields, gstat R>

There are a full seventeen packages which claim to depend on Rcpp while not having any compiled code of their own. That is likely false—but I keep them in my counts, however relunctantly. A CRAN-declared Depends: is a Depends:, after all.

Another nice thing to look at is the total number of package that declare that they need compilation:

R> ## number of packages with compiled code R> dat[ , .(N=.N), by=NeedsCompilation] NeedsCompilation N 1: no 7625 2: yes 2832 3: No 1 R>

Isn’t that awesome? It is 2832 out of (currently) 10458, or about 27.1%. Just over one in four. Now the 998 for Rcpp look even better as they are about 35% of all such packages. In order words, a little over one third of all packages with compiled code (which may be legacy C, Fortran or C++) use Rcpp. Wow.

Before closing, one shoutout to Dirk Schumacher whose thankr which I made the center of the last post is now on CRAN. As a mighty fine and slim micropackage without external dependencies. Neat.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Gender Roles with Text Mining and N-grams

Sat, 04/15/2017 - 02:00

(This article was first published on data science ish, and kindly contributed to R-bloggers)

Today is the one year anniversary of the janeaustenr package’s appearance on CRAN, its cranniversary, if you will. I think it’s time for more Jane Austen here on my blog.

via GIPHY

I saw this paper by Matthew Jockers and Gabi Kirilloff a number of months ago and the ideas in it have been knocking around in my head ever since. The authors of that paper used text mining to examine a corpus of 19th century novels and explore how gendered pronouns (he/she/him/her) are associated with different verbs. These authors used the Stanford CoreNLP library to parse dependencies in sentences and find which verbs are connected to which pronouns; I have been thinking about how to apply a different approach to this question using tidy data principles and n-grams. Let’s see what we can do!

Jane Austen and n-grams

An n-gram is a contiguous series of n words from a text; for example, a bigram is a pair of words, with n = 2. If we want to find out which verbs an author is more likely to pair with the pronoun “she” than with “he”, we can analyze bigrams. Let’s use unnest_tokens from the tidytext package to identify all the bigrams in the 6 completed, published novels of Jane Austen and transform this to a tidy dataset.

library(tidyverse) library(tidytext) library(janeaustenr) austen_bigrams <- austen_books() %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) austen_bigrams ## # A tibble: 725,048 × 2 ## book bigram ## <fctr> <chr> ## 1 Sense & Sensibility sense and ## 2 Sense & Sensibility and sensibility ## 3 Sense & Sensibility sensibility by ## 4 Sense & Sensibility by jane ## 5 Sense & Sensibility jane austen ## 6 Sense & Sensibility austen 1811 ## 7 Sense & Sensibility 1811 chapter ## 8 Sense & Sensibility chapter 1 ## 9 Sense & Sensibility 1 the ## 10 Sense & Sensibility the family ## # ... with 725,038 more rows

That is all the bigrams from Jane Austen’s works, but we only want the ones that start with “he” or “she”. Jane Austen wrote in the third person, so this is a good example set of texts for this question. The original paper used dependency parsing of sentences and included other pronouns like “her” and “him”, but let’s just look for bigrams that start with “she” and “he”. We will get some adverbs and modifiers and such as the second word in the bigram, but mostly verbs, the main thing we are interested in.

pronouns <- c("he", "she") bigram_counts <- austen_bigrams %>% count(book, bigram, sort = TRUE) %>% ungroup() %>% separate(bigram, c("word1", "word2"), sep = " ") %>% filter(word1 %in% pronouns) %>% count(word1, word2, wt = n, sort = TRUE) %>% rename(total = nn) bigram_counts ## # A tibble: 1,571 × 3 ## word1 word2 total ## <chr> <chr> <int> ## 1 she had 1472 ## 2 she was 1377 ## 3 he had 1023 ## 4 he was 889 ## 5 she could 817 ## 6 he is 399 ## 7 she would 383 ## 8 she is 330 ## 9 he could 307 ## 10 he would 264 ## # ... with 1,561 more rows

There we go! These are the most common bigrams that start with “he” and “she” in Jane Austen’s works. Notice that there are more mentions of women than men here; this makes sense as Jane Austen’s novels have protagonists who are women. The most common bigrams look pretty similar between the male and female characters in Austen’s works. Let’s calculate a log odds ratio so we can find the words (hopefully mostly verbs) that exhibit the biggest differences between relative use for “she” and “he”.

word_ratios <- bigram_counts %>% group_by(word2) %>% mutate(word_total = sum(total)) %>% ungroup() %>% filter(word_total > 10) %>% select(-word_total) %>% spread(word1, total, fill = 0) %>% mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>% mutate(logratio = log(she / he)) %>% arrange(desc(logratio))

Which words have about the same likelihood of following “he” or “she” in Jane Austen’s novels?

word_ratios %>% arrange(abs(logratio)) ## # A tibble: 164 × 4 ## word2 he she logratio ## <chr> <dbl> <dbl> <dbl> ## 1 always 0.001846438 0.0018956289 0.02629233 ## 2 loves 0.000923219 0.0008920607 -0.03433229 ## 3 too 0.000923219 0.0008920607 -0.03433229 ## 4 when 0.000923219 0.0008920607 -0.03433229 ## 5 acknowledged 0.001077089 0.0011150758 0.03466058 ## 6 remained 0.001077089 0.0011150758 0.03466058 ## 7 had 0.157562702 0.1642506690 0.04157024 ## 8 paused 0.001384828 0.0014495986 0.04571041 ## 9 would 0.040775504 0.0428189117 0.04889836 ## 10 turned 0.003077397 0.0032337199 0.04954919 ## # ... with 154 more rows

These words, like “always” and “loves”, are about as likely to come after the word “she” as the word “he”. Now let’s look at the words that exhibit the largest differences in appearing after “she” compared to “he”.

word_ratios %>% mutate(abslogratio = abs(logratio)) %>% group_by(logratio < 0) %>% top_n(15, abslogratio) %>% ungroup() %>% mutate(word = reorder(word2, logratio)) %>% ggplot(aes(word, logratio, color = logratio < 0)) + geom_segment(aes(x = word, xend = word, y = 0, yend = logratio), size = 1.1, alpha = 0.6) + geom_point(size = 3.5) + coord_flip() + labs(x = NULL, y = "Relative appearance after 'she' compared to 'he'", title = "Words paired with 'he' and 'she' in Jane Austen's novels", subtitle = "Women remember, read, and feel while men stop, take, and reply") + scale_color_discrete(name = "", labels = c("More 'she'", "More 'he'")) + scale_y_continuous(breaks = seq(-1, 2), labels = c("0.5x", "Same", "2x", "4x"))

These words are the ones that are the most different in how Jane Austen used them with the pronouns “he” and “she”. Women in Austen’s novels do things like remember, read, feel, resolve, long, hear, dare, and cry. Men, on the other hand, in these novels do things like stop, take, reply, come, marry, and know. Women in Austen’s world can be funny and smart and unconventional, but she plays with these ideas within a cultural context where they act out gendered roles.

George Eliot and n-grams

Let’s look at another set of novels to see some similarities and differences. Let’s take some novels of George Eliot, another English writer (a woman) who lived and wrote several decades after Jane Austen. Let’s take Middlemarch (MY FAVE), Silas Marner, and The Mill on the Floss.

library(gutenbergr) eliot <- gutenberg_download(c(145, 550, 6688), mirror = "http://mirrors.xmission.com/gutenberg/")

We now have the texts downloaded from Project Gutenberg. We can use the same approach as above and calculate the log odds ratios for each word that comes after “he” and “she” in these novels of George Eliot.

eliot_ratios <- eliot %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% count(bigram, sort = TRUE) %>% ungroup() %>% separate(bigram, c("word1", "word2"), sep = " ") %>% filter(word1 %in% pronouns) %>% count(word1, word2, wt = n, sort = TRUE) %>% rename(total = nn) %>% group_by(word2) %>% mutate(word_total = sum(total)) %>% ungroup() %>% filter(word_total > 10) %>% select(-word_total) %>% spread(word1, total, fill = 0) %>% mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>% mutate(logratio = log(she / he)) %>% arrange(desc(logratio))

What words exhibit the largest differences in their appearance after these pronouns in George Eliot’s works?

eliot_ratios %>% mutate(abslogratio = abs(logratio)) %>% group_by(logratio < 0) %>% top_n(15, abslogratio) %>% ungroup() %>% mutate(word = reorder(word2, logratio)) %>% ggplot(aes(word, logratio, color = logratio < 0)) + geom_segment(aes(x = word, xend = word, y = 0, yend = logratio), size = 1.1, alpha = 0.6) + geom_point(size = 3.5) + coord_flip() + labs(x = NULL, y = "Relative appearance after 'she' compared to 'he'", title = "Words paired with 'he' and 'she' in George Eliot's novels", subtitle = "Women read, run, and need while men leave, mean, and tell") + scale_color_discrete(name = "", labels = c("More 'she'", "More 'he'")) + scale_y_continuous(breaks = seq(-3, 3), labels = c("0.125x", "0.25x", "0.5x", "Same", "2x", "4x", "8x"))

We can see some difference in word use and style here, but overall there are quite similar ideas behind the verbs for women and men in Eliot’s works as Austen’s. Women read, run, need, marry, and look while men leave, mean, tell, know, and call. The verbs associated with women are more connected to emotion or feelings while the verbs associated with men are more connected to action or speaking.

Jane Eyre and n-grams

Finally, let’s look at one more novel. The original paper found that Jane Eyre by Charlotte Brontë had its verbs switched, in that there were lots of active, non-feelings verbs associated with feminine pronouns. That Jane Eyre!

eyre <- gutenberg_download(1260, mirror = "http://mirrors.xmission.com/gutenberg/") eyre_ratios <- eyre %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% count(bigram, sort = TRUE) %>% ungroup() %>% separate(bigram, c("word1", "word2"), sep = " ") %>% filter(word1 %in% pronouns) %>% count(word1, word2, wt = n, sort = TRUE) %>% rename(total = nn) %>% group_by(word2) %>% mutate(word_total = sum(total)) %>% ungroup() %>% filter(word_total > 5) %>% select(-word_total) %>% spread(word1, total, fill = 0) %>% mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>% mutate(logratio = log(she / he)) %>% arrange(desc(logratio))

What words exhibit the largest differences in Jane Eyre?

eyre_ratios %>% mutate(abslogratio = abs(logratio)) %>% group_by(logratio < 0) %>% top_n(15, abslogratio) %>% ungroup() %>% mutate(word = reorder(word2, logratio)) %>% ggplot(aes(word, logratio, color = logratio < 0)) + geom_segment(aes(x = word, xend = word, y = 0, yend = logratio), size = 1.1, alpha = 0.6) + geom_point(size = 3.5) + coord_flip() + labs(x = NULL, y = "Relative appearance after 'she' compared to 'he'", title = "Words paired with 'he' and 'she' in Jane Eyre", subtitle = "Women look, tell, and open while men stop, smile, and pause") + scale_color_discrete(name = "", labels = c("More 'she'", "More 'he'")) + scale_y_continuous(breaks = seq(-3, 3), labels = c("0.125x", "0.25x", "0.5x", "Same", "2x", "4x", "8x"))

Indeed, the words that are more likely to appear after “she” are not particularly feelings-oriented; women in this novel do things like look, tell, open, and do. Men in Jane Eyre do things like stop, smile, pause, pursue, and stand.

The End

It is so interesting to me how these various authors understand and portray their characters’ roles and gender, and how we can see that through analyzing word choices. The R Markdown file used to make this blog post is available here. I am very happy to hear about that or other feedback and questions!

To leave a comment for the author, please follow the link and comment on their blog: data science ish. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Optimising your blog with R and Google Optimize

Sat, 04/15/2017 - 02:00

(This article was first published on Florian Teschner, and kindly contributed to R-bloggers)

When writing a blog, sooner or later, one measures and consequently tries to improve some kind of success metric.
In my case; I have been tracking user behavior (visitors, bounce rate, pageviews, …) for slightly over a year with Google Analytics. While some posts have been more “successful” than others, it is not my goal to change the topics and headlining in order to get more clicks. Instead, one metric that I care about is that visitors find value in the stuff they read and subsequently read more of the past blog posts. In order to see if I can tweak this metric I ran a small AB-test over the last two months.

50% of the blog visitors saw a new start page where the excerpt after the blog post headline was hidden.
The following images shows the difference between test-setting and baseline-setting.

What do you expect? Did the test-setting outperform the baseline setting in terms of pageviews?
As I use a very common jekyll blog theme, my assumption was that the theme authors know more about usability than me and would have optimized for reader-“engagement”.
So let’s have a look at the data. The googleAnalyticsR-package allows to conveniently load the data from GA into R.

library(googleAnalyticsR) googleAnalyticsR::ga_auth() account_list <- google_analytics_account_list() ga_id <- 119848123 dff <- google_analytics_4(ga_id, date_range = c("2017-02-19","2017-04-14"), metrics = c("users","sessions","bounceRate", "avgSessionDuration", "pageviews", "pageviewsPerSession"), dimensions = c("date","experimentVariant"), anti_sample = TRUE)

With the downloaded data, I use the fresh weighted.summarySE function to compare the two key KPIs; bounce-rate and pageviews per session in both settings.

dff$Test <- ifelse(dff$experimentVariant == 0, "Baseline", "Experiment \n Setting") mm <- melt(dff[, c("Test", "sessions", "bounceRate", "pageviewsPerSession")], id.vars = c("Test", "sessions")) dfc <- weighted.summarySE(mm, measurevar="value", groupvars=c("Test", "variable"), weights = "sessions",na.rm=T) p1<-ggplot(dfc, aes(Test, value)) + geom_point() + geom_errorbar(aes(ymin=value-se, ymax=value+se), width=.1) p1 + theme_economist(base_size = 16) + xlab("") + ylab("") + facet_wrap(~variable, scales="free_y", ncol = 1)

Wow, what a difference; both measures are improved by ~25%. I am seriously surprised by the large effect of such a small change. While I have been testing the effect of interface changes on user behavior before (among others as part of my dissertation), these effects are still a surprise.

As of yesterday, I stopped the experiment and permanently changed the blog-layout to be slightly more compact but 25% more successful ;).

To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hybrid content-based and collaborative filtering recommendations with {ordinal} logistic regression (2): Recommendation as discrete choice

Fri, 04/14/2017 - 22:53

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

In this continuation of “Hybrid content-based and collaborative filtering recommendations with {ordinal} logistic regression (1): Feature engineering” I will describe the application of the {ordinal}
clm() function to test a new, hybrid content-based, collaborative
filtering approach to recommender engines by fitting a class of ordinal
logistic (aka ordered logit) models to ratings
data from the MovieLens 100K dataset. All R code used in this project
can be obtained from the respective GitHub repository; the chunks of
code present in the body of the post illustrate the essential steps
only. The MovieLens 100K dataset can be obtained from the GroupLens research laboratory
of the Department of Computer Science and Engineering at the University
of Minnesota. This second part of the study relies on the R code in OrdinalRecommenders_3.R
and presents the model training, cross-validation, and the analyses.
But before we proceed to approach the recommendation problem from a
viewpoint of discrete choice modeling, let me briefly remind you of the
results of the feature engineering phase and explain what happens in OrdinalRecommenders_2.R which prepares the data frames that are passed to clm().

Enter explanatory variables

In the feature engineering phase (OrdinalRecommender_1.R)
we have produced four – two user-user and two item-item – matrices. Of
these four matrices, two are proximity matrices with Jaccard distances
derived from binarized content-based features: for users, we have used
the binary-featured vectors that describe each user by what movies he or
she has or has not rated, while for items we have derived the proximity
matrix from the binarized information on movie genre and the decade of
the production. In addition, we have two Pearson correlation matrices
derived from the ratings matrix directly, one describing the user-user
similarity and the other the similarity between the items. The later two
matrices present the memory-based component of the model, while the
former carry content-based information. In OrdinalRecommenders_2.R, these four matrices are:

  • usersDistance – user-user Jaccard distances,
  • UserUserSim, – user-user Pearson correlations,
  • itemsDistance – item-item Jaccard distances, and
  • ItemItemSim – item-item Pearson correlations.

The OrdinalRecommenders_2.R essentially performs only the following operations over these matrices: for each user-item ratings present in ratingsData (the full ratings matrix),

  • from the user-user proximity/similarity matrix,
    select 100 most users who have rated the item under consideration and
    who are closest to the user under consideration;
  • place them in increasing order of distance/decreasing order of similarity from the current user;
  • copy their respective ratings of the item under consideration.

For the item-item matrices, the procedure is similar:

  • from the item-item proximity/similarity matrix,
    select 100 items that the user under consideration has rated and that
    are closest to the item under consideration;
  • place them in increasing order of distance/decreasing order of similarity from the current item;
  • copy the respective ratings of the item under consideration.

Note that this procedure of selecting the nearest
neighbours’ ratings produces missing data almost by necessity: a
particular user doesn’t have to have 100 ratings of the items similar to
the item under consideration, and for a particular user and item we
don’t necessarily find 100 similar users’ ratings of the same item. This
would reflect in a decrease of the sample size as one builds
progressively more and more feature-encompassing ordered logit models.
In order to avoid this problem, I have first reduced the sample size so
to keep only complete observations for the most feature-encompassing
model that I have planned to test: the one relying on 30 nearest
neighbors from all four matrices (e.g., encompassing 4 * 30 = 120
regressors, add four intercepts needed for the ordered logit over a
5-point rating scale); 87,237 observations from the dataset were kept.

For the sake of completeness, I have also copied the
respective distance and similarity measures from the matrices; while the
first 100 columns of the output matrices present ratings only, the
following 100 columns are the respective distance/similarity measures.
The later will not be used in the remainder of the study. The resulting
four matrices are:

  • proxUsersRatingsFrame – for each
    user-item combination in the data set, these are the ratings of the item
    under consideration from the 100 closest neighbours in the user-user
    content-based proximity matrix;
  • simUsersRatingsFrame, – for each
    user-item combination in the data set, these are the ratings of the item
    under consideration from the 100 closest neighbours in the user-user
    memory-based similarity matrix;
  • proxItemsRatingsFrame – for each
    user-item combination in the data set, these are the ratings from the
    user under consideration of the 100 closest neighbours in the item-item
    content-based proximity matrix; and
  • simItemsRatingsFrame – for
    each user-item combination in the data set, these are the ratings from
    the user under consideration of the 100 closest neighbours in the
    item-item memory-based similarity matrix.

Of course it can be done faster than for-looping over
100,000 ratings. Next, the strategy. The strategy here is to
progressively select a larger number of numSims – the number of nearest neighboours taken from proxUsersRatingsFrame, simUsersRatingsFrame, proxItemsRatingsFrame, and simItemsRatingsFrame – and pull them together with the ratings from the ratingsData. For example, if numSims == 5, we obtain a predictive model Rating ~ 4 matrices * 5 regressors which is 20 regressors in total, if numSims == 10, we obtain a predictive model Rating ~ 4 matrices * 10 regressors which is 40 regressors in total, etc.

Enter Ordered Logit Model

The idea to think of the recommendation problem as
discrete choice is certainly not new; look at the very introductory
lines of [1] where the Netflix competition problem was used to motivate
the introduction of ordered regressions in discrete choice. We assume
that each user has an unobservable, continuous utility function that
maps onto his or her item preferences, and of which we learn only by
observing the discrete information that is captured by his or her
ratings on a 5-point scale. We also assume that the mapping of the
underlying utility on the categories of the rating scale is controlled
by threshold parameters, of which we fix the first to have a value of
zero, and estimate the thresholds for 1→2, 2→3, 3→4, and 4→5 scale
adjustments; these play a role of category-specific intercepts, and are
usually considered as nuisance parameters in ordered logistic
regression. In terms of the unobserved user’s utility for items, these
threshold parameters “disect” the hypothesized continuous utility scale
into fragments that map onto the discrete rating responses (see Chapter
3, Chapter 4.8 in [1]). Additionally, we estimate a vector of regression
coefficients, β, one for each explanatory variable in the model: in our case, the respective ratings from the numSims
closest neighbours from the similarity and proximity matrices as
described in the previous section. In the ordered logit setting – and
with the useful convention of writing minus in front of the xiTβ term used in {ordinal} (see [2]) – we have:

and given that

the logit is cumulative; θj stands for the j-th response category-specific threshold, while i is the index of each observation in the dataset. Vectors x and β play the usual role. The interpretation of the regression coefficients in β becomes clear after realizing that the cumulative odds ratios take the form of:

and is independent of the response category j, so that for a unit change in x the cumulative odds ratio of the change in probability P(Yi ≥ j) increases by a multiplicative factor of exp(β).
Back to the recommendation problem: by progressively adding more
nearest neighbours’’ ratings from the user-user and item-item proximity
and similarity matrices, and tracking the behavior of regression
coefficients in predictions from the respective ordered logit models,
one can figure out the relative importance of the memory-based
information (from the similarity matrices) and the content-based
information (from proximity matrices).

I have varied the numSims (the number of nearest neighbours used) as seq(5, 30, by = 5)
to develop six ordinal models with 20, 40, 60, 80, 100, and 120
regressors respectively, and performed a 10-fold cross-validation over
87,237 user-item ratings from MovieLens 100K (OrdinalRecommenders_3.R):

### --- 10-fold cross-validation for each:
### --- numSims <- seq(5, 30, by = 5)
numSims <- seq(5, 30, by = 5)
meanRMSE <- numeric(length(numSims))
totalN <- dim(ratingsData)[1]
n <- numeric()
ct <- 0
## -- Prepare modelFrame:
# - select variables so to match the size needed for the most encompassing clm() model:
f1 <- select(proxUsersRatingsFrame,
            starts_with('proxUsersRatings_')[1:numSims[length(numSims)]])
f2 <- select(simUsersRatingsFrame,
            starts_with('simUsersRatings_')[1:numSims[length(numSims)]])
f3 <- select(proxItemsRatingsFrame,
            starts_with('proxItemsRatings_')[1:numSims[length(numSims)]])
f4 <- select(simItemsRatingsFrame,
            starts_with('simItemsRatings_')[1:numSims[length(numSims)]])
# - modelFrame:
mFrame <- cbind(f1, f2, f3, f4, ratingsData$Rating)
colnames(mFrame)[dim(mFrame)[2]] <- 'Rating'
# - Keep complete observations only;
# - to match the size needed for the most encompassing clm() model:
mFrame <- mFrame[complete.cases(mFrame), ]
# - store sample size:
n <- dim(mFrame)[1]
# - Rating as ordered factor for clm():
mFrame$Rating <- factor(mFrame$Rating, ordered = T)
# - clean up a bit:
rm('f1', 'f2', 'f3', 'f4'); gc()
## -- 10-fold cross-validation
set.seed(10071974)
# - folds:
foldSize <- round(length(mFrame$Rating)/10)
foldRem <- length(mFrame$Rating) - 10*foldSize
foldSizes <- rep(foldSize, 9)
foldSizes[10] <- foldSize + foldRem
foldInx <- numeric()
for (i in 1:length(foldSizes)) {
 foldInx <- append(foldInx, rep(i,foldSizes[i]))
}
foldInx <- sample(foldInx)
# CV loop:
for (k in numSims) {
 
 ## -- loop counter
 ct <- ct + 1
 
 ## -- report
 print(paste0("Ordinal Logistic Regression w. ",
              k, " nearest neighbours running:"))
 
 ### --- select k neighbours:
 modelFrame <- mFrame[, c(1:k, 31:(30+k), 61:(60+k), 91:(90+k))]
 modelFrame$Rating <- mFrame$Rating
 
 # - model for the whole data set (no CV):
 mFitAll <- clm(Rating ~ .,
                data = modelFrame)
 saveRDS(mFitAll, paste0("OrdinalModel_NoCV_", k, "Feats.Rds"))
 
 RMSE <- numeric(10)
 for (i in 1:10) {
   # - train and test data sets
   trainFrame <- modelFrame[which(foldInx != i), ]
   testFrame <- modelFrame[which(foldInx == i), ]
   # - model
   mFit <- clm(Rating ~ .,
               data = trainFrame)
   
   # - predict test set:
   predictions <- predict(mFit, newdata = testFrame,
                          type = "class")$fit
   
   # - RMSE:
   dataPoints <- length(testFrame$Rating)
   RMSE[i] <-
     sum((as.numeric(predictions)-as.numeric(testFrame$Rating))^2)/dataPoints
   print(paste0(i, "-th fold, RMSE:", RMSE[i]))
 }
 
 ## -- store mean RMSE over 10 folds:
 meanRMSE[ct] <- mean(RMSE)
 print(meanRMSE[ct])
 
 ## -- clean up:
 rm('testFrame', 'trainFrame', 'modelFrame'); gc()
}

The outer loop varies k in numSims,
the number of nearest neighbours selected, then prepares the dataset,
picks ten folds for cross-validation, and fits the ordered logit to the
whole dataset before the cross-validation; the inner loop performs a
10-fold cross-validation and stores the RMSE from each test fold; the
average RMSE from ten folds for each k in numSims is recorded. This procedure is later modified to test models that work with proximity or similarity matrices exclusively.

Results

The following figure presents the average RMSE from
10-fold CV after selecting 5, 10, 15, 20, 25, and 30 nearest neighbours
from the user-user and item-item proximity and similarity matrices. Once
again, the similarity matrices represent the memory-based information,
while the proximity matrices carry the content-based information. The
four different sources of features are termed “feature categories” in
the graph.

Figure 1. The average RMSE from the 10-fold cross-validations from clm()
with 5, 10, 15, 20, 25, and 30 nearest neighbours.

The model with 20 features only
 and exactly 24 parameters (4 intercepts were estimated plus the
coefficients for 5 ratings from four matrices) achieves an average RMSE
of around .52. For comparison, in a recent 10-fold cross-validation over MovieLens 100K with {recommenderlab}
the CF algorithms achieved an RMSE of .991 with 100 nearest neighbours
in user-based CF, and an RMSE of .940 with 100 nearest neighbours in
item-based CF: our approach almost double downs the error with a
fivefold reduction in the number of features used. With 120 features (30
nearest neighbours from all four matrices) and 124 parameters, the
models achieves a RMSE of approximately .43. With these results, the new
approach surpasses many recommender engines that were tested on
MovieLens 100K in the previous years; for example, compare the results
presented here – taking into consideration the fact that the model with
20 features only achieves an RMSE of .517 on the whole dataset (ie.
without cross-validation) – with the results
from the application of several different approaches presented by the
Recommender Systems project of the Carleton College, Northfield, MN.

What about the relative importance of memory and
content-based information in the prediction of user ratings? The
following figure presents the median values and range of exp(β) for
progressively more complex ordinal models that were tested here:

Figure 2. The median exp(β) from the clm() ordered logit  models estimated
without cross-validation and encompassing 5, 10, 15, 20, 25, and 30 nearest neighbours from the memory-based and content-based matrices.

As it is readily observed, the memory-based features (simItemsCoeff for item-item similarity neighbours, and simUsersCoeff
for user-user similarity neighbours in the graph legend) weigh more
heavily in the ratings prediction than the content-based features (proxItemsCoeff for the item-item proximity neighbours, and proxUsersCoeff
for the user-user proximity neighbours). I have performed another round
of the experiment by fitting ordinal models with the information from
the memory-based, similarity matrices only; Figure 3 presents the
average RMSE obtained from 10-fold CVs of these models:

Figure 3. The average RMSE from the 10-fold cross-validations from clm()
with 5, 10, 15, 20, 25, and 30 nearest neighbours for models with memory-based
information only.

The results seem to indicate almost no change in the
RMSEs across the models (but note how the model starts overfitting when
more than 25 features per matrix are used). Could it be that the
content-based information – encoded by the proximity matrices – is not a
necessary ingredient in my predictive model at all? No. But we will
certainly not use RMSE to support this conclusion:

modelFiles <- list.files()
modelRes <- matrix(rep(0,6*7), nrow = 6, ncol = 7)
numSims <- seq(5, 30, by = 5)
ct = 0
for (k in numSims) {
 ct = ct + 1
 fileNameSimProx <- modelFiles[which(grepl(paste0("CV_",k,"F"), modelFiles, fixed = T))]
 modelSimProx <- readRDS(fileNameSimProx)
 fileNameSim <- modelFiles[which(grepl(paste0("Sim_",k,"F"), modelFiles, fixed = T))]
 modelSim <- readRDS(fileNameSim)
 testRes <- anova(modelSimProx, modelSim)
 modelRes[ct, ] <- c(testRes$no.par, testRes$AIC, testRes$LR.stat[2],
                     testRes$df[2], testRes$`Pr(>Chisq)`[2])
}
modelRes <- as.data.frame(modelRes)
colnames(modelRes) <- c('Parameters_Model1', 'Parameters_Model2',
                       'AIC_Model1', 'AIC_Model2',
                       'LRTest','DF','Pr_TypeIErr')
write.csv(modelRes, 'ModelSelection.csv')

The results of the likelihood ratio tests,
together with the values of the Akaike Information Criterion (AIC) for
all models are provided in Table 1; a comparison between the AIC values
are provided in Figure 4. I have compared the models that encompass both
the top neighbours from the memory and the content-based matrices, on
one, with the models encompassing only information from the memory-based
matrices, on the other hand. As you can see, the AIC is consistently
lower for the models that incorporate both the content and the
memory-based information (with the significant LRTs testifying that the
contribution of content-based information cannot be rejected).

Figure 4. Model selection: AIC for ordered logit models
with 5, 10, 15, 20, 25, and 30 nearest neighbours: memory-based
information only (Sim) vs. memory-based and content-based models (Sim+Prox).

Summary

Some say that comparing model-based recommendation
systems with the CF approaches is not fair in the first place. True.
However, achieving a reduction in error of almost 50% by using 80% less
features is telling. I think the result has two important stories to
tell, in fact. The first one is that the comparison is really not fair
:-). The second story is, however, much more important. It is not about
the application of the relevant model; since the development of discrete
choice models in economics, many behavioral scientist who see a Likert
type response scale probably tends to think of some appropriate ordinal
model for the problem at hand. It is about the formulation of the problem: as ever in science, the crucial question turns out to be the one on the choice of the most relevant description of the problem; in cognitive and data science, the question of what representation enables for its solution by as simple as possible means of mathematical modeling.

References

[1] Greene, W. & Hensher, D. (2010). Modeling Ordered Choices. Cambridge University Press, 2010.

[2] Christensen, R. H. B. (June 28, 2015). Analysis
of ordinal data with cumulative link models – estimation with the
R-package ordinal. Accessed on March 03, 2017, from the The Comprehensive R Archive Network. URL: https://cran.r-project.org/web/packages/ordinal/vignettes/clm_intro.pdf

Goran S. Milovanović, PhD
Data Science Consultant, SmartCat

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Hybrid content-based and collaborative filtering recommendations with {ordinal} logistic regression (1): Feature engineering

Fri, 04/14/2017 - 22:38

(This article was first published on The Exactness of Mind, and kindly contributed to R-bloggers)

I will use {ordinal} clm() (and other cool R packages such as {text2vec}
as well) here to develop a hybrid content-based, collaborative
filtering, and (obivously) model-based approach to solve the
recommendation problem on the MovieLens 100K dataset in R. All R code
used in this project can be obtained from the respective GitHub repository; the chunks of code present in the body of the post illustrate the essential steps only. The MovieLens 100K dataset can be obtained from the GroupLens research laboratory
of the Department of Computer Science and Engineering at the University
of Minnesota. The first part of the study introduces the new approach
and refers to the feature engineering steps that are performed by the OrdinalRecommenders_1.R script (found on GitHub). The second part, to be published soon, relies on the R code in  OrdinalRecommenders_3.R and presents the model training, cross-validation, and analyses steps. The  OrdinalRecommenders_2.R
script encompasses some tireless for-looping in R (a bad habbit indeed)
across the dataset only in order to place the information from the
dataset in the format needed for the modeling phase. The study aims at
(a) the demonstration of the improvement in predicted ratings for
recommending on a well-known dataset, and (b) attempts to shedd light on
the importance of various types of information in the work of
recommendation engines. Consequently, the code is not suited for use in
production; additional optimizations are straightforward, simple, and
necessary as well.

Introduction

Avoiding the formal exposition of the problem
altogether, the recommendation problem in the context of contemporary
online markets can expressed in the following way: given (A) the
information on the user’s past ratings of various items (i.e. the  user-item ratings;
for example, the ratings on a 5-point Likert type scale of the movies
or TV shows that she or he has watched in the past, the ratings on
similar scales of the items purchased, and similar), (B) the information
on the user-item ratings of other users, © the attributes of users
and/or items (c.f. the user’s gender, geographical location, age,
occupation, and similar; the genre of the movie, product type, and
similar for items), provide a prediction of the user’s rating of a novel, previously unassessed item.
The predictions are then used to select the novel items (or lists of
items) that a particular user would – by assumption – enjoy to assess,
and eventualy purchase, in the future. The cognitive systems used for
making such predictions are known as recommendation engines, or recommender systems, and are widely used nowadays across the Internet business.

It is generally recognized that recommendation engines can be grouped in two broad categories: (1) content-based systems,  or (2) collaborative filtering systems, with the later additionally described as (1.1) neighborhood or (1.2) model-based methods [1]. Content based
systems (CF) rely on a typical description of items over feature
vectors, and then recommend novel items to users by computing some
similarity metric between them and the items that the user has already
rated. Collaborative filtering systems, on the
other hand, rely on the assumption that the covariations between the
ratings (a) from different users over the same set of items, or (b) to
different items from the same set of users, implicitly carry the
information on item and user attributes. The information about the true
user and item attributes are thus latent in CF. In the neighborhood (or memory) variant of CF, these approaches use the user-item ratings directly and proceed by aggregating the ratings of similar users (user-based CF) and/or similar items (item-based CF)
in order to provide an estimate of the user’s rating of a novel item,
typically weighting the known ratings of similar users/items by the
respective proximity/similarity measures. In the model-based
variant of CF, the systems tries to discover the structure of the
latent factors from the observed behavioral measures (ie. known
user-item ratings), and then a myriad of machine learning algortihms and
statistical approaches can be used to train a predictive model for the
novel user-item ratings of novel items.

I will here introduce a new hybrid approach to
recommender systems with the following characteristics. First, the
system is both content-based and CF. The content-based component of the
system encompasses two matrices: the user-user and the item-item
proximity matrices, both obtained from applying the relevant distance
metric over a set of features that characterize users and items,
respectively. The CF component of the system, relies on the typical
user-user and item-item similarity matrices computed from the known,
past user-item ratings, providing for a memory component of the
recommender. Second, the system is model-based in the sense that it
combines the information from these four matrices in an ordinal logistic
regression model to predict the ratings of novel items on a 5-point
Likert scale. The underlying logic of this approach is to try to express
the recommendation problem as a discrete choice problem, where the
alternatives are provided by ordinal information from a 5-point Likert
scale, while the decision makers and the choice context are described by
a set of content-based and memory-based features, all referring to some
degree of closeness between the unknown user/item combination with the
known user/item in two spaces: the attribute (feature) space and the
neighborhood (memory) space. Because the CF approaches rely on
users-user (or item-item) proximity, the question of how many users (or
items) that are similar to the context of the present, novel user-item
rating should be used in prediction arises naturally. By expressing the
recommender problem as a regression problem essentially, we become able
to judge the significance of additional user-user and/or item-item
similarity information from the change in regression coefficients
obtained by progressively adding more and more information to nested
ordinal logistic models and training them. Thus the new approach also
provides and excellent opportunity for a comparison between
content-based information and memory-based information (the later as
being typically used in CF): which one is more important? Given the
assumption that the former is implicitly present in the later, i.e. that
it presents latent information in the CF approach, does the usage of
manifest memory-based information on user-item ratings completelly
eliminates the need for a content-based description? If not, how the two
should be combined? I will demonstrate how the present approach
resolves this dilemma while resulting in highly improved recommendations
on a well-known and widely used MovieLens dataset, at the same time
reducing drastically the number of features that are needed to build the
relevant prediction model.

Motivation

Consider the following user-item ratings matrix, encompassing 10 items (columns) and three users (rows):

In this rating matrix, the ratings of user1 and user2
have a maximal Pearson correlation of 1; however, the ratings of user2
and user3 have a linear correlation of .085, sharing around .007 % of
variance only. Upon a closer look, user1 and user2 have only three
common ratings (for items i1, i4, and i5), while user2 and user3 have
rated exactly the same items (i1, i3, i4, i5, i8, i9, and i10). Imagine
that the items rated by user2 and user 3 are all Sci-Fi movies; we can
imagine these two users chating and challenging each other’s views on
the contemporary Sci-Fi production on online forums or fan conferences,
right? They do not have to have perfectly similar tastes, right, but in
fact  they are similar: simply because they both enjoy Sci-Fi. The Jaccard distance
measures the proximity of two vectors A and B, both given over binary
features (e.g. present vs. not present), by computing the ratio of (a)
the difference between the union of A and B and the intersection of A
and B to the (b) union of A and B; it ranges from zero to one and can be
viewed as a proportion of common features that the two vectors share in
the total number of features present in both. The Jaccard similarity
coefficient is given by one minus the Jaccard distance; we will use the
{proxy} dist() function to compute these from our ratings matrix:

### --- Illustration
user1 <- c(5, NA, NA, 4, 3, NA, 1, NA, NA, NA)
user2 <- c(4, NA, 3, 3, 2, NA, NA, 1, 2, 4)
# -- User1 and User2 have a high Pearson Correlation Coefficient:
cor(user1, user2, use = "pairwise.complete.obs")
user3 <- c(3, NA, 1, 5, 4, NA, NA, 2, 5, 4)
# -- User2 and User3 have a low Pearson Correlation Coefficient:
cor(user2, user3, use = "pairwise.complete.obs")
# -- User2 and User3 have a high Jaccard similarity
# -- i.e. 1 minus the Jaccard distance between them:
library(proxy)
as.numeric(1 - dist(t(user2), t(user3), method = "jaccard"))
as.numeric(1 - dist(t(user1), t(user2), method = "jaccard"))
as.numeric(1 - dist(t(user1), t(user3), method = "jaccard"))

User2 and User3 who have provided the ratings for the
same items exactly have a Jaccard similarity index of 1; they both have
the Jaccard similarity index of .375 with User1. Then who is the user
similar to User2 – the one whose ratings could be used to approximate
the experiences of User2 with previously unseen items: (a) User1, with
whom he has a memory-based Pearson correlation of 1 on three shared
items, or (b) User3, with whom he has a Pearson correlation of only
.085, but with whom he had shared the same experiences and thus achieved
a maximum level of Jaccard similarity?

The value of the linear correlation coefficient (or
cosine similarity, or any other similar metric) will not preserve the
information on the number of shared items; that information is imlicitly
present in the Type I Error test (for example), not in the extent of
covariance itself. The Jaccard distance, on the other hand, neglects the
covariances from the rating scales altogether, preserving only the
information about the extent of the shared ratings. If we think about
the items rated by a particular user, neglecting the particular rating
values, we understand that we can describe some aspects of the user’s
taste by binarizing his or her ratings; we can know, for example, the
frequency distribution of the item types that he or she has decided to
assess at all. If we additionally compute a distance or similarity
matrix over that information for all users, we express the content-based
approach in terms of the attributes that describe a particular use by
its distance or similarity from other users in terms of the very same
attributes – in a content-based description space. We can add the
avaialable demographic information to the binary vectors before
computing the similarities too and thus enrich the representation. And
we can do the same for items, of course. As we already have the
similarity matrices from linear correlations between the paired users’
ratings at our disposal, we can combine the information from these two
approaches – the content-based and the memory, or neighborhood-based –
and try to predict the unobserved user-item ratings from a unified
model.

The details of feature engineering along these lines in R are provided in the respective section below.

Dataset

The MovieLens 100K dataset [2] comprises 100,000
ratings (Likert type 1 to 5 scale) from 943 users on 1682 movies;
additionally, there are at least 20 movies rated per user. Some
demographic information for the users is present – age, gender,
occupation, zip – as well as the genre and the release dates for movies.
The dataset can be downloaded from the GroupLens research laboratory
of the Department of Computer Science and Engineering at the University
of Minnesota. Historically, it is among the most popular datasets used
to test the new approaches to the development of recommendation engines.
MovieLens 100K can be also obtained from Kaggle and Datahub. Our  OrdinalRecommenders_1.R
script reads in the components of this dataset and computes the desired
user-user and item-item distance and similarity matrices following some
data reshaping.

Feature engineering

Part 1.A of the  OrdinalRecommenders_1.R
script is used merely to read the original MovieLens 100K dataset
files, rename the columns present there, and save as .csv files; you
will need to run it too in order to use the code from part 1.B and the
remaining scripts. In Part 1.B we begin to operate over the following
three data.frames: (1) the ratingsData, which comprises the ratings on
5-point scales for all user-item pairs from the dataset, (2) usersData,
which comprises demographic information on users (Gender, Age,
Occupation, Zip-code), and (3) moviesData, which comprises all item
information (Title, Release Date, and Genre). Part 1.B of this R script
performs the essentialy feature engineering operations in order to
produce the following four matrices:

Matrix P1 – the Item-Item Proximity Matrix:
the Jaccard distances between the items, computed from the binarized
features encompassing relase decade (derived from the original release
year information, provided in the movie Title column) and movie genre.
This matrix plays the role of the content-based representation of items. The dist2() function from {text2vec} is used on a sparse Matrix class to compute the Jaccard distances here:

### --- Compute moviesDistance w. Jaccard {text2vec} from movie genres
moviesData <- moviesData %>% separate(col = Date,
                                     into = c('Day', 'Month','Year'),
                                     sep = "-")
moviesData$Day <- NULL
moviesData$Month <- NULL
moviesData$Year <- as.numeric(moviesData$Year)
range(moviesData$Year)
# - that would be: [1] 1922 1998
# - Introduce Movie Decade in place of Year:
decadeBoundary <- seq(1920, 2000, by = 10)
moviesData$Year <- sapply(moviesData$Year, function(x) {
 wL <- x < decadeBoundary
 wU <- x >= decadeBoundary
 if (sum(wL) == length(decadeBoundary))  {
   return(1)
 } else if (sum(wU) == length(decadeBoundary)) {
   decadeBoundary[length(decadeBoundary)]
 } else {
   decadeBoundary[max(which(wL-wU == -1))]
 }
})
# - Match moviesData$Year with ratingsData:
mD <- moviesData %>%
 select(MovieID, Year)
ratingsData <- merge(ratingsData, mD,
                    by = 'MovieID')
# - Movie Year (now Decade) as binary:
moviesData <- moviesData %>%
 spread(key = Year,
        value = Year,
        fill = 0,
        sep = "_")
# - compute moviesDistance:
moviesDistance <- moviesData[, 3:ncol(moviesData)]
w <- which(moviesDistance > 0, arr.ind = T)
moviesDistance[w] <- 1
moviesDistance <- dist2(Matrix(as.matrix(moviesData[, 4:ncol(moviesData)])),
                       method = "jaccard")
moviesDistance <- as.matrix(moviesDistance)
rm(moviesData); gc()
# - save objects and clear:
numMovies <- length(unique(ratingsData$MovieID))
write_csv(as.data.frame(moviesDistance),
         path = paste0(getwd(),'/moviesDistance.csv'),
         append = F, col_names = T)
rm(moviesDistance); gc()

Matrix P2 – the User-User Proximity Matrix:
the Jaccard distances between the users, computed from the binarized
features representing whether a particular user has or has not rated a
particular item only. This matrix plays the role of the content-based representation of users:

### --- produce binary User-Item Matrix (who rated what only):
userItemMat <- matrix(rep(0, dim(usersData)[1]*numMovies),
                     nrow = dim(usersData)[1],
                     ncol = numMovies)
userItemMat[as.matrix(ratingsData[c('UserID', 'MovieID')])] <- 1
rm('w', 'ratingsData', 'usersData'); gc()
### --- Compute userDistance w. Jaccard {text2vec}
userItemMat <- Matrix(userItemMat)
usersDistance <- dist2(userItemMat,
                      method = "jaccard")
rm(userItemMat); gc()
usersDistance <- as.matrix(usersDistance)
write_csv(as.data.frame(usersDistance),
         path = paste0(getwd(),'/usersDistance.csv'),
         append = F, col_names = T)
rm(usersDistance); gc()

Note that I’ve missed to use the user demographics
available in MovieLens 100K. In this project, I will not use them at
all, but the general ratio is: because I am going after an ordinal
logistic regression model, I can use these information as regressors
there. Also, as I have computed the user-user distance by looking at
what movies ratings they have in common, presumeably capturing some
information on shared tastes, I could have done the some for the items,
by inspecting what movies have been rated by the same users. However, a
hunch told me that I might be end up using redundand feature
representations in the same model; I’ve decided to describe the movie
similarities from content-based features directly (genre and the decade
of production), and user similarities from content-based features that
encompass the similarity in tastes (the common movie ratings among
them).

The following two similarity matrices are those
typically used in user and item-based CF; they comprise Pearson
correlation coefficients computed directly from the ratings matrix for
1681 items (note: one movie was ‘unknown’, so I have removed it from the
dataset and cleaned up from the ratings matrix accordingly) and 943
users. Given the moderate size of the desired correlation matrices, I
have used the {base} cor() function on pairwise complete observations to
compute them. For digging out correlation or cosine matrices from
problems larger than the MovieLens 100K, you might want to take a look
at the functions provided by my colleague Stefan Nikolić in his “Recommender Systems: Matrix operations for fast calculation of similarities” (supporting the computations for CF in an order of magnitude lesser time than the popular {recommenderlab} R package). The code produces the remaining two matrices:

Matrix S1 – the User-User Similarity Matrix:
the pairwise complete Pearson correlation coefficients between the
users’ ratings from the ratings matrix directly. This matrix plays the
role of the memory-based representation of users:

### --- Compute User-User and Item-Item Ratings Similarity Matrices
ratingsData <- read_csv("ratingsData_Model.csv",
                       col_names = T)
ratingsData$X1 <- NULL
# - User-Item Ratings Matrix
ratingsData$Timestamp <- NULL
ratingsData <- ratingsData %>%
 spread(key = MovieID,
        value = Rating,
        sep = "_") %>%
 arrange(UserID)
# - Pearson Correlations: User-User Sim Matrix
UserUserSim <- ratingsData %>%
 select(starts_with("Movie"))
UserUserSim <- t(UserUserSim)
UserUserSim <- cor(UserUserSim,
                  use = 'pairwise.complete.obs')
UserUserSim <- as.data.frame(UserUserSim)
write_csv(UserUserSim,
         path = paste0(getwd(),'/UserUserSim.csv'),
         append = F, col_names = T)
rm(UserUserSim); gc()

Matrix S2 – the Item-Item Similarity Matrix:
the pairwise complete Pearson correlation coefficients between the
item’ ratings from the ratings matrix directly. This matrix plays the
role of the memory-based representation of items:

# - Pearson Correlations: Item-Item Sim Matrix
ItemItemSim <- ratingsData %>%
 select(starts_with("Movie"))
rm(ratingsData); gc()
ItemItemSim <- cor(ItemItemSim,
                  use = 'pairwise.complete.obs')
ItemItemSim <- as.data.frame(as.matrix(ItemItemSim))
write_csv(ItemItemSim,
         path = paste0(getwd(),'/ItemItemSim.csv'),
         append = F, col_names = T)
rm(ItemItemSim); gc()

Summary

The idea is to combine the content-based with
memory-based information in a unified model to solve the recommendation
problem for online recommender systems. I have used the Jaccard
distances to describe the content-based proximity between users, based
on what items they have rated or not; on the other hand, the Jaccard
distances were also used to described the proximity of items in a
content-based space derived from the “essentialistic” item features like
movie genres and the decade of their production. Additionally, the
typical Pearson correlation similarity matrices – computed directly from
the ratings matrix – were produced for both users and items. The two
user-user and item-item matrices share the same respective
dimensionality. In the next step, the idea is to use the neighborhoods
of different size from these matrices (say, take p similar users from the content-based proximity and the memory-based user-user similarity matrices, and q
similar items from the content-based proximity and the memory-based
item-item matrices) as regressors in an ordinal logistic model to
predict the ordinal 5-point ratings of novel user-item combinations. By
inspecting the regression coefficients from these predictive models
while varying the size of the neighborhood sets of users and items whose
ratings will be used as regressors, I hope to be able to provide some
conclusion on the relative importance of content-based and memory-based
information for building efficient recommendation engines. We’ll see.
But before skipping to  OrdinalRecommenders_3.R to play with the beautiful {ordinal} models, you might wish to run  OrdinalRecommenders_2.R
that will prepare the first 100 neighbours from the proximity and
similarity matrices for you. As I told you, there’s a for loop there –
and a bad one.

References

[1] Desrosiers, C. & Karypis, G. (2010). A comprehensive survey of neighborhood-based recommendation methods. In Ricci, F., Rokach, L., Shapira, B. & Kantor, P. B. (Eds), Recommender Systems Handbook. Springer: US. DOI:10.1007/978-0-387-85820-3.

[2] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872.

Goran S. Milovanović, PhD
Data Science Consultant, SmartCat

To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Modeling data with functional programming, Part I

Fri, 04/14/2017 - 22:26

(This article was first published on R – Cartesian Faith, and kindly contributed to R-bloggers)

The latest draft of my book is available. This will be my last pre-publication update, as I’m in the process of editing the remainder of the book. That said, the first part is packed with examples and provides a solid foundation on its own. I’m including the preface below to whet people’s appetite for the complete book when it is published.

Preface

This book is about programming. Not just any programming, but programming for data science and numerical systems. This type of programming usually starts as a mathematical modeling problem that needs to be translated into computer code. With functional programming, the same reasoning used for the mathematical model can be used for the software model. This not only reduces the impedance mismatch between model and code, it also makes code easier to understand, maintain, change, and reuse. Functional programming is the conceptual force that binds the these two models together. Most books that cover numerical and/or quantitative methods focus primarily on the mathematics of the model. Once this model is established the computational algorithms are presented without fanfare as imperative, step-by-step, algorithms. These detailed steps are designed for machines. It is often difficult to reason about the algorithm in a way that can meaningfully leverage the properties of the mathematical model. This is a shame because mathematical models are often quite beautiful and elegant yet are transformed into ugly and cumbersome software. This unfortunate outcome is exacerbated by the real world, which is messy: data does not always behave as desired; data sources change; computational power is not always as great as we wish; reporting and validation workflows complicate model implementations. Most theoretical books ignore the practical aspects of working in the field. The need for a theoretical and structured bridge from the quantitative methods to programming has grown as data science and the computational sciences become more pervasive.

The goal is to re-establish the intimate relationship between mathematics and computer science. By the end of the book, readers will be able to write computer programs that clearly convey the underlying mathematical model, while being easy to modify and maintain. This is possible in R by leveraging the functional programming features built into the language. Functional programming has a long history within academia and its popularity in industry has recently been rising. Two prominent reasons for this upswing are the growing computational needs commensurate with large data sets and the data-centric computational model that is consistent with the analytical model. The superior modularity of functional programs isolates data management from model development, which keeps the model clean and reduces development overhead. Derived from a formal mathematical language, functional programs can be reasoned about with a level of clarity and certainty not possible in other programming paradigms. Hence the same level of rigor applied to the quantitative model can be applied to the software model.

Divided into three parts, foundations are first established that uses the world of mathematics as an introduction to functional programming concepts (Chapter 2). Topics discussed include basic concepts in set theory, statistics, linear algebra, and calculus. As core tools for data scientists, this material should be accessible to most practitioners, graduate students, and even upper class undergraduates. The idea is to show you that you already know most of the concepts used in functional programming. Writing code using functional programming extends this knowledge to operations beyond the mathematical model. This chapter also shows counter examples using other programming styles, some in other languages. The point isn’t to necessarily criticize other implementation approaches, but rather make the differences in styles tangible to the reader. After establishing some initial familiarity, Chapter 3 dives into functions, detailing the various properties and behaviors they have. Some important features include higher-order functions, first-class functions, and closures. This chapter gives you a formal vocabulary for talking about functional programs in R. This distinction is important as other functional languages have plenty more terms and theory ignored in this book. Again, the goal is how to leverage functional programming ideas to be a better data scientist. Finally, Chapter 4 reviews the various packages in R that provide functionality related to functional programming. These packages include some built-in implementations, paradigms for parallel computing, a subset of the so-called tidyverse, and my own lambda.r package, which offers a more comprehensive approach to writing functional programs.

While the first part provides a working overview of functional programming in R, Part II takes it a step further. This part is for readers that want to exploit functional programming principles to their fullest. Many topics in Part I reference deeper discussions in Part II. This canon begins by exploring the nature of vectorization in Chapter 5. Often taken for granted, we look at how colloquial vectorization conflates a few concepts. This chapter unravels the concepts, showing you what can be done and what to expect with each type of vectorization. Three primary higher-order functions map, fold, and filter follow. Chapter 6 shows how the concept of map appears throughout R, particularly in the apply family of functions. I show how to reason about data structures with respect to the ordinals (or index) of the data structure. This approach can simplify code by enabling the separation of data structures, so long as an explicit ordinal mapping exists. While fold is more fundamental than map, its use is less frequent in R. Discussed in Chapter 7, fold provides a common structure for repeated function application. Optimization and many iterative methods, as well as stochastic systems make heavy use of repeated function application, but this is usually implemented as a loop. With fold, the same function used to implement a single step can be used for multiple steps, reducing code complexity and making it easier to test. The final function of the canon is filter, which creates subsets using a predicate. This concept is so integral to R that the notation is deeply integrated into the language. Chapter 8 shows how the native R syntax is tied to this higher-order function, which can be useful when data structures are more complex. Understanding this connection also simplifies porting code to or from R when the other language doesn’t have native syntax for these operations.

Programming languages can’t do much without data structures. Chapter 9 shows how to use native R data structures to implement numerous algorithms as well as emulate other data structures. For example, lists can be used to emulate trees, while environments can emulate hash tables.

The last part focuses on applications and advanced topics. Part III can act as a reference for implementing various algorithms in a functional style. This provides readers with tangible examples of using functional programming concepts for algorithm development. This part begins by proposing a simple model development pipeline in Chapter 11. The intention is to provide some reusable structure that can be tailored to each reader’s needs. For example, the process of backtesting is often implemented in reverse. Loops become integral to the implementation, which makes it difficult to test individual update steps in an algorithm. This chapter also shows some basic organizational tricks to simplify model development. A handful of machine learning models, such as random forest, are also presented in Chapter 11. Remarkably, many of these algorithms can be implemented in less than 30 lines fo code when using functional programming techniques. Optimization methods, such as Newton’s method, linear programming, and dynamic follow in Chapter 13. These methods are usually iterative and therefore benefit from functional programming. State-based systems are explored in Chapter 12. Ranging from iterative function systems to context-free grammars, state is central to many models and simulations. This chapter shows how functional programming can simplify these models. This part also discusses two case studies (Chapter 14 and Chapter 15) for implementing more complete systems. In essence, Part III shows the reader how to apply the concepts presented in the book to real-world problems.

Each chapter presents a number of exercises to test what you’ve learned. By the end of the book, you will know how to leverage functional program- ming to improve your life as a data scientist. Not only will you be able to quickly and easily implement mathematical ideas, you’ll be able to incrementally change your exploratory model into a production model, possibly scaling to multiple cores for big data. Doing so also facilitates repeatable research since others can review and modify your work more easily.

Rowe – Modeling data with functional programming (part i)

To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Community workshops

Fri, 04/14/2017 - 20:00

(This article was first published on R – Locke Data, and kindly contributed to R-bloggers)

Following on from when we announced the availability of our community workshops, we’ve got three in the next three months that folks can attend.

May 19th – Data science project in a day

We’ll be in Kiev, Ukraine, doing a whole data science project in a day. This is intended to give people a little bit of code, process, and critical thinking along the whole data science workflow. This will enable folks to see how it hangs together and decide where and how much they want to invest their learning in future.

Tickets for this workshop are currently ~£60 so even if people flew over and stayed a couple of nights, it’d still be pretty darn cheap!

More Info

June 26th – Introduction to R

We’ll be in Newcastle, UK, doing an introduction to R. This focuses on data analysis skills, working with databases, and writing reusable code. It’s also split into two halves so folks can come along for only half the day.

Tickets for this workshop are £50 for a half day or £90 for the whole day, with discounts available to certain groups.

More Info

July 15th – Introduction to R

We’ll be in Manchester, UK, doing an introduction to R. This focuses on data analysis skills, working with databases, and writing reusable code.

Tickets for this workshop are £125 before the 12th of June.

More Info

Community workshops

If you’d like us to do a community workshop for your group, get in touch. You can also catch up to us at one the upcoming events we’re presenting at.

The post Community workshops appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.

To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

R is for Archaeology: A report on the 2017 Society of American Archaeology meeting

Fri, 04/14/2017 - 18:00

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Ben Marwick, Associate Professor of Archaeology, University of Washington and Senior Research Scientist, University of Wollongong

The Society of American Archaeology (SAA) is one of the largest professional organisations for archaeologists in the world, and just concluded its annual meeting in Vancouver, BC at the end of March. The R language has been a part of this meeting for more than a decade, with occasional citations of R Core in the posters, and more recently, the distinctive ggplot2 graphics appearing infrequently on posters and slides. However, among the few archaeologists that have heard of R, it has a reputation for being difficult to learn and use, idiosyncratic, and only suitable for highly specialized analyses. Generally, archaeology students are raised on Excel and SPSS. This year, a few of us thought it was time to administer some first aid to R's reputation among archaeologists and generally broaden awareness of this wonderful tool. We developed a plan for this year's SAA meeting to show our colleagues that R is not too hard to learn, it is useful for almost anything that involves numbers, and it has lots of fun and cool people that use it to get their research done quicker and easier.

Our plan had three main elements. The first element was the début of two new SAA Interest Groups. The Open Science Interest Group (OSIG) was directly inspired by Andrew MacDonald's work founding the ESA Open Science section, with the OSIG being approved by the SAA Board this year. It aims to promote the use of preprints (e.g. SocArXiv), open data (e.g. tDAR, Open Context), and open methods (e.g. R and GitHub). The OSIG recently released a manifesto describing these aims in more detail. At this SAA meeting we also saw the first appearance of the Quantitative Archaeology Interest Group, which has a strong focus on supporting the use R for archaeological research. The appearance of these two groups shows the rest of the archaeological community that there is now a substantial group of R users among academic and professional archaeologists, and they are keen to get organised so they can more effectively help others who are learning R. Some of us in these interest groups were also participants in fora and discussants in sessions throughout the conference, and so had opportunities to tell our colleagues, for example, that it would be ideal if R scripts were available for for certain interesting new analytical methods, or that R code should be submitted when manuscripts are submitted for publication.

The second element of our plan was a normal conference session titled 'Archaeological Science Using R'. This was a two hour session of nine presentations by academic and professional archaeologists that were live code demonstrations of innovative uses of R to solve archaeological research problems. We collected R markdown files and data files from the presenters before the conference, and tested them extensively to ensure they'd work perfectly during the presentations. We also made a few editorial changes to speed things up a bit, for example using readr::read_csv instead of read.csv. We were told in advance by the conference organisers that we couldn't count on good internet access, so we also had to ensure that the code demos worked offline. On the day, the live-coding presentations went very well, with no-one crashing and burning, and some presenters even doing some off-script code improvisation to answer questions from the audience. At the start of the session we announced the release of our online book containing the full text of all contributions, including code, data and narrative text, which is online here. We could only do this thanks to the bookdown package, which allowed us to quickly combine the R markdown files into a single, easily readable website. I think this might be a new record for the time from an SAA conference session to a public release of an edited volume. The online book also uses Matthew Salganik's Open Review Toolkit to collect feedback while we're preparing this for publication as an edited volume by Springer (go ahead and leave us some feedback!). There was a lot of enthusiastic chatter later in the conference about a weird new kind of session where people were demoing R code instead of showing slides. We took this as an indicator of success, and received several requests for it to be a recurring event in future meetings.

The third element of our plan was a three hour training workshop during the conference to introduce archaeologists to R for data analysis and visualization. Using pedagogical techniques from Software Carpentry (i.e. sticky notes, live coding and lots of exercises), Matt Harris and I got people using RStudio (and discovering the miracle of tab-complete) and modern R packages such as readxl, dplyr, tidyr, ggplot2. At the end of three hours we found that our room wasn't booked for anything, so the students requested a further hour of Q&A, which lead to demonstrations of knitr, plotly, mapview, sf, some more advanced ggplot2, and a little git. Despite being located in the Vancouver Hilton, this was another low-bandwidth situation (which we were warned about in advance), so we loaded all the packages to the student's computers from USB sticks. In this case we downloaded package binaries for both Windows and OSX, put them on the USB sticks before the workshop, and had the students run a little bit of R code that used install.packages() to install the binaries to the .libpaths() location (for Windows) or untar'd the binaries to that location (for OSX). That worked perfectly, and seemed to be a very quick and lightweight method to get packages and their dependencies to all our students without using the internet. Getting the students started by running this bit of code was also a nice way to orient them to the RStudio layout, since they were seeing that for the first time.

This workshop was a first for the SAA, and was a huge success. Much of this is due to our sponsors who helped us pay for the venue hire (which was surprisingly expensive!). We got some major support from Microsoft Data Science User Group program (ed note: email msdsug@microsoft.com for info about the program) and Open Context, as well as cool stickers and swag for the students from RStudio, rOpenSci, and the Centre for Open Science. We used the stickers like tiny certificates of accomplishment, for example when our students produced their first plot, we handed out the ggplot2 stickers as a little reward.

Given the positive reception of our workshop, forum and interest groups, our feeling is that archaeologists are generally receptive to new tools for working with data, perhaps more so now than in the past (i.e. pre-tidyverse). Younger researchers seem especially motivated to learn R because they may have heard of it, but not had a chance to learn it because their degree program doesn't offer it. If you are a researcher in a field where R (or any programming language) is only rarely used by your colleagues, now might be a good time to organise a rehabilitation of R's reputation in your field. Our strategy of interest groups, code demos in a conference session, and a short training workshop during the meeting is one that we would recommend, and we imagine will transfer easily to many other disciplines. We're happy to share more details with anyone who wants to try!

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Data Science for Operational Excellence Exercises (Part-2)

Fri, 04/14/2017 - 17:45

(This article was first published on R-exercises, and kindly contributed to R-bloggers)

Network problems are everywhere. We can easily find instances in logistics, telecom, project mangement, among others. In order to attack these problems using linear programming we need to go beyond assign and transportation problems that we saw in part I. Our goal here is to expand the problems we can solve using lpsove and igraph R packages. For that, we will formulate the transportation problem using a more generic function lp.
Answers to the exercises are available here. If you obtained a different (correct) answer than those
listed on the solutions page, please feel free to post your answer as a comment on that page.
Please install and load the package lpsolve and igraph before starting the exercise.

Answers to the exercises are available here.

Exercise 1
Load libraries lpsove and igraph and learn how to use lp function by replicating the example in ?lp.

Exercise 2
We want to rewrite the transport problem stated in part I, but this time using lp function. First, load the data used in part I exercise.

Exercise 3
Run the this transportation problem one more time using lp.transport. Check the objective function and decision variables values.

Exercise 4
Rearrange the data in a format required for the objective function to use at the lp function.

Exercise 5
Construct the binary matrix to model the DEMAND constraints.Observe that the variables should be at the same order defined at the objective function.

Learn more about Network Analysis in the online course Statistics with R – Advanced Level. In this course you will learn how to:

  • Run a cluster analysis
  • Perform multidimensional scaling
  • And much more

Exercise 6
Forget about the OFFER constraints for a while. Run the lp function only with de DEMAND constraint and see what happens.

Exercise 7
Construct the binary matrix to model the OFFER constraints.Observe that the variables should be at the same order defined at the objective function.

Exercise 8
Bind both offer and demand constraints to construct a matrix to f.con and vectors to f.dir, and f.rhs.

Exercise 9
Now, solve the problem using lp function. Find the solution for each variable and the objective function value.Check if it matches the values from lp.transport.

Exercise 10
Rename rows to represent factories and columns to be depots. Create an graph using graph_from_incidence_matrix based on decision variables optimum values.

Related exercise sets:
  1. Data Hacking with RDSTK 3
  2. Building Shiny App exercises part 4
  3. Descriptive Analytics-Part 6: Interactive dashboard ( 2/2)
  4. Explore all our (>1000) R exercises
  5. Find an R course using our R Course Finder directory

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Modified Age Bias Plot

Fri, 04/14/2017 - 07:00

(This article was first published on fishR Blog, and kindly contributed to R-bloggers)

Original Age Bias Plot

Campana et al. (1995) introduced the “age bias plot” to visually assess potential differences in paired age estimates (e.g., between two structures such as scales and otoliths, between two readers, or between one reader at two times). One set of age estimates serve as “reference” ages in the age bias plot. The reference ages are usually the age estimates thought to be most accurate (i.e., the most accurate structure, “best” reader), but could be from the first reading if two readings from the same reader are made. The reference estimates form the x-axis on the age-bias plot. The mean and 95% confidence interval for the nonreference age estimates computed at each value of the reference age estimates are plotted to form the age bias plot. A 1:1 line that represents agreement between the two age estimates is usually included. Confidence intervals that do not capture this “agreement line” suggest a difference in the two age estimates at that reference age. An example age bias plot is below.

The age bias plot above was constructed with the code below. Briefly, the foundational calculations for the age bias plot are constructed with ageBias(), where the variable with the reference age estimates follows the tilde. The age bias plot is then constructed by submitting the ageBias() results to plot(). The col.CIsig=, pch.mean.sig=, sfrac=, lty.agree=, and col.agree= arguments are used here to modify default settings for these arguments so that the resultant age bias plot most closely resembles that described by Campana et al. (1995). These arguments are discussed further in the next section.

library(FSA) # provides ageBias(), plot() and WhitefishLC data(WhitefishLC) # example data ab1 <- ageBias(scaleC~otolithC,data=WhitefishLC, nref.lab="Scale Age",ref.lab="Otolith Age") plot(ab1,col.CIsig="black",pch.mean.sig=19,sfrac=0.01, lty.agree=1,col.agree="black") Making a “Cleaner” Age Bias Plot

I found the age bias plot to be useful for detecting systematic differences in estimated ages between two sets of readings, but I also found the plot to be “clunky” with some data sets. Thus, I modified the original design in several ways. The first set of changes were simply to make the plot “cleaner” and easier to interpret. Specific changes I made are described below.

  • Used a 1-sample t-test to determine if mean nonreference age (i.e., y-axis) differed significantly from the reference age for each reference age (i.e., x-axis). In other words, for example, does the mean nonreference age at a reference age of 3 differ from 3? This test was repeated for each reference age and the resultant p-values were corrected for multiple comparisons. Reference ages for which a significant difference was detected were plotted by default with an open symbol and a different color. The symbol and color for the ages where a significant difference was detected are controlled by col.CIsig= and pch.mean.sig=, respectively.
  • Confidence intervals for reference ages with small sample sizes can be very wide, which can cause poor scaling of the y-axis on the age bias plot. The min.n.CI= argument in ageBias() sets a sample size threshold for when confidence intervals are constructed. The default for this argument is 3 (i.e., a confidence interval will be constructed if the sample size is at least 3).
  • Made the agreement line a ligher gray and dashed so that it can be seen but it is less bold. The type and color of the agreement line are controlled by lty.agree= and col.agree=, respectively.
  • Removed the “caps” on the ends of the confidence intervals to reduce clutter. The length of the confidence interval ends are controlled by sfrac=.

These modifications are the defaults settings in ageBias() and plot().

plot(ab1)

Plotting Differences on the Y-Axis

Muir et al. (2008) were the first (to my knowledge) to modify the age bias plot by using the difference between the reference ages and the mean nonreference ages on the y-axis. I modified this concept by first computing the difference between the nonreference and reference ages (nonreference-reference) for each individual and then computing the mean of those differences for each reference age. With this modification, the mean difference between nonreference and reference ages is plotted against the reference ages. The “agreement line” is now a horizontal line at 0 on this plot. This modified age bias plot is constructed by including difference=TRUE in ageBias().

plot(ab1,difference=TRUE)

Showing Individual Variability

I often wanted to have a feel for the individual variabilty underlying the age bias plot. A faint gray line “behind” that stretches from the minimum to the maximum nonreference age for each reference age is plotted behind each confidence interval by including show.range=TRUE.

plot(ab1,difference=TRUE,show.range=TRUE)

Alternatively, individual points (for paired age estimates) are included with show.pts=TRUE. There tends to be considerable overplotting of individual points because of the discrete nature of age data. To make individual points more obvious, a transparent color can be used for each point such that more overlapping points will appear darker. The level of transparency is controlled by including an integer in transparency=, with values further from 1 being more transparent.

plot(ab1,difference=TRUE,show.pts=TRUE)

Illustrating Sample Size

I also often want to know the sample sizes underlying the age bias plot. In particular, it is useful to know the number of nonreference age estimates that contributed to each mean and confidence interval. My first attempt at providing this information on the plot was to simply print the values on the plot (usually above the plot or just above the x-axis). The sample sizee can be added to the plot with show.n=TRUE.

plot(ab1,difference=TRUE,show.n=TRUE)

These values, however, are often either so crowded or so small as to be of little utility. Recently I added the ability to add marginal histograms to the age bias plot. For example, a histogram of the sample sizes in the previous plot can be added by using xHist=TRUE.

plot(ab1,difference=TRUE,xHist=TRUE)

The same can be added for the nonreference ages with yHist=TRUE. This plot has the added advantage of showing the distribution of differences in age estimates, with the bar at a difference of zero representing the amount of perfect agreement between the sets of age estimates.

plot(ab1,difference=TRUE,xHist=TRUE,yHist=TRUE)

The same plot, but with a marginal histogram of the nonreference age estimates rather than the difference in age estimates is also nice for showing the age distributions for both sets of age estimates.

plot(ab1,xHist=TRUE,yHist=TRUE)

My Preference

My current preference for an age bias plot is to use the differences in age estimates, plot both marginal histograms (I like to see the distributions for the reference age estimates and the difference in age estimates), to show the individual points, and to remove the coloring for significantly different ages (though, I like the difference in symbols). An example of my preferred plot is shown below for ages estimated from fin rays for two readers.

abOO <- ageBias(finray2~finray1,data=WhitefishLC, nref.lab="Reader 2 Age",ref.lab="Reader 1 Age") plot(abOO,difference=TRUE,xHist=TRUE,yHist=TRUE, col.CIsig="black",show.pts=TRUE)

A Couple of Details

The modifications of the age bias plot described here are available in the development version of the FSA package, but not yet in the stable CRAN version. They will appear in v0.8.13.

Several of the options illustrated above can be modified with other arguments to ageBias(). See the documentation for further details (i.e., use ?ageBias).

The plots in this post used the following modifications of the default base graphing parameters. These changes make narrower margins around the plot, move the axis labels closer to the axes, and reduce the size of the axis tick marks relative to the default values.

par(mar=c(3.5,3.5,1,1),mgp=c(2.1,0.4,0),tcl=-0.2) Final Thoughts

Please let me know what you think about my preferred age bias plot. In general, I like it and feel that it is more informative than other age bias plots. However, I am not quite satisfied with how I separated the “axes” labels for the main age bias plot and the marginal histograms. It looks “hacky” to me.

In addition, I thought about putting a horizontal line on the top marginal histogram that shows the cutoff for when a confidence interval is calculated. I also thought about highlighting the “zero bar” in the right marginal histogram when difference=TRUE to further highlight the amount of agreement between the age estimates. In the end, I did not make these modifications because they would seem to add clutter or draw too much attention.

Let me know if you have other ideas for how these age bias plots could be modified to be more informative.

References

To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RcppArmadillo 0.7.800.2.0

Fri, 04/14/2017 - 04:14

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new RcppArmadillo version 0.7.800.2.0 is now on CRAN.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 318 other packages on CRAN — an increase of 20 just since the last CRAN release of 0.7.600.1.0 in December!

Changes in this release relative to the previous CRAN release are as follows:

Changes in RcppArmadillo version 0.7.800.2.0 (2017-04-12)
  • Upgraded to Armadillo release 7.800.2 (Rogue State Redux)

    • The Armadillo license changed to Apache License 2.0
  • The DESCRIPTION file now mentions the Apache License 2.0, as well as the former MPL2 license used for earlier releases.

  • A new file init.c was added with calls to R_registerRoutines() and R_useDynamicSymbols()

  • Symbol registration is enabled in useDynLib

  • The fastLm example was updated

Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

March ’17 New Package Picks

Fri, 04/14/2017 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)





























Two hundred and sixteen new packages were added to CRAN in March. The following are my picks for the Top Forty, organized into five categories: Bioscience, Data, Data Science, Statistics and Utilities.

Bioscience
  • BioInstaller v0.0.3: Provides tools to install and download massive bioinformatics analysis software and database, such as NGS analysis tools with its required database or/and reference genome.

  • DSAIDE v0.4.0: Provides a collection of Shiny Apps that allow users to simulate and explore infectious disease transmission. The vignette will get you started.

  • treespace v1.0.0: Provides tools for the exploration of distributions of phylogenetic trees. This package includes a Shiny interface which can be started from R. There are vignettes for Dengue trees, Transmission trees, and for exploring landscapes of phylogenetic trees. The following plot from the vignette shows clusters of similar trees.

library(treespace) data(woodmiceTrees) wm.res <- treespace(woodmiceTrees,nf=3) wm.groves <- findGroves(wm.res, nclust=6) plotGrovesD3(wm.groves)

{"x":{"data":{"x":[-0.994915614079129,-0.613678626917784,2.6667063130703,-13.6081153840016,2.19796507262781,3.60128203945996,-7.93464330242989,-0.601509902833136,-0.994915614079123,-0.601509902833136,-0.607150138418815,10.3943189332891,-8.22061660863031,-0.542706411478826,3.33541847941513,-0.043828098018826,-0.164629339023312,2.79080622437025,10.5466992537476,-0.611741178555835,3.57629422667165,9.6450605977335,3.87015775914438,-0.926024164827251,-1.2190898967413,2.4723857807866,-8.86243680646928,-1.21908989674129,-14.5730155701834,-13.602475148416,0.917863183050823,2.27894083741,2.28568544942178,-15.7748658745841,2.1492025394282,-0.790924629423665,12.7908135541976,0.944271832856846,-1.24839432652649,3.98230290310848,-1.01426319128175,2.01968513715778,2.67234654865598,3.60013557723756,-0.360155109563769,3.37184987522406,0.252595301625625,3.36472290920032,6.40461224612179,-0.994699490566318,-9.42122752144957,2.66670631307031,-9.45007781913339,-0.607150138418815,13.1790840072057,1.82079944018017,-1.24928257943979,-0.994915614079122,2.43713839500106,-13.044572686349,-15.9705489873276,-0.913373605366738,-13.6310955663375,3.64770068321681,0.668865915945599,8.14682718932421,13.4934309468058,4.04346690730909,2.2791569609228,-0.580632735597668,2.36612308170806,5.33313046919023,-0.994915614079123,11.0665629950681,-1.00840683218326,4.76538085706121,1.4438114263851,-0.171393662122702,2.02621362565675,-8.75385363504267,-0.858828154723405,-0.330268540830038,3.90067192022295,-0.0730717036886937,-16.5286674300406,-0.994915614079122,2.4723857807866,-9.65081856527717,-0.335908776415717,-0.796564865009345,-0.542706411478826,1.0492923610008,-1.21908989674129,-15.6745883154345,0.0979934237349428,-9.34792446440201,-0.437013152012128,2.27241234891102,-0.17689227089257,0.955121246501657,12.8975581653797,5.51089790555852,-10.4680320213257,-0.0399993481969869,9.52808714143766,3.5891133153753,4.42342442747862,0.04726565910757,-1.25417131433135,2.3611288897358,2.27894083741,-9.02693355729029,2.07898006954062,3.87668624764336,-6.4615827644661,2.03185386124243,1.82035414418588,-0.994915614079123,-1.01426319128175,3.37749011080974,0.199097799331354,-0.172281915035997,-1.00862295569607,4.15243988951517,-1.00840683218326,2.28458107299568,-0.607150138418815,-0.519052641259981,2.08128395756897,-16.5601837313279,2.46688265557635,-9.05280962299188,3.48076661737389,-0.995803866992419,-0.988171002067342,4.03546615887711,-0.994915614079123,-1.32470655220642,-0.989275378493443,3.56866136174658,-0.607150138418815,-7.66836564572728,-0.335908776415717,6.72470221778202,-0.27737989822945,-0.633026204120414,2.84248353506267,0.0979934237349427,2.92888779742133,3.20320883731573,-0.335908776415717,2.87234652407254,4.32578048491479,-1.25417131433135,2.83684329947699,-1.00751857926997,-0.915013358974152,3.57540597375836,-0.989275378493444,-0.707126558298265,-0.712766793883944,2.26633787221915,3.5888971918625,-1.01180587891182,-9.25652460111788,2.33076992578286,-0.165035101165791,3.30290284183795,-0.994699490566318,-16.2186567440298,3.5888971918625,-0.982530766481663,9.52808714143766,-0.606100942970156,2.73656237720091,3.46106473819439,2.27894083741,3.63680350037365,2.4723857807866,8.14682718932421,2.27894083741,-0.995803866992419,-9.25088436553221,0.859657106490925,0.774749830581254,0.04726565910757,-15.8890069786153,0.404970591334759,2.42557261667579,-0.613678626917791,-0.607013028043388,-0.919902093865714,2.28458107299568,-0.607150138418815,-0.765676710838193,4.14679965392949,-13.1809002016393,12.8975581653797,4.26288720150493,2.02621362565675,3.36367420397476],"y":[-1.36296405257431,-1.0143380748759,4.21882824322934,1.85395115102287,4.17637909087138,4.86461034763142,-16.009575686826,-1.02771387775047,-1.36296405257431,-1.02771387775047,-1.02766987185401,-0.50552388611458,-10.0515267421666,-1.07229905286387,4.15752547173998,2.76392139550187,2.59796872098458,4.93883384448699,-10.7625404554933,-2.21247979854658,4.87226502578944,-9.5088708745062,3.32153776588015,-0.842513536381463,-0.716141175163109,-7.16104731132083,5.75437287564331,-0.716141175163109,-1.7069313395939,1.85390714512642,4.8987315359459,3.88353406250903,5.61969323330277,0.344365695407378,-14.4171273048538,-0.501442487250261,-9.6000449103967,1.03726886777004,-0.33393376356846,3.46374535755798,-1.35535338031275,4.91950411344246,4.21878423733287,7.13159940553585,-0.00302341316305915,5.01212399682195,-13.0610240652768,3.77531806014533,7.01232473326826,0.386526915197534,3.75962479695718,4.21882824322933,-11.8666109777134,-1.02766987185401,-8.32290371040569,2.52086549795982,-0.320645972486814,-1.36296405257431,4.62733018857611,3.8619430148942,-1.58401124473219,-1.37092740080123,-1.87858215341395,3.08391000762073,-1.19130284875386,-0.939160305549965,-9.99053252948651,5.96476384086601,5.63302503028087,-1.03141128116272,3.87552670838565,7.5655093012157,-1.36296405257431,-11.1843323318248,0.394093581562641,6.98752111471012,1.37775884921857,1.22975877872164,4.90617231646436,6.55363566239471,-1.1898855504877,0.868592373260222,6.09901634763347,1.2290930557728,0.146376109093674,-1.36296405257431,-7.16104731132083,-11.8099177019773,0.868636379156678,-0.501398481353804,-1.07229905286387,1.43585335420713,-0.716141175163109,-1.76388587914128,-0.23437411826138,-3.78957669060228,0.645071817193465,3.89686585948713,2.27313496228741,1.1554629700066,-9.37652435432995,8.34207063093772,4.70275523659972,-5.51421132484612,-8.2299588005646,4.87798615050598,6.18643823118439,0.0191206331844139,-0.326994001640874,-7.3165426940803,3.88353406250902,4.08158718069937,-7.49629748614467,3.30820596890205,1.9385779426291,4.9061283105679,1.76288229528141,-1.36296405257431,-1.35535338031275,5.0120799909255,-0.0108095562981677,1.24304656980329,-1.3553973862092,5.13648704660545,0.394093581562642,3.88349005661257,-1.02766987185401,-1.36622147874601,4.52661553061152,0.167362584229802,-7.16806857752342,4.10252964993904,-10.8783948870456,-1.34967626149266,0.373195118219431,3.41920418834103,-1.36296405257431,-0.871680563819035,-1.36300805847077,3.14939364607734,-1.02766987185401,-1.0051396096495,0.868636379156678,-2.10228430818339,0.0359583960557165,-1.00672740261434,5.89152592638034,-0.23437411826138,0.461240469186375,-9.87333975241742,0.868636379156678,-8.00759724325873,4.51980695910259,-0.326994001640875,5.8915699322768,0.380805790480995,-1.35124757466906,4.88555281687109,-1.36300805847077,-1.18325326640639,-1.18320926050994,5.62730390556433,3.12849518273413,-1.36839184423488,-11.4879553182351,4.29529384801837,-1.92883606503872,4.00946413995936,0.386526915197535,-0.0288756124843536,3.12849518273413,0.373151112322974,-8.2299588005646,-2.21252380444303,4.19946765707951,7.74251598858856,3.88353406250902,-10.4582816896664,-7.16104731132083,-0.939160305549963,3.88353406250902,-1.34967626149266,-11.4879993241316,0.931854402146931,10.0778445455006,0.0191206331844133,-1.59197459295911,1.73932530176124,-7.36117187509016,-1.01433807487591,-1.03473514395306,-1.35759560382312,3.88349005661257,-1.02766987185401,0.202085100647231,5.13653105250191,1.52547532730244,-9.37652435432995,8.19604761262773,4.90617231646436,7.45847251006686],"col_var":["1","1","2","3","2","4","5","1","1","1","1","4","5","1","4","1","1","2","6","1","4","6","4","1","1","6","3","1","3","3","1","2","2","3","6","1","6","1","1","4","1","2","2","2","1","2","6","4","4","1","3","2","5","1","6","1","1","1","2","3","3","1","3","4","1","4","6","2","2","1","2","2","1","6","1","4","1","1","2","3","1","1","2","1","3","1","6","5","1","1","1","2","1","3","1","3","1","2","1","1","6","2","3","6","6","4","4","1","1","6","2","3","6","4","3","2","1","1","1","2","1","1","1","4","1","2","1","1","2","3","6","3","6","1","1","4","1","1","1","4","1","3","1","4","1","1","2","1","2","6","1","6","4","1","2","1","1","4","1","1","1","2","4","1","5","2","1","4","1","3","4","1","6","1","2","2","2","6","6","4","2","1","5","1","4","1","3","1","6","1","1","1","2","1","1","4","3","6","4","2","2"],"key_var":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201]},"settings":{"x_log":false,"y_log":false,"labels_size":10,"labels_positions":null,"point_size":64,"point_opacity":1,"hover_size":1,"hover_opacity":null,"xlab":"Axis 1","ylab":"Axis 2","has_labels":false,"col_lab":"groups","col_continuous":false,"colors":null,"ellipses":false,"ellipses_data":[],"symbol_lab":"symbol_var","size_range":[10,300],"size_lab":"NULL","opacity_lab":"NULL","unit_circle":false,"has_color_var":true,"has_symbol_var":false,"has_size_var":false,"has_opacity_var":false,"has_url_var":false,"has_legend":true,"has_tooltips":true,"tooltip_text":null,"has_custom_tooltips":false,"click_callback":null,"zoom_callback":null,"fixed":false,"legend_width":150,"left_margin":30,"html_id":"scatterD3-KHCNQGMN","xlim":null,"ylim":null,"x_categorical":false,"y_categorical":false,"menu":true,"lasso":false,"lasso_callback":null,"dom_id_reset_zoom":"scatterD3-reset-zoom","dom_id_svg_export":"scatterD3-svg-export","dom_id_lasso_toggle":"scatterD3-lasso-toggle","transitions":false,"axes_font_size":"100%","legend_font_size":"100%","caption":null,"lines":{"slope":[0,null],"intercept":[0,0],"stroke_dasharray":[5,5]},"hashes":[]}},"evals":[],"jsHooks":[]} ### Data
* cropdatape v1.0.0: Provides access to data from the Agricultural Ministry of Peru.

Data Science
  • anomalyDetection v0.1.1: Implements procedures to aid in detecting network log anomalies. The vignette provides examples.

  • kerasR v0.4.1: Provides an interface to the Keras Deep Learning Library, which provides specifications for describing dense neural networks, convolution neural networks (CNN), and recurrent neural networks (RNN) running on top of either TensorFlow or Theano. The vignette contains examples.

  • modeval v0.1.2: Allows users to easily compare multiple classifications models built with caret functions for small data sets. The vignette provides examples.

  • supc v0.1: Implements the self-updating process clustering algorithms proposed in Shiu and Chen. The vignette contains examples.

  • tensorflow v0.7: Provides an interface to TensorFlow, an open-source software library for numerical computation using data flow graphs.

Statistics
  • frailtyEM v0.5.4: Contains functions for fitting shared frailty models with a semi-parametric baseline hazard using the Expectation-Maximization algorithm. The vignette explains the math.

  • FRK v0.1.1: Provides functions to build, fit and predict spatial random effects, fixed rank kriging models with large datasets. The vignette introduces the theory and shows some examples.

  • hmi v0.6-3: Allows users to build single-level and multilevel imputation models using functions provided, or functions from the mice and MCMCglmm packages. There is a vignette.

  • margins v0.3.0: Ports Stata’s margins command for calculating marginal (or partial) effects. There is an introduction, a vignette on the Technical Implementation Details, and a comparison with the Stata Command.

  • mlrMBO v1.0.0: Provides a toolbox for Bayesian, model-based optimization. There is an introduction and vignettes for Mixed Space Optimization, Noisy Optimization, and Parallelization.

  • MonteCarlo v1.0.0: Simplifies Monte Carlo simulation studies by providing functions that automatically set up loops to run over parameter grids, parallelize the computations, and generate output in LaTeX tables. The vignette shows how to use it.

  • RankingProject v0.1.1: Provides functions to generate plots and tables for comparing independently sampled populations. There is an introduction and a vignette that reproduces the figures from “A Primer on Visualizations for Comparing Populations, Including the Issue of Overlapping Confidence Intervals” by Wright, Klein, and Wieczorek (2017, The American Statistician, in press).

  • rjmcmc v0.2.2: Provides functions to perform reversible-jump MCMC with the restriction introduced by Barker & Link. The vignette shows how to calculate posterior probabilities from the MCMC output.

Utilities
  • canvasXpress v1.5.2: Enables creation of visualizations using CanvasXpress, a stand-alone JavaScript library for reproducible research. The vignette shows how to get started.

  • collapsibleTree v0.1.4: Provides functions to build interactive Reingold-Tilford tree diagrams created using D3.js, where every node can be expanded and collapsed by clicking on it. There are some examples on the GitHub site.

library(Polychrome) pal2 <- alphabet.colors(26) rancurves(pal2)

  • RApiDatetime v0.0.3: Provides a C-level API to allow packages to access C-level R date and datetime code.

  • Rcssplot v0.2.0.0: Provides tools to style plots with cascading style sheets. The vignette shows how.

  • reticulate v0.7: Implements an R interface to Python modules, classes, and functions. When calling into Python, R data types are automatically converted to their equivalent Python types. When values are returned from Python to R, they are converted back to R types. There is an overview and a vignette describing arrays in R and Python

  • shinyWidgets v0.2.0: Provides custom input widgets for Shiny apps. See the README for examples.

  • shinyjqui v0.1.0: An extension to shiny that brings interactions and animation effects from the jQuery UI library. There is an introduction and a vignette with examples.

  • valaddin v0.1.0: Provides tools to transform functions into functions with input validation checks, in a manner suitable for both programmatic and interactive use. The vignette shows how.

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Choosing R packages for mixed effects modelling based on the car you drive

Fri, 04/14/2017 - 02:00
Choosing R packages for mixed effects modelling based on the car you drive

There are many roads you can take to fit a mixed effects model
(sometimes termed hierarchical models) in R. There are numerous
packages that each deploy different engines to fit mixed effects models.

In this driveability review, I look at some of the dominant packages for
mixed effects models The focus of this review is on the practicality of
each package, rather than their numerical accuracy. Information on
numerical accuracy can be found in the academic literature, see the
bibliography below. I will lookat each package’s handling, speed and
adaptability for tackling different types of questions.

Summary of the packages strengths and weaknesses. Speeds are relative
to lme4 in a test case problem, see below for details.

Parameter lme4 rjags INLA RStan Relative speed 1 100 10 60 Handling- ease of coding1 Straightforward Hard Moderate Hard User manual Accessible Brief and technical Very technical2 Highly accessible Online documentation Extensive Extensive Nascent Nascent Support from developers3 Good Good Outstanding Good Testing for accuracy Extensive Extensive for some types of models Nascent Nascent Adaptability – types of models you can fit Limited Your imagination is the limit Extensive Almost as much as JAGS Best for Simple problems, quick implementation Customising complex models Spatio-temporal models Developing new types of models How you will look driving it Traditional, but elegant You like to take the scenic route Edgy and slightly mysterious Hanging with the cool kids
  1. but see below for some add-on packages which make life easier for
    STAN and JAGS
  2. = lots of equations
  3. In my experience. But, heck these packages are all free and all give better user support than many
    paid services!

The first is
lme4,
meaning linear mixed effects models with S4 classes. lme4 is like an
older model sports-car – fast, respectable, well known and able to
handle common types of questions. lme4 uses maximum likelihood methods
for estimation, so interpret its statistics as a
frequentist.

The next three packages are Bayesian techniques.

The second package is
rjags,
which is an interface to the “Just Another Gibbs
Sampler”
program. rjags is the
Land-Rover of packages for mixed effects models: well tested, able to
tackle just about any terrian, but slow and prone to breaking down
(usually a user error, rather than the package itself).

My take on reviewing R packages for Generalised Linear Mixed Effects. Clockwise from top left: The `lme4` package is fast but a bit older, like an early noughties Ferrari; the JAGS program goes anywhere like a Range Rover, but breaks often and is slow; the Stan language is modern and efficient like a hybrid BMW and INLA is fast and modern but requires technical expertise like a Tesla.

The third package is INLA, standing for
“Integrated Nested Laplace Approximation”. INLA is like an electric
sports car: it is a modern technique that is lightning fast, but there
has been limited testing to date and you better understand what is under
the hood before you attempt to drive it.

The final package is
RStan,
which is the R interface to the Stan modelling
language
. RStan is like like BMW’s hybrid,
combining computational speed with flexibility and with exceptional
documentation and support.

Now, let’s look at each packages computational speed, handling (ease of
use), driver support (documentation, existing knowledge and forums) and
flexibility.

Skip to the end for a biblography of examples and some cool packages
that build on these tools.

A note about philosophy

The frequentist and Bayesian approaches I review here come from
fundamentally different statistical
philosophies
.
Here I will tackle the approaches pragmatically, looking at estimation
speed and so on, however you might exclude one or the other a-prior
based on your bent towards Bayesian or frequentist thinking.

Speed

I simulated a simple hierarchical data-set to test each of the models.
The script is available here. The
data-set has 100 binary measurements. There is one fixed covariate
(continuous ) and one random effect with five groups. The linear
predictor was transformed to binomial probabilities using the logit
function. For the Bayesian approaches, slightly different priors were
used for each package, depending on what was available. See the script
for more details on priors.

For each package I fit the same linear model with binomial errors and a
logit link. To fit the models I used the functions glmer() from
lme4, inla() from INLA, custom code for rjags and the
stan_glmer() from the package
rstanarm
to fit the Stan model (you can also write your own custom code for
RStan, but for convenience I used the canned version of mixed effects
models).

Computational speeds for fitting a binomial Generalised Linear Model to
100 samples with one random and one fixed effect. Relative times are
relative to lme4

Measurement lme4 rjags INLA RStan Speed (Seconds) 0.063 6.2 0.59 3.8 Speed (relative) 1 99 9.4 60

So lme4 is by far the fastest, clocking in at around 6 tenths of a
second (on my 3GHz Intel i7 Macbook Pro). INLA comes in second, ten
times slower than lme4, then followed an order of magnitude later by
RStan and finally rjags came plodding along behind about 100 times
slower than lme4.

If you run the code I have provided you will see that all packages come
up with similar estimates for the model parameters.

Handling – ease of use

lme4 leads the pack when it comes to ease of use. If you have your
data sorted, then you can go from zero to fitted model in one line of
code. INLA is similar, however you should generally not rely on
default priors but specify your own, which requires additional code and
thinking.

Use of RStan and rjags depends on how you use them. You can write
your own model code, which can lead to quite lengthy scripts, but also
gives you greater number of choices for how you design your model.
However, there are also many packages that provide wrapper’s to RStan
and rjags that will fit common types of models (like mixed effects
models). These tend to be as simple as INLA’s functions, remembering
that you should think carefully about prior
choice
.

If you are going to write your own models, you will need to get good at
debugging, so there is an extra overhead there.

Model checking is another important consideration.

For users familiar with GLMs, lme4 probably provides the most familiar
support for model checking. INLA can be more difficult to use, because
it requires the user to hand-code many model checking routines. INLA,
rjags and RStan also require the user to carefully check that the
fitting algorithms have performed properly.

Each of these packages use different types of algorithms, so the
statistics you will use to check algorithm performance are different.
See each package’s documentation for further advice.

Driver support – documentation and support

lme4 and RStan have the highest quality user-manuals. Both are well
written and provide numerous examples that users will find helpful.

The user manual for rjags and JAGS is somewhat brief, but the user can
easily find many helpful examples on the web, and these packages are
covered by numerous books. So if you are willing to broaden your search
you should have no trouble finding help.

RStan and INLA are both relatively new, so fewer examples can be
found on the web. While RStan has an excellent manual for both the
mathsy and non-mathsy types, documentation of INLA can be difficult to
follow for novice users, because much of it is equations.

Web support and chat groups are also an important aspect of model
development. Help for all packages can be found on sites like
Stackoverflow, or you
can post your own new questions there.

Personally, I have found that support for INLA, through their
forum
,
is exceptional. On several occaisons I have posted questions for which I
could not find documentation and have generally recieved a response from
INLA’s developers before the end of the working day. This is amazing,
because it is better than you would expect from many paid services.

I have not participated in forums for the other packages, so I can’t add
comments here.

One final note on support is that lme4 and JAGS are both pretty well
studied approaches to modelling and there are numerous academic papers
dealing with their biases and accuracy. INLA and RStan are
relatively new on the scene, so for new and different types of problems
you might want to run your own simulation study to check that your model
is performing as expected.

Adaptability – ability to tackle different types of problems

Your imagination is the limit with rjags – take it on well driven
routes, like a mixed effects model, or off-road on new adventures with
Bayesian structural equation modelling – it can do it all.

rjags is followed closely by RStan. RStan has a few limitations,
but can basically do anything JAGS can do, and often faster (e.g. this
ecological
study
).
RStan will probably eventually replace rjags, but to date rjags
persists because of the extensive range of well documented examples
users can build on.

INLA comes in fourth with a diverse range of built in likelihoods and
model functions (you can even ask the developers to add more and they
might do it!). INLA is becoming particularly popular for modelling
spatial and temporal auto-correlation, partly due to its speed. You can
do these types of models with RStan and rjags but computations might
be impossibly slow.

lme4 is the least flexible of packages – though there are some options
to customise it’s models. The similar nlme package also provides a
range of tools for fitting random effects for spatial and temporal
autocorrelation.

There are however, some clever tricks you can do with lme4 to fit a
broader range of models. For instance, you can include a random effect
where every sample is a separate group in a Poisson GLM to get a Poisson
with extra variation.

Conclusion

That’s all. I hope this blog helps you make a good choice before
investing in learning a new tool for mixed effects models. Let me
know
how you go and if you
found this useful.

Cool applications of mixed effects models in R

rstanarm for mixed effects models coded like glm() but using Stan for estimation.

glmmBUGS for mixed effects models coded like glm() but using JAGS for estimation.

A new function in the mgcv package (links to
pdf)
for
auto-writing code for Bayesian Generalised Additive Models.

A new study showing how you can
fit spatial models with barriers.

Fit Bayesian Latent Variable
models
for
analysing multivariate data (e.g. ecological communities).

The GLMM in
R FAQ

Bibliography

Unless otherwise indicated, these resources are open access.

Stan user manual.

JAGS
user manual.

Introduction to lme4
(pdf)
.

INLA project page.

INLA
forum.

Mixed Effects Models and Extensions in Ecology with
R
my go-to book for
theory and code for fitting likelihood based mixed effects models (e.g.
with lme4). You will have to buy it.

A good example for how to design a simulation study to test your
models

Example for INLA for spatial
models

in ecology.

A review of mixed effects models in fisheries
science

(good for other disciplines too).

NIMBLE another R package for Bayesian
modelling.

glmmADMB another R
package giving greater flexibility for fitting using maximum likelihood.

dplyr 0.6.0 coming soon!

Fri, 04/14/2017 - 00:46

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

I’m planning to submit dplyr 0.6.0 to CRAN on May 11 (in four weeks time). In preparation, I’d like to announce that the release candidate, dplyr 0.5.0.9002 is now available. I would really appreciate it if you’d try it out and report any problems. This will ensure that the official release has as few bugs as possible.

Installation

Install the pre-release version with:

# install.packages("devtools") devtools::install_github("tidyverse/dplyr")

If you discover any problems, please file a minimal reprex on GitHub. You can roll back to the released version with:

install.packages("dplyr") Features

dplyr 0.6.0 is a major release including over 100 bug fixes and improvements. There are three big changes that I want to touch on here:

  • Databases
  • Improved encoding support (particularly for CJK on windows)
  • Tidyeval, a new framework for programming with dplyr

You can see a complete list of changes in the draft release notes.

Databases

Almost all database related code has been moved out of dplyr and into a new package, dbplyr. This makes dplyr simpler, and will make it easier to release fixes for bugs that only affect databases.

To install the development version of dbplyr so you can try it out, run:

devtools::install_github("hadley/dbplyr")

There’s one major change, as well as a whole heap of bug fixes and minor improvements. It is now no longer necessary to create a remote “src”. Instead you can work directly with the database connection returned by DBI, reflecting the robustness of the DBI ecosystem. Thanks largely to the work of Kirill Muller (funded by the R Consortium) DBI backends are now much more consistent, comprehensive, and easier to use. That means that there’s no longer a need for a layer between you and DBI.

You can continue to use src_mysql(), src_postgres(), and src_sqlite() (which still live in dplyr), but I recommend a new style that makes the connection to DBI more clear:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbWriteTable(con, "iris", iris) #> [1] TRUE iris2 <- tbl(con, "iris") iris2 #> Source: table<iris> [?? x 5] #> Database: sqlite 3.11.1 [:memory:] #> #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3.0 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5.0 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa #> 7 4.6 3.4 1.4 0.3 setosa #> 8 5.0 3.4 1.5 0.2 setosa #> 9 4.4 2.9 1.4 0.2 setosa #> 10 4.9 3.1 1.5 0.1 setosa #> # ... with more rows

This is particularly useful if you want to perform non-SELECT queries as you can do whatever you want with DBI::dbGetQuery() and DBI::dbExecute().

If you’ve implemented a database backend for dplyr, please read the backend news to see what’s changed from your perspective (not much). If you want to ensure your package works with both the current and previous version of dplyr, see wrap_dbplyr_obj() for helpers.

Character encoding

We have done a lot of work to ensure that dplyr works with encodings other that Latin1 on Windows. This is most likely to affect you if you work with data that contains Chinese, Japanese, or Korean (CJK) characters. dplyr should now just work with such data.

Tidyeval

dplyr has a new approach to non-standard evaluation (NSE) called tidyeval. Tidyeval is described in detail in a new vignette about programming with dplyr but, in brief, it gives you the ability to interpolate values in contexts where dplyr usually works with expressions:

my_var <- quo(homeworld) starwars %>% group_by(!!my_var) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE) #> # A tibble: 49 × 3 #> homeworld height mass #> <chr> <dbl> <dbl> #> 1 Alderaan 176.3333 64.0 #> 2 Aleen Minor 79.0000 15.0 #> 3 Bespin 175.0000 79.0 #> 4 Bestine IV 180.0000 110.0 #> 5 Cato Neimoidia 191.0000 90.0 #> 6 Cerea 198.0000 82.0 #> 7 Champala 196.0000 NaN #> 8 Chandrila 150.0000 NaN #> 9 Concord Dawn 183.0000 79.0 #> 10 Corellia 175.0000 78.5 #> # ... with 39 more rows

This will make it much easier to eliminate copy-and-pasted dplyr code by extracting repeated code into a function.

This also means that the underscored version of each main verb (filter_(), select_() etc). is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages