Rcpp now used by 1000 CRAN packages
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Moments ago Rcpp passed a big milestone as there are now 1000 packages on CRAN depending on it (as measured by Depends, Imports and LinkingTo, but excluding Suggests). The graph is on the left depicts the growth of Rcpp usage over time.
One easy way to compute such reverse dependency counts is the tools::dependsOnPkgs() function that was just mentioned in yesterday’s R^4 blog post. Another way is to use the reverse_dependencies_with_maintainers() function from this helper scripts file on CRAN. Lastly, devtools has a function revdep() but it has the wrong default parameters as it includes Suggests: which you’d have to override to get the count I use here (it currently gets 1012 in this wider measure).
Rcpp cleared 300 packages in November 2014. It passed 400 packages in June 2015 (when I only tweeted about it), 500 packages in late October 2015, 600 packages last March, 700 packages last July, 800 packages last October and 900 packages early January. The chart extends to the very beginning via manually compiled data from CRANberries and checked with crandb. The next part uses manually saved entries. The core (and by far largest) part of the data set was generated semiautomatically via a short script appending updates to a small filebased backend. A list of packages using Rcpp is kept on this page.
Also displayed in the graph is the relative proportion of CRAN packages using Rcpp. The four percent hurdle was cleared just before useR! 2014 where I showed a similar graph (as two distinct graphs) in my invited talk. We passed five percent in December of 2014, six percent July of 2015, seven percent just before Christmas 2015, eight percent last summer, and nine percent midDecember 2016. Ten percent is next; we may get there during the summer.
1000 user packages is a really large number. This puts a whole lot of responsibility on us in the Rcpp team as we continue to keep Rcpp as performant and reliable as it has been.
And with that a very big Thank You! to all users and contributors of Rcpp for help, suggestions, bug reports, documentation or, of course, code.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Shiny server series part 2: running shiny on multiple ports
(This article was first published on Jasper Ginn's blog, and kindly contributed to Rbloggers)
This guide is part of a series on setting up your own private server running shiny apps. There are many guides with great advice on how to set up an R shiny server and related software. I try to make a comprehensive guide based in part on these resources as well as my own experiences. I always aim to properly attribute information to their respective sources. If you notice an issue, please contact me.
I noticed there were some errors in the first part of this series. These are now fixed. The revised post is available here
Part 1 of this series is available here
In part 1 of this series, we set up shiny server on a VPS. Our setup currently looks like this:
 Rstudio server is running on port 8787
 Shiny server is running on port 3838
 We have a custom domain name pointing to the DigitalOcean VPS
In this part, we’ll configure shiny server to run on two ports; one on 3838 and one on 4949 (although this is arbitrary and you could use any port you’d like). This way, we can set up our server to service a publicly accessible shiny server instance as well as an instance that will be password protected. This is useful if you want to share an app with a select group of people or just for development purposes.
Resources used for this partThis tutorial draws from Rstudio’s documentation about shiny server configuration.
Running shiny on multiple portsThese steps are fairly straightforward. First, log into your VPS, switch to the shiny user and back up the shiny configuration file in case something goes wrong:
# Log into the server ssh root@<yourVPSip> # Switch to shiny user su shiny # Copy the config file sudo cp /etc/shinyserver/shinyserver.conf /etc/shinyserver/shinyserverbackup.confOpen the configuration file:
sudo nano /etc/shinyserver/shinyserver.confThis configuration governs the port on which shiny runs, as well as the location where it stores error logs and where shiny can find your apps.
Firstly, add the following line under ‘run_as shiny’ (denoted by ‘1’ in the image below). This ensures that shiny outputs error logs in this folder:
# Log all Shiny output to files in this directory log_dir /var/log/shinyserver;Then, add ‘127.0.0.1’ next to ‘listen 3838’ (denoted by ‘2’ in the image below).
Finally, add the following lines below the last ‘}’ (denoted by ‘3’ in the image below):
# Define a server that listens on port 4949 server { listen 4949 127.0.0.1; # Define a location at the base URL location / { # Host the directory of Shiny Apps stored in this directory site_dir /srv/privateserver; # When a user visits the base URL rather than a particular application, # an index of the applications available in this directory will be shown. directory_index on; } }Hit control+x and Y and enter to save your changes.
This configuration ensures that shiny listens on both ports 3838 and 4949. As you can see in the config file, the private shiny server does not look for shiny apps in the ‘/srv/shinyserver’ folder, but rather in the ‘/srv/privateserver’ folder. We need to create this folder and give the shiny user read and write permissions:
# Navigate to the /srv/ folder cd /srv # Create a folder for the private shiny server sudo mkdir privateserver # Enter the folder cd privateserver # Give shiny read/write permissions sudo chown R shiny:shinyapps . sudo chmod g+w . sudo chmod g+s .Now, we just need to restart shiny server:
sudo service shinyserver restartYou are now running shiny server on both port 3838 and 4949!
Now, we simply need to arrange a new route to the nginx configuration. To do this, open up the config file:
sudo nano /etc/nginx/sitesenabled/defaultThen, copy paste the following lines just below the editor route. Indent the lines so they line up with the other routes, and remember to only use spaces, no tabs.
location /privateapps/ { proxy_pass http://127.0.0.1:4949/; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; }It should look like the configuration in the image below:
Hit control+x and Y and enter to save your changes.
You can now access your second shiny server on http://www.<yourdomainname>.<extension>/privateapps/
In the next part, we’ll add an SSL security certificate to the server.
To leave a comment for the author, please follow the link and comment on their blog: Jasper Ginn's blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Encoding categorical variables: onehot and beyond
(This article was first published on R – WinVector Blog, and kindly contributed to Rbloggers)
(or: how to correctly use xgboost from R)R has "onehot" encoding hidden in most of its modeling paths. Asking an R user where onehot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere.
For example we can see evidence of onehot encoding in the variable names chosen by a linear regression:
dTrain < data.frame(x= c('a','b','b', 'c'), y= c(1, 2, 1, 2)) summary(lm(y~x, data= dTrain)) ## ## Call: ## lm(formula = y ~ x, data = dTrain) ## ## Residuals: ## 1 2 3 4 ## 2.914e16 5.000e01 5.000e01 2.637e16 ## ## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 1.0000 0.7071 1.414 0.392 ## xb 0.5000 0.8660 0.577 0.667 ## xc 1.0000 1.0000 1.000 0.500 ## ## Residual standard error: 0.7071 on 1 degrees of freedom ## Multiple Rsquared: 0.5, Adjusted Rsquared: 0.5 ## Fstatistic: 0.5 on 2 and 1 DF, pvalue: 0.7071Much of the encoding in R is essentially based on "contrasts" implemented in stats::model.matrix() Note: do not use base::data.matrix() or use hashing before modeling you might get away with them (especially with tree based methods), but they are not in general good technique as we show below:
data.matrix(dTrain) ## x y ## [1,] 1 1 ## [2,] 2 2 ## [3,] 2 1 ## [4,] 3 2stats::model.matrix() does not store its onehot plan in a convenient manner (it can be inferred by pulling the "contrasts" attribute plus examining the column names of the first encoding, but the levels identified are not conveniently represented). When directly applying stats::model.matrix() you can not safely assume the same formula applied to two different data sets (say train and application or test) are using the same encoding! We demonstrate this below:
dTrain < data.frame(x= c('a','b','c'), stringsAsFactors = FALSE) encTrain < stats::model.matrix(~x, dTrain) print(encTrain) ## (Intercept) xb xc ## 1 1 0 0 ## 2 1 1 0 ## 3 1 0 1 ## attr(,"assign") ## [1] 0 1 1 ## attr(,"contrasts") ## attr(,"contrasts")$x ## [1] "contr.treatment" dTest < data.frame(x= c('b','c'), stringsAsFactors = FALSE) stats::model.matrix(~x, dTest) ## (Intercept) xc ## 1 1 0 ## 2 1 1 ## attr(,"assign") ## [1] 0 1 ## attr(,"contrasts") ## attr(,"contrasts")$x ## [1] "contr.treatment"The above malcoding can be a critical flaw when you are building a model and then later using the model on new data (be it crossvalidation data, test data, or future application data). Many R users are not familiar with the above issue as encoding is hidden in model training, and how to encode new data is stored as part of the model. Python scikitlearn users coming to R often ask "where is the onehot encoder" (as it isn’t discussed as much in R as it is in scikitlearn) and even supply a number of (low quality) oneoff packages "porting onehot encoding to R."
The main place an R user needs a proper encoder (and that is an encoder that stores its encoding plan in a conveniently reusable form, which many of the "oneoff ported from Python" packages actually fail to do) is when using a machine learning implementation that isn’t completely Rcentric. One such system is xgboost which requires (as is typical of machine learning in scikitlearn) data to already be encoded as a numeric matrix (instead of a heterogeneous structure such as a data.frame). This requires explicit conversion on the part of the R user, and many R users get it wrong (fail to store the encoding plan somewhere). To make this concrete let’s work a simple example.
Let’s try the Titanic data set to see encoding in action. Note: we are not working hard on this example (as in adding extra variables derived from cabin layout, commonality of names, and other sophisticated feature transforms) just plugging the obvious variable into xgboost. As we said: xgboost requires a numeric matrix for its input, so unlike many R modeling methods we must manage the data encoding ourselves (instead of leaving that to R which often hides the encoding plan in the trained model). Also note: differences observed in performance that are below the the sampling noise level should not be considered significant (e.g., all the methods demonstrated here performed about the same).
We bring in our data:
# set up example data set library("titanic") data(titanic_train) str(titanic_train) ## 'data.frame': 891 obs. of 12 variables: ## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ... ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... ## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ... ## $ Sex : chr "male" "female" "female" "female" ... ## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... ## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... ## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... ## $ Cabin : chr "" "C85" "" "C123" ... ## $ Embarked : chr "S" "C" "S" "S" ... summary(titanic_train) ## PassengerId Survived Pclass Name ## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891 ## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character ## Median :446.0 Median :0.0000 Median :3.000 Mode :character ## Mean :446.0 Mean :0.3838 Mean :2.309 ## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000 ## Max. :891.0 Max. :1.0000 Max. :3.000 ## ## Sex Age SibSp Parch ## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000 ## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000 ## Mode :character Median :28.00 Median :0.000 Median :0.0000 ## Mean :29.70 Mean :0.523 Mean :0.3816 ## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000 ## Max. :80.00 Max. :8.000 Max. :6.0000 ## NA's :177 ## Ticket Fare Cabin Embarked ## Length:891 Min. : 0.00 Length:891 Length:891 ## Class :character 1st Qu.: 7.91 Class :character Class :character ## Mode :character Median : 14.45 Mode :character Mode :character ## Mean : 32.20 ## 3rd Qu.: 31.00 ## Max. :512.33 ## outcome < 'Survived' target < 1 shouldBeCategorical < c('PassengerId', 'Pclass', 'Parch') for(v in shouldBeCategorical) { titanic_train[[v]] < as.factor(titanic_train[[v]]) } tooDetailed < c("Ticket", "Cabin", "Name", "PassengerId") vars < setdiff(colnames(titanic_train), c(outcome, tooDetailed)) dTrain < titanic_trainAnd design our crossvalidated modeling experiment:
library("xgboost") library("sigr") library("WVPlots") library("vtreat") set.seed(4623762) crossValPlan < vtreat::kWayStratifiedY(nrow(dTrain), 10, dTrain, dTrain[[outcome]]) evaluateModelingProcedure < function(xMatrix, outcomeV, crossValPlan) { preds < rep(NA_real_, nrow(xMatrix)) for(ci in crossValPlan) { nrounds < 1000 cv < xgb.cv(data= xMatrix[ci$train, ], label= outcomeV[ci$train], objective= 'binary:logistic', nrounds= nrounds, verbose= 0, nfold= 5) #nrounds < which.min(cv$evaluation_log$test_rmse_mean) # regression nrounds < which.min(cv$evaluation_log$test_error_mean) # classification model < xgboost(data= xMatrix[ci$train, ], label= outcomeV[ci$train], objective= 'binary:logistic', nrounds= nrounds, verbose= 0) preds[ci$app] < predict(model, xMatrix[ci$app, ]) } preds }Our preferred way to encode data is to use the vtreat package in the "no variables mode" shown below (differing from the powerful "y aware" modes we usually teach).
set.seed(4623762) tplan < vtreat::designTreatmentsZ(dTrain, vars, minFraction= 0, verbose=FALSE) # restrict to common varaibles types # see vignette('vtreatVariableTypes', package = 'vtreat') for details sf < tplan$scoreFrame newvars < sf$varName[sf$code %in% c("lev", "clean", "isBAD")] trainVtreat < as.matrix(vtreat::prepare(tplan, dTrain, varRestriction = newvars)) print(dim(trainVtreat)) ## [1] 891 20 print(colnames(trainVtreat)) ## [1] "Pclass_lev_x.1" "Pclass_lev_x.2" "Pclass_lev_x.3" ## [4] "Sex_lev_x.female" "Sex_lev_x.male" "Age_clean" ## [7] "Age_isBAD" "SibSp_clean" "Parch_lev_x.0" ## [10] "Parch_lev_x.1" "Parch_lev_x.2" "Parch_lev_x.3" ## [13] "Parch_lev_x.4" "Parch_lev_x.5" "Parch_lev_x.6" ## [16] "Fare_clean" "Embarked_lev_x." "Embarked_lev_x.C" ## [19] "Embarked_lev_x.Q" "Embarked_lev_x.S" dTrain$predVtreatZ < evaluateModelingProcedure(trainVtreat, dTrain[[outcome]]==target, crossValPlan) sigr::permTestAUC(dTrain, 'predVtreatZ', outcome, target) ## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.86, s.d.=0.017, p<1e05)." WVPlots::ROCPlot(dTrain, 'predVtreatZ', outcome, target, 'vtreat encoder performance')Model matrix can perform similar encoding when we only have a single data set.
set.seed(4623762) f < paste('~ 0 + ', paste(vars, collapse = ' + ')) # model matrix skips rows with NAs by default, # get control of this through an option oldOpt < getOption('na.action') options(na.action='na.pass') trainModelMatrix < stats::model.matrix(as.formula(f), dTrain) # note model.matrix does not conveniently store the encoding # plan, so you may run into difficulty if you were to encode # new data which didn't have all the levels seen in the training # data. options(na.action=oldOpt) print(dim(trainModelMatrix)) ## [1] 891 16 print(colnames(trainModelMatrix)) ## [1] "Pclass1" "Pclass2" "Pclass3" "Sexmale" "Age" ## [6] "SibSp" "Parch1" "Parch2" "Parch3" "Parch4" ## [11] "Parch5" "Parch6" "Fare" "EmbarkedC" "EmbarkedQ" ## [16] "EmbarkedS" dTrain$predModelMatrix < evaluateModelingProcedure(trainModelMatrix, dTrain[[outcome]]==target, crossValPlan) sigr::permTestAUC(dTrain, 'predModelMatrix', outcome, target) ## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.87, s.d.=0.019, p<1e05)." WVPlots::ROCPlot(dTrain, 'predModelMatrix', outcome, target, 'model.matrix encoder performance')The caret package also supplies an encoding functionality properly split between training (caret::dummyVars()) and application (called predict()).
library("caret") ## Loading required package: lattice ## Loading required package: ggplot2 set.seed(4623762) f < paste('~', paste(vars, collapse = ' + ')) encoder < caret::dummyVars(as.formula(f), dTrain) trainCaret < predict(encoder, dTrain) print(dim(trainCaret)) ## [1] 891 19 print(colnames(trainCaret)) ## [1] "Pclass.1" "Pclass.2" "Pclass.3" "Sexfemale" "Sexmale" ## [6] "Age" "SibSp" "Parch.0" "Parch.1" "Parch.2" ## [11] "Parch.3" "Parch.4" "Parch.5" "Parch.6" "Fare" ## [16] "Embarked" "EmbarkedC" "EmbarkedQ" "EmbarkedS" dTrain$predCaret < evaluateModelingProcedure(trainCaret, dTrain[[outcome]]==target, crossValPlan) sigr::permTestAUC(dTrain, 'predCaret', outcome, target) ## [1] "AUC test alt. hyp. AUC>AUC(permuted): (AUC=0.85, s.d.=0.017, p<1e05)." WVPlots::ROCPlot(dTrain, 'predCaret', outcome, target, 'caret encoder performance')We usually forget to teach vtreat::designTreatmentsZ() as it is often dominated by the more powerful yaware methods vtreat supplies (though not for this simple example). vtreat::designTreatmentsZ has a number of useful properties:
 Does not look at the outcome values, so does not require extra care in crossvalidation.
 Saves its encoding, so can be used correctly on new data.
The above two properties are shared with caret::dummyVars(). Additional features of vtreat::designTreatmentsZ (that differ from caret::dummyVars()‘s choices) include:
 No NA values are passed through by vtreat::prepare().
 NA presence is added as an additional informative column.
 A few derived columns (such as pooling of rare levels are made available).
 Rare dummy variables are pruned (under a usercontrolled threshold) to prevent encoding explosion.
 Novel levels (levels that occur during test or application, but not during training) are deliberately passed through as "no training level activated" by vtreat::prepare() (caret::dummyVars() considers this an error).
The vtreat yaware methods include proper nested modeling and yaware dimension reduction.
vtreat is designed "to always work" (always return a pure numeric data frame with no missing values). It also excels in "big data" situations where the statistics it can collect on high cardinality categorical variables can have a huge positive impact in modeling performance. In many cases vtreat works around problems that kill the analysis pipeline (such as discovering new variable levels during test or application). We teach vtreat sore of "bimodally" in both a "fire and forget" mode and a "all the details on deck" mode (suitable for formal citation). Either way vtreat can make your modeling procedures stronger, more reliable, and easier.
All code for this article can be found here.
To leave a comment for the author, please follow the link and comment on their blog: R – WinVector Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Forecasting: Linear Trend and ARIMA Models Exercises (Part2)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
There are two main approaches to time series forecasting. One of them is to find persistent patterns in a time series itself, and extrapolate those patterns. Another approach is to discover how a series depend on other variables, which serve as predictors.
This set of exercises focuses on the first approach, while the second one will be considered in a later set. The present set allows to practice in applying three forecasting models:
– a naive model, which provides probably the simplest forecasting technique, but still can be useful as a benchmark for evaluating other methods,
– a linear trend model (based on a simple linear regression),
– the ARIMA model, a more sophisticated and popular model, which assumes a linear dependence of a time series on its past values and random shocks.
The exercises do not require a deep understanding of underlying theories, and make use of automatic model estimation functions included in the forecast package. The set also help to practice in retrieving useful data from forecasts (confidence intervals, forecast errors), and comparing predictive accuracy of different models. The exercises are based on data on ecommerce retail sales in the USA for 19992016 retrieved from FRED, the Federal Reserve Bank of St. Louis database (download here). The data represent quarterly sales volumes in millions of dollars.
For other parts of the series follow the tag forecasting
Answers to the exercises are available here
Exercise 1
Read the data from the file, and transform it into a time series (ts) object (given that the data is quarterly and the starting period is the fourth quarter of 1999).
Plot the obtained series.
Exercise 2
Make a naive forecast for the next 8 periods using the appropriate function from the forecast package (i.e. create an object of the class forecast using the function that implements the naive method of forecasting) (Note that this method sets all forecast values equal to the last known time series value).
Exercise 3
Plot the forecast values.
Exercise 4
Make a forecast for the next 8 periods based on a linear model in two steps:
(1) create a linear regression model for the forecast using the tslm function from the forecast package (use the series as the dependent variable, trend and season as independent variables),
(2) make a forecast based on the model using the forecast function from the same package.
Plot the forecast.
Exercise 5
Retrieve forecast errors (residuals) from the linear model based forecast and save them as a separate variable.
 A complete introduction on Forecasting
 Work thru an exponentional smoothing instruction
 And much more
Exercise 6
Make a forecast for the next 8 periods based on the ARIMA model in two steps:
(1) create a model using the auto.arima function from the forecast package,
(2) make a forecast based on the model using the forecast function from the same package.
Plot the forecast.
Exercise 7
Print the summary of the forecast based on the ARIMA model.
Exercise 8
Explore the structure the forecast summary. Find the forecast value for the last period, and its 5% confidence interval values.
Exercise 9
Retrieve forecast errors (residuals) from the ARIMA based forecast.
Exercise 10
Use the errors from the ARIMA based forecast and the errors from the linear model based forecast to compare predictive accuracy of the two models with the DieboldMariano test (implemented as a function in the forecast package). Test the hypothesis that the ARIMA based forecast is more accurate than the linear model based forecast.
 Multiple Regression (Part 3) Diagnostics
 Forecasting for small business Exercises (Part1)
 3D plotting exercises
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R Weekly Bulletin Vol – IV
(This article was first published on R programming, and kindly contributed to Rbloggers)
This week’s R bulletin will cover topics like removing duplicate rows, finding row number, sorting a data frame in the same order, sorting a data frame in different order, and creating two tabs in an excel workbook. We will also cover functions like Sys.time, julian, and the second & minute function from the lubridate package. Hope you like this R weekly bulletin. Enjoy reading!
Shortcut Keys1. To open a file in R – Ctrl+O
2. To open a new R script – Ctrl+Shift+N
3. To save an R script – Ctrl+S
Consider the following data frame “df” comprising of 3 rows of a 1minute OHC of a stock.
Example:
open = c(221, 221.25, 221.25) high = c(221.75, 221.45, 221.45) close = c(221.35, 221.2, 221.2) df = data.frame(open, high, close) print(df)As can be seen from the output, the 2nd and the 3rd rows are duplicates. If we want to retain only the unique rows, we can use the duplicated function in the following manner:
df[!duplicated(df), ]This will remove the 3rd duplicate row, and retain the first 2 rows. In case we want the duplicate row, we can do so in the following way:
df[duplicated(df), ] Return row number for a particular value in a columnTo find the row number for a particular value in a column in a vector/data frame we can use the “which” function.
Example 1:
day = c("Mon", "Tue", "Wed", "Thurs", "Fri", "Sat", "Sun") row_number = which(day == "Sat") print(row_number)[1] 6
Example 2: Consider a 2 column data frame “df” comprising of the stock symbols and their respective closing price for the day. To find the row number corresponding to the HCL stock we call the “which” function on the Ticker column with its value selected as HCL.
Ticker = c("INFY", "TCS", "HCL") ClosePrice = c(2021, 2294, 910) data = data.frame(Ticker, ClosePrice) row_number = which(data$Ticker == "HCL") print(row_number)[1] 3
Sorting a data frame by two columns in the same orderTo sort a data frame by two columns in the same order, we can use the “order” function and the “with” function. Consider a data frame comprising of stock symbols, their categorization, and the percentage change in price. We first sort the data frame based on category and then sort based on the percentage change in price.
The order function by default sorts in an ascending manner. Hence to sort both the columns in descending order we keep the decreasing argument as TRUE.
Example – Sorting a data frame by two columns
# Create a data frame Ticker = c("INFY", "TCS", "HCLTECH", "SBIN") Category = c(1, 1, 1, 2) Percent_Change = c(2.3, 0.25, 0.5, 0.25) df = data.frame(Ticker, Category, Percent_Change) print(df) # Sorting by Category column first and then the Percent_Change column: df_sort = df[with(df, order(Category, Percent_Change, decreasing = TRUE)), ] print(df_sort) Sorting a data frame by two columns in different orderTo sort a data frame by two columns in different order, we can use the “order” function along with the “with” function.
Consider a data frame comprising of stock symbols, their categorization, and the percentage change in price. Assume that we want to sort first in an ascending order by column “Category”, and then by column “Percent_Change” in a descending order.
The order function by default sorts in an ascending manner. Hence, to sort the “Category” column we mention it as the first variable in the order function without prepending it with any sign. To sort the “Percent_Change” column in a descending order we prepend it with a negative sign.
Example – Sorting a data frame by two columns
# Create a data frame Ticker = c("INFY", "TCS", "HCLTECH", "SBIN") Category = c(1, 1, 1, 2) Percent_Change = c(2.3, 0.25, 0.5, 0.25) df = data.frame(Ticker, Category, Percent_Change) print(df) # Sort by Category column first and then the Percent_Change column: df_sort = df[with(df, order(Category, Percent_Change)), ] print(df_sort) Creating two tabs in the output excel workbookAt times we want to write & save the multiple results generated after running our R script in an excel workbook on separate worksheets. To do so, we can make use of the “append” argument in the write.xlsx function.To write to an excel file, one must first install and load the xlsx package in R.
Example: In this example, we are creating two worksheets in the “Stocks.xlsx” workbook. In the worksheet named, “Top Gainers”, we save the table_1 output, while in the “Top Losers” worksheet we save the table_2 output. To create a second worksheet in the same workbook, we keep the append argument as TRUE in the second line.
write.xlsx(table_1, "Stocks.xlsx", sheetName = "Top Gainers", append = FALSE) write.xlsx(table_2, " Stocks.xlsx", sheetName = "Top Losers", append = TRUE) Functions Demystified Sys.time functionThe Sys.time function gives the current date and time.
Example:
date_time = Sys.time() print(date_time)[1] “20170415 16:25:38 IST”
The function can be used to find the time required to run a code by placing this function at the start and the end of the code. The difference between the start and the end will give the time taken to execute the code.
julian functionThe julian function is used to extract the Julian date, which is the number of days since January 1, 1970. Given a date, the syntax of the function is given as:
julian(date)
Example:
date = as.Date("20100315") julian(date)[1] 14683
attr(,”origin”)
[1] “19700101”
Alternatively one can use the as.integer function to get the same result
second and minute functionsThese functions are part of the lubridate package. The second function retrieves/sets the second component of a datetime object, while the minute function retrieves/sets the minute component. It allows Datetime objects like those belonging to the POSIXct, POSIXlt, Date, zoo, xts, and the timeSeries objects.
Example:
library(lubridate) # Retrieving the seconds We have used the ymd_hms function to parse the given object. x = ymd_hms("20160601 12:23:45") second(x)[1] 45
minute function:
library(lubridate) # Retrieving the minute from a datetime object x = ymd_hms("20160601 12:23:45") minute(x)[1] 23
# Retrieving the minute from a time object. We have used the hms function to parse the given object. x = hms("15:29:06") minute(x)[1] 29
Next StepWe hope you liked this bulletin. In the next weekly bulletin, we will list more interesting ways and methods plus R functions for our readers.
Download the PDF Now!
The post R Weekly Bulletin Vol – IV appeared first on .
To leave a comment for the author, please follow the link and comment on their blog: R programming. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Another rule of three, this one in statistics
(This article was first published on R – Decision Science News, and kindly contributed to Rbloggers)
SAMPLE UNTIL THE FIRST SUCCESS, PUT A CONFIDENCE INTERVAL ON P(SUCCESS)
We wrote last week of Charles Darwin’s love of the “rule of three” which, according to Stigler “is simply the mathematical proposition that if a/b = c/d, then any three of a, b, c, and d suffice to determine the fourth.”
We were surprised to learn this is called the rule of three, as we had only heard of the rule of three in comedy. We were even more surprised when a reader wrote in telling us about a rule of three in statistics. According to Wikipedia: the rule of three states that if a certain event did not occur in a sample with n subjects … the interval from 0 to 3/n is a 95% confidence interval for the rate of occurrences in the population.
It’s a heuristic. We love heuristics!
We decided to give it a test run in a little simulation. You can imagine that we’re testing for defects in products coming off of a production line. Here’s how the simulation works:
 We test everything that comes off the line, one by one, until we come across a defect (the n+1st test).
 We then make a confidence interval bounded by 0 and 3/n. and make note of it. 95% of such intervals should contain the true underlying probability of defect.
 Because it’s a simulation and we know the true underlying probability of defect, we make note of whether the interval contains the true probability of defect.
 We repeat this 10,000 times at each of the following underlying probabilities: .001, .002, and .003.
The simulation thus far assumes the testers have the patience to keep testing until they find a defect. In reality, however, they might get bored and stop testing before the first defect is found. To address this, we also simulated another condition in which the testing is stopped at n/2, halfway before the first defect is found. Of course, in reality, one has no way of knowing when one is half the way to the first defective test, but our simulation can at least let us know what kind of confidence interval one will get if one does indeed stop halfway.
Here’s the result on bracketing, that is, the probability that the confidence intervals contain the correct value:
Across all three levels of true underlying probability, when stopping just before the first defect, we get 95% confidence intervals. However, when we stop half way to the first defect, we get closer to 100% intervals (99.73%, 99.80%, and 99.86%, respectively).
So we know that the upper bounds of these confidence intervals fall above the true probability 95% to about 100% of the time, but where to they fall?
In the uppermost figure of this post, we see the locations of the upper bounds of the simulated confidence intervals when we stop immediately before the first defect. For convenience, we draw blue lines at the true underlying probabilities of .001 (top), .002 (middle), and .003 (bottom). When it’s a 95% confidence interval, about 95% of the upper bounds should fall to the right of the blue line, and 5% to the left. We cut the axis at .05 so you can actually see something, but realize it extends all the way to 1.0, with the heights of the bars trailing off.
For comparison, let’s look at the case in which we stop halfway to the first defect. As suggested by the bracketing probabilities, here we see almost all of the upper bounds exceed the true underlying probability. As our applied statistician reader wrote us about the rule of three, “the weakness is that in some situations it’s a very broad confidence interval.”
REFERENCE
A Look at the Rule of Three
B. D. Jovanovic and P. S. Levy
The American Statistician
Vol. 51, No. 2 (May, 1997), pp. 137139
DOI: 10.2307/2685405
Stable URL: http://www.jstor.org/stable/2685405
R CODE FOR THOSE WHO WANT TO PLAY ALONG AT HOME
require(scales)
library(ggplot2)
library(dplyr)
levels = c(.001,.002,.003)
ITER = 10000
res_list = vector('list', ITER*length(levels))
index=1
for(true_p in levels) {
for (i in 1:ITER) {
onesam = sample(
x = c(1, 0),
size = 10*1/true_p,
prob = c(true_p, 1  true_p),
replace = TRUE
)
cut = which.max(onesam)  1
upper_bound_halfway = min(3 / (cut/2),1)
upper_bound_lastpossible = min(3/cut,1)
res_list[[index]] =
data.frame(
true_p = true_p,
cut = cut,
upper_bound_halfway = upper_bound_halfway,
bracketed_halfway = true_p < upper_bound_halfway,
upper_bound_lastpossible = upper_bound_lastpossible,
bracketed_lastpossible = true_p < upper_bound_lastpossible ) index=index+1 } } df = do.call('rbind',res_list) rm(res_list) plot_data = rbind( df %>% group_by(true_p) %>% summarise(bracketing_probability = mean(bracketed_halfway),type="halfway"),
df %>% group_by(true_p) %>% summarise(bracketing_probability = mean(bracketed_lastpossible),type="last possible")
)
plot_data
p=ggplot(plot_data,aes(x=true_p,y=bracketing_probability,group=type,fill=type)) +
geom_bar(stat="identity",position="dodge") +
coord_cartesian(ylim=c(.5,1)) +
theme_bw() +
theme(legend.position = "bottom",
panel.grid.minor.x = element_blank()) +
labs(x="True Probability",y="Bracketing Probability")
p
ggsave(plot=p,file="bracketing.png",width=4,height=4)
plot_data2 = df %>%
dplyr::select(bracketed_halfway,bracketed_lastpossible) %>%
tidyr::gather(bound_type,upper_bound,c(upper_bound_halfway,upper_bound_lastpossible)) %>%
arrange(bound_type,upper_bound) %>%
mutate(bin = floor(upper_bound/.001)*.001) %>%
group_by(bound_type,true_p,bin) %>%
summarise(count = n()) %>%
ungroup()
p=ggplot(subset(plot_data2,bound_type=="upper_bound_lastpossible"),aes(x=bin+.0005,y=count)) +
geom_bar(stat="identity",width = .0005) +
geom_vline(aes(xintercept = true_p),color="blue") +
coord_cartesian(xlim = c(0,.05)) +
labs(x="Upper Bound",y="Count") +
facet_grid(true_p ~ .) +
theme_bw() +
theme(legend.position = "none")
p
ggsave(plot=p,file="upper_bound_lastpossible.png",width=4,height=4)
#repeat for upper_bound_halfway
p=ggplot(subset(plot_data2,bound_type=="upper_bound_halfway"),aes(x=bin+.0005,y=count)) +
geom_bar(stat="identity",width = .0005) +
geom_vline(aes(xintercept = true_p),color="blue") +
coord_cartesian(xlim = c(0,.05),ylim=c(0,1750)) +
labs(x="Upper Bound",y="Count") +
facet_grid(true_p ~ .) +
theme_bw() +
theme(legend.position = "none")
p
ggsave(plot=p,file="upper_bound_halfway.png",width=4,height=4)
The post Another rule of three, this one in statistics appeared first on Decision Science News.
To leave a comment for the author, please follow the link and comment on their blog: R – Decision Science News. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
#5: Easy package information
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
Welcome to the fifth post in the recklessly rambling R rants series, or R4 for short.
The third post showed an easy way to follow R development by monitoring (curated) changes on the NEWS file for the development version rdevel. As a concrete example, I mentioned that it has shown a nice new function (tools::CRAN_package_db()) coming up in R 3.4.0. Today we will build on that.
Consider the following short snippet:
library(data.table) getPkgInfo < function() { if (exists("tools::CRAN_package_db")) { dat < tools::CRAN_package_db() } else { tf < tempfile() download.file("https://cloud.rproject.org/src/contrib/PACKAGES.rds", tf, quiet=TRUE) dat < readRDS(tf) # rdevel can now readRDS off a URL too } dat < as.data.frame(dat) setDT(dat) dat }It defines a simple function getPkgInfo() as a wrapper around said new function from R 3.4.0, ie tools::CRAN_package_db(), and a fallback alternative using a tempfile (in the automagically cleaned R temp directory) and an explicit download and read of the underlying RDS file. As an aside, just this week the rdevel NEWS told us that such readRDS() operations can now read directly from URL connection. Very nice—as RDS is a fantastic file format when you are working in R.
Anyway, back to the RDS file! The snippet above returns a data.table object with as many rows as there are packages on CRAN, and basically all their (parsed !!) DESCRIPTION info and then some. A gold mine!
Consider this to see how many package have a dependency (in the sense of Depends, Imports or LinkingTo, but not Suggests because Suggests != Depends) on Rcpp:
R> dat < getPkgInfo() R> rcppRevDepInd < as.integer(tools::dependsOnPkgs("Rcpp", recursive=FALSE, installed=dat)) R> length(rcppRevDepInd) [1] 998 R>So exciting—we will hit 1000 within days! But let’s do some more analysis:
R> dat[ rcppRevDepInd, RcppRevDep := TRUE] # set to TRUE for given set R> dat[ RcppRevDep==TRUE, 1:2] Package Version 1: ABCoptim 0.14.0 2: AbsFilterGSEA 1.5 3: acc 1.3.3 4: accelerometry 2.2.5 5: acebayes 1.3.4  994: yakmoR 0.1.1 995: yCrypticRNAs 0.99.2 996: yuima 1.5.9 997: zic 0.9 998: ziphsmm 1.0.4 R>Here we index the reverse dependency using the vector we had just computed, and then that new variable to subset the data.table object. Given the aforementioned parsed information from all the DESCRIPTION files, we can learn more:
R> ## likely false entries R> dat[ RcppRevDep==TRUE, ][NeedsCompilation!="yes", c(1:2,4)] Package Version Depends 1: baitmet 1.0.0 Rcpp, erah (>= 1.0.5) 2: bea.R 1.0.1 R (>= 3.2.1), data.table 3: brms 1.6.0 R (>= 3.2.0), Rcpp (>= 0.12.0), ggplot2 (>= 2.0.0), methods 4: classifierplots 1.3.3 R (>= 3.1), ggplot2 (>= 2.2), data.table (>= 1.10), 5: ctsem 2.3.1 R (>= 3.2.0), OpenMx (>= 2.3.0), Rcpp 6: DeLorean 1.2.4 R (>= 3.0.2), Rcpp (>= 0.12.0) 7: erah 1.0.5 R (>= 2.10), Rcpp 8: GxM 1.1 NA 9: hmi 0.6.3 R (>= 3.0.0) 10: humarray 1.1 R (>= 3.2), NCmisc (>= 1.1.4), IRanges (>= 1.22.10),\nGenomicRanges (>= 1.16.4) 11: iNextPD 0.3.2 R (>= 3.1.2) 12: joinXL 1.0.1 R (>= 3.3.1) 13: mafs 0.0.2 NA 14: mlxR 3.1.0 R (>= 3.0.1), ggplot2 15: RmixmodCombi 1.0 R(>= 3.0.2), Rmixmod(>= 2.0.1), Rcpp(>= 0.8.0), methods,\ngraphics 16: rrr 1.0.0 R (>= 3.2.0) 17: UncerIn2 2.0 R (>= 3.0.0), sp, RandomFields, automap, fields, gstat R>There are a full seventeen packages which claim to depend on Rcpp while not having any compiled code of their own. That is likely false—but I keep them in my counts, however relunctantly. A CRANdeclared Depends: is a Depends:, after all.
Another nice thing to look at is the total number of package that declare that they need compilation:
R> ## number of packages with compiled code R> dat[ , .(N=.N), by=NeedsCompilation] NeedsCompilation N 1: no 7625 2: yes 2832 3: No 1 R>Isn’t that awesome? It is 2832 out of (currently) 10458, or about 27.1%. Just over one in four. Now the 998 for Rcpp look even better as they are about 35% of all such packages. In order words, a little over one third of all packages with compiled code (which may be legacy C, Fortran or C++) use Rcpp. Wow.
Before closing, one shoutout to Dirk Schumacher whose thankr which I made the center of the last post is now on CRAN. As a mighty fine and slim micropackage without external dependencies. Neat.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Gender Roles with Text Mining and Ngrams
(This article was first published on data science ish, and kindly contributed to Rbloggers)
Today is the one year anniversary of the janeaustenr package’s appearance on CRAN, its cranniversary, if you will. I think it’s time for more Jane Austen here on my blog.
I saw this paper by Matthew Jockers and Gabi Kirilloff a number of months ago and the ideas in it have been knocking around in my head ever since. The authors of that paper used text mining to examine a corpus of 19th century novels and explore how gendered pronouns (he/she/him/her) are associated with different verbs. These authors used the Stanford CoreNLP library to parse dependencies in sentences and find which verbs are connected to which pronouns; I have been thinking about how to apply a different approach to this question using tidy data principles and ngrams. Let’s see what we can do!
Jane Austen and ngramsAn ngram is a contiguous series of n words from a text; for example, a bigram is a pair of words, with n = 2. If we want to find out which verbs an author is more likely to pair with the pronoun “she” than with “he”, we can analyze bigrams. Let’s use unnest_tokens from the tidytext package to identify all the bigrams in the 6 completed, published novels of Jane Austen and transform this to a tidy dataset.
library(tidyverse) library(tidytext) library(janeaustenr) austen_bigrams < austen_books() %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) austen_bigrams ## # A tibble: 725,048 × 2 ## book bigram ## <fctr> <chr> ## 1 Sense & Sensibility sense and ## 2 Sense & Sensibility and sensibility ## 3 Sense & Sensibility sensibility by ## 4 Sense & Sensibility by jane ## 5 Sense & Sensibility jane austen ## 6 Sense & Sensibility austen 1811 ## 7 Sense & Sensibility 1811 chapter ## 8 Sense & Sensibility chapter 1 ## 9 Sense & Sensibility 1 the ## 10 Sense & Sensibility the family ## # ... with 725,038 more rowsThat is all the bigrams from Jane Austen’s works, but we only want the ones that start with “he” or “she”. Jane Austen wrote in the third person, so this is a good example set of texts for this question. The original paper used dependency parsing of sentences and included other pronouns like “her” and “him”, but let’s just look for bigrams that start with “she” and “he”. We will get some adverbs and modifiers and such as the second word in the bigram, but mostly verbs, the main thing we are interested in.
pronouns < c("he", "she") bigram_counts < austen_bigrams %>% count(book, bigram, sort = TRUE) %>% ungroup() %>% separate(bigram, c("word1", "word2"), sep = " ") %>% filter(word1 %in% pronouns) %>% count(word1, word2, wt = n, sort = TRUE) %>% rename(total = nn) bigram_counts ## # A tibble: 1,571 × 3 ## word1 word2 total ## <chr> <chr> <int> ## 1 she had 1472 ## 2 she was 1377 ## 3 he had 1023 ## 4 he was 889 ## 5 she could 817 ## 6 he is 399 ## 7 she would 383 ## 8 she is 330 ## 9 he could 307 ## 10 he would 264 ## # ... with 1,561 more rowsThere we go! These are the most common bigrams that start with “he” and “she” in Jane Austen’s works. Notice that there are more mentions of women than men here; this makes sense as Jane Austen’s novels have protagonists who are women. The most common bigrams look pretty similar between the male and female characters in Austen’s works. Let’s calculate a log odds ratio so we can find the words (hopefully mostly verbs) that exhibit the biggest differences between relative use for “she” and “he”.
word_ratios < bigram_counts %>% group_by(word2) %>% mutate(word_total = sum(total)) %>% ungroup() %>% filter(word_total > 10) %>% select(word_total) %>% spread(word1, total, fill = 0) %>% mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>% mutate(logratio = log(she / he)) %>% arrange(desc(logratio))Which words have about the same likelihood of following “he” or “she” in Jane Austen’s novels?
word_ratios %>% arrange(abs(logratio)) ## # A tibble: 164 × 4 ## word2 he she logratio ## <chr> <dbl> <dbl> <dbl> ## 1 always 0.001846438 0.0018956289 0.02629233 ## 2 loves 0.000923219 0.0008920607 0.03433229 ## 3 too 0.000923219 0.0008920607 0.03433229 ## 4 when 0.000923219 0.0008920607 0.03433229 ## 5 acknowledged 0.001077089 0.0011150758 0.03466058 ## 6 remained 0.001077089 0.0011150758 0.03466058 ## 7 had 0.157562702 0.1642506690 0.04157024 ## 8 paused 0.001384828 0.0014495986 0.04571041 ## 9 would 0.040775504 0.0428189117 0.04889836 ## 10 turned 0.003077397 0.0032337199 0.04954919 ## # ... with 154 more rowsThese words, like “always” and “loves”, are about as likely to come after the word “she” as the word “he”. Now let’s look at the words that exhibit the largest differences in appearing after “she” compared to “he”.
word_ratios %>% mutate(abslogratio = abs(logratio)) %>% group_by(logratio < 0) %>% top_n(15, abslogratio) %>% ungroup() %>% mutate(word = reorder(word2, logratio)) %>% ggplot(aes(word, logratio, color = logratio < 0)) + geom_segment(aes(x = word, xend = word, y = 0, yend = logratio), size = 1.1, alpha = 0.6) + geom_point(size = 3.5) + coord_flip() + labs(x = NULL, y = "Relative appearance after 'she' compared to 'he'", title = "Words paired with 'he' and 'she' in Jane Austen's novels", subtitle = "Women remember, read, and feel while men stop, take, and reply") + scale_color_discrete(name = "", labels = c("More 'she'", "More 'he'")) + scale_y_continuous(breaks = seq(1, 2), labels = c("0.5x", "Same", "2x", "4x"))These words are the ones that are the most different in how Jane Austen used them with the pronouns “he” and “she”. Women in Austen’s novels do things like remember, read, feel, resolve, long, hear, dare, and cry. Men, on the other hand, in these novels do things like stop, take, reply, come, marry, and know. Women in Austen’s world can be funny and smart and unconventional, but she plays with these ideas within a cultural context where they act out gendered roles.
George Eliot and ngramsLet’s look at another set of novels to see some similarities and differences. Let’s take some novels of George Eliot, another English writer (a woman) who lived and wrote several decades after Jane Austen. Let’s take Middlemarch (MY FAVE), Silas Marner, and The Mill on the Floss.
library(gutenbergr) eliot < gutenberg_download(c(145, 550, 6688), mirror = "http://mirrors.xmission.com/gutenberg/")We now have the texts downloaded from Project Gutenberg. We can use the same approach as above and calculate the log odds ratios for each word that comes after “he” and “she” in these novels of George Eliot.
eliot_ratios < eliot %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% count(bigram, sort = TRUE) %>% ungroup() %>% separate(bigram, c("word1", "word2"), sep = " ") %>% filter(word1 %in% pronouns) %>% count(word1, word2, wt = n, sort = TRUE) %>% rename(total = nn) %>% group_by(word2) %>% mutate(word_total = sum(total)) %>% ungroup() %>% filter(word_total > 10) %>% select(word_total) %>% spread(word1, total, fill = 0) %>% mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>% mutate(logratio = log(she / he)) %>% arrange(desc(logratio))What words exhibit the largest differences in their appearance after these pronouns in George Eliot’s works?
eliot_ratios %>% mutate(abslogratio = abs(logratio)) %>% group_by(logratio < 0) %>% top_n(15, abslogratio) %>% ungroup() %>% mutate(word = reorder(word2, logratio)) %>% ggplot(aes(word, logratio, color = logratio < 0)) + geom_segment(aes(x = word, xend = word, y = 0, yend = logratio), size = 1.1, alpha = 0.6) + geom_point(size = 3.5) + coord_flip() + labs(x = NULL, y = "Relative appearance after 'she' compared to 'he'", title = "Words paired with 'he' and 'she' in George Eliot's novels", subtitle = "Women read, run, and need while men leave, mean, and tell") + scale_color_discrete(name = "", labels = c("More 'she'", "More 'he'")) + scale_y_continuous(breaks = seq(3, 3), labels = c("0.125x", "0.25x", "0.5x", "Same", "2x", "4x", "8x"))We can see some difference in word use and style here, but overall there are quite similar ideas behind the verbs for women and men in Eliot’s works as Austen’s. Women read, run, need, marry, and look while men leave, mean, tell, know, and call. The verbs associated with women are more connected to emotion or feelings while the verbs associated with men are more connected to action or speaking.
Jane Eyre and ngramsFinally, let’s look at one more novel. The original paper found that Jane Eyre by Charlotte Brontë had its verbs switched, in that there were lots of active, nonfeelings verbs associated with feminine pronouns. That Jane Eyre!
eyre < gutenberg_download(1260, mirror = "http://mirrors.xmission.com/gutenberg/") eyre_ratios < eyre %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% count(bigram, sort = TRUE) %>% ungroup() %>% separate(bigram, c("word1", "word2"), sep = " ") %>% filter(word1 %in% pronouns) %>% count(word1, word2, wt = n, sort = TRUE) %>% rename(total = nn) %>% group_by(word2) %>% mutate(word_total = sum(total)) %>% ungroup() %>% filter(word_total > 5) %>% select(word_total) %>% spread(word1, total, fill = 0) %>% mutate_if(is.numeric, funs((. + 1) / sum(. + 1))) %>% mutate(logratio = log(she / he)) %>% arrange(desc(logratio))What words exhibit the largest differences in Jane Eyre?
eyre_ratios %>% mutate(abslogratio = abs(logratio)) %>% group_by(logratio < 0) %>% top_n(15, abslogratio) %>% ungroup() %>% mutate(word = reorder(word2, logratio)) %>% ggplot(aes(word, logratio, color = logratio < 0)) + geom_segment(aes(x = word, xend = word, y = 0, yend = logratio), size = 1.1, alpha = 0.6) + geom_point(size = 3.5) + coord_flip() + labs(x = NULL, y = "Relative appearance after 'she' compared to 'he'", title = "Words paired with 'he' and 'she' in Jane Eyre", subtitle = "Women look, tell, and open while men stop, smile, and pause") + scale_color_discrete(name = "", labels = c("More 'she'", "More 'he'")) + scale_y_continuous(breaks = seq(3, 3), labels = c("0.125x", "0.25x", "0.5x", "Same", "2x", "4x", "8x"))Indeed, the words that are more likely to appear after “she” are not particularly feelingsoriented; women in this novel do things like look, tell, open, and do. Men in Jane Eyre do things like stop, smile, pause, pursue, and stand.
The EndIt is so interesting to me how these various authors understand and portray their characters’ roles and gender, and how we can see that through analyzing word choices. The R Markdown file used to make this blog post is available here. I am very happy to hear about that or other feedback and questions!
To leave a comment for the author, please follow the link and comment on their blog: data science ish. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Optimising your blog with R and Google Optimize
(This article was first published on Florian Teschner, and kindly contributed to Rbloggers)
When writing a blog, sooner or later, one measures and consequently tries to improve some kind of success metric.
In my case; I have been tracking user behavior (visitors, bounce rate, pageviews, …) for slightly over a year with Google Analytics. While some posts have been more “successful” than others, it is not my goal to change the topics and headlining in order to get more clicks. Instead, one metric that I care about is that visitors find value in the stuff they read and subsequently read more of the past blog posts. In order to see if I can tweak this metric I ran a small ABtest over the last two months.
50% of the blog visitors saw a new start page where the excerpt after the blog post headline was hidden.
The following images shows the difference between testsetting and baselinesetting.
What do you expect? Did the testsetting outperform the baseline setting in terms of pageviews?
As I use a very common jekyll blog theme, my assumption was that the theme authors know more about usability than me and would have optimized for reader“engagement”.
So let’s have a look at the data. The googleAnalyticsRpackage allows to conveniently load the data from GA into R.
With the downloaded data, I use the fresh weighted.summarySE function to compare the two key KPIs; bouncerate and pageviews per session in both settings.
dff$Test < ifelse(dff$experimentVariant == 0, "Baseline", "Experiment \n Setting") mm < melt(dff[, c("Test", "sessions", "bounceRate", "pageviewsPerSession")], id.vars = c("Test", "sessions")) dfc < weighted.summarySE(mm, measurevar="value", groupvars=c("Test", "variable"), weights = "sessions",na.rm=T) p1<ggplot(dfc, aes(Test, value)) + geom_point() + geom_errorbar(aes(ymin=valuese, ymax=value+se), width=.1) p1 + theme_economist(base_size = 16) + xlab("") + ylab("") + facet_wrap(~variable, scales="free_y", ncol = 1)Wow, what a difference; both measures are improved by ~25%. I am seriously surprised by the large effect of such a small change. While I have been testing the effect of interface changes on user behavior before (among others as part of my dissertation), these effects are still a surprise.
As of yesterday, I stopped the experiment and permanently changed the bloglayout to be slightly more compact but 25% more successful ;).
To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hybrid contentbased and collaborative filtering recommendations with {ordinal} logistic regression (2): Recommendation as discrete choice
(This article was first published on The Exactness of Mind, and kindly contributed to Rbloggers)
In this continuation of “Hybrid contentbased and collaborative filtering recommendations with {ordinal} logistic regression (1): Feature engineering” I will describe the application of the {ordinal}
clm() function to test a new, hybrid contentbased, collaborative
filtering approach to recommender engines by fitting a class of ordinal
logistic (aka ordered logit) models to ratings
data from the MovieLens 100K dataset. All R code used in this project
can be obtained from the respective GitHub repository; the chunks of
code present in the body of the post illustrate the essential steps
only. The MovieLens 100K dataset can be obtained from the GroupLens research laboratory
of the Department of Computer Science and Engineering at the University
of Minnesota. This second part of the study relies on the R code in OrdinalRecommenders_3.R
and presents the model training, crossvalidation, and the analyses.
But before we proceed to approach the recommendation problem from a
viewpoint of discrete choice modeling, let me briefly remind you of the
results of the feature engineering phase and explain what happens in OrdinalRecommenders_2.R which prepares the data frames that are passed to clm().
Enter explanatory variables
In the feature engineering phase (OrdinalRecommender_1.R)
we have produced four – two useruser and two itemitem – matrices. Of
these four matrices, two are proximity matrices with Jaccard distances
derived from binarized contentbased features: for users, we have used
the binaryfeatured vectors that describe each user by what movies he or
she has or has not rated, while for items we have derived the proximity
matrix from the binarized information on movie genre and the decade of
the production. In addition, we have two Pearson correlation matrices
derived from the ratings matrix directly, one describing the useruser
similarity and the other the similarity between the items. The later two
matrices present the memorybased component of the model, while the
former carry contentbased information. In OrdinalRecommenders_2.R, these four matrices are:
 usersDistance – useruser Jaccard distances,
 UserUserSim, – useruser Pearson correlations,
 itemsDistance – itemitem Jaccard distances, and
 ItemItemSim – itemitem Pearson correlations.
The OrdinalRecommenders_2.R essentially performs only the following operations over these matrices: for each useritem ratings present in ratingsData (the full ratings matrix),
 from the useruser proximity/similarity matrix,
select 100 most users who have rated the item under consideration and
who are closest to the user under consideration;  place them in increasing order of distance/decreasing order of similarity from the current user;
 copy their respective ratings of the item under consideration.
For the itemitem matrices, the procedure is similar:
 from the itemitem proximity/similarity matrix,
select 100 items that the user under consideration has rated and that
are closest to the item under consideration;  place them in increasing order of distance/decreasing order of similarity from the current item;
 copy the respective ratings of the item under consideration.
Note that this procedure of selecting the nearest
neighbours’ ratings produces missing data almost by necessity: a
particular user doesn’t have to have 100 ratings of the items similar to
the item under consideration, and for a particular user and item we
don’t necessarily find 100 similar users’ ratings of the same item. This
would reflect in a decrease of the sample size as one builds
progressively more and more featureencompassing ordered logit models.
In order to avoid this problem, I have first reduced the sample size so
to keep only complete observations for the most featureencompassing
model that I have planned to test: the one relying on 30 nearest
neighbors from all four matrices (e.g., encompassing 4 * 30 = 120
regressors, add four intercepts needed for the ordered logit over a
5point rating scale); 87,237 observations from the dataset were kept.
For the sake of completeness, I have also copied the
respective distance and similarity measures from the matrices; while the
first 100 columns of the output matrices present ratings only, the
following 100 columns are the respective distance/similarity measures.
The later will not be used in the remainder of the study. The resulting
four matrices are:
 proxUsersRatingsFrame – for each
useritem combination in the data set, these are the ratings of the item
under consideration from the 100 closest neighbours in the useruser
contentbased proximity matrix;  simUsersRatingsFrame, – for each
useritem combination in the data set, these are the ratings of the item
under consideration from the 100 closest neighbours in the useruser
memorybased similarity matrix;  proxItemsRatingsFrame – for each
useritem combination in the data set, these are the ratings from the
user under consideration of the 100 closest neighbours in the itemitem
contentbased proximity matrix; and  simItemsRatingsFrame – for
each useritem combination in the data set, these are the ratings from
the user under consideration of the 100 closest neighbours in the
itemitem memorybased similarity matrix.
Of course it can be done faster than forlooping over
100,000 ratings. Next, the strategy. The strategy here is to
progressively select a larger number of numSims – the number of nearest neighboours taken from proxUsersRatingsFrame, simUsersRatingsFrame, proxItemsRatingsFrame, and simItemsRatingsFrame – and pull them together with the ratings from the ratingsData. For example, if numSims == 5, we obtain a predictive model Rating ~ 4 matrices * 5 regressors which is 20 regressors in total, if numSims == 10, we obtain a predictive model Rating ~ 4 matrices * 10 regressors which is 40 regressors in total, etc.
Enter Ordered Logit Model
The idea to think of the recommendation problem as
discrete choice is certainly not new; look at the very introductory
lines of [1] where the Netflix competition problem was used to motivate
the introduction of ordered regressions in discrete choice. We assume
that each user has an unobservable, continuous utility function that
maps onto his or her item preferences, and of which we learn only by
observing the discrete information that is captured by his or her
ratings on a 5point scale. We also assume that the mapping of the
underlying utility on the categories of the rating scale is controlled
by threshold parameters, of which we fix the first to have a value of
zero, and estimate the thresholds for 1→2, 2→3, 3→4, and 4→5 scale
adjustments; these play a role of categoryspecific intercepts, and are
usually considered as nuisance parameters in ordered logistic
regression. In terms of the unobserved user’s utility for items, these
threshold parameters “disect” the hypothesized continuous utility scale
into fragments that map onto the discrete rating responses (see Chapter
3, Chapter 4.8 in [1]). Additionally, we estimate a vector of regression
coefficients, β, one for each explanatory variable in the model: in our case, the respective ratings from the numSims
closest neighbours from the similarity and proximity matrices as
described in the previous section. In the ordered logit setting – and
with the useful convention of writing minus in front of the xiTβ term used in {ordinal} (see [2]) – we have:
and given that
the logit is cumulative; θj stands for the jth response categoryspecific threshold, while i is the index of each observation in the dataset. Vectors x and β play the usual role. The interpretation of the regression coefficients in β becomes clear after realizing that the cumulative odds ratios take the form of:
and is independent of the response category j, so that for a unit change in x the cumulative odds ratio of the change in probability P(Yi ≥ j) increases by a multiplicative factor of exp(β).
Back to the recommendation problem: by progressively adding more
nearest neighbours’’ ratings from the useruser and itemitem proximity
and similarity matrices, and tracking the behavior of regression
coefficients in predictions from the respective ordered logit models,
one can figure out the relative importance of the memorybased
information (from the similarity matrices) and the contentbased
information (from proximity matrices).
I have varied the numSims (the number of nearest neighbours used) as seq(5, 30, by = 5)
to develop six ordinal models with 20, 40, 60, 80, 100, and 120
regressors respectively, and performed a 10fold crossvalidation over
87,237 useritem ratings from MovieLens 100K (OrdinalRecommenders_3.R):
###  numSims < seq(5, 30, by = 5)
numSims < seq(5, 30, by = 5)
meanRMSE < numeric(length(numSims))
totalN < dim(ratingsData)[1]
n < numeric()
ct < 0
##  Prepare modelFrame:
#  select variables so to match the size needed for the most encompassing clm() model:
f1 < select(proxUsersRatingsFrame,
starts_with('proxUsersRatings_')[1:numSims[length(numSims)]])
f2 < select(simUsersRatingsFrame,
starts_with('simUsersRatings_')[1:numSims[length(numSims)]])
f3 < select(proxItemsRatingsFrame,
starts_with('proxItemsRatings_')[1:numSims[length(numSims)]])
f4 < select(simItemsRatingsFrame,
starts_with('simItemsRatings_')[1:numSims[length(numSims)]])
#  modelFrame:
mFrame < cbind(f1, f2, f3, f4, ratingsData$Rating)
colnames(mFrame)[dim(mFrame)[2]] < 'Rating'
#  Keep complete observations only;
#  to match the size needed for the most encompassing clm() model:
mFrame < mFrame[complete.cases(mFrame), ]
#  store sample size:
n < dim(mFrame)[1]
#  Rating as ordered factor for clm():
mFrame$Rating < factor(mFrame$Rating, ordered = T)
#  clean up a bit:
rm('f1', 'f2', 'f3', 'f4'); gc()
##  10fold crossvalidation
set.seed(10071974)
#  folds:
foldSize < round(length(mFrame$Rating)/10)
foldRem < length(mFrame$Rating)  10*foldSize
foldSizes < rep(foldSize, 9)
foldSizes[10] < foldSize + foldRem
foldInx < numeric()
for (i in 1:length(foldSizes)) {
foldInx < append(foldInx, rep(i,foldSizes[i]))
}
foldInx < sample(foldInx)
# CV loop:
for (k in numSims) {
##  loop counter
ct < ct + 1
##  report
print(paste0("Ordinal Logistic Regression w. ",
k, " nearest neighbours running:"))
###  select k neighbours:
modelFrame < mFrame[, c(1:k, 31:(30+k), 61:(60+k), 91:(90+k))]
modelFrame$Rating < mFrame$Rating
#  model for the whole data set (no CV):
mFitAll < clm(Rating ~ .,
data = modelFrame)
saveRDS(mFitAll, paste0("OrdinalModel_NoCV_", k, "Feats.Rds"))
RMSE < numeric(10)
for (i in 1:10) {
#  train and test data sets
trainFrame < modelFrame[which(foldInx != i), ]
testFrame < modelFrame[which(foldInx == i), ]
#  model
mFit < clm(Rating ~ .,
data = trainFrame)
#  predict test set:
predictions < predict(mFit, newdata = testFrame,
type = "class")$fit
#  RMSE:
dataPoints < length(testFrame$Rating)
RMSE[i] <
sum((as.numeric(predictions)as.numeric(testFrame$Rating))^2)/dataPoints
print(paste0(i, "th fold, RMSE:", RMSE[i]))
}
##  store mean RMSE over 10 folds:
meanRMSE[ct] < mean(RMSE)
print(meanRMSE[ct])
##  clean up:
rm('testFrame', 'trainFrame', 'modelFrame'); gc()
}
The outer loop varies k in numSims,
the number of nearest neighbours selected, then prepares the dataset,
picks ten folds for crossvalidation, and fits the ordered logit to the
whole dataset before the crossvalidation; the inner loop performs a
10fold crossvalidation and stores the RMSE from each test fold; the
average RMSE from ten folds for each k in numSims is recorded. This procedure is later modified to test models that work with proximity or similarity matrices exclusively.
Results
The following figure presents the average RMSE from
10fold CV after selecting 5, 10, 15, 20, 25, and 30 nearest neighbours
from the useruser and itemitem proximity and similarity matrices. Once
again, the similarity matrices represent the memorybased information,
while the proximity matrices carry the contentbased information. The
four different sources of features are termed “feature categories” in
the graph.
Figure 1. The average RMSE from the 10fold crossvalidations from clm()
with 5, 10, 15, 20, 25, and 30 nearest neighbours.
The model with 20 features only
and exactly 24 parameters (4 intercepts were estimated plus the
coefficients for 5 ratings from four matrices) achieves an average RMSE
of around .52. For comparison, in a recent 10fold crossvalidation over MovieLens 100K with {recommenderlab}
the CF algorithms achieved an RMSE of .991 with 100 nearest neighbours
in userbased CF, and an RMSE of .940 with 100 nearest neighbours in
itembased CF: our approach almost double downs the error with a
fivefold reduction in the number of features used. With 120 features (30
nearest neighbours from all four matrices) and 124 parameters, the
models achieves a RMSE of approximately .43. With these results, the new
approach surpasses many recommender engines that were tested on
MovieLens 100K in the previous years; for example, compare the results
presented here – taking into consideration the fact that the model with
20 features only achieves an RMSE of .517 on the whole dataset (ie.
without crossvalidation) – with the results
from the application of several different approaches presented by the
Recommender Systems project of the Carleton College, Northfield, MN.
What about the relative importance of memory and
contentbased information in the prediction of user ratings? The
following figure presents the median values and range of exp(β) for
progressively more complex ordinal models that were tested here:
Figure 2. The median exp(β) from the clm() ordered logit models estimated
without crossvalidation and encompassing 5, 10, 15, 20, 25, and 30 nearest neighbours from the memorybased and contentbased matrices.
As it is readily observed, the memorybased features (simItemsCoeff for itemitem similarity neighbours, and simUsersCoeff
for useruser similarity neighbours in the graph legend) weigh more
heavily in the ratings prediction than the contentbased features (proxItemsCoeff for the itemitem proximity neighbours, and proxUsersCoeff
for the useruser proximity neighbours). I have performed another round
of the experiment by fitting ordinal models with the information from
the memorybased, similarity matrices only; Figure 3 presents the
average RMSE obtained from 10fold CVs of these models:
Figure 3. The average RMSE from the 10fold crossvalidations from clm()
with 5, 10, 15, 20, 25, and 30 nearest neighbours for models with memorybased
information only.
The results seem to indicate almost no change in the
RMSEs across the models (but note how the model starts overfitting when
more than 25 features per matrix are used). Could it be that the
contentbased information – encoded by the proximity matrices – is not a
necessary ingredient in my predictive model at all? No. But we will
certainly not use RMSE to support this conclusion:
modelRes < matrix(rep(0,6*7), nrow = 6, ncol = 7)
numSims < seq(5, 30, by = 5)
ct = 0
for (k in numSims) {
ct = ct + 1
fileNameSimProx < modelFiles[which(grepl(paste0("CV_",k,"F"), modelFiles, fixed = T))]
modelSimProx < readRDS(fileNameSimProx)
fileNameSim < modelFiles[which(grepl(paste0("Sim_",k,"F"), modelFiles, fixed = T))]
modelSim < readRDS(fileNameSim)
testRes < anova(modelSimProx, modelSim)
modelRes[ct, ] < c(testRes$no.par, testRes$AIC, testRes$LR.stat[2],
testRes$df[2], testRes$`Pr(>Chisq)`[2])
}
modelRes < as.data.frame(modelRes)
colnames(modelRes) < c('Parameters_Model1', 'Parameters_Model2',
'AIC_Model1', 'AIC_Model2',
'LRTest','DF','Pr_TypeIErr')
write.csv(modelRes, 'ModelSelection.csv')
The results of the likelihood ratio tests,
together with the values of the Akaike Information Criterion (AIC) for
all models are provided in Table 1; a comparison between the AIC values
are provided in Figure 4. I have compared the models that encompass both
the top neighbours from the memory and the contentbased matrices, on
one, with the models encompassing only information from the memorybased
matrices, on the other hand. As you can see, the AIC is consistently
lower for the models that incorporate both the content and the
memorybased information (with the significant LRTs testifying that the
contribution of contentbased information cannot be rejected).
Figure 4. Model selection: AIC for ordered logit models
with 5, 10, 15, 20, 25, and 30 nearest neighbours: memorybased
information only (Sim) vs. memorybased and contentbased models (Sim+Prox).
Summary
Some say that comparing modelbased recommendation
systems with the CF approaches is not fair in the first place. True.
However, achieving a reduction in error of almost 50% by using 80% less
features is telling. I think the result has two important stories to
tell, in fact. The first one is that the comparison is really not fair
:). The second story is, however, much more important. It is not about
the application of the relevant model; since the development of discrete
choice models in economics, many behavioral scientist who see a Likert
type response scale probably tends to think of some appropriate ordinal
model for the problem at hand. It is about the formulation of the problem: as ever in science, the crucial question turns out to be the one on the choice of the most relevant description of the problem; in cognitive and data science, the question of what representation enables for its solution by as simple as possible means of mathematical modeling.
References
[1] Greene, W. & Hensher, D. (2010). Modeling Ordered Choices. Cambridge University Press, 2010.
[2] Christensen, R. H. B. (June 28, 2015). Analysis
of ordinal data with cumulative link models – estimation with the
Rpackage ordinal. Accessed on March 03, 2017, from the The Comprehensive R Archive Network. URL: https://cran.rproject.org/web/packages/ordinal/vignettes/clm_intro.pdf
Goran S. Milovanović, PhD
Data Science Consultant, SmartCat
To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Hybrid contentbased and collaborative filtering recommendations with {ordinal} logistic regression (1): Feature engineering
(This article was first published on The Exactness of Mind, and kindly contributed to Rbloggers)
I will use {ordinal} clm() (and other cool R packages such as {text2vec}
as well) here to develop a hybrid contentbased, collaborative
filtering, and (obivously) modelbased approach to solve the
recommendation problem on the MovieLens 100K dataset in R. All R code
used in this project can be obtained from the respective GitHub repository; the chunks of code present in the body of the post illustrate the essential steps only. The MovieLens 100K dataset can be obtained from the GroupLens research laboratory
of the Department of Computer Science and Engineering at the University
of Minnesota. The first part of the study introduces the new approach
and refers to the feature engineering steps that are performed by the OrdinalRecommenders_1.R script (found on GitHub). The second part, to be published soon, relies on the R code in OrdinalRecommenders_3.R and presents the model training, crossvalidation, and analyses steps. The OrdinalRecommenders_2.R
script encompasses some tireless forlooping in R (a bad habbit indeed)
across the dataset only in order to place the information from the
dataset in the format needed for the modeling phase. The study aims at
(a) the demonstration of the improvement in predicted ratings for
recommending on a wellknown dataset, and (b) attempts to shedd light on
the importance of various types of information in the work of
recommendation engines. Consequently, the code is not suited for use in
production; additional optimizations are straightforward, simple, and
necessary as well.
Introduction
Avoiding the formal exposition of the problem
altogether, the recommendation problem in the context of contemporary
online markets can expressed in the following way: given (A) the
information on the user’s past ratings of various items (i.e. the useritem ratings;
for example, the ratings on a 5point Likert type scale of the movies
or TV shows that she or he has watched in the past, the ratings on
similar scales of the items purchased, and similar), (B) the information
on the useritem ratings of other users, © the attributes of users
and/or items (c.f. the user’s gender, geographical location, age,
occupation, and similar; the genre of the movie, product type, and
similar for items), provide a prediction of the user’s rating of a novel, previously unassessed item.
The predictions are then used to select the novel items (or lists of
items) that a particular user would – by assumption – enjoy to assess,
and eventualy purchase, in the future. The cognitive systems used for
making such predictions are known as recommendation engines, or recommender systems, and are widely used nowadays across the Internet business.
It is generally recognized that recommendation engines can be grouped in two broad categories: (1) contentbased systems, or (2) collaborative filtering systems, with the later additionally described as (1.1) neighborhood or (1.2) modelbased methods [1]. Content based
systems (CF) rely on a typical description of items over feature
vectors, and then recommend novel items to users by computing some
similarity metric between them and the items that the user has already
rated. Collaborative filtering systems, on the
other hand, rely on the assumption that the covariations between the
ratings (a) from different users over the same set of items, or (b) to
different items from the same set of users, implicitly carry the
information on item and user attributes. The information about the true
user and item attributes are thus latent in CF. In the neighborhood (or memory) variant of CF, these approaches use the useritem ratings directly and proceed by aggregating the ratings of similar users (userbased CF) and/or similar items (itembased CF)
in order to provide an estimate of the user’s rating of a novel item,
typically weighting the known ratings of similar users/items by the
respective proximity/similarity measures. In the modelbased
variant of CF, the systems tries to discover the structure of the
latent factors from the observed behavioral measures (ie. known
useritem ratings), and then a myriad of machine learning algortihms and
statistical approaches can be used to train a predictive model for the
novel useritem ratings of novel items.
I will here introduce a new hybrid approach to
recommender systems with the following characteristics. First, the
system is both contentbased and CF. The contentbased component of the
system encompasses two matrices: the useruser and the itemitem
proximity matrices, both obtained from applying the relevant distance
metric over a set of features that characterize users and items,
respectively. The CF component of the system, relies on the typical
useruser and itemitem similarity matrices computed from the known,
past useritem ratings, providing for a memory component of the
recommender. Second, the system is modelbased in the sense that it
combines the information from these four matrices in an ordinal logistic
regression model to predict the ratings of novel items on a 5point
Likert scale. The underlying logic of this approach is to try to express
the recommendation problem as a discrete choice problem, where the
alternatives are provided by ordinal information from a 5point Likert
scale, while the decision makers and the choice context are described by
a set of contentbased and memorybased features, all referring to some
degree of closeness between the unknown user/item combination with the
known user/item in two spaces: the attribute (feature) space and the
neighborhood (memory) space. Because the CF approaches rely on
usersuser (or itemitem) proximity, the question of how many users (or
items) that are similar to the context of the present, novel useritem
rating should be used in prediction arises naturally. By expressing the
recommender problem as a regression problem essentially, we become able
to judge the significance of additional useruser and/or itemitem
similarity information from the change in regression coefficients
obtained by progressively adding more and more information to nested
ordinal logistic models and training them. Thus the new approach also
provides and excellent opportunity for a comparison between
contentbased information and memorybased information (the later as
being typically used in CF): which one is more important? Given the
assumption that the former is implicitly present in the later, i.e. that
it presents latent information in the CF approach, does the usage of
manifest memorybased information on useritem ratings completelly
eliminates the need for a contentbased description? If not, how the two
should be combined? I will demonstrate how the present approach
resolves this dilemma while resulting in highly improved recommendations
on a wellknown and widely used MovieLens dataset, at the same time
reducing drastically the number of features that are needed to build the
relevant prediction model.
Motivation
Consider the following useritem ratings matrix, encompassing 10 items (columns) and three users (rows):
In this rating matrix, the ratings of user1 and user2
have a maximal Pearson correlation of 1; however, the ratings of user2
and user3 have a linear correlation of .085, sharing around .007 % of
variance only. Upon a closer look, user1 and user2 have only three
common ratings (for items i1, i4, and i5), while user2 and user3 have
rated exactly the same items (i1, i3, i4, i5, i8, i9, and i10). Imagine
that the items rated by user2 and user 3 are all SciFi movies; we can
imagine these two users chating and challenging each other’s views on
the contemporary SciFi production on online forums or fan conferences,
right? They do not have to have perfectly similar tastes, right, but in
fact they are similar: simply because they both enjoy SciFi. The Jaccard distance
measures the proximity of two vectors A and B, both given over binary
features (e.g. present vs. not present), by computing the ratio of (a)
the difference between the union of A and B and the intersection of A
and B to the (b) union of A and B; it ranges from zero to one and can be
viewed as a proportion of common features that the two vectors share in
the total number of features present in both. The Jaccard similarity
coefficient is given by one minus the Jaccard distance; we will use the
{proxy} dist() function to compute these from our ratings matrix:
user1 < c(5, NA, NA, 4, 3, NA, 1, NA, NA, NA)
user2 < c(4, NA, 3, 3, 2, NA, NA, 1, 2, 4)
#  User1 and User2 have a high Pearson Correlation Coefficient:
cor(user1, user2, use = "pairwise.complete.obs")
user3 < c(3, NA, 1, 5, 4, NA, NA, 2, 5, 4)
#  User2 and User3 have a low Pearson Correlation Coefficient:
cor(user2, user3, use = "pairwise.complete.obs")
#  User2 and User3 have a high Jaccard similarity
#  i.e. 1 minus the Jaccard distance between them:
library(proxy)
as.numeric(1  dist(t(user2), t(user3), method = "jaccard"))
as.numeric(1  dist(t(user1), t(user2), method = "jaccard"))
as.numeric(1  dist(t(user1), t(user3), method = "jaccard"))
User2 and User3 who have provided the ratings for the
same items exactly have a Jaccard similarity index of 1; they both have
the Jaccard similarity index of .375 with User1. Then who is the user
similar to User2 – the one whose ratings could be used to approximate
the experiences of User2 with previously unseen items: (a) User1, with
whom he has a memorybased Pearson correlation of 1 on three shared
items, or (b) User3, with whom he has a Pearson correlation of only
.085, but with whom he had shared the same experiences and thus achieved
a maximum level of Jaccard similarity?
The value of the linear correlation coefficient (or
cosine similarity, or any other similar metric) will not preserve the
information on the number of shared items; that information is imlicitly
present in the Type I Error test (for example), not in the extent of
covariance itself. The Jaccard distance, on the other hand, neglects the
covariances from the rating scales altogether, preserving only the
information about the extent of the shared ratings. If we think about
the items rated by a particular user, neglecting the particular rating
values, we understand that we can describe some aspects of the user’s
taste by binarizing his or her ratings; we can know, for example, the
frequency distribution of the item types that he or she has decided to
assess at all. If we additionally compute a distance or similarity
matrix over that information for all users, we express the contentbased
approach in terms of the attributes that describe a particular use by
its distance or similarity from other users in terms of the very same
attributes – in a contentbased description space. We can add the
avaialable demographic information to the binary vectors before
computing the similarities too and thus enrich the representation. And
we can do the same for items, of course. As we already have the
similarity matrices from linear correlations between the paired users’
ratings at our disposal, we can combine the information from these two
approaches – the contentbased and the memory, or neighborhoodbased –
and try to predict the unobserved useritem ratings from a unified
model.
The details of feature engineering along these lines in R are provided in the respective section below.
Dataset
The MovieLens 100K dataset [2] comprises 100,000
ratings (Likert type 1 to 5 scale) from 943 users on 1682 movies;
additionally, there are at least 20 movies rated per user. Some
demographic information for the users is present – age, gender,
occupation, zip – as well as the genre and the release dates for movies.
The dataset can be downloaded from the GroupLens research laboratory
of the Department of Computer Science and Engineering at the University
of Minnesota. Historically, it is among the most popular datasets used
to test the new approaches to the development of recommendation engines.
MovieLens 100K can be also obtained from Kaggle and Datahub. Our OrdinalRecommenders_1.R
script reads in the components of this dataset and computes the desired
useruser and itemitem distance and similarity matrices following some
data reshaping.
Feature engineering
Part 1.A of the OrdinalRecommenders_1.R
script is used merely to read the original MovieLens 100K dataset
files, rename the columns present there, and save as .csv files; you
will need to run it too in order to use the code from part 1.B and the
remaining scripts. In Part 1.B we begin to operate over the following
three data.frames: (1) the ratingsData, which comprises the ratings on
5point scales for all useritem pairs from the dataset, (2) usersData,
which comprises demographic information on users (Gender, Age,
Occupation, Zipcode), and (3) moviesData, which comprises all item
information (Title, Release Date, and Genre). Part 1.B of this R script
performs the essentialy feature engineering operations in order to
produce the following four matrices:
Matrix P1 – the ItemItem Proximity Matrix:
the Jaccard distances between the items, computed from the binarized
features encompassing relase decade (derived from the original release
year information, provided in the movie Title column) and movie genre.
This matrix plays the role of the contentbased representation of items. The dist2() function from {text2vec} is used on a sparse Matrix class to compute the Jaccard distances here:
moviesData < moviesData %>% separate(col = Date,
into = c('Day', 'Month','Year'),
sep = "")
moviesData$Day < NULL
moviesData$Month < NULL
moviesData$Year < as.numeric(moviesData$Year)
range(moviesData$Year)
#  that would be: [1] 1922 1998
#  Introduce Movie Decade in place of Year:
decadeBoundary < seq(1920, 2000, by = 10)
moviesData$Year < sapply(moviesData$Year, function(x) {
wL < x < decadeBoundary
wU < x >= decadeBoundary
if (sum(wL) == length(decadeBoundary)) {
return(1)
} else if (sum(wU) == length(decadeBoundary)) {
decadeBoundary[length(decadeBoundary)]
} else {
decadeBoundary[max(which(wLwU == 1))]
}
})
#  Match moviesData$Year with ratingsData:
mD < moviesData %>%
select(MovieID, Year)
ratingsData < merge(ratingsData, mD,
by = 'MovieID')
#  Movie Year (now Decade) as binary:
moviesData < moviesData %>%
spread(key = Year,
value = Year,
fill = 0,
sep = "_")
#  compute moviesDistance:
moviesDistance < moviesData[, 3:ncol(moviesData)]
w < which(moviesDistance > 0, arr.ind = T)
moviesDistance[w] < 1
moviesDistance < dist2(Matrix(as.matrix(moviesData[, 4:ncol(moviesData)])),
method = "jaccard")
moviesDistance < as.matrix(moviesDistance)
rm(moviesData); gc()
#  save objects and clear:
numMovies < length(unique(ratingsData$MovieID))
write_csv(as.data.frame(moviesDistance),
path = paste0(getwd(),'/moviesDistance.csv'),
append = F, col_names = T)
rm(moviesDistance); gc()
Matrix P2 – the UserUser Proximity Matrix:
the Jaccard distances between the users, computed from the binarized
features representing whether a particular user has or has not rated a
particular item only. This matrix plays the role of the contentbased representation of users:
userItemMat < matrix(rep(0, dim(usersData)[1]*numMovies),
nrow = dim(usersData)[1],
ncol = numMovies)
userItemMat[as.matrix(ratingsData[c('UserID', 'MovieID')])] < 1
rm('w', 'ratingsData', 'usersData'); gc()
###  Compute userDistance w. Jaccard {text2vec}
userItemMat < Matrix(userItemMat)
usersDistance < dist2(userItemMat,
method = "jaccard")
rm(userItemMat); gc()
usersDistance < as.matrix(usersDistance)
write_csv(as.data.frame(usersDistance),
path = paste0(getwd(),'/usersDistance.csv'),
append = F, col_names = T)
rm(usersDistance); gc()
Note that I’ve missed to use the user demographics
available in MovieLens 100K. In this project, I will not use them at
all, but the general ratio is: because I am going after an ordinal
logistic regression model, I can use these information as regressors
there. Also, as I have computed the useruser distance by looking at
what movies ratings they have in common, presumeably capturing some
information on shared tastes, I could have done the some for the items,
by inspecting what movies have been rated by the same users. However, a
hunch told me that I might be end up using redundand feature
representations in the same model; I’ve decided to describe the movie
similarities from contentbased features directly (genre and the decade
of production), and user similarities from contentbased features that
encompass the similarity in tastes (the common movie ratings among
them).
The following two similarity matrices are those
typically used in user and itembased CF; they comprise Pearson
correlation coefficients computed directly from the ratings matrix for
1681 items (note: one movie was ‘unknown’, so I have removed it from the
dataset and cleaned up from the ratings matrix accordingly) and 943
users. Given the moderate size of the desired correlation matrices, I
have used the {base} cor() function on pairwise complete observations to
compute them. For digging out correlation or cosine matrices from
problems larger than the MovieLens 100K, you might want to take a look
at the functions provided by my colleague Stefan Nikolić in his “Recommender Systems: Matrix operations for fast calculation of similarities” (supporting the computations for CF in an order of magnitude lesser time than the popular {recommenderlab} R package). The code produces the remaining two matrices:
Matrix S1 – the UserUser Similarity Matrix:
the pairwise complete Pearson correlation coefficients between the
users’ ratings from the ratings matrix directly. This matrix plays the
role of the memorybased representation of users:
ratingsData < read_csv("ratingsData_Model.csv",
col_names = T)
ratingsData$X1 < NULL
#  UserItem Ratings Matrix
ratingsData$Timestamp < NULL
ratingsData < ratingsData %>%
spread(key = MovieID,
value = Rating,
sep = "_") %>%
arrange(UserID)
#  Pearson Correlations: UserUser Sim Matrix
UserUserSim < ratingsData %>%
select(starts_with("Movie"))
UserUserSim < t(UserUserSim)
UserUserSim < cor(UserUserSim,
use = 'pairwise.complete.obs')
UserUserSim < as.data.frame(UserUserSim)
write_csv(UserUserSim,
path = paste0(getwd(),'/UserUserSim.csv'),
append = F, col_names = T)
rm(UserUserSim); gc()
Matrix S2 – the ItemItem Similarity Matrix:
the pairwise complete Pearson correlation coefficients between the
item’ ratings from the ratings matrix directly. This matrix plays the
role of the memorybased representation of items:
ItemItemSim < ratingsData %>%
select(starts_with("Movie"))
rm(ratingsData); gc()
ItemItemSim < cor(ItemItemSim,
use = 'pairwise.complete.obs')
ItemItemSim < as.data.frame(as.matrix(ItemItemSim))
write_csv(ItemItemSim,
path = paste0(getwd(),'/ItemItemSim.csv'),
append = F, col_names = T)
rm(ItemItemSim); gc()
Summary
The idea is to combine the contentbased with
memorybased information in a unified model to solve the recommendation
problem for online recommender systems. I have used the Jaccard
distances to describe the contentbased proximity between users, based
on what items they have rated or not; on the other hand, the Jaccard
distances were also used to described the proximity of items in a
contentbased space derived from the “essentialistic” item features like
movie genres and the decade of their production. Additionally, the
typical Pearson correlation similarity matrices – computed directly from
the ratings matrix – were produced for both users and items. The two
useruser and itemitem matrices share the same respective
dimensionality. In the next step, the idea is to use the neighborhoods
of different size from these matrices (say, take p similar users from the contentbased proximity and the memorybased useruser similarity matrices, and q
similar items from the contentbased proximity and the memorybased
itemitem matrices) as regressors in an ordinal logistic model to
predict the ordinal 5point ratings of novel useritem combinations. By
inspecting the regression coefficients from these predictive models
while varying the size of the neighborhood sets of users and items whose
ratings will be used as regressors, I hope to be able to provide some
conclusion on the relative importance of contentbased and memorybased
information for building efficient recommendation engines. We’ll see.
But before skipping to OrdinalRecommenders_3.R to play with the beautiful {ordinal} models, you might wish to run OrdinalRecommenders_2.R
that will prepare the first 100 neighbours from the proximity and
similarity matrices for you. As I told you, there’s a for loop there –
and a bad one.
References
[1] Desrosiers, C. & Karypis, G. (2010). A comprehensive survey of neighborhoodbased recommendation methods. In Ricci, F., Rokach, L., Shapira, B. & Kantor, P. B. (Eds), Recommender Systems Handbook. Springer: US. DOI:10.1007/9780387858203.
[2] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872.
Goran S. Milovanović, PhD
Data Science Consultant, SmartCat
To leave a comment for the author, please follow the link and comment on their blog: The Exactness of Mind. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Modeling data with functional programming, Part I
(This article was first published on R – Cartesian Faith, and kindly contributed to Rbloggers)
The latest draft of my book is available. This will be my last prepublication update, as I’m in the process of editing the remainder of the book. That said, the first part is packed with examples and provides a solid foundation on its own. I’m including the preface below to whet people’s appetite for the complete book when it is published.
PrefaceThis book is about programming. Not just any programming, but programming for data science and numerical systems. This type of programming usually starts as a mathematical modeling problem that needs to be translated into computer code. With functional programming, the same reasoning used for the mathematical model can be used for the software model. This not only reduces the impedance mismatch between model and code, it also makes code easier to understand, maintain, change, and reuse. Functional programming is the conceptual force that binds the these two models together. Most books that cover numerical and/or quantitative methods focus primarily on the mathematics of the model. Once this model is established the computational algorithms are presented without fanfare as imperative, stepbystep, algorithms. These detailed steps are designed for machines. It is often difficult to reason about the algorithm in a way that can meaningfully leverage the properties of the mathematical model. This is a shame because mathematical models are often quite beautiful and elegant yet are transformed into ugly and cumbersome software. This unfortunate outcome is exacerbated by the real world, which is messy: data does not always behave as desired; data sources change; computational power is not always as great as we wish; reporting and validation workflows complicate model implementations. Most theoretical books ignore the practical aspects of working in the field. The need for a theoretical and structured bridge from the quantitative methods to programming has grown as data science and the computational sciences become more pervasive.
The goal is to reestablish the intimate relationship between mathematics and computer science. By the end of the book, readers will be able to write computer programs that clearly convey the underlying mathematical model, while being easy to modify and maintain. This is possible in R by leveraging the functional programming features built into the language. Functional programming has a long history within academia and its popularity in industry has recently been rising. Two prominent reasons for this upswing are the growing computational needs commensurate with large data sets and the datacentric computational model that is consistent with the analytical model. The superior modularity of functional programs isolates data management from model development, which keeps the model clean and reduces development overhead. Derived from a formal mathematical language, functional programs can be reasoned about with a level of clarity and certainty not possible in other programming paradigms. Hence the same level of rigor applied to the quantitative model can be applied to the software model.
Divided into three parts, foundations are first established that uses the world of mathematics as an introduction to functional programming concepts (Chapter 2). Topics discussed include basic concepts in set theory, statistics, linear algebra, and calculus. As core tools for data scientists, this material should be accessible to most practitioners, graduate students, and even upper class undergraduates. The idea is to show you that you already know most of the concepts used in functional programming. Writing code using functional programming extends this knowledge to operations beyond the mathematical model. This chapter also shows counter examples using other programming styles, some in other languages. The point isn’t to necessarily criticize other implementation approaches, but rather make the differences in styles tangible to the reader. After establishing some initial familiarity, Chapter 3 dives into functions, detailing the various properties and behaviors they have. Some important features include higherorder functions, firstclass functions, and closures. This chapter gives you a formal vocabulary for talking about functional programs in R. This distinction is important as other functional languages have plenty more terms and theory ignored in this book. Again, the goal is how to leverage functional programming ideas to be a better data scientist. Finally, Chapter 4 reviews the various packages in R that provide functionality related to functional programming. These packages include some builtin implementations, paradigms for parallel computing, a subset of the socalled tidyverse, and my own lambda.r package, which offers a more comprehensive approach to writing functional programs.
While the first part provides a working overview of functional programming in R, Part II takes it a step further. This part is for readers that want to exploit functional programming principles to their fullest. Many topics in Part I reference deeper discussions in Part II. This canon begins by exploring the nature of vectorization in Chapter 5. Often taken for granted, we look at how colloquial vectorization conflates a few concepts. This chapter unravels the concepts, showing you what can be done and what to expect with each type of vectorization. Three primary higherorder functions map, fold, and filter follow. Chapter 6 shows how the concept of map appears throughout R, particularly in the apply family of functions. I show how to reason about data structures with respect to the ordinals (or index) of the data structure. This approach can simplify code by enabling the separation of data structures, so long as an explicit ordinal mapping exists. While fold is more fundamental than map, its use is less frequent in R. Discussed in Chapter 7, fold provides a common structure for repeated function application. Optimization and many iterative methods, as well as stochastic systems make heavy use of repeated function application, but this is usually implemented as a loop. With fold, the same function used to implement a single step can be used for multiple steps, reducing code complexity and making it easier to test. The final function of the canon is filter, which creates subsets using a predicate. This concept is so integral to R that the notation is deeply integrated into the language. Chapter 8 shows how the native R syntax is tied to this higherorder function, which can be useful when data structures are more complex. Understanding this connection also simplifies porting code to or from R when the other language doesn’t have native syntax for these operations.
Programming languages can’t do much without data structures. Chapter 9 shows how to use native R data structures to implement numerous algorithms as well as emulate other data structures. For example, lists can be used to emulate trees, while environments can emulate hash tables.
The last part focuses on applications and advanced topics. Part III can act as a reference for implementing various algorithms in a functional style. This provides readers with tangible examples of using functional programming concepts for algorithm development. This part begins by proposing a simple model development pipeline in Chapter 11. The intention is to provide some reusable structure that can be tailored to each reader’s needs. For example, the process of backtesting is often implemented in reverse. Loops become integral to the implementation, which makes it difficult to test individual update steps in an algorithm. This chapter also shows some basic organizational tricks to simplify model development. A handful of machine learning models, such as random forest, are also presented in Chapter 11. Remarkably, many of these algorithms can be implemented in less than 30 lines fo code when using functional programming techniques. Optimization methods, such as Newton’s method, linear programming, and dynamic follow in Chapter 13. These methods are usually iterative and therefore benefit from functional programming. Statebased systems are explored in Chapter 12. Ranging from iterative function systems to contextfree grammars, state is central to many models and simulations. This chapter shows how functional programming can simplify these models. This part also discusses two case studies (Chapter 14 and Chapter 15) for implementing more complete systems. In essence, Part III shows the reader how to apply the concepts presented in the book to realworld problems.
Each chapter presents a number of exercises to test what you’ve learned. By the end of the book, you will know how to leverage functional program ming to improve your life as a data scientist. Not only will you be able to quickly and easily implement mathematical ideas, you’ll be able to incrementally change your exploratory model into a production model, possibly scaling to multiple cores for big data. Doing so also facilitates repeatable research since others can review and modify your work more easily.
Rowe – Modeling data with functional programming (part i)
To leave a comment for the author, please follow the link and comment on their blog: R – Cartesian Faith. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Community workshops
(This article was first published on R – Locke Data, and kindly contributed to Rbloggers)
Following on from when we announced the availability of our community workshops, we’ve got three in the next three months that folks can attend.
May 19th – Data science project in a dayWe’ll be in Kiev, Ukraine, doing a whole data science project in a day. This is intended to give people a little bit of code, process, and critical thinking along the whole data science workflow. This will enable folks to see how it hangs together and decide where and how much they want to invest their learning in future.
Tickets for this workshop are currently ~£60 so even if people flew over and stayed a couple of nights, it’d still be pretty darn cheap!
June 26th – Introduction to RWe’ll be in Newcastle, UK, doing an introduction to R. This focuses on data analysis skills, working with databases, and writing reusable code. It’s also split into two halves so folks can come along for only half the day.
Tickets for this workshop are £50 for a half day or £90 for the whole day, with discounts available to certain groups.
July 15th – Introduction to RWe’ll be in Manchester, UK, doing an introduction to R. This focuses on data analysis skills, working with databases, and writing reusable code.
Tickets for this workshop are £125 before the 12th of June.
Community workshopsIf you’d like us to do a community workshop for your group, get in touch. You can also catch up to us at one the upcoming events we’re presenting at.
TweetThe post Community workshops appeared first on Locke Data. Locke Data are a data science consultancy aimed at helping organisations get ready and get started with data science.
To leave a comment for the author, please follow the link and comment on their blog: R – Locke Data. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R is for Archaeology: A report on the 2017 Society of American Archaeology meeting
(This article was first published on Revolutions, and kindly contributed to Rbloggers)
by Ben Marwick, Associate Professor of Archaeology, University of Washington and Senior Research Scientist, University of Wollongong
The Society of American Archaeology (SAA) is one of the largest professional organisations for archaeologists in the world, and just concluded its annual meeting in Vancouver, BC at the end of March. The R language has been a part of this meeting for more than a decade, with occasional citations of R Core in the posters, and more recently, the distinctive ggplot2 graphics appearing infrequently on posters and slides. However, among the few archaeologists that have heard of R, it has a reputation for being difficult to learn and use, idiosyncratic, and only suitable for highly specialized analyses. Generally, archaeology students are raised on Excel and SPSS. This year, a few of us thought it was time to administer some first aid to R's reputation among archaeologists and generally broaden awareness of this wonderful tool. We developed a plan for this year's SAA meeting to show our colleagues that R is not too hard to learn, it is useful for almost anything that involves numbers, and it has lots of fun and cool people that use it to get their research done quicker and easier.
Our plan had three main elements. The first element was the début of two new SAA Interest Groups. The Open Science Interest Group (OSIG) was directly inspired by Andrew MacDonald's work founding the ESA Open Science section, with the OSIG being approved by the SAA Board this year. It aims to promote the use of preprints (e.g. SocArXiv), open data (e.g. tDAR, Open Context), and open methods (e.g. R and GitHub). The OSIG recently released a manifesto describing these aims in more detail. At this SAA meeting we also saw the first appearance of the Quantitative Archaeology Interest Group, which has a strong focus on supporting the use R for archaeological research. The appearance of these two groups shows the rest of the archaeological community that there is now a substantial group of R users among academic and professional archaeologists, and they are keen to get organised so they can more effectively help others who are learning R. Some of us in these interest groups were also participants in fora and discussants in sessions throughout the conference, and so had opportunities to tell our colleagues, for example, that it would be ideal if R scripts were available for for certain interesting new analytical methods, or that R code should be submitted when manuscripts are submitted for publication.
The second element of our plan was a normal conference session titled 'Archaeological Science Using R'. This was a two hour session of nine presentations by academic and professional archaeologists that were live code demonstrations of innovative uses of R to solve archaeological research problems. We collected R markdown files and data files from the presenters before the conference, and tested them extensively to ensure they'd work perfectly during the presentations. We also made a few editorial changes to speed things up a bit, for example using readr::read_csv instead of read.csv. We were told in advance by the conference organisers that we couldn't count on good internet access, so we also had to ensure that the code demos worked offline. On the day, the livecoding presentations went very well, with noone crashing and burning, and some presenters even doing some offscript code improvisation to answer questions from the audience. At the start of the session we announced the release of our online book containing the full text of all contributions, including code, data and narrative text, which is online here. We could only do this thanks to the bookdown package, which allowed us to quickly combine the R markdown files into a single, easily readable website. I think this might be a new record for the time from an SAA conference session to a public release of an edited volume. The online book also uses Matthew Salganik's Open Review Toolkit to collect feedback while we're preparing this for publication as an edited volume by Springer (go ahead and leave us some feedback!). There was a lot of enthusiastic chatter later in the conference about a weird new kind of session where people were demoing R code instead of showing slides. We took this as an indicator of success, and received several requests for it to be a recurring event in future meetings.
The third element of our plan was a three hour training workshop during the conference to introduce archaeologists to R for data analysis and visualization. Using pedagogical techniques from Software Carpentry (i.e. sticky notes, live coding and lots of exercises), Matt Harris and I got people using RStudio (and discovering the miracle of tabcomplete) and modern R packages such as readxl, dplyr, tidyr, ggplot2. At the end of three hours we found that our room wasn't booked for anything, so the students requested a further hour of Q&A, which lead to demonstrations of knitr, plotly, mapview, sf, some more advanced ggplot2, and a little git. Despite being located in the Vancouver Hilton, this was another lowbandwidth situation (which we were warned about in advance), so we loaded all the packages to the student's computers from USB sticks. In this case we downloaded package binaries for both Windows and OSX, put them on the USB sticks before the workshop, and had the students run a little bit of R code that used install.packages() to install the binaries to the .libpaths() location (for Windows) or untar'd the binaries to that location (for OSX). That worked perfectly, and seemed to be a very quick and lightweight method to get packages and their dependencies to all our students without using the internet. Getting the students started by running this bit of code was also a nice way to orient them to the RStudio layout, since they were seeing that for the first time.
This workshop was a first for the SAA, and was a huge success. Much of this is due to our sponsors who helped us pay for the venue hire (which was surprisingly expensive!). We got some major support from Microsoft Data Science User Group program (ed note: email msdsug@microsoft.com for info about the program) and Open Context, as well as cool stickers and swag for the students from RStudio, rOpenSci, and the Centre for Open Science. We used the stickers like tiny certificates of accomplishment, for example when our students produced their first plot, we handed out the ggplot2 stickers as a little reward.
Given the positive reception of our workshop, forum and interest groups, our feeling is that archaeologists are generally receptive to new tools for working with data, perhaps more so now than in the past (i.e. pretidyverse). Younger researchers seem especially motivated to learn R because they may have heard of it, but not had a chance to learn it because their degree program doesn't offer it. If you are a researcher in a field where R (or any programming language) is only rarely used by your colleagues, now might be a good time to organise a rehabilitation of R's reputation in your field. Our strategy of interest groups, code demos in a conference session, and a short training workshop during the meeting is one that we would recommend, and we imagine will transfer easily to many other disciplines. We're happy to share more details with anyone who wants to try!
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Data Science for Operational Excellence Exercises (Part2)
(This article was first published on Rexercises, and kindly contributed to Rbloggers)
Network problems are everywhere. We can easily find instances in logistics, telecom, project mangement, among others. In order to attack these problems using linear programming we need to go beyond assign and transportation problems that we saw in part I. Our goal here is to expand the problems we can solve using lpsove and igraph R packages. For that, we will formulate the transportation problem using a more generic function lp.
Answers to the exercises are available here. If you obtained a different (correct) answer than those
listed on the solutions page, please feel free to post your answer as a comment on that page.
Please install and load the package lpsolve and igraph before starting the exercise.
Answers to the exercises are available here.
Exercise 1
Load libraries lpsove and igraph and learn how to use lp function by replicating the example in ?lp.
Exercise 2
We want to rewrite the transport problem stated in part I, but this time using lp function. First, load the data used in part I exercise.
Exercise 3
Run the this transportation problem one more time using lp.transport. Check the objective function and decision variables values.
Exercise 4
Rearrange the data in a format required for the objective function to use at the lp function.
Exercise 5
Construct the binary matrix to model the DEMAND constraints.Observe that the variables should be at the same order defined at the objective function.
 Run a cluster analysis
 Perform multidimensional scaling
 And much more
Exercise 6
Forget about the OFFER constraints for a while. Run the lp function only with de DEMAND constraint and see what happens.
Exercise 7
Construct the binary matrix to model the OFFER constraints.Observe that the variables should be at the same order defined at the objective function.
Exercise 8
Bind both offer and demand constraints to construct a matrix to f.con and vectors to f.dir, and f.rhs.
Exercise 9
Now, solve the problem using lp function. Find the solution for each variable and the objective function value.Check if it matches the values from lp.transport.
Exercise 10
Rename rows to represent factories and columns to be depots. Create an graph using graph_from_incidence_matrix based on decision variables optimum values.
 Data Hacking with RDSTK 3
 Building Shiny App exercises part 4
 Descriptive AnalyticsPart 6: Interactive dashboard ( 2/2)
 Explore all our (>1000) R exercises
 Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: Rexercises. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Modified Age Bias Plot
(This article was first published on fishR Blog, and kindly contributed to Rbloggers)
Original Age Bias PlotCampana et al. (1995) introduced the “age bias plot” to visually assess potential differences in paired age estimates (e.g., between two structures such as scales and otoliths, between two readers, or between one reader at two times). One set of age estimates serve as “reference” ages in the age bias plot. The reference ages are usually the age estimates thought to be most accurate (i.e., the most accurate structure, “best” reader), but could be from the first reading if two readings from the same reader are made. The reference estimates form the xaxis on the agebias plot. The mean and 95% confidence interval for the nonreference age estimates computed at each value of the reference age estimates are plotted to form the age bias plot. A 1:1 line that represents agreement between the two age estimates is usually included. Confidence intervals that do not capture this “agreement line” suggest a difference in the two age estimates at that reference age. An example age bias plot is below.
The age bias plot above was constructed with the code below. Briefly, the foundational calculations for the age bias plot are constructed with ageBias(), where the variable with the reference age estimates follows the tilde. The age bias plot is then constructed by submitting the ageBias() results to plot(). The col.CIsig=, pch.mean.sig=, sfrac=, lty.agree=, and col.agree= arguments are used here to modify default settings for these arguments so that the resultant age bias plot most closely resembles that described by Campana et al. (1995). These arguments are discussed further in the next section.
library(FSA) # provides ageBias(), plot() and WhitefishLC data(WhitefishLC) # example data ab1 < ageBias(scaleC~otolithC,data=WhitefishLC, nref.lab="Scale Age",ref.lab="Otolith Age") plot(ab1,col.CIsig="black",pch.mean.sig=19,sfrac=0.01, lty.agree=1,col.agree="black") Making a “Cleaner” Age Bias PlotI found the age bias plot to be useful for detecting systematic differences in estimated ages between two sets of readings, but I also found the plot to be “clunky” with some data sets. Thus, I modified the original design in several ways. The first set of changes were simply to make the plot “cleaner” and easier to interpret. Specific changes I made are described below.
 Used a 1sample ttest to determine if mean nonreference age (i.e., yaxis) differed significantly from the reference age for each reference age (i.e., xaxis). In other words, for example, does the mean nonreference age at a reference age of 3 differ from 3? This test was repeated for each reference age and the resultant pvalues were corrected for multiple comparisons. Reference ages for which a significant difference was detected were plotted by default with an open symbol and a different color. The symbol and color for the ages where a significant difference was detected are controlled by col.CIsig= and pch.mean.sig=, respectively.
 Confidence intervals for reference ages with small sample sizes can be very wide, which can cause poor scaling of the yaxis on the age bias plot. The min.n.CI= argument in ageBias() sets a sample size threshold for when confidence intervals are constructed. The default for this argument is 3 (i.e., a confidence interval will be constructed if the sample size is at least 3).
 Made the agreement line a ligher gray and dashed so that it can be seen but it is less bold. The type and color of the agreement line are controlled by lty.agree= and col.agree=, respectively.
 Removed the “caps” on the ends of the confidence intervals to reduce clutter. The length of the confidence interval ends are controlled by sfrac=.
These modifications are the defaults settings in ageBias() and plot().
plot(ab1) Plotting Differences on the YAxisMuir et al. (2008) were the first (to my knowledge) to modify the age bias plot by using the difference between the reference ages and the mean nonreference ages on the yaxis. I modified this concept by first computing the difference between the nonreference and reference ages (nonreferencereference) for each individual and then computing the mean of those differences for each reference age. With this modification, the mean difference between nonreference and reference ages is plotted against the reference ages. The “agreement line” is now a horizontal line at 0 on this plot. This modified age bias plot is constructed by including difference=TRUE in ageBias().
plot(ab1,difference=TRUE) Showing Individual VariabilityI often wanted to have a feel for the individual variabilty underlying the age bias plot. A faint gray line “behind” that stretches from the minimum to the maximum nonreference age for each reference age is plotted behind each confidence interval by including show.range=TRUE.
plot(ab1,difference=TRUE,show.range=TRUE)Alternatively, individual points (for paired age estimates) are included with show.pts=TRUE. There tends to be considerable overplotting of individual points because of the discrete nature of age data. To make individual points more obvious, a transparent color can be used for each point such that more overlapping points will appear darker. The level of transparency is controlled by including an integer in transparency=, with values further from 1 being more transparent.
plot(ab1,difference=TRUE,show.pts=TRUE) Illustrating Sample SizeI also often want to know the sample sizes underlying the age bias plot. In particular, it is useful to know the number of nonreference age estimates that contributed to each mean and confidence interval. My first attempt at providing this information on the plot was to simply print the values on the plot (usually above the plot or just above the xaxis). The sample sizee can be added to the plot with show.n=TRUE.
plot(ab1,difference=TRUE,show.n=TRUE)These values, however, are often either so crowded or so small as to be of little utility. Recently I added the ability to add marginal histograms to the age bias plot. For example, a histogram of the sample sizes in the previous plot can be added by using xHist=TRUE.
plot(ab1,difference=TRUE,xHist=TRUE)The same can be added for the nonreference ages with yHist=TRUE. This plot has the added advantage of showing the distribution of differences in age estimates, with the bar at a difference of zero representing the amount of perfect agreement between the sets of age estimates.
plot(ab1,difference=TRUE,xHist=TRUE,yHist=TRUE)The same plot, but with a marginal histogram of the nonreference age estimates rather than the difference in age estimates is also nice for showing the age distributions for both sets of age estimates.
plot(ab1,xHist=TRUE,yHist=TRUE) My PreferenceMy current preference for an age bias plot is to use the differences in age estimates, plot both marginal histograms (I like to see the distributions for the reference age estimates and the difference in age estimates), to show the individual points, and to remove the coloring for significantly different ages (though, I like the difference in symbols). An example of my preferred plot is shown below for ages estimated from fin rays for two readers.
abOO < ageBias(finray2~finray1,data=WhitefishLC, nref.lab="Reader 2 Age",ref.lab="Reader 1 Age") plot(abOO,difference=TRUE,xHist=TRUE,yHist=TRUE, col.CIsig="black",show.pts=TRUE) A Couple of DetailsThe modifications of the age bias plot described here are available in the development version of the FSA package, but not yet in the stable CRAN version. They will appear in v0.8.13.
Several of the options illustrated above can be modified with other arguments to ageBias(). See the documentation for further details (i.e., use ?ageBias).
The plots in this post used the following modifications of the default base graphing parameters. These changes make narrower margins around the plot, move the axis labels closer to the axes, and reduce the size of the axis tick marks relative to the default values.
par(mar=c(3.5,3.5,1,1),mgp=c(2.1,0.4,0),tcl=0.2) Final ThoughtsPlease let me know what you think about my preferred age bias plot. In general, I like it and feel that it is more informative than other age bias plots. However, I am not quite satisfied with how I separated the “axes” labels for the main age bias plot and the marginal histograms. It looks “hacky” to me.
In addition, I thought about putting a horizontal line on the top marginal histogram that shows the cutoff for when a confidence interval is calculated. I also thought about highlighting the “zero bar” in the right marginal histogram when difference=TRUE to further highlight the amount of agreement between the age estimates. In the end, I did not make these modifications because they would seem to add clutter or draw too much attention.
Let me know if you have other ideas for how these age bias plots could be modified to be more informative.
References
Campana, S.E., M.C. Annand, and J.I. McMillan. 1995. Graphical and statistical methods for determining the consistency of age determinations. Transactions of the American Fisheries Society 124:131138.

Muir, A.M., M.P. Ebener, J.X. He, and J.E. Johnson. 2008. A comparison of the scale and otolith methods of age estimation for lake whitefish in Lake Huron. North American Journal of Fisheries Management 28:625635.

Ogle, D.H. 2015. Introductory Fisheries Analyses with R book. CRC Press.
To leave a comment for the author, please follow the link and comment on their blog: fishR Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
RcppArmadillo 0.7.800.2.0
(This article was first published on Thinking inside the box , and kindly contributed to Rbloggers)
A new RcppArmadillo version 0.7.800.2.0 is now on CRAN.
Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 318 other packages on CRAN — an increase of 20 just since the last CRAN release of 0.7.600.1.0 in December!
Changes in this release relative to the previous CRAN release are as follows:
Changes in RcppArmadillo version 0.7.800.2.0 (20170412)
Upgraded to Armadillo release 7.800.2 (Rogue State Redux)
 The Armadillo license changed to Apache License 2.0

The DESCRIPTION file now mentions the Apache License 2.0, as well as the former MPL2 license used for earlier releases.

A new file init.c was added with calls to R_registerRoutines() and R_useDynamicSymbols()

Symbol registration is enabled in useDynLib

The fastLm example was updated
Courtesy of CRANberries, there is a diffstat report. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcppdevel mailing list off the RForge page.
This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive reaggregation in thirdparty forprofit settings.
To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
March ’17 New Package Picks
(This article was first published on R Views, and kindly contributed to Rbloggers)
Two hundred and sixteen new packages were added to CRAN in March. The following are my picks for the Top Forty, organized into five categories: Bioscience, Data, Data Science, Statistics and Utilities.
Bioscience
BioInstaller v0.0.3: Provides tools to install and download massive bioinformatics analysis software and database, such as NGS analysis tools with its required database or/and reference genome.

DSAIDE v0.4.0: Provides a collection of Shiny Apps that allow users to simulate and explore infectious disease transmission. The vignette will get you started.

treespace v1.0.0: Provides tools for the exploration of distributions of phylogenetic trees. This package includes a Shiny interface which can be started from R. There are vignettes for Dengue trees, Transmission trees, and for exploring landscapes of phylogenetic trees. The following plot from the vignette shows clusters of similar trees.
{"x":{"data":{"x":[0.994915614079129,0.613678626917784,2.6667063130703,13.6081153840016,2.19796507262781,3.60128203945996,7.93464330242989,0.601509902833136,0.994915614079123,0.601509902833136,0.607150138418815,10.3943189332891,8.22061660863031,0.542706411478826,3.33541847941513,0.043828098018826,0.164629339023312,2.79080622437025,10.5466992537476,0.611741178555835,3.57629422667165,9.6450605977335,3.87015775914438,0.926024164827251,1.2190898967413,2.4723857807866,8.86243680646928,1.21908989674129,14.5730155701834,13.602475148416,0.917863183050823,2.27894083741,2.28568544942178,15.7748658745841,2.1492025394282,0.790924629423665,12.7908135541976,0.944271832856846,1.24839432652649,3.98230290310848,1.01426319128175,2.01968513715778,2.67234654865598,3.60013557723756,0.360155109563769,3.37184987522406,0.252595301625625,3.36472290920032,6.40461224612179,0.994699490566318,9.42122752144957,2.66670631307031,9.45007781913339,0.607150138418815,13.1790840072057,1.82079944018017,1.24928257943979,0.994915614079122,2.43713839500106,13.044572686349,15.9705489873276,0.913373605366738,13.6310955663375,3.64770068321681,0.668865915945599,8.14682718932421,13.4934309468058,4.04346690730909,2.2791569609228,0.580632735597668,2.36612308170806,5.33313046919023,0.994915614079123,11.0665629950681,1.00840683218326,4.76538085706121,1.4438114263851,0.171393662122702,2.02621362565675,8.75385363504267,0.858828154723405,0.330268540830038,3.90067192022295,0.0730717036886937,16.5286674300406,0.994915614079122,2.4723857807866,9.65081856527717,0.335908776415717,0.796564865009345,0.542706411478826,1.0492923610008,1.21908989674129,15.6745883154345,0.0979934237349428,9.34792446440201,0.437013152012128,2.27241234891102,0.17689227089257,0.955121246501657,12.8975581653797,5.51089790555852,10.4680320213257,0.0399993481969869,9.52808714143766,3.5891133153753,4.42342442747862,0.04726565910757,1.25417131433135,2.3611288897358,2.27894083741,9.02693355729029,2.07898006954062,3.87668624764336,6.4615827644661,2.03185386124243,1.82035414418588,0.994915614079123,1.01426319128175,3.37749011080974,0.199097799331354,0.172281915035997,1.00862295569607,4.15243988951517,1.00840683218326,2.28458107299568,0.607150138418815,0.519052641259981,2.08128395756897,16.5601837313279,2.46688265557635,9.05280962299188,3.48076661737389,0.995803866992419,0.988171002067342,4.03546615887711,0.994915614079123,1.32470655220642,0.989275378493443,3.56866136174658,0.607150138418815,7.66836564572728,0.335908776415717,6.72470221778202,0.27737989822945,0.633026204120414,2.84248353506267,0.0979934237349427,2.92888779742133,3.20320883731573,0.335908776415717,2.87234652407254,4.32578048491479,1.25417131433135,2.83684329947699,1.00751857926997,0.915013358974152,3.57540597375836,0.989275378493444,0.707126558298265,0.712766793883944,2.26633787221915,3.5888971918625,1.01180587891182,9.25652460111788,2.33076992578286,0.165035101165791,3.30290284183795,0.994699490566318,16.2186567440298,3.5888971918625,0.982530766481663,9.52808714143766,0.606100942970156,2.73656237720091,3.46106473819439,2.27894083741,3.63680350037365,2.4723857807866,8.14682718932421,2.27894083741,0.995803866992419,9.25088436553221,0.859657106490925,0.774749830581254,0.04726565910757,15.8890069786153,0.404970591334759,2.42557261667579,0.613678626917791,0.607013028043388,0.919902093865714,2.28458107299568,0.607150138418815,0.765676710838193,4.14679965392949,13.1809002016393,12.8975581653797,4.26288720150493,2.02621362565675,3.36367420397476],"y":[1.36296405257431,1.0143380748759,4.21882824322934,1.85395115102287,4.17637909087138,4.86461034763142,16.009575686826,1.02771387775047,1.36296405257431,1.02771387775047,1.02766987185401,0.50552388611458,10.0515267421666,1.07229905286387,4.15752547173998,2.76392139550187,2.59796872098458,4.93883384448699,10.7625404554933,2.21247979854658,4.87226502578944,9.5088708745062,3.32153776588015,0.842513536381463,0.716141175163109,7.16104731132083,5.75437287564331,0.716141175163109,1.7069313395939,1.85390714512642,4.8987315359459,3.88353406250903,5.61969323330277,0.344365695407378,14.4171273048538,0.501442487250261,9.6000449103967,1.03726886777004,0.33393376356846,3.46374535755798,1.35535338031275,4.91950411344246,4.21878423733287,7.13159940553585,0.00302341316305915,5.01212399682195,13.0610240652768,3.77531806014533,7.01232473326826,0.386526915197534,3.75962479695718,4.21882824322933,11.8666109777134,1.02766987185401,8.32290371040569,2.52086549795982,0.320645972486814,1.36296405257431,4.62733018857611,3.8619430148942,1.58401124473219,1.37092740080123,1.87858215341395,3.08391000762073,1.19130284875386,0.939160305549965,9.99053252948651,5.96476384086601,5.63302503028087,1.03141128116272,3.87552670838565,7.5655093012157,1.36296405257431,11.1843323318248,0.394093581562641,6.98752111471012,1.37775884921857,1.22975877872164,4.90617231646436,6.55363566239471,1.1898855504877,0.868592373260222,6.09901634763347,1.2290930557728,0.146376109093674,1.36296405257431,7.16104731132083,11.8099177019773,0.868636379156678,0.501398481353804,1.07229905286387,1.43585335420713,0.716141175163109,1.76388587914128,0.23437411826138,3.78957669060228,0.645071817193465,3.89686585948713,2.27313496228741,1.1554629700066,9.37652435432995,8.34207063093772,4.70275523659972,5.51421132484612,8.2299588005646,4.87798615050598,6.18643823118439,0.0191206331844139,0.326994001640874,7.3165426940803,3.88353406250902,4.08158718069937,7.49629748614467,3.30820596890205,1.9385779426291,4.9061283105679,1.76288229528141,1.36296405257431,1.35535338031275,5.0120799909255,0.0108095562981677,1.24304656980329,1.3553973862092,5.13648704660545,0.394093581562642,3.88349005661257,1.02766987185401,1.36622147874601,4.52661553061152,0.167362584229802,7.16806857752342,4.10252964993904,10.8783948870456,1.34967626149266,0.373195118219431,3.41920418834103,1.36296405257431,0.871680563819035,1.36300805847077,3.14939364607734,1.02766987185401,1.0051396096495,0.868636379156678,2.10228430818339,0.0359583960557165,1.00672740261434,5.89152592638034,0.23437411826138,0.461240469186375,9.87333975241742,0.868636379156678,8.00759724325873,4.51980695910259,0.326994001640875,5.8915699322768,0.380805790480995,1.35124757466906,4.88555281687109,1.36300805847077,1.18325326640639,1.18320926050994,5.62730390556433,3.12849518273413,1.36839184423488,11.4879553182351,4.29529384801837,1.92883606503872,4.00946413995936,0.386526915197535,0.0288756124843536,3.12849518273413,0.373151112322974,8.2299588005646,2.21252380444303,4.19946765707951,7.74251598858856,3.88353406250902,10.4582816896664,7.16104731132083,0.939160305549963,3.88353406250902,1.34967626149266,11.4879993241316,0.931854402146931,10.0778445455006,0.0191206331844133,1.59197459295911,1.73932530176124,7.36117187509016,1.01433807487591,1.03473514395306,1.35759560382312,3.88349005661257,1.02766987185401,0.202085100647231,5.13653105250191,1.52547532730244,9.37652435432995,8.19604761262773,4.90617231646436,7.45847251006686],"col_var":["1","1","2","3","2","4","5","1","1","1","1","4","5","1","4","1","1","2","6","1","4","6","4","1","1","6","3","1","3","3","1","2","2","3","6","1","6","1","1","4","1","2","2","2","1","2","6","4","4","1","3","2","5","1","6","1","1","1","2","3","3","1","3","4","1","4","6","2","2","1","2","2","1","6","1","4","1","1","2","3","1","1","2","1","3","1","6","5","1","1","1","2","1","3","1","3","1","2","1","1","6","2","3","6","6","4","4","1","1","6","2","3","6","4","3","2","1","1","1","2","1","1","1","4","1","2","1","1","2","3","6","3","6","1","1","4","1","1","1","4","1","3","1","4","1","1","2","1","2","6","1","6","4","1","2","1","1","4","1","1","1","2","4","1","5","2","1","4","1","3","4","1","6","1","2","2","2","6","6","4","2","1","5","1","4","1","3","1","6","1","1","1","2","1","1","4","3","6","4","2","2"],"key_var":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201]},"settings":{"x_log":false,"y_log":false,"labels_size":10,"labels_positions":null,"point_size":64,"point_opacity":1,"hover_size":1,"hover_opacity":null,"xlab":"Axis 1","ylab":"Axis 2","has_labels":false,"col_lab":"groups","col_continuous":false,"colors":null,"ellipses":false,"ellipses_data":[],"symbol_lab":"symbol_var","size_range":[10,300],"size_lab":"NULL","opacity_lab":"NULL","unit_circle":false,"has_color_var":true,"has_symbol_var":false,"has_size_var":false,"has_opacity_var":false,"has_url_var":false,"has_legend":true,"has_tooltips":true,"tooltip_text":null,"has_custom_tooltips":false,"click_callback":null,"zoom_callback":null,"fixed":false,"legend_width":150,"left_margin":30,"html_id":"scatterD3KHCNQGMN","xlim":null,"ylim":null,"x_categorical":false,"y_categorical":false,"menu":true,"lasso":false,"lasso_callback":null,"dom_id_reset_zoom":"scatterD3resetzoom","dom_id_svg_export":"scatterD3svgexport","dom_id_lasso_toggle":"scatterD3lassotoggle","transitions":false,"axes_font_size":"100%","legend_font_size":"100%","caption":null,"lines":{"slope":[0,null],"intercept":[0,0],"stroke_dasharray":[5,5]},"hashes":[]}},"evals":[],"jsHooks":[]} ### Data
* cropdatape v1.0.0: Provides access to data from the Agricultural Ministry of Peru.

rdataretriever v1.0.0: Provides R access to the Data Retriever command line interface that automates the tasks of finding, downloading, and cleaning public datasets.

rnaturalearth v0.1.0: Provides functions to fetch Natural Earth map data. There is an introduction and a vignette on working with countries.

ropercenter v0.1.0: Provides registered users with programmatic, reproducible access to data sets from The Roper Center for Public Opinion Research, which maintains the largest archive of public opinion data in existence. The vignette shows how to get started.

statsgrokse v 0.1.4: Provides an API for the server containing Wikipedia Page View Statistics for the years 2008 to 2015.

ukds v0.1.0: Enables reproducible, programmatic retrieval of datasets from the UK Data Service. The vignette shows how to setup and use it.

anomalyDetection v0.1.1: Implements procedures to aid in detecting network log anomalies. The vignette provides examples.

kerasR v0.4.1: Provides an interface to the Keras Deep Learning Library, which provides specifications for describing dense neural networks, convolution neural networks (CNN), and recurrent neural networks (RNN) running on top of either TensorFlow or Theano. The vignette contains examples.

modeval v0.1.2: Allows users to easily compare multiple classifications models built with caret functions for small data sets. The vignette provides examples.

supc v0.1: Implements the selfupdating process clustering algorithms proposed in Shiu and Chen. The vignette contains examples.

tensorflow v0.7: Provides an interface to TensorFlow, an opensource software library for numerical computation using data flow graphs.

frailtyEM v0.5.4: Contains functions for fitting shared frailty models with a semiparametric baseline hazard using the ExpectationMaximization algorithm. The vignette explains the math.

FRK v0.1.1: Provides functions to build, fit and predict spatial random effects, fixed rank kriging models with large datasets. The vignette introduces the theory and shows some examples.

hmi v0.63: Allows users to build singlelevel and multilevel imputation models using functions provided, or functions from the mice and MCMCglmm packages. There is a vignette.

margins v0.3.0: Ports Stata’s margins command for calculating marginal (or partial) effects. There is an introduction, a vignette on the Technical Implementation Details, and a comparison with the Stata Command.

mlrMBO v1.0.0: Provides a toolbox for Bayesian, modelbased optimization. There is an introduction and vignettes for Mixed Space Optimization, Noisy Optimization, and Parallelization.

MonteCarlo v1.0.0: Simplifies Monte Carlo simulation studies by providing functions that automatically set up loops to run over parameter grids, parallelize the computations, and generate output in LaTeX tables. The vignette shows how to use it.

RankingProject v0.1.1: Provides functions to generate plots and tables for comparing independently sampled populations. There is an introduction and a vignette that reproduces the figures from “A Primer on Visualizations for Comparing Populations, Including the Issue of Overlapping Confidence Intervals” by Wright, Klein, and Wieczorek (2017, The American Statistician, in press).

rjmcmc v0.2.2: Provides functions to perform reversiblejump MCMC with the restriction introduced by Barker & Link. The vignette shows how to calculate posterior probabilities from the MCMC output.

canvasXpress v1.5.2: Enables creation of visualizations using CanvasXpress, a standalone JavaScript library for reproducible research. The vignette shows how to get started.

collapsibleTree v0.1.4: Provides functions to build interactive ReingoldTilford tree diagrams created using D3.js, where every node can be expanded and collapsed by clicking on it. There are some examples on the GitHub site.

cronR v0.3.0: Allows users to schedule R scripts and processes on Unix/Linux systems, while taskscheduleR allows the same from Windows. See the cronR vignette and the taskscheduleR vignette.

doctr v0.2.0: Provides tools to help check the consistency and quality of a dataset. There are vignettes for Comparing Tables, Dataset Diagnostics, and EDA Automation.

geojsonR v1.0.1: Provides functions to process GeoJson objects. The vignette gives the details.

ggedit v0.2.1: Provides an interactive layer and theme aesthetic editor for ggplot2. The ggedit gitbook explains how to use it.

implyr v0.1.0: Provides a SQL backend to dplyr for Apache Impala.

officer v0.1.0: Provides functions to manipulate Microsoft Word and Microsoft Powerpoint documents. This is a companion package to flextable.

flextable v.1.0: Includes functions to create tables in Word, PowerPoint, and HTML documents. There is an overview, and vignettes for explaining objects and table layout.

pivottabler v0.1.0: Allows users to create complex pivot tables and pivot tables with irregular layouts in R. There are multiple vignettes, including an introduction, a styling guide, a Shiny example and explanations of calculations and data groups.

Polychrome v0.8.2: Provides tools for creating, viewing, and assessing qualitative palettes with many (2030 or more) colors. There is a colorful vignette

RApiDatetime v0.0.3: Provides a Clevel API to allow packages to access Clevel R date and datetime code.

Rcssplot v0.2.0.0: Provides tools to style plots with cascading style sheets. The vignette shows how.

reticulate v0.7: Implements an R interface to Python modules, classes, and functions. When calling into Python, R data types are automatically converted to their equivalent Python types. When values are returned from Python to R, they are converted back to R types. There is an overview and a vignette describing arrays in R and Python

shinyWidgets v0.2.0: Provides custom input widgets for Shiny apps. See the README for examples.

shinyjqui v0.1.0: An extension to shiny that brings interactions and animation effects from the jQuery UI library. There is an introduction and a vignette with examples.

valaddin v0.1.0: Provides tools to transform functions into functions with input validation checks, in a manner suitable for both programmatic and interactive use. The vignette shows how.
To leave a comment for the author, please follow the link and comment on their blog: R Views. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Choosing R packages for mixed effects modelling based on the car you drive
There are many roads you can take to fit a mixed effects model
(sometimes termed hierarchical models) in R. There are numerous
packages that each deploy different engines to fit mixed effects models.
In this driveability review, I look at some of the dominant packages for
mixed effects models The focus of this review is on the practicality of
each package, rather than their numerical accuracy. Information on
numerical accuracy can be found in the academic literature, see the
bibliography below. I will lookat each package’s handling, speed and
adaptability for tackling different types of questions.
Summary of the packages strengths and weaknesses. Speeds are relative
to lme4 in a test case problem, see below for details.
 but see below for some addon packages which make life easier for
STAN and JAGS  = lots of equations
 In my experience. But, heck these packages are all free and all give better user support than many
paid services!
The first is
lme4,
meaning linear mixed effects models with S4 classes. lme4 is like an
older model sportscar – fast, respectable, well known and able to
handle common types of questions. lme4 uses maximum likelihood methods
for estimation, so interpret its statistics as a
frequentist.
The next three packages are Bayesian techniques.
The second package is
rjags,
which is an interface to the “Just Another Gibbs
Sampler” program. rjags is the
LandRover of packages for mixed effects models: well tested, able to
tackle just about any terrian, but slow and prone to breaking down
(usually a user error, rather than the package itself).
My take on reviewing R packages for Generalised Linear Mixed Effects. Clockwise from top left: The `lme4` package is fast but a bit older, like an early noughties Ferrari; the JAGS program goes anywhere like a Range Rover, but breaks often and is slow; the Stan language is modern and efficient like a hybrid BMW and INLA is fast and modern but requires technical expertise like a Tesla.
The third package is INLA, standing for
“Integrated Nested Laplace Approximation”. INLA is like an electric
sports car: it is a modern technique that is lightning fast, but there
has been limited testing to date and you better understand what is under
the hood before you attempt to drive it.
The final package is
RStan,
which is the R interface to the Stan modelling
language. RStan is like like BMW’s hybrid,
combining computational speed with flexibility and with exceptional
documentation and support.
Now, let’s look at each packages computational speed, handling (ease of
use), driver support (documentation, existing knowledge and forums) and
flexibility.
Skip to the end for a biblography of examples and some cool packages
that build on these tools.
The frequentist and Bayesian approaches I review here come from
fundamentally different statistical
philosophies.
Here I will tackle the approaches pragmatically, looking at estimation
speed and so on, however you might exclude one or the other aprior
based on your bent towards Bayesian or frequentist thinking.
I simulated a simple hierarchical dataset to test each of the models.
The script is available here. The
dataset has 100 binary measurements. There is one fixed covariate
(continuous ) and one random effect with five groups. The linear
predictor was transformed to binomial probabilities using the logit
function. For the Bayesian approaches, slightly different priors were
used for each package, depending on what was available. See the script
for more details on priors.
For each package I fit the same linear model with binomial errors and a
logit link. To fit the models I used the functions glmer() from
lme4, inla() from INLA, custom code for rjags and the
stan_glmer() from the package
rstanarm
to fit the Stan model (you can also write your own custom code for
RStan, but for convenience I used the canned version of mixed effects
models).
Computational speeds for fitting a binomial Generalised Linear Model to
100 samples with one random and one fixed effect. Relative times are
relative to lme4
So lme4 is by far the fastest, clocking in at around 6 tenths of a
second (on my 3GHz Intel i7 Macbook Pro). INLA comes in second, ten
times slower than lme4, then followed an order of magnitude later by
RStan and finally rjags came plodding along behind about 100 times
slower than lme4.
If you run the code I have provided you will see that all packages come
up with similar estimates for the model parameters.
lme4 leads the pack when it comes to ease of use. If you have your
data sorted, then you can go from zero to fitted model in one line of
code. INLA is similar, however you should generally not rely on
default priors but specify your own, which requires additional code and
thinking.
Use of RStan and rjags depends on how you use them. You can write
your own model code, which can lead to quite lengthy scripts, but also
gives you greater number of choices for how you design your model.
However, there are also many packages that provide wrapper’s to RStan
and rjags that will fit common types of models (like mixed effects
models). These tend to be as simple as INLA’s functions, remembering
that you should think carefully about prior
choice.
If you are going to write your own models, you will need to get good at
debugging, so there is an extra overhead there.
Model checking is another important consideration.
For users familiar with GLMs, lme4 probably provides the most familiar
support for model checking. INLA can be more difficult to use, because
it requires the user to handcode many model checking routines. INLA,
rjags and RStan also require the user to carefully check that the
fitting algorithms have performed properly.
Each of these packages use different types of algorithms, so the
statistics you will use to check algorithm performance are different.
See each package’s documentation for further advice.
lme4 and RStan have the highest quality usermanuals. Both are well
written and provide numerous examples that users will find helpful.
The user manual for rjags and JAGS is somewhat brief, but the user can
easily find many helpful examples on the web, and these packages are
covered by numerous books. So if you are willing to broaden your search
you should have no trouble finding help.
RStan and INLA are both relatively new, so fewer examples can be
found on the web. While RStan has an excellent manual for both the
mathsy and nonmathsy types, documentation of INLA can be difficult to
follow for novice users, because much of it is equations.
Web support and chat groups are also an important aspect of model
development. Help for all packages can be found on sites like
Stackoverflow, or you
can post your own new questions there.
Personally, I have found that support for INLA, through their
forum,
is exceptional. On several occaisons I have posted questions for which I
could not find documentation and have generally recieved a response from
INLA’s developers before the end of the working day. This is amazing,
because it is better than you would expect from many paid services.
I have not participated in forums for the other packages, so I can’t add
comments here.
One final note on support is that lme4 and JAGS are both pretty well
studied approaches to modelling and there are numerous academic papers
dealing with their biases and accuracy. INLA and RStan are
relatively new on the scene, so for new and different types of problems
you might want to run your own simulation study to check that your model
is performing as expected.
Your imagination is the limit with rjags – take it on well driven
routes, like a mixed effects model, or offroad on new adventures with
Bayesian structural equation modelling – it can do it all.
rjags is followed closely by RStan. RStan has a few limitations,
but can basically do anything JAGS can do, and often faster (e.g. this
ecological
study).
RStan will probably eventually replace rjags, but to date rjags
persists because of the extensive range of well documented examples
users can build on.
INLA comes in fourth with a diverse range of built in likelihoods and
model functions (you can even ask the developers to add more and they
might do it!). INLA is becoming particularly popular for modelling
spatial and temporal autocorrelation, partly due to its speed. You can
do these types of models with RStan and rjags but computations might
be impossibly slow.
lme4 is the least flexible of packages – though there are some options
to customise it’s models. The similar nlme package also provides a
range of tools for fitting random effects for spatial and temporal
autocorrelation.
There are however, some clever tricks you can do with lme4 to fit a
broader range of models. For instance, you can include a random effect
where every sample is a separate group in a Poisson GLM to get a Poisson
with extra variation.
That’s all. I hope this blog helps you make a good choice before
investing in learning a new tool for mixed effects models. Let me
know how you go and if you
found this useful.
rstanarm for mixed effects models coded like glm() but using Stan for estimation.
glmmBUGS for mixed effects models coded like glm() but using JAGS for estimation.
A new function in the mgcv package (links to
pdf) for
autowriting code for Bayesian Generalised Additive Models.
A new study showing how you can
fit spatial models with barriers.
Fit Bayesian Latent Variable
models for
analysing multivariate data (e.g. ecological communities).
Unless otherwise indicated, these resources are open access.
Stan user manual.
JAGS
user manual.
INLA project page.
INLA
forum.
Mixed Effects Models and Extensions in Ecology with
R my goto book for
theory and code for fitting likelihood based mixed effects models (e.g.
with lme4). You will have to buy it.
A good example for how to design a simulation study to test your
models
Example for INLA for spatial
models
in ecology.
A review of mixed effects models in fisheries
science
(good for other disciplines too).
NIMBLE another R package for Bayesian
modelling.
glmmADMB another R
package giving greater flexibility for fitting using maximum likelihood.
dplyr 0.6.0 coming soon!
(This article was first published on RStudio Blog, and kindly contributed to Rbloggers)
I’m planning to submit dplyr 0.6.0 to CRAN on May 11 (in four weeks time). In preparation, I’d like to announce that the release candidate, dplyr 0.5.0.9002 is now available. I would really appreciate it if you’d try it out and report any problems. This will ensure that the official release has as few bugs as possible.
InstallationInstall the prerelease version with:
# install.packages("devtools") devtools::install_github("tidyverse/dplyr")If you discover any problems, please file a minimal reprex on GitHub. You can roll back to the released version with:
install.packages("dplyr") Featuresdplyr 0.6.0 is a major release including over 100 bug fixes and improvements. There are three big changes that I want to touch on here:
 Databases
 Improved encoding support (particularly for CJK on windows)
 Tidyeval, a new framework for programming with dplyr
You can see a complete list of changes in the draft release notes.
DatabasesAlmost all database related code has been moved out of dplyr and into a new package, dbplyr. This makes dplyr simpler, and will make it easier to release fixes for bugs that only affect databases.
To install the development version of dbplyr so you can try it out, run:
devtools::install_github("hadley/dbplyr")There’s one major change, as well as a whole heap of bug fixes and minor improvements. It is now no longer necessary to create a remote “src”. Instead you can work directly with the database connection returned by DBI, reflecting the robustness of the DBI ecosystem. Thanks largely to the work of Kirill Muller (funded by the R Consortium) DBI backends are now much more consistent, comprehensive, and easier to use. That means that there’s no longer a need for a layer between you and DBI.
You can continue to use src_mysql(), src_postgres(), and src_sqlite() (which still live in dplyr), but I recommend a new style that makes the connection to DBI more clear:
con < DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbWriteTable(con, "iris", iris) #> [1] TRUE iris2 < tbl(con, "iris") iris2 #> Source: table<iris> [?? x 5] #> Database: sqlite 3.11.1 [:memory:] #> #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> <dbl> <dbl> <dbl> <dbl> <chr> #> 1 5.1 3.5 1.4 0.2 setosa #> 2 4.9 3.0 1.4 0.2 setosa #> 3 4.7 3.2 1.3 0.2 setosa #> 4 4.6 3.1 1.5 0.2 setosa #> 5 5.0 3.6 1.4 0.2 setosa #> 6 5.4 3.9 1.7 0.4 setosa #> 7 4.6 3.4 1.4 0.3 setosa #> 8 5.0 3.4 1.5 0.2 setosa #> 9 4.4 2.9 1.4 0.2 setosa #> 10 4.9 3.1 1.5 0.1 setosa #> # ... with more rowsThis is particularly useful if you want to perform nonSELECT queries as you can do whatever you want with DBI::dbGetQuery() and DBI::dbExecute().
If you’ve implemented a database backend for dplyr, please read the backend news to see what’s changed from your perspective (not much). If you want to ensure your package works with both the current and previous version of dplyr, see wrap_dbplyr_obj() for helpers.
Character encodingWe have done a lot of work to ensure that dplyr works with encodings other that Latin1 on Windows. This is most likely to affect you if you work with data that contains Chinese, Japanese, or Korean (CJK) characters. dplyr should now just work with such data.
Tidyevaldplyr has a new approach to nonstandard evaluation (NSE) called tidyeval. Tidyeval is described in detail in a new vignette about programming with dplyr but, in brief, it gives you the ability to interpolate values in contexts where dplyr usually works with expressions:
my_var < quo(homeworld) starwars %>% group_by(!!my_var) %>% summarise_at(vars(height:mass), mean, na.rm = TRUE) #> # A tibble: 49 × 3 #> homeworld height mass #> <chr> <dbl> <dbl> #> 1 Alderaan 176.3333 64.0 #> 2 Aleen Minor 79.0000 15.0 #> 3 Bespin 175.0000 79.0 #> 4 Bestine IV 180.0000 110.0 #> 5 Cato Neimoidia 191.0000 90.0 #> 6 Cerea 198.0000 82.0 #> 7 Champala 196.0000 NaN #> 8 Chandrila 150.0000 NaN #> 9 Concord Dawn 183.0000 79.0 #> 10 Corellia 175.0000 78.5 #> # ... with 39 more rowsThis will make it much easier to eliminate copyandpasted dplyr code by extracting repeated code into a function.
This also means that the underscored version of each main verb (filter_(), select_() etc). is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).
To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...