Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 day 16 hours ago

Making Your .C Less NOTEworthy

Fri, 01/12/2018 - 06:00

(This article was first published on MeanMean, and kindly contributed to R-bloggers)

.MathJax_SVG_Display {text-align: center; margin: 1em 0em; position: relative; display: block!important; text-indent: 0; max-width: none; max-height: none; min-width: 0; min-height: 0; width: 100%} .MathJax_SVG .MJX-monospace {font-family: monospace} .MathJax_SVG .MJX-sans-serif {font-family: sans-serif} .MathJax_SVG {display: inline; font-style: normal; font-weight: normal; line-height: normal; font-size: 100%; font-size-adjust: none; text-indent: 0; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0; min-height: 0; border: 0; padding: 0; margin: 0} .MathJax_SVG * {transition: none; -webkit-transition: none; -moz-transition: none; -ms-transition: none; -o-transition: none} .mjx-svg-href {fill: blue; stroke: blue} .MathJax_SVG_LineBox {display: table!important} .MathJax_SVG_LineBox span {display: table-cell!important; width: 10000em!important; min-width: 0; max-width: none; padding: 0; border: 0; margin: 0}

If you are a package maintainer, you may have noticed the following new notes from your code checks:
Found no calls to: ‘R_registerRoutines’, ‘R_useDynamicSymbols’ If you are using Rcpp you can easily fix this by refreshing the auto-generated function registration. However, if you have a lot of C code that uses the .C() interface, then you need to make a few changes. In this blog post, I’ll use my updated meanShiftR package as an example of how to quickly fix your code.

The first thing you need to do is to examine your C code and determine your parameter types for each of the functions you call from R using .C(). In meanShiftR I have two C functions that I call from R. The first function R_meanShift has a function prototype that looks like this:

/* mean shift function prototype*/ void R_meanShift( double * query, /* data to query for, in row major form */ double * train, /* reference data, in row major form */ int * assignment, /* assignment */ int * queryRowPtr, /* number of rows of query data */ int * trainRowPtr, /* number of rows of reference data */ int * queryColPtr, /* number of columns of query and reference*/ int * nNeighborsPtr, /* number of neighbors */ int * iterationsPtr, /* number of iterations */ double * bandwidth, /* bandwidth */ double * alphaPtr, /* ms to newton ajustment parameter */ double * epsilonPtr, /* min l2 dist for convergence */ double * epsilonClusterPtr, /* min l2 dist for clustering */ int * kernelEnumPtr, /* kernel type */ int * algorithmEnumPtr, /* algorithm type */ int * intParameters, /* kernel and alg dependent parameters */ int * dblParameters /* kernel and alg dependent parameters */ );

The second function R_knnhas a function prototype that looks like this:

/* R_knn function prototype */ void R_knn( double * queryPoints, /* point to query for */ double * x, /* data to reference for the query (what we build our tree from) */ int * xnrowPtr, /* number of rows of x */ int * nrowPtr, /* number of rows from query points */ int * ncolPtr, /* number of columns for both queryPoints and x */ double * kDist, /* distance vector (we return this) */ int * indexInt, /* index of neighbors (we return this) */ int * kPtr, /* number of nearest neighbors */ double * weight, /* used for weighted distance calculations */ int * leafSizePtr, /* number of nodes in the k-d tree leaf nodes */ double * maxDist /* maximum distance l2^2 to look for neighbors */ );

With this information, I can register these function in four pretty painless steps. (1) Include the R_ext/Rdynload.h in a header file. (2) Create R_NativePrimitiveArgTypes for each C function that you call with .C().
(3) Register our C functions as an R_CMethodDef array. (4) Finally, we create a function that is run when we load our function through R; in this function we register our C functions using R_registerRoutines.

To detail this process, I have created an example C file identifying each of these steps. This isn’t necessary, but can simplify things if you call C functions from R that have prototypes defined in separate files. To make things easy to apply to your code, I have included a complete example below, and will then go over each of these steps.

/* init.c */ #include #include #include "R.h" #include "Rinternals.h" #include "Rmath.h" /*************************************/ /* STEP: 1. This header file is new */ /*************************************/ #include #include "kdtree.h" #include "meanShift.h" #if defined _OPENMP #include #endif /*************************************/ /* Register SO's */ /*************************************/ /* STEP: 2. Create R_NativePrimitiveArgTypes for each function */ /*************************************/ static R_NativePrimitiveArgType R_meanShift_t[] = { REALSXP, REALSXP, INTSXP, INTSXP, INTSXP, INTSXP, INTSXP, INTSXP, REALSXP, REALSXP, REALSXP, REALSXP, INTSXP, INTSXP }; static R_NativePrimitiveArgType R_knn_t[] = { REALSXP, REALSXP, INTSXP, INTSXP, INTSXP, REALSXP, INTSXP, INTSXP, REALSXP, INTSXP, REALSXP }; /*************************************/ /* STEP: 3. This is where we register our C functions as an R_CMethodDef array */ /*************************************/ static const R_CMethodDef cMethods[] = { {"R_meanShift", (DL_FUNC) &R_meanShift, 14, R_meanShift_t}, {"R_knn", (DL_FUNC) &R_knn, 14, R_knn_t}, {NULL, NULL, 0, NULL} }; /*************************************/ /* STEP: 4. This is our boilerplate registration of our C code */ /*************************************/ void R_init_myLib(DllInfo *info) { R_registerRoutines(info, cMethods, NULL, NULL, NULL); R_useDynamicSymbols(info, TRUE); }

We will start at step two, step one was just including the header file.
Step two involves creating an array of an enumerated types for each C function we call from R, where the enumerated types are from standard R SEXP enumerations:
INTSXP for integers,
REALSXP for doubles,
CHARSXP for characters, etc…
Each of these types are detailed in the R Internals Manual .
Each array is created with type R_NativePrimitiveArgType and we will be using it in the next step.

In step three we register our functions with the function name, our enumerated array from the last step, and the number of parameters.
We register them through an array of function arrays, where a function array of a function named foo with enumerated array foo_t using three parameters would look like:

{"foo", (DL_FUNC) &foo, 3, foo_t}

This array of function arrays is NULL terminated with function array
{NULL, NULL, 0, NULL}.
The array type we use is R_CMethodDef and we need to make sure we register all our C functions in this array.

Finally in step four we put it all together in a function that registers our C functions when we load the shared object.

void R_init_myLib(DllInfo *info) { R_registerRoutines(info, cMethods, NULL, NULL, NULL); R_useDynamicSymbols(info, TRUE); }

That’s it!, now update your packages!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: MeanMean. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

How to implement neural networks in R

Fri, 01/12/2018 - 01:33

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

If you've ever wondered how neural networks work behind the scenes, check out this guide to implementing neural networks in scratch with R, by David Selby. You may be surprised how with just a little linear algebra and a few R functions, you can train a function that classifies the red dots from the blue dots in a complex pattern like this:

David also includes some elegant R code that implements neural networks using R6 classes. For a similar implementation using base R function, you may also want to check out this guide to implementing neural networks in R by Ilia Karmanov.

Tea & Stats: Building a neural network from scratch in R

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio Connect v1.5.12

Fri, 01/12/2018 - 01:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re pleased to announce RStudio Connect 1.5.12. This release includes support for viewing historical content, per-application timeout settings, and important improvements and bug fixes.

Historical Content

RStudio Connect now retains and displays historical content. By selecting the content’s history, viewers can easily navigate, compare, and email prior versions of content. Historical content is especially valuable for scheduled reports. Previously published documents, plots, and custom versions of parameterized reports are also saved. Administrators can control how much history is saved by specifying a maximum age and/or a maximum number of historical versions.

Timeout Settings

Timeout settings can be customized for specific Shiny applications or Plumber APIs. These settings allow publishers to optimize timeouts for specific content. For example, a live-updating dashboard might be kept open without expecting user input, while a resource-intensive, interactive app might be more aggressively shut down when idle. Idle Timeout, Initial Timeout, Connection Timeout, and Read Timeout can all be customized.

Along with this improvement, be aware of a BREAKING CHANGE. The Applications.ConnectionTimeout and Application.ReadTimeout settings, which specify server default timeouts for all content, have been deprecated in favor of Scheduler.ConnectionTimeout and Scheduler.ReadTimeout.

Other Improvements
  • A new security option, “Web Sudo Mode”, is enabled by default to require users to re-enter their password prior to performing sensitive actions like altering users, altering API keys, and linking RStudio to Connect.
  • The usermanager command line interface (CLI) can be used to update user first and last name, email, and username in addition to user role. User attributes that are managed by your external authentication provider will continue to be managed externally, but the CLI can be used by administrators to complete other fields in user profiles.
  • The Connect dashboard will show administrators and publishers a warning as license expiration nears.
  • BREAKING CHANGE The RequireExternalUsernames option deprecated in 1.5.10 has been removed.
  • Known Issue After installing RStudio Connect 1.5.12, previously deployed content may incorrectly display an error message. Refreshing the browser will fix the error.

You can see the full release notes for RStudio Connect 1.5.12 here.

Upgrade Planning

There are no special precautions to be aware of when upgrading from v1.5.10 apart from the breaking changes and known issue listed above and in the release notes. You can expect the installation and startup of v1.5.12 to be complete in under a minute.

If you’re upgrading from a release older than v1.5.10, be sure to consider the “Upgrade Planning” notes from the intervening releases, as well.

If you haven’t yet had a chance to download and try RStudio Connect, we encourage you to do so. RStudio Connect is the best way to share all the work that you do in R (Shiny apps, R Markdown documents, plots, dashboards, Plumber APIs, etc.) with collaborators, colleagues, or customers.

You can find more details or download a 45-day evaluation of the product at https://www.rstudio.com/products/connect/. Additional resources can be found below.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Fitting a TensorFlow Linear Classifier with tfestimators

Fri, 01/12/2018 - 01:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

In a recent post, I mentioned three avenues for working with TensorFlow from R:
* The keras package, which uses the Keras API for building scaleable, deep learning models * The tfestimators package, which wraps Google’s Estimators API for fitting models with pre-built estimators
* The tensorflow package, which provides an interface to Google’s low-level TensorFlow API

In this post, Edgar and I use the linear_classifier() function, one of six pre-built models currently in the tfestimators package, to train a linear classifier using data from the titanic package.

library(tfestimators) library(tensorflow) library(tidyverse) library(titanic)

The titanic_train data set contains 12 fields of information on 891 passengers from the Titanic. First, we load the data, split it into training and test sets, and have a look at it.

titanic_set <- titanic_train %>% filter(!is.na(Age)) # Split the data into training and test data sets indices <- sample(1:nrow(titanic_set), size = 0.80 * nrow(titanic_set)) train <- titanic_set[indices, ] test <- titanic_set[-indices, ] glimpse(titanic_set) ## Observations: 714 ## Variables: 12 ## $ PassengerId 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16... ## $ Survived 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0,... ## $ Pclass 3, 1, 3, 1, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 3,... ## $ Name "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra... ## $ Sex "male", "female", "female", "female", "male", "mal... ## $ Age 22, 38, 26, 35, 35, 54, 2, 27, 14, 4, 58, 20, 39, ... ## $ SibSp 1, 1, 0, 1, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 1,... ## $ Parch 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0,... ## $ Ticket "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138... ## $ Fare 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 51.8625,... ## $ Cabin "", "C85", "", "C123", "", "E46", "", "", "", "G6"... ## $ Embarked "S", "C", "S", "S", "S", "S", "S", "S", "C", "S", ...

Notice that both Sex and Embarked are character variables. We would like to make both of these categorical variables for the analysis. We can do this “on the fly” by using thetfestimators::feature_columns() function to get the data into the shape expected for an input Tensor. Category levels are set by passing a list to the vocabulary_list argument. The Pclass variable is passed as a numeric feature, so no further action is required.

cols <- feature_columns( column_categorical_with_vocabulary_list("Sex", vocabulary_list = list("male", "female")), column_categorical_with_vocabulary_list("Embarked", vocabulary_list = list("S", "C", "Q", "")), column_numeric("Pclass") )

So far, no real processing has taken place. The data have not yet been evaluated by R or loaded into TensorFlow. Our first interaction with TensorFlow begins when we use the linear_classifier() function to build the TensorFlow model object for a linear model.

model <- linear_classifier(feature_columns = cols)

Now, we use the tfestimators::input_fn() to get the data into TensorFlow and define the model itself. The following helper function sets up the predictive variables and response variable for a model to predict survival from knowing a passenger’s sex, ticket class, and port of embarkation.

titanic_input_fn <- function(data) { input_fn(data, features = c("Sex", "Pclass", "Embarked"), response = "Survived") }

tfestimators::train() uses the helper function to fit and train the model on the training set constructed above.

train(model, titanic_input_fn(train))

The tensorflow::evaluate() function evaluates the model’s performance.

model_eval <- evaluate(model, titanic_input_fn(test)) glimpse(model_eval) ## Observations: 1 ## Variables: 9 ## $ loss 40.2544 ## $ accuracy_baseline 0.5874126 ## $ global_step 5 ## $ auc 0.8096247 ## $ `prediction/mean` 0.3557937 ## $ `label/mean` 0.4125874 ## $ average_loss 0.5629987 ## $ auc_precision_recall 0.8102072 ## $ accuracy 0.7132867

It’s not a great model, by any means, but an AUC of 0.85 isn’t bad for a first try. We will use R’s familiar predict() function to make some predictions with the test data set. Notice that this data needs to be wrapped in the titanic_input_fn() just like we did for the training data above.

model_predict <- predict(model, titanic_input_fn(test))

The following code unpacks the list containing the prediction results.

res <- data.frame(matrix(unlist(model_predict[[1]]),ncol=2,byrow=TRUE), unlist(model_predict[[2]]), unlist(model_predict[[3]]), unlist(model_predict[[4]]), unlist(model_predict[[5]])) names(res) <- c("Prob Survive", "Prob Perish",names(model_predict)[2:5]) options(digits=3) head(res) ## Prob Survive Prob Perish logits classes class_ids logistic ## 1 0.380 0.620 0.4899 1 1 0.620 ## 2 0.509 0.491 -0.0373 0 0 0.491 ## 3 0.380 0.620 0.4899 1 1 0.620 ## 4 0.509 0.491 -0.0373 0 0 0.491 ## 5 0.781 0.219 -1.2697 0 0 0.219 ## 6 0.735 0.265 -1.0180 0 0 0.265

Before finishing up, we note that TensorFlow writes quite a bit of information to disk:

list.files(model$estimator$model_dir) ## [1] "checkpoint" "eval" ## [3] "graph.pbtxt" "logs" ## [5] "model.ckpt-1.data-00000-of-00001" "model.ckpt-1.index" ## [7] "model.ckpt-1.meta" "model.ckpt-5.data-00000-of-00001" ## [9] "model.ckpt-5.index" "model.ckpt-5.meta"

Finally, we use the TensorBoard visualization tool to look at the data flow graph and other aspects of the model.

To see all of this, point your browser to address returned by the following command.

tensorboard(model$estimator$model_dir, action="start") ## Started TensorBoard at http://127.0.0.1:5503

_____='https://rviews.rstudio.com/2018/01/12/linear-model-in-tensorflow/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Deep Learning from first principles in Python, R and Octave – Part 2

Thu, 01/11/2018 - 15:45

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

“What does the world outside your head really ‘look’ like? Not only is there no color, there’s also no sound: the compression and expansion of air is picked up by the ears, and turned into electrical signals. The brain then presents these signals to us as mellifluous tones and swishes and clatters and jangles. Reality is also odorless: there’s no such thing as smell outside our brains. Molecules floating through the air bind to receptors in our nose and are interpreted as different smells by our brain. The real world is not full of rich sensory events; instead, our brains light up the world with their own sensuality.”
The Brain: The Story of You” by David Eagleman

The world is Maya, illusory. The ultimate reality, the Brahman, is all-pervading and all-permeating, which is colourless, odourless, tasteless, nameless and formless
Bhagavad Gita

1. Introduction

This post is a follow-up post to my earlier post Deep Learning from first principles in Python, R and Octave-Part 1. In the first part, I implemented Logistic Regression, in vectorized Python,R and Octave, with a wannabe Neural Network (a Neural Network with no hidden layers). In this second part, I implement a regular, but somewhat primitive Neural Network (a Neural Network with just 1 hidden layer). The 2nd part implements classification of manually created datasets, where the different clusters of the 2 classes are not linearly separable.

Neural Network perform really well in learning all sorts of non-linear boundaries between classes. Initially logistic regression is used perform the classification and the decision boundary is plotted. Vanilla logistic regression performs quite poorly. Using SVMs with a radial basis kernel would have performed much better in creating non-linear boundaries. To see R and Python implementations of SVMs take a look at my post Practical Machine Learning with R and Python – Part 4.

You could also check out my book on Amazon Practical Machine Learning with R and Python – Machine Learning in Stereo,  in which I implement several Machine Learning algorithms on regression and classification, along with other necessary metrics that are used in Machine Learning.

You can clone and fork this R Markdown file along with the vectorized implementations of the 3 layer Neural Network for Python, R and Octave from Github DeepLearning-Part2

2. The 3 layer Neural Network

A simple representation of a 3 layer Neural Network (NN) with 1 hidden layer is shown below.

In the above Neural Network, there are 2 input features at the input layer, 3 hidden units at the hidden layer and 1 output layer as it deals with binary classification. The activation unit at the hidden layer can be a tanh, sigmoid, relu etc. At the output layer the activation is a sigmoid to handle binary classification

# Superscript indicates layer 1


Also

# Superscript indicates layer 2

Hence

And

Similarly

and

These equations can be written as



I) Some important results (a memory refresher!)
and -(a) and
and
Using (a) we can shown that and (b)
Now -(c)

Since and using (b) we get

Using the values of the derivatives of sinhx and coshx from (b) above we get

Since
-(d)

II) Derivatives


Since therefore see Part1



and


III) Back propagation
Using the derivatives from II) we can derive the following results using Chain Rule



-(A)
-(B)



-(C)

-(D)

IV) Gradient Descent
The key computations in the backward cycle are
– From (C)
– From (D)
– From (A)
– From (B)

The weights and biases (W1,b1,W2,b2) are updated for each iteration thus minimizing the loss/cost.

These derivations can be represented pictorially using the computation graph (from the book Deep Learning by Ian Goodfellow, Joshua Bengio and Aaron Courville)

3. Manually create a data set that is not lineary separable

Initially I create a dataset with 2 classes which has around 9 clusters that cannot be separated by linear boundaries. Note: This data set is saved as data.csv and is used for the R and Octave Neural networks to see how they perform on the same dataset.

import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.linear_model from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification, make_blobs from matplotlib.colors import ListedColormap import sklearn import sklearn.datasets colors=['black','gold'] cmap = matplotlib.colors.ListedColormap(colors) X, y = make_blobs(n_samples = 400, n_features = 2, centers = 7, cluster_std = 1.3, random_state = 4) #Create 2 classes y=y.reshape(400,1) y = y % 2 #Plot the figure plt.figure() plt.title('Non-linearly separable classes') plt.scatter(X[:,0], X[:,1], c=y, marker= 'o', s=50,cmap=cmap) plt.savefig('fig1.png', bbox_inches='tight') 4. Logistic Regression

On the above created dataset, classification with logistic regression is performed, and the decision boundary is plotted. It can be seen that logistic regression performs quite poorly

import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.linear_model from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification, make_blobs from matplotlib.colors import ListedColormap import sklearn import sklearn.datasets #from DLfunctions import plot_decision_boundary execfile("./DLfunctions.py") # Since import does not work in Rmd!!! colors=['black','gold'] cmap = matplotlib.colors.ListedColormap(colors) X, y = make_blobs(n_samples = 400, n_features = 2, centers = 7, cluster_std = 1.3, random_state = 4) #Create 2 classes y=y.reshape(400,1) y = y % 2 # Train the logistic regression classifier clf = sklearn.linear_model.LogisticRegressionCV(); clf.fit(X, y); # Plot the decision boundary for logistic regression plot_decision_boundary_n(lambda x: clf.predict(x), X.T, y.T,"fig2.png") 5. The 3 layer Neural Network in Python (vectorized)

The vectorized implementation is included below. Note that in the case of Python a learning rate of 0.5 and 3 hidden units performs very well.

## Random data set with 9 clusters import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd from sklearn.datasets import make_classification, make_blobs execfile("./DLfunctions.py") # Since import does not work in Rmd!!! X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9, cluster_std = 1.3, random_state = 4) #Create 2 classes Y1=Y1.reshape(400,1) Y1 = Y1 % 2 X2=X1.T Y2=Y1.T #Perform gradient descent parameters,costs = computeNN(X2, Y2, numHidden = 4, learningRate=0.5, numIterations = 10000) plot_decision_boundary(lambda x: predict(parameters, x.T), X2, Y2,str(4),str(0.5),"fig3.png") ## Cost after iteration 0: 0.692669 ## Cost after iteration 1000: 0.246650 ## Cost after iteration 2000: 0.227801 ## Cost after iteration 3000: 0.226809 ## Cost after iteration 4000: 0.226518 ## Cost after iteration 5000: 0.226331 ## Cost after iteration 6000: 0.226194 ## Cost after iteration 7000: 0.226085 ## Cost after iteration 8000: 0.225994 ## Cost after iteration 9000: 0.225915   6. The 3 layer Neural Network in R (vectorized)

For this the dataset created by Python is saved  to see how R performs on the same dataset. The vectorized implementation of a Neural Network was just a little more interesting as R does not have a similar package like ‘numpy’. While numpy handles broadcasting implicitly, in R I had to use the ‘sweep’ command to broadcast. The implementaion is included below. Note that since the initialization with random weights is slightly different, R performs best with a learning rate of 0.1 and with 6 hidden units

source("DLfunctions2_1.R") z <- as.matrix(read.csv("data.csv",header=FALSE)) # x <- z[,1:2] y <- z[,3] x1 <- t(x) y1 <- t(y) #Perform gradient descent nn <-computeNN(x1, y1, 6, learningRate=0.1,numIterations=10000) # Good ## [1] 0.7075341 ## [1] 0.2606695 ## [1] 0.2198039 ## [1] 0.2091238 ## [1] 0.211146 ## [1] 0.2108461 ## [1] 0.2105351 ## [1] 0.210211 ## [1] 0.2099104 ## [1] 0.2096437 ## [1] 0.209409 plotDecisionBoundary(z,nn,6,0.1)

 

 7.  The 3 layer Neural Network in Octave (vectorized)

This uses the same dataset that was generated using Python code.
source("DL-function2.m")
data=csvread("data.csv");
X=data(:,1:2);
Y=data(:,3);
# Make sure that the model parameters are correct. Take the transpose of X & Y
#Perform gradient descent
[W1,b1,W2,b2,costs]= computeNN(X', Y',4, learningRate=0.5, numIterations = 10000);

8a. Performance  for different learning rates (Python) import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd from sklearn.datasets import make_classification, make_blobs execfile("./DLfunctions.py") # Since import does not work in Rmd!!! # Create data X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9, cluster_std = 1.3, random_state = 4) #Create 2 classes Y1=Y1.reshape(400,1) Y1 = Y1 % 2 X2=X1.T Y2=Y1.T # Create a list of learning rates learningRate=[0.5,1.2,3.0] df=pd.DataFrame() #Compute costs for each learning rate for lr in learningRate: parameters,costs = computeNN(X2, Y2, numHidden = 4, learningRate=lr, numIterations = 10000) print(costs) df1=pd.DataFrame(costs) df=pd.concat([df,df1],axis=1) #Set the iterations iterations=[0,1000,2000,3000,4000,5000,6000,7000,8000,9000] #Create data frame #Set index df1=df.set_index([iterations]) df1.columns=[0.5,1.2,3.0] fig=df1.plot() fig=plt.title("Cost vs No of Iterations for different learning rates") plt.savefig('fig4.png', bbox_inches='tight') 8b. Performance  for different hidden units (Python) import numpy as np import matplotlib import matplotlib.pyplot as plt import sklearn.linear_model import pandas as pd from sklearn.datasets import make_classification, make_blobs execfile("./DLfunctions.py") # Since import does not work in Rmd!!! #Create data set X1, Y1 = make_blobs(n_samples = 400, n_features = 2, centers = 9, cluster_std = 1.3, random_state = 4) #Create 2 classes Y1=Y1.reshape(400,1) Y1 = Y1 % 2 X2=X1.T Y2=Y1.T # Make a list of hidden unis numHidden=[3,5,7] df=pd.DataFrame() #Compute costs for different hidden units for numHid in numHidden: parameters,costs = computeNN(X2, Y2, numHidden = numHid, learningRate=1.2, numIterations = 10000) print(costs) df1=pd.DataFrame(costs) df=pd.concat([df,df1],axis=1) #Set the iterations iterations=[0,1000,2000,3000,4000,5000,6000,7000,8000,9000] #Set index df1=df.set_index([iterations]) df1.columns=[3,5,7] #Plot fig=df1.plot() fig=plt.title("Cost vs No of Iterations for different no of hidden units") plt.savefig('fig5.png', bbox_inches='tight') 9a. Performance  for different learning rates (R) source("DLfunctions2_1.R") # Read data z <- as.matrix(read.csv("data.csv",header=FALSE)) # x <- z[,1:2] y <- z[,3] x1 <- t(x) y1 <- t(y) #Loop through learning rates and compute costs learningRate <-c(0.1,1.2,3.0) df <- NULL for(i in seq_along(learningRate)){ nn <- computeNN(x1, y1, 6, learningRate=learningRate[i],numIterations=10000) cost <- nn$costs df <- cbind(df,cost) } #Create dataframe df <- data.frame(df) iterations=seq(0,10000,by=1000) df <- cbind(iterations,df) names(df) <- c("iterations","0.5","1.2","3.0") library(reshape2) df1 <- melt(df,id="iterations") # Melt the data #Plot ggplot(df1) + geom_line(aes(x=iterations,y=value,colour=variable),size=1) + xlab("Iterations") + ylab('Cost') + ggtitle("Cost vs No iterations for different learning rates")

9b. Performance  for different hidden units (R) source("DLfunctions2_1.R") # Loop through Num hidden units numHidden <-c(4,6,9) df <- NULL for(i in seq_along(numHidden)){ nn <- computeNN(x1, y1, numHidden[i], learningRate=0.1,numIterations=10000) cost <- nn$costs df <- cbind(df,cost) } df <- data.frame(df) iterations=seq(0,10000,by=1000) df <- cbind(iterations,df) names(df) <- c("iterations","4","6","9") library(reshape2) # Melt df1 <- melt(df,id="iterations") # Plot ggplot(df1) + geom_line(aes(x=iterations,y=value,colour=variable),size=1) + xlab("Iterations") + ylab('Cost') + ggtitle("Cost vs No iterations for different number of hidden units")

10a. Performance of the Neural Network for different learning rates (Octave)

source("DL-function2.m")
plotLRCostVsIterations()
print -djph figa.jpg

10b. Performance of the Neural Network for different number of hidden units (Octave)

source("DL-function2.m")
plotHiddenCostVsIterations()
print -djph figa.jpg

11. Turning the heat on the Neural Network

In this 2nd part I create a a central region of positives and and the outside region as negatives. The points are generated using the equation of a circle (x – a)^{2} + (y -b) ^{2} = R^{2} . How does the 3 layer Neural Network perform on this?  Here’s a look! Note: The same dataset is also used for R and Octave Neural Network constructions

12. Manually creating a circular central region import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.linear_model from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification, make_blobs from matplotlib.colors import ListedColormap import sklearn import sklearn.datasets colors=['black','gold'] cmap = matplotlib.colors.ListedColormap(colors) x1=np.random.uniform(0,10,800).reshape(800,1) x2=np.random.uniform(0,10,800).reshape(800,1) X=np.append(x1,x2,axis=1) X.shape # Create (x-a)^2 + (y-b)^2 = R^2 # Create a subset of values where squared is <0,4. Perform ravel() to flatten this vector a=(np.power(X[:,0]-5,2) + np.power(X[:,1]-5,2) <= 6).ravel() Y=a.reshape(800,1) cmap = matplotlib.colors.ListedColormap(colors) plt.figure() plt.title('Non-linearly separable classes') plt.scatter(X[:,0], X[:,1], c=Y, marker= 'o', s=15,cmap=cmap) plt.savefig('fig6.png', bbox_inches='tight') 13a. Decision boundary with hidden units=4 and learning rate = 2.2 (Python)

With the above hyper parameters the decision boundary is triangular

import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.linear_model execfile("./DLfunctions.py") x1=np.random.uniform(0,10,800).reshape(800,1) x2=np.random.uniform(0,10,800).reshape(800,1) X=np.append(x1,x2,axis=1) X.shape # Create a subset of values where squared is <0,4. Perform ravel() to flatten this vector a=(np.power(X[:,0]-5,2) + np.power(X[:,1]-5,2) <= 6).ravel() Y=a.reshape(800,1) X2=X.T Y2=Y.T parameters,costs = computeNN(X2, Y2, numHidden = 4, learningRate=2.2, numIterations = 10000) plot_decision_boundary(lambda x: predict(parameters, x.T), X2, Y2,str(4),str(2.2),"fig7.png") ## Cost after iteration 0: 0.692836 ## Cost after iteration 1000: 0.331052 ## Cost after iteration 2000: 0.326428 ## Cost after iteration 3000: 0.474887 ## Cost after iteration 4000: 0.247989 ## Cost after iteration 5000: 0.218009 ## Cost after iteration 6000: 0.201034 ## Cost after iteration 7000: 0.197030 ## Cost after iteration 8000: 0.193507 ## Cost after iteration 9000: 0.191949 13b. Decision boundary with hidden units=12 and learning rate = 2.2 (Python)

With the above hyper parameters the decision boundary is triangular

import numpy as np import matplotlib.pyplot as plt import matplotlib.colors import sklearn.linear_model execfile("./DLfunctions.py") x1=np.random.uniform(0,10,800).reshape(800,1) x2=np.random.uniform(0,10,800).reshape(800,1) X=np.append(x1,x2,axis=1) X.shape # Create a subset of values where squared is <0,4. Perform ravel() to flatten this vector a=(np.power(X[:,0]-5,2) + np.power(X[:,1]-5,2) <= 6).ravel() Y=a.reshape(800,1) X2=X.T Y2=Y.T parameters,costs = computeNN(X2, Y2, numHidden = 12, learningRate=2.2, numIterations = 10000) plot_decision_boundary(lambda x: predict(parameters, x.T), X2, Y2,str(12),str(2.2),"fig8.png") ## Cost after iteration 0: 0.693291 ## Cost after iteration 1000: 0.383318 ## Cost after iteration 2000: 0.298807 ## Cost after iteration 3000: 0.251735 ## Cost after iteration 4000: 0.177843 ## Cost after iteration 5000: 0.130414 ## Cost after iteration 6000: 0.152400 ## Cost after iteration 7000: 0.065359 ## Cost after iteration 8000: 0.050921 ## Cost after iteration 9000: 0.039719 14a. Decision boundary with hidden units=9 and learning rate = 0.5 (R)

When the number of hidden units is 6 and the learning rate is 0,1, is also a triangular shape in R

source("DLfunctions2_1.R") z <- as.matrix(read.csv("data1.csv",header=FALSE)) # N x <- z[,1:2] y <- z[,3] x1 <- t(x) y1 <- t(y) nn <-computeNN(x1, y1, 9, learningRate=0.5,numIterations=10000) # Triangular ## [1] 0.8398838 ## [1] 0.3303621 ## [1] 0.3127731 ## [1] 0.3012791 ## [1] 0.3305543 ## [1] 0.3303964 ## [1] 0.2334615 ## [1] 0.1920771 ## [1] 0.2341225 ## [1] 0.2188118 ## [1] 0.2082687 plotDecisionBoundary(z,nn,6,0.1)

14b. Decision boundary with hidden units=8 and learning rate = 0.1 (R) source("DLfunctions2_1.R") z <- as.matrix(read.csv("data1.csv",header=FALSE)) # N x <- z[,1:2] y <- z[,3] x1 <- t(x) y1 <- t(y) nn <-computeNN(x1, y1, 8, learningRate=0.1,numIterations=10000) # Hemisphere ## [1] 0.7273279 ## [1] 0.3169335 ## [1] 0.2378464 ## [1] 0.1688635 ## [1] 0.1368466 ## [1] 0.120664 ## [1] 0.111211 ## [1] 0.1043362 ## [1] 0.09800573 ## [1] 0.09126161 ## [1] 0.0840379 plotDecisionBoundary(z,nn,8,0.1) 15a. Decision boundary with hidden units=12 and learning rate = 1.5 (Octave)

source("DL-function2.m")
data=csvread("data1.csv");
X=data(:,1:2);
Y=data(:,3);
# Make sure that the model parameters are correct. Take the transpose of X & Y
[W1,b1,W2,b2,costs]= computeNN(X', Y',12, learningRate=1.5, numIterations = 10000);
plotDecisionBoundary(data, W1,b1,W2,b2)
print -djpg fige.jpg

Conclusion: This post implemented a 3 layer Neural Network to create non-linear boundaries while performing classification. Clearly the Neural Network performs very well when the number of hidden units and learning rate are varied.

To be continued…
Watch this space!!

References
1. Deep Learning Specialization
2. Neural Networks for Machine Learning
3. Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville
4. Neural Networks: The mechanics of backpropagation
5. Machine Learning

Also see
1. My book ‘Practical Machine Learning with R and Python’ on Amazon
2. GooglyPlus: yorkr analyzes IPL players, teams, matches with plots and tables
3. The 3rd paperback & kindle editions of my books on Cricket, now on Amazon
4. Exploring Quantum Gate operations with QCSimulator
5. Simulating a Web Joint in Android
6. My travels through the realms of Data Science, Machine Learning, Deep Learning and (AI)
7. Presentation on Wireless Technologies – Part 1

To see all posts check Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Introducing editheme: Palettes and graphics matching your RStudio editor

Thu, 01/11/2018 - 11:37

(This article was first published on R_EN – Piece of K, and kindly contributed to R-bloggers)

I use RStudio every day at work. And every day my eyes say thank you to the developers for implementing themes support for the editor, and more recently a complete dark skin for the GUI. Dark themes are not only important for your health and the health of the planet. They are also absolutely essential to look cool in the office and make your colleagues believe you’re working on very complicated projects.

However, one little thing could dispel the illusion in a second: an ugly, flashy graphic with a white background in your plot pane.

To solve this serious issue of my daily life I wrote editheme, a tiny package providing color palettes and functions to create graphics matching the RStudio editor. The following animated screenshot is worth a thousand words.

Read more about the package on its Github page.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R_EN – Piece of K. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Last day: Data science courses in R (/python/etc.) for $11 at Udemy (New Year sale)

Thu, 01/11/2018 - 07:44

Udemy is offering readers of R-bloggers access to its global online learning marketplace for only $10 per course! This deal (offering over 50%-90% discount) is for hundreds of their courses – including many R-Programming, data science, machine learning etc.

Click here to browse ALL (R and non-R) courses

Advanced R courses: 

Introductory R courses: 

Top data science courses (not in R): 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

R jumps to 8th position in TIOBE language rankings

Thu, 01/11/2018 - 02:47

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

The R language surged to 8th place in the 2017 TIOBE language rankings, up 8 places from a year before. Fellow data science language language Python also saw an increase in rankings, taking the 4th spot (one ahead of its January 2016 ranking).

(Click the table for the current top 20 rankings.) TIOBE ranks programming languages according to their search engine rankings, and R has been steadily climbing since the rankings began:

You can find the current TIOBE language rankings, updated monthly, at the link below.

TIOBE: TIOBE Index 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bayesian Binomial Test in R

Thu, 01/11/2018 - 01:00

(This article was first published on Silent Spring Institute Developer Blog, and kindly contributed to R-bloggers)

Summary: in this post, I implemenent an R function for computing \( P(\theta_1 > \theta2) \), where \( \theta_1 \) and \( \theta_2 \) are beta-distributed random variables. This is useful for estimating the probability that one binomial proportion is greater than another.

I am working on a project in which I need to compare two binomial proportions to see if one is likely to be greater than the other. I wanted to do this in a Bayesian framework; fortunately, estimating binomial proportions is one of the first subjects taught to introduce Bayesian statistics (and statistics in general, for that matter) so the method is well established. Rasmus Bååth has a very nice article describing the Bayesian binomial test, and an estimation approach using JAGS.

I made the problem even simpler by using the fact that the beta distribution is the conjugate prior to the binomial distribution. That is, if the prior is \( \mathrm{beta}(\alpha, \beta) \) distributed, then the posterior after observing \( k \) successes in \( n \) trials is

\mathrm{beta}(\alpha + k, \beta + n - k)

If you want to start with a noninformative prior, then you can use the beta distribution with \( \alpha = \beta = 1\), which collapses to the uniform distribution. This makes the posterior

\mathrm{beta}(1 + k, 1 + n - k)

What I am interested is in comparing two binomial proportions, \( \theta_1 \) and \( \theta_2 \). I want to calculate

P(\theta_1>\theta_2)

We observe \( k_1 \) successes out of \( n_1 \) trials and estimate a posterior distribution for \( \theta_1 \), and \( k_2 \) successes out of \( n_2 \) trials for \( \theta_2 \). Then the probability in question is

P(\mathrm{beta}(1 + k_1, 1 + n_1 - k_1) > \mathrm{beta}(1 + k_2, 1 + n_2 - k_2)

So far, so textbook. At this point, the simplest way to get an estimate is to draw two sets of random numbers, distributed by the two respective beta distributions, and see how many times the first random numbers is greater than the second:

set.seed(1) k1 <- 5 # Five successes k2 <- 1 # One success n1 <- 10 # Ten trials n2 <- 10 # Ten trials b1 <- rbeta(10000, 1 + k1, 1 + n1 - k1) b2 <- rbeta(10000, 1 + k2, 1 + n2 - k2) # Probability estimate of P(theta_1 > theta_2) sum(b1 > b2) / 10000 ## [1] 0.9688

This works fine, but it would be nice to have an exact solution. I found several papers describing different approaches.

Nadarajah and Kotz (2007) derived a functional form for the distribution of \( P(\theta_1 – \theta_2) \) (and Chen and Luo fixed a typo in their work in 2011.) This approach is implemented in the tolerance package via the functions ddiffprop, pdiffprop, qdiffprop, and rdiffprop. The a1 and a2 parameters let you adjust the priors, where a1 = a2 = 1 is a uniform prior for both proportions.

library(tolerance) # Probability that the first proportion (5 successes out of 10 trials) # is greater than the second (1 success out of 10 trials) pdiffprop(0, k1 = 5, k2 = 1, n1 = 10, n2 = 10, a1 = 1, a2 = 1, lower.tail = FALSE) ## [1] 0.9682663

The functions work well, but I found that they stop working for large values of \( n_1 \) or \( n_2 \). This is because several of the terms in the density function use the gamma function, and the R implementation returns infinity and throws a warning for any input larger than 171.

The second approach I found is described in Raineri et al. 2014, in which they derived a recursive formula for \( P(\theta_1 > \theta_2) \). I unrolled the recursion so it could handle large \( n \) without overflowing the stack, and implemented it in R. The first two functions, h and g, come directly from Raineri et al. and pdiff provides a wrapper for estimating differences in binomial proportions.

h <- function(a, b, c, d) { exp(lbeta(a + c, b + d) - (lbeta(a, b) + lbeta(c, d))) } # probability that beta(a, b) > beta(c, d) g <- function(a, b, c, d) { stopifnot(a >= 1 && b >= 1 && c >= 1 && d >= 1) accum <- 0.5 while (a != c || b != d) { if (a > c) { accum <- accum + h(a - 1, b, c, d) / (a - 1) a <- a - 1 } else if (b > d) { accum <- accum - h(a, b - 1, c, d) / (b - 1) b <- b - 1 } else if (c > a) { accum <- accum - h(a, b, c - 1, d) / (c - 1) c <- c - 1 } else if (d > b) { accum <- accum + h(a, b, c, d - 1) / (d - 1) d <- d - 1 } } accum } pdiff <- function(k1, k2, n1, n2, a1 = 1, a2 = 1) { g(1 + k1, a1 + n1 - k1, 1 + k2, a2 + n2 - k2) } # Probability that the first proportion (5 successes out of 10 trials) # is greater than the second (1 success out of 10 trials) pdiff(k1 = 5, k2 = 1, n1 = 10, n2 = 10) ## [1] 0.9682663

It can also handle large values for \( n_1 \) and \( n_2 \):

pdiff(k1 = 420, k2 = 400, n1 = 1000, n2 = 1000) ## [1] 0.8182686

The results from the simulation, tolerance package, and this new function all agree, which is a good sign. I ended up using the pdiff function for my project.

A baseball example

Let’s apply this to a simple example. Suppose we want to estimate an “ability score” for every baseball team, so we can see which teams are better than others. We model each team’s ability score as a binomial random variable representing the team’s proportion of wins to total games. By feeding in each team’s win and loss record for a season, we’ll estimate a posterior distribution that we can use to rank pairs of teams by how likely they are to have different ability scores.

First, we filter for the team records from 2016:

library(tidyverse) library(Lahman) library(knitr) teams <- Teams %>% filter(yearID == 2016) %>% select(teamID, W, L) %>% mutate(teamID = as.character(teamID), P = paste0(round(W / (W + L) * 100), "%"))

And generate a dataframe that has the win and loss numbers from every pair of teams:

teams_compared <- expand.grid(teams$teamID, teams$teamID, stringsAsFactors = FALSE) %>% rename(team1 = Var1, team2 = Var2) %>% filter(team1 != team2) %>% left_join(teams, by = c(team1 = "teamID")) %>% left_join(teams, by = c(team2 = "teamID"), suffix = c("_1", "_2")) head(teams_compared) ## team1 team2 W_1 L_1 P_1 W_2 L_2 P_2 ## 1 ATL ARI 68 93 42% 69 93 43% ## 2 BAL ARI 89 73 55% 69 93 43% ## 3 BOS ARI 93 69 57% 69 93 43% ## 4 CHA ARI 78 84 48% 69 93 43% ## 5 CHN ARI 103 58 64% 69 93 43% ## 6 CIN ARI 68 94 42% 69 93 43%

Now, we can use our pdiff function to calculate the probability that team 1’s win percentage is greater than team 2’s win percentage.

teams_compared <- teams_compared %>% mutate(p_1_greater = pmap(list(k1 = W_1, k2 = W_2, n1 = W_1 + L_1, n2 = W_2 + L_2), pdiff)) %>% unnest(p_1_greater)

Let’s see which teams are most likely to have a higher win percentage than other teams:

teams_compared %>% filter(P_1 > P_2) %>% arrange(-p_1_greater) %>% head() ## team1 team2 W_1 L_1 P_1 W_2 L_2 P_2 p_1_greater ## 1 CHN MIN 103 58 64% 59 103 36% 0.9999997 ## 2 TEX MIN 95 67 59% 59 103 36% 0.9999694 ## 3 WAS MIN 95 67 59% 59 103 36% 0.9999694 ## 4 CHN CIN 103 58 64% 68 94 42% 0.9999631 ## 5 CHN SDN 103 58 64% 68 94 42% 0.9999631 ## 6 CHN TBA 103 58 64% 68 94 42% 0.9999631

The Cubs (CHN), Rangers, and Nationals all have extremely high posterior probabilities of having a better win percentage than the Minnesota Twins.

Let’s see which teams are most likely to be the same:

teams_compared %>% filter(P_1 > P_2) %>% arrange(p_1_greater) %>% head() ## team1 team2 W_1 L_1 P_1 W_2 L_2 P_2 p_1_greater ## 1 TEX CLE 95 67 59% 94 67 58% 0.5186491 ## 2 WAS CLE 95 67 59% 94 67 58% 0.5186491 ## 3 NYN DET 87 75 54% 86 75 53% 0.5206036 ## 4 SFN DET 87 75 54% 86 75 53% 0.5206036 ## 5 ARI ATL 69 93 43% 68 93 42% 0.5257282 ## 6 OAK ATL 69 93 43% 68 93 42% 0.5257282

Texas and Cleveland had very similar records, so the posterior probability that Texas was better than Cleveland is only 52%.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Silent Spring Institute Developer Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Deep Learning With Keras To Predict Customer Churn

Thu, 01/11/2018 - 01:00

This is guest post contributed by Matt Dancho, CEO of Business Science. The post was originally published on the Business Science blog.

Customer churn is a problem that all companies need to monitor, especially those that depend on subscription-based revenue streams. The simple fact is that most organizations have data that can be used to target these individuals and to understand the key drivers of churn, and we now have Keras for Deep Learning available in R (Yes, in R!!), which predicted customer churn with 82% accuracy.

We’re super excited for this article because we are using the new keras package to produce an Artificial Neural Network (ANN) model on the IBM Watson Telco Customer Churn Data Set! As with most business problems, it’s equally important to explain what features drive the model, which is why we’ll use the lime package for explainability. We cross-checked the LIME results with a Correlation Analysis using the corrr package.

In addition, we use three new packages to assist with Machine Learning (ML): recipes for preprocessing, rsample for sampling data and yardstick for model metrics. These are relatively new additions to CRAN developed by Max Kuhn at RStudio (creator of the caret package). It seems that R is quickly developing ML tools that rival Python. Good news if you’re interested in applying Deep Learning in R! We are so let’s get going!!

Customer Churn: Hurts Sales, Hurts Company

Customer churn refers to the situation when a customer ends their relationship with a company, and it’s a costly problem. Customers are the fuel that powers a business. Loss of customers impacts sales. Further, it’s much more difficult and costly to gain new customers than it is to retain existing customers. As a result, organizations need to focus on reducing customer churn.

The good news is that machine learning can help. For many businesses that offer subscription based services, it’s critical to both predict customer churn and explain what features relate to customer churn. Older techniques such as logistic regression can be less accurate than newer techniques such as deep learning, which is why we are going to show you how to model an ANN in R with the keras package.

Churn Modeling With Artificial Neural Networks (Keras)

Artificial Neural Networks (ANN) are now a staple within the sub-field of Machine Learning called Deep Learning. Deep learning algorithms can be vastly superior to traditional regression and classification methods (e.g. linear and logistic regression) because of the ability to model interactions between features that would otherwise go undetected. The challenge becomes explainability, which is often needed to support the business case. The good news is we get the best of both worlds with keras and lime.

IBM Watson Dataset (Where We Got The Data)

The dataset used for this tutorial is IBM Watson Telco Dataset. According to IBM, the business challenge is…

A telecommunications company [Telco] is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you’re an analyst at this company and you have to find out who is leaving and why.

The dataset includes information about:

  • Customers who left within the last month: The column is called Churn
  • Services that each customer has signed up for: phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
  • Customer account information: how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
  • Demographic info about customers: gender, age range, and if they have partners and dependents
Deep Learning With Keras (What We Did With The Data)

In this example we show you how to use keras to develop a sophisticated and highly accurate deep learning model in R. We walk you through the preprocessing steps, investing time into how to format the data for Keras. We inspect the various classification metrics, and show that an un-tuned ANN model can easily get 82% accuracy on the unseen data. Here’s the deep learning training history visualization.

We have some fun with preprocessing the data (yes, preprocessing can actually be fun and easy!). We use the new recipes package to simplify the preprocessing workflow.

We end by showing you how to explain the ANN with the lime package. Neural networks used to be frowned upon because of the “black box” nature meaning these sophisticated models (ANNs are highly accurate) are difficult to explain using traditional methods. Not any more with LIME! Here’s the feature importance visualization.

We also cross-checked the LIME results with a Correlation Analysis using the corrr package. Here’s the correlation visualization.

We even built an Shiny Application with a Customer Scorecard to monitor customer churn risk and to make recommendations on how to improve customer health! Feel free to take it for a spin.

Credits

We saw that just last week the same Telco customer churn dataset was used in the article, Predict Customer Churn – Logistic Regression, Decision Tree and Random Forest. We thought the article was excellent.

This article takes a different approach with Keras, LIME, Correlation Analysis, and a few other cutting edge packages. We encourage the readers to check out both articles because, although the problem is the same, both solutions are beneficial to those learning data science and advanced modeling.

Prerequisites

We use the following libraries in this tutorial:

Install the following packages with install.packages().

pkgs <- c("keras", "lime", "tidyquant", "rsample", "recipes", "yardstick", "corrr") install.packages(pkgs) Load Libraries

Load the libraries.

# Load libraries library(keras) library(lime) library(tidyquant) library(rsample) library(recipes) library(yardstick) library(corrr)

If you have not previously run Keras in R, you will need to install Keras using the install_keras() function.

# Install Keras if you have not installed before install_keras() Import Data

Download the IBM Watson Telco Data Set here. Next, use read_csv() to import the data into a nice tidy data frame. We use the glimpse() function to quickly inspect the data. We have the target “Churn” and all other variables are potential predictors. The raw data set needs to be cleaned and preprocessed for ML.

churn_data_raw <- read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv") glimpse(churn_data_raw) Observations: 7,043 Variables: 21 $ customerID "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "77... $ gender "Female", "Male", "Male", "Male", "Female", "... $ SeniorCitizen 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ Partner "Yes", "No", "No", "No", "No", "No", "No", "N... $ Dependents "No", "No", "No", "No", "No", "No", "Yes", "N... $ tenure 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 5... $ PhoneService "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"... $ MultipleLines "No phone service", "No", "No", "No phone ser... $ InternetService "DSL", "DSL", "DSL", "DSL", "Fiber optic", "F... $ OnlineSecurity "No", "Yes", "Yes", "Yes", "No", "No", "No", ... $ OnlineBackup "Yes", "No", "Yes", "No", "No", "No", "Yes", ... $ DeviceProtection "No", "Yes", "No", "Yes", "No", "Yes", "No", ... $ TechSupport "No", "No", "No", "Yes", "No", "No", "No", "N... $ StreamingTV "No", "No", "No", "No", "No", "Yes", "Yes", "... $ StreamingMovies "No", "No", "No", "No", "No", "Yes", "No", "N... $ Contract "Month-to-month", "One year", "Month-to-month... $ PaperlessBilling "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes"... $ PaymentMethod "Electronic check", "Mailed check", "Mailed c... $ MonthlyCharges 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.... $ TotalCharges 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.... $ Churn "No", "No", "Yes", "No", "Yes", "Yes", "No", ... Preprocess Data

We’ll go through a few steps to preprocess the data for ML. First, we “prune” the data, which is nothing more than removing unnecessary columns and rows. Then we split into training and testing sets. After that we explore the training set to uncover transformations that will be needed for deep learning. We save the best for last. We end by preprocessing the data with the new recipes package.

Prune The Data

The data has a few columns and rows we’d like to remove:

  • The “customerID” column is a unique identifier for each observation that isn’t needed for modeling. We can de-select this column.
  • The data has 11 NA values all in the “TotalCharges” column. Because it’s such a small percentage of the total population (99.8% complete cases), we can drop these observations with the drop_na() function from tidyr. Note that these may be customers that have not yet been charged, and therefore an alternative is to replace with zero or -99 to segregate this population from the rest.
  • My preference is to have the target in the first column so we’ll include a final select() ooperation to do so.

We’ll perform the cleaning operation with one tidyverse pipe (%>%) chain.

# Remove unnecessary data churn_data_tbl <- churn_data_raw %>% select(-customerID) %>% drop_na() %>% select(Churn, everything()) glimpse(churn_data_tbl) Observations: 7,032 Variables: 20 $ Churn "No", "No", "Yes", "No", "Yes", "Yes", "No", ... $ gender "Female", "Male", "Male", "Male", "Female", "... $ SeniorCitizen 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... $ Partner "Yes", "No", "No", "No", "No", "No", "No", "N... $ Dependents "No", "No", "No", "No", "No", "No", "Yes", "N... $ tenure 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 5... $ PhoneService "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"... $ MultipleLines "No phone service", "No", "No", "No phone ser... $ InternetService "DSL", "DSL", "DSL", "DSL", "Fiber optic", "F... $ OnlineSecurity "No", "Yes", "Yes", "Yes", "No", "No", "No", ... $ OnlineBackup "Yes", "No", "Yes", "No", "No", "No", "Yes", ... $ DeviceProtection "No", "Yes", "No", "Yes", "No", "Yes", "No", ... $ TechSupport "No", "No", "No", "Yes", "No", "No", "No", "N... $ StreamingTV "No", "No", "No", "No", "No", "Yes", "Yes", "... $ StreamingMovies "No", "No", "No", "No", "No", "Yes", "No", "N... $ Contract "Month-to-month", "One year", "Month-to-month... $ PaperlessBilling "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes"... $ PaymentMethod "Electronic check", "Mailed check", "Mailed c... $ MonthlyCharges 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.... $ TotalCharges 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.. Split Into Train/Test Sets

We have a new package, rsample, which is very useful for sampling methods. It has the initial_split() function for splitting data sets into training and testing sets. The return is a special rsplit object.

# Split test/training sets set.seed(100) train_test_split <- initial_split(churn_data_tbl, prop = 0.8) train_test_split <5626/1406/7032>

We can retrieve our training and testing sets using training() and testing() functions.

# Retrieve train and test sets train_tbl <- training(train_test_split) test_tbl <- testing(train_test_split) Exploration: What Transformation Steps Are Needed For ML?

This phase of the analysis is often called exploratory analysis, but basically we are trying to answer the question, “What steps are needed to prepare for ML?” The key concept is knowing what transformations are needed to run the algorithm most effectively. Artificial Neural Networks are best when the data is one-hot encoded, scaled and centered. In addition, other transformations may be beneficial as well to make relationships easier for the algorithm to identify. A full exploratory analysis is not practical in this article. With that said we’ll cover a few tips on transformations that can help as they relate to this dataset. In the next section, we will implement the preprocessing techniques.

Discretize The “tenure” Feature

Numeric features like age, years worked, length of time in a position can generalize a group (or cohort). We see this in marketing a lot (think “millennials”, which identifies a group born in a certain timeframe). The “tenure” feature falls into this category of numeric features that can be discretized into groups.

We can split into six cohorts that divide up the user base by tenure in roughly one year (12 month) increments. This should help the ML algorithm detect if a group is more/less susceptible to customer churn.

Transform The “TotalCharges” Feature

What we don’t like to see is when a lot of observations are bunched within a small part of the range.

We can use a log transformation to even out the data into more of a normal distribution. It’s not perfect, but it’s quick and easy to get our data spread out a bit more.

Pro Tip: A quick test is to see if the log transformation increases the magnitude of the correlation between “TotalCharges” and “Churn”. We’ll use a few dplyr operations along with the corrr package to perform a quick correlation.

  • correlate(): Performs tidy correlations on numeric data
  • focus(): Similar to select(). Takes columns and focuses on only the rows/columns of importance.
  • fashion(): Makes the formatting aesthetically easier to read.
# Determine if log transformation improves correlation # between TotalCharges and Churn train_tbl %>% select(Churn, TotalCharges) %>% mutate( Churn = Churn %>% as.factor() %>% as.numeric(), LogTotalCharges = log(TotalCharges) ) %>% correlate() %>% focus(Churn) %>% fashion() rowname Churn 1 TotalCharges -.20 2 LogTotalCharges -.25

The correlation between “Churn” and “LogTotalCharges” is greatest in magnitude indicating the log transformation should improve the accuracy of the ANN model we build. Therefore, we should perform the log transformation.

One-Hot Encoding

One-hot encoding is the process of converting categorical data to sparse data, which has columns of only zeros and ones (this is also called creating “dummy variables” or a “design matrix”). All non-numeric data will need to be converted to dummy variables. This is simple for binary Yes/No data because we can simply convert to 1’s and 0’s. It becomes slightly more complicated with multiple categories, which requires creating new columns of 1’s and 0`s for each category (actually one less). We have four features that are multi-category: Contract, Internet Service, Multiple Lines, and Payment Method.

Feature Scaling

ANN’s typically perform faster and often times with higher accuracy when the features are scaled and/or normalized (aka centered and scaled, also known as standardizing). Because ANNs use gradient descent, weights tend to update faster. According to Sebastian Raschka, an expert in the field of Deep Learning, several examples when feature scaling is important are:

  • k-nearest neighbors with an Euclidean distance measure if want all features to contribute equally
  • k-means (see k-nearest neighbors)
  • logistic regression, SVMs, perceptrons, neural networks etc. if you are using gradient descent/ascent-based optimization, otherwise some weights will update much faster than others
  • linear discriminant analysis, principal component analysis, kernel principal component analysis since you want to find directions of maximizing the variance (under the constraints that those directions/eigenvectors/principal components are orthogonal); you want to have features on the same scale since you’d emphasize variables on “larger measurement scales” more. There are many more cases than I can possibly list here … I always recommend you to think about the algorithm and what it’s doing, and then it typically becomes obvious whether we want to scale your features or not.

The interested reader can read Sebastian Raschka’s article for a full discussion on the scaling/normalization topic. Pro Tip: When in doubt, standardize the data.

Preprocessing With Recipes

Let’s implement the preprocessing steps/transformations uncovered during our exploration. Max Kuhn (creator of caret) has been putting some work into Rlang ML tools lately, and the payoff is beginning to take shape. A new package, recipes, makes creating ML data preprocessing workflows a breeze! It takes a little getting used to, but I’ve found that it really helps manage the preprocessing steps. We’ll go over the nitty gritty as it applies to this problem.

Step 1: Create A Recipe

A “recipe” is nothing more than a series of steps you would like to perform on the training, testing and/or validation sets. Think of preprocessing data like baking a cake (I’m not a baker but stay with me). The recipe is our steps to make the cake. It doesn’t do anything other than create the playbook for baking.

We use the recipe() function to implement our preprocessing steps. The function takes a familiar object argument, which is a modeling function such as object = Churn ~ . meaning “Churn” is the outcome (aka response, predictor, target) and all other features are predictors. The function also takes the data argument, which gives the “recipe steps” perspective on how to apply during baking (next).

A recipe is not very useful until we add “steps”, which are used to transform the data during baking. The package contains a number of useful “step functions” that can be applied. The entire list of Step Functions can be viewed here. For our model, we use:

  1. step_discretize() with the option = list(cuts = 6) to cut the continuous variable for “tenure” (number of years as a customer) to group customers into cohorts.
  2. step_log() to log transform “TotalCharges”.
  3. step_dummy() to one-hot encode the categorical data. Note that this adds columns of one/zero for categorical data with three or more categories.
  4. step_center() to mean-center the data.
  5. step_scale() to scale the data.

The last step is to prepare the recipe with the prep() function. This step is used to “estimate the required parameters from a training set that can later be applied to other data sets”. This is important for centering and scaling and other functions that use parameters defined from the training set.

Here’s how simple it is to implement the preprocessing steps that we went over!

# Create recipe rec_obj <- recipe(Churn ~ ., data = train_tbl) %>% step_discretize(tenure, options = list(cuts = 6)) %>% step_log(TotalCharges) %>% step_dummy(all_nominal(), -all_outcomes()) %>% step_center(all_predictors(), -all_outcomes()) %>% step_scale(all_predictors(), -all_outcomes()) %>% prep(data = train_tbl)

We can print the recipe object if we ever forget what steps were used to prepare the data. Pro Tip: We can save the recipe object as an RDS file using saveRDS(), and then use it to bake() (discussed next) future raw data into ML-ready data in production!

# Print the recipe object rec_obj Data Recipe Inputs: role #variables outcome 1 predictor 19 Training data contained 5626 data points and no missing data. Steps: Dummy variables from tenure [trained] Log transformation on TotalCharges [trained] Dummy variables from ~gender, ~Partner, ... [trained] Centering for SeniorCitizen, ... [trained] Scaling for SeniorCitizen, ... [trained] Step 2: Baking With Your Recipe

Now for the fun part! We can apply the “recipe” to any data set with the bake() function, and it processes the data following our recipe steps. We’ll apply to our training and testing data to convert from raw data to a machine learning dataset. Check our training set out with glimpse(). Now that’s an ML-ready dataset prepared for ANN modeling!!

# Predictors x_train_tbl <- bake(rec_obj, newdata = train_tbl) %>% select(-Churn) x_test_tbl <- bake(rec_obj, newdata = test_tbl) %>% select(-Churn) glimpse(x_train_tbl) Observations: 5,626 Variables: 35 $ SeniorCitizen -0.4351959, -0.4351... $ MonthlyCharges -1.1575972, -0.2601... $ TotalCharges -2.275819130, 0.389... $ gender_Male -1.0016900, 0.99813... $ Partner_Yes 1.0262054, -0.97429... $ Dependents_Yes -0.6507747, -0.6507... $ tenure_bin1 2.1677790, -0.46121... $ tenure_bin2 -0.4389453, -0.4389... $ tenure_bin3 -0.4481273, -0.4481... $ tenure_bin4 -0.4509837, 2.21698... $ tenure_bin5 -0.4498419, -0.4498... $ tenure_bin6 -0.4337508, -0.4337... $ PhoneService_Yes -3.0407367, 0.32880... $ MultipleLines_No.phone.service 3.0407367, -0.32880... $ MultipleLines_Yes -0.8571364, -0.8571... $ InternetService_Fiber.optic -0.8884255, -0.8884... $ InternetService_No -0.5272627, -0.5272... $ OnlineSecurity_No.internet.service -0.5272627, -0.5272... $ OnlineSecurity_Yes -0.6369654, 1.56966... $ OnlineBackup_No.internet.service -0.5272627, -0.5272... $ OnlineBackup_Yes 1.3771987, -0.72598... $ DeviceProtection_No.internet.service -0.5272627, -0.5272... $ DeviceProtection_Yes -0.7259826, 1.37719... $ TechSupport_No.internet.service -0.5272627, -0.5272... $ TechSupport_Yes -0.6358628, -0.6358... $ StreamingTV_No.internet.service -0.5272627, -0.5272... $ StreamingTV_Yes -0.7917326, -0.7917... $ StreamingMovies_No.internet.service -0.5272627, -0.5272... $ StreamingMovies_Yes -0.797388, -0.79738... $ Contract_One.year -0.5156834, 1.93882... $ Contract_Two.year -0.5618358, -0.5618... $ PaperlessBilling_Yes 0.8330334, -1.20021... $ PaymentMethod_Credit.card..automatic. -0.5231315, -0.5231... $ PaymentMethod_Electronic.check 1.4154085, -0.70638... $ PaymentMethod_Mailed.check -0.5517013, 1.81225... Step 3: Don’t Forget The Target

One last step, we need to store the actual values (truth) as y_train_vec and y_test_vec, which are needed for modeling our ANN. We convert to a series of numeric ones and zeros which can be accepted by the Keras ANN modeling functions. We add “vec” to the name so we can easily remember the class of the object (it’s easy to get confused when working with tibbles, vectors, and matrix data types).

# Response variables for training and testing sets y_train_vec <- ifelse(pull(train_tbl, Churn) == "Yes", 1, 0) y_test_vec <- ifelse(pull(test_tbl, Churn) == "Yes", 1, 0) Model Customer Churn With Keras (Deep Learning)

This is super exciting!! Finally, Deep Learning with Keras in R! The team at RStudio has done fantastic work recently to create the keras package, which implements Keras in R. Very cool!

Background On Artifical Neural Networks

For those unfamiliar with Neural Networks (and those that need a refresher), read this article. It’s very comprehensive, and you’ll leave with a general understanding of the types of deep learning and how they work.

Source: Xenon Stack

Deep Learning has been available in R for some time, but the primary packages used in the wild have not (this includes Keras, Tensor Flow, Theano, etc, which are all Python libraries). It’s worth mentioning that a number of other Deep Learning packages exist in R including h2o, mxnet, and others. The interested reader can check out this blog post for a comparison of deep learning packages in R.

Building A Deep Learning Model

We’re going to build a special class of ANN called a Multi-Layer Perceptron (MLP). MLPs are one of the simplest forms of deep learning, but they are both highly accurate and serve as a jumping-off point for more complex algorithms. MLPs are quite versatile as they can be used for regression, binary and multi classification (and are typically quite good at classification problems).

We’ll build a three layer MLP with Keras. Let’s walk-through the steps before we implement in R.

  1. Initialize a sequential model: The first step is to initialize a sequential model with keras_model_sequential(), which is the beginning of our Keras model. The sequential model is composed of a linear stack of layers.

  2. Apply layers to the sequential model: Layers consist of the input layer, hidden layers and an output layer. The input layer is the data and provided it’s formatted correctly there’s nothing more to discuss. The hidden layers and output layers are what controls the ANN inner workings.

    • Hidden Layers: Hidden layers form the neural network nodes that enable non-linear activation using weights. The hidden layers are created using layer_dense(). We’ll add two hidden layers. We’ll apply units = 16, which is the number of nodes. We’ll select kernel_initializer = "uniform" and activation = "relu" for both layers. The first layer needs to have the input_shape = 35, which is the number of columns in the training set. Key Point: While we are arbitrarily selecting the number of hidden layers, units, kernel initializers and activation functions, these parameters can be optimized through a process called hyperparameter tuning that is discussed in Next Steps.

    • Dropout Layers: Dropout layers are used to control overfitting. This eliminates weights below a cutoff threshold to prevent low weights from overfitting the layers. We use the layer_dropout() function add two drop out layers with rate = 0.10 to remove weights below 10%.

    • Output Layer: The output layer specifies the shape of the output and the method of assimilating the learned information. The output layer is applied using the layer_dense(). For binary values, the shape should be units = 1. For multi-classification, the units should correspond to the number of classes. We set the kernel_initializer = "uniform" and the activation = "sigmoid" (common for binary classification).

  3. Compile the model: The last step is to compile the model with compile(). We’ll use optimizer = "adam", which is one of the most popular optimization algorithms. We select loss = "binary_crossentropy" since this is a binary classification problem. We’ll select metrics = c("accuracy") to be evaluated during training and testing. Key Point: The optimizer is often included in the tuning process.

Let’s codify the discussion above to build our Keras MLP-flavored ANN model.

# Building our Artificial Neural Network model_keras <- keras_model_sequential() model_keras %>% # First hidden layer layer_dense( units = 16, kernel_initializer = "uniform", activation = "relu", input_shape = ncol(x_train_tbl)) %>% # Dropout to prevent overfitting layer_dropout(rate = 0.1) %>% # Second hidden layer layer_dense( units = 16, kernel_initializer = "uniform", activation = "relu") %>% # Dropout to prevent overfitting layer_dropout(rate = 0.1) %>% # Output layer layer_dense( units = 1, kernel_initializer = "uniform", activation = "sigmoid") %>% # Compile ANN compile( optimizer = 'adam', loss = 'binary_crossentropy', metrics = c('accuracy') ) keras_model Model ___________________________________________________________________________________________________ Layer (type) Output Shape Param # =================================================================================================== dense_1 (Dense) (None, 16) 576 ___________________________________________________________________________________________________ dropout_1 (Dropout) (None, 16) 0 ___________________________________________________________________________________________________ dense_2 (Dense) (None, 16) 272 ___________________________________________________________________________________________________ dropout_2 (Dropout) (None, 16) 0 ___________________________________________________________________________________________________ dense_3 (Dense) (None, 1) 17 =================================================================================================== Total params: 865 Trainable params: 865 Non-trainable params: 0 ___________________________________________________________________________________________________

We use the fit() function to run the ANN on our training data. The object is our model, and x and y are our training data in matrix and numeric vector forms, respectively. The batch_size = 50 sets the number samples per gradient update within each epoch. We set epochs = 35 to control the number training cycles. Typically we want to keep the batch size high since this decreases the error within each training cycle (epoch). We also want epochs to be large, which is important in visualizing the training history (discussed below). We set validation_split = 0.30 to include 30% of the data for model validation, which prevents overfitting. The training process should complete in 15 seconds or so.

# Fit the keras model to the training data history <- fit( object = model_keras, x = as.matrix(x_train_tbl), y = y_train_vec, batch_size = 50, epochs = 35, validation_split = 0.30 )

We can inspect the training history. We want to make sure there is minimal difference between the validation accuracy and the training accuracy.

# Print a summary of the training history print(history) Trained on 3,938 samples, validated on 1,688 samples (batch_size=50, epochs=35) Final epoch (plot to see history): val_loss: 0.4215 val_acc: 0.8057 loss: 0.399 acc: 0.8101

We can visualize the Keras training history using the plot() function. What we want to see is the validation accuracy and loss leveling off, which means the model has completed training. We see that there is some divergence between training loss/accuracy and validation loss/accuracy. This model indicates we can possibly stop training at an earlier epoch. Pro Tip: Only use enough epochs to get a high validation accuracy. Once validation accuracy curve begins to flatten or decrease, it’s time to stop training.

# Plot the training/validation history of our Keras model plot(history) Making Predictions

We’ve got a good model based on the validation accuracy. Now let’s make some predictions from our keras model on the test data set, which was unseen during modeling (we use this for the true performance assessment). We have two functions to generate predictions:

  • predict_classes(): Generates class values as a matrix of ones and zeros. Since we are dealing with binary classification, we’ll convert the output to a vector.
  • predict_proba(): Generates the class probabilities as a numeric matrix indicating the probability of being a class. Again, we convert to a numeric vector because there is only one column output.
# Predicted Class yhat_keras_class_vec <- predict_classes(object = model_keras, x = as.matrix(x_test_tbl)) %>% as.vector() # Predicted Class Probability yhat_keras_prob_vec <- predict_proba(object = model_keras, x = as.matrix(x_test_tbl)) %>% as.vector() Inspect Performance With Yardstick

The yardstick package has a collection of handy functions for measuring performance of machine learning models. We’ll overview some metrics we can use to understand the performance of our model.

First, let’s get the data formatted for yardstick. We create a data frame with the truth (actual values as factors), estimate (predicted values as factors), and the class probability (probability of yes as numeric). We use the fct_recode() function from the forcats package to assist with recoding as Yes/No values.

# Format test data and predictions for yardstick metrics estimates_keras_tbl <- tibble( truth = as.factor(y_test_vec) %>% fct_recode(yes = "1", no = "0"), estimate = as.factor(yhat_keras_class_vec) %>% fct_recode(yes = "1", no = "0"), class_prob = yhat_keras_prob_vec ) estimates_keras_tbl # A tibble: 1,406 x 3 truth estimate class_prob 1 yes no 0.328355074 2 yes yes 0.633630514 3 no no 0.004589651 4 no no 0.007402068 5 no no 0.049968336 6 no no 0.116824441 7 no yes 0.775479317 8 no no 0.492996633 9 no no 0.011550998 10 no no 0.004276015 # ... with 1,396 more rows

Now that we have the data formatted, we can take advantage of the yardstick package. The only other thing we need to do is to set options(yardstick.event_first = FALSE). As pointed out by ad1729 in GitHub Issue 13, the default is to classify 0 as the positive class instead of 1.

options(yardstick.event_first = FALSE) Confusion Table

We can use the conf_mat() function to get the confusion table. We see that the model was by no means perfect, but it did a decent job of identifying customers likely to churn.

# Confusion Table estimates_keras_tbl %>% conf_mat(truth, estimate) Truth Prediction no yes no 950 161 yes 99 196 Accuracy

We can use the metrics() function to get an accuracy measurement from the test set. We are getting roughly 82% accuracy.

# Accuracy estimates_keras_tbl %>% metrics(truth, estimate) # A tibble: 1 x 1 accuracy 1 0.8150782 AUC

We can also get the ROC Area Under the Curve (AUC) measurement. AUC is often a good metric used to compare different classifiers and to compare to randomly guessing (AUC_random = 0.50). Our model has AUC = 0.85, which is much better than randomly guessing. Tuning and testing different classification algorithms may yield even better results.

# AUC estimates_keras_tbl %>% roc_auc(truth, class_prob) [1] 0.8523951 Precision And Recall

Precision is when the model predicts “yes”, how often is it actually “yes”. Recall (also true positive rate or specificity) is when the actual value is “yes” how often is the model correct. We can get precision() and recall() measurements using yardstick.

# Precision tibble( precision = estimates_keras_tbl %>% precision(truth, estimate), recall = estimates_keras_tbl %>% recall(truth, estimate) ) # A tibble: 1 x 2 precision recall 1 0.6644068 0.5490196

Precision and recall are very important to the business case: The organization is concerned with balancing the cost of targeting and retaining customers at risk of leaving with the cost of inadvertently targeting customers that are not planning to leave (and potentially decreasing revenue from this group). The threshold above which to predict Churn = “Yes” can be adjusted to optimize for the business problem. This becomes an Customer Lifetime Value optimization problem that is discussed further in Next Steps.

F1 Score

We can also get the F1-score, which is a weighted average between the precision and recall. Machine learning classifier thresholds are often adjusted to maximize the F1-score. However, this is often not the optimal solution to the business problem.

# F1-Statistic estimates_keras_tbl %>% f_meas(truth, estimate, beta = 1) [1] 0.601227 Explain The Model With LIME

LIME stands for Local Interpretable Model-agnostic Explanations, and is a method for explaining black-box machine learning model classifiers. For those new to LIME, this YouTube video does a really nice job explaining how LIME helps to identify feature importance with black box machine learning models (e.g. deep learning, stacked ensembles, random forest).


Setup

The lime package implements LIME in R. One thing to note is that it’s not setup out-of-the-box to work with keras. The good news is with a few functions we can get everything working properly. We’ll need to make two custom functions:

  • model_type: Used to tell lime what type of model we are dealing with. It could be classification, regression, survival, etc.

  • predict_model: Used to allow lime to perform predictions that its algorithm can interpret.

The first thing we need to do is identify the class of our model object. We do this with the class() function.

class(model_keras) [1] "keras.models.Sequential" [2] "keras.engine.training.Model" [3] "keras.engine.topology.Container" [4] "keras.engine.topology.Layer" [5] "python.builtin.object"

Next we create our model_type() function. It’s only input is x the keras model. The function simply returns “classification”, which tells LIME we are classifying.

# Setup lime::model_type() function for keras model_type.keras.models.Sequential <- function(x, ...) { "classification" }

Now we can create our predict_model() function, which wraps keras::predict_proba(). The trick here is to realize that it’s inputs must be x a model, newdata a dataframe object (this is important), and type which is not used but can be use to switch the output type. The output is also a little tricky because it must be in the format of probabilities by classification (this is important; shown next).

# Setup lime::predict_model() function for keras predict_model.keras.models.Sequential <- function(x, newdata, type, ...) { pred <- predict_proba(object = x, x = as.matrix(newdata)) data.frame(Yes = pred, No = 1 - pred) }

Run this next script to show you what the output looks like and to test our predict_model() function. See how it’s the probabilities by classification. It must be in this form for model_type = "classification".

# Test our predict_model() function predict_model(x = model_keras, newdata = x_test_tbl, type = 'raw') %>% tibble::as_tibble() # A tibble: 1,406 x 2 Yes No 1 0.328355074 0.6716449 2 0.633630514 0.3663695 3 0.004589651 0.9954103 4 0.007402068 0.9925979 5 0.049968336 0.9500317 6 0.116824441 0.8831756 7 0.775479317 0.2245207 8 0.492996633 0.5070034 9 0.011550998 0.9884490 10 0.004276015 0.9957240 # ... with 1,396 more rows

Now the fun part, we create an explainer using the lime() function. Just pass the training data set without the “Attribution column”. The form must be a data frame, which is OK since our predict_model function will switch it to an keras object. Set model = automl_leader our leader model, and bin_continuous = FALSE. We could tell the algorithm to bin continuous variables, but this may not make sense for categorical numeric data that we didn’t change to factors.

# Run lime() on training set explainer <- lime::lime( x = x_train_tbl, model = model_keras, bin_continuous = FALSE )

Now we run the explain() function, which returns our explanation. This can take a minute to run so we limit it to just the first ten rows of the test data set. We set n_labels = 1 because we care about explaining a single class. Setting n_features = 4 returns the top four features that are critical to each case. Finally, setting kernel_width = 0.5 allows us to increase the “model_r2” value by shrinking the localized evaluation.

# Run explain() on explainer explanation <- lime::explain( x_test_tbl[1:10, ], explainer = explainer, n_labels = 1, n_features = 4, kernel_width = 0.5 ) Feature Importance Visualization

The payoff for the work we put in using LIME is this feature importance plot. This allows us to visualize each of the first ten cases (observations) from the test data. The top four features for each case are shown. Note that they are not the same for each case. The green bars mean that the feature supports the model conclusion, and the red bars contradict. A few important features based on frequency in first ten cases:

  • Tenure (7 cases)
  • Senior Citizen (5 cases)
  • Online Security (4 cases)
plot_features(explanation) + labs(title = "LIME Feature Importance Visualization", subtitle = "Hold Out (Test) Set, First 10 Cases Shown")

Another excellent visualization can be performed using plot_explanations(), which produces a facetted heatmap of all case/label/feature combinations. It’s a more condensed version of plot_features(), but we need to be careful because it does not provide exact statistics and it makes it less easy to investigate binned features (Notice that “tenure” would not be identified as a contributor even though it shows up as a top feature in 7 of 10 cases).

plot_explanations(explanation) + labs(title = "LIME Feature Importance Heatmap", subtitle = "Hold Out (Test) Set, First 10 Cases Shown") Check Explanations With Correlation Analysis

One thing we need to be careful with the LIME visualization is that we are only doing a sample of the data, in our case the first 10 test observations. Therefore, we are gaining a very localized understanding of how the ANN works. However, we also want to know on from a global perspective what drives feature importance.

We can perform a correlation analysis on the training set as well to help glean what features correlate globally to “Churn”. We’ll use the corrr package, which performs tidy correlations with the function correlate(). We can get the correlations as follows.

# Feature correlations to Churn corrr_analysis <- x_train_tbl %>% mutate(Churn = y_train_vec) %>% correlate() %>% focus(Churn) %>% rename(feature = rowname) %>% arrange(abs(Churn)) %>% mutate(feature = as_factor(feature)) corrr_analysis # A tibble: 35 x 2 feature Churn 1 gender_Male -0.006690899 2 tenure_bin3 -0.009557165 3 MultipleLines_No.phone.service -0.016950072 4 PhoneService_Yes 0.016950072 5 MultipleLines_Yes 0.032103354 6 StreamingTV_Yes 0.066192594 7 StreamingMovies_Yes 0.067643871 8 DeviceProtection_Yes -0.073301197 9 tenure_bin4 -0.073371838 10 PaymentMethod_Mailed.check -0.080451164 # ... with 25 more rows

The correlation visualization helps in distinguishing which features are relavant to Churn.

# Correlation visualization corrr_analysis %>% ggplot(aes(x = Churn, y = fct_reorder(feature, desc(Churn)))) + geom_point() + # Positive Correlations - Contribute to churn geom_segment(aes(xend = 0, yend = feature), color = palette_light()[[2]], data = corrr_analysis %>% filter(Churn > 0)) + geom_point(color = palette_light()[[2]], data = corrr_analysis %>% filter(Churn > 0)) + # Negative Correlations - Prevent churn geom_segment(aes(xend = 0, yend = feature), color = palette_light()[[1]], data = corrr_analysis %>% filter(Churn < 0)) + geom_point(color = palette_light()[[1]], data = corrr_analysis %>% filter(Churn < 0)) + # Vertical lines geom_vline(xintercept = 0, color = palette_light()[[5]], size = 1, linetype = 2) + geom_vline(xintercept = -0.25, color = palette_light()[[5]], size = 1, linetype = 2) + geom_vline(xintercept = 0.25, color = palette_light()[[5]], size = 1, linetype = 2) + # Aesthetics theme_tq() + labs(title = "Churn Correlation Analysis", subtitle = "Positive Correlations (contribute to churn), Negative Correlations (prevent churn)", y = "Feature Importance")

The correlation analysis helps us quickly disseminate which features that the LIME analysis may be excluding. We can see that the following features are highly correlated (magnitude > 0.25):

Increases Likelihood of Churn (Red): – Tenure = Bin 1 (<12 Months) – Internet Service = “Fiber Optic” – Payment Method = “Electronic Check”

Decreases Likelihood of Churn (Blue): – Contract = “Two Year” – Total Charges (Note that this may be a biproduct of additional services such as Online Security)

Feature Investigation

We can investigate features that are most frequent in the LIME feature importance visualization along with those that the correlation analysis shows an above normal magnitude. We’ll investigate:

  • Tenure (7/10 LIME Cases, Highly Correlated)
  • Contract (Highly Correlated)
  • Internet Service (Highly Correlated)
  • Payment Method (Highly Correlated)
  • Senior Citizen (5/10 LIME Cases)
  • Online Security (4/10 LIME Cases)
Tenure (7/10 LIME Cases, Highly Correlated)

LIME cases indicate that the ANN model is using this feature frequently and high correlation agrees that this is important. Investigating the feature distribution, it appears that customers with lower tenure (bin 1) are more likely to leave. Opportunity: Target customers with less than 12 month tenure.

Contract (Highly Correlated)

While LIME did not indicate this as a primary feature in the first 10 cases, the feature is clearly correlated with those electing to stay. Customers with one and two year contracts are much less likely to churn. Opportunity: Offer promotion to switch to long term contracts.

Internet Service (Highly Correlated)

While LIME did not indicate this as a primary feature in the first 10 cases, the feature is clearly correlated with those electing to stay. Customers with fiber optic service are more likely to churn while those with no internet service are less likely to churn. Improvement Area: Customers may be dissatisfied with fiber optic service.

Payment Method (Highly Correlated)

While LIME did not indicate this as a primary feature in the first 10 cases, the feature is clearly correlated with those electing to stay. Customers with electronic check are more likely to leave. Opportunity: Offer customers a promotion to switch to automatic payments.

Senior Citizen (5/10 LIME Cases)

Senior citizen appeared in several of the LIME cases indicating it was important to the ANN for the 10 samples. However, it was not highly correlated to Churn, which may indicate that the ANN is using in an more sophisticated manner (e.g. as an interaction). It’s difficult to say that senior citizens are more likely to leave, but non-senior citizens appear less at risk of churning. Opportunity: Target users in the lower age demographic.

Online Security (4/10 LIME Cases)

Customers that did not sign up for online security were more likely to leave while customers with no internet service or online security were less likely to leave. Opportunity: Promote online security and other packages that increase retention rates.

Next Steps: Business Science University

We’ve just scratched the surface with the solution to this problem, but unfortunately there’s only so much ground we can cover in an article. Here are a few next steps that I’m pleased to announce will be covered in a Business Science University course coming in 2018!

Customer Lifetime Value

Your organization needs to see the financial benefit so always tie your analysis to sales, profitability or ROI. Customer Lifetime Value (CLV) is a methodology that ties the business profitability to the retention rate. While we did not implement the CLV methodology herein, a full customer churn analysis would tie the churn to an classification cutoff (threshold) optimization to maximize the CLV with the predictive ANN model.

The simplified CLV model is:

\[
CLV=GC*\frac{1}{1+d-r}
\]

Where,

  • GC is the gross contribution per customer
  • d is the annual discount rate
  • r is the retention rate
ANN Performance Evaluation and Improvement

The ANN model we built is good, but it could be better. How we understand our model accuracy and improve on it is through the combination of two techniques:

  • K-Fold Cross-Fold Validation: Used to obtain bounds for accuracy estimates.
  • Hyper Parameter Tuning: Used to improve model performance by searching for the best parameters possible.

We need to implement K-Fold Cross Validation and Hyper Parameter Tuning if we want a best-in-class model.

Distributing Analytics

It’s critical to communicate data science insights to decision makers in the organization. Most decision makers in organizations are not data scientists, but these individuals make important decisions on a day-to-day basis. The Shiny application below includes a Customer Scorecard to monitor customer health (risk of churn).

Business Science University

You’re probably wondering why we are going into so much detail on next steps. We are happy to announce a new project for 2018: Business Science University, an online school dedicated to helping data science learners.

Benefits to learners:

  • Build your own online GitHub portfolio of data science projects to market your skills to future employers!
  • Learn real-world applications in People Analytics (HR), Customer Analytics, Marketing Analytics, Social Media Analytics, Text Mining and Natural Language Processing (NLP), Financial and Time Series Analytics, and more!
  • Use advanced machine learning techniques for both high accuracy modeling and explaining features that have an effect on the outcome!
  • Create ML-powered web-applications that can be distributed throughout an organization, enabling non-data scientists to benefit from algorithms in a user-friendly way!

Enrollment is open so please signup for special perks. Just go to Business Science University and select enroll.

Conclusions

Customer churn is a costly problem. The good news is that machine learning can solve churn problems, making the organization more profitable in the process. In this article, we saw how Deep Learning can be used to predict customer churn. We built an ANN model using the new keras package that achieved 82% predictive accuracy (without tuning)! We used three new machine learning packages to help with preprocessing and measuring performance: recipes, rsample and yardstick. Finally we used lime to explain the Deep Learning model, which traditionally was impossible! We checked the LIME results with a Correlation Analysis, which brought to light other features to investigate. For the IBM Telco dataset, tenure, contract type, internet service type, payment menthod, senior citizen status, and online security status were useful in diagnosing customer churn. We hope you enjoyed this article!

About Business Science

Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business and financial applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science can help to expand into predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

Direct forecast X Recursive forecast

Wed, 01/10/2018 - 23:32

(This article was first published on R – insightR, and kindly contributed to R-bloggers)

By Gabriel Vasconcelos

When dealing with forecasting models there is an issue that generates a lot of confusion, which is the difference between direct and recursive forecasts. I believe most people are more used to recursive forecasts because they are the first we learn when studying ARIMA models.

Suppose you want to forecast the variable , steps ahead using only past information of and consider the two equations below:


The first equation is an AR(1) and the second equation is a more general version where we estimate a model directly to the forecasting horizon we want. Now let’s see how the one step ahead forecast would be for each model. In the first equation we would have and in the second equation we must make to obtain the same result.

For two steps ahead things start to change. In the first equation we have:

and in the second:

Note that there is an extra number 2 subscript in the coefficients of the equation above. It indicates that both and will depend on the choice of . Now we can generalize both cases to any in order to have:

The first case is called recursive forecast and the second case is called direct forecast. In the recursive forecast we only need to estimate one model and use its coefficients to iterate on the forecasting horizon until we have the horizon we want. In the direct forecast we need to estimate one different model for each forecasting horizon but we do not need to iterate the forecast. The first out-of-sample prediction of the direct forecast will be already on the desired horizon.

Multivariate Problems

Now suppose we want to forecast using past information of and . The recursive model would be:

The one step ahead forecast would be:

The two step ahead forecast would be:

Now we have a problem. In the two steps ahead forecast we can just replace by the one step ahead equation just like we did in the univariate case. However, we do not have a way to obtain . In fact, if we use other variables in recursive forecasts we must also forecast these variables in a Vector Autorregressive (VAR) framework for example. In the direct case nothing changes: we could just estimate an equation for on and .

Example

In this example we are going to forecast the Brazilian inflation using past information of the inflation, the industrial production and the unemployment rate. The recursive model will be a VAR(3) and the direct model will be a simple regression with three lags of each variable.

#library devtools #install_github(gabrielrvsc/HDeconometrics) library(HDeconometrics) library(reshape2) library(ggplot2) # = Load data = # data("BRinf") data = BRinf[ , c(1, 12, 14)] colnames(data) = c("INF", "IP", "U") Recursive Forecast

First we are going to estimate the recursive forecasts for 1 to 24 steps (months) ahead. The VAR will forecast all variables but we are only interested in the inflation. The plot below shows that the forecast converges very fast to the yellow line, which is the unconditional mean of the inflation in the training set. Recursive forecasts using AR or VAR on stationary and stable data will always converge to the unconditional mean unless we include more features in the model such as exogenous variables.

# = 24 out-of-sample (test) observations = # train = data[1:132, ] test = data[-c(1:132), ] # = Estimate model and compute forecasts = # VAR = HDvar(train, p = 3) recursive = predict(VAR, 24) df = data.frame(date = as.Date(rownames(data)), INF = data[,"INF"], fitted=c(rep(NA, 3), fitted(VAR)[ ,1], rep(NA, 24)), forecast = c(rep(NA,132), recursive[, 1])) # = Plot = # dfm = melt(df,id.vars = "date") ggplot(data = dfm) + geom_line(aes(x = date, y = value, color = variable))+ geom_hline(yintercept = mean(train[ ,1]), linetype = 2,color = "yellow")

Direct Forecast

In direct forecasts we will need to estimate 24 models for the inflation to obtain the 24 forecasts. We must arrange the data in the right way for the model to estimate the regression on the right lags. This is where most people get confused. Let’s look at an univariate example using the function embed to arrange the data (click here for more details on the function). I created a variable the is just the sequence from 1 to 10. The embed function was used to generate a matrix with and its first three lags (we lost three observations because of the lags). The model for is a regression of the column on the remaining columns.

y=1:10 lags=embed(y,4) colnames(lags)=c("yt","yt-1","yt-2","yt-3") lags ## yt yt-1 yt-2 yt-3 ## [1,] 4 3 2 1 ## [2,] 5 4 3 2 ## [3,] 6 5 4 3 ## [4,] 7 6 5 4 ## [5,] 8 7 6 5 ## [6,] 9 8 7 6 ## [7,] 10 9 8 7

If we want to run a model for two steps ahead we must remove the fist observation in the column and the last observation in the lagged columns:

lags = cbind(lags[-1, 1], lags[-nrow(lags), -1]) colnames(lags) = c("yt", "yt-2", "yt-3", "yt-4") lags ## yt yt-2 yt-3 yt-4 ## [1,] 5 3 2 1 ## [2,] 6 4 3 2 ## [3,] 7 5 4 3 ## [4,] 8 6 5 4 ## [5,] 9 7 6 5 ## [6,] 10 8 7 6

Now we just run a regression of the columns on the other columns. For we must do the same procedure again and for a general we must remove the first observations of the first column and the last observations of the remaining columns.

The embed function also works for matrices with multiple columns. I will use it on the data to make it ready for the model. The code below runs the direct forecast for the forecasting horizons 1 to 24. The plot does not have a fitted line because there is one fitted model for each horizon. You can see that the direct forecasting is considerably different from the recursive case. It does not converge to the unconditional mean.

# = Create matrix with lags = # X = embed(data, 4) train = X[1:129, ] test = X[-c(1:129), ] ytrain = train[ ,1] # = The Xtest is the same for all horizons = # # = The last three observations (lags) of the train for each variable = # Xtest = train[nrow(train), 1:9] # = Remove the first three columns of the train = # # = They are the three variables in t = # Xtrain = train[ ,-c(1:3)] ytest = test[, 1] # = Run 24 models and forecasts = # direct = c() for(i in 1:24){ model = lm(ytrain ~ Xtrain) # = Run regression = # direct[i] = c(1, Xtest) %*% coef(model) # = Calculate Forecast = # ytrain = ytrain[-1] # = Remove first observation of yt = # Xtrain = Xtrain[-nrow(Xtrain), ] # = Remove last observation of the other variables = # } # = plot = # df = data.frame(data=as.Date(rownames(data)), INF = data[ ,"INF"], forecast = c(rep(NA, 132), direct)) dfm = melt(df,id.vars = "data") ggplot(data = dfm) + geom_line(aes(x = data, y = value, color = variable)) + geom_hline(yintercept = mean(train[ ,1]), linetype=2, color="yellow")

As mentioned before, the one step ahead forecast is the same in both cases:

print(direct[1]) ## [1] 0.7843985 print(recursive[1]) ## [1] 0.7843985 Conclusion
  • Recursive forecast:
    • Single model for all horizons,
    • must iterate the forecast using the coefficients for horizons different from one,
    • Forecasts converge to the unconditional mean for long horizons.
  • Direct forecast:
    • One model for each horizon,
    • No need for iteration,
    • Forecasts does not converge to the unconditional mean,
    • Must be careful to arrange the data.

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – insightR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

SW10 digs deep

Wed, 01/10/2018 - 18:45

(This article was first published on R – thinkr, and kindly contributed to R-bloggers)

Responding to a weak property market

In December I looked at how recent events have shaped the property market in London SW10. If short-distance moves are off the table in the current climate, how are property owners responding? When sales are weak, are planning applications in the ascendency? I applied data science techniques to Royal Borough of Kensington and Chelsea (RBKC) planning data to find out.

Property transactions evaporated with the Financial Crisis. The Government “stamped” on the green shoots of recovery with penalising duties on moves. And the uncertainty surrounding Brexit hasn’t helped.

Property development offers an alternative way to add space. Owners unable to sell would want to consider their options, engage consultants, and secure planning permission. So one could reasonably expect the data to reflect a delayed response. And that’s what we see when plotting sales versus planning applications.

From 2001 to 2017 in SW10 there were circa 6,800 property transactions, and 6,400 decided planning applications. The numbers are surprisingly balanced in both their totals and their shape over time. When sales are down, attention does appear to shift to development.

Is adding space the motivation?

So owners are responding to the weaker market by doing something to their properties. But what? Given the state of the market, is adding space a key theme?

Planning applications, at least on the RBKC planning search portal, are not naturally categorised by type of change. So, as a starting point, I used a “bag of words” data science technique to explore the application proposal descriptions for word frequency.

I stripped out stop words, converted proposal descriptions to lower case, and stemmed the remainder. This ensured words like “Windows” and “windows”, as well as “terrace” and “terracing”, were treated as one.

The resultant word frequency plot includes the “what”, for example, “extension”, as well as the “where” and “how”. From this I could pick out a reasonable set of broad “what” themes, such as extensions and basements. If, for example, a proposal for an extension also mentioned windows, then I counted this once only as the bigger theme.

When sales are down, dig down

With a reasonable set of themes, I could now visualise how the number of planning applications has changed over time by type of development. Adding space, down, sideways or up, does appear to have driven much of the uptick.

Finding space though in dense and tightly-regulated conservation areas is challenging. So it is perhaps unsurprising that “down” is sometimes the only, albeit more expensive, way to go.

The majority of applications, across all themes, are decided favourably. Often, the planning review process will drive the need for the applicant to make proposal changes, or submit additional information. In such cases, the applicant withdraws the proposal, updates, and goes again.

By applying a linear regression model to each of the themes, we can better see which types of planning applications have grown more steeply in recent years.

The most significant change has been in the volume of applications related to trees. However, it is worth noting, in this particular case, that RBKC adopted a Trees and Development Supplementary Planning Document on the 20 April 2010.

We can more clearly see that adding space is a key motivation for the growth in planning applications against the backdrop of the weaker property market. In contrast, it does seem reasonable that owners undertake roofing works more out of necessity. So this type of development might be a little more immune to the headwinds and tailwinds of the property market.

Square-footage is king

An entirely separate and interesting topic of exploration might be the net economic effect of the shift from sales to development. One could anticipate negative effects on one ecosystem, and positive effects on another. For example, bad for estate agents and removal firms, but good for builders and planning consultants.

But to return to the opening question, in response to a weak property market, it would seem that SW10 is digging deep, both into its pocket and the ground, to find more living space. “Square-footage is king”, as central London estate agents like to say, so surely it’s a good long-term investment?

R toolkit   Packages Functions purrr map_df rvest read_html; html_nodes; html_text; html_table; html_attr SPARQL SPARQL dplyr bind_rows; filter; mutate; mutate_at; if_else; select; group_by; summarise; arrange tidyr spread; nest; unnest broom tidy stringr str_c; str_detect lubridate as_datetime; date; year; dmy tibble data_frame tm removeWords; stemDocument qdap freq_terms ggplot2 geom_bar; geom_area; facet_wrap; geom_col; coord_flip ggthemes theme_economist

The code may be viewed here.

Citations / Attributions

R Development Core Team (2008). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

Planning application data web-scraped from the Planning search pages with the kind permission of the Royal Borough of Kensington and Chelsea.

The post SW10 digs deep appeared first on thinkr.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – thinkr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Building a Daily Bitcoin Price Tracker with Coindeskr and Shiny in R

Wed, 01/10/2018 - 15:00

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

Let’s admit it. The whole world has been going crazy with Bitcoin. Bitcoin (BTC), the first cryptocurrency (in fact, the first digital currency to solve the double-spend problem) introduced by Satoshi Nakamoto has become bigger than well-established firms (even a few countries). So, a lot of Bitcoin Enthusiasts and Investors are looking to keep a track of its daily price to better read the market and make moves accordingly.

This tutorial is to help an R user build his/her own Daily Bitcoin Price Tracker using three packages, Coindeskr, Shiny and Dygraphs.

Coindeskr helps us access coindesk API to extract Bitcoin Price Index, including historic Bitcoin Prices, powered by Coindesk.

Shiny Structure and Script Naming

Every Shiny app contains two parts – the UI part and the Server part. This post follows single file layout of designing the Shiny app where the shiny app contains one single file app.R that contains two functions ui and serverin the same code.

Start a new R Script app.R in a new folder (of desired name) and proceed further with the following code.

Installation and Loading #install.packages('shiny') #install.packages('coindeskr') #install.packages('dygraphs') library(shiny) #To build the shiny App library(coindeskr) #R-Package connecting to Coindesk API library(dygraphs) #For interactive Time-series graphs

To start with let us load the required packages – Shiny, Coindeskr and dygraphs – as mentioned in the above code. If the above packages are not already available on your machine, Please install them before loading.

Extract Last 31 days Bitcoin Price

Let us use coindeskr’s handy function get_last31days_price() to extract Bitcoin’s USD Price for the last 31 days and store the output in a dataframe (last31).

last31 <- get_last31days_price() UI Elements to be displayed in the app

A Bitcoin Price Tracker should not only display the graph of the last one month price information but also should give us some highlights or summary, hence let us also display the minimum and maximum price of Bitcoin in the given time period and dates of when it was minimum and maximum.

UI should contain the following:

* Title of the App
* Minimum Bitcoin Price and its equivalent Date
* Maximum Bitcoin Price and its equivalent Date
* Interactive Time-Series Graph to see the trend of Bitcoin Price for the last 31 days

While the first three elements can be coded just with basic shiny functions, the last – Interactive Time-Series Graph – requires dygraphOutput() function to display a dygraph object.

ui <- shinyUI( fluidPage( titlePanel('Bitcoin USD Price for Last 31 days'), mainPanel( h3('Minimum'), h3(htmlOutput('minprice')), h3('Maximum'), h3(htmlOutput('maxprice')), dygraphOutput("btcprice") ) )) Server code of the Shiny App

This is where we have to extract information from the dataframe – last31 (created initially) and feed into the respective output id – minprice, maxprice and btcprice – that are defined in the ui part of the code. To extract minimum Bitcoin price and maximum Bitcoin price, min() and max() function are used respectively and to which.min() and which.max() are used, that return the rownames of minimum and maximum price values, where rownames contain equivalent dates. And as mentioned earlier, dygraphs package provides renderDygraph() function to display the interactive (html) time-series graph created using dygraph() function.

server <- function(input,output){ output$minprice <- renderText( paste('Price : $', min(last31), '
Date :', rownames(last31)[which.min(last31$Price)] ) ) output$maxprice <- renderText( paste('Price : $', max(last31), '
Date :', rownames(last31)[which.max(last31$Price)] ) ) output$btcprice <- renderDygraph( dygraph(data = last31, main = "Bitcoin USD Price for Last 31 days") %>% dyHighlight(highlightCircleSize = 5, highlightSeriesBackgroundAlpha = 0.2, hideOnMouseOut = FALSE, highlightSeriesOpts = list(strokeWidth = 3)) %>% dyRangeSelector() ) } Running the Bitcoin Price Tracker Shiny App:

If your code is ready, The shiny app can be run (as usual) using the Run App button on the top right of your RStudio. If your code is not yet ready, you can use the following code the run the shiny app directly from Github (provided all the required packages – shiny, coindeskr, dygraphs – are already installed).

shiny::runGitHub('amrrs/Bitcoin_price_tracker_Daily') Screenshot of the Bitcoin Price Tracker Shiny App:

Thus, Your Daily Bitcoin Price Tracker is ready! The code used above is available on my github and if you are interested in Shiny, you can learn more from Datacamp’s Building Web Applications in R with Shiny Course.

    Related Post

    1. Getting started with dplyr in R using Titanic Dataset
    2. How to apply Monte Carlo simulation to forecast Stock prices using Python
    3. Analysing iOS App Store iTunes Reviews in R
    4. Handling ‘Happy’ vs ‘Not Happy’: Better sentiment analysis with sentimentr in R
    5. Creating Reporting Template with Glue in R
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    The Brazilian post office and R

    Wed, 01/10/2018 - 09:00

    (This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

    While I would like to post a success story about the Brazilian post office using R to analyze its internal data, the real story is the use of R’s logo for its Registered mail :). Have a look:

    My best guess is that someone searched for letter R in Google images and used the first hit as a symbol of registered mail. My second best guess is that someone really likes R and has a particular sense of humour.

    Courtesy of Rodrigo Hartmann in R Brasil – Programadores.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R Work Areas. Standardize and Automate.

    Wed, 01/10/2018 - 08:37

    (This article was first published on some real numbers, and kindly contributed to R-bloggers)

    Before beginning work on a new data science project I like to do the following:

    1. Get my work area ready by creating an R Project for use with the RStudio IDE.

    2. Organize my work area by creating a series of directories to store my project inputs and outputs. I create ‘data’ (raw data), ‘src’ (R code), ‘reports’ (markdown documents etc) and ‘documentation’ (help files) directories.

    3. Set up tracking of my work by Initializing a Git Repo.

    4. Take measures to avoid tracking sensitive information (such as data or passwords) by adding certain file names or extensions to the .gitignore file.

    You can of course achieve the the desired result by using the RStudio IDE GUI but I have found into handy to automate the process using a shell script. Because I use Windows, I execute this script using the Git BASH emulator. If you have a Mac or Linux machine, just use the terminal.

    Steps:

    1. Navigate to a directory you want to want to designate as your area of work and run

    bash project_setup projectname

    where “projectname” is a name of your choosing.

    2. Open the freshly generated R Project in RStudio. This will create your .Rproj.user directory.

    3. Start work!

    You can see my script below, just modify to suit your own requirements. Notice I have set the R project options ‘Restore Workspace’, ‘Save Workspace’ and ‘Always Save History’ to an explicit ‘No’, increased the ‘Number of Spaces for Tab’ to 4 and prevented the tracking of .csv data files.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: some real numbers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    How to implement Random Forests in R

    Wed, 01/10/2018 - 06:52

    (This article was first published on R-posts.com, and kindly contributed to R-bloggers)

    Imagine you were to buy a car, would you just go to a store and buy the first one that you see? No, right? You usually consult few people around you, take their opinion, add your research to it and then go for the final decision. Let’s take a simpler scenario: whenever you go for a movie, do you ask your friends for reviews about the movie (unless, off-course it stars one of your favorite actress)?

    Have you ever wondered why do we ask multiple people about their opinions or reviews before going for a movie or before buying a car or may be, before planning a holiday? It’s because review of one person may be biased as per her preference; however, when we ask multiple people we are trying to remove bias that a single person may provide. One person may have a very strong dislike for a specific destination because of her experience at that location; however, ten other people may have very strong preference for the same destination because they have had a wonderful experience there. From this, we can infer that the one person was more like an exceptional case and her experience may be one of a case.

    Another example which I am sure all of us have encountered is during the interviews at any company or college. We often have to go through multiple rounds of interviews. Even though the questions asked in multiple rounds of interview are similar, if not same – companies still go for it. The reason is that they want to have views from multiple recruitment leaders. If multiple leaders are zeroing in on a candidate then the likelihood of her turning out to be a good hire is high.

    In the world of analytics and data science, this is called ‘ensembling’. Ensembling is a “type of supervised learning technique where multiple models are trained on a training dataset and their individual outputs are combined by some rule to derive the final output.”

    Let’s break the above definition and look at it step by step.

    When we say multiple models are trained on a dataset, same model with different hyper parameters or different models can be trained on the training dataset. Training observations may differ slightly while sampling; however, overall population remains the same.

    “Outputs are combined by some rule” – there could be multiple rules by which outputs are combined. The most common ones are the average (in terms of numerical output) or vote (in terms of categorical output). When different models give us numerical output, we can simply take the average of all the outputs and use the average as the result. In case of categorical output, we can use vote – output occurring maximum number of times is the final output. There are other complex methods of deriving at output also but they are out of scope of this article.

    Random Forest is one such very powerful ensembling machine learning algorithm which works by creating multiple decision trees and then combining the output generated by each of the decision trees. Decision tree is a classification model which works on the concept of information gain at every node. For all the data points, decision tree will try to classify data points at each of the nodes and check for information gain at each node. It will then classify at the node where information gain is maximum. It will follow this process subsequently until all the nodes are exhausted or there is no further information gain. Decision trees are very simple and easy to understand models; however, they have very low predictive power. In fact, they are called weak learners.

    Random Forest works on the same weak learners. It combines the output of multiple decision trees and then finally come up with its own output. Random Forest works on the same principle as Decision Tress; however, it does not select all the data points and variables in each of the trees. It randomly samples data points and variables in each of the tree that it creates and then combines the output at the end. It removes the bias that a decision tree model might introduce in the system. Also, it improves the predictive power significantly. We will see this in the next section when we take a sample data set and compare the accuracy of Random Forest and Decision Tree.

    Now, let’s take a small case study and try to implement multiple Random Forest models with different hyper parameters, and compare one of the Random Forest model with Decision Tree model. (I am sure you will agree with me on this – even without implementing the model, we can say intuitively that Random Forest will give us better results than Decision Tree). The dataset is taken from UCI website and can be found on this link. The data contains 7 variables – six explanatory (Buying Price, Maintenance, NumDoors, NumPersons, BootSpace, Safety) and one response variable (Condition). The variables are self-explanatory and refer to the attributes of cars and the response variable is ‘Car Acceptability’. All the variables are categorical in nature and have 3-4 factor levels in each.

    Let’s start the R code implementation and predict the car acceptability based on explanatory variables.

    # Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/car/ install.packages("randomForest") library(randomForest) # Load the dataset and explore data1 <- read.csv(file.choose(), header = TRUE) head(data1) str(data1) summary(data1) > head(data1) BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition 1 vhigh vhigh 2 2 small low unacc 2 vhigh vhigh 2 2 small med unacc 3 vhigh vhigh 2 2 small high unacc 4 vhigh vhigh 2 2 med low unacc 5 vhigh vhigh 2 2 med med unacc 6 vhigh vhigh 2 2 med high unacc > str(data1) 'data.frame': 1728 obs. of 7 variables: $ BuyingPrice: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ... $ Maintenance: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ... $ NumDoors : Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ... $ NumPersons : Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ... $ BootSpace : Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ... $ Safety : Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ... $ Condition : Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ... > summary(data1) BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition high :432 high :432 2 :432 2 :576 big :576 high:576 acc : 384 low :432 low :432 3 :432 4 :576 med :576 low :576 good : 69 med :432 med :432 4 :432 more:576 small:576 med :576 unacc:1210 vhigh:432 vhigh:432 5more:432 vgood: 65

    Now, we will split the dataset into train and validation set in the ratio 70:30. We can also create a test dataset, but for the time being we will just keep train and validation set.

    # Split into Train and Validation sets # Training Set : Validation Set = 70 : 30 (random) set.seed(100) train <- sample(nrow(data1), 0.7*nrow(data1), replace = FALSE) TrainSet <- data1[train,] ValidSet <- data1[-train,] summary(TrainSet) summary(ValidSet) > summary(TrainSet) BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition high :313 high :287 2 :305 2 :406 big :416 high:396 acc :264 low :292 low :317 3 :300 4 :399 med :383 low :412 good : 52 med :305 med :303 4 :295 more:404 small:410 med :401 unacc:856 vhigh:299 vhigh:302 5more:309 vgood: 37 > summary(ValidSet) BuyingPrice Maintenance NumDoors NumPersons BootSpace Safety Condition high :119 high :145 2 :127 2 :170 big :160 high:180 acc :120 low :140 low :115 3 :132 4 :177 med :193 low :164 good : 17 med :127 med :129 4 :137 more:172 small:166 med :175 unacc:354 vhigh:133 vhigh:130 5more:123 vgood: 28

    Now, we will create a Random Forest model with default parameters and then we will fine tune the model by changing ‘mtry’. We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry). According to Random Forest package description:

    Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.

    Mtry: Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)



    # Create a Random Forest model with default parameters model1 <- randomForest(Condition ~ ., data = TrainSet, importance = TRUE) model1 > model1 Call: randomForest(formula = Condition ~ ., data = TrainSet, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 3.64% Confusion matrix: acc good unacc vgood class.error acc 253 7 4 0 0.04166667 good 3 44 1 4 0.15384615 unacc 18 1 837 0 0.02219626 vgood 6 0 0 31 0.16216216

    By default, number of trees is 500 and number of variables tried at each split is 2 in this case. Error rate is 3.6%.

    # Fine tuning parameters of Random Forest model model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE) model2 > model2 Call: randomForest(formula = Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 6 OOB estimate of error rate: 2.32% Confusion matrix: acc good unacc vgood class.error acc 254 4 6 0 0.03787879 good 3 47 1 1 0.09615385 unacc 10 1 845 0 0.01285047 vgood 1 1 0 35 0.05405405

    When we have increased the mtry to 6 from 2, error rate has reduced from 3.6% to 2.32%. We will now predict on the train dataset first and then predict on validation dataset.

    # Predicting on train set predTrain <- predict(model2, TrainSet, type = "class") # Checking classification accuracy table(predTrain, TrainSet$Condition) > table(predTrain, TrainSet$Condition) predTrain acc good unacc vgood acc 264 0 0 0 good 0 52 0 0 unacc 0 0 856 0 vgood 0 0 0 37 # Predicting on Validation set predValid <- predict(model2, ValidSet, type = "class") # Checking classification accuracy mean(predValid == ValidSet$Condition) table(predValid,ValidSet$Condition) > mean(predValid == ValidSet$Condition) [1] 0.9884393 > table(predValid,ValidSet$Condition) predValid acc good unacc vgood acc 117 0 2 0 good 1 16 0 0 unacc 1 0 352 0 vgood 1 1 0 28

    In case of prediction on train dataset, there is zero misclassification; however, in the case of validation dataset, 6 data points are misclassified and accuracy is 98.84%. We can also use function to check important variables. The below functions show the drop in mean accuracy for each of the variables.

    # To check important variables importance(model2) varImpPlot(model2) > importance(model2) acc good unacc vgood MeanDecreaseAccuracy MeanDecreaseGini BuyingPrice 143.90534 80.38431 101.06518 66.75835 188.10368 71.15110 Maintenance 130.61956 77.28036 98.23423 43.18839 171.86195 90.08217 NumDoors 32.20910 16.14126 34.46697 19.06670 49.35935 32.45190 NumPersons 142.90425 51.76713 178.96850 49.06676 214.55381 125.13812 BootSpace 85.36372 60.34130 74.32042 50.24880 132.20780 72.22591 Safety 179.91767 93.56347 207.03434 90.73874 275.92450 149.74474 > varImpPlot(model2)

    Now, we will use ‘for’ loop and check for different values of mtry.

    # Using For loop to identify the right mtry for model a=c() i=5 for (i in 3:8) { model3 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = i, importance = TRUE) predValid <- predict(model3, ValidSet, type = "class") a[i-2] = mean(predValid == ValidSet$Condition) } a plot(3:8,a) > a [1] 0.9749518 0.9884393 0.9845857 0.9884393 0.9884393 0.9903661 > > plot(3:8,a)

    From the above graph, we can see that the accuracy decreased when mtry was increased from 4 to 5 and then increased when mtry was changed to 6 from 5. Maximum accuracy is at mtry equal to 8.

    Now, we have seen the implementation of Random Forest and understood the importance of the model. Let’s compare this model with decision tree and see how decision trees fare in comparison to random forest.

    # Compare with Decision Tree install.packages("rpart") install.packages("caret") install.packages("e1071") library(rpart) library(caret) library(e1071) # We will compare model 1 of Random Forest with Decision Tree model model_dt = train(Condition ~ ., data = TrainSet, method = "rpart") model_dt_1 = predict(model_dt, data = TrainSet) table(model_dt_1, TrainSet$Condition) mean(model_dt_1 == TrainSet$Condition) > table(model_dt_1, TrainSet$Condition) model_dt_1 acc good unacc vgood acc 241 52 132 37 good 0 0 0 0 unacc 23 0 724 0 vgood 0 0 0 0 > > mean(model_dt_1 == TrainSet$Condition) [1] 0.7981803

    On the training dataset, the accuracy is around 79.8% and there is lot of misclassification. Now, look at the validation dataset.

    # Running on Validation Set model_dt_vs = predict(model_dt, newdata = ValidSet) table(model_dt_vs, ValidSet$Condition) mean(model_dt_vs == ValidSet$Condition) > table(model_dt_vs, ValidSet$Condition) model_dt_vs acc good unacc vgood acc 107 17 58 28 good 0 0 0 0 unacc 13 0 296 0 vgood 0 0 0 0 > > mean(model_dt_vs == ValidSet$Condition) [1] 0.7764933

    The accuracy on validation dataset has decreased further to 77.6%.

    The above comparison shows the true power of ensembling and the importance of using Random Forest over Decision Trees. Though Random Forest comes up with its own inherent limitations (in terms of number of factor levels a categorical variable can have), but it still is one of the best models that can be used for classification. It is easy to use and tune as compared to some of the other complex models, and still provides us good level of accuracy in the business scenario. You can also compare Random Forest with other models and see how it fares in comparison to other techniques. Happy Random Foresting!!

    Author Bio:

    This article was contributed by Perceptive Analytics. Chaitanya Sagar, Prudhvi Potuganti and Saneesh Veetil contributed to this article.

    Perceptive Analytics provides data analytics, data visualization, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R Interface to Google CloudML

    Wed, 01/10/2018 - 01:00

    (This article was first published on TensorFlow for R, and kindly contributed to R-bloggers)

    Overview

    We are excited to announce the availability of the cloudml package, which provides an R interface to Google Cloud Machine Learning Engine. CloudML provides a number of services including:

    • Scalable training of models built with the keras, tfestimators, and tensorflow R packages.

    • On-demand access to training on GPUs, including the new Tesla P100 GPUs from NVIDIA®.

    • Hyperparameter tuning to optmize key attributes of model architectures in order to maximize predictive accuracy.

    • Deployment of trained models to the Google global prediction platform that can support thousands of users and TBs of data.

    Training with CloudML

    Once you’ve configured your system to publish to CloudML, training a model is as straightforward as calling the cloudml_train() function:

    library(cloudml) cloudml_train("train.R")

    CloudML provides a variety of GPU configurations, which can be easily selected when calling cloudml_train(). For example, the following would train the same model as above but with a Tesla K80 GPU:

    cloudml_train("train.R", master_type = "standard_gpu")

    To train using a Tesla P100 GPU you would specify "standard_p100":

    cloudml_train("train.R", master_type = "standard_p100")

    When training completes the job is collected and a training run report is displayed:

    Learning More

    Check out the cloudml package documentation to get started with training and deploying models on CloudML.

    You can also find out more about the various capabilities of CloudML in these articles:

    • Training with CloudML goes into additional depth on managing training jobs and their output.

    • Hyperparameter Tuning explores how you can improve the performance of your models by running many trials with distinct hyperparameters (e.g. number and size of layers) to determine their optimal values.

    • Google Cloud Storage provides information on copying data between your local machine and Google Storage and also describes how to use data within Google Storage during training.

    • Deploying Models describes how to deploy trained models and generate predictions from them.

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: TensorFlow for R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    TWiMLAI talk 88 sketchnotes: Using Deep Learning and Google Street View to Estimate Demographics with Timnit Gebru

    Wed, 01/10/2018 - 01:00

    (This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

    These are my sketchnotes taken from the “This week in Machine Learning & AI” podcast number 88 about Using Deep Learning and Google Street View to Estimate Demographics with Timnit Gebru:

    Sketchnotes from TWiMLAI talk #88: Using Deep Learning and Google Street View to Estimate Demographics with Timnit Gebru

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    tidytext 0.1.6

    Wed, 01/10/2018 - 01:00

    (This article was first published on Rstats on Julia Silge, and kindly contributed to R-bloggers)

    I am pleased to announce that tidytext 0.1.6 is now on CRAN!

    Most of this release, as well as the 0.1.5 release which I did not blog about, was for maintenance, updates to align with API changes from tidytext’s dependencies, and bugs. I just spent a good chunk of effort getting tidytext to pass R CMD check on older versions of R despite the fact that some of the packages in tidytext’s Suggests require recent versions of R. FUN TIMES. I was glad to get it working, though, because I know that we have users, some teaching on university campuses, etc, who are constrained to older versions of R in various environments.

    There are some more interesting updates. For example, did you know about the new-ish stopwords package? This package provides access to stopword lists from multiple sources in multiple languages. If you would like to access these in a list data structure, go to the original package. But if you like your text tidy, I GOT YOU.

    library(tidytext) get_stopwords() ## # A tibble: 175 x 2 ## word lexicon ## ## 1 i snowball ## 2 me snowball ## 3 my snowball ## 4 myself snowball ## 5 we snowball ## 6 our snowball ## 7 ours snowball ## 8 ourselves snowball ## 9 you snowball ## 10 your snowball ## # ... with 165 more rows get_stopwords(source = "smart") ## # A tibble: 571 x 2 ## word lexicon ## ## 1 a smart ## 2 a's smart ## 3 able smart ## 4 about smart ## 5 above smart ## 6 according smart ## 7 accordingly smart ## 8 across smart ## 9 actually smart ## 10 after smart ## # ... with 561 more rows get_stopwords(language = "ru") ## # A tibble: 159 x 2 ## word lexicon ## ## 1 и snowball ## 2 в snowball ## 3 во snowball ## 4 не snowball ## 5 что snowball ## 6 он snowball ## 7 на snowball ## 8 я snowball ## 9 с snowball ## 10 со snowball ## # ... with 149 more rows get_stopwords(language = "it") ## # A tibble: 279 x 2 ## word lexicon ## ## 1 ad snowball ## 2 al snowball ## 3 allo snowball ## 4 ai snowball ## 5 agli snowball ## 6 all snowball ## 7 agl snowball ## 8 alla snowball ## 9 alle snowball ## 10 con snowball ## # ... with 269 more rows

    This allows users to implement text mining tasks using tidy data principles that have been difficult before now. What if we would like to find the most common words in, say, Rainer Maria Rilke’s work, but in the original German?

    library(gutenbergr) library(tidyverse) raw_rilke <- gutenberg_download(c(24288, 33863, 2188, 34521), meta_fields = "title") %>% mutate(text = iconv(text, from = "latin-9", to = "UTF-8")) tidy_rilke <- raw_rilke %>% unnest_tokens(word, text) %>% count(title, word, sort = TRUE) %>% anti_join(get_stopwords(language = "de")) tidy_rilke ## # A tibble: 18,698 x 3 ## title word n ## ## 1 Die Aufzeichnungen des Malte Laurids Brigge immer 160 ## 2 Die Aufzeichnungen des Malte Laurids Brigge ganz 155 ## 3 Die Aufzeichnungen des Malte Laurids Brigge mehr 140 ## 4 Die Aufzeichnungen des Malte Laurids Brigge konnte 132 ## 5 Die Aufzeichnungen des Malte Laurids Brigge kam 123 ## 6 Die Aufzeichnungen des Malte Laurids Brigge zeit 120 ## 7 Die Aufzeichnungen des Malte Laurids Brigge schon 119 ## 8 Die Aufzeichnungen des Malte Laurids Brigge sah 101 ## 9 Die Aufzeichnungen des Malte Laurids Brigge hätte 97 ## 10 Die Aufzeichnungen des Malte Laurids Brigge wäre 95 ## # ... with 18,688 more rows tidy_rilke %>% group_by(title) %>% top_n(12) %>% ungroup %>% mutate(word = reorder(word, n), title = factor(title, levels = c("Das Stunden-Buch", "Das Buch der Bilder", "Neue Gedichte", "Die Aufzeichnungen des Malte Laurids Brigge"))) %>% group_by(title, word) %>% arrange(desc(n)) %>% ungroup() %>% mutate(word = factor(paste(word, title, sep = "__"), levels = rev(paste(word, title, sep = "__")))) %>% ggplot(aes(word, n, fill = title)) + geom_col(alpha = 0.8, show.legend = FALSE) + coord_flip() + facet_wrap(~title, scales = "free") + scale_y_continuous(expand = c(0,0)) + scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) + labs(x = NULL, y = "Number of uses in each book", title = "Word use in the poetry of Rainer Maria Wilke", subtitle = "The most common words after stopword removal")

    The first three works here are poetry (The Book of Hours, The Book of Images, and New Poems) while the last is a book of prose (The Notebooks of Malte Laurids Brigge). We can see the different themes and word use here, even just by counting up word frequencies. Now, if I actually spoke German fluently, I know this would mean more to me, but even to my English-speaking eyes, we can see meaningful trends. These are all still quite common words (the Snowball stopword lists are not terribly large) but some of these works are more religious (God, life) and some more focused on narrating events, and so forth.

    Another addition in this release is a dataset of negators, modals, and adverbs (only in English). These are words that can affect sentiment analysis, either by intensifying words or negating them.

    nma_words %>% count(modifier) ## # A tibble: 3 x 2 ## modifier n ## ## 1 adverb 22 ## 2 modal 7 ## 3 negator 15

    You can read more from Saif Mohammad about how these kinds of words can affect sentiment analysis. One of the reasons that tidy data principles are so well suited to text mining is that you can interrogate sentiment scores and get at questions like these quite naturally. I talk about this in my DataCamp course, and also you can read about this in our book, in the chapter on n-grams and the case study on Usenet messages.

    For example, we can ask which words in Jane Austen’s novels are more likely to appear after these adverbs?

    library(janeaustenr) adverbs <- nma_words %>% filter(modifier == "adverb") %>% pull(word) austen_bigrams <- austen_books() %>% unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% count(bigram, sort = TRUE) %>% separate(bigram, c("word1", "word2"), sep = " ") austen_bigrams %>% filter(word1 %in% adverbs) %>% count(word1, word2, wt = n, sort = TRUE) %>% inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>% mutate(contribution = score * nn) %>% group_by(word1) %>% filter(n() > 10) %>% top_n(10, abs(contribution)) %>% ungroup() %>% mutate(word2 = reorder(paste(word2, word1, sep = "__"), contribution)) %>% ggplot(aes(word2, contribution, fill = contribution > 0)) + geom_col(alpha = 0.8, show.legend = FALSE) + facet_wrap(~ word1, scales = "free", nrow = 3) + scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) + coord_flip() + labs(x = NULL, y = "Sentiment score * # of occurrences", title = "Words preceded by adverbs in Jane Austen's novels", subtitle = "Things are rather distressing but most agreeable")

    Gosh, I love this A LOT because you can see really common Jane Austen word patterns here. Some people are extremely agreeable, but sometimes you can’t help but be highly incensed. I am particularly fond of this kind of text mining.

    To see any more details of how to use tidytext functions, you can check out the documentation, vignettes, and news for tidytext at our package website. Let me know if you have questions!

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Rstats on Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Quantitative Story Telling with Shiny: Gender Bias in Syllabi

    Wed, 01/10/2018 - 01:00

    (This article was first published on Computational Social Science, and kindly contributed to R-bloggers)

    LSE IR Gender and Diversity Project

    Two shinydashboard posts in a row, that’s a first. As I mentioned on Twitter, I’m not really this productive; rather, the apps had been on the proverbial shelf for a while and I’m just releasing them now. In fact, this is one of my earlier works: quantifying the gender imbalance as it manifests itself in the LSE International Relations (IR) reading lists. You can access the app here.

    This is a much larger project that I got involved during its second year, so I’m just visualising other peoples’ hard work. The recentness of my contribution to the project was clearly on display when I amused my audience by saying cross-sectional feminism instead of inter-sectional. Are you a statistician or what? Baby steps.

    In a nutshell, about twenty or so PhD candidates at the department manually (!) scraped the reading lists of 43 courses that were on offer during the 2015-2016 academic year, resulting in a dataset containing 12,358 non-unique publications. Of those, 2,574 involves at least one female author, while 9,784 features at least one male author. Morevoer, 81% of the syllabi is written exclusively by male scholars.

    Recently, I have been working on a talk proposal about (Shiny) design, so I will let some of those guidelines dictate the structure for the rest of this post. I will touch upon three main themes: i) how to emphasise contrast and make a point out of it, ii) how to unpack the whole and present disaggregated data, and iii) how to design useful interactivity that connects with your intended audience. As usual, there will be sprinkles of random thoughts and semi-relevant sidetracking in between.

    On that note, this project used to be (still?) called the Gender and Diversity Project. You may have noticed that I struck through the latter bit in the title. Well, when I ran the classification models – predicting binary outcomes as male/female, diverse/not diverse etc. – two things stuck out. First, if you do not subsample, the sheer male dominance of the field will result in lazy models that are >90% accurate: it will always predict male, and will be correct most of the time (especially when you subset the data). Naturally, this is uninformative.

    However, the second point eclipses the first: when you try to predict diversity, well, you can’t. Because even though women are severely under-represented (leading to lazy algorithms), everyone is white. You can’t even make a cross-section joke anymore. Another unintended side effect of this is that if you rely on an API like genderize, you don’t have to worry about whether it will work well on, say, an Indian name. Feel free to make your own inferences about the state of the discipline.

    Visualising Contrast

    On that uplifting note, let’s move on to the first theme: contrast. What I mean by contrast here is striking difference, or difference in juxtaposition; not the graphical design contrast/hue/saturation. Given the subject matter, I thought a basic comparison over time stratified by gender will do nicely. It would not be surprising to see an increase in the number of included works by female authors as time progresses. However, we don’t know whether that (if exists) is an independent effect or a general one. We can set up a hover plot with ggplot2 to illustrate this point: the first (static) plot only shows the female author subset, and upon hovering/clicking, it reveals the second plot that displays the whole data. Like the previous post, I’m only providing unevaluated (read: motivational) code here; you can always fork the functional code on GitHub:

    #Only the hover part of the Shiny code fluidRow(column(width = 6, plotOutput("plot1", height = 400, hover = hoverOpts( id = "plot1_hover", nullOutside = FALSE))), column(width = 6, conditionalPanel( condition = "input.plot1_hover != null", plotOutput("plot2", height = 400))))

    On the left panel, we see what we expected: works by female authors see a surge after 1990. The drop-off at the end is probably indicative of the lag present in publication date and the time needed to make it into a reading list. However, when we plot both genders on the right panel, we realise the trend is universal – male authors also get included more and more.

    Our illustration demonstrates two separate effects. First, there is absolute improvement over time; in syllabi, the number of publications by female authors tripled in the last three decades. Second, there is comparatively little relative progress in the same timeframe. Any statement more precise than that will need to involve statistics (cough, we may have a manuscript under review).

    One thing we may not have accomplished with the above is the clear communication of year-to-year ratios. We could have used stacked bar charts rather than histograms, but I wanted to divide the workload – don’t put all your tricks into one plot (as they are more likely to break). dygraphs is a powerful library and will serve our needs well with its interactivity. plotly can achieve similar results as well, but I used it as an example in my previous post so let’s go for diversity (ha):

    library(dygraphs) dygraph(data = authors) %>% dyOptions(fillGraph = TRUE, fillAlpha = 0.1, panEdgeFraction = 0.1, drawPoints = TRUE, strokeWidth = 1.2, drawGapEdgePoints = TRUE, drawGrid = FALSE, mobileDisableYTouch = TRUE) %>% dyLimit(.2, color = "black") %>% dyLegend(width = 400, hideOnMouseOut = FALSE) %>% dyAxis("y", label = "Percentage of All Readings", valueRange = c(.01, 1.005), rangePad = 1) %>% dyAxis("x", label = "Date of Publication") %>% dySeries("V2", label = "Female Author Ratio", color = "#1fbfc3", stepPlot = TRUE) %>% dySeries("V3", label = "Male Author Ratio", color = "#f5766f", stepPlot = TRUE, fillGraph = TRUE)

    Now, we can see that the majority of publication-years include less than 20% female authors, indicated by the dashed line. In the live app, users can hover and the legend on the top right corner will update to display the ratios for both genders. Unsurprisingly, we observe a similar trend after 1990; the relative improvement is about double: pre-1990, the female author ratio averages around 10%, while post-1990 it’s about 20%. I believe I have already made a sarcastic remark about the state of the discipline.

    Disaggregating Content

    Okay, so we have all these crude yearly statistics and plotted them to the best of our ability. What next? We need to go deeper ala Inception. This part is naturally governed by the richness of your data. In our case, we have some publication characteristics (year, type, number of authors, author gender) and data on independent courses. Let’s illustrate them both.

    Publication characteristics easily lend themselves to segmented, colour-coded graphics. There are multiple libraries that you can utilise in R. I will go with sunburstR for no other reason that I liked the graphics so much, I made it my website favicon. It’s also featured in my first blog post. We all have our favourites.

    library(htmlwidgets) library(sunburstR) library(RColorBrewer) blues <- c(brewer.pal(9, "Blues")) reds <- c(brewer.pal(9, "Reds")) #Sunburst using static patch data (code at the end makes sure the legend is on by default; use with htmlwidgets) output$sb <- renderSunburst({ htmlwidgets::onRender( sunburst(patch, count = TRUE, legend = list(w = 150, h = 25, s = 5, t = 25), breadcrumb = list(w = 0, h = 25, s = 5, t = 10), colors = c("", blues[1:8], reds[7:2]), legendOrder = c("1960", "1970", "1980", "1990", "2000", "2010", "Book", "Article", "OtherPublisher", "TopPublisher", "CoAuthored", "SingleAuthored", "MaleCoAuthor", "FemaleCoAuthor"), withD3 = TRUE), " function(el,x){ d3.select(el).select('.sunburst-togglelegend').property('checked', true); d3.select(el).select('.sunburst-legend').style('visibility', ''); } " ) })

    You will need to wrangle your data to create sequences that sunburstR can plot, which is usually achieved with multiple group_by arguments with dplyr. The JavaScript code at the end renders a click so that the legend is on by default. You need to load htmlwidgets first and make sure you pass withD3 = TRUE as an argument during the call. Users can navigate the dial inside-out and the selection descriptives will be displayed in the middle, both as percentages and as a raw count.

    Moving onto course breakdown, I got inspired by the bokehtutorial featuring the periodic table. We first cluster the courses into subfields of the discipline. Then, I manually (painfully) arranged the courses because all my automated attempts resulted in not-so-elegant outputs. Luckily, I happen to have a large whiteboard in my room (you don’t?), so I just drew the coordinate matrix there and then copy-pasted them. I’m sure you will find a way, too.

    library(rbokeh) #Bokeh using static course data output$bokeh <- renderRbokeh({ figure(title = "", tools = c("pan", "wheel_zoom", "reset", "hover", "save"), font = "Roboto Condensed", ylim = as.character(1:6), xlim = as.character(0:14), xgrid = FALSE, ygrid = FALSE, xaxes = FALSE, yaxes = FALSE, height = 400, width = 1050, h_symmetry = TRUE, v_symmetry = TRUE, toolbar_location = "right") %>% #Create cluster boxes as indicators ly_crect(xcor, ycor, data = indicator, width = 2.95, height = .95, fill_color = colors, line_color = "#252525", fill_alpha = .6, hover = list(Subfield, Courses)) %>% ly_text(symx, ycor, text = clusters, data = indicator, font = "Roboto Condensed", font_style = "normal", font_size = "14pt", align = "center", baseline = "middle") %>% #Create centered rectangles ly_crect(xcor, ycor, data = course, width = .95, height = .95, fill_color = color, line_color = "#252525", fill_alpha = .6, hover = list(Convener, Readings)) %>% #F/M ratio ly_text(symx, ycor, text = Ratio, data = course, font = "Roboto Condensed", font_style = "bold", font_size = "14pt", align = "left", baseline = "middle") %>% #Core course indicator ly_text(symx2, numbery, text = Core, data = course, font = "Roboto Condensed", font_style = "bold", font_size = "6pt", align = "left", baseline = "middle") %>% #Course level ly_text(symx, massy, text = Level, data = course, font = "Roboto Condensed", font_size = "6pt", align = "left", baseline = "middle") %>% theme_title(text_font = "Roboto Condensed", text_font_size = "16pt", background_fill_color = "#ecf0f5", text_font_style = "normal") %>% theme_plot(background_fill_color = "#ecf0f5", border_fill_color = "#ecf0f5", outline_line_alpha = 0) })

    We want to convey two things at a glance with the course breakdown: which subfields feature more readings by female authors, and the level of dispersion within the clusters. Thus, I sorted the clusters from the lowest overall ratio to the highest, as well as dividing the courses into five categories. I removed both axes as they would be more confusing given the layout. In the app, you can hover on the boxes to reveal additional info: for courses, the gender and rank of the convener and the total of number of readings included; for clusters, it shows the full name of the subfield so that you can find out what IPE actually means.

    Impactful Interactivity

    Finally, sometimes you want to give the users finer control over the end result. In the first couple of plots, they can see yearly ratios, but cannot do much more than that. One way of achieving this would be giving them control over creating yearly dummies and then plotting it as a binary outcome for whether that mark is reached or not. For example, you can manipulate the dataframe to create dummies for whether that year has equal or greater than 20% F/M ratio. I will not embed the code here as it is just a ggplot with slider inputs, but the output would look like the following:

    In addition, we can also give them control over visualising the co-authorship patterns. We have three variables of interest for each entry: the total number of authors, the number of female authors, and the number of male authors. I calcuate the circle radii using square root as otherwise you would only see a huge M (if you include single-authored work in the mix) and nothing else. We transform all three into sliders and emulate the CRAN package shiny app:

    devtools::install_github("jcheng5/bubbles") library(bubbles) #Bubbles using reactive coData() output$bubbles <- renderBubbles({ if (nrow(coData()) == 0) return() bubbles(sqrt(coData()$n), coData()$AutGen, key = coData()$AutGen, tooltip = coData()$n, color = "#222d32", textColor = "white") })

    Last but not least, as this whole project is about publications, it wouldn’t be right to gloss over the publishers themselves. Again, we are looking at the publishers from a single-year snapshot of the LSE IR department, so it wouldn’t be poignant to generalise beyond that. At least, this is what reviewer 2 told us. Using DT, it’s a breeze to create interactive tables. In Shiny, do call renderDataTable via DT:

    library(DT) #selectedData will update based on slider settings DT::renderDataTable(selectedData()) options(DT.options = list( pageLength = 10, order = list(list(4, "desc")), class = "hover", language = list(search = "Enter Publisher Name:", info = "Displaying _START_ to _END_ of _TOTAL_ publishers")))

    We see that the more ‘prestigious’ the publisher is, the more it gets close to the ‘status quo’ of 20% F/M ratio. However, if you decrease the number of total publications to ten (lowest allowed), you will find that there are publishers that go beyond 50%. Well, two of them, anyway.

    Let’s block ads! (Why?)

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Computational Social Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Pages