### Stouffer’s meta-analysis with weight and direction effect in R

(This article was first published on ** R – Fabio Marroni's Blog**, and kindly contributed to R-bloggers)

I often have to perform meta-analysis of past experiments for which the only info I have is the fold change and the p-value (of any measure you may imagine: species richness, gene expression, depth of coverage, plates of carbonara eaten in 5 minutes, everything).

The hardest thing to find out for me was how to take the direction of the changes into account. E.g. if monday I eat 10 carbonara more than my brother (p=0.01), on tuesday 10 more (p=0.01), on wednesday 5 more (p=0.1), on thursday 10 less (p=0.01), on friday ten less (p=0.01) and on saturday 5 less (p=0.1), a standard Stouffer meta-analysis would return a highly significant p-value, completely disregarding the fact that the significant changes were half in my favor, and half in favor of my brother.

How can I take into account the information on the direction?

I had no idea on how to proceed, and I wasn’t able to find any reference (I am not an expert of meta-analysis), but then I was so lucky to stumble upon the manual page of the software package METAINTER for performing meta-analysis.

The authors described there a method to perform Stouffer meta-analysis accounting for the direction of the effects. Their explanation was so clear that it was easy for me to write a simple function in R: I paste it below.

signed.Stouffer.meta <- function(p, w, sign) { # p is a vector of p-values if (missing(w)) { w <- rep(1, length(p))/length(p) } else { if (length(w) != length(p)) stop("Length of p and w must equal!") } if(length(p)==1) Zi1) { Zi<-qnorm(p/2,lower.tail=FALSE) Zi[sign<0]<-(-Zi[sign<0]) } Z <- sum(w*Zi)/sqrt(sum(w^2)) p.val <- 2*(1-pnorm(abs(Z))) return(c(Z = Z, p.value = p.val)) } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Fabio Marroni's Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### grconvertX and grconvertY

(This article was first published on ** R – Insights of a PhD**, and kindly contributed to R-bloggers)

These two functions are unbelievably useful for positioning graphical elements (text, axes, labels, …) in R. They allow one to convert coordinates between various different formats. For instance, you can convert your user coordinate (say 5 where x ranges from 0 to 200) to normalized device coordinates (proportional distance across the device) and vice versa. Very useful for positioning panel labels…

grconvertX(.1, "npc", "user")returns the x coordinate that is 10% across the plot region. So with that and the equivalent y function, you can place your labels in exactly the same position on every panel and plot… e.g.

plot(rnorm(20), rnorm(20)) text(grconvertX(.1, "npc", "user"), grconvertY(.9, "npc", "user"), "a)")To show the difference between the from side, run the following…

text(grconvertX(.1, "npc", "user"), grconvertY(.9, "npc", "user"), "npc") text(grconvertX(.1, "ndc", "user"), grconvertY(.9, "ndc", "user"), "ndc", xpd = "n") # the text will be near the outer marginIt’s also very useful for positioning axis labels from boxplot when the labels are too long…

Hope that’s helpful to someone…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Insights of a PhD**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Create your Machine Learning library from scratch with R ! (2/5) – PCA

(This article was first published on ** Enhance Data Science**, and kindly contributed to R-bloggers)

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement **Principal components analysis (PCA)** using only the linear algebra available in R. Previously, we managed to implement linear regression and logistic regression from scratch and next time we will deal with K nearest neighbors (KNN).

The PCA is a dimensionality reduction method which seeks the vectors which explains most of the variance in the dataset. From a mathematical standpoint, the PCA is just a coordinates change to represent the points in a more appropriate basis. Picking few of these coordinates is enough to explain an important part of the variance in the dataset.

The mathematics of PCALet be the observations of our datasets, the points are in . We assume that they are centered and of unit variance. We denote the matrix of observations.

Then, can be diagonalized and has real and positive eigenvalues (it is a symmetric positive definite matrix).

We denote ... > \lambda_p" title="Rendered by QuickLaTeX.com" height="18" width="97" style="vertical-align: -6px;" data-recalc-dims="1"/> its ordered eigenvalues and the associated eigenvectors. It can be shown that is the cumulative variance explained by .

It can also be shown that is the orthonormal basis of size which explains the most variances.

**This is exactly what we wanted ! We have a smaller basis which explains as much variance as possible !**

The implementation in R has three-steps:

- We center the data and divide them by their deviations. Our data now comply with PCA hypothesis.
- We diagonalise and store the eigenvectors and eigenvalues
- The cumulative variance is computed and the required numbers of eigenvectors to reach the variance threshold is stored. We only keep the first eigenvectors

Now that we have the transformation matrix, we can perform the projection on the new basis.

predict.my_pca<-function(pca,x,..) { ##Centering x_std=sweep(x,2,pca[['center']]) ##Standardization x_std=x_std%*%diag(1/pca[['std']]) return(x_std%*%pca[['transform']]) }The function applies the change of basis formula and a projection on the principals components.

Plot the PCA projectionUsing the predict function, we can now plot the projection of the observations on the two main components. As in the part 1, we used the Iris dataset.

library(ggplot2) pca1=my_pca(as.matrix(iris[,1:4]),1,scale=TRUE,center = TRUE) projected=predict(pca1,as.matrix(iris[,1:4])) ggplot()+geom_point(aes(x=projected[,1],y=projected[,2],color=iris[,5])) Projection of the Iris dataset on the two mains PCA Comparison with the FactoMineR implementationWe can now compare our implementation with the standard FactoMineR implementation of Principal Component Analysis.

library(FactoMineR) pca_stats= PCA(as.matrix(iris[,1:4])) projected_stats=predict(pca_stats,as.matrix(iris[,1:4]))$coord[,1:2] ggplot(data=iris)+geom_point(aes(x=projected_stats[,1],y=-projected_stats[,2],color=Species))+xlab('PC1')+ylab('PC2')+ggtitle('Iris dataset projected on the two mains PC (FactomineR)')When running this, you should get a plot very similar to the previous one. This ensures the sanity of our implementation.

Projection of the Iris dataset using the FactoMineR implementationThanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter.

.

The post Create your Machine Learning library from scratch with R ! (2/5) – PCA appeared first on Enhance Data Science.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Enhance Data Science**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### DALEX @ eRum 2018

(This article was first published on ** English – SmarterPoland.pl**, and kindly contributed to R-bloggers)

DALEX invasion has started with the workshop and talk @ eRum 2018.

Find workshop materials at DALEX: Descriptive mAchine Learning EXplanations. Tools for exploration, validation and explanation of complex machine learning models (thanks Mateusz Staniak for having the second part of the workshop).

And my presentation Show my your model 2.0! (thanks go to the whole MI2DataLab for contributions and Malgorzata Pawlak for great stickers).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** English – SmarterPoland.pl**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Time Series of the World, Unite!

(This article was first published on ** R – usefulr**, and kindly contributed to R-bloggers)

The R ecosystem knows a ridiculous number of time series classes. So, I decided to create a new universal standard that finally covers everyone’s use case… Ok, just kidding!

tsbox, now freshly on CRAN, provides a set of tools that are agnostic towards existing time series classes. It is built around a set of converters, which convert time series stored as ts, xts, data.frame, data.table, tibble, zoo, tsibble or timeSeries to each other.

To install the stable version from CRAN:

install.packages("tsbox")To get an idea how easy it is to switch from one class to another, consider this:

library(tsbox) x.ts <- ts_c(mdeaths, fdeaths) x.xts <- ts_xts(x.ts) x.df <- ts_df(x.xts) x.tbl <- ts_tbl(x.df) x.dt <- ts_tbl(x.tbl) x.zoo <- ts_zoo(x.dt) x.tsibble <- ts_tsibble(x.zoo) x.timeSeries <- ts_timeSeries(x.tsibble)We jump form good old ts objects toxts, store our time series in various data frames and convert them to some highly specialized time series formats.

tsbox is class-agnosticBecause these converters work nicely, we can use them to make functions class-agnostic. If a class-agnostic function works for one class, it works for all:

ts_scale(x.ts) ts_scale(x.xts) ts_scale(x.df) ts_scale(x.dt) ts_scale(x.tbl)ts_scale normalizes one or multiple series, by subtracting the mean and dividing by the standard deviation. It works like a ‘generic’ function: You can apply it on any time series object, and it will return an object of the same class as its input.

So, whether we want to smooth, scale, differentiate, chain-link, forecast, regularize or seasonally adjust a series, we can use the same commands to whatever time series at hand. tsbox offers a comprehensive toolkit for the basics of time series manipulation. Here are some additional operations:

ts_pc(x.ts) # percentage change rates ts_forecast(x.xts) # forecast, by exponential smoothing ts_seas(x.df) # seasonal adjustment, by X-13 ts_frequency(x.dt, "year") # convert to annual frequency ts_span(x.tbl, "-1 year") # limit time span to final year tsbox is frequency-agnosticThere are many more. Because they all start with ts_, you can use auto-complete to see what’s around. Most conveniently, there is a time series plot function that works for all classes and frequencies:

ts_plot( `Airline Passengers` = AirPassengers, `Lynx trappings` = ts_df(lynx), `Deaths from Lung Diseases` = ts_xts(fdeaths), title = "Airlines, trappings, and deaths", subtitle = "Monthly passengers, annual trappings, monthly deaths" )There is also a version that uses ggplot2 and has the same syntax.

Time series in data framesYou may have wondered why we treated data frames as a time series class. The spread of dplyr and data.table has given data frames a boost and made them one of the most popular data structures in R. So, storing time series in a data frame is an obvious consequence. And even if you don’t intend to keep time series in data frames, this is still the format in which you import and export your data. tsbox makes it easy to switch from data frames to time series and back.

Make existing functions class-agnostictsbox includes tools to make existing functions class-agnostic. To do so, the ts_ function can be used to wrap any function that works with time series. For a function that works on "ts" objects, this is as simple as that:

ts_rowsums <- ts_(rowSums) ts_rowsums(ts_c(mdeaths, fdeaths))Note that ts_ returns a function, which can be used with or without a name.

In case you are wondering, tsbox uses data.table as a backend, and makes use of its incredibly efficient reshaping facilities, its joins and rolling joins. And thanks to anytime, tsbox will be able to recongnize almost any date format without manual intervention.

So, enjoy some relieve in R’s time series class struggle.

Website: www.tsbox.help

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** R – usefulr**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Getting into the Rhythm: the euRovision sweepstake

(This article was first published on ** Mango Solutions**, and kindly contributed to R-bloggers)

*Laura Swales, Marketing and Events Assistant*

Another month, another sweepstake to raise money for the Bath Cats & Dogs home!

This time, we picked the Eurovision song contest as our sweepstake of choice. After enjoying my first experience of using R to randomise the names for the previous sweepstake I decided to give it another go, but with a few tweaks.

SoundcheckDuring my first attempt in R, issues arose when I had been (innocently!) allocated the favourite horse to win. I had no way to prove that the R code had made the selection, as my work was not reproducible.

So with the cries of “cheater!” and “fix!”” still ringing in my ears, we started by setting a seed. This meant that if someone else was to replicate my code they would get the same results; therefore removing the dark smudge against my good name.

At random I selected the number 6 at which to set my seed.

set.seed(6)I next compiled my lists of people and Eurovision countries and associated them with correlating objects.

people_list <- c( "Andy M", "Adam", "Laura", "Rachel", "Owen", "Yvi", "Karis", "Toby", "Jen", "Matty G", "Tatiana", "Amanda", "Chrissy", "Lisa", "Lisa", "Ben", "Ben", "Robert", "Toby", "Matt A", "Lynn", "Ruth", "Julian", "Karina", "Colin", "Colin") countries_list <- c( "Albania", "Australia", "Austria", "Bulgaria", "Cyprus", "Czech Rep", "Denmark", "Estonia", "Finland", "France", "Germany", "Hungary", "Ireland", "Israel", "Italy", "Lithuania", "Moldova", "Norway", "Portugal", "Serbia", "Slovenia", "Spain", "Sweden", "The Netherlands", "Ukraine", "United Kingdom" )Once I had the lists associated with objects, I followed the same steps as my previous attempt in R. I put both objects into data frames and then used the sample function to jumble up the names.

assign_countries <- data.frame(people = people_list, countries = sample(countries_list))Task complete!

Fate had delivered me Denmark, who were nowhere near the favourites at the point of selection. I sighed with relief knowing that I had no chance of winning again and that perhaps maybe now I could start to re-build my reputation as an honest co-worker…

EncoreBefore I finished my latest foray into R, we decided to create a function for creating sweepstakes in R.

I was talked down from picking the name SweepstakeizzleR and decided upon the slightly more sensible sweepR.

I entered the desired workings of the function, which followed from the above work in R.

sweepR <- function(a, b, seed = 1234){ set.seed(seed) data.frame(a, sample(b)) }Once done, I could use my newly created function to complete the work I had done before but in a much timelier fashion.

sweepR(people_list, countries_list)My very first function worked! Using a function like sweepR will allow me to reliably reproduce the procedures I need for whatever task I’m working on. In this case it has enabled me to create a successfully random sweepstake mix of names and entries.

WinneRWith great relief Israel won Eurovision and I was very happy to hand over the prize to Amanda.

I really enjoyed learning a little more about R and how I can create functions to streamline my work. Hopefully another reason will come up for me to learn even more soon!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Mango Solutions**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### The Pleasing Ratio Project

(This article was first published on ** R – Fronkonstin**, and kindly contributed to R-bloggers)

Music is a world within itself, with a language we all understand (Sir Duke, Stevie Wonder)

This serious man on the left is Gustav Theodor Fechner, a German philosopher, physicist and experimental psychologist who lived between 1801 and 1887. To be honest, I don’t know almost anything of his life or work exepct one thing: he did in the 1860s a *thought-provoking* experiment. It seems me interesting for two important reasons: he called into question something widely established and obtained experimental data by himself.

Fechner’s experiment was simple: he presented just ten rectangles to 82 students. Then he asked each of them to choose the most pleasing one and obtained revealing discoveries I will not explain here since would cause *bias* in my experiment. You can find more information about the original experiment here.

I have done a project inspired in Fechner’s one that I called *The Pleasing Ratio Project*. Once you enter in the App, you will see two rectangles. Both of them have the same area. They only vary in their *length-to-width* ratios. Then you will be asked to select the one that seems you most pleasing. If you have doubts, just choose *I’m not sure*. You can do it as many times as you want (all responses are completely anonymous). Every *game* will confront a couple of ratios, which can vary from 1 to 3,5. In the *Results* section you will find the percentage of winning games for each ratio. The one with the highest percentage will be named officially as *The Most Pleasing Ratio of the World* in the future by myself.

Although my experiment is absolutely inspired in Fechner’s one, there are some differences. I can explore a bigger set of ratios doing an A/B test and I also introduce the option *I’m not sure*. This makes this one a bit richer.

The experiment has also some interesting technical features:

- the use of shinydashboard package to arrange the App
- the use of shinyjs package to add javaScript to refresh the page when use choose to play again
- to save votes in a text file
- to read it to visualize results

Will I obtain the same results as Fechner? This is a living project whose results will change over the time so you can check it regularly.

Go to The Pleasing Ratio Project!

The code of the project is available in GitHub. Thanks a lot for your collaboration!

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – Fronkonstin**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Workaround for tidyr::spread with duplicate row identifiers

(This article was first published on ** Digital Age Economist on Digital Age Economist**, and kindly contributed to R-bloggers)

The tidyverse api is one the easiest APIs in R to follow – thanks the concept of grammer based code. But every now and again you come across an error that boggles your mind, my personal one is the error spread delivers when I have no indexing for my data: Error: Duplicate identifiers for rows (2, 3), (1, 4, 5).... I am sure we have all seen this some or other time.

Well, lets reproduce this error:

library(dplyr) library(tidyr) df ## # A tibble: 5 x 2 ## age gender ## ## 1 21 Male ## 2 17 Female ## 3 32 Female ## 4 29 Male ## 5 15 Male # Spread df %>% spread(gender, age) ## Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)To anyone doing the analysis, the output that he/she expects is quite trivial, the output should be:

## # A tibble: 3 x 2 ## Female Male ## * ## 1 17 21 ## 2 32 29 ## 3 15But given that there is no index for spread to use and nothing stopping the 2nd and 3rd value for Male to collapse to a single row, spread decides its better to throw an error. There are multiple questions on stackoverflow of people asking how to spread the df without getting the warning. Here are some links:

And the list goes on…

The solutionLuckily I stumbled accross a really nice issue on the tidyr github repo page. The user, markdly, openend the issue, but in the end added a nice workaround for these kind of problems. The idea is to add an index using group_by and row_number and just removing the column after the widening of the data.

df <- df %>% group_by(gender) %>% mutate(grouped_id = row_number()) df ## # A tibble: 5 x 3 ## # Groups: gender [2] ## age gender grouped_id ## ## 1 21 Male 1 ## 2 17 Female 1 ## 3 32 Female 2 ## 4 29 Male 2 ## 5 15 Male 3 # Now Spread df %>% spread(gender, age) %>% select(-grouped_id) ## # A tibble: 3 x 2 ## Female Male ## * ## 1 17 21 ## 2 32 29 ## 3 15Using this method of giving spread an index ensures no duplicated row names are introduced and a rectangular dataset can remain.

I hope that this little gem of an insight into working with spread has saved you some time!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Digital Age Economist on Digital Age Economist**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### CHAID and R – When you need explanation – May 15, 2018

(This article was first published on ** Chuck Powell**, and kindly contributed to R-bloggers)

A modern data scientist using R has access to an almost bewildering

number of tools, libraries and algorithms to analyze the data. In my

next two posts I’m going to focus on an in depth visit with CHAID

(Chi-square automatic interaction detection). The title should give you

a hint for why I think CHAID is a good “tool” for your analytical

toolbox. There are lots of tools that can help you predict or classify

but CHAID is especially good at helping you explain to **any audience**

how the model arrives at it’s prediction or classification. It’s also

incredibly robust from a statistical perspective, making almost no

assumptions about your data for distribution or normality. I’ll try and

elaborate on that as we work the example.

You can get a very brief summary of CHAID from

wikipedia

and mentions of it scattered about in places like Analytics

Vidhya

or Data Flair. If

you prefer a more scholarly bent the original article can be found in

places like JSTOR. As the name

implies it is fundamentally based on the venerable Chi-square test – and

while not the most powerful (in terms of detecting the smallest possible

differences) or the fastest, it really is easy to manage and more

importantly to tell the story after using it.

Compared to some other techniques it’s also quite simple to use, as I

hope you’ll agree, by the end of these posts. To showcase it we’re going

to be using a dataset that comes to us from the IBM Watson

Project

and comes packaged with the rsample library. It’s a very practical and

understandable dataset. A great use case for a tree based algorithm.

Imagine yourself in a fictional company faced with the task of trying to

figure out which employees you are going to “lose” a.k.a. attrition or

turnover. There’s a steep cost involved in keeping good employees and

training and on-boarding can be expensive. Being able to predict

attrition even a little bit better would save you lots of money and make

the company better, especially if you can understand exactly what you

have to “watch out” for that might indicate the person is a high risk to

leave.

If you’ve never used CHAID before you may also not have partykit.

CHAID isn’t on CRAN but I have commented out the install command

below. You’ll also get a variety of messages, none of which is relevant

to this example so I’ve suppressed them.

Let’s load up the attrition dataset and take a look at the variables

we have.

Okay we have data on 1,470 employees. We have 30 potential predictor or

independent variables and the all important attrition variable which

gives us a yes or no answer to the question of whether or not the

employee left. We’re to build the most accurate predictive model we can

that is also simple (parsimonious) and explainable. The predictors we

have seem to be the sorts of data we might have on hand in our HR files

and thank goodness are labelled in a way that makes them pretty self

explanatory.

The CHAID library in R requires that any variables that we enter as

predictors be either nominal or ordinal variables (see ?CHAID::chaid),

which in R speak means we have to get them in as either factor or

ordered factor. The str command shows we have a bunch of variables

which are of type integer. As it turns out moving from integer to

factor is simple in terms of code but has to be thoughtful for

substantive reasons. So let’s see how things breakdown.

Hmmmm, 15 factors and 16 integers. Let’s explore further. Of the

variables that are integers how many of them have a small number of

values (a.k.a. levels) and can therefore be simply and easily converted

to true factors. We’ll use a dplyr pipe to see how many have 5 or

fewer levels and 10 or fewer levels.

2 and 4 respectively. We can be pretty confident that converting these

from integer to factor won’t lose much information. Simple to run a

mutate operation across the 4 we have identified. Probably more

elegant though to make it a mutate_if. That way in the future we

decide we like 4 or 7 or 122 as our criteria for the change we only have

to change one number. The “if” variation is also less to type and less

likely to make a manual mistake.

As you look at the results this is a good time to remind you that

CHAID is “non parametric” which means that we don’t have to worry

about how the distribution (normality) looks nor make any assumptions

about the variance. We are assuming that the predictors are independent

of one another, but that is true of every statistical test and this is a

robust procedure. So for now, let’s simply ignore all the variables that

are still integers. I promise we’ll come back and deal with them later.

But for now I’m eager to actually use CHAID and do some predicting.

We’re also going to defer and address the issue of “over-fitting” and

how to most wisely use the data we have. We’re simply going to build a

first model using all 1,470 cases, the 18 factors we have available to

predict with and we are trying to predict attrition. We’ll create a new

dataframe called newattrit (how original right?).

The chaid command accepts two pieces of information in it’s simplest

case, a formula like outcome ~ predictors and a dataframe. We’re going

to make use of the ~ . shortcut on the right hand side and add

attrition on the left and newattrit as our dataframe.

About 6 seconds later (at least on my Mac) we’ll have a solution that we

can print and plot.

**I’m going to output all the plots in a smaller size for the benefit
of you the readers. I’m doing that via RMarkdown and it won’t happen
automatically for you if you download and use the code. I’ll initially
be using, fig.height=10, fig.width=20, dpi=90, out.width=“900px”**

What does CHAID do? Straight from the help pages “Select the predictor

that has the smallest adjusted p-value (i.e., most significant). If this

adjusted p-value is less than or equal to a user-specified alpha-level

alpha4, split the node using this predictor. Else, do not split and the

node is considered as a terminal node.” So it will take our 18

predictors and test each one against our outcome variable – attrition.

The one with the lowest p value (a proxy for is most predictive) will

“anchor” our decision tree. It will then repeat this process of

splitting until more splits fail to yield *significant* results. I’m way

over-simplifying, of course, but you get the idea. The end result will

be a series of terminal nodes (think of them as “prediction buckets”

that have a group of employees who all meet the same criteria who we

think will either attrit or not attrit). Let’s run it.

I happen to be a visual learner and prefer the plot to the print but

they are obviously reporting the same information so use them as you see

fit. As you can see the very first split it decides on is overtime yes

or no. I’ve run the chi-square test so that you can see the p value is

indeed very small (0.00000000000000022).

So the algorithm has decided that the most predictive way to divide our

sample of employees is into 20 terminal nodes or buckets. Each one of

the nodes represents a distinct set of predictors. Take a minute to look

at node 19. Every person there shares the following characteristics.

- [2] OverTime in No
- [15] StockOptionLevel in 1, 2, 3
- [17] EnvironmentSatisfaction in Medium, High, Very_High
- [19] Department in Research_Development: No

There are n = 314 in this group, our prediction is that No they will

not attrit and we were “wrong” err = 3.2%. That’s some useful

information. To quote an old Star Wars movie “These are not the droids

you’re looking for…”. In other words, this is not a group we should be

overly worried about losing and we can say that with pretty high

confidence.

For contrast let’s look at node #23:

- [20] OverTime in Yes
- [21] JobLevel in 1
- [22] StockOptionLevel in 0, 3
- [23] JobSatisfaction in Low, Medium, High:

Where there are n = 61 staff, we predict they will leave Yes and we

get it wrong err = 26.2% of the time. A little worrisome that we’re not

as accurate but this is a group that bears watching or intervention if

we want to retain them.

Some other things to note. Because the predictors are considered

categorical we will get splits like we do for node 22, where 0 and 3 are

on one side and 1, 2 is on the other. The number of people in any node

can be quite variable. Finally, notice that a variable can occur at

different levels of the model like StockOptionLevel does!

On the plot side of things there are a few key options you can adjust

to make things easier to read. The next blocks of code show you how to

adjust some key options such as adding a title, reducing the font size,

using “simple” mode, and changing colors.

Next let’s look into varying the parameters chaid uses to build the

model. chaid_control (not surprisingly) controls the behavior of the

model building. When you check the documentation at ?chaid_control you

can see the list of 8 parameters you can adjust. We’ve already run the

default settings implicitly when we built chaidattrit1 let’s look at

three others.

- minsplit – Number of observations in splitted response at which no

further split is desired. - minprob – Minimum frequency of observations in terminal nodes.
- maxheight – Maximum height for the tree.

We’ll use those but our fourth model we’ll simply require a higher

significance level for alpha2 and alpha4.

Let me call your attention to chaidattrit3 for a minute to highlight

two important things. First it is a good picture of what we get for

answer if we were to ask a question about what are the most important

predictors, what variables should we focus on. An important technical

detail has emerged as well. Notice that when you look at inner node #3

that there is no technical reason why a node has to have a *binary*

split in chaid. As this example clearly shows node#3 leads to a three

way split that is nodes #4-6.

So the obvious question is which model is best? IMHO the joy of CHAID is

in giving you a clear picture of what you would predict given the data

and why. Then of course there is the usual problem every data scientist

has, which is, I have what I think is a great model. How well will it

generalize to new data? Whether that’s next years attrition numbers for

the same company or say data from a different company.

But it’s time to talk about accuracy and all the related ideas, so on

with the show…

When it’s all said and done we built a model called chaidattrit1 to be

able to predict or classify the 1,470 staff members. Seems reasonable

then that we can get back these predictions from the model for all 1,470

people and see how we did compared to the data we have about whether

they attrited or not. The print and plot commands sort of summarize that

for us at the terminal node level with an error rate but all in all

which of our four models is best?

The first step is to get the predictions for each model and put them

somewhere. For that we’ll use the predict command. If you inspect the

object you create (in my case with a head command) you’ll see it’s a

vector of factors where the attribute names is set to be the terminal

node the prediction is associated with. So pmodel1 <-

predict(chaidattrit1) puts our predictions using the first model we

built in a nice orderly fashion. On the other side newattrit$Attrition

has the actual outcome of whether the employee departed or not.

What we want is a comparison of how well we did. How often did we get it

right or wrong? Turns out what we need is called a confusion matrix. The

caret package has a function called confusionMatrix that will give

us what we want nicely formatted and printed.

There’s a nice short summary of what is produced at this url Confusion

Matrix,

so I won’t even try to repeat that material. I’ll just run the

appropriate commands. Later we’ll revisit this topic to be more

efficient. For now I want to focus on the results.

There we have it, four matrices, one for each of the models we made with

the different control parameters. It helpfully provides not just

Accuracy but also other common measures you may be interested in. I

won’t review them all that’s why I provided the link to a detailed

description

of all the measures. Before we leave the topic for a bit however, I do

want to highlight a way you can use the purrr package to make your

life a lot easier. A special thanks to Steven at

MungeX-3D for his recent post on purrr

which got me thinking about it.

We have 4 models so far (with more to come) we have the nice neat output

from caret but honestly to compare values across the 4 models involves

way too much scrolling back and forth right now. Let’s use purrr to

create a nice neat dataframe. purrr’s map command is like lapply

from base R, designed to apply some operations or functions to a list of

objects. So what we’ll do is as follows:

- Create a named list called modellist to point to our four existing

models (perhaps at a latter date we’ll start even earlier in our

modelling process). - It’s a named list so we can name each model (for now with the

accurate but uninteresting name Modelx) - Pass the list using map to the predict function to generate our

predictions - Pipe %>% those results to the confusionMatrix function with

map - Pipe %>% the confusion matrix results to map_dfr. The results of

confusionMattrix are actually a list of six items. The ones we want

to capture are in $overall and $byClass. We grab them, transpose

them, and make them into a dataframe then bind the two dataframes

together so everything is neatly packaged. The .id = ModelNumb

tells map_dfr to add an identifying column to the dataframe. It is

populated with the name of the list item we passed in modellist.

Therefore the object CHAIDresults contains everything we might want

to use to compare models in one neat dataframe.

The kable call is simply for your reading convenience. Makes it a

little easier to read than a traditional print call.

ModelNumb

Accuracy

Kappa

AccuracyLower

AccuracyUpper

AccuracyNull

AccuracyPValue

McnemarPValue

Sensitivity

Specificity

Pos Pred Value

Neg Pred Value

Precision

Recall

F1

Prevalence

Detection Rate

Detection Prevalence

Balanced Accuracy

Model1

0.8707483

0.4191632

0.8525159

0.8874842

0.9095238

0.9999996

0.0e+00

0.8900524

0.6766917

0.9651257

0.3797468

0.9651257

0.8900524

0.9260700

0.9095238

0.8095238

0.8387755

0.7833720

Model2

0.8455782

0.3529603

0.8260781

0.8636860

0.8857143

0.9999985

6.4e-06

0.8863287

0.5297619

0.9359286

0.3755274

0.9359286

0.8863287

0.9104536

0.8857143

0.7850340

0.8387755

0.7080453

Model3

0.8571429

0.2936476

0.8382017

0.8746440

0.9394558

1.0000000

0.0e+00

0.8703838

0.6516854

0.9748581

0.2447257

0.9748581

0.8703838

0.9196634

0.9394558

0.8176871

0.8387755

0.7610346

Model4

0.8619048

0.3676334

0.8432050

0.8791447

0.9156463

1.0000000

0.0e+00

0.8826152

0.6370968

0.9635036

0.3333333

0.9635036

0.8826152

0.9212873

0.9156463

0.8081633

0.8387755

0.7598560

One other thing I’ll mention in passing is that the partykit package

offers a way of assessing the relative importance of the variables in

the model via the varimp command. We’ll come back to this concept of

variable importance later but for now a simple example of text and plot

output.

But before we go much farther we should probably circle back and make

use of all those variables that were coded as integers that we

conveniently ignored in building our first four models. Let’s bring them

into our model building activities and see what they can add to our

understanding. As a first step let’s use ggplot2 and take a look at

their distribution using a density plot.

Well other than Age very few of those variables appear to have

especially normal distributions. That’s okay we’re going to wind up

cutting them up into factors anyway. The only question is what are the

best cut-points to use? In base R the cut function default is equal

intervals (distances along the x axis). You can also specify your own

cutpoints and your own labels as shown below.

ggplot2 has three helper functions I prefer to use: cut_interval,

cut_number, and cut_width. cut_interval makes n groups with equal

range, cut_number makes n groups with (approximately) equal numbers of

observations, and cut_width makes groups of a fixed specified width.

As we think about moving the numeric variables into factors any of these

might be a viable alternative.

For the sake of our current example let’s say that I would like to focus

on groups of more or less equal size which means that I would need to

apply cut_number to each of the 12 variables under discussion. I’m not

enamored of running the function 12 times though so I would prefer to

wrap it in a mutate_if statement. If the variable is numeric then

apply cut_number with n=5.

The problem is that cut_number will error out if it doesn’t think

there are enough values to produce the bins you requested. So…

A little sleuthing reveals that there is one variable among the 12 that

has too few values for the cut_number function to work. That variable

is YearsSinceLastPromotion. Let’s try what we would like but

explicitly select out that variable.

Yes that appears to be it. So let’s manually cut it into 4 groups and

then apply the 5 grouping code to the other 11 variables. Once we have

accomplished that we can run the same newattrit <- attrition %>%

select_if(is.factor) we ran earlier to produce a newattrit dataframe

we can work with.

Now we have newattrit with all 30 predictor variables. We will simply

repeat the process we used earlier to develop 4 new models.

As we did earlier we’ll also repeat the steps necessary to build a table

of results.

ModelNumb

Accuracy

Kappa

AccuracyLower

AccuracyUpper

AccuracyNull

AccuracyPValue

McnemarPValue

Sensitivity

Specificity

Pos Pred Value

Neg Pred Value

Precision

Recall

F1

Prevalence

Detection Rate

Detection Prevalence

Balanced Accuracy

Model1

0.8707483

0.4191632

0.8525159

0.8874842

0.9095238

0.9999996

0.0e+00

0.8900524

0.6766917

0.9651257

0.3797468

0.9651257

0.8900524

0.9260700

0.9095238

0.8095238

0.8387755

0.7833720

Model2

0.8455782

0.3529603

0.8260781

0.8636860

0.8857143

0.9999985

6.4e-06

0.8863287

0.5297619

0.9359286

0.3755274

0.9359286

0.8863287

0.9104536

0.8857143

0.7850340

0.8387755

0.7080453

Model3

0.8571429

0.2936476

0.8382017

0.8746440

0.9394558

1.0000000

0.0e+00

0.8703838

0.6516854

0.9748581

0.2447257

0.9748581

0.8703838

0.9196634

0.9394558

0.8176871

0.8387755

0.7610346

Model4

0.8619048

0.3676334

0.8432050

0.8791447

0.9156463

1.0000000

0.0e+00

0.8826152

0.6370968

0.9635036

0.3333333

0.9635036

0.8826152

0.9212873

0.9156463

0.8081633

0.8387755

0.7598560

Model5

0.8775510

0.4451365

0.8596959

0.8938814

0.9122449

0.9999968

0.0e+00

0.8926174

0.7209302

0.9708029

0.3924051

0.9708029

0.8926174

0.9300699

0.9122449

0.8142857

0.8387755

0.8067738

Model6

0.8442177

0.3317731

0.8246542

0.8623944

0.8938776

1.0000000

1.0e-07

0.8820396

0.5256410

0.9399838

0.3459916

0.9399838

0.8820396

0.9100903

0.8938776

0.7884354

0.8387755

0.7038403

Model7

0.8571429

0.2936476

0.8382017

0.8746440

0.9394558

1.0000000

0.0e+00

0.8703838

0.6516854

0.9748581

0.2447257

0.9748581

0.8703838

0.9196634

0.9394558

0.8176871

0.8387755

0.7610346

Model8

0.8639456

0.3808988

0.8453515

0.8810715

0.9136054

1.0000000

0.0e+00

0.8845867

0.6456693

0.9635036

0.3459916

0.9635036

0.8845867

0.9223602

0.9136054

0.8081633

0.8387755

0.7651280

You can clearly see that Overtime remains the first cut in our tree

structure but that now other variables have started to influence our

model as well, such as how long they’ve worked for us and their age. You

can see from the table that model #5 is apparently the most accurate

now. Not by a huge amount but apparently these numeric variables we

ignored at first pass do matter at least to some degree.

I’m not going to dwell on the current results too much they are simply

for an example and in my next post I’d like to spend some time on

over-fitting and cross validation.

I hope you’ve found this useful. I am always open to comments,

corrections and suggestions.

Chuck (ibecav at gmail dot

com)

This

work is licensed under a

Creative

Commons Attribution-ShareAlike 4.0 International License.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Chuck Powell**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### fruit: Image-Based Systems of Linear Equations (Numeric)

(This article was first published on ** R/exams**, and kindly contributed to R-bloggers)

Exercise template for solving a system of three linear equations (numeric answer) with a problem description based on shuffled images.

**Name:**fruit

**Type:**num

**Related:**fruit2

**Description:**A system of three linear equations has to be solved and the solution has to be entered into a fourth equation. However, the system is not defined through a verbal description or mathermatical notation but through images (clip art of tropical fruits). The problem can be interpreted as prices of three fruits (banana, orange, pineapple) and corresponding fruit baskets with different combinations of fruits. Images are stored in Base64 encoding within the exercise files and embedded dynamically into the output. PDFs are best generated from the Rnw version, HTML is best generated with pandoc from either the Rmd version (where pandoc is used by default) or the Rnw version (where ttm is used by default, but pandoc can be easily used as well.)

**Solution feedback:**Yes

**Randomization:**Random numbers, shuffled graphics

**Mathematical notation:**Yes

**Verbatim R input/output:**No

**Images:**Yes

**Other supplements:**No

**Template:**fruit.Rnw fruit.Rmd

**Raw:**(1 random version) fruit.tex fruit.md

**PDF:**

**HTML:**

*(Note that the HTML output contains mathematical equations in MathML. It is displayed by browsers with MathML support like Firefox or Safari – but not Chrome.)*

**Demo code:**

To **leave a comment** for the author, please follow the link and comment on their blog: ** R/exams**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### fruit2: Image-Based Systems of Linear Equations (Single-Choice)

(This article was first published on ** R/exams**, and kindly contributed to R-bloggers)

Exercise template for solving a system of three linear equations (single-choice) with a problem description based on shuffled images.

**Name:**fruit2

**Type:**schoice

**Related:**fruit

**Description:**A system of three linear equations has to be solved and the solution has to be entered into a fourth equation. However, the system is not defined through a verbal description or mathermatical notation but through images (clip art of tropical fruits). The problem can be interpreted as prices of three fruits (banana, orange, pineapple) and corresponding fruit baskets with different combinations of fruits. Images are stored in Base64 encoding within the exercise files and embedded dynamically into the output. A set of five answer alternatives is generated based on two potential mistakes and two random solutions from a suitable range. PDFs are best generated from the Rnw version, HTML is best generated with pandoc from either the Rmd version (where pandoc is used by default) or the Rnw version (where ttm is used by default, but pandoc can be easily used as well.)

**Solution feedback:**Yes

**Randomization:**Random numbers, shuffled graphics

**Mathematical notation:**Yes

**Verbatim R input/output:**No

**Images:**Yes

**Other supplements:**No

**Template:**fruit2.Rnw fruit2.Rmd

**Raw:**(1 random version) fruit2.tex fruit2.md

**PDF:**

**HTML:**

*(Note that the HTML output contains mathematical equations in MathML. It is displayed by browsers with MathML support like Firefox or Safari – but not Chrome.)*

**Demo code:**

To **leave a comment** for the author, please follow the link and comment on their blog: ** R/exams**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Early draft of our “Feature Engineering and Selection” book

(This article was first published on ** Blog - Applied Predictive Modeling**, and kindly contributed to R-bloggers)

Kjell and I are writing another book on predictive modeling, this time focused on all the things that you can do with predictors. It’s about 60% done and we’d love to get feedback. You cna take a look at http://feat.engineering and provide feedback at https://github.com/topepo/FES/issues.

The current TOC is:

- Introduction
- Illustrative Example: Predicting Risk of Ischemic Stroke
- A Review of the Predictive Modeling Process
- Exploratory Visualizations
- Encoding Categorical Predictors
- Engineering Numeric Predictors
*Detecting Interaction Effects*(these later chapters are not finished yet)*Flattening Profile Data**Handling Missing Data**Feature Engineering Without Overfitting**Feature Selection*

To **leave a comment** for the author, please follow the link and comment on their blog: ** Blog - Applied Predictive Modeling**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Generative Assessment Creation

(This article was first published on ** Rstats – OUseful.Info, the blog…**, and kindly contributed to R-bloggers)

It’s coming round to that time of year where we have to create the assessment material for courses with an October start date. In many cases, we reuse question forms from previous presentations but change the specific details. If a question is suitably defined, then large parts of this process could be automated.

In the OU, automated question / answer option randomisation is used to provide iCMAs (interactive computer marked assessments) via the student VLE using OpenMark. As well as purely text based questions, questions can include tables or images as part of the question.

One way of supporting such question types is to manually create a set of answer options, perhaps with linked media assets, and then allow randomisation of them.

Another way is to define the question in a *generative* way so that the correct and incorrect answers are automatically generated.*(This seems to be one of those use cases for why ‘everyone should learn to code’;-)*

Pinching screenshots from an (old?) OpenMark tutorial, we can see how a dynamically generated question might be defined. For example, create a set of variables:

and then generate a templated question, and student feedback generator, around them:

Packages also exist for creating generative questions/answers more generally. For example, the R exams package allows you to define question/answer templates in Rmd and then generate questions and solutions in a variety of output document formats.

You can also write templates that include the creation of graphical assets such as charts:

Via my feeds over the weekend, I noticed that this package now also supports the creation of more general diagrams created from a TikZ diagram template. For example, logic diagrams:

*(You can see more exam templates here: www.r-exams.org/templates.)*

As I’m still on a “we can do everything in Jupyter” kick, one of the things I’ve explored is various IPython/notebook magics that support diagram creation. At the moment, these are just generic magics that allow you to write TikZ diagrams, for example, that make use of various TikZ packages:

One the to do list is to create some example magics that template different question types.

I’m not sure if OpenCreate is following a similar model? (I seem to have lost access permissions again…)

FWIW, I’ve also started looking at my show’n’tell notebooks again, trying to get them working in Azure notebooks. (OU staff should be able to log in to noteooks.azure.com using OUCU@open.ac.uk credentials.) For the moment, I’m depositing them at *https://notebooks.azure.com/OUsefulInfo/libraries/gettingstarted*, although some tidying may happen at some point. There are also several more basic demo notebooks I need to put together (e.g. on creating charts and using interactive widgets, digital humanities demos, R demos and (if they work!) polyglot R and python notebook demos, etc.). To use the notebooks interactively, log in and clone the library into your own user space.

To **leave a comment** for the author, please follow the link and comment on their blog: ** Rstats – OUseful.Info, the blog…**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### sparklyr 0.8

(This article was first published on ** RStudio Blog**, and kindly contributed to R-bloggers)

We’re pleased to announce that sparklyr 0.8 is now available on CRAN! Sparklyr provides an R interface to Apache Spark. It supports dplyr syntax for working with Spark DataFrames and exposes the full range of machine learning algorithms available in Spark ML. You can also learn more about Apache Spark and sparklyr at spark.rstudio.com and the sparklyr webinar series. In this version, we added support for Spark 2.3, Livy 0.5, and various enhancements and bugfixes. For this post, we’d like to highlight a new feature from Spark 2.3 and introduce the mleap and graphframes extensions.

Parallel Cross-ValidationSpark 2.3 supports parallelism in hyperparameter tuning. In other words, instead of training each model specification serially, you can now train them in parallel. This can be enabled by setting the parallelism parameter in ml_cross_validator() or ml_train_split_validation(). Here’s an example:

library(sparklyr) sc <- spark_connect(master = "local", version = "2.3.0") iris_tbl <- sdf_copy_to(sc, iris) # Define the pipeline labels <- c("setosa", "versicolor", "virginica") pipeline <- ml_pipeline(sc) %>% ft_vector_assembler( c("Sepal_Width", "Sepal_Length", "Petal_Width", "Petal_Length"), "features" ) %>% ft_string_indexer_model("Species", "label", labels = labels) %>% ml_logistic_regression() # Specify hyperparameter grid grid <- list( logistic = list( elastic_net_param = c(0.25, 0.75), reg_param = c(1e-3, 1e-4) ) ) # Create the cross validator object cv <- ml_cross_validator( sc, estimator = pipeline, estimator_param_maps = grid, evaluator = ml_multiclass_classification_evaluator(sc), num_folds = 3, parallelism = 4 ) # Train the models cv_model <- ml_fit(cv, iris_tbl)Once the models are trained, you can inspect the performance results by using the newly available helper function ml_validation_metrics():

ml_validation_metrics(cv_model) ## f1 elastic_net_param_1 reg_param_1 ## 1 0.9506 0.25 1e-03 ## 2 0.9384 0.75 1e-03 ## 3 0.9384 0.25 1e-04 ## 4 0.9569 0.75 1e-04 spark_disconnect(sc) Pipelines in ProductionEarlier this year, we announced support for ML Pipelines in sparklyr, and discussed how one can persist models onto disk. While that workflow is appropriate for batch scoring of large datasets, we also wanted to enable real-time, low-latency scoring using pipelines developed with sparklyr. To enable this, we’ve developed the mleap package, available on CRAN, which provides an interface to the MLeap open source project.

MLeap allows you to use your Spark pipelines in any Java-enabled device or service. This works by serializing Spark pipelines which can later be loaded into the Java Virtual Machine (JVM) for scoring without requiring a Spark cluster. This means that software engineers can take Spark pipelines exported with sparklyr and easily embed them in web, desktop or mobile applications.

To get started, simply grab the package from CRAN and install the necessary dependencies:

install.packages("mleap") library(mleap) install_maven() install_mleap()Then, build a pipeline as usual:

library(sparklyr) sc <- spark_connect(master = "local", version = "2.2.0") mtcars_tbl <- sdf_copy_to(sc, mtcars) # Create a pipeline and fit it pipeline <- ml_pipeline(sc) %>% ft_binarizer("hp", "big_hp", threshold = 100) %>% ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>% ml_gbt_regressor(label_col = "mpg") pipeline_model <- ml_fit(pipeline, mtcars_tbl)Once we have the pipeline model, we can export it via ml_write_bundle():

# Export model model_path <- file.path(tempdir(), "mtcars_model.zip") transformed_tbl <- ml_transform(pipeline_model, mtcars_tbl) ml_write_bundle(pipeline_model, transformed_tbl, model_path) spark_disconnect(sc)At this point, we’re ready to use mtcars_model.zip in other applications. Notice that the following code does not require Spark:

# Import model model <- mleap_load_bundle(model_path) # Create a data frame to be scored newdata <- tibble::tribble( ~qsec, ~hp, ~wt, 16.2, 101, 2.68, 18.1, 99, 3.08 ) # Transform the data frame transformed_df <- mleap_transform(model, newdata) dplyr::glimpse(transformed_df) ## Observations: 2 ## Variables: 6 ## $ qsec 16.2, 18.1 ## $ hp 101, 99 ## $ wt 2.68, 3.08 ## $ big_hp 1, 0 ## $ features [[[1, 2.68, 16.2], [3]], [[0, 3.08, 18.1], [3]]] ## $ prediction 21.07, 22.37Notice that MLeap requires Spark 2.0 to 2.2. You can find additional details in the production pipelines guide.

Graph AnalysisThe other extension we’d like to highlight is graphframes, which provides an interface to the GraphFrames Spark package. GraphFrames allows us to run graph algorithms at scale using a DataFrame-based API.

Let’s see graphframes in action through a quick example, where we analyze the relationships among package on CRAN.

library(graphframes) library(dplyr) sc <- spark_connect(master = "local", version = "2.1.0") # Grab list of CRAN packages and their dependencies available_packages <- available.packages( contrib.url("https://cloud.r-project.org/") ) %>% `[`(, c("Package", "Depends", "Imports")) %>% as_tibble() %>% transmute( package = Package, dependencies = paste(Depends, Imports, sep = ",") %>% gsub("\\n|\\s+", "", .) ) # Copy data to Spark packages_tbl <- sdf_copy_to(sc, available_packages, overwrite = TRUE) # Create a tidy table of dependencies, which define the edges of our graph edges_tbl <- packages_tbl %>% mutate( dependencies = dependencies %>% regexp_replace("\\\\(([^)]+)\\\\)", "") ) %>% ft_regex_tokenizer( "dependencies", "dependencies_vector", pattern = "(\\s+)?,(\\s+)?", to_lower_case = FALSE ) %>% transmute( src = package, dst = explode(dependencies_vector) ) %>% filter(!dst %in% c("R", "NA"))Once we have an edges table, we can easily create a GraphFrame object by calling gf_graphframe() and running PageRank:

# Create a GraphFrame object g <- gf_graphframe(edges = edges_tbl) # Run the PageRank algorithm pagerank <- gf_pagerank(g, tol = 0.01) pagerank %>% gf_vertices() %>% arrange(desc(pagerank)) ## # Source: table [?? x 2] ## # Database: spark_connection ## # Ordered by: desc(pagerank) ## id pagerank ## ## 1 methods 259. ## 2 stats 209. ## 3 utils 194. ## 4 Rcpp 109. ## 5 graphics 104. ## 6 grDevices 60.0 ## 7 MASS 53.7 ## 8 lattice 34.7 ## 9 Matrix 33.3 ## 10 grid 32.1 ## # ... with more rowsWe can also collect a sample of the graph locally for visualization:

library(gh) library(visNetwork) list_repos <- function(username) { gh("/users/:username/repos", username = username) %>% vapply("[[", "", "name") } rlib_repos <- list_repos("r-lib") tidyverse_repos <- list_repos("tidyverse") base_packages <- installed.packages() %>% as_tibble() %>% filter(Priority == "base") %>% pull(Package) top_packages <- pagerank %>% gf_vertices() %>% arrange(desc(pagerank)) %>% head(75) %>% pull(id) edges_local <- g %>% gf_edges() %>% filter(src %in% !!top_packages && dst %in% !!top_packages) %>% rename(from = src, to = dst) %>% collect() vertices_local <- g %>% gf_vertices() %>% filter(id %in% top_packages) %>% mutate( group = case_when( id %in% !!rlib_repos ~ "r-lib", id %in% !!tidyverse_repos ~ "tidyverse", id %in% !!base_packages ~ "base", TRUE ~ "other" ), title = id) %>% collect() visNetwork(vertices_local, edges_local, width = "100%") %>% visEdges(arrows = "to"){"x":{"nodes":{"id":["ggplot2","gtable","knitr","RCurl","stringi","stringr","curl","magrittr","numDeriv","gtools","glue","raster","reshape2","parallel","methods","bitops","R6","abind","cluster","glmnet","iterators","lattice","lme4","lubridate","nlme","tidyr","tools","ape","boot","coda","fields","Formula","MASS","purrr","rgl","XML","zoo","graphics","stats","grid","quadprog","tcltk","car","dplyr","htmltools","Matrix","mgcv","mime","plyr","lazyeval","rlang","crayon","httr","mvtnorm","rJava","shiny","sp","survival","tibble","xml2","splines","RColorBrewer","assertthat","data.table","doParallel","foreach","Hmisc","igraph","Rcpp","scales","digest","codetools","jsonlite","utils","grDevices"],"group":["tidyverse","other","other","other","other","tidyverse","other","tidyverse","other","other","tidyverse","other","other","base","base","other","other","other","other","other","other","other","other","tidyverse","other","tidyverse","base","other","other","other","other","other","other","tidyverse","other","other","other","base","base","base","other","base","other","tidyverse","other","other","other","other","other","other","other","r-lib","r-lib","other","other","other","other","other","tidyverse","other","base","other","other","other","other","other","other","other","other","other","other","other","other","base","base"],"title":["ggplot2","gtable","knitr","RCurl","stringi","stringr","curl","magrittr","numDeriv","gtools","glue","raster","reshape2","parallel","methods","bitops","R6","abind","cluster","glmnet","iterators","lattice","lme4","lubridate","nlme","tidyr","tools","ape","boot","coda","fields","Formula","MASS","purrr","rgl","XML","zoo","graphics","stats","grid","quadprog","tcltk","car","dplyr","htmltools","Matrix","mgcv","mime","plyr","lazyeval","rlang","crayon","httr","mvtnorm","rJava","shiny","sp","survival","tibble","xml2","splines","RColorBrewer","assertthat","data.table","doParallel","foreach","Hmisc","igraph","Rcpp","scales","digest","codetools","jsonlite","utils","grDevices"]},"edges":{"from":["abind","abind","ape","ape","ape","ape","ape","ape","ape","ape","ape","assertthat","boot","boot","car","car","car","car","car","car","car","car","car","cluster","cluster","cluster","cluster","coda","crayon","crayon","crayon","data.table","doParallel","doParallel","doParallel","doParallel","dplyr","dplyr","dplyr","dplyr","dplyr","dplyr","dplyr","dplyr","dplyr","fields","foreach","foreach","foreach","Formula","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","ggplot2","glmnet","glmnet","glmnet","glmnet","glue","gtable","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","Hmisc","htmltools","htmltools","htmltools","httr","httr","httr","httr","igraph","igraph","igraph","igraph","igraph","igraph","igraph","iterators","jsonlite","knitr","knitr","knitr","lattice","lattice","lattice","lattice","lattice","lme4","lme4","lme4","lme4","lme4","lme4","lme4","lme4","lme4","lme4","lme4","lubridate","lubridate","lubridate","MASS","MASS","MASS","MASS","MASS","Matrix","Matrix","Matrix","Matrix","Matrix","Matrix","mgcv","mgcv","mgcv","mgcv","mgcv","mime","mvtnorm","mvtnorm","nlme","nlme","nlme","nlme","plyr","purrr","purrr","purrr","raster","raster","raster","Rcpp","Rcpp","RCurl","RCurl","reshape2","reshape2","reshape2","rgl","rgl","rgl","rgl","rgl","rgl","rgl","rgl","rgl","rJava","scales","scales","scales","scales","shiny","shiny","shiny","shiny","shiny","shiny","shiny","shiny","sp","sp","sp","sp","sp","sp","sp","stringi","stringi","stringi","stringr","stringr","stringr","survival","survival","survival","survival","survival","survival","tibble","tibble","tibble","tibble","tidyr","tidyr","tidyr","tidyr","tidyr","tidyr","tidyr","tidyr","XML","XML","xml2","zoo","zoo","zoo","zoo","zoo"],"to":["methods","utils","nlme","lattice","graphics","methods","stats","tools","utils","parallel","Rcpp","tools","graphics","stats","abind","MASS","mgcv","grDevices","utils","stats","graphics","lme4","nlme","graphics","grDevices","stats","utils","lattice","grDevices","methods","utils","methods","foreach","iterators","parallel","utils","assertthat","glue","magrittr","methods","rlang","R6","Rcpp","tibble","utils","methods","codetools","utils","iterators","stats","digest","grid","gtable","MASS","plyr","reshape2","scales","stats","tibble","lazyeval","Matrix","utils","foreach","methods","methods","grid","lattice","survival","Formula","ggplot2","methods","cluster","gtable","grid","data.table","htmltools","utils","digest","Rcpp","jsonlite","mime","curl","R6","methods","graphics","grDevices","magrittr","Matrix","stats","utils","utils","methods","stringr","methods","tools","grid","grDevices","graphics","stats","utils","Matrix","methods","stats","graphics","grid","splines","utils","parallel","MASS","lattice","nlme","methods","stringr","Rcpp","grDevices","graphics","stats","utils","methods","methods","graphics","grid","stats","utils","lattice","nlme","methods","stats","graphics","Matrix","tools","stats","methods","graphics","stats","utils","lattice","Rcpp","magrittr","rlang","tibble","methods","sp","Rcpp","methods","utils","methods","bitops","plyr","Rcpp","stringr","graphics","grDevices","stats","utils","htmltools","knitr","jsonlite","shiny","magrittr","methods","RColorBrewer","plyr","Rcpp","R6","methods","utils","mime","jsonlite","digest","htmltools","R6","tools","methods","utils","stats","graphics","grDevices","lattice","grid","tools","utils","stats","glue","magrittr","stringi","graphics","Matrix","methods","splines","stats","utils","crayon","methods","rlang","utils","dplyr","glue","magrittr","purrr","Rcpp","rlang","stringi","tibble","methods","utils","Rcpp","stats","utils","graphics","grDevices","lattice"]},"nodesToDataframe":true,"edgesToDataframe":true,"options":{"width":"100%","height":"100%","nodes":{"shape":"dot"},"manipulation":{"enabled":false},"edges":{"arrows":"to"}},"groups":["tidyverse","other","base","r-lib"],"width":"100%","height":null,"idselection":{"enabled":false},"byselection":{"enabled":false},"main":null,"submain":null,"footer":null,"background":"rgba(0, 0, 0, 0)"},"evals":[],"jsHooks":[]}

spark_disconnect(sc)Notice that GraphFrames currently supports Spark 2.0 and 2.1. You can find additional details in the graph analysis guide.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** RStudio Blog**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Sketchnotes from TWiML&AI: Adversarial Attacks Against Reinforcement Learning Agents with Ian Goodfellow & Sandy Huang

(This article was first published on ** Shirin's playgRound**, and kindly contributed to R-bloggers)

These are my sketchnotes for Sam Charrington’s podcast **This Week in Machine Learning and AI** about *Adversarial Attacks Against Reinforcement Learning Agents with Ian Goodfellow & Sandy Huang*:

You can listen to the podcast here.

In this episode, I’m joined by Ian Goodfellow, Staff Research Scientist at Google Brain and Sandy Huang, Phd Student in the EECS department at UC Berkeley, to discuss their work on the paper Adversarial Attacks on Neural Network Policies. If you’re a regular listener here you’ve probably heard of adversarial attacks, and have seen examples of deep learning based object detectors that can be fooled into thinking that, for example, a giraffe is actually a school bus, by injecting some imperceptible noise into the image. Well, Sandy and Ian’s paper sits at the intersection of adversarial attacks and reinforcement learning, another area we’ve discussed quite a bit on the podcast. In their paper, they describe how adversarial attacks can also be effective at targeting neural network policies in reinforcement learning. Sandy gives us an overview of the paper, including how changing a single pixel value can throw off performance of a model trained to play Atari games. We also cover a lot of interesting topics relating to adversarial attacks and RL individually, and some related areas such as hierarchical reward functions and transfer learning. This was a great conversation that I’m really excited to bring to you! https://twimlai.com/twiml-talk-119-adversarial-attacks-reinforcement-learning-agents-ian-goodfellow-sandy-huang/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Shirin's playgRound**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Is non-inferiority on par with superiority?

(This article was first published on ** ouR data generation**, and kindly contributed to R-bloggers)

It is grant season around here (actually, it is pretty much always grant season), which means another series of problems to tackle. Even with the most straightforward study designs, there is almost always some interesting twist, or an approach that presents a subtle issue or two. In this case, the investigator wants compare two interventions, but doesn’t feel the need to show that one is better than the other. He just wants to see if the newer intervention is *not inferior* to the more established intervention.

The shift from a superiority trial to a non-inferiority trial leads to a fundamental shift in the hypothesis testing framework. In the more traditional superiority trial, where we want to see if an intervention is an improvement over another intervention, we can set up the hypothesis test with null and alternative hypotheses based on the difference of the intervention proportions \(p_{old}\) and \(p_{new}\) (under the assumption of a binary outcome):

\[

\begin{aligned}

H_0: p_{new} – p_{old} &\le 0 \\

H_A: p_{new} – p_{old} &> 0

\end{aligned}

\] In this context, if we reject the null hypothesis that the difference in proportions is less than zero, we conclude that the new intervention is an improvement over the old one, at least for the population under study. (A crucial element of the test is the \(\alpha\)-level that determines the Type 1 error (probability of rejecting \(H_0\) when \(H_0\) is actually true. If we use \(\alpha = 0.025\), then that is analogous to doing a two-sided test with \(\alpha = .05\) and hypotheses \(H_0: p_{new} – p_{old} = 0\) and \(H_A: p_{new} – p_{old} \ne 0\).)

In the case of an inferiority trial, we add a little twist. Really, we subtract a little twist. In this case the hypotheses are:

\[

\begin{aligned}

H_0: p_{new} – p_{old} &\le -\Delta \\

H_A: p_{new} – p_{old} &> -\Delta

\end{aligned}

\]

where \(\Delta\) is some threshold that sets the non-inferiority bounds. Clearly, if \(\Delta = 0\) then this is equivalent to a superiority test. However, for any other \(\Delta\), there is a bit of a cushion so that the new intervention will still be considered *non-inferior* even if we observe a *lower* proportion for the new intervention compared to the older intervention.

As long as the confidence interval around the observed estimate for the difference in proportions does not cross the \(-\Delta\) threshold, we can conclude the new intervention is non-inferior. If we construct a 95% confidence interval, this procedure will have a Type 1 error rate \(\alpha = 0.025\), and a 90% CI will yield an \(\alpha = 0.05\). (I will demonstrate this with a simulation.)

The following figures show how different confident intervals imply different conclusions. I’ve added an equivalence trial here as well, but won’t discuss in detail except to say that in this situation we would conclude that two interventions are *equivalent* if the confidence interval falls between \(-\Delta\) and \(\Delta\)). The bottom interval crosses the non-inferiority threshold, so is considered inconclusive. The second interval from the top crosses zero, but does not cross the non-inferiority threshold, so we conclude that the new intervention is at least as effective as the old one. And the top interval excludes zero, so we conclude that the new intervention is an improvement:

This next figure highlights the key challenge of the the non-inferiority trial: where do we set \(\Delta\)? By shifting the threshold towards zero in this example (and not changing anything else), we can no longer conclude non-inferiority. But, the superiority test is not affected, and never will be. The comparison for a superiority test is made relative to zero only, and has nothing to do with \(\Delta\). So, unless there is a principled reason for selecting \(\Delta\), the process (and conclusions) and feel a little arbitrary. (Check out this interactive post for a really cool way to explore some of these issues.)

Type 1 error rateTo calculate the Type 1 error rate, we generate data under the null hypothesis, or in this case on the rightmost boundary of the null hypothesis since it is a composite hypothesis. First, let’s generate one data set:

library(magrittr) library(broom) set.seed(319281) def <- defDataAdd(varname = "y", formula = "0.30 - 0.15*rx", dist = "binary") DT <- genData(1000) %>% trtAssign(dtName = ., grpName = "rx") DT <- addColumns(def, DT) DT ## id rx y ## 1: 1 0 0 ## 2: 2 1 0 ## 3: 3 1 0 ## 4: 4 0 0 ## 5: 5 1 0 ## --- ## 996: 996 0 1 ## 997: 997 0 0 ## 998: 998 1 0 ## 999: 999 0 0 ## 1000: 1000 0 0And we can estimate a confidence interval for the difference between the two means:

props <- DT[, .(success = sum(y), n=.N), keyby = rx] setorder(props, -rx) round(tidy(prop.test(props$success, props$n, correct = FALSE, conf.level = 0.95))[ ,-c(5, 8,9)], 3) ## estimate1 estimate2 statistic p.value conf.low conf.high ## 1 0.142 0.276 27.154 0 -0.184 -0.084If we generate 1000 data sets in the same way, we can count the number of occurrences where the where we would incorrectly reject the null hypothesis (i.e. commit a Type 1 error):

powerRet <- function(nPerGrp, level, effect, d = NULL) { Form <- genFormula(c(0.30, -effect), c("rx")) def <- defDataAdd(varname = "y", formula = Form, dist = "binary") DT <- genData(nPerGrp*2) %>% trtAssign(dtName = ., grpName = "rx") iter <- 1000 ci <- data.table() # generate 1000 data sets and store results each time in "ci" for (i in 1: iter) { dx <- addColumns(def, DT) props <- dx[, .(success = sum(y), n=.N), keyby = rx] setorder(props, -rx) ptest <- prop.test(props$success, props$n, correct = FALSE, conf.level = level) ci <- rbind(ci, data.table(t(ptest$conf.int), diff = ptest$estimate[1] - ptest$estimate[2])) } setorder(ci, V1) ci[, i:= 1:.N] # for sample size calculation at 80% power if (is.null(d)) d <- ci[i==.2*.N, V1] ci[, d := d] # determine if interval crosses threshold ci[, nullTrue := (V1 <= d)] return(ci[]) }Using 95% CIs, we expect 2.5% of the intervals to lie to the right of the non-inferiority threshold. That is, 2.5% of the time we would reject the null hypothesis when we shouldn’t:

ci <- powerRet(nPerGrp = 500, level = 0.95, effect = 0.15, d = -0.15) formattable::percent(ci[, mean(!(nullTrue))], 1) ## [1] 2.4%And using 90% CIs, we expect 5% of the intervals to lie to the right of the threshold:

ci <- powerRet(nPerGrp = 500, level = 0.90, effect = 0.15, d = -0.15) formattable::percent(ci[, mean(!(nullTrue))], 1) ## [1] 5.1% Sample size estimatesIf we do not expect the effect sizes to be different across interventions, it seems reasonable to find the sample size under this assumption of no effect. Assuming we want to set \(\alpha = 0.025\), we generate many data sets and estimate the 95% confidence interval for each one. The power is merely the proportion of these confidence intervals lie entirely to the right of \(-\Delta\).

But how should we set \(\Delta\)? I’d propose that for each candidate sample size level, we find \(-\Delta\) such that 80% of the simulated confidence intervals lie to the right of some value, where 80% is the desired power of the test (i.e., given that there is no treatment effect, 80% of the (hypothetical) experiments we conduct will lead us to conclude that the new treatment is *non-inferior* to the old treatment).

It is clear that increasing the sample size reduces the width of the 95% confidence intervals. As a result, the non-inferiority threshold based on 80% power is shifted closer towards zero when sample size increases. This implies that a larger sample size allows us to make a more compelling statement about non-inferiority.

Unfortunately, not all non-inferiority statements are alike. If we set \(\Delta\) too large, we may expand the bounds of non-inferiority beyond a reasonable, justifiable point. Given that there is no actual constraint on what \(\Delta\) can be, I would say that the non-inferiority test is somewhat more problematic than its closely related cousin, the superiority test, where \(\Delta\) is in effect fixed at zero. But, if we take this approach, where we identify \(\Delta\) that satisfies the desired power, we can make a principled decision about whether or not the threshold is within reasonable bounds.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** ouR data generation**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### CRAN Release of R/exams 2.3-1

(This article was first published on ** R/exams**, and kindly contributed to R-bloggers)

New minor release of the R/exams package to CRAN, containing a wide range of smaller improvements and bug fixes. Notable new features include a dedicated OpenOLAT interface, and a convenience function facilitating the use of TikZ-based graphics.

Version 2.3-1 of the one-for-all exams generator R/exams has been published on the Comprehensive R Archive Network at https://CRAN.R-project.org/package=exams. In the next days this will propagate to other CRAN mirrors along with Windows binary packages. The development version of the package is now version 2.3-2 on http://R-Forge.R-project.org/forum/?group_id=1337.

New features- Added new interface exams2openolat() for the open-source OpenOLAT learning management system. This is only a convenience wrapper to exams2qti12() or exams2qti21() with some dedicated tweaks for optimizing MathJax output for OpenOLAT.
- New function include_tikz() that facilitates compiling standalone TikZ figures into a range of output formats, especially PNG and SVG (for HTML-based output). This is useful when including TikZ in R/Markdown exercises or when converting R/LaTeX exercises to HTML. Two examples have been added to the package that illustrate the capabilities of include_tikz(): automaton, logic. A dedicated blog post is also planned.

- Following the blog post on Written R/exams around the World several users have been kind enough to add language support for: Croatian (hr.dcf, contributed by Krunoslav Juraić), Danish (da.dcf, contributed by Tue Vissing Jensen and Jakob Messner),Slovak (sk.dcf, contributed by Peter Fabsic), Swiss German (gsw.dcf, contributed by Reto Stauffer), Turkish (tr.dcf, contributed by Emrah Er). Furthermore, Portuguese has been distinguished into pt-PT.dcf (Portuguese Portuguese) vs. pt-BR.dcf (Brazilian Portuguese) with pt.dcf defaulting to the former (contributed by Thomas Dellinger).
- After setting a random seed exams2nops() and exams2pdf() now yield the same random versions of the exercises. Previously, this was not the case because exams2nops() internally generates a single random trial exam first for a couple of sanity checks. Now, the .GlobalEnv$.Random.seed is restored after generating the trial exam.
- Fixed the support for nsamp argument in exams2nops(). Furthermore, current limitations of exams2nops() are pointed out more clearly in error messages and edge cases caught.
- Allow varying points within a certain exercise in nops_eval().

- In exams2html() and other interfaces based on make_exercise_transform_html() the option base64 = TRUE now uses Base64 encoding for all file extensions (known to the package) whereas base64 = NULL only encodes image files (previous default behavior).
- Bug fixes and improvements in HTML transformers:
- Only ="file.ext" (with =") for supplementary files embedded into HTML is replaced now by the corresponding Base64-encoded version.
- href="file.ext" is replaced by href="file.ext" download="file.ext" prior to Base 64 replacement to assure that the file name is preserved for the browser/downloader.
- alt="file.ext" and download="file.ext" are preserved without the Base64-encoded version of file.ext.

- Include further file URIs for Base64 supplements, in particular .sav for SPSS data files.
- In exams2blackboard(..., base64 = FALSE, ...) the base64 = FALSE was erroneously ignored. No matter how base64 was specified essentially base64 = TRUE was used, it is honored again now.

- \exshuffle{} can now also be used for schoice exercises with more than one TRUE answer. In a first step only one of the TRUE answers is selected and then -1 items from the FALSE answers.
- Function include_supplement(..., dir = "foo") – without full path to "foo" – now also works if "foo" is not a local sub-directory but a sub-directory to the exercise directory edir (if specified).
- Enable passing of envir argument from exams2html() to xweave() in case of R/Markdown (.Rmd) exercises.
- When using exams2html(..., mathjax = TRUE) for testing purposes, mathjax.rstudio.com is used now rather than cdn.mathjax.org which is currently redirecting and will eventually be shut down completely.
- Added support for \tightlist (as produced by pandoc) in all current LaTeX templates as well as exams2nops().

- Fixed a bug in stresstest_exercise() where the “rank” (previously called “order”) of the correct solution was computed incorrectly. Additional enhancements in plots and labels.
- Fixed a bug for tex2image(..., tikz = TRUE) where erroneously \usetikzlibrary{TRUE} was included. Also tex2image(..., Sweave = TRUE) (the default) did not run properly on Windows, fixed now.
- Better warnings if \exshuffle{} could not be honored due to a lack of sufficiently many (suitable) answer alternatives.
- Bug fix in CSV export of exams2arsnova(). Recent ARSnova versions use “mc” (rather than “MC”) and “abcd” (rather than “SC”) to code multiple-choice and single-choice questions, respectively.

To **leave a comment** for the author, please follow the link and comment on their blog: ** R/exams**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Statistics Sunday: Taylor Swift vs. Lorde – Analyzing Song Lyrics

(This article was first published on ** Deeply Trivial**, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Statistics Sunday Last week, I showed how to tokenize text. Today I’ll use those functions to do some text analysis of one of my favorite types of text: song lyrics. Plus, this is a great opportunity to demonstrate a new R package I discovered: geniusR, which will download lyrics from Genius.

There are two packages – geniusR and geniusr – which will do this. I played with both and found geniusR easier to use. Neither is perfect, but what is perfect, anyway?

To install geniusR, you’ll use a different method than usual – you’ll need to install the package devtools, then call the install_github function to download the R package directly from GitHub.

install.packages("devtools")devtools::install_github("josiahparry/geniusR")

## Downloading GitHub repo josiahparry/geniusR@master

## from URL https://api.github.com/repos/josiahparry/geniusR/zipball/master

## Installing geniusR

## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \

## --no-environ --no-save --no-restore --quiet CMD INSTALL \

## '/private/var/folders/85/9ygtlz0s4nxbmx3kgkvbs5g80000gn/T/Rtmpl3bwRx/devtools33c73e3f989/JosiahParry-geniusR-5907d82' \

## --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' \

## --install-tests

##

Now you’ll want to load geniusR and tidyverse so we can work with our data.

library(geniusR)library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ggplot2 2.2.1 purrr 0.2.4

## tibble 1.4.2 dplyr 0.7.4

## tidyr 0.8.0 stringr 1.3.0

## readr 1.1.1 forcats 0.3.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──

## dplyr::filter() masks stats::filter()

## dplyr::lag() masks stats::lag()

For today’s demonstration, I’ll be working with data from two artists I love: Taylor Swift and Lorde. Both dropped new albums last year, Reputation and Melodrama, respectively, and both, though similar in age and friends with each other, have very different writing and musical styles.

geniusR has a function genius_album that will download lyrics from an entire album, labeling it by track.

swift_lyrics <- genius_album(artist="Taylor Swift", album="Reputation")## Joining, by = c("track_title", "track_n", "track_url")

lorde_lyrics <- genius_album(artist="Lorde", album="Melodrama")

## Joining, by = c("track_title", "track_n", "track_url")

Now we want to tokenize our datasets, remove stop words, and count word frequency – this code should look familiar, except this time, I’m combining them using the pipeline symbol (%>%) from the tidyverse, which allows you to string together multiple functions without having to nest them.

library(tidytext)tidy_swift <- swift_lyrics %>%

unnest_tokens(word,lyric) %>%

anti_join(stop_words) %>%

count(word, sort=TRUE)

## Joining, by = "word"

head(tidy_swift)

## # A tibble: 6 x 2

## word n

##

## 1 call 46

## 2 wanna 37

## 3 ooh 35

## 4 ha 34

## 5 ah 33

## 6 time 32

tidy_lorde <- lorde_lyrics %>%

unnest_tokens(word,lyric) %>%

anti_join(stop_words) %>%

count(word, sort=TRUE)

## Joining, by = "word"

head(tidy_lorde)

## # A tibble: 6 x 2

## word n

##

## 1 boom 40

## 2 love 26

## 3 shit 24

## 4 dynamite 22

## 5 homemade 22

## 6 light 22

Looking at the top 6 words for each, it doesn’t look like there will be a lot of overlap. But let’s explore that, shall we? Lorde’s album is 3 tracks shorter than Taylor Swift’s. To make sure our word comparisons are meaningful, I’ll create new variables that takes into account total number of words, so each word metric will be a proportion, allowing for direct comparisons. And because I’ll be joining the datasets, I’ll be sure to label these new columns by artist name.

tidy_swift <- tidy_swift %>%rename(swift_n = n) %>%

mutate(swift_prop = swift_n/sum(swift_n))

tidy_lorde <- tidy_lorde %>%

rename(lorde_n = n) %>%

mutate(lorde_prop = lorde_n/sum(lorde_n))

There are multiple types of joins available in the tidyverse. I used an anti_join to remove stop words. Today, I want to use a full_join, because I want my final dataset to retain all words from both artists. When one dataset contributes a word not found in the other artist’s set, it will fill those variables in with missing values.

compare_words <- tidy_swift %>%full_join(tidy_lorde, by = "word")

summary(compare_words)

## word swift_n swift_prop lorde_n

## Length:957 Min. : 1.000 Min. :0.00050 Min. : 1.0

## Class :character 1st Qu.: 1.000 1st Qu.:0.00050 1st Qu.: 1.0

## Mode :character Median : 1.000 Median :0.00050 Median : 1.0

## Mean : 3.021 Mean :0.00152 Mean : 2.9

## 3rd Qu.: 3.000 3rd Qu.:0.00151 3rd Qu.: 3.0

## Max. :46.000 Max. :0.02321 Max. :40.0

## NA's :301 NA's :301 NA's :508

## lorde_prop

## Min. :0.0008

## 1st Qu.:0.0008

## Median :0.0008

## Mean :0.0022

## 3rd Qu.:0.0023

## Max. :0.0307

## NA's :508

The final dataset contains 957 tokens – unique words – and the NAs tell how many words are only present in one artist’s corpus. Lorde uses 301 words Taylor Swift does not, and Taylor Swift uses 508 words that Lorde does not. That leaves 148 words on which they overlap.

There are many things we could do with these data, but let’s visualize words and proportions, with one artist on the x-axis and the other on the y-axis.

ggplot(compare_words, aes(x=swift_prop, y=lorde_prop)) +geom_abline() +

geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) +

labs(y="Lorde", x="Taylor Swift") + theme_classic()

## Warning: Removed 809 rows containing missing values (geom_text).

The warning lets me know there are 809 rows with missing values – those are the words only present in one artist’s corpus. Words that fall on or near the line are used at similar rates between artists. Words above the line are used more by Lorde than Taylor Swift, and words below the line are used more by Taylor Swift than Lorde. This tells us that, for instance, Lorde uses “love,” “light,” and, yes, “shit,” more than Swift, while Swift uses “call,” “wanna,” and “hands” more than Lorde. They use words like “waiting,” “heart,” and “dreams” at similar rates. Rates are low overall, but if you look at the max values for the proportion variables, Swift’s most common word only accounts for about 2.3% of her total words; Lorde’s most common word only accounts for about 3.1% of her total words.

This highlights why it’s important to remove stop words for these types of analyses; otherwise, our datasets and chart would be full of words like “the,” “a”, and “and.”

Next Statistics Sunday, we’ll take a look at sentiment analysis!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To **leave a comment** for the author, please follow the link and comment on their blog: ** Deeply Trivial**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tips for Ellipse Summary Plot

(This article was first published on ** R – ЯтомизоnoR**, and kindly contributed to R-bloggers)

I privately had some questions and reply here, because it may also help others including me.

1. How to specify sizeWith plot axis parameters.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], xlim=c(4,8), ylim=c(2,5))

2. How to specify colorWith plot color parameter.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], col=c(‘cyan’, ‘orange’, ‘magenta’))

3. How to give namesUsing builtin iris data.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)])

Digging deeper about iris data> str(iris)

‘data.frame’: 150 obs. of 5 variables:

$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …

$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …

$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …

$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …

$ Species : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 …

The column Species are used in both of x and y data. These are used to give the name of each catergory.

exampleUsing fivenum instead of default ninenum.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], col=c(‘cyan’, ‘orange’, ‘magenta’), SUMMARY=fivenum)

Above shows the plot shown above.

Below may help you to know values on each axis. Here, for the fivenum, each 3rd values is a (x, y) set of each category average.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], SUMMARY=fivenum, plot=FALSE)

$setosa

x y

1 4.3 2.3

2 4.8 3.2

3 5.0 3.4

4 5.2 3.7

5 5.8 4.4

$versicolor

x y

1 4.9 2.0

2 5.6 2.5

3 5.9 2.8

4 6.3 3.0

5 7.0 3.4

$virginica

x y

1 4.9 2.2

2 6.2 2.8

3 6.5 3.0

4 6.9 3.2

5 7.9 3.8

To **leave a comment** for the author, please follow the link and comment on their blog: ** R – ЯтомизоnoR**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### automaton: Interpretation of Automaton Diagrams (Using TikZ)

(This article was first published on ** R/exams**, and kindly contributed to R-bloggers)

Exercise template for assessing the interpretation of an automaton diagram (drawn with TikZ) based on randomly generated input sequences.

**Name:**automaton

**Type:**mchoice

**Related:**logic

**Description:**An automaton diagram with four states A-D is drawn with TikZ and is to be interpreted, where A is always the initial state and one state is randomly picked as the accepting state. Five binary 0/1 input sequences acceptance have to be assessed with approximately a quarter of all sequences being accepted. Depending on the exams2xyz() interface the TikZ graphic can be rendered in PNG, SVG, or directly by LaTeX.

**Solution feedback:**Yes

**Randomization:**Random numbers, text blocks, and graphics

**Mathematical notation:**No

**Verbatim R input/output:**No

**Images:**Yes

**Other supplements:**No

**Template:**automaton.Rnw automaton.Rmd

**Raw:**(1 random version) automaton.tex automaton.md

**PDF:**

**HTML:**

**Demo code:**

**leave a comment** for the author, please follow the link and comment on their blog: ** R/exams**.
R-bloggers.com offers **daily e-mail updates** about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...