## Error message

• Deprecated function: ini_set(): Use of mbstring.http_input is deprecated in include_once() (line 654 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
• Deprecated function: ini_set(): Use of mbstring.http_output is deprecated in include_once() (line 655 of /home/spatiala/public_html/geostat-course.org/sites/default/settings.php).
R news and tutorials contributed by hundreds of R bloggers
Updated: 7 hours 50 min ago

### Stouffer’s meta-analysis with weight and direction effect in R

Tue, 05/15/2018 - 21:24

(This article was first published on R – Fabio Marroni's Blog, and kindly contributed to R-bloggers)

I often have to perform meta-analysis of past experiments for which the only info I have is the fold change and the p-value (of any measure you may imagine: species richness, gene expression, depth of coverage, plates of carbonara eaten in 5 minutes, everything).
The hardest thing to find out for me was how to take the direction of the changes into account. E.g. if monday I eat 10 carbonara more than my brother (p=0.01), on tuesday 10 more (p=0.01), on wednesday 5 more (p=0.1), on thursday 10 less (p=0.01), on friday ten less (p=0.01) and on saturday 5 less (p=0.1), a standard Stouffer meta-analysis would return a highly significant p-value, completely disregarding the fact that the significant changes were half in my favor, and half in favor of my brother.
How can I take into account the information on the direction?
I had no idea on how to proceed, and I wasn’t able to find any reference (I am not an expert of meta-analysis), but then I was so lucky to stumble upon the manual page of the software package METAINTER for performing meta-analysis.

The authors described there a method to perform Stouffer meta-analysis accounting for the direction of the effects. Their explanation was so clear that it was easy for me to write a simple function in R: I paste it below.

signed.Stouffer.meta <- function(p, w, sign) { # p is a vector of p-values if (missing(w)) { w <- rep(1, length(p))/length(p) } else { if (length(w) != length(p)) stop("Length of p and w must equal!") } if(length(p)==1) Zi1) { Zi<-qnorm(p/2,lower.tail=FALSE) Zi[sign<0]<-(-Zi[sign<0]) } Z <- sum(w*Zi)/sqrt(sum(w^2)) p.val <- 2*(1-pnorm(abs(Z))) return(c(Z = Z, p.value = p.val)) } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Fabio Marroni's Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### grconvertX and grconvertY

Tue, 05/15/2018 - 18:46

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

These two functions are unbelievably useful for positioning graphical elements (text, axes, labels, …) in R. They allow one to convert coordinates between various different formats. For instance, you can convert your user coordinate (say 5 where x ranges from 0 to 200) to normalized device coordinates (proportional distance across the device) and vice versa. Very useful for positioning panel labels…

grconvertX(.1, "npc", "user")

returns the x coordinate that is 10% across the plot region. So with that and the equivalent y function, you can place your labels in exactly the same position on every panel and plot… e.g.

plot(rnorm(20), rnorm(20)) text(grconvertX(.1, "npc", "user"), grconvertY(.9, "npc", "user"), "a)")

To show the difference between the from side, run the following…

text(grconvertX(.1, "npc", "user"), grconvertY(.9, "npc", "user"), "npc") text(grconvertX(.1, "ndc", "user"), grconvertY(.9, "ndc", "user"), "ndc", xpd = "n") # the text will be near the outer margin

It’s also very useful for positioning axis labels from boxplot when the labels are too long…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Create your Machine Learning library from scratch with R ! (2/5) – PCA

Tue, 05/15/2018 - 18:07

(This article was first published on Enhance Data Science, and kindly contributed to R-bloggers)

This is this second post of the “Create your Machine Learning library from scratch with R !” series. Today, we will see how you can implement Principal components analysis (PCA) using only the linear algebra available in R. Previously, we managed to implement linear regression and logistic regression from scratch and next time we will deal with K nearest neighbors (KNN).

Principal components analysis

The PCA is a dimensionality reduction method which seeks the vectors which explains most of the variance in the dataset. From a mathematical standpoint, the PCA is just a coordinates change to represent the points in a more appropriate basis. Picking few of these coordinates is enough to explain an important part of the variance in the dataset.

The mathematics of PCA

Let be the observations of our datasets, the points are in . We assume that they are centered and of unit variance. We denote the matrix of observations.
Then, can be diagonalized and has real and positive eigenvalues (it is a symmetric positive definite matrix).
We denote ... > \lambda_p" title="Rendered by QuickLaTeX.com" height="18" width="97" style="vertical-align: -6px;" data-recalc-dims="1"/> its ordered eigenvalues and the associated eigenvectors. It can be shown that is the cumulative variance explained by .
It can also be shown that is the orthonormal basis of size which explains the most variances.

This is exactly what we wanted ! We have a smaller basis which explains as much variance as possible !

PCA in R

The implementation in R has three-steps:

1. We center the data and divide them by their deviations. Our data now comply with PCA hypothesis.
2. We diagonalise and store the eigenvectors and eigenvalues
3. The cumulative variance is computed and the required numbers of eigenvectors to reach the variance threshold is stored. We only keep the first eigenvectors
###PCA my_pca<-function(x,variance_explained=0.9,center=T,scale=T) { my_pca=list() ##Compute the mean of each variable if (center) { my_pca[['center']]=colMeans(x) } ## Otherwise, we set the mean to 0 else my_pca[['center']]=rep(0,dim(x)[2]) ####Compute the standard dev of each variable if (scale) { my_pca[['std']]=apply(x,2,sd) } ## Otherwise, we set the sd to 0 else my_pca[['std']]=rep(1,dim(x)[2]) ##Normalization ##Centering x_std=sweep(x,2,my_pca[['center']]) ##Standardization x_std=x_std%*%diag(1/my_pca[['std']]) ##Cov matrix eigen_cov=eigen(crossprod(x_std,x_std)) ##Computing the cumulative variance my_pca[['cumulative_variance']] =cumsum(eigen_cov[['values']]) ##Number of required components my_pca[['n_components']] =sum((my_pca[['cumulative_variance']]/sum(eigen_cov[['values']]))<variance_explained)+1 ##Selection of the principal components my_pca[['transform']] =eigen_cov[['vectors']][,1:my_pca[['n_components']]] attr(my_pca, "class") <- "my_pca" return(my_pca) }

Now that we have the transformation matrix, we can perform the projection on the new basis.

predict.my_pca<-function(pca,x,..) { ##Centering x_std=sweep(x,2,pca[['center']]) ##Standardization x_std=x_std%*%diag(1/pca[['std']]) return(x_std%*%pca[['transform']]) }

The function applies the change of basis formula and a projection on the principals components.

Plot the PCA projection

Using the predict function, we can now plot the projection of the observations on the two main components. As in the part 1, we used the Iris dataset.

library(ggplot2) pca1=my_pca(as.matrix(iris[,1:4]),1,scale=TRUE,center = TRUE) projected=predict(pca1,as.matrix(iris[,1:4])) ggplot()+geom_point(aes(x=projected[,1],y=projected[,2],color=iris[,5])) Projection of the Iris dataset on the two mains PCA Comparison with the FactoMineR implementation

We can now compare our implementation with the standard FactoMineR implementation of Principal Component Analysis.

library(FactoMineR) pca_stats= PCA(as.matrix(iris[,1:4])) projected_stats=predict(pca_stats,as.matrix(iris[,1:4]))$coord[,1:2] ggplot(data=iris)+geom_point(aes(x=projected_stats[,1],y=-projected_stats[,2],color=Species))+xlab('PC1')+ylab('PC2')+ggtitle('Iris dataset projected on the two mains PC (FactomineR)') When running this, you should get a plot very similar to the previous one. This ensures the sanity of our implementation. Projection of the Iris dataset using the FactoMineR implementation Thanks for reading ! To find more posts on Machine Learning, Python and R, you can follow us on Facebook or Twitter. . The post Create your Machine Learning library from scratch with R ! (2/5) – PCA appeared first on Enhance Data Science. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Enhance Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### DALEX @ eRum 2018 Tue, 05/15/2018 - 16:55 (This article was first published on English – SmarterPoland.pl, and kindly contributed to R-bloggers) DALEX invasion has started with the workshop and talk @ eRum 2018. Find workshop materials at DALEX: Descriptive mAchine Learning EXplanations. Tools for exploration, validation and explanation of complex machine learning models (thanks Mateusz Staniak for having the second part of the workshop). And my presentation Show my your model 2.0! (thanks go to the whole MI2DataLab for contributions and Malgorzata Pawlak for great stickers). var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: English – SmarterPoland.pl. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Time Series of the World, Unite! Tue, 05/15/2018 - 12:00 (This article was first published on R – usefulr, and kindly contributed to R-bloggers) The R ecosystem knows a ridiculous number of time series classes. So, I decided to create a new universal standard that finally covers everyone’s use case… Ok, just kidding! tsbox, now freshly on CRAN, provides a set of tools that are agnostic towards existing time series classes. It is built around a set of converters, which convert time series stored as ts, xts, data.frame, data.table, tibble, zoo, tsibble or timeSeries to each other. To install the stable version from CRAN: install.packages("tsbox") To get an idea how easy it is to switch from one class to another, consider this: library(tsbox) x.ts <- ts_c(mdeaths, fdeaths) x.xts <- ts_xts(x.ts) x.df <- ts_df(x.xts) x.tbl <- ts_tbl(x.df) x.dt <- ts_tbl(x.tbl) x.zoo <- ts_zoo(x.dt) x.tsibble <- ts_tsibble(x.zoo) x.timeSeries <- ts_timeSeries(x.tsibble) We jump form good old ts objects toxts, store our time series in various data frames and convert them to some highly specialized time series formats. tsbox is class-agnostic Because these converters work nicely, we can use them to make functions class-agnostic. If a class-agnostic function works for one class, it works for all: ts_scale(x.ts) ts_scale(x.xts) ts_scale(x.df) ts_scale(x.dt) ts_scale(x.tbl) ts_scale normalizes one or multiple series, by subtracting the mean and dividing by the standard deviation. It works like a ‘generic’ function: You can apply it on any time series object, and it will return an object of the same class as its input. So, whether we want to smooth, scale, differentiate, chain-link, forecast, regularize or seasonally adjust a series, we can use the same commands to whatever time series at hand. tsbox offers a comprehensive toolkit for the basics of time series manipulation. Here are some additional operations: ts_pc(x.ts) # percentage change rates ts_forecast(x.xts) # forecast, by exponential smoothing ts_seas(x.df) # seasonal adjustment, by X-13 ts_frequency(x.dt, "year") # convert to annual frequency ts_span(x.tbl, "-1 year") # limit time span to final year tsbox is frequency-agnostic There are many more. Because they all start with ts_, you can use auto-complete to see what’s around. Most conveniently, there is a time series plot function that works for all classes and frequencies: ts_plot( Airline Passengers = AirPassengers, Lynx trappings = ts_df(lynx), Deaths from Lung Diseases = ts_xts(fdeaths), title = "Airlines, trappings, and deaths", subtitle = "Monthly passengers, annual trappings, monthly deaths" ) There is also a version that uses ggplot2 and has the same syntax. Time series in data frames You may have wondered why we treated data frames as a time series class. The spread of dplyr and data.table has given data frames a boost and made them one of the most popular data structures in R. So, storing time series in a data frame is an obvious consequence. And even if you don’t intend to keep time series in data frames, this is still the format in which you import and export your data. tsbox makes it easy to switch from data frames to time series and back. Make existing functions class-agnostic tsbox includes tools to make existing functions class-agnostic. To do so, the ts_ function can be used to wrap any function that works with time series. For a function that works on "ts" objects, this is as simple as that: ts_rowsums <- ts_(rowSums) ts_rowsums(ts_c(mdeaths, fdeaths)) Note that ts_ returns a function, which can be used with or without a name. In case you are wondering, tsbox uses data.table as a backend, and makes use of its incredibly efficient reshaping facilities, its joins and rolling joins. And thanks to anytime, tsbox will be able to recongnize almost any date format without manual intervention. So, enjoy some relieve in R’s time series class struggle. Website: www.tsbox.help var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – usefulr. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Getting into the Rhythm: the euRovision sweepstake Tue, 05/15/2018 - 11:51 (This article was first published on Mango Solutions, and kindly contributed to R-bloggers) Laura Swales, Marketing and Events Assistant Another month, another sweepstake to raise money for the Bath Cats & Dogs home! This time, we picked the Eurovision song contest as our sweepstake of choice. After enjoying my first experience of using R to randomise the names for the previous sweepstake I decided to give it another go, but with a few tweaks. Soundcheck During my first attempt in R, issues arose when I had been (innocently!) allocated the favourite horse to win. I had no way to prove that the R code had made the selection, as my work was not reproducible. So with the cries of “cheater!” and “fix!”” still ringing in my ears, we started by setting a seed. This meant that if someone else was to replicate my code they would get the same results; therefore removing the dark smudge against my good name. At random I selected the number 6 at which to set my seed. set.seed(6) I next compiled my lists of people and Eurovision countries and associated them with correlating objects. people_list <- c( "Andy M", "Adam", "Laura", "Rachel", "Owen", "Yvi", "Karis", "Toby", "Jen", "Matty G", "Tatiana", "Amanda", "Chrissy", "Lisa", "Lisa", "Ben", "Ben", "Robert", "Toby", "Matt A", "Lynn", "Ruth", "Julian", "Karina", "Colin", "Colin") countries_list <- c( "Albania", "Australia", "Austria", "Bulgaria", "Cyprus", "Czech Rep", "Denmark", "Estonia", "Finland", "France", "Germany", "Hungary", "Ireland", "Israel", "Italy", "Lithuania", "Moldova", "Norway", "Portugal", "Serbia", "Slovenia", "Spain", "Sweden", "The Netherlands", "Ukraine", "United Kingdom" ) Once I had the lists associated with objects, I followed the same steps as my previous attempt in R. I put both objects into data frames and then used the sample function to jumble up the names. assign_countries <- data.frame(people = people_list, countries = sample(countries_list)) Task complete! Fate had delivered me Denmark, who were nowhere near the favourites at the point of selection. I sighed with relief knowing that I had no chance of winning again and that perhaps maybe now I could start to re-build my reputation as an honest co-worker… Encore Before I finished my latest foray into R, we decided to create a function for creating sweepstakes in R. I was talked down from picking the name SweepstakeizzleR and decided upon the slightly more sensible sweepR. I entered the desired workings of the function, which followed from the above work in R. sweepR <- function(a, b, seed = 1234){ set.seed(seed) data.frame(a, sample(b)) } Once done, I could use my newly created function to complete the work I had done before but in a much timelier fashion. sweepR(people_list, countries_list) My very first function worked! Using a function like sweepR will allow me to reliably reproduce the procedures I need for whatever task I’m working on. In this case it has enabled me to create a successfully random sweepstake mix of names and entries. WinneR With great relief Israel won Eurovision and I was very happy to hand over the prize to Amanda. I really enjoyed learning a little more about R and how I can create functions to streamline my work. Hopefully another reason will come up for me to learn even more soon! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### The Pleasing Ratio Project Tue, 05/15/2018 - 10:08 (This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers) Music is a world within itself, with a language we all understand (Sir Duke, Stevie Wonder) This serious man on the left is Gustav Theodor Fechner, a German philosopher, physicist and experimental psychologist who lived between 1801 and 1887. To be honest, I don’t know almost anything of his life or work exepct one thing: he did in the 1860s a thought-provoking experiment. It seems me interesting for two important reasons: he called into question something widely established and obtained experimental data by himself. Fechner’s experiment was simple: he presented just ten rectangles to 82 students. Then he asked each of them to choose the most pleasing one and obtained revealing discoveries I will not explain here since would cause bias in my experiment. You can find more information about the original experiment here. I have done a project inspired in Fechner’s one that I called The Pleasing Ratio Project. Once you enter in the App, you will see two rectangles. Both of them have the same area. They only vary in their length-to-width ratios. Then you will be asked to select the one that seems you most pleasing. If you have doubts, just choose I’m not sure. You can do it as many times as you want (all responses are completely anonymous). Every game will confront a couple of ratios, which can vary from 1 to 3,5. In the Results section you will find the percentage of winning games for each ratio. The one with the highest percentage will be named officially as The Most Pleasing Ratio of the World in the future by myself. Although my experiment is absolutely inspired in Fechner’s one, there are some differences. I can explore a bigger set of ratios doing an A/B test and I also introduce the option I’m not sure. This makes this one a bit richer. The experiment has also some interesting technical features: • the use of shinydashboard package to arrange the App • the use of shinyjs package to add javaScript to refresh the page when use choose to play again • to save votes in a text file • to read it to visualize results Will I obtain the same results as Fechner? This is a living project whose results will change over the time so you can check it regularly. Go to The Pleasing Ratio Project! The code of the project is available in GitHub. Thanks a lot for your collaboration! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### Workaround for tidyr::spread with duplicate row identifiers Tue, 05/15/2018 - 02:00 (This article was first published on Digital Age Economist on Digital Age Economist, and kindly contributed to R-bloggers) The problem The tidyverse api is one the easiest APIs in R to follow – thanks the concept of grammer based code. But every now and again you come across an error that boggles your mind, my personal one is the error spread delivers when I have no indexing for my data: Error: Duplicate identifiers for rows (2, 3), (1, 4, 5).... I am sure we have all seen this some or other time. Well, lets reproduce this error: library(dplyr) library(tidyr) df ## # A tibble: 5 x 2 ## age gender ## ## 1 21 Male ## 2 17 Female ## 3 32 Female ## 4 29 Male ## 5 15 Male # Spread df %>% spread(gender, age) ## Error: Duplicate identifiers for rows (2, 3), (1, 4, 5) To anyone doing the analysis, the output that he/she expects is quite trivial, the output should be: ## # A tibble: 3 x 2 ## Female Male ## * ## 1 17 21 ## 2 32 29 ## 3 15 But given that there is no index for spread to use and nothing stopping the 2nd and 3rd value for Male to collapse to a single row, spread decides its better to throw an error. There are multiple questions on stackoverflow of people asking how to spread the df without getting the warning. Here are some links: And the list goes on… The solution Luckily I stumbled accross a really nice issue on the tidyr github repo page. The user, markdly, openend the issue, but in the end added a nice workaround for these kind of problems. The idea is to add an index using group_by and row_number and just removing the column after the widening of the data. df <- df %>% group_by(gender) %>% mutate(grouped_id = row_number()) df ## # A tibble: 5 x 3 ## # Groups: gender [2] ## age gender grouped_id ## ## 1 21 Male 1 ## 2 17 Female 1 ## 3 32 Female 2 ## 4 29 Male 2 ## 5 15 Male 3 # Now Spread df %>% spread(gender, age) %>% select(-grouped_id) ## # A tibble: 3 x 2 ## Female Male ## * ## 1 17 21 ## 2 32 29 ## 3 15 Using this method of giving spread an index ensures no duplicated row names are introduced and a rectangular dataset can remain. I hope that this little gem of an insight into working with spread has saved you some time! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Digital Age Economist on Digital Age Economist. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### CHAID and R – When you need explanation – May 15, 2018 Tue, 05/15/2018 - 02:00 (This article was first published on Chuck Powell, and kindly contributed to R-bloggers) A modern data scientist using R has access to an almost bewildering number of tools, libraries and algorithms to analyze the data. In my next two posts I’m going to focus on an in depth visit with CHAID (Chi-square automatic interaction detection). The title should give you a hint for why I think CHAID is a good “tool” for your analytical toolbox. There are lots of tools that can help you predict or classify but CHAID is especially good at helping you explain to any audience how the model arrives at it’s prediction or classification. It’s also incredibly robust from a statistical perspective, making almost no assumptions about your data for distribution or normality. I’ll try and elaborate on that as we work the example. You can get a very brief summary of CHAID from wikipedia and mentions of it scattered about in places like Analytics Vidhya or Data Flair. If you prefer a more scholarly bent the original article can be found in places like JSTOR. As the name implies it is fundamentally based on the venerable Chi-square test – and while not the most powerful (in terms of detecting the smallest possible differences) or the fastest, it really is easy to manage and more importantly to tell the story after using it. Compared to some other techniques it’s also quite simple to use, as I hope you’ll agree, by the end of these posts. To showcase it we’re going to be using a dataset that comes to us from the IBM Watson Project and comes packaged with the rsample library. It’s a very practical and understandable dataset. A great use case for a tree based algorithm. Imagine yourself in a fictional company faced with the task of trying to figure out which employees you are going to “lose” a.k.a. attrition or turnover. There’s a steep cost involved in keeping good employees and training and on-boarding can be expensive. Being able to predict attrition even a little bit better would save you lots of money and make the company better, especially if you can understand exactly what you have to “watch out” for that might indicate the person is a high risk to leave. Setup and library loading If you’ve never used CHAID before you may also not have partykit. CHAID isn’t on CRAN but I have commented out the install command below. You’ll also get a variety of messages, none of which is relevant to this example so I’ve suppressed them. # install.packages("partykit") # install.packages("CHAID", repos="http://R-Forge.R-project.org") require(rsample) # for dataset and splitting also loads broom and tidyr require(dplyr) require(ggplot2) theme_set(theme_bw()) # set theme require(CHAID) require(purrr) require(caret) Predicting attrition in a fictional company Let’s load up the attrition dataset and take a look at the variables we have. # data(attrition) str(attrition) ## 'data.frame': 1470 obs. of 31 variables: ##$ Age : int 41 49 37 33 27 32 59 30 38 36 ... ## $Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ... ##$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ... ## $DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ... ##$ Department : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ... ## $DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ... ##$ Education : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ... ## $EducationField : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ... ##$ EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ... ## $Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ... ##$ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ... ## $JobInvolvement : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ... ##$ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ... ## $JobRole : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ... ##$ JobSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ... ## $MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ... ##$ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ... ## $MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ... ##$ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ... ## $OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ... ##$ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ... ## $PerformanceRating : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ... ##$ RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ... ## $StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ... ##$ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ... ## $TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ... ##$ WorkLifeBalance : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ... ## $YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ... ##$ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ... ## $YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ... ##$ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...

Okay we have data on 1,470 employees. We have 30 potential predictor or
independent variables and the all important attrition variable which
gives us a yes or no answer to the question of whether or not the
employee left. We’re to build the most accurate predictive model we can
that is also simple (parsimonious) and explainable. The predictors we
have seem to be the sorts of data we might have on hand in our HR files
and thank goodness are labelled in a way that makes them pretty self
explanatory.

The CHAID library in R requires that any variables that we enter as
predictors be either nominal or ordinal variables (see ?CHAID::chaid),
which in R speak means we have to get them in as either factor or
ordered factor. The str command shows we have a bunch of variables
which are of type integer. As it turns out moving from integer to
factor is simple in terms of code but has to be thoughtful for
substantive reasons. So let’s see how things breakdown.

attrition %>% select_if(is.factor) %>% ncol ## [1] 15 attrition %>% select_if(is.numeric) %>% ncol ## [1] 16

Hmmmm, 15 factors and 16 integers. Let’s explore further. Of the
variables that are integers how many of them have a small number of
values (a.k.a. levels) and can therefore be simply and easily converted
to true factors. We’ll use a dplyr pipe to see how many have 5 or
fewer levels and 10 or fewer levels.

attrition %>% select_if(function(col) length(unique(col)) <= 5 & is.integer(col)) %>% head ## JobLevel StockOptionLevel ## 1 2 0 ## 2 2 1 ## 4 1 0 ## 5 1 0 ## 7 1 1 ## 8 1 0 attrition %>% select_if(function(col) length(unique(col)) <= 10 & is.integer(col)) %>% head ## JobLevel NumCompaniesWorked StockOptionLevel TrainingTimesLastYear ## 1 2 8 0 0 ## 2 2 1 1 3 ## 4 1 6 0 3 ## 5 1 1 0 3 ## 7 1 9 1 3 ## 8 1 0 0 2

2 and 4 respectively. We can be pretty confident that converting these
from integer to factor won’t lose much information. Simple to run a
mutate operation across the 4 we have identified. Probably more
elegant though to make it a mutate_if. That way in the future we
decide we like 4 or 7 or 122 as our criteria for the change we only have
to change one number. The “if” variation is also less to type and less
likely to make a manual mistake.

attrition %>% mutate( JobLevel = factor(JobLevel), NumCompaniesWorked = factor(NumCompaniesWorked), StockOptionLevel = factor(StockOptionLevel), TrainingTimesLastYear = factor(TrainingTimesLastYear) ) %>% str ## 'data.frame': 1470 obs. of 31 variables: ## $Age : int 41 49 37 33 27 32 59 30 38 36 ... ##$ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ... ## $BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ... ##$ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ... ## $Department : Factor w/ 3 levels "Human_Resources",..: 3 2 2 2 2 2 2 2 2 2 ... ##$ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ... ## $Education : Ord.factor w/ 5 levels "Below_College"<..: 2 1 2 4 1 2 3 1 3 3 ... ##$ EducationField : Factor w/ 6 levels "Human_Resources",..: 2 2 5 2 4 2 4 2 2 4 ... ## $EnvironmentSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 2 3 4 4 1 4 3 4 4 3 ... ##$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ... ## $HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ... ##$ JobInvolvement : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 2 3 3 3 4 3 2 3 ... ## $JobLevel : Factor w/ 5 levels "1","2","3","4",..: 2 2 1 1 1 1 1 1 3 2 ... ##$ JobRole : Factor w/ 9 levels "Healthcare_Representative",..: 8 7 3 7 3 3 3 3 5 1 ... ## $JobSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 3 3 2 4 1 3 3 3 ... ##$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ... ## $MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ... ##$ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ... ## $NumCompaniesWorked : Factor w/ 10 levels "0","1","2","3",..: 9 2 7 2 10 1 5 2 1 7 ... ##$ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ... ## $PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ... ##$ PerformanceRating : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ... ## $RelationshipSatisfaction: Ord.factor w/ 4 levels "Low"<"Medium"<..: 1 4 2 3 4 3 1 2 2 2 ... ##$ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 1 2 1 1 2 1 4 2 1 3 ... ## $TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ... ##$ TrainingTimesLastYear : Factor w/ 7 levels "0","1","2","3",..: 1 4 4 4 4 3 4 3 3 4 ... ## $WorkLifeBalance : Ord.factor w/ 4 levels "Bad"<"Good"<"Better"<..: 1 3 3 3 3 2 2 3 3 2 ... ##$ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ... ## $YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ... ##$ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ... ## $YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ... attrition <- attrition %>% mutate_if(function(col) length(unique(col)) <= 10 & is.integer(col), as.factor) summary(attrition) ## Age Attrition BusinessTravel DailyRate ## Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0 ## 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0 ## Median :36.00 Travel_Rarely :1043 Median : 802.0 ## Mean :36.92 Mean : 802.5 ## 3rd Qu.:43.00 3rd Qu.:1157.0 ## Max. :60.00 Max. :1499.0 ## ## Department DistanceFromHome Education ## Human_Resources : 63 Min. : 1.000 Below_College:170 ## Research_Development:961 1st Qu.: 2.000 College :282 ## Sales :446 Median : 7.000 Bachelor :572 ## Mean : 9.193 Master :398 ## 3rd Qu.:14.000 Doctor : 48 ## Max. :29.000 ## ## EducationField EnvironmentSatisfaction Gender ## Human_Resources : 27 Low :284 Female:588 ## Life_Sciences :606 Medium :287 Male :882 ## Marketing :159 High :453 ## Medical :464 Very_High:446 ## Other : 82 ## Technical_Degree:132 ## ## HourlyRate JobInvolvement JobLevel ## Min. : 30.00 Low : 83 1:543 ## 1st Qu.: 48.00 Medium :375 2:534 ## Median : 66.00 High :868 3:218 ## Mean : 65.89 Very_High:144 4:106 ## 3rd Qu.: 83.75 5: 69 ## Max. :100.00 ## ## JobRole JobSatisfaction MaritalStatus ## Sales_Executive :326 Low :289 Divorced:327 ## Research_Scientist :292 Medium :280 Married :673 ## Laboratory_Technician :259 High :442 Single :470 ## Manufacturing_Director :145 Very_High:459 ## Healthcare_Representative:131 ## Manager :102 ## (Other) :215 ## MonthlyIncome MonthlyRate NumCompaniesWorked OverTime ## Min. : 1009 Min. : 2094 1 :521 No :1054 ## 1st Qu.: 2911 1st Qu.: 8047 0 :197 Yes: 416 ## Median : 4919 Median :14236 3 :159 ## Mean : 6503 Mean :14313 2 :146 ## 3rd Qu.: 8379 3rd Qu.:20462 4 :139 ## Max. :19999 Max. :26999 7 : 74 ## (Other):234 ## PercentSalaryHike PerformanceRating RelationshipSatisfaction ## Min. :11.00 Low : 0 Low :276 ## 1st Qu.:12.00 Good : 0 Medium :303 ## Median :14.00 Excellent :1244 High :459 ## Mean :15.21 Outstanding: 226 Very_High:432 ## 3rd Qu.:18.00 ## Max. :25.00 ## ## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance ## 0:631 Min. : 0.00 0: 54 Bad : 80 ## 1:596 1st Qu.: 6.00 1: 71 Good :344 ## 2:158 Median :10.00 2:547 Better:893 ## 3: 85 Mean :11.28 3:491 Best :153 ## 3rd Qu.:15.00 4:123 ## Max. :40.00 5:119 ## 6: 65 ## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion ## Min. : 0.000 Min. : 0.000 Min. : 0.000 ## 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000 ## Median : 5.000 Median : 3.000 Median : 1.000 ## Mean : 7.008 Mean : 4.229 Mean : 2.188 ## 3rd Qu.: 9.000 3rd Qu.: 7.000 3rd Qu.: 3.000 ## Max. :40.000 Max. :18.000 Max. :15.000 ## ## YearsWithCurrManager ## Min. : 0.000 ## 1st Qu.: 2.000 ## Median : 3.000 ## Mean : 4.123 ## 3rd Qu.: 7.000 ## Max. :17.000 ## As you look at the results this is a good time to remind you that CHAID is “non parametric” which means that we don’t have to worry about how the distribution (normality) looks nor make any assumptions about the variance. We are assuming that the predictors are independent of one another, but that is true of every statistical test and this is a robust procedure. So for now, let’s simply ignore all the variables that are still integers. I promise we’ll come back and deal with them later. But for now I’m eager to actually use CHAID and do some predicting. We’re also going to defer and address the issue of “over-fitting” and how to most wisely use the data we have. We’re simply going to build a first model using all 1,470 cases, the 18 factors we have available to predict with and we are trying to predict attrition. We’ll create a new dataframe called newattrit (how original right?). newattrit <- attrition %>% select_if(is.factor) dim(newattrit) ## [1] 1470 19 The chaid command accepts two pieces of information in it’s simplest case, a formula like outcome ~ predictors and a dataframe. We’re going to make use of the ~ . shortcut on the right hand side and add attrition on the left and newattrit as our dataframe. About 6 seconds later (at least on my Mac) we’ll have a solution that we can print and plot. I’m going to output all the plots in a smaller size for the benefit of you the readers. I’m doing that via RMarkdown and it won’t happen automatically for you if you download and use the code. I’ll initially be using, fig.height=10, fig.width=20, dpi=90, out.width=“900px” What does CHAID do? Straight from the help pages “Select the predictor that has the smallest adjusted p-value (i.e., most significant). If this adjusted p-value is less than or equal to a user-specified alpha-level alpha4, split the node using this predictor. Else, do not split and the node is considered as a terminal node.” So it will take our 18 predictors and test each one against our outcome variable – attrition. The one with the lowest p value (a proxy for is most predictive) will “anchor” our decision tree. It will then repeat this process of splitting until more splits fail to yield significant results. I’m way over-simplifying, of course, but you get the idea. The end result will be a series of terminal nodes (think of them as “prediction buckets” that have a group of employees who all meet the same criteria who we think will either attrit or not attrit). Let’s run it. # demonstrate a full model using chaid with defaults chaidattrit1 <- chaid(Attrition ~ ., data = newattrit) print(chaidattrit1) ## ## Model formula: ## Attrition ~ BusinessTravel + Department + Education + EducationField + ## EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + ## JobRole + JobSatisfaction + MaritalStatus + NumCompaniesWorked + ## OverTime + PerformanceRating + RelationshipSatisfaction + ## StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance ## ## Fitted party: ## [1] root ## | [2] OverTime in No ## | | [3] StockOptionLevel in 0 ## | | | [4] JobSatisfaction in Low ## | | | | [5] RelationshipSatisfaction in Low, Medium, High: No (n = 56, err = 42.9%) ## | | | | [6] RelationshipSatisfaction in Very_High: No (n = 28, err = 7.1%) ## | | | [7] JobSatisfaction in Medium, High ## | | | | [8] JobInvolvement in Low: Yes (n = 12, err = 41.7%) ## | | | | [9] JobInvolvement in Medium, High, Very_High ## | | | | | [10] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 181, err = 9.9%) ## | | | | | [11] BusinessTravel in Travel_Frequently ## | | | | | | [12] RelationshipSatisfaction in Low: Yes (n = 8, err = 25.0%) ## | | | | | | [13] RelationshipSatisfaction in Medium, High, Very_High: No (n = 30, err = 16.7%) ## | | | [14] JobSatisfaction in Very_High: No (n = 134, err = 7.5%) ## | | [15] StockOptionLevel in 1, 2, 3 ## | | | [16] EnvironmentSatisfaction in Low: No (n = 127, err = 11.0%) ## | | | [17] EnvironmentSatisfaction in Medium, High, Very_High ## | | | | [18] Department in Human_Resources, Sales: No (n = 164, err = 8.5%) ## | | | | [19] Department in Research_Development: No (n = 314, err = 3.2%) ## | [20] OverTime in Yes ## | | [21] JobLevel in 1 ## | | | [22] StockOptionLevel in 0, 3 ## | | | | [23] JobSatisfaction in Low, Medium, High: Yes (n = 61, err = 26.2%) ## | | | | [24] JobSatisfaction in Very_High: No (n = 28, err = 46.4%) ## | | | [25] StockOptionLevel in 1, 2 ## | | | | [26] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 50, err = 26.0%) ## | | | | [27] BusinessTravel in Travel_Frequently: Yes (n = 17, err = 35.3%) ## | | [28] JobLevel in 2, 3, 4, 5 ## | | | [29] MaritalStatus in Divorced, Married ## | | | | [30] EnvironmentSatisfaction in Low, Medium: No (n = 60, err = 20.0%) ## | | | | [31] EnvironmentSatisfaction in High, Very_High ## | | | | | [32] TrainingTimesLastYear in 0, 6: No (n = 10, err = 40.0%) ## | | | | | [33] TrainingTimesLastYear in 1, 2, 3, 4, 5 ## | | | | | | [34] EnvironmentSatisfaction in Low, Medium, High: No (n = 57, err = 0.0%) ## | | | | | | [35] EnvironmentSatisfaction in Very_High: No (n = 61, err = 6.6%) ## | | | [36] MaritalStatus in Single ## | | | | [37] Department in Human_Resources, Research_Development: No (n = 37, err = 10.8%) ## | | | | [38] Department in Sales: Yes (n = 35, err = 40.0%) ## ## Number of inner nodes: 18 ## Number of terminal nodes: 20 plot(chaidattrit1) chisq.test(newattrit$Attrition, newattrit$OverTime) ## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: newattrit$Attrition and newattrit$OverTime ## X-squared = 87.564, df = 1, p-value < 2.2e-16 I happen to be a visual learner and prefer the plot to the print but they are obviously reporting the same information so use them as you see fit. As you can see the very first split it decides on is overtime yes or no. I’ve run the chi-square test so that you can see the p value is indeed very small (0.00000000000000022). So the algorithm has decided that the most predictive way to divide our sample of employees is into 20 terminal nodes or buckets. Each one of the nodes represents a distinct set of predictors. Take a minute to look at node 19. Every person there shares the following characteristics. • [2] OverTime in No • [15] StockOptionLevel in 1, 2, 3 • [17] EnvironmentSatisfaction in Medium, High, Very_High • [19] Department in Research_Development: No There are n = 314 in this group, our prediction is that No they will not attrit and we were “wrong” err = 3.2%. That’s some useful information. To quote an old Star Wars movie “These are not the droids you’re looking for…”. In other words, this is not a group we should be overly worried about losing and we can say that with pretty high confidence. For contrast let’s look at node #23: • [20] OverTime in Yes • [21] JobLevel in 1 • [22] StockOptionLevel in 0, 3 • [23] JobSatisfaction in Low, Medium, High: Where there are n = 61 staff, we predict they will leave Yes and we get it wrong err = 26.2% of the time. A little worrisome that we’re not as accurate but this is a group that bears watching or intervention if we want to retain them. Some other things to note. Because the predictors are considered categorical we will get splits like we do for node 22, where 0 and 3 are on one side and 1, 2 is on the other. The number of people in any node can be quite variable. Finally, notice that a variable can occur at different levels of the model like StockOptionLevel does! On the plot side of things there are a few key options you can adjust to make things easier to read. The next blocks of code show you how to adjust some key options such as adding a title, reducing the font size, using “simple” mode, and changing colors. # digress for plotting plot(chaidattrit1, type = "simple") plot( chaidattrit1, main = "Testing Graphical Options", gp = gpar(fontsize = 8), type = "simple" ) plot( chaidattrit1, main = "Testing More Graphical Options", gp = gpar( col = "blue", lty = "solid", lwd = 3, fontsize = 10 ) ) Exercising some control Next let’s look into varying the parameters chaid uses to build the model. chaid_control (not surprisingly) controls the behavior of the model building. When you check the documentation at ?chaid_control you can see the list of 8 parameters you can adjust. We’ve already run the default settings implicitly when we built chaidattrit1 let’s look at three others. • minsplit – Number of observations in splitted response at which no further split is desired. • minprob – Minimum frequency of observations in terminal nodes. • maxheight – Maximum height for the tree. We’ll use those but our fourth model we’ll simply require a higher significance level for alpha2 and alpha4. ctrl <- chaid_control(minsplit = 200, minprob = 0.05) ctrl # notice the rest of the list is there at the default value ##$alpha2 ## [1] 0.05 ## ## $alpha3 ## [1] -1 ## ##$alpha4 ## [1] 0.05 ## ## $minsplit ## [1] 200 ## ##$minbucket ## [1] 7 ## ## $minprob ## [1] 0.05 ## ##$stump ## [1] FALSE ## ## $maxheight ## [1] -1 ## ## attr(,"class") ## [1] "chaid_control" chaidattrit2 <- chaid(Attrition ~ ., data = newattrit, control = ctrl) print(chaidattrit2) ## ## Model formula: ## Attrition ~ BusinessTravel + Department + Education + EducationField + ## EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + ## JobRole + JobSatisfaction + MaritalStatus + NumCompaniesWorked + ## OverTime + PerformanceRating + RelationshipSatisfaction + ## StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance ## ## Fitted party: ## [1] root ## | [2] OverTime in No ## | | [3] StockOptionLevel in 0 ## | | | [4] JobSatisfaction in Low: No (n = 84, err = 31.0%) ## | | | [5] JobSatisfaction in Medium, High ## | | | | [6] JobInvolvement in Low: Yes (n = 12, err = 41.7%) ## | | | | [7] JobInvolvement in Medium, High, Very_High ## | | | | | [8] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 181, err = 9.9%) ## | | | | | [9] BusinessTravel in Travel_Frequently: No (n = 38, err = 28.9%) ## | | | [10] JobSatisfaction in Very_High: No (n = 134, err = 7.5%) ## | | [11] StockOptionLevel in 1, 2, 3 ## | | | [12] EnvironmentSatisfaction in Low: No (n = 127, err = 11.0%) ## | | | [13] EnvironmentSatisfaction in Medium, High, Very_High ## | | | | [14] Department in Human_Resources, Sales: No (n = 164, err = 8.5%) ## | | | | [15] Department in Research_Development: No (n = 314, err = 3.2%) ## | [16] OverTime in Yes ## | | [17] JobLevel in 1: Yes (n = 156, err = 47.4%) ## | | [18] JobLevel in 2, 3, 4, 5 ## | | | [19] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%) ## | | | [20] MaritalStatus in Single: No (n = 72, err = 34.7%) ## ## Number of inner nodes: 9 ## Number of terminal nodes: 11 plot( chaidattrit2, main = "minsplit = 200, minprob = 0.05", gp = gpar( col = "blue", lty = "solid", lwd = 3 ) ) ctrl <- chaid_control(maxheight = 3) chaidattrit3 <- chaid(Attrition ~ ., data = newattrit, control = ctrl) print(chaidattrit3) ## ## Model formula: ## Attrition ~ BusinessTravel + Department + Education + EducationField + ## EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + ## JobRole + JobSatisfaction + MaritalStatus + NumCompaniesWorked + ## OverTime + PerformanceRating + RelationshipSatisfaction + ## StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance ## ## Fitted party: ## [1] root ## | [2] OverTime in No ## | | [3] StockOptionLevel in 0 ## | | | [4] JobSatisfaction in Low: No (n = 84, err = 31.0%) ## | | | [5] JobSatisfaction in Medium, High: No (n = 231, err = 15.6%) ## | | | [6] JobSatisfaction in Very_High: No (n = 134, err = 7.5%) ## | | [7] StockOptionLevel in 1, 2, 3 ## | | | [8] EnvironmentSatisfaction in Low: No (n = 127, err = 11.0%) ## | | | [9] EnvironmentSatisfaction in Medium, High, Very_High: No (n = 478, err = 5.0%) ## | [10] OverTime in Yes ## | | [11] JobLevel in 1 ## | | | [12] StockOptionLevel in 0, 3: Yes (n = 89, err = 34.8%) ## | | | [13] StockOptionLevel in 1, 2: No (n = 67, err = 35.8%) ## | | [14] JobLevel in 2, 3, 4, 5 ## | | | [15] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%) ## | | | [16] MaritalStatus in Single: No (n = 72, err = 34.7%) ## ## Number of inner nodes: 7 ## Number of terminal nodes: 9 plot( chaidattrit3, main = "maxheight = 3", gp = gpar( col = "blue", lty = "solid", lwd = 3 ) ) ctrl <- chaid_control(alpha2 = .01, alpha4 = .01) chaidattrit4 <- chaid(Attrition ~ ., data = newattrit, control = ctrl) print(chaidattrit4) ## ## Model formula: ## Attrition ~ BusinessTravel + Department + Education + EducationField + ## EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + ## JobRole + JobSatisfaction + MaritalStatus + NumCompaniesWorked + ## OverTime + PerformanceRating + RelationshipSatisfaction + ## StockOptionLevel + TrainingTimesLastYear + WorkLifeBalance ## ## Fitted party: ## [1] root ## | [2] OverTime in No ## | | [3] StockOptionLevel in 0 ## | | | [4] JobSatisfaction in Low ## | | | | [5] RelationshipSatisfaction in Low, Medium, High: No (n = 56, err = 42.9%) ## | | | | [6] RelationshipSatisfaction in Very_High: No (n = 28, err = 7.1%) ## | | | [7] JobSatisfaction in Medium, High, Very_High ## | | | | [8] JobInvolvement in Low: No (n = 20, err = 45.0%) ## | | | | [9] JobInvolvement in Medium, High, Very_High ## | | | | | [10] JobLevel in 1: No (n = 139, err = 18.0%) ## | | | | | [11] JobLevel in 2, 3, 4, 5: No (n = 206, err = 5.8%) ## | | [12] StockOptionLevel in 1, 2, 3: No (n = 605, err = 6.3%) ## | [13] OverTime in Yes ## | | [14] JobLevel in 1 ## | | | [15] StockOptionLevel in 0, 3: Yes (n = 89, err = 34.8%) ## | | | [16] StockOptionLevel in 1, 2: No (n = 67, err = 35.8%) ## | | [17] JobLevel in 2, 3, 4, 5 ## | | | [18] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%) ## | | | [19] MaritalStatus in Single ## | | | | [20] Department in Human_Resources, Research_Development: No (n = 37, err = 10.8%) ## | | | | [21] Department in Sales: Yes (n = 35, err = 40.0%) ## ## Number of inner nodes: 10 ## Number of terminal nodes: 11 plot( chaidattrit4, main = "alpha2 = .01, alpha4 = .01", gp = gpar( col = "blue", lty = "solid", lwd = 3 ) ) Let me call your attention to chaidattrit3 for a minute to highlight two important things. First it is a good picture of what we get for answer if we were to ask a question about what are the most important predictors, what variables should we focus on. An important technical detail has emerged as well. Notice that when you look at inner node #3 that there is no technical reason why a node has to have a binary split in chaid. As this example clearly shows node#3 leads to a three way split that is nodes #4-6. How good is our model? So the obvious question is which model is best? IMHO the joy of CHAID is in giving you a clear picture of what you would predict given the data and why. Then of course there is the usual problem every data scientist has, which is, I have what I think is a great model. How well will it generalize to new data? Whether that’s next years attrition numbers for the same company or say data from a different company. But it’s time to talk about accuracy and all the related ideas, so on with the show… When it’s all said and done we built a model called chaidattrit1 to be able to predict or classify the 1,470 staff members. Seems reasonable then that we can get back these predictions from the model for all 1,470 people and see how we did compared to the data we have about whether they attrited or not. The print and plot commands sort of summarize that for us at the terminal node level with an error rate but all in all which of our four models is best? The first step is to get the predictions for each model and put them somewhere. For that we’ll use the predict command. If you inspect the object you create (in my case with a head command) you’ll see it’s a vector of factors where the attribute names is set to be the terminal node the prediction is associated with. So pmodel1 <- predict(chaidattrit1) puts our predictions using the first model we built in a nice orderly fashion. On the other side newattrit$Attrition
has the actual outcome of whether the employee departed or not.

What we want is a comparison of how well we did. How often did we get it
right or wrong? Turns out what we need is called a confusion matrix. The
caret package has a function called confusionMatrix that will give
us what we want nicely formatted and printed.

There’s a nice short summary of what is produced at this url Confusion
Matrix
,
so I won’t even try to repeat that material. I’ll just run the
appropriate commands. Later we’ll revisit this topic to be more
efficient. For now I want to focus on the results.

# digress how accurate were we pmodel1 <- predict(chaidattrit1) head(pmodel1) ## 38 19 23 23 16 14 ## Yes No Yes Yes No No ## Levels: No Yes pmodel2 <- predict(chaidattrit2) pmodel3 <- predict(chaidattrit3) pmodel4 <- predict(chaidattrit4) confusionMatrix(pmodel1, newattrit$Attrition) ## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 1190 147 ## Yes 43 90 ## ## Accuracy : 0.8707 ## 95% CI : (0.8525, 0.8875) ## No Information Rate : 0.8388 ## P-Value [Acc > NIR] : 0.0003553 ## ## Kappa : 0.4192 ## Mcnemar's Test P-Value : 7.874e-14 ## ## Sensitivity : 0.9651 ## Specificity : 0.3797 ## Pos Pred Value : 0.8901 ## Neg Pred Value : 0.6767 ## Prevalence : 0.8388 ## Detection Rate : 0.8095 ## Detection Prevalence : 0.9095 ## Balanced Accuracy : 0.6724 ## ## 'Positive' Class : No ## confusionMatrix(pmodel2, newattrit$Attrition) ## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 1154 148 ## Yes 79 89 ## ## Accuracy : 0.8456 ## 95% CI : (0.8261, 0.8637) ## No Information Rate : 0.8388 ## P-Value [Acc > NIR] : 0.2516 ## ## Kappa : 0.353 ## Mcnemar's Test P-Value : 6.382e-06 ## ## Sensitivity : 0.9359 ## Specificity : 0.3755 ## Pos Pred Value : 0.8863 ## Neg Pred Value : 0.5298 ## Prevalence : 0.8388 ## Detection Rate : 0.7850 ## Detection Prevalence : 0.8857 ## Balanced Accuracy : 0.6557 ## ## 'Positive' Class : No ## confusionMatrix(pmodel3, newattrit$Attrition) ## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 1202 179 ## Yes 31 58 ## ## Accuracy : 0.8571 ## 95% CI : (0.8382, 0.8746) ## No Information Rate : 0.8388 ## P-Value [Acc > NIR] : 0.02864 ## ## Kappa : 0.2936 ## Mcnemar's Test P-Value : < 2e-16 ## ## Sensitivity : 0.9749 ## Specificity : 0.2447 ## Pos Pred Value : 0.8704 ## Neg Pred Value : 0.6517 ## Prevalence : 0.8388 ## Detection Rate : 0.8177 ## Detection Prevalence : 0.9395 ## Balanced Accuracy : 0.6098 ## ## 'Positive' Class : No ## confusionMatrix(pmodel4, newattrit$Attrition) ## Confusion Matrix and Statistics ## ## Reference ## Prediction No Yes ## No 1188 158 ## Yes 45 79 ## ## Accuracy : 0.8619 ## 95% CI : (0.8432, 0.8791) ## No Information Rate : 0.8388 ## P-Value [Acc > NIR] : 0.007845 ## ## Kappa : 0.3676 ## Mcnemar's Test P-Value : 3.815e-15 ## ## Sensitivity : 0.9635 ## Specificity : 0.3333 ## Pos Pred Value : 0.8826 ## Neg Pred Value : 0.6371 ## Prevalence : 0.8388 ## Detection Rate : 0.8082 ## Detection Prevalence : 0.9156 ## Balanced Accuracy : 0.6484 ## ## 'Positive' Class : No ##

There we have it, four matrices, one for each of the models we made with
the different control parameters. It helpfully provides not just
Accuracy but also other common measures you may be interested in. I
won’t review them all that’s why I provided the link to a detailed
description

of all the measures. Before we leave the topic for a bit however, I do
want to highlight a way you can use the purrr package to make your
life a lot easier. A special thanks to Steven at
MungeX-3D
for his recent post on purrr
which got me thinking about it.

We have 4 models so far (with more to come) we have the nice neat output
from caret but honestly to compare values across the 4 models involves
way too much scrolling back and forth right now. Let’s use purrr to
create a nice neat dataframe. purrr’s map command is like lapply
from base R, designed to apply some operations or functions to a list of
objects. So what we’ll do is as follows:

1. Create a named list called modellist to point to our four existing
models (perhaps at a latter date we’ll start even earlier in our
modelling process).
2. It’s a named list so we can name each model (for now with the
accurate but uninteresting name Modelx)
3. Pass the list using map to the predict function to generate our
predictions
4. Pipe %>% those results to the confusionMatrix function with
map
5. Pipe %>% the confusion matrix results to map_dfr. The results of
confusionMattrix are actually a list of six items. The ones we want
to capture are in $overall and$byClass. We grab them, transpose
them, and make them into a dataframe then bind the two dataframes
together so everything is neatly packaged. The .id = ModelNumb
tells map_dfr to add an identifying column to the dataframe. It is
populated with the name of the list item we passed in modellist.
Therefore the object CHAIDresults contains everything we might want
to use to compare models in one neat dataframe.

The kable call is simply for your reading convenience. Makes it a

library(kableExtra) modellist <- list(Model1 = chaidattrit1, Model2 = chaidattrit2, Model3 = chaidattrit3, Model4 = chaidattrit4) CHAIDResults <- map(modellist, ~ predict(.x)) %>% map(~ confusionMatrix(newattrit$Attrition, .x)) %>% map_dfr(~ cbind(as.data.frame(t(.x$overall)),as.data.frame(t(.x$byClass))), .id = "ModelNumb") kable(CHAIDResults, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 9) ModelNumb Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull AccuracyPValue McnemarPValue Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy Model1 0.8707483 0.4191632 0.8525159 0.8874842 0.9095238 0.9999996 0.0e+00 0.8900524 0.6766917 0.9651257 0.3797468 0.9651257 0.8900524 0.9260700 0.9095238 0.8095238 0.8387755 0.7833720 Model2 0.8455782 0.3529603 0.8260781 0.8636860 0.8857143 0.9999985 6.4e-06 0.8863287 0.5297619 0.9359286 0.3755274 0.9359286 0.8863287 0.9104536 0.8857143 0.7850340 0.8387755 0.7080453 Model3 0.8571429 0.2936476 0.8382017 0.8746440 0.9394558 1.0000000 0.0e+00 0.8703838 0.6516854 0.9748581 0.2447257 0.9748581 0.8703838 0.9196634 0.9394558 0.8176871 0.8387755 0.7610346 Model4 0.8619048 0.3676334 0.8432050 0.8791447 0.9156463 1.0000000 0.0e+00 0.8826152 0.6370968 0.9635036 0.3333333 0.9635036 0.8826152 0.9212873 0.9156463 0.8081633 0.8387755 0.7598560 One other thing I’ll mention in passing is that the partykit package offers a way of assessing the relative importance of the variables in the model via the varimp command. We’ll come back to this concept of variable importance later but for now a simple example of text and plot output. sort(varimp(chaidattrit1), decreasing = TRUE) ## JobLevel OverTime EnvironmentSatisfaction ## 0.142756888 0.114384725 0.071069051 ## StockOptionLevel MaritalStatus JobSatisfaction ## 0.058726463 0.030332565 0.029157845 ## TrainingTimesLastYear RelationshipSatisfaction Department ## 0.025637743 0.015700750 0.013815233 ## BusinessTravel JobInvolvement ## 0.009906245 0.009205317 plot(sort(varimp(chaidattrit1), decreasing = TRUE)) What about those other variables? But before we go much farther we should probably circle back and make use of all those variables that were coded as integers that we conveniently ignored in building our first four models. Let’s bring them into our model building activities and see what they can add to our understanding. As a first step let’s use ggplot2 and take a look at their distribution using a density plot. # Turning numeric variables into factors ## what do they look like attrition %>% select_if(is.numeric) %>% gather(metric, value) %>% ggplot(aes(value, fill = metric)) + geom_density(show.legend = FALSE) + facet_wrap( ~ metric, scales = "free") Well other than Age very few of those variables appear to have especially normal distributions. That’s okay we’re going to wind up cutting them up into factors anyway. The only question is what are the best cut-points to use? In base R the cut function default is equal intervals (distances along the x axis). You can also specify your own cutpoints and your own labels as shown below. table(cut(attrition$YearsWithCurrManager, breaks = 5)) ## ## (-0.017,3.4] (3.4,6.8] (6.8,10.2] (10.2,13.6] (13.6,17] ## 825 158 414 54 19 table(attrition$YearsSinceLastPromotion) ## ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ## 581 357 159 52 61 45 32 76 18 17 6 24 10 10 9 13 table(cut( attrition$YearsSinceLastPromotion, breaks = c(-1, 0.9, 1.9, 2.9, 30), labels = c("Less than 1", "1", "2", "More than 2") )) ## ## Less than 1 1 2 More than 2 ## 581 357 159 373

ggplot2 has three helper functions I prefer to use: cut_interval,
cut_number, and cut_width. cut_interval makes n groups with equal
range, cut_number makes n groups with (approximately) equal numbers of
observations, and cut_width makes groups of a fixed specified width.
As we think about moving the numeric variables into factors any of these
might be a viable alternative.

# cut_interval makes n groups with equal range table(cut_interval(attrition$YearsWithCurrManager, n = 5)) ## ## [0,3.4] (3.4,6.8] (6.8,10.2] (10.2,13.6] (13.6,17] ## 825 158 414 54 19 # cut_number makes n groups with (approximately) equal numbers of observations table(cut_number(attrition$YearsWithCurrManager, n = 5)) ## ## [0,1] (1,2] (2,4] (4,7] (7,17] ## 339 344 240 276 271 # cut_width makes groups of width width table(cut_width(attrition$YearsWithCurrManager, width = 2)) ## ## [-1,1] (1,3] (3,5] (5,7] (7,9] (9,11] (11,13] (13,15] (15,17] ## 339 486 129 245 171 49 32 10 9 For the sake of our current example let’s say that I would like to focus on groups of more or less equal size which means that I would need to apply cut_number to each of the 12 variables under discussion. I’m not enamored of running the function 12 times though so I would prefer to wrap it in a mutate_if statement. If the variable is numeric then apply cut_number with n=5. The problem is that cut_number will error out if it doesn’t think there are enough values to produce the bins you requested. So… cut_number(attrition$YearsWithCurrManager, n = 6) # Error: Insufficient data values to produce 6 bins. cut_number(attrition$YearsSinceLastPromotion, n = 4) # Error: Insufficient data values to produce 4 bins. attrition %>% mutate_if(is.numeric, funs(cut_number(., n=5))) # Error in mutate_impl(.data, dots) : # Evaluation error: Insufficient data values to produce 5 bins.. A little sleuthing reveals that there is one variable among the 12 that has too few values for the cut_number function to work. That variable is YearsSinceLastPromotion. Let’s try what we would like but explicitly select out that variable. attrition %>% select(-YearsSinceLastPromotion) %>% mutate_if(is.numeric, funs(cut_number(., n=5))) %>% head ## Age Attrition BusinessTravel DailyRate ## 1 (38,45] Yes Travel_Rarely (942,1.22e+03] ## 2 (45,60] No Travel_Frequently [102,392] ## 3 (34,38] Yes Travel_Rarely (1.22e+03,1.5e+03] ## 4 (29,34] No Travel_Frequently (1.22e+03,1.5e+03] ## 5 [18,29] No Travel_Rarely (392,656] ## 6 (29,34] No Travel_Frequently (942,1.22e+03] ## Department DistanceFromHome Education EducationField ## 1 Sales [1,2] College Life_Sciences ## 2 Research_Development (5,9] Below_College Life_Sciences ## 3 Research_Development [1,2] College Other ## 4 Research_Development (2,5] Master Life_Sciences ## 5 Research_Development [1,2] Below_College Medical ## 6 Research_Development [1,2] College Life_Sciences ## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel ## 1 Medium Female (87,100] High 2 ## 2 High Male (59,73] Medium 2 ## 3 Very_High Male (87,100] Medium 1 ## 4 Very_High Female (45,59] High 1 ## 5 Low Male [30,45] High 1 ## 6 Very_High Male (73,87] High 1 ## JobRole JobSatisfaction MaritalStatus MonthlyIncome ## 1 Sales_Executive Very_High Single (5.74e+03,9.86e+03] ## 2 Research_Scientist Medium Married (4.23e+03,5.74e+03] ## 3 Laboratory_Technician High Single [1.01e+03,2.7e+03] ## 4 Research_Scientist High Married (2.7e+03,4.23e+03] ## 5 Laboratory_Technician Medium Married (2.7e+03,4.23e+03] ## 6 Laboratory_Technician Very_High Single (2.7e+03,4.23e+03] ## MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike ## 1 (1.67e+04,2.17e+04] 8 Yes [11,12] ## 2 (2.17e+04,2.7e+04] 1 No (19,25] ## 3 [2.09e+03,6.89e+03] 6 Yes (13,15] ## 4 (2.17e+04,2.7e+04] 1 Yes [11,12] ## 5 (1.18e+04,1.67e+04] 9 No [11,12] ## 6 (1.18e+04,1.67e+04] 0 No (12,13] ## PerformanceRating RelationshipSatisfaction StockOptionLevel ## 1 Excellent Low 0 ## 2 Outstanding Very_High 1 ## 3 Excellent Medium 0 ## 4 Excellent High 0 ## 5 Excellent Very_High 1 ## 6 Excellent High 0 ## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany ## 1 (5,8] 0 Bad (5,7] ## 2 (8,10] 3 Better (7,10] ## 3 (5,8] 3 Better [0,2] ## 4 (5,8] 3 Better (7,10] ## 5 (5,8] 3 Better [0,2] ## 6 (5,8] 2 Good (5,7] ## YearsInCurrentRole YearsWithCurrManager ## 1 (2,4] (4,7] ## 2 (4,7] (4,7] ## 3 [0,1] [0,1] ## 4 (4,7] [0,1] ## 5 (1,2] (1,2] ## 6 (4,7] (4,7] Yes that appears to be it. So let’s manually cut it into 4 groups and then apply the 5 grouping code to the other 11 variables. Once we have accomplished that we can run the same newattrit <- attrition %>% select_if(is.factor) we ran earlier to produce a newattrit dataframe we can work with. attrition$YearsSinceLastPromotion <- cut( attrition$YearsSinceLastPromotion, breaks = c(-1, 0.9, 1.9, 2.9, 30), labels = c("Less than 1", "1", "2", "More than 2") ) attrition <- attrition %>% mutate_if(is.numeric, funs(cut_number(., n=5))) summary(attrition) ## Age Attrition BusinessTravel ## [18,29]:326 No :1233 Non-Travel : 150 ## (29,34]:325 Yes: 237 Travel_Frequently: 277 ## (34,38]:255 Travel_Rarely :1043 ## (38,45]:291 ## (45,60]:273 ## ## ## DailyRate Department DistanceFromHome ## [102,392] :294 Human_Resources : 63 [1,2] :419 ## (392,656] :294 Research_Development:961 (2,5] :213 ## (656,942] :294 Sales :446 (5,9] :308 ## (942,1.22e+03] :294 (9,17] :253 ## (1.22e+03,1.5e+03]:294 (17,29]:277 ## ## ## Education EducationField EnvironmentSatisfaction ## Below_College:170 Human_Resources : 27 Low :284 ## College :282 Life_Sciences :606 Medium :287 ## Bachelor :572 Marketing :159 High :453 ## Master :398 Medical :464 Very_High:446 ## Doctor : 48 Other : 82 ## Technical_Degree:132 ## ## Gender HourlyRate JobInvolvement JobLevel ## Female:588 [30,45] :306 Low : 83 1:543 ## Male :882 (45,59] :298 Medium :375 2:534 ## (59,73] :280 High :868 3:218 ## (73,87] :312 Very_High:144 4:106 ## (87,100]:274 5: 69 ## ## ## JobRole JobSatisfaction MaritalStatus ## Sales_Executive :326 Low :289 Divorced:327 ## Research_Scientist :292 Medium :280 Married :673 ## Laboratory_Technician :259 High :442 Single :470 ## Manufacturing_Director :145 Very_High:459 ## Healthcare_Representative:131 ## Manager :102 ## (Other) :215 ## MonthlyIncome MonthlyRate NumCompaniesWorked ## [1.01e+03,2.7e+03] :294 [2.09e+03,6.89e+03]:294 1 :521 ## (2.7e+03,4.23e+03] :294 (6.89e+03,1.18e+04]:294 0 :197 ## (4.23e+03,5.74e+03]:294 (1.18e+04,1.67e+04]:294 3 :159 ## (5.74e+03,9.86e+03]:294 (1.67e+04,2.17e+04]:294 2 :146 ## (9.86e+03,2e+04] :294 (2.17e+04,2.7e+04] :294 4 :139 ## 7 : 74 ## (Other):234 ## OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction ## No :1054 [11,12]:408 Low : 0 Low :276 ## Yes: 416 (12,13]:209 Good : 0 Medium :303 ## (13,15]:302 Excellent :1244 High :459 ## (15,19]:325 Outstanding: 226 Very_High:432 ## (19,25]:226 ## ## ## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance ## 0:631 [0,5] :316 0: 54 Bad : 80 ## 1:596 (5,8] :309 1: 71 Good :344 ## 2:158 (8,10] :298 2:547 Better:893 ## 3: 85 (10,17]:261 3:491 Best :153 ## (17,40]:286 4:123 ## 5:119 ## 6: 65 ## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion ## [0,2] :342 [0,1] :301 Less than 1:581 ## (2,5] :434 (1,2] :372 1 :357 ## (5,7] :166 (2,4] :239 2 :159 ## (7,10] :282 (4,7] :295 More than 2:373 ## (10,40]:246 (7,18]:263 ## ## ## YearsWithCurrManager ## [0,1] :339 ## (1,2] :344 ## (2,4] :240 ## (4,7] :276 ## (7,17]:271 ## ## newattrit <- attrition %>% select_if(is.factor) dim(newattrit) ## [1] 1470 31 Now we have newattrit with all 30 predictor variables. We will simply repeat the process we used earlier to develop 4 new models. # Repeat to produce models 5-8 chaidattrit5 <- chaid(Attrition ~ ., data = newattrit) print(chaidattrit5) ## ## Model formula: ## Attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + ## Education + EducationField + EnvironmentSatisfaction + Gender + ## HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + ## MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + ## OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + ## StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + ## WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + ## YearsWithCurrManager ## ## Fitted party: ## [1] root ## | [2] OverTime in No ## | | [3] YearsAtCompany in [0,2] ## | | | [4] Age in [18,29], (29,34] ## | | | | [5] StockOptionLevel in 0 ## | | | | | [6] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 56, err = 41.1%) ## | | | | | [7] BusinessTravel in Travel_Frequently: Yes (n = 10, err = 10.0%) ## | | | | [8] StockOptionLevel in 1, 2, 3: No (n = 63, err = 15.9%) ## | | | [9] Age in (34,38], (38,45], (45,60] ## | | | | [10] WorkLifeBalance in Bad: No (n = 4, err = 50.0%) ## | | | | [11] WorkLifeBalance in Good, Better, Best ## | | | | | [12] EducationField in Human_Resources, Life_Sciences, Marketing, Medical: No (n = 92, err = 2.2%) ## | | | | | [13] EducationField in Other, Technical_Degree: No (n = 13, err = 23.1%) ## | | [14] YearsAtCompany in (2,5], (5,7], (7,10], (10,40] ## | | | [15] WorkLifeBalance in Bad: No (n = 45, err = 22.2%) ## | | | [16] WorkLifeBalance in Good, Better, Best ## | | | | [17] JobSatisfaction in Low ## | | | | | [18] StockOptionLevel in 0 ## | | | | | | [19] RelationshipSatisfaction in Low: Yes (n = 11, err = 45.5%) ## | | | | | | [20] RelationshipSatisfaction in Medium: No (n = 12, err = 8.3%) ## | | | | | | [21] RelationshipSatisfaction in High: No (n = 17, err = 47.1%) ## | | | | | | [22] RelationshipSatisfaction in Very_High: No (n = 20, err = 0.0%) ## | | | | | [23] StockOptionLevel in 1, 2, 3: No (n = 93, err = 4.3%) ## | | | | [24] JobSatisfaction in Medium, High, Very_High ## | | | | | [25] Age in [18,29], (29,34], (34,38], (38,45] ## | | | | | | [26] BusinessTravel in Non-Travel, Travel_Rarely ## | | | | | | | [27] JobInvolvement in Low: No (n = 25, err = 12.0%) ## | | | | | | | [28] JobInvolvement in Medium, High, Very_High ## | | | | | | | | [29] RelationshipSatisfaction in Low: No (n = 81, err = 3.7%) ## | | | | | | | | [30] RelationshipSatisfaction in Medium, High: No (n = 198, err = 0.0%) ## | | | | | | | | [31] RelationshipSatisfaction in Very_High ## | | | | | | | | | [32] DistanceFromHome in [1,2], (2,5], (5,9], (17,29]: No (n = 92, err = 2.2%) ## | | | | | | | | | [33] DistanceFromHome in (9,17]: No (n = 13, err = 23.1%) ## | | | | | | [34] BusinessTravel in Travel_Frequently: No (n = 95, err = 8.4%) ## | | | | | [35] Age in (45,60] ## | | | | | | [36] JobSatisfaction in Low, Medium, High ## | | | | | | | [37] TotalWorkingYears in [0,5], (5,8], (8,10], (17,40]: No (n = 57, err = 0.0%) ## | | | | | | | [38] TotalWorkingYears in (10,17]: No (n = 14, err = 28.6%) ## | | | | | | [39] JobSatisfaction in Very_High: No (n = 43, err = 20.9%) ## | [40] OverTime in Yes ## | | [41] JobLevel in 1 ## | | | [42] StockOptionLevel in 0, 3 ## | | | | [43] DistanceFromHome in [1,2], (2,5] ## | | | | | [44] EnvironmentSatisfaction in Low: Yes (n = 12, err = 16.7%) ## | | | | | [45] EnvironmentSatisfaction in Medium, High, Very_High: No (n = 33, err = 36.4%) ## | | | | [46] DistanceFromHome in (5,9], (9,17], (17,29]: Yes (n = 44, err = 18.2%) ## | | | [47] StockOptionLevel in 1, 2 ## | | | | [48] BusinessTravel in Non-Travel, Travel_Rarely: No (n = 50, err = 26.0%) ## | | | | [49] BusinessTravel in Travel_Frequently: Yes (n = 17, err = 35.3%) ## | | [50] JobLevel in 2, 3, 4, 5 ## | | | [51] MaritalStatus in Divorced, Married ## | | | | [52] EnvironmentSatisfaction in Low, Medium: No (n = 60, err = 20.0%) ## | | | | [53] EnvironmentSatisfaction in High, Very_High ## | | | | | [54] TrainingTimesLastYear in 0, 6: No (n = 10, err = 40.0%) ## | | | | | [55] TrainingTimesLastYear in 1, 2, 3, 4, 5 ## | | | | | | [56] YearsInCurrentRole in [0,1], (1,2]: No (n = 36, err = 11.1%) ## | | | | | | [57] YearsInCurrentRole in (2,4], (4,7], (7,18]: No (n = 82, err = 0.0%) ## | | | [58] MaritalStatus in Single ## | | | | [59] Department in Human_Resources, Research_Development: No (n = 37, err = 10.8%) ## | | | | [60] Department in Sales: Yes (n = 35, err = 40.0%) ## ## Number of inner nodes: 28 ## Number of terminal nodes: 32 plot( chaidattrit5, main = "Default control sliced numerics", gp = gpar( col = "blue", lty = "solid", lwd = 3, fontsize = 8 ) ) ctrl <- chaid_control(minsplit = 200, minprob = 0.05) chaidattrit6 <- chaid(Attrition ~ ., data = newattrit, control = ctrl) print(chaidattrit6) ## ## Model formula: ## Attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + ## Education + EducationField + EnvironmentSatisfaction + Gender + ## HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + ## MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + ## OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + ## StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + ## WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + ## YearsWithCurrManager ## ## Fitted party: ## [1] root ## | [2] OverTime in No ## | | [3] YearsAtCompany in [0,2] ## | | | [4] Age in [18,29], (29,34]: No (n = 129, err = 32.6%) ## | | | [5] Age in (34,38], (38,45], (45,60]: No (n = 109, err = 6.4%) ## | | [6] YearsAtCompany in (2,5], (5,7], (7,10], (10,40] ## | | | [7] WorkLifeBalance in Bad: No (n = 45, err = 22.2%) ## | | | [8] WorkLifeBalance in Good, Better, Best ## | | | | [9] JobSatisfaction in Low: No (n = 153, err = 12.4%) ## | | | | [10] JobSatisfaction in Medium, High, Very_High ## | | | | | [11] Age in [18,29], (29,34], (34,38], (38,45] ## | | | | | | [12] BusinessTravel in Non-Travel, Travel_Rarely ## | | | | | | | [13] JobInvolvement in Low: No (n = 25, err = 12.0%) ## | | | | | | | [14] JobInvolvement in Medium, High, Very_High ## | | | | | | | | [15] RelationshipSatisfaction in Low: No (n = 81, err = 3.7%) ## | | | | | | | | [16] RelationshipSatisfaction in Medium, High: No (n = 198, err = 0.0%) ## | | | | | | | | [17] RelationshipSatisfaction in Very_High: No (n = 105, err = 4.8%) ## | | | | | | [18] BusinessTravel in Travel_Frequently: No (n = 95, err = 8.4%) ## | | | | | [19] Age in (45,60]: No (n = 114, err = 11.4%) ## | [20] OverTime in Yes ## | | [21] JobLevel in 1: Yes (n = 156, err = 47.4%) ## | | [22] JobLevel in 2, 3, 4, 5 ## | | | [23] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%) ## | | | [24] MaritalStatus in Single: No (n = 72, err = 34.7%) ## ## Number of inner nodes: 11 ## Number of terminal nodes: 13 plot( chaidattrit6, main = "minsplit = 200, minprob = 0.05", gp = gpar( col = "blue", lty = "solid", lwd = 3, fontsize = 8 ) ) ctrl <- chaid_control(maxheight = 3) chaidattrit7 <- chaid(Attrition ~ ., data = newattrit, control = ctrl) print(chaidattrit7) ## ## Model formula: ## Attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + ## Education + EducationField + EnvironmentSatisfaction + Gender + ## HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + ## MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + ## OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + ## StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + ## WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + ## YearsWithCurrManager ## ## Fitted party: ## [1] root ## | [2] OverTime in No ## | | [3] YearsAtCompany in [0,2] ## | | | [4] Age in [18,29], (29,34]: No (n = 129, err = 32.6%) ## | | | [5] Age in (34,38], (38,45], (45,60]: No (n = 109, err = 6.4%) ## | | [6] YearsAtCompany in (2,5], (5,7], (7,10], (10,40] ## | | | [7] WorkLifeBalance in Bad: No (n = 45, err = 22.2%) ## | | | [8] WorkLifeBalance in Good, Better, Best: No (n = 771, err = 6.6%) ## | [9] OverTime in Yes ## | | [10] JobLevel in 1 ## | | | [11] StockOptionLevel in 0, 3: Yes (n = 89, err = 34.8%) ## | | | [12] StockOptionLevel in 1, 2: No (n = 67, err = 35.8%) ## | | [13] JobLevel in 2, 3, 4, 5 ## | | | [14] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%) ## | | | [15] MaritalStatus in Single: No (n = 72, err = 34.7%) ## ## Number of inner nodes: 7 ## Number of terminal nodes: 8 plot( chaidattrit7, main = "maxheight = 3", gp = gpar( col = "blue", lty = "solid", lwd = 3, fontsize = 8 ) ) ctrl <- chaid_control(alpha2 = .01, alpha4 = .01) chaidattrit8 <- chaid(Attrition ~ ., data = newattrit, control = ctrl) print(chaidattrit8) ## ## Model formula: ## Attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + ## Education + EducationField + EnvironmentSatisfaction + Gender + ## HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + ## MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + ## OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + ## StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + ## WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + ## YearsWithCurrManager ## ## Fitted party: ## [1] root ## | [2] OverTime in No ## | | [3] YearsAtCompany in [0,2] ## | | | [4] Age in [18,29], (29,34] ## | | | | [5] StockOptionLevel in 0: No (n = 66, err = 48.5%) ## | | | | [6] StockOptionLevel in 1, 2, 3: No (n = 63, err = 15.9%) ## | | | [7] Age in (34,38], (38,45], (45,60] ## | | | | [8] WorkLifeBalance in Bad: No (n = 4, err = 50.0%) ## | | | | [9] WorkLifeBalance in Good, Better, Best: No (n = 105, err = 4.8%) ## | | [10] YearsAtCompany in (2,5], (5,7], (7,10], (10,40] ## | | | [11] WorkLifeBalance in Bad: No (n = 45, err = 22.2%) ## | | | [12] WorkLifeBalance in Good, Better, Best ## | | | | [13] JobSatisfaction in Low ## | | | | | [14] JobRole in Healthcare_Representative, Human_Resources, Laboratory_Technician, Manager, Manufacturing_Director, Research_Director, Research_Scientist, Sales_Executive ## | | | | | | [15] StockOptionLevel in 0: No (n = 58, err = 22.4%) ## | | | | | | [16] StockOptionLevel in 1, 2, 3: No (n = 92, err = 3.3%) ## | | | | | [17] JobRole in Sales_Representative: Yes (n = 3, err = 0.0%) ## | | | | [18] JobSatisfaction in Medium, High, Very_High: No (n = 618, err = 5.2%) ## | [19] OverTime in Yes ## | | [20] JobLevel in 1 ## | | | [21] StockOptionLevel in 0, 3: Yes (n = 89, err = 34.8%) ## | | | [22] StockOptionLevel in 1, 2: No (n = 67, err = 35.8%) ## | | [23] JobLevel in 2, 3, 4, 5 ## | | | [24] MaritalStatus in Divorced, Married: No (n = 188, err = 10.6%) ## | | | [25] MaritalStatus in Single ## | | | | [26] Department in Human_Resources, Research_Development: No (n = 37, err = 10.8%) ## | | | | [27] Department in Sales: Yes (n = 35, err = 40.0%) ## ## Number of inner nodes: 13 ## Number of terminal nodes: 14 plot( chaidattrit8, main = "alpha2 = .01, alpha4 = .01", gp = gpar( col = "blue", lty = "solid", lwd = 3, fontsize = 8 ) ) As we did earlier we’ll also repeat the steps necessary to build a table of results. modellist <- list(Model1 = chaidattrit1, Model2 = chaidattrit2, Model3 = chaidattrit3, Model4 = chaidattrit4, Model5 = chaidattrit5, Model6 = chaidattrit6, Model7 = chaidattrit7, Model8 = chaidattrit8) CHAIDResults <- map(modellist, ~ predict(.x)) %>% map(~ confusionMatrix(newattrit$Attrition, .x)) %>% map_dfr(~ cbind(as.data.frame(t(.x$overall)),as.data.frame(t(.x$byClass))), .id = "ModelNumb") kable(CHAIDResults, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), font_size = 10)

ModelNumb

Accuracy

Kappa

AccuracyLower

AccuracyUpper

AccuracyNull

AccuracyPValue

McnemarPValue

Sensitivity

Specificity

Pos Pred Value

Neg Pred Value

Precision

Recall

F1

Prevalence

Detection Rate

Detection Prevalence

Balanced Accuracy

Model1

0.8707483

0.4191632

0.8525159

0.8874842

0.9095238

0.9999996

0.0e+00

0.8900524

0.6766917

0.9651257

0.3797468

0.9651257

0.8900524

0.9260700

0.9095238

0.8095238

0.8387755

0.7833720

Model2

0.8455782

0.3529603

0.8260781

0.8636860

0.8857143

0.9999985

6.4e-06

0.8863287

0.5297619

0.9359286

0.3755274

0.9359286

0.8863287

0.9104536

0.8857143

0.7850340

0.8387755

0.7080453

Model3

0.8571429

0.2936476

0.8382017

0.8746440

0.9394558

1.0000000

0.0e+00

0.8703838

0.6516854

0.9748581

0.2447257

0.9748581

0.8703838

0.9196634

0.9394558

0.8176871

0.8387755

0.7610346

Model4

0.8619048

0.3676334

0.8432050

0.8791447

0.9156463

1.0000000

0.0e+00

0.8826152

0.6370968

0.9635036

0.3333333

0.9635036

0.8826152

0.9212873

0.9156463

0.8081633

0.8387755

0.7598560

Model5

0.8775510

0.4451365

0.8596959

0.8938814

0.9122449

0.9999968

0.0e+00

0.8926174

0.7209302

0.9708029

0.3924051

0.9708029

0.8926174

0.9300699

0.9122449

0.8142857

0.8387755

0.8067738

Model6

0.8442177

0.3317731

0.8246542

0.8623944

0.8938776

1.0000000

1.0e-07

0.8820396

0.5256410

0.9399838

0.3459916

0.9399838

0.8820396

0.9100903

0.8938776

0.7884354

0.8387755

0.7038403

Model7

0.8571429

0.2936476

0.8382017

0.8746440

0.9394558

1.0000000

0.0e+00

0.8703838

0.6516854

0.9748581

0.2447257

0.9748581

0.8703838

0.9196634

0.9394558

0.8176871

0.8387755

0.7610346

Model8

0.8639456

0.3808988

0.8453515

0.8810715

0.9136054

1.0000000

0.0e+00

0.8845867

0.6456693

0.9635036

0.3459916

0.9635036

0.8845867

0.9223602

0.9136054

0.8081633

0.8387755

0.7651280

You can clearly see that Overtime remains the first cut in our tree
structure but that now other variables have started to influence our
model as well, such as how long they’ve worked for us and their age. You
can see from the table that model #5 is apparently the most accurate
now. Not by a huge amount but apparently these numeric variables we
ignored at first pass do matter at least to some degree.

Not done yet

I’m not going to dwell on the current results too much they are simply
for an example and in my next post I’d like to spend some time on
over-fitting and cross validation.

I hope you’ve found this useful. I am always open to comments,
corrections and suggestions.

Chuck (ibecav at gmail dot
com)

This
Creative
.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Chuck Powell. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### fruit: Image-Based Systems of Linear Equations (Numeric)

Tue, 05/15/2018 - 00:00

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Exercise template for solving a system of three linear equations (numeric answer) with a problem description based on shuffled images.

Name: fruit Type: num Related: fruit2

Description: A system of three linear equations has to be solved and the solution has to be entered into a fourth equation. However, the system is not defined through a verbal description or mathermatical notation but through images (clip art of tropical fruits). The problem can be interpreted as prices of three fruits (banana, orange, pineapple) and corresponding fruit baskets with different combinations of fruits. Images are stored in Base64 encoding within the exercise files and embedded dynamically into the output. PDFs are best generated from the Rnw version, HTML is best generated with pandoc from either the Rmd version (where pandoc is used by default) or the Rnw version (where ttm is used by default, but pandoc can be easily used as well.) Solution feedback: Yes Randomization: Random numbers, shuffled graphics Mathematical notation: Yes Verbatim R input/output: No Images: Yes Other supplements: No Template: fruit.Rnw fruit.Rmd Raw: (1 random version) fruit.tex fruit.md PDF: HTML:

(Note that the HTML output contains mathematical equations in MathML. It is displayed by browsers with MathML support like Firefox or Safari – but not Chrome.)

Demo code:

library("exams") set.seed(1090) exams2html("fruit.Rnw") set.seed(1090) exams2pdf("fruit.Rnw") set.seed(1090) exams2html("fruit.Rmd") set.seed(1090) exams2pdf("fruit.Rmd") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### fruit2: Image-Based Systems of Linear Equations (Single-Choice)

Tue, 05/15/2018 - 00:00

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Exercise template for solving a system of three linear equations (single-choice) with a problem description based on shuffled images.

Name: fruit2 Type: schoice Related: fruit

Description: A system of three linear equations has to be solved and the solution has to be entered into a fourth equation. However, the system is not defined through a verbal description or mathermatical notation but through images (clip art of tropical fruits). The problem can be interpreted as prices of three fruits (banana, orange, pineapple) and corresponding fruit baskets with different combinations of fruits. Images are stored in Base64 encoding within the exercise files and embedded dynamically into the output. A set of five answer alternatives is generated based on two potential mistakes and two random solutions from a suitable range. PDFs are best generated from the Rnw version, HTML is best generated with pandoc from either the Rmd version (where pandoc is used by default) or the Rnw version (where ttm is used by default, but pandoc can be easily used as well.) Solution feedback: Yes Randomization: Random numbers, shuffled graphics Mathematical notation: Yes Verbatim R input/output: No Images: Yes Other supplements: No Template: fruit2.Rnw fruit2.Rmd Raw: (1 random version) fruit2.tex fruit2.md PDF: HTML:

(Note that the HTML output contains mathematical equations in MathML. It is displayed by browsers with MathML support like Firefox or Safari – but not Chrome.)

Demo code:

library("exams") set.seed(1090) exams2html("fruit2.Rnw") set.seed(1090) exams2pdf("fruit2.Rnw") set.seed(1090) exams2html("fruit2.Rmd") set.seed(1090) exams2pdf("fruit2.Rmd") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Early draft of our “Feature Engineering and Selection” book

Mon, 05/14/2018 - 15:29

(This article was first published on Blog - Applied Predictive Modeling, and kindly contributed to R-bloggers)

Kjell and I are writing another book on predictive modeling, this time focused on all the things that you can do with predictors. It’s about 60% done and we’d love to get feedback. You cna take a look at http://feat.engineering and provide feedback at https://github.com/topepo/FES/issues.

The current TOC is:

1. Introduction
2. Illustrative Example: Predicting Risk of Ischemic Stroke
3. A Review of the Predictive Modeling Process
4. Exploratory Visualizations
5. Encoding Categorical Predictors
6. Engineering Numeric Predictors
7. Detecting Interaction Effects (these later chapters are not finished yet)
8. Flattening Profile Data
9. Handling Missing Data
10. Feature Engineering Without Overfitting
11. Feature Selection
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Blog - Applied Predictive Modeling. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Generative Assessment Creation

Mon, 05/14/2018 - 13:00

(This article was first published on Rstats – OUseful.Info, the blog…, and kindly contributed to R-bloggers)

It’s coming round to that time of year where we have to create the assessment material for courses with an October start date. In many cases, we reuse question forms from previous presentations but change the specific details. If a question is suitably defined, then large parts of this process could be automated.

In the OU, automated question / answer option randomisation is used to provide iCMAs (interactive computer marked assessments) via the student VLE using OpenMark. As well as purely text based questions, questions can include tables or images as part of the question.

One way of supporting such question types is to manually create a set of answer options, perhaps with linked media assets, and then allow randomisation of them.

Another way is to define the question in a generative way so that the correct and incorrect answers are automatically generated.(This seems to be one of those use cases for why ‘everyone should learn to code’;-)

Pinching screenshots from an (old?) OpenMark tutorial, we can see how a dynamically generated question might be defined. For example, create a set of variables:

and then generate a templated question, and student feedback generator, around them:

Packages also exist for creating generative questions/answers more generally. For example, the R exams package allows you to define question/answer templates in Rmd and then generate questions and solutions in a variety of output document formats.

You can also write templates that include the creation of graphical assets such as charts:

Via my feeds over the weekend, I noticed that this package now also supports the creation of more general diagrams created from a TikZ diagram template. For example, logic diagrams:

(You can see more exam templates here: www.r-exams.org/templates.)

As I’m still on a “we can do everything in Jupyter” kick, one of the things I’ve explored is various IPython/notebook magics that support diagram creation. At the moment, these are just generic magics that allow you to write TikZ diagrams, for example, that make use of various TikZ packages:

One the to do list is to create some example magics that template different question types.

I’m not sure if OpenCreate is following a similar model? (I seem to have lost access permissions again…)

FWIW, I’ve also started looking at my show’n’tell notebooks again, trying to get them working in Azure notebooks. (OU staff should be able to log in to noteooks.azure.com using OUCU@open.ac.uk credentials.) For the moment, I’m depositing them at https://notebooks.azure.com/OUsefulInfo/libraries/gettingstarted, although some tidying may happen at some point. There are also several more basic demo notebooks I need to put together (e.g. on creating charts and using interactive widgets, digital humanities demos, R demos and (if they work!) polyglot R and python notebook demos, etc.). To use the notebooks interactively, log in and clone the library into your own user space.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – OUseful.Info, the blog…. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### sparklyr 0.8

Mon, 05/14/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re pleased to announce that sparklyr 0.8 is now available on CRAN! Sparklyr provides an R interface to Apache Spark. It supports dplyr syntax for working with Spark DataFrames and exposes the full range of machine learning algorithms available in Spark ML. You can also learn more about Apache Spark and sparklyr at spark.rstudio.com and the sparklyr webinar series. In this version, we added support for Spark 2.3, Livy 0.5, and various enhancements and bugfixes. For this post, we’d like to highlight a new feature from Spark 2.3 and introduce the mleap and graphframes extensions.

Parallel Cross-Validation

Spark 2.3 supports parallelism in hyperparameter tuning. In other words, instead of training each model specification serially, you can now train them in parallel. This can be enabled by setting the parallelism parameter in ml_cross_validator() or ml_train_split_validation(). Here’s an example:

library(sparklyr) sc <- spark_connect(master = "local", version = "2.3.0") iris_tbl <- sdf_copy_to(sc, iris) # Define the pipeline labels <- c("setosa", "versicolor", "virginica") pipeline <- ml_pipeline(sc) %>% ft_vector_assembler( c("Sepal_Width", "Sepal_Length", "Petal_Width", "Petal_Length"), "features" ) %>% ft_string_indexer_model("Species", "label", labels = labels) %>% ml_logistic_regression() # Specify hyperparameter grid grid <- list( logistic = list( elastic_net_param = c(0.25, 0.75), reg_param = c(1e-3, 1e-4) ) ) # Create the cross validator object cv <- ml_cross_validator( sc, estimator = pipeline, estimator_param_maps = grid, evaluator = ml_multiclass_classification_evaluator(sc), num_folds = 3, parallelism = 4 ) # Train the models cv_model <- ml_fit(cv, iris_tbl)

Once the models are trained, you can inspect the performance results by using the newly available helper function ml_validation_metrics():

ml_validation_metrics(cv_model) ## f1 elastic_net_param_1 reg_param_1 ## 1 0.9506 0.25 1e-03 ## 2 0.9384 0.75 1e-03 ## 3 0.9384 0.25 1e-04 ## 4 0.9569 0.75 1e-04 spark_disconnect(sc) Pipelines in Production

Earlier this year, we announced support for ML Pipelines in sparklyr, and discussed how one can persist models onto disk. While that workflow is appropriate for batch scoring of large datasets, we also wanted to enable real-time, low-latency scoring using pipelines developed with sparklyr. To enable this, we’ve developed the mleap package, available on CRAN, which provides an interface to the MLeap open source project.

MLeap allows you to use your Spark pipelines in any Java-enabled device or service. This works by serializing Spark pipelines which can later be loaded into the Java Virtual Machine (JVM) for scoring without requiring a Spark cluster. This means that software engineers can take Spark pipelines exported with sparklyr and easily embed them in web, desktop or mobile applications.

To get started, simply grab the package from CRAN and install the necessary dependencies:

install.packages("mleap") library(mleap) install_maven() install_mleap()

Then, build a pipeline as usual:

library(sparklyr) sc <- spark_connect(master = "local", version = "2.2.0") mtcars_tbl <- sdf_copy_to(sc, mtcars) # Create a pipeline and fit it pipeline <- ml_pipeline(sc) %>% ft_binarizer("hp", "big_hp", threshold = 100) %>% ft_vector_assembler(c("big_hp", "wt", "qsec"), "features") %>% ml_gbt_regressor(label_col = "mpg") pipeline_model <- ml_fit(pipeline, mtcars_tbl)

Once we have the pipeline model, we can export it via ml_write_bundle():

# Export model model_path <- file.path(tempdir(), "mtcars_model.zip") transformed_tbl <- ml_transform(pipeline_model, mtcars_tbl) ml_write_bundle(pipeline_model, transformed_tbl, model_path) spark_disconnect(sc)

At this point, we’re ready to use mtcars_model.zip in other applications. Notice that the following code does not require Spark:

# Import model model <- mleap_load_bundle(model_path) # Create a data frame to be scored newdata <- tibble::tribble( ~qsec, ~hp, ~wt, 16.2, 101, 2.68, 18.1, 99, 3.08 ) # Transform the data frame transformed_df <- mleap_transform(model, newdata) dplyr::glimpse(transformed_df) ## Observations: 2 ## Variables: 6 ## $qsec 16.2, 18.1 ##$ hp 101, 99 ## $wt 2.68, 3.08 ##$ big_hp 1, 0 ## $features [[[1, 2.68, 16.2], [3]], [[0, 3.08, 18.1], [3]]] ##$ prediction 21.07, 22.37

Notice that MLeap requires Spark 2.0 to 2.2. You can find additional details in the production pipelines guide.

Graph Analysis

The other extension we’d like to highlight is graphframes, which provides an interface to the GraphFrames Spark package. GraphFrames allows us to run graph algorithms at scale using a DataFrame-based API.

Let’s see graphframes in action through a quick example, where we analyze the relationships among package on CRAN.

library(graphframes) library(dplyr) sc <- spark_connect(master = "local", version = "2.1.0") # Grab list of CRAN packages and their dependencies available_packages <- available.packages( contrib.url("https://cloud.r-project.org/") ) %>% [(, c("Package", "Depends", "Imports")) %>% as_tibble() %>% transmute( package = Package, dependencies = paste(Depends, Imports, sep = ",") %>% gsub("\\n|\\s+", "", .) ) # Copy data to Spark packages_tbl <- sdf_copy_to(sc, available_packages, overwrite = TRUE) # Create a tidy table of dependencies, which define the edges of our graph edges_tbl <- packages_tbl %>% mutate( dependencies = dependencies %>% regexp_replace("\\\$$([^)]+)\\\$$", "") ) %>% ft_regex_tokenizer( "dependencies", "dependencies_vector", pattern = "(\\s+)?,(\\s+)?", to_lower_case = FALSE ) %>% transmute( src = package, dst = explode(dependencies_vector) ) %>% filter(!dst %in% c("R", "NA"))

Once we have an edges table, we can easily create a GraphFrame object by calling gf_graphframe() and running PageRank:

# Create a GraphFrame object g <- gf_graphframe(edges = edges_tbl) # Run the PageRank algorithm pagerank <- gf_pagerank(g, tol = 0.01) pagerank %>% gf_vertices() %>% arrange(desc(pagerank)) ## # Source: table [?? x 2] ## # Database: spark_connection ## # Ordered by: desc(pagerank) ## id pagerank ## ## 1 methods 259. ## 2 stats 209. ## 3 utils 194. ## 4 Rcpp 109. ## 5 graphics 104. ## 6 grDevices 60.0 ## 7 MASS 53.7 ## 8 lattice 34.7 ## 9 Matrix 33.3 ## 10 grid 32.1 ## # ... with more rows

We can also collect a sample of the graph locally for visualization:

library(gh) library(visNetwork) list_repos <- function(username) { gh("/users/:username/repos", username = username) %>% vapply("[[", "", "name") } rlib_repos <- list_repos("r-lib") tidyverse_repos <- list_repos("tidyverse") base_packages <- installed.packages() %>% as_tibble() %>% filter(Priority == "base") %>% pull(Package) top_packages <- pagerank %>% gf_vertices() %>% arrange(desc(pagerank)) %>% head(75) %>% pull(id) edges_local <- g %>% gf_edges() %>% filter(src %in% !!top_packages && dst %in% !!top_packages) %>% rename(from = src, to = dst) %>% collect() vertices_local <- g %>% gf_vertices() %>% filter(id %in% top_packages) %>% mutate( group = case_when( id %in% !!rlib_repos ~ "r-lib", id %in% !!tidyverse_repos ~ "tidyverse", id %in% !!base_packages ~ "base", TRUE ~ "other" ), title = id) %>% collect() visNetwork(vertices_local, edges_local, width = "100%") %>% visEdges(arrows = "to")

spark_disconnect(sc)

Notice that GraphFrames currently supports Spark 2.0 and 2.1. You can find additional details in the graph analysis guide.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Sketchnotes from TWiML&AI: Adversarial Attacks Against Reinforcement Learning Agents with Ian Goodfellow & Sandy Huang

Mon, 05/14/2018 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

These are my sketchnotes for Sam Charrington’s podcast This Week in Machine Learning and AI about Adversarial Attacks Against Reinforcement Learning Agents with Ian Goodfellow & Sandy Huang:

Sketchnotes from TWiMLAI talk: Adversarial Attacks Against Reinforcement Learning Agents with Ian Goodfellow & Sandy Huang

You can listen to the podcast here.

In this episode, I’m joined by Ian Goodfellow, Staff Research Scientist at Google Brain and Sandy Huang, Phd Student in the EECS department at UC Berkeley, to discuss their work on the paper Adversarial Attacks on Neural Network Policies. If you’re a regular listener here you’ve probably heard of adversarial attacks, and have seen examples of deep learning based object detectors that can be fooled into thinking that, for example, a giraffe is actually a school bus, by injecting some imperceptible noise into the image. Well, Sandy and Ian’s paper sits at the intersection of adversarial attacks and reinforcement learning, another area we’ve discussed quite a bit on the podcast. In their paper, they describe how adversarial attacks can also be effective at targeting neural network policies in reinforcement learning. Sandy gives us an overview of the paper, including how changing a single pixel value can throw off performance of a model trained to play Atari games. We also cover a lot of interesting topics relating to adversarial attacks and RL individually, and some related areas such as hierarchical reward functions and transfer learning. This was a great conversation that I’m really excited to bring to you! https://twimlai.com/twiml-talk-119-adversarial-attacks-reinforcement-learning-agents-ian-goodfellow-sandy-huang/

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Is non-inferiority on par with superiority?

Mon, 05/14/2018 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

It is grant season around here (actually, it is pretty much always grant season), which means another series of problems to tackle. Even with the most straightforward study designs, there is almost always some interesting twist, or an approach that presents a subtle issue or two. In this case, the investigator wants compare two interventions, but doesn’t feel the need to show that one is better than the other. He just wants to see if the newer intervention is not inferior to the more established intervention.

The shift from a superiority trial to a non-inferiority trial leads to a fundamental shift in the hypothesis testing framework. In the more traditional superiority trial, where we want to see if an intervention is an improvement over another intervention, we can set up the hypothesis test with null and alternative hypotheses based on the difference of the intervention proportions $$p_{old}$$ and $$p_{new}$$ (under the assumption of a binary outcome):

\begin{aligned} H_0: p_{new} – p_{old} &\le 0 \\ H_A: p_{new} – p_{old} &> 0 \end{aligned} In this context, if we reject the null hypothesis that the difference in proportions is less than zero, we conclude that the new intervention is an improvement over the old one, at least for the population under study. (A crucial element of the test is the $$\alpha$$-level that determines the Type 1 error (probability of rejecting $$H_0$$ when $$H_0$$ is actually true. If we use $$\alpha = 0.025$$, then that is analogous to doing a two-sided test with $$\alpha = .05$$ and hypotheses $$H_0: p_{new} – p_{old} = 0$$ and $$H_A: p_{new} – p_{old} \ne 0$$.)

In the case of an inferiority trial, we add a little twist. Really, we subtract a little twist. In this case the hypotheses are:

\begin{aligned} H_0: p_{new} – p_{old} &\le -\Delta \\ H_A: p_{new} – p_{old} &> -\Delta \end{aligned}

where $$\Delta$$ is some threshold that sets the non-inferiority bounds. Clearly, if $$\Delta = 0$$ then this is equivalent to a superiority test. However, for any other $$\Delta$$, there is a bit of a cushion so that the new intervention will still be considered non-inferior even if we observe a lower proportion for the new intervention compared to the older intervention.

As long as the confidence interval around the observed estimate for the difference in proportions does not cross the $$-\Delta$$ threshold, we can conclude the new intervention is non-inferior. If we construct a 95% confidence interval, this procedure will have a Type 1 error rate $$\alpha = 0.025$$, and a 90% CI will yield an $$\alpha = 0.05$$. (I will demonstrate this with a simulation.)

The following figures show how different confident intervals imply different conclusions. I’ve added an equivalence trial here as well, but won’t discuss in detail except to say that in this situation we would conclude that two interventions are equivalent if the confidence interval falls between $$-\Delta$$ and $$\Delta$$). The bottom interval crosses the non-inferiority threshold, so is considered inconclusive. The second interval from the top crosses zero, but does not cross the non-inferiority threshold, so we conclude that the new intervention is at least as effective as the old one. And the top interval excludes zero, so we conclude that the new intervention is an improvement:

This next figure highlights the key challenge of the the non-inferiority trial: where do we set $$\Delta$$? By shifting the threshold towards zero in this example (and not changing anything else), we can no longer conclude non-inferiority. But, the superiority test is not affected, and never will be. The comparison for a superiority test is made relative to zero only, and has nothing to do with $$\Delta$$. So, unless there is a principled reason for selecting $$\Delta$$, the process (and conclusions) and feel a little arbitrary. (Check out this interactive post for a really cool way to explore some of these issues.)

Type 1 error rate

To calculate the Type 1 error rate, we generate data under the null hypothesis, or in this case on the rightmost boundary of the null hypothesis since it is a composite hypothesis. First, let’s generate one data set:

library(magrittr) library(broom) set.seed(319281) def <- defDataAdd(varname = "y", formula = "0.30 - 0.15*rx", dist = "binary") DT <- genData(1000) %>% trtAssign(dtName = ., grpName = "rx") DT <- addColumns(def, DT) DT ## id rx y ## 1: 1 0 0 ## 2: 2 1 0 ## 3: 3 1 0 ## 4: 4 0 0 ## 5: 5 1 0 ## --- ## 996: 996 0 1 ## 997: 997 0 0 ## 998: 998 1 0 ## 999: 999 0 0 ## 1000: 1000 0 0

And we can estimate a confidence interval for the difference between the two means:

props <- DT[, .(success = sum(y), n=.N), keyby = rx] setorder(props, -rx) round(tidy(prop.test(props$success, props$n, correct = FALSE, conf.level = 0.95))[ ,-c(5, 8,9)], 3) ## estimate1 estimate2 statistic p.value conf.low conf.high ## 1 0.142 0.276 27.154 0 -0.184 -0.084

If we generate 1000 data sets in the same way, we can count the number of occurrences where the where we would incorrectly reject the null hypothesis (i.e. commit a Type 1 error):

powerRet <- function(nPerGrp, level, effect, d = NULL) { Form <- genFormula(c(0.30, -effect), c("rx")) def <- defDataAdd(varname = "y", formula = Form, dist = "binary") DT <- genData(nPerGrp*2) %>% trtAssign(dtName = ., grpName = "rx") iter <- 1000 ci <- data.table() # generate 1000 data sets and store results each time in "ci" for (i in 1: iter) { dx <- addColumns(def, DT) props <- dx[, .(success = sum(y), n=.N), keyby = rx] setorder(props, -rx) ptest <- prop.test(props$success, props$n, correct = FALSE, conf.level = level) ci <- rbind(ci, data.table(t(ptest$conf.int), diff = ptest$estimate[1] - ptest$estimate[2])) } setorder(ci, V1) ci[, i:= 1:.N] # for sample size calculation at 80% power if (is.null(d)) d <- ci[i==.2*.N, V1] ci[, d := d] # determine if interval crosses threshold ci[, nullTrue := (V1 <= d)] return(ci[]) } Using 95% CIs, we expect 2.5% of the intervals to lie to the right of the non-inferiority threshold. That is, 2.5% of the time we would reject the null hypothesis when we shouldn’t: ci <- powerRet(nPerGrp = 500, level = 0.95, effect = 0.15, d = -0.15) formattable::percent(ci[, mean(!(nullTrue))], 1) ## [1] 2.4% And using 90% CIs, we expect 5% of the intervals to lie to the right of the threshold: ci <- powerRet(nPerGrp = 500, level = 0.90, effect = 0.15, d = -0.15) formattable::percent(ci[, mean(!(nullTrue))], 1) ## [1] 5.1% Sample size estimates If we do not expect the effect sizes to be different across interventions, it seems reasonable to find the sample size under this assumption of no effect. Assuming we want to set $$\alpha = 0.025$$, we generate many data sets and estimate the 95% confidence interval for each one. The power is merely the proportion of these confidence intervals lie entirely to the right of $$-\Delta$$. But how should we set $$\Delta$$? I’d propose that for each candidate sample size level, we find $$-\Delta$$ such that 80% of the simulated confidence intervals lie to the right of some value, where 80% is the desired power of the test (i.e., given that there is no treatment effect, 80% of the (hypothetical) experiments we conduct will lead us to conclude that the new treatment is non-inferior to the old treatment). ci <- powerRet(nPerGrp = 200, level = 0.95, effect = 0) p1 <- plotCIs(ci, 200, 0.95) ci <- powerRet(nPerGrp = 500, level = 0.95, effect = 0) p2 <- plotCIs(ci, 500, 0.95) library(gridExtra) grid.arrange(p1, p2, nrow = 1, bottom = "difference in proportion", left = "iterations") It is clear that increasing the sample size reduces the width of the 95% confidence intervals. As a result, the non-inferiority threshold based on 80% power is shifted closer towards zero when sample size increases. This implies that a larger sample size allows us to make a more compelling statement about non-inferiority. Unfortunately, not all non-inferiority statements are alike. If we set $$\Delta$$ too large, we may expand the bounds of non-inferiority beyond a reasonable, justifiable point. Given that there is no actual constraint on what $$\Delta$$ can be, I would say that the non-inferiority test is somewhat more problematic than its closely related cousin, the superiority test, where $$\Delta$$ is in effect fixed at zero. But, if we take this approach, where we identify $$\Delta$$ that satisfies the desired power, we can make a principled decision about whether or not the threshold is within reasonable bounds. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more... ### CRAN Release of R/exams 2.3-1 Mon, 05/14/2018 - 00:00 (This article was first published on R/exams, and kindly contributed to R-bloggers) New minor release of the R/exams package to CRAN, containing a wide range of smaller improvements and bug fixes. Notable new features include a dedicated OpenOLAT interface, and a convenience function facilitating the use of TikZ-based graphics. Version 2.3-1 of the one-for-all exams generator R/exams has been published on the Comprehensive R Archive Network at https://CRAN.R-project.org/package=exams. In the next days this will propagate to other CRAN mirrors along with Windows binary packages. The development version of the package is now version 2.3-2 on http://R-Forge.R-project.org/forum/?group_id=1337. New features • Added new interface exams2openolat() for the open-source OpenOLAT learning management system. This is only a convenience wrapper to exams2qti12() or exams2qti21() with some dedicated tweaks for optimizing MathJax output for OpenOLAT. • New function include_tikz() that facilitates compiling standalone TikZ figures into a range of output formats, especially PNG and SVG (for HTML-based output). This is useful when including TikZ in R/Markdown exercises or when converting R/LaTeX exercises to HTML. Two examples have been added to the package that illustrate the capabilities of include_tikz(): automaton, logic. A dedicated blog post is also planned. Written exams (NOPS) • Following the blog post on Written R/exams around the World several users have been kind enough to add language support for: Croatian (hr.dcf, contributed by Krunoslav Juraić), Danish (da.dcf, contributed by Tue Vissing Jensen and Jakob Messner),Slovak (sk.dcf, contributed by Peter Fabsic), Swiss German (gsw.dcf, contributed by Reto Stauffer), Turkish (tr.dcf, contributed by Emrah Er). Furthermore, Portuguese has been distinguished into pt-PT.dcf (Portuguese Portuguese) vs. pt-BR.dcf (Brazilian Portuguese) with pt.dcf defaulting to the former (contributed by Thomas Dellinger). • After setting a random seed exams2nops() and exams2pdf() now yield the same random versions of the exercises. Previously, this was not the case because exams2nops() internally generates a single random trial exam first for a couple of sanity checks. Now, the .GlobalEnv$.Random.seed is restored after generating the trial exam.
• Fixed the support for nsamp argument in exams2nops(). Furthermore, current limitations of exams2nops() are pointed out more clearly in error messages and edge cases caught.
• Allow varying points within a certain exercise in nops_eval().
HTML output and Base64-encoded supplements
• In exams2html() and other interfaces based on make_exercise_transform_html() the option base64 = TRUE now uses Base64 encoding for all file extensions (known to the package) whereas base64 = NULL only encodes image files (previous default behavior).
• Bug fixes and improvements in HTML transformers:
• Only ="file.ext" (with =") for supplementary files embedded into HTML is replaced now by the corresponding Base64-encoded version.
• href="file.ext" is replaced by href="file.ext" download="file.ext" prior to Base 64 replacement to assure that the file name is preserved for the browser/downloader.
• alt="file.ext" and download="file.ext" are preserved without the Base64-encoded version of file.ext.
• Include further file URIs for Base64 supplements, in particular .sav for SPSS data files.
• In exams2blackboard(..., base64 = FALSE, ...) the base64 = FALSE was erroneously ignored. No matter how base64 was specified essentially base64 = TRUE was used, it is honored again now.
Extensions
• \exshuffle{} can now also be used for schoice exercises with more than one TRUE answer. In a first step only one of the TRUE answers is selected and then -1 items from the FALSE answers.
• Function include_supplement(..., dir = "foo") – without full path to "foo" – now also works if "foo" is not a local sub-directory but a sub-directory to the exercise directory edir (if specified).
• Enable passing of envir argument from exams2html() to xweave() in case of R/Markdown (.Rmd) exercises.
• When using exams2html(..., mathjax = TRUE) for testing purposes, mathjax.rstudio.com is used now rather than cdn.mathjax.org which is currently redirecting and will eventually be shut down completely.
• Added support for \tightlist (as produced by pandoc) in all current LaTeX templates as well as exams2nops().
Bug fixes
• Fixed a bug in stresstest_exercise() where the “rank” (previously called “order”) of the correct solution was computed incorrectly. Additional enhancements in plots and labels.
• Fixed a bug for tex2image(..., tikz = TRUE) where erroneously \usetikzlibrary{TRUE} was included. Also tex2image(..., Sweave = TRUE) (the default) did not run properly on Windows, fixed now.
• Better warnings if \exshuffle{} could not be honored due to a lack of sufficiently many (suitable) answer alternatives.
• Bug fix in CSV export of exams2arsnova(). Recent ARSnova versions use “mc” (rather than “MC”) and “abcd” (rather than “SC”) to code multiple-choice and single-choice questions, respectively.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Statistics Sunday: Taylor Swift vs. Lorde – Analyzing Song Lyrics

Sun, 05/13/2018 - 19:12

(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)

.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }

Statistics Sunday Last week, I showed how to tokenize text. Today I’ll use those functions to do some text analysis of one of my favorite types of text: song lyrics. Plus, this is a great opportunity to demonstrate a new R package I discovered: geniusR, which will download lyrics from Genius.

There are two packages – geniusR and geniusr – which will do this. I played with both and found geniusR easier to use. Neither is perfect, but what is perfect, anyway?

To install geniusR, you’ll use a different method than usual – you’ll need to install the package devtools, then call the install_github function to download the R package directly from GitHub.

install.packages("devtools")
devtools::install_github("josiahparry/geniusR")
## from URL https://api.github.com/repos/josiahparry/geniusR/zipball/master
## Installing geniusR
## '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \
## --no-environ --no-save --no-restore --quiet CMD INSTALL \
## '/private/var/folders/85/9ygtlz0s4nxbmx3kgkvbs5g80000gn/T/Rtmpl3bwRx/devtools33c73e3f989/JosiahParry-geniusR-5907d82' \
## --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' \
## --install-tests
##

Now you’ll want to load geniusR and tidyverse so we can work with our data.

library(geniusR)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ggplot2 2.2.1 purrr 0.2.4
## tibble 1.4.2 dplyr 0.7.4
## tidyr 0.8.0 stringr 1.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──

For today’s demonstration, I’ll be working with data from two artists I love: Taylor Swift and Lorde. Both dropped new albums last year, Reputation and Melodrama, respectively, and both, though similar in age and friends with each other, have very different writing and musical styles.

geniusR has a function genius_album that will download lyrics from an entire album, labeling it by track.

swift_lyrics <- genius_album(artist="Taylor Swift", album="Reputation")
## Joining, by = c("track_title", "track_n", "track_url")
lorde_lyrics <- genius_album(artist="Lorde", album="Melodrama")
## Joining, by = c("track_title", "track_n", "track_url")

Now we want to tokenize our datasets, remove stop words, and count word frequency – this code should look familiar, except this time, I’m combining them using the pipeline symbol (%>%) from the tidyverse, which allows you to string together multiple functions without having to nest them.

library(tidytext)
tidy_swift <- swift_lyrics %>%
unnest_tokens(word,lyric) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 6 x 2
## word n
##
## 1 call 46
## 2 wanna 37
## 3 ooh 35
## 4 ha 34
## 5 ah 33
## 6 time 32
tidy_lorde <- lorde_lyrics %>%
unnest_tokens(word,lyric) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE)
## Joining, by = "word"
## # A tibble: 6 x 2
## word n
##
## 1 boom 40
## 2 love 26
## 3 shit 24
## 4 dynamite 22
## 6 light 22

Looking at the top 6 words for each, it doesn’t look like there will be a lot of overlap. But let’s explore that, shall we? Lorde’s album is 3 tracks shorter than Taylor Swift’s. To make sure our word comparisons are meaningful, I’ll create new variables that takes into account total number of words, so each word metric will be a proportion, allowing for direct comparisons. And because I’ll be joining the datasets, I’ll be sure to label these new columns by artist name.

tidy_swift <- tidy_swift %>%
rename(swift_n = n) %>%
mutate(swift_prop = swift_n/sum(swift_n))

tidy_lorde <- tidy_lorde %>%
rename(lorde_n = n) %>%
mutate(lorde_prop = lorde_n/sum(lorde_n))

There are multiple types of joins available in the tidyverse. I used an anti_join to remove stop words. Today, I want to use a full_join, because I want my final dataset to retain all words from both artists. When one dataset contributes a word not found in the other artist’s set, it will fill those variables in with missing values.

compare_words <- tidy_swift %>%
full_join(tidy_lorde, by = "word")

summary(compare_words)
## word swift_n swift_prop lorde_n
## Length:957 Min. : 1.000 Min. :0.00050 Min. : 1.0
## Class :character 1st Qu.: 1.000 1st Qu.:0.00050 1st Qu.: 1.0
## Mode :character Median : 1.000 Median :0.00050 Median : 1.0
## Mean : 3.021 Mean :0.00152 Mean : 2.9
## 3rd Qu.: 3.000 3rd Qu.:0.00151 3rd Qu.: 3.0
## Max. :46.000 Max. :0.02321 Max. :40.0
## NA's :301 NA's :301 NA's :508
## lorde_prop
## Min. :0.0008
## 1st Qu.:0.0008
## Median :0.0008
## Mean :0.0022
## 3rd Qu.:0.0023
## Max. :0.0307
## NA's :508

The final dataset contains 957 tokens – unique words – and the NAs tell how many words are only present in one artist’s corpus. Lorde uses 301 words Taylor Swift does not, and Taylor Swift uses 508 words that Lorde does not. That leaves 148 words on which they overlap.

There are many things we could do with these data, but let’s visualize words and proportions, with one artist on the x-axis and the other on the y-axis.

ggplot(compare_words, aes(x=swift_prop, y=lorde_prop)) +
geom_abline() +
geom_text(aes(label=word), check_overlap=TRUE, vjust=1.5) +
labs(y="Lorde", x="Taylor Swift") + theme_classic()
## Warning: Removed 809 rows containing missing values (geom_text).

The warning lets me know there are 809 rows with missing values – those are the words only present in one artist’s corpus. Words that fall on or near the line are used at similar rates between artists. Words above the line are used more by Lorde than Taylor Swift, and words below the line are used more by Taylor Swift than Lorde. This tells us that, for instance, Lorde uses “love,” “light,” and, yes, “shit,” more than Swift, while Swift uses “call,” “wanna,” and “hands” more than Lorde. They use words like “waiting,” “heart,” and “dreams” at similar rates. Rates are low overall, but if you look at the max values for the proportion variables, Swift’s most common word only accounts for about 2.3% of her total words; Lorde’s most common word only accounts for about 3.1% of her total words.

This highlights why it’s important to remove stop words for these types of analyses; otherwise, our datasets and chart would be full of words like “the,” “a”, and “and.”

Next Statistics Sunday, we’ll take a look at sentiment analysis!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### Tips for Ellipse Summary Plot

Sun, 05/13/2018 - 02:05

(This article was first published on R – ЯтомизоnoR, and kindly contributed to R-bloggers)

I privately had some questions and reply here, because it may also help others including me.

1. How to specify size

With plot axis parameters.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], xlim=c(4,8), ylim=c(2,5))

2. How to specify color

With plot color parameter.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], col=c(‘cyan’, ‘orange’, ‘magenta’))

3. How to give names

Using builtin iris data.

> ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)])

> str(iris)
‘data.frame’: 150 obs. of 5 variables:
$Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …
$Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …
$Species : Factor w/ 3 levels “setosa”,”versicolor”,..: 1 1 1 1 1 1 1 1 1 1 … The column Species are used in both of x and y data. These are used to give the name of each catergory. example Using fivenum instead of default ninenum. > ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], col=c(‘cyan’, ‘orange’, ‘magenta’), SUMMARY=fivenum) Above shows the plot shown above. Below may help you to know values on each axis. Here, for the fivenum, each 3rd values is a (x, y) set of each category average. > ellipseplot(iris[,c(‘Species’, ‘Sepal.Length’)], iris[,c(‘Species’, ‘Sepal.Width’)], SUMMARY=fivenum, plot=FALSE)$setosa
x y
1 4.3 2.3
2 4.8 3.2
3 5.0 3.4
4 5.2 3.7
5 5.8 4.4

$versicolor x y 1 4.9 2.0 2 5.6 2.5 3 5.9 2.8 4 6.3 3.0 5 7.0 3.4$virginica
x y
1 4.9 2.2
2 6.2 2.8
3 6.5 3.0
4 6.9 3.2
5 7.9 3.8

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – ЯтомизоnoR. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

### automaton: Interpretation of Automaton Diagrams (Using TikZ)

Sun, 05/13/2018 - 00:00

(This article was first published on R/exams, and kindly contributed to R-bloggers)

Exercise template for assessing the interpretation of an automaton diagram (drawn with TikZ) based on randomly generated input sequences.

Name: automaton Type: mchoice Related: logic

Description: An automaton diagram with four states A-D is drawn with TikZ and is to be interpreted, where A is always the initial state and one state is randomly picked as the accepting state. Five binary 0/1 input sequences acceptance have to be assessed with approximately a quarter of all sequences being accepted. Depending on the exams2xyz() interface the TikZ graphic can be rendered in PNG, SVG, or directly by LaTeX. Solution feedback: Yes Randomization: Random numbers, text blocks, and graphics Mathematical notation: No Verbatim R input/output: No Images: Yes Other supplements: No Template: automaton.Rnw automaton.Rmd Raw: (1 random version) automaton.tex automaton.md PDF: HTML:

Demo code:

library("exams") set.seed(1090) exams2html("automaton.Rnw") set.seed(1090) exams2pdf("automaton.Rnw") set.seed(1090) exams2html("automaton.Rmd") set.seed(1090) exams2pdf("automaton.Rmd") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R/exams. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...