Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 5 hours 13 min ago

Benchmarking a SSD drive in reading and writing files with R

Fri, 06/29/2018 - 09:00

(This article was first published on Marcelo S. Perlin, and kindly contributed to R-bloggers)

I recently bought a new computer for home and it came with two drives,
one HDD and other SSD. The later is used for the OS and the former for
all of my files. From all computers I had, both home and work, this is
definitely the fastest. While some of the merits are due to the newer
CPUS and RAM, the SSD drive can make all the difference in file
operations.

My research usually deals with large files from financial markets. Being
efficient in reading those files is key to my productivity. Given that,
I was very curious in understanding how much I would benefit in speed
when reading/writing files in my SSD drive instead of the HDD. For that,
I wrote a simple function that will time a particular operation. The
function will take as input the number of rows in the data (1..Inf), the
type of function used to save the file (rds, csv, fst) and the
type of drive (HDD or SSD). See next.

bench.fct <- function(N = 2500000, type.file = 'rds', type.hd = 'HDD') { # Function for timing read and write operations # # INPUT: N - Number of rows in dataframe to be read and write # type.file - format of output file (rds, csv, fst) # type.hd - where to save (hdd or ssd) # # OUTPUT: A dataframe with results require(tidyverse) require(fst) my.df <- data_frame(x = runif(N), char.vec = sample(letters, size = N, replace = TRUE)) path.file <- switch(type.hd, 'SSD' = '~', 'HDD' = '/mnt/HDD/') my.file <- file.path(path.file, switch (type.file, 'rds-base' = 'temp_rds.rds', 'rds-readr' = 'temp_rds.rds', 'fst' = 'temp_fst.fst', 'csv-readr' = 'temp_csv.csv', 'csv-base' = 'temp_csv.csv')) if (type.file == 'rds-base') { time.write <- system.time(saveRDS(my.df, my.file, compress = FALSE)) time.read <- system.time(readRDS(my.file)) } else if (type.file == 'rds-readr') { time.write <- system.time(write_rds(x = my.df, path = my.file, compress = 'none')) time.read <- system.time(read_rds(path = my.file )) } else if (type.file == 'fst') { time.write <- system.time(write.fst(x = my.df, path = my.file)) time.read <- system.time(read_fst(my.file)) } else if (type.file == 'csv-readr') { time.write <- system.time(write_csv(x = my.df, path = my.file)) time.read <- system.time(read_csv(file = my.file, col_types = cols(x = col_double(), char.vec = col_character()))) } else if (type.file == 'csv-base') { time.write <- system.time(write.csv(x = my.df, file = my.file)) time.read <- system.time(read.csv(file = my.file)) } # clean up file.remove(my.file) # save output df.out <- data_frame(type.file = type.file, type.hd = type.hd, N = N, type.time = c('write', 'read'), times = c(time.write[3], time.read[3])) return(df.out) }

Now that we have my function, its time to use it for all combinations
between number of rows, the formats of the file and type of drive:

library(purrr) df.grid <- expand.grid(N = seq(1, 500000, by = 50000), type.file = c('rds-readr', 'rds-base', 'fst', 'csv-readr', 'csv-base'), type.hd = c('HDD', 'SSD'), stringsAsFactors = F) l.out <- pmap(list(N = df.grid$N, type.file = df.grid$type.file, type.hd = df.grid$type.hd), .f = bench.fct) df.res <- do.call(what = bind_rows, args = l.out)

Lets check the result in a nice plot:

library(ggplot2) p <- ggplot(df.res, aes(x = N, y = times, linetype = type.hd)) + geom_line() + facet_grid(type.file ~ type.time) print(p)

As you can see, the csv-base format is messing with the y axis. Let’s
remove it for better visualization:

library(ggplot2) p <- ggplot(filter(df.res, !(type.file %in% c('csv-base'))), aes(x = N, y = times, linetype = type.hd)) + geom_line() + facet_grid(type.file ~ type.time) print(p)

When it comes to the file format, we learn:

  • By far, the fst format is the best. It takes less time to read
    and write than the others. However, it’s probably unfair to compare
    it to csv and rds as it uses many of the 16 cores of my
    computer.

  • readr is a great package for writing and reading csv files.
    You can see a large difference of time from using the base
    functions. This is likely due to the use of low level functions to
    write and read the text files.

  • When using the rds format, the base function do not differ much
    from the readr functions
    .

As for the effect of using SSD, its clear that it DOES NOT effect
the time of reading and writing. The differences between using HDD and
SSD looks like noise. Seeking to provide a more robust analysis, let’s
formally test this hypothesis using a simple t-test for the means:

tab <- df.res %>% group_by(type.file, type.time) %>% summarise(mean.HDD = mean(times[type.hd == 'HDD']), mean.SSD = mean(times[type.hd == 'SSD']), p.value = t.test(times[type.hd == 'SSD'], times[type.hd == 'HDD'])$p.value) print(tab) ## # A tibble: 10 x 5 ## # Groups: type.file [?] ## type.file type.time mean.HDD mean.SSD p.value ## ## 1 csv-base read 0.554 0.463 0.605 ## 2 csv-base write 0.405 0.405 0.997 ## 3 csv-readr read 0.142 0.126 0.687 ## 4 csv-readr write 0.0711 0.0706 0.982 ## 5 fst read 0.015 0.0084 0.0584 ## 6 fst write 0.00900 0.00910 0.964 ## 7 rds-base read 0.0321 0.0303 0.848 ## 8 rds-base write 0.0253 0.025 0.969 ## 9 rds-readr read 0.0323 0.0304 0.845 ## 10 rds-readr write 0.0251 0.0247 0.957

As we can see, the null hypothesis of equal means easily fails to be
rejected for almost all types of files and operations at 10%. The
exception was for the fst format in a reading operation. In other
words, statistically, it does not make any difference in time from using
SSD or HDD to read or write files in different formats.

I am very surprised by this result. Independently of the type of format,
I expected a large difference as SSD drives are much faster within an
OS. Am I missing something? Is this due to the OS being in the SSD? What
you guys think?

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Marcelo S. Perlin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Code for Workshop: Introduction to Machine Learning with R

Fri, 06/29/2018 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

These are the slides from my workshop: Introduction to Machine Learning with R which I gave at the University of Heidelberg, Germany on June 28th 2018. The entire code accompanying the workshop can be found below the video.

The workshop covered the basics of machine learning. With an example dataset I went through a standard machine learning workflow in R with the packages caret and h2o:

  • reading in data
  • exploratory data analysis
  • missingness
  • feature engineering
  • training and test split
  • model training with Random Forests, Gradient Boosting, Neural Nets, etc.
  • hyperparameter tuning


Workshop – Introduction to Machine Learning with R from Shirin Glander

Setup

All analyses are done in R using RStudio. For detailed session information including R version, operating system and package versions, see the sessionInfo() output at the end of this document.

All figures are produced with ggplot2.

  • libraries
library(tidyverse) # for tidy data analysis library(readr) # for fast reading of input files library(mice) # mice package for Multivariate Imputation by Chained Equations (MICE)

Data preparation The dataset

The dataset I am using in these example analyses, is the Breast Cancer Wisconsin (Diagnostic) Dataset. The data was downloaded from the UC Irvine Machine Learning Repository.

The first dataset looks at the predictor classes:

  • malignant or
  • benign breast mass.

The features characterise cell nucleus properties and were generated from image analysis of fine needle aspirates (FNA) of breast masses:

  • Sample ID (code number)
  • Clump thickness
  • Uniformity of cell size
  • Uniformity of cell shape
  • Marginal adhesion
  • Single epithelial cell size
  • Number of bare nuclei
  • Bland chromatin
  • Number of normal nuclei
  • Mitosis
  • Classes, i.e. diagnosis
bc_data <- read_delim("/Users/shiringlander/Documents/Github/intro_to_ml_workshop/intro_to_ml_uni_heidelberg/datasets/breast-cancer-wisconsin.data.txt", delim = ",", col_names = c("sample_code_number", "clump_thickness", "uniformity_of_cell_size", "uniformity_of_cell_shape", "marginal_adhesion", "single_epithelial_cell_size", "bare_nuclei", "bland_chromatin", "normal_nucleoli", "mitosis", "classes")) %>% mutate(bare_nuclei = as.numeric(bare_nuclei), classes = ifelse(classes == "2", "benign", ifelse(classes == "4", "malignant", NA))) summary(bc_data) ## sample_code_number clump_thickness uniformity_of_cell_size ## Min. : 61634 Min. : 1.000 Min. : 1.000 ## 1st Qu.: 870688 1st Qu.: 2.000 1st Qu.: 1.000 ## Median : 1171710 Median : 4.000 Median : 1.000 ## Mean : 1071704 Mean : 4.418 Mean : 3.134 ## 3rd Qu.: 1238298 3rd Qu.: 6.000 3rd Qu.: 5.000 ## Max. :13454352 Max. :10.000 Max. :10.000 ## ## uniformity_of_cell_shape marginal_adhesion single_epithelial_cell_size ## Min. : 1.000 Min. : 1.000 Min. : 1.000 ## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 2.000 ## Median : 1.000 Median : 1.000 Median : 2.000 ## Mean : 3.207 Mean : 2.807 Mean : 3.216 ## 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 4.000 ## Max. :10.000 Max. :10.000 Max. :10.000 ## ## bare_nuclei bland_chromatin normal_nucleoli mitosis ## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000 ## 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000 ## Median : 1.000 Median : 3.000 Median : 1.000 Median : 1.000 ## Mean : 3.545 Mean : 3.438 Mean : 2.867 Mean : 1.589 ## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 4.000 3rd Qu.: 1.000 ## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000 ## NA's :16 ## classes ## Length:699 ## Class :character ## Mode :character ## ## ## ##

Missing data # how many NAs are in the data md.pattern(bc_data, plot = FALSE) ## sample_code_number clump_thickness uniformity_of_cell_size ## 683 1 1 1 ## 16 1 1 1 ## 0 0 0 ## uniformity_of_cell_shape marginal_adhesion single_epithelial_cell_size ## 683 1 1 1 ## 16 1 1 1 ## 0 0 0 ## bland_chromatin normal_nucleoli mitosis classes bare_nuclei ## 683 1 1 1 1 1 0 ## 16 1 1 1 1 0 1 ## 0 0 0 0 16 16 bc_data <- bc_data %>% drop_na() %>% select(classes, everything(), -sample_code_number) head(bc_data) ## # A tibble: 6 x 10 ## classes clump_thickness uniformity_of_cell_si… uniformity_of_cell_sha… ## ## 1 benign 5 1 1 ## 2 benign 5 4 4 ## 3 benign 3 1 1 ## 4 benign 6 8 8 ## 5 benign 4 1 1 ## 6 malignant 8 10 10 ## # ... with 6 more variables: marginal_adhesion , ## # single_epithelial_cell_size , bare_nuclei , ## # bland_chromatin , normal_nucleoli , mitosis

Missing values can be imputed with the mice package.

More info and tutorial with code: https://shirinsplayground.netlify.com/2018/04/flu_prediction/

Data exploration
  • Response variable for classification
ggplot(bc_data, aes(x = classes, fill = classes)) + geom_bar()

More info on dealing with unbalanced classes: https://shiring.github.io/machine_learning/2017/04/02/unbalanced

  • Response variable for regression
ggplot(bc_data, aes(x = clump_thickness)) + geom_histogram(bins = 10)

  • Features
gather(bc_data, x, y, clump_thickness:mitosis) %>% ggplot(aes(x = y, color = classes, fill = classes)) + geom_density(alpha = 0.3) + facet_wrap( ~ x, scales = "free", ncol = 3)

  • Correlation graphs
co_mat_benign <- filter(bc_data, classes == "benign") %>% select(-1) %>% cor() co_mat_malignant <- filter(bc_data, classes == "malignant") %>% select(-1) %>% cor() library(igraph) g_benign <- graph.adjacency(co_mat_benign, weighted = TRUE, diag = FALSE, mode = "upper") g_malignant <- graph.adjacency(co_mat_malignant, weighted = TRUE, diag = FALSE, mode = "upper") # http://kateto.net/networks-r-igraph cut.off_b <- mean(E(g_benign)$weight) cut.off_m <- mean(E(g_malignant)$weight) g_benign_2 <- delete_edges(g_benign, E(g_benign)[weight < cut.off_b]) g_malignant_2 <- delete_edges(g_malignant, E(g_malignant)[weight < cut.off_m]) c_g_benign_2 <- cluster_fast_greedy(g_benign_2) c_g_malignant_2 <- cluster_fast_greedy(g_malignant_2) par(mfrow = c(1,2)) plot(c_g_benign_2, g_benign_2, vertex.size = colSums(co_mat_benign) * 10, vertex.frame.color = NA, vertex.label.color = "black", vertex.label.cex = 0.8, edge.width = E(g_benign_2)$weight * 15, layout = layout_with_fr(g_benign_2), main = "Benign tumors") plot(c_g_malignant_2, g_malignant_2, vertex.size = colSums(co_mat_malignant) * 10, vertex.frame.color = NA, vertex.label.color = "black", vertex.label.cex = 0.8, edge.width = E(g_malignant_2)$weight * 15, layout = layout_with_fr(g_malignant_2), main = "Malignant tumors")

Principal Component Analysis library(ellipse) # perform pca and extract scores pcaOutput <- prcomp(as.matrix(bc_data[, -1]), scale = TRUE, center = TRUE) pcaOutput2 <- as.data.frame(pcaOutput$x) # define groups for plotting pcaOutput2$groups <- bc_data$classes centroids <- aggregate(cbind(PC1, PC2) ~ groups, pcaOutput2, mean) conf.rgn <- do.call(rbind, lapply(unique(pcaOutput2$groups), function(t) data.frame(groups = as.character(t), ellipse(cov(pcaOutput2[pcaOutput2$groups == t, 1:2]), centre = as.matrix(centroids[centroids$groups == t, 2:3]), level = 0.95), stringsAsFactors = FALSE))) ggplot(data = pcaOutput2, aes(x = PC1, y = PC2, group = groups, color = groups)) + geom_polygon(data = conf.rgn, aes(fill = groups), alpha = 0.2) + geom_point(size = 2, alpha = 0.6) + labs(color = "", fill = "")

Multidimensional Scaling select(bc_data, -1) %>% dist() %>% cmdscale %>% as.data.frame() %>% mutate(group = bc_data$classes) %>% ggplot(aes(x = V1, y = V2, color = group)) + geom_point()

t-SNE dimensionality reduction library(tsne) select(bc_data, -1) %>% dist() %>% tsne() %>% as.data.frame() %>% mutate(group = bc_data$classes) %>% ggplot(aes(x = V1, y = V2, color = group)) + geom_point()

Machine Learning packages for R caret # configure multicore library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) library(caret)

Training, validation and test data set.seed(42) index <- createDataPartition(bc_data$classes, p = 0.7, list = FALSE) train_data <- bc_data[index, ] test_data <- bc_data[-index, ] bind_rows(data.frame(group = "train", train_data), data.frame(group = "test", test_data)) %>% gather(x, y, clump_thickness:mitosis) %>% ggplot(aes(x = y, color = group, fill = group)) + geom_density(alpha = 0.3) + facet_wrap( ~ x, scales = "free", ncol = 3)

Regression set.seed(42) model_glm <- caret::train(clump_thickness ~ ., data = train_data, method = "glm", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE)) model_glm ## Generalized Linear Model ## ## 479 samples ## 9 predictor ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 432, 431, 431, 431, ... ## Resampling results: ## ## RMSE Rsquared MAE ## 1.972314 0.5254215 1.648832 predictions <- predict(model_glm, test_data) # model_glm$finalModel$linear.predictors == model_glm$finalModel$fitted.values data.frame(residuals = resid(model_glm), predictors = model_glm$finalModel$linear.predictors) %>% ggplot(aes(x = predictors, y = residuals)) + geom_jitter() + geom_smooth(method = "lm")

# y == train_data$clump_thickness data.frame(residuals = resid(model_glm), y = model_glm$finalModel$y) %>% ggplot(aes(x = y, y = residuals)) + geom_jitter() + geom_smooth(method = "lm")

data.frame(actual = test_data$clump_thickness, predicted = predictions) %>% ggplot(aes(x = actual, y = predicted)) + geom_jitter() + geom_smooth(method = "lm")

Classification Decision trees

rpart

library(rpart) library(rpart.plot) set.seed(42) fit <- rpart(classes ~ ., data = train_data, method = "class", control = rpart.control(xval = 10, minbucket = 2, cp = 0), parms = list(split = "information")) rpart.plot(fit, extra = 100)

Random Forests

Random Forests predictions are based on the generation of multiple classification trees. They can be used for both, classification and regression tasks. Here, I show a classification task.

set.seed(42) model_rf <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3, savePredictions = TRUE, verboseIter = FALSE))

When you specify savePredictions = TRUE, you can access the cross-validation resuls with model_rf$pred.

model_rf ## Random Forest ## ## 479 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 431, 431, 431, 431, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 2 0.9776753 0.9513499 ## 5 0.9757957 0.9469999 ## 9 0.9714200 0.9370285 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2. model_rf$finalModel$confusion ## benign malignant class.error ## benign 304 7 0.02250804 ## malignant 5 163 0.02976190 Dealing with unbalanced data

Luckily, caret makes it very easy to incorporate over- and under-sampling techniques with cross-validation resampling. We can simply add the sampling option to our trainControl and choose down for under- (also called down-) sampling. The rest stays the same as with our original model.

set.seed(42) model_rf_down <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE, sampling = "down")) model_rf_down ## Random Forest ## ## 479 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 431, 431, 431, 431, ... ## Addtional sampling using down-sampling prior to pre-processing ## ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 2 0.9797503 0.9563138 ## 5 0.9741198 0.9438326 ## 9 0.9699578 0.9346310 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 2.

Feature Importance imp <- model_rf$finalModel$importance imp[order(imp, decreasing = TRUE), ] ## uniformity_of_cell_size uniformity_of_cell_shape ## 43.936945 39.840595 ## bare_nuclei bland_chromatin ## 33.820345 31.984813 ## normal_nucleoli single_epithelial_cell_size ## 21.686039 17.761202 ## clump_thickness marginal_adhesion ## 16.318817 9.518437 ## mitosis ## 2.220633 # estimate variable importance importance <- varImp(model_rf, scale = TRUE) plot(importance)

  • predicting test data
confusionMatrix(predict(model_rf, test_data), as.factor(test_data$classes)) ## Confusion Matrix and Statistics ## ## Reference ## Prediction benign malignant ## benign 128 4 ## malignant 5 67 ## ## Accuracy : 0.9559 ## 95% CI : (0.9179, 0.9796) ## No Information Rate : 0.652 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9031 ## Mcnemar's Test P-Value : 1 ## ## Sensitivity : 0.9624 ## Specificity : 0.9437 ## Pos Pred Value : 0.9697 ## Neg Pred Value : 0.9306 ## Prevalence : 0.6520 ## Detection Rate : 0.6275 ## Detection Prevalence : 0.6471 ## Balanced Accuracy : 0.9530 ## ## 'Positive' Class : benign ## results <- data.frame(actual = test_data$classes, predict(model_rf, test_data, type = "prob")) results$prediction <- ifelse(results$benign > 0.5, "benign", ifelse(results$malignant > 0.5, "malignant", NA)) results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE) ggplot(results, aes(x = prediction, fill = correct)) + geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)

Extreme gradient boosting trees

Extreme gradient boosting (XGBoost) is a faster and improved implementation of gradient boosting for supervised learning.

“XGBoost uses a more regularized model formalization to control over-fitting, which gives it better performance.” Tianqi Chen, developer of xgboost

XGBoost is a tree ensemble model, which means the sum of predictions from a set of classification and regression trees (CART). In that, XGBoost is similar to Random Forests but it uses a different approach to model training. Can be used for classification and regression tasks. Here, I show a classification task.

set.seed(42) model_xgb <- caret::train(classes ~ ., data = train_data, method = "xgbTree", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3, savePredictions = TRUE, verboseIter = FALSE)) model_xgb ## eXtreme Gradient Boosting ## ## 479 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 431, 431, 431, 431, ... ## Resampling results across tuning parameters: ## ## eta max_depth colsample_bytree subsample nrounds Accuracy ## 0.3 1 0.6 0.50 50 0.9567788 ## 0.3 1 0.6 0.50 100 0.9544912 ## 0.3 1 0.6 0.50 150 0.9513572 ## 0.3 1 0.6 0.75 50 0.9576164 ## 0.3 1 0.6 0.75 100 0.9536448 ## 0.3 1 0.6 0.75 150 0.9525987 ## 0.3 1 0.6 1.00 50 0.9559409 ## 0.3 1 0.6 1.00 100 0.9555242 ## 0.3 1 0.6 1.00 150 0.9551031 ## 0.3 1 0.8 0.50 50 0.9718588 ## 0.3 1 0.8 0.50 100 0.9720583 ## 0.3 1 0.8 0.50 150 0.9699879 ## 0.3 1 0.8 0.75 50 0.9726964 ## 0.3 1 0.8 0.75 100 0.9724664 ## 0.3 1 0.8 0.75 150 0.9705868 ## 0.3 1 0.8 1.00 50 0.9714202 ## 0.3 1 0.8 1.00 100 0.9710035 ## 0.3 1 0.8 1.00 150 0.9705866 ## 0.3 2 0.6 0.50 50 0.9559448 ## 0.3 2 0.6 0.50 100 0.9565397 ## 0.3 2 0.6 0.50 150 0.9555063 ## 0.3 2 0.6 0.75 50 0.9530150 ## 0.3 2 0.6 0.75 100 0.9550985 ## 0.3 2 0.6 0.75 150 0.9551070 ## 0.3 2 0.6 1.00 50 0.9532320 ## 0.3 2 0.6 1.00 100 0.9551072 ## 0.3 2 0.6 1.00 150 0.9557237 ## 0.3 2 0.8 0.50 50 0.9720583 ## 0.3 2 0.8 0.50 100 0.9735166 ## 0.3 2 0.8 0.50 150 0.9720540 ## 0.3 2 0.8 0.75 50 0.9722494 ## 0.3 2 0.8 0.75 100 0.9726703 ## 0.3 2 0.8 0.75 150 0.9716374 ## 0.3 2 0.8 1.00 50 0.9716327 ## 0.3 2 0.8 1.00 100 0.9724622 ## 0.3 2 0.8 1.00 150 0.9718416 ## 0.3 3 0.6 0.50 50 0.9548905 ## 0.3 3 0.6 0.50 100 0.9557237 ## 0.3 3 0.6 0.50 150 0.9555198 ## 0.3 3 0.6 0.75 50 0.9561404 ## 0.3 3 0.6 0.75 100 0.9546820 ## 0.3 3 0.6 0.75 150 0.9552982 ## 0.3 3 0.6 1.00 50 0.9577983 ## 0.3 3 0.6 1.00 100 0.9573819 ## 0.3 3 0.6 1.00 150 0.9567655 ## 0.3 3 0.8 0.50 50 0.9733131 ## 0.3 3 0.8 0.50 100 0.9728829 ## 0.3 3 0.8 0.50 150 0.9718499 ## 0.3 3 0.8 0.75 50 0.9751879 ## 0.3 3 0.8 0.75 100 0.9743546 ## 0.3 3 0.8 0.75 150 0.9735212 ## 0.3 3 0.8 1.00 50 0.9743372 ## 0.3 3 0.8 1.00 100 0.9737122 ## 0.3 3 0.8 1.00 150 0.9743461 ## 0.4 1 0.6 0.50 50 0.9548861 ## 0.4 1 0.6 0.50 100 0.9528290 ## 0.4 1 0.6 0.50 150 0.9498772 ## 0.4 1 0.6 0.75 50 0.9557239 ## 0.4 1 0.6 0.75 100 0.9513529 ## 0.4 1 0.6 0.75 150 0.9492779 ## 0.4 1 0.6 1.00 50 0.9559365 ## 0.4 1 0.6 1.00 100 0.9551031 ## 0.4 1 0.6 1.00 150 0.9536361 ## 0.4 1 0.8 0.50 50 0.9710164 ## 0.4 1 0.8 0.50 100 0.9697577 ## 0.4 1 0.8 0.50 150 0.9687074 ## 0.4 1 0.8 0.75 50 0.9710122 ## 0.4 1 0.8 0.75 100 0.9707996 ## 0.4 1 0.8 0.75 150 0.9691455 ## 0.4 1 0.8 1.00 50 0.9705911 ## 0.4 1 0.8 1.00 100 0.9697446 ## 0.4 1 0.8 1.00 150 0.9697576 ## 0.4 2 0.6 0.50 50 0.9544866 ## 0.4 2 0.6 0.50 100 0.9542694 ## 0.4 2 0.6 0.50 150 0.9536357 ## 0.4 2 0.6 0.75 50 0.9540611 ## 0.4 2 0.6 0.75 100 0.9542694 ## 0.4 2 0.6 0.75 150 0.9549033 ## 0.4 2 0.6 1.00 50 0.9540653 ## 0.4 2 0.6 1.00 100 0.9555239 ## 0.4 2 0.6 1.00 150 0.9546818 ## 0.4 2 0.8 0.50 50 0.9720670 ## 0.4 2 0.8 0.50 100 0.9695629 ## 0.4 2 0.8 0.50 150 0.9702006 ## 0.4 2 0.8 0.75 50 0.9722627 ## 0.4 2 0.8 0.75 100 0.9720500 ## 0.4 2 0.8 0.75 150 0.9716289 ## 0.4 2 0.8 1.00 50 0.9726705 ## 0.4 2 0.8 1.00 100 0.9708042 ## 0.4 2 0.8 1.00 150 0.9708129 ## 0.4 3 0.6 0.50 50 0.9555150 ## 0.4 3 0.6 0.50 100 0.9553021 ## 0.4 3 0.6 0.50 150 0.9548943 ## 0.4 3 0.6 0.75 50 0.9555281 ## 0.4 3 0.6 0.75 100 0.9563662 ## 0.4 3 0.6 0.75 150 0.9555324 ## 0.4 3 0.6 1.00 50 0.9575900 ## 0.4 3 0.6 1.00 100 0.9571735 ## 0.4 3 0.6 1.00 150 0.9559104 ## 0.4 3 0.8 0.50 50 0.9737255 ## 0.4 3 0.8 0.50 100 0.9745501 ## 0.4 3 0.8 0.50 150 0.9730874 ## 0.4 3 0.8 0.75 50 0.9747539 ## 0.4 3 0.8 0.75 100 0.9724664 ## 0.4 3 0.8 0.75 150 0.9720498 ## 0.4 3 0.8 1.00 50 0.9747539 ## 0.4 3 0.8 1.00 100 0.9749624 ## 0.4 3 0.8 1.00 150 0.9734996 ## Kappa ## 0.9050828 ## 0.8999999 ## 0.8930637 ## 0.9067208 ## 0.8982284 ## 0.8959903 ## 0.9028825 ## 0.9022543 ## 0.9014018 ## 0.9382467 ## 0.9386326 ## 0.9340573 ## 0.9400323 ## 0.9395968 ## 0.9353783 ## 0.9372262 ## 0.9362148 ## 0.9353247 ## 0.9032270 ## 0.9047203 ## 0.9024465 ## 0.8968511 ## 0.9015282 ## 0.9016169 ## 0.8971329 ## 0.9015111 ## 0.9028614 ## 0.9387022 ## 0.9419143 ## 0.9387792 ## 0.9391933 ## 0.9401872 ## 0.9379714 ## 0.9377309 ## 0.9397601 ## 0.9384827 ## 0.9008861 ## 0.9029797 ## 0.9024531 ## 0.9037859 ## 0.9004226 ## 0.9019909 ## 0.9074584 ## 0.9064701 ## 0.9051441 ## 0.9414031 ## 0.9405025 ## 0.9380734 ## 0.9456856 ## 0.9438986 ## 0.9419994 ## 0.9438642 ## 0.9426000 ## 0.9439780 ## 0.9007223 ## 0.8964381 ## 0.8897615 ## 0.9027951 ## 0.8931520 ## 0.8886910 ## 0.9030461 ## 0.9014362 ## 0.8982364 ## 0.9363059 ## 0.9334254 ## 0.9311383 ## 0.9361883 ## 0.9357131 ## 0.9320657 ## 0.9353688 ## 0.9333607 ## 0.9334467 ## 0.8999756 ## 0.8997888 ## 0.8983861 ## 0.8991356 ## 0.8998960 ## 0.9013529 ## 0.8990428 ## 0.9023340 ## 0.9004889 ## 0.9387165 ## 0.9332663 ## 0.9345567 ## 0.9393855 ## 0.9389455 ## 0.9380863 ## 0.9401366 ## 0.9361847 ## 0.9361724 ## 0.9021263 ## 0.9017938 ## 0.9010613 ## 0.9025263 ## 0.9043436 ## 0.9024744 ## 0.9069828 ## 0.9059579 ## 0.9031829 ## 0.9424523 ## 0.9442537 ## 0.9410193 ## 0.9447486 ## 0.9397683 ## 0.9388701 ## 0.9449064 ## 0.9454375 ## 0.9422358 ## ## Tuning parameter 'gamma' was held constant at a value of 0 ## ## Tuning parameter 'min_child_weight' was held constant at a value of 1 ## Accuracy was used to select the optimal model using the largest value. ## The final values used for the model were nrounds = 50, max_depth = 3, ## eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 ## and subsample = 0.75.

  • Feature Importance
importance <- varImp(model_xgb, scale = TRUE) plot(importance)

  • predicting test data
confusionMatrix(predict(model_xgb, test_data), as.factor(test_data$classes)) ## Confusion Matrix and Statistics ## ## Reference ## Prediction benign malignant ## benign 128 3 ## malignant 5 68 ## ## Accuracy : 0.9608 ## 95% CI : (0.9242, 0.9829) ## No Information Rate : 0.652 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.9142 ## Mcnemar's Test P-Value : 0.7237 ## ## Sensitivity : 0.9624 ## Specificity : 0.9577 ## Pos Pred Value : 0.9771 ## Neg Pred Value : 0.9315 ## Prevalence : 0.6520 ## Detection Rate : 0.6275 ## Detection Prevalence : 0.6422 ## Balanced Accuracy : 0.9601 ## ## 'Positive' Class : benign ## results <- data.frame(actual = test_data$classes, predict(model_xgb, test_data, type = "prob")) results$prediction <- ifelse(results$benign > 0.5, "benign", ifelse(results$malignant > 0.5, "malignant", NA)) results$correct <- ifelse(results$actual == results$prediction, TRUE, FALSE) ggplot(results, aes(x = prediction, fill = correct)) + geom_bar(position = "dodge")

ggplot(results, aes(x = prediction, y = benign, color = correct, shape = correct)) + geom_jitter(size = 3, alpha = 0.6)

Available models in caret

https://topepo.github.io/caret/available-models.html

Feature Selection

Performing feature selection on the whole dataset would lead to prediction bias, we therefore need to run the whole modeling process on the training data alone!

  • Correlation

Correlations between all features are calculated and visualised with the corrplot package. I am then removing all features with a correlation higher than 0.7, keeping the feature with the lower mean.

library(corrplot) # calculate correlation matrix corMatMy <- cor(train_data[, -1]) corrplot(corMatMy, order = "hclust")

#Apply correlation filter at 0.70, highlyCor <- colnames(train_data[, -1])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)] ## Compare row 2 and column 3 with corr 0.908 ## Means: 0.709 vs 0.594 so flagging column 2 ## Compare row 3 and column 7 with corr 0.749 ## Means: 0.67 vs 0.569 so flagging column 3 ## All correlations <= 0.7 # which variables are flagged for removal? highlyCor ## [1] "uniformity_of_cell_size" "uniformity_of_cell_shape" #then we remove these variables train_data_cor <- train_data[, which(!colnames(train_data) %in% highlyCor)]

  • Recursive Feature Elimination (RFE)

Another way to choose features is with Recursive Feature Elimination. RFE uses a Random Forest algorithm to test combinations of features and rate each with an accuracy score. The combination with the highest score is usually preferential.

set.seed(7) results_rfe <- rfe(x = train_data[, -1], y = as.factor(train_data$classes), sizes = c(1:9), rfeControl = rfeControl(functions = rfFuncs, method = "cv", number = 10)) # chosen features predictors(results_rfe) ## [1] "bare_nuclei" "clump_thickness" ## [3] "uniformity_of_cell_size" "uniformity_of_cell_shape" ## [5] "bland_chromatin" "normal_nucleoli" ## [7] "marginal_adhesion" "single_epithelial_cell_size" train_data_rfe <- train_data[, c(1, which(colnames(train_data) %in% predictors(results_rfe)))]

  • Genetic Algorithm (GA)

The Genetic Algorithm (GA) has been developed based on evolutionary principles of natural selection: It aims to optimize a population of individuals with a given set of genotypes by modeling selection over time. In each generation (i.e. iteration), each individual’s fitness is calculated based on their genotypes. Then, the fittest individuals are chosen to produce the next generation. This subsequent generation of individuals will have genotypes resulting from (re-) combinations of the parental alleles. These new genotypes will again determine each individual’s fitness. This selection process is iterated for a specified number of generations and (ideally) leads to fixation of the fittest alleles in the gene pool.

This concept of optimization can be applied to non-evolutionary models as well, like feature selection processes in machine learning.

set.seed(27) model_ga <- gafs(x = train_data[, -1], y = as.factor(train_data$classes), iters = 10, # generations of algorithm popSize = 10, # population size for each generation levels = c("malignant", "benign"), gafsControl = gafsControl(functions = rfGA, # Assess fitness with RF method = "cv", # 10 fold cross validation genParallel = TRUE, # Use parallel programming allowParallel = TRUE)) plot(model_ga) # Plot mean fitness (AUC) by generation

train_data_ga <- train_data[, c(1, which(colnames(train_data) %in% model_ga$ga$final))]

Hyperparameter tuning with caret
  • Cartesian Grid

  • mtry: Number of variables randomly sampled as candidates at each split.

set.seed(42) grid <- expand.grid(mtry = c(1:10)) model_rf_tune_man <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE), tuneGrid = grid) model_rf_tune_man ## Random Forest ## ## 479 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 431, 431, 431, 431, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 1 0.9785044 0.9532161 ## 2 0.9772586 0.9504377 ## 3 0.9774625 0.9508246 ## 4 0.9766333 0.9488778 ## 5 0.9753789 0.9460274 ## 6 0.9737078 0.9422613 ## 7 0.9730957 0.9408547 ## 8 0.9714155 0.9371611 ## 9 0.9718280 0.9380578 ## 10 0.9718280 0.9380135 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 1. plot(model_rf_tune_man)

  • Random Search
set.seed(42) model_rf_tune_auto <- caret::train(classes ~ ., data = train_data, method = "rf", preProcess = c("scale", "center"), trControl = trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = TRUE, verboseIter = FALSE, search = "random"), tuneGrid = grid, tuneLength = 15) model_rf_tune_auto ## Random Forest ## ## 479 samples ## 9 predictor ## 2 classes: 'benign', 'malignant' ## ## Pre-processing: scaled (9), centered (9) ## Resampling: Cross-Validated (10 fold, repeated 10 times) ## Summary of sample sizes: 432, 431, 431, 431, 431, 431, ... ## Resampling results across tuning parameters: ## ## mtry Accuracy Kappa ## 1 0.9785044 0.9532161 ## 2 0.9772586 0.9504377 ## 3 0.9774625 0.9508246 ## 4 0.9766333 0.9488778 ## 5 0.9753789 0.9460274 ## 6 0.9737078 0.9422613 ## 7 0.9730957 0.9408547 ## 8 0.9714155 0.9371611 ## 9 0.9718280 0.9380578 ## 10 0.9718280 0.9380135 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was mtry = 1. plot(model_rf_tune_auto)

Grid search with h2o

The R package h2o provides a convenient interface to H2O, which is an open-source machine learning and deep learning platform. H2O distributes a wide range of common machine learning algorithms for classification, regression and deep learning.

library(h2o) h2o.init(nthreads = -1) ## Connection successful! ## ## R is connected to the H2O cluster: ## H2O cluster uptime: 26 minutes 45 seconds ## H2O cluster timezone: Europe/Berlin ## H2O data parsing timezone: UTC ## H2O cluster version: 3.20.0.2 ## H2O cluster version age: 13 days ## H2O cluster name: H2O_started_from_R_shiringlander_jrj894 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 3.24 GB ## H2O cluster total cores: 8 ## H2O cluster allowed cores: 8 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## H2O Internal Security: FALSE ## H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4 ## R Version: R version 3.5.0 (2018-04-23) h2o.no_progress() bc_data_hf <- as.h2o(bc_data) h2o.describe(bc_data_hf) %>% gather(x, y, Zeros:Sigma) %>% mutate(group = ifelse(x %in% c("Min", "Max", "Mean"), "min, mean, max", ifelse(x %in% c("NegInf", "PosInf"), "Inf", "sigma, zeros"))) %>% ggplot(aes(x = Label, y = as.numeric(y), color = x)) + geom_point(size = 4, alpha = 0.6) + scale_color_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + facet_grid(group ~ ., scales = "free") + labs(x = "Feature", y = "Value", color = "")

library(reshape2) # for melting bc_data_hf[, 1] <- h2o.asfactor(bc_data_hf[, 1]) cor <- h2o.cor(bc_data_hf) rownames(cor) <- colnames(cor) melt(cor) %>% mutate(Var2 = rep(rownames(cor), nrow(cor))) %>% mutate(Var2 = factor(Var2, levels = colnames(cor))) %>% mutate(variable = factor(variable, levels = colnames(cor))) %>% ggplot(aes(x = variable, y = Var2, fill = value)) + geom_tile(width = 0.9, height = 0.9) + scale_fill_gradient2(low = "white", high = "red", name = "Cor.") + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) + labs(x = "", y = "")

Training, validation and test data splits <- h2o.splitFrame(bc_data_hf, ratios = c(0.7, 0.15), seed = 1) train <- splits[[1]] valid <- splits[[2]] test <- splits[[3]] response <- "classes" features <- setdiff(colnames(train), response) summary(as.factor(train$classes), exact_quantiles = TRUE) ## classes ## benign :313 ## malignant:167 summary(as.factor(valid$classes), exact_quantiles = TRUE) ## classes ## benign :64 ## malignant:38 summary(as.factor(test$classes), exact_quantiles = TRUE) ## classes ## benign :67 ## malignant:34 pca <- h2o.prcomp(training_frame = train, x = features, validation_frame = valid, transform = "NORMALIZE", impute_missing = TRUE, k = 3, seed = 42) eigenvec <- as.data.frame(pca@model$eigenvectors) eigenvec$label <- features library(ggrepel) ggplot(eigenvec, aes(x = pc1, y = pc2, label = label)) + geom_point(color = "navy", alpha = 0.7) + geom_text_repel()

Classification Random Forest hyper_params <- list( ntrees = c(25, 50, 75, 100), max_depth = c(10, 20, 30), min_rows = c(1, 3, 5) ) search_criteria <- list( strategy = "RandomDiscrete", max_models = 50, max_runtime_secs = 360, stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 0.0005, seed = 42 ) rf_grid <- h2o.grid(algorithm = "randomForest", # h2o.randomForest, # alternatively h2o.gbm # for Gradient boosting trees x = features, y = response, grid_id = "rf_grid", training_frame = train, validation_frame = valid, nfolds = 25, fold_assignment = "Stratified", hyper_params = hyper_params, search_criteria = search_criteria, seed = 42 ) # performance metrics where smaller is better -> order with decreasing = FALSE sort_options_1 <- c("mean_per_class_error", "mse", "err", "logloss") for (sort_by_1 in sort_options_1) { grid <- h2o.getGrid("rf_grid", sort_by = sort_by_1, decreasing = FALSE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) h2o.saveModel(best_model, path="models", force = TRUE) } # performance metrics where bigger is better -> order with decreasing = TRUE sort_options_2 <- c("auc", "precision", "accuracy", "recall", "specificity") for (sort_by_2 in sort_options_2) { grid <- h2o.getGrid("rf_grid", sort_by = sort_by_2, decreasing = TRUE) model_ids <- grid@model_ids best_model <- h2o.getModel(model_ids[[1]]) h2o.saveModel(best_model, path = "models", force = TRUE) } files <- list.files(path = "/Users/shiringlander/Documents/Github/intro_to_ml_workshop/intro_to_ml_uni_heidelberg/models") rf_models <- files[grep("rf_grid_model", files)] for (model_id in rf_models) { path <- paste0("/Users/shiringlander/Documents/Github/intro_to_ml_workshop/intro_to_ml_uni_heidelberg", "/models/", model_id) best_model <- h2o.loadModel(path) mse_auc_test <- data.frame(model_id = model_id, mse = h2o.mse(h2o.performance(best_model, test)), auc = h2o.auc(h2o.performance(best_model, test))) if (model_id == rf_models[[1]]) { mse_auc_test_comb <- mse_auc_test } else { mse_auc_test_comb <- rbind(mse_auc_test_comb, mse_auc_test) } } mse_auc_test_comb %>% gather(x, y, mse:auc) %>% ggplot(aes(x = model_id, y = y, fill = model_id)) + facet_grid(x ~ ., scales = "free") + geom_bar(stat = "identity", alpha = 0.8, position = "dodge") + scale_fill_brewer(palette = "Set1") + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1), plot.margin = unit(c(0.5, 0, 0, 1.5), "cm")) + labs(x = "", y = "value", fill = "")

for (model_id in rf_models) { best_model <- h2o.getModel(model_id) finalRf_predictions <- data.frame(model_id = rep(best_model@model_id, nrow(test)), actual = as.vector(test$classes), as.data.frame(h2o.predict(object = best_model, newdata = test))) finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", ifelse(finalRf_predictions$malignant > 0.8, "malignant", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) if (model_id == rf_models[[1]]) { finalRf_predictions_comb <- finalRf_predictions } else { finalRf_predictions_comb <- rbind(finalRf_predictions_comb, finalRf_predictions) } } finalRf_predictions_comb %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + facet_wrap(~ model_id, ncol = 2) + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")

finalRf_predictions_comb %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + facet_wrap(~ model_id, ncol = 2) + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")

rf_model <- h2o.loadModel("/Users/shiringlander/Documents/Github/intro_to_ml_workshop/intro_to_ml_uni_heidelberg/models/rf_grid_model_0") h2o.varimp_plot(rf_model)

#h2o.varimp(rf_model) h2o.mean_per_class_error(rf_model, train = TRUE, valid = TRUE, xval = TRUE) ## train valid xval ## 0.02196246 0.02343750 0.02515735 h2o.confusionMatrix(rf_model, valid = TRUE) ## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.533333333333333: ## benign malignant Error Rate ## benign 61 3 0.046875 =3/64 ## malignant 0 38 0.000000 =0/38 ## Totals 61 41 0.029412 =3/102 plot(rf_model, timestep = "number_of_trees", metric = "classification_error")

plot(rf_model, timestep = "number_of_trees", metric = "logloss")

plot(rf_model, timestep = "number_of_trees", metric = "AUC")

plot(rf_model, timestep = "number_of_trees", metric = "rmse")

h2o.auc(rf_model, train = TRUE) ## [1] 0.9907214 h2o.auc(rf_model, valid = TRUE) ## [1] 0.9829359 h2o.auc(rf_model, xval = TRUE) ## [1] 0.9903005 perf <- h2o.performance(rf_model, test) perf ## H2OBinomialMetrics: drf ## ## MSE: 0.03258482 ## RMSE: 0.1805127 ## LogLoss: 0.1072519 ## Mean Per-Class Error: 0.02985075 ## AUC: 0.9916594 ## Gini: 0.9833187 ## ## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold: ## benign malignant Error Rate ## benign 63 4 0.059701 =4/67 ## malignant 0 34 0.000000 =0/34 ## Totals 63 38 0.039604 =4/101 ## ## Maximum Metrics: Maximum metrics at their respective thresholds ## metric threshold value idx ## 1 max f1 0.306667 0.944444 18 ## 2 max f2 0.306667 0.977011 18 ## 3 max f0point5 0.720000 0.933735 13 ## 4 max accuracy 0.533333 0.960396 16 ## 5 max precision 1.000000 1.000000 0 ## 6 max recall 0.306667 1.000000 18 ## 7 max specificity 1.000000 1.000000 0 ## 8 max absolute_mcc 0.306667 0.917235 18 ## 9 max min_per_class_accuracy 0.533333 0.955224 16 ## 10 max mean_per_class_accuracy 0.306667 0.970149 18 ## ## Gains/Lift Table: Extract with `h2o.gainsLift(, )` or `h2o.gainsLift(, valid=, xval=)` plot(perf)

perf@metrics$thresholds_and_metric_scores %>% ggplot(aes(x = fpr, y = tpr)) + geom_point() + geom_line() + geom_abline(slope = 1, intercept = 0) + labs(x = "False Positive Rate", y = "True Positive Rate")

h2o.logloss(perf) ## [1] 0.1072519 h2o.mse(perf) ## [1] 0.03258482 h2o.auc(perf) ## [1] 0.9916594 head(h2o.metric(perf)) ## Metrics for Thresholds: Binomial metrics as a function of classification thresholds ## threshold f1 f2 f0point5 accuracy precision recall ## 1 1.000000 0.583333 0.466667 0.777778 0.801980 1.000000 0.411765 ## 2 0.986667 0.666667 0.555556 0.833333 0.831683 1.000000 0.500000 ## 3 0.973333 0.716981 0.612903 0.863636 0.851485 1.000000 0.558824 ## 4 0.960000 0.740741 0.641026 0.877193 0.861386 1.000000 0.588235 ## 5 0.946667 0.763636 0.668790 0.889831 0.871287 1.000000 0.617647 ## 6 0.920000 0.807018 0.723270 0.912698 0.891089 1.000000 0.676471 ## specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy ## 1 1.000000 0.563122 0.411765 0.705882 ## 2 1.000000 0.631514 0.500000 0.750000 ## 3 1.000000 0.675722 0.558824 0.779412 ## 4 1.000000 0.697542 0.588235 0.794118 ## 5 1.000000 0.719221 0.617647 0.808824 ## 6 1.000000 0.762280 0.676471 0.838235 ## tns fns fps tps tnr fnr fpr tpr idx ## 1 67 20 0 14 1.000000 0.588235 0.000000 0.411765 0 ## 2 67 17 0 17 1.000000 0.500000 0.000000 0.500000 1 ## 3 67 15 0 19 1.000000 0.441176 0.000000 0.558824 2 ## 4 67 14 0 20 1.000000 0.411765 0.000000 0.588235 3 ## 5 67 13 0 21 1.000000 0.382353 0.000000 0.617647 4 ## 6 67 11 0 23 1.000000 0.323529 0.000000 0.676471 5 finalRf_predictions <- data.frame(actual = as.vector(test$classes), as.data.frame(h2o.predict(object = rf_model, newdata = test))) finalRf_predictions$accurate <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict, "yes", "no") finalRf_predictions$predict_stringent <- ifelse(finalRf_predictions$benign > 0.8, "benign", ifelse(finalRf_predictions$malignant > 0.8, "malignant", "uncertain")) finalRf_predictions$accurate_stringent <- ifelse(finalRf_predictions$actual == finalRf_predictions$predict_stringent, "yes", ifelse(finalRf_predictions$predict_stringent == "uncertain", "na", "no")) finalRf_predictions %>% group_by(actual, predict) %>% dplyr::summarise(n = n()) ## # A tibble: 4 x 3 ## # Groups: actual [?] ## actual predict n ## ## 1 benign benign 64 ## 2 benign malignant 3 ## 3 malignant benign 1 ## 4 malignant malignant 33 finalRf_predictions %>% group_by(actual, predict_stringent) %>% dplyr::summarise(n = n()) ## # A tibble: 5 x 3 ## # Groups: actual [?] ## actual predict_stringent n ## ## 1 benign benign 62 ## 2 benign malignant 2 ## 3 benign uncertain 3 ## 4 malignant malignant 29 ## 5 malignant uncertain 5 finalRf_predictions %>% ggplot(aes(x = actual, fill = accurate)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(fill = "Were\npredictions\naccurate?", title = "Default predictions")

finalRf_predictions %>% subset(accurate_stringent != "na") %>% ggplot(aes(x = actual, fill = accurate_stringent)) + geom_bar(position = "dodge") + scale_fill_brewer(palette = "Set1") + labs(fill = "Were\npredictions\naccurate?", title = "Stringent predictions")

df <- finalRf_predictions[, c(1, 3, 4)] thresholds <- seq(from = 0, to = 1, by = 0.1) prop_table <- data.frame(threshold = thresholds, prop_true_b = NA, prop_true_m = NA) for (threshold in thresholds) { pred <- ifelse(df$benign > threshold, "benign", "malignant") pred_t <- ifelse(pred == df$actual, TRUE, FALSE) group <- data.frame(df, "pred" = pred_t) %>% group_by(actual, pred) %>% dplyr::summarise(n = n()) group_b <- filter(group, actual == "benign") prop_b <- sum(filter(group_b, pred == TRUE)$n) / sum(group_b$n) prop_table[prop_table$threshold == threshold, "prop_true_b"] <- prop_b group_m <- filter(group, actual == "malignant") prop_m <- sum(filter(group_m, pred == TRUE)$n) / sum(group_m$n) prop_table[prop_table$threshold == threshold, "prop_true_m"] <- prop_m } prop_table %>% gather(x, y, prop_true_b:prop_true_m) %>% ggplot(aes(x = threshold, y = y, color = x)) + geom_point() + geom_line() + scale_color_brewer(palette = "Set1") + labs(y = "proportion of true predictions", color = "b: benign cases\nm: malignant cases")

If you are interested in more machine learning posts, check out the category listing for machine_learning on my blog – https://shirinsplayground.netlify.com/categories/#posts-list-machine-learninghttps://shiring.github.io/categories.html#machine_learning-ref

stopCluster(cl) h2o.shutdown() ## Are you sure you want to shutdown the H2O instance running at http://localhost:54321/ (Y/N)? sessionInfo() ## R version 3.5.0 (2018-04-23) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS High Sierra 10.13.5 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 ## ## attached base packages: ## [1] parallel stats graphics grDevices utils datasets methods ## [8] base ## ## other attached packages: ## [1] ggrepel_0.8.0 reshape2_1.4.3 h2o_3.20.0.2 ## [4] corrplot_0.84 caret_6.0-80 doParallel_1.0.11 ## [7] iterators_1.0.9 foreach_1.4.4 ellipse_0.4.1 ## [10] igraph_1.2.1 bindrcpp_0.2.2 mice_3.1.0 ## [13] lattice_0.20-35 forcats_0.3.0 stringr_1.3.1 ## [16] dplyr_0.7.5 purrr_0.2.5 readr_1.1.1 ## [19] tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 ## [22] tidyverse_1.2.1 ## ## loaded via a namespace (and not attached): ## [1] minqa_1.2.4 colorspace_1.3-2 class_7.3-14 ## [4] rprojroot_1.3-2 pls_2.6-0 rstudioapi_0.7 ## [7] DRR_0.0.3 prodlim_2018.04.18 lubridate_1.7.4 ## [10] xml2_1.2.0 codetools_0.2-15 splines_3.5.0 ## [13] mnormt_1.5-5 robustbase_0.93-1 knitr_1.20 ## [16] RcppRoll_0.3.0 jsonlite_1.5 nloptr_1.0.4 ## [19] broom_0.4.4 ddalpha_1.3.4 kernlab_0.9-26 ## [22] sfsmisc_1.1-2 compiler_3.5.0 httr_1.3.1 ## [25] backports_1.1.2 assertthat_0.2.0 Matrix_1.2-14 ## [28] lazyeval_0.2.1 cli_1.0.0 htmltools_0.3.6 ## [31] tools_3.5.0 gtable_0.2.0 glue_1.2.0 ## [34] Rcpp_0.12.17 cellranger_1.1.0 nlme_3.1-137 ## [37] blogdown_0.6 psych_1.8.4 timeDate_3043.102 ## [40] xfun_0.2 gower_0.1.2 lme4_1.1-17 ## [43] rvest_0.3.2 pan_1.4 DEoptimR_1.0-8 ## [46] MASS_7.3-50 scales_0.5.0 ipred_0.9-6 ## [49] hms_0.4.2 RColorBrewer_1.1-2 yaml_2.1.19 ## [52] rpart_4.1-13 stringi_1.2.3 randomForest_4.6-14 ## [55] e1071_1.6-8 lava_1.6.1 geometry_0.3-6 ## [58] bitops_1.0-6 rlang_0.2.1 pkgconfig_2.0.1 ## [61] evaluate_0.10.1 bindr_0.1.1 recipes_0.1.3 ## [64] labeling_0.3 CVST_0.2-2 tidyselect_0.2.4 ## [67] plyr_1.8.4 magrittr_1.5 bookdown_0.7 ## [70] R6_2.2.2 mitml_0.3-5 dimRed_0.1.0 ## [73] pillar_1.2.3 haven_1.1.1 foreign_0.8-70 ## [76] withr_2.1.2 RCurl_1.95-4.10 survival_2.42-3 ## [79] abind_1.4-5 nnet_7.3-12 modelr_0.1.2 ## [82] crayon_1.3.4 jomo_2.6-2 xgboost_0.71.2 ## [85] utf8_1.1.4 rmarkdown_1.10 grid_3.5.0 ## [88] readxl_1.1.0 data.table_1.11.4 ModelMetrics_1.1.0 ## [91] digest_0.6.15 stats4_3.5.0 munsell_0.5.0 ## [94] magic_1.5-8 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Beeswarms instead of histograms

Thu, 06/28/2018 - 22:19

(This article was first published on R – Insights of a PhD, and kindly contributed to R-bloggers)

Histograms are good, density plots are also good. Violin and bean plots too. Recently I had someone ask for a plot where you could see each individual point along a continuum, give the points specific colours based on a second variable (similar to the figure), which deviates somewhat from the typical density type plots. Apparently, they’re called beeplots or beeswarms. And there’s a way to make them in R (of course, there’s probably more than one… ggplot2??).

Here’s one way (slightly modified from the packages help files)…

library(beeswarm) # assuming you've installed it ;) data(breast) beeswarm(time_survival ~ ER, data = breast, pch = 16, pwcol = 1 + as.numeric(event_survival), xlab = "", ylab = "Follow-up time (months)", horizontal = TRUE, labels = c("ER neg", "ER pos"), method = "c") legend("topright", legend = c("Yes", "No"), title = "Censored", pch = 16, col = 1:2)

Horizontal is optional of course…

Feel free to comment if you know of other approaches.

See here for more examples.

Hope that helps someone

 

 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Insights of a PhD. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Choroplethr v3.6.2 is now on CRAN

Thu, 06/28/2018 - 00:33

(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)

A new version of Choroplethr is now on CRAN! You can get it by typing the following from the R Console:

install.packages("choroplethr")

I encourage all Choroplethr users to install this new version!

Overview

A new version of ggplot2 will be released shortly. This new version of ggplot2 introduces many new features. These new features, however, come at a cost: some packages which use ggplot2 (such as Choroplethr) are not compatible with this new version.

In fact, if you try to draw a map with the new version of ggplot2 and the old version of Choroplethr, you might get an error. This new version of Chorplethr (v3.6.2) fixes those issues.

Technical Details

Choroplethr was one of (if not the) first R package to create maps of the US where Alaska and Hawaii are rendered as insets. Here’s an example map, along with code:

library(choroplethr) data(df_pop_state) state_choropleth(df_pop_state)

The new version of ggplot2 was causing Choroplethr to crash whenever it attempted to render the above map. In order to understand the underlying issue, it is helpful to first understand how Choroplethr creates the above map.

Insets and ?annotation_custom

The most important thing to notice about the above map is this: it is inaccurate.

Alaska and Hawaii are not really located where they appear. And they are not drawn to scale, either.

Here is how the 50 US states “really” look:

library(choroplethrMaps) library(ggplot2) data(state.map) ggplot(state.map, aes(long, lat, group=group)) + geom_polygon()

In order to render Alaska and Hawaii as insets, Choroplethr:

  1. First pre-renders the Continental US, Alaska and Hawaii as separate ggplot objects.
  2. Then affixes the Alaska and Hawaii ggplot objects to the continental US object using ggplot2’s ggplotGrob and annotation_custom functions.
Insets and Themes

The two images differ in more than just the placement of Alaska and Hawaii, though. The first map also has a clear background, whereas the second map has a grey background and tick marks for longitude and latitude. Also, Choroplethr uses a single legend for all three maps (the legends for Alaska and Hawaii are suppressed).

Traditionally Choroplethr has implemented these aesthetic tweaks in the functions theme_clean (for the Continental US) and theme_inset (for Alaska and Hawaii). The code for these functions originally came from Section 13.19 (“Making a Map with a Clean Background”) of Winston Chang’s R Graphics Cookbook. Here’s what that code used to look like:

theme_clean = function() { theme( axis.title = element_blank(), axis.text = element_blank(), panel.background = element_blank(), panel.grid = element_blank(), axis.ticks.length = unit(0, "cm"), panel.spacing = unit(0, "lines"), plot.margin = unit(c(0, 0, 0, 0), "lines"), complete = TRUE ) }, theme_inset = function() { theme( legend.position = "none", axis.title = element_blank(), axis.text = element_blank(), panel.background = element_blank(), panel.grid = element_blank(), axis.ticks.length = unit(0, "cm"), panel.spacing = unit(0, "lines"), plot.margin = unit(c(0, 0, 0, 0), "lines"), complete = TRUE ) }

The Bug

The bug was happening when Choroplethr was attempting to affix the map of Alaska, with theme_inset applied to it, to the map of the Continental US:

# omit code for creating continental.ggplot ret = continental.ggplot # omit code for creating alaska.ggplot # add alaska.ggplot to ret using ggplotGrob and annotation_custom alaska.grob = ggplotGrob(alaska.ggplot) ret = ret + annotation_custom(grobTree(alaska.grob), xmin=-125, xmax=-110, ymin=22.5, ymax=30)

The last line in the above snippet was causing Choroplethr to crash. It appears that the intersection of theme_insetggplotGrob and annotation_custom was causing a problem.

The Fix

It appears that ggplot2 has a new function for creating a clear background on a map that does not cause these problems. That function is theme_void. Modifying theme_clean and theme_inset to use theme_void solved the problem. Here is the new version of those functions:

theme_clean = function() { ggplot2::theme_void() } theme_inset = function() { ggplot2::theme_void() %+replace% ggplot2::theme(legend.position = "none") }

The post Choroplethr v3.6.2 is now on CRAN appeared first on AriLamstein.com.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Financial Times and BBC use R for publication graphics

Wed, 06/27/2018 - 23:56

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

While graphics guru Edward Tufte recently claimed that "R coders and users just can't do words on graphics and typography" and need additonal tools to make graphics that aren't "clunky", data journalists at major publications beg to differ. The BBC has been creating graphics "purely in R" for some time, with a typography style matching that of the BBC website. Senior BBC Data Journalist Christine Jeavans offers several examples, including this chart of life expectancy differences between men and women:

… and this chart on gender pay gaps at large British banks:

Meanwhile, the chart below was made for the Financial Times using just R and the ggplot2 package, "down to the custom FT font and the white bar in the top left", according to data journalist John Burn-Murdoch.

There are also entire collections devoted to recreating Tufte's own visualizations in R, presumably meeting his typography standards. Tufte later clarified saying "Problem is not code, problem is published practice", which is true of any programming environment, which is why it was strange that he'd call out R in particular.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Finalfit now in CRAN

Wed, 06/27/2018 - 21:56

(This article was first published on R – DataSurg, and kindly contributed to R-bloggers)

Your favourite package for getting model outputs directly into publication ready tables is now available on CRAN. They make you work for it! Thank you to all that helped. The development version will continue to be available from github.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – DataSurg. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Course: Marketing Analytics in R

Wed, 06/27/2018 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Course Description

Here is a link to our new R course.

This is your chance to dive into the worlds of marketing and business analytics using R. Day by day, there are a multitude of decisions that companies have to face. With the help of statistical models, you’re going to be able to support the business decision-making process based on data, not your gut feeling. Let us show you what a great impact statistical modeling can have on the performance of businesses. You’re going to learn about and apply strategies to communicate your results and help them make a difference.

Chapter 1: Modeling Customer Lifetime Value with Linear Regression

How can you decide which customers are most valuable for your business? Learn how to model the customer lifetime value using linear regression.

Chapter 2: Logistic Regression for Churn Prevention

Predicting if a customer will leave your business, or churn, is important for targeting valuable customers and retaining those who are at risk. Learn how to model customer churn using logistic regression.

Chapter 3: Modeling Time to Reorder with Survival Analysis

Learn how to model the time to an event using survival analysis. This could be the time until next order or until a person churns.

Chapter 4: Reducing Dimensionality with Principal Component Analysis

Learn how to reduce the number of variables in your data using principal component analysis. Not only does this help to get a better understanding of your data. PCA also enables you to condense information to single indices and to solve multicollinearity problems in a regression analysis with many intercorrelated variables.

If you are interested in learning more about marketing and data science, check out this tutorial for Python, Data Science for Search Engine Marketing.

The following R courses are prerequisites to take Marketing Analytics in R: Statistical Modeling.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New Project: A Visual History of Nobel Prize Winners

Wed, 06/27/2018 - 18:00

(This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers)

Project Description

The Nobel Prize is perhaps the worlds most well known scientific award. Every year it is given to scientists and scholars in chemistry, literature, physics, medicine, economics, and peace. The first Nobel Prize was handed out in 1901, and at that time the prize was Eurocentric and male-focused, but nowadays it’s not biased in any way. Surely, right?

Well, let’s find out! In this Project, you get to explore patterns and trends in over 100 years worth of Nobel Prize winners. What characteristics do the prize winners have? Which country gets it most often? And has anybody gotten it twice? It’s up to you to figure this out.

Before taking on this Project, we recommend that you have completed Introduction to the Tidyverse.

The dataset used in this Project was retrieved from The Nobel Foundation on Kaggle.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - r programming. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Debug Java in Bio7

Wed, 06/27/2018 - 14:33

(This article was first published on R – Bio7 Website, and kindly contributed to R-bloggers)

27.06.2018

In Bio7 R and Java code can be easily combined in the Bio7 platform. For instance to create Graphical User Interfaces in Java (with SWT or JavaFX), transfer image pixel/selection data from ImageJ and finally doing the analysis parts in R code/scripts which can be called from within Java (an easy to use API on top of Rserve in Bio7 is available, see, e.g., Groovy examples here).

However sometimes it is necessary to debug the Java language as part of such combined work. In Bio7 (based on Eclipse RCP) the very powerful Java debugging tools from Eclipse are available by default.

In addition in Bio7 it is possible to debug dynamically compiled Java classes in the same Bio7 process (using Java like a scripting language is a special feature of Bio7).

The video at the bottom of this page demonstrates the use of a remote debugging connection on the same computer (localhost connection) to debug dynamically compiled/executed Java code.

Debugging a Java process which has already started (Bio7 in this case) is only possible if Bio7 is started beforehand with the following Java arguments inside a shell to start a debug server connection:

Windows (started from the current directory = Bio7 installation directory):
Bio7.exe -vmargs -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8000

Mac (given the full path to the Bio7 executable):
/Applications/Bio7.app/Contents/MacOS/Bio7 -vmargs -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8000

Linux (started from the current directory = Bio7 installation directory):
./Bio7 -vmargs -Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=8000

Furthermore the compiler debug information option has to be enabled in the Bio7 Java preferences (for the dynamic compilation process) and a seperate debug connection has to be started for this Java class (with the available Eclipse debug configuration dialog).

In the video below I (hopefully) demonstrate how you can (graphically) debug a simple cellular automata simulation like the Game of Life within a running Bio7 (Java) process.

" data-embed-play="Play VideoThis video will be embedded from Youtube. The apply."> Please activate JavaScript to view this video.
Video-Link:
https://www.youtube.com/watch?v=vKAGBGuQKlI var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Bio7 Website. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Game of Friendship Paradox

Wed, 06/27/2018 - 09:13

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

In the introduction of my course next week, I will (briefly) mention networks, and I wanted to provide some illustration of the Friendship Paradox. On network of thrones (discussed in Beveridge and Shan (2016)), there is a dataset with the network of characters in Game of Thrones. The word “friend” might be abusive here, but let’s continue to call connected nodes “friends”. The friendship paradox states that

People on average have fewer friends than their friends

This was discussed in Feld (1991) for instance, or Zuckerman & Jost (2001). Let’s try to see what it means here. First, let us get a copy of the dataset

1 2 3 4 download.file("https://www.macalester.edu/~abeverid/data/stormofswords.csv","got.csv") GoT=read.csv("got.csv") library(networkD3) simpleNetwork(GoT[,1:2])

Because it is difficult for me to incorporate some d3js script in the blog, I will illustrate with a more basic graph,

Consider a vertex in the undirected graph (with classical graph notations), and let denote the number of edges touching it (i.e. has friends). The average number of friends of a random person in the graph is The average number of friends that a typical friend has is
But
\sum_{v\in V} \left(\frac{1}{d(v)}\sum_{v’\in E_v} d(v’)\right)=\sum_{v,v’ \in G} \left(

\frac{d(v’)}{d(v)}+\frac{d(v)}{d(v’)}\right)
Thus,
Note that this can be related to the variance decomposition i.e.(Jensen inequality). But let us get back to our network. The list of nodes is

1 2 M=(rbind(as.matrix(GoT[,1:2]),as.matrix(GoT[,2:1]))) nodes=unique(M[,1])

and we each of them, we can get the list of friends, and the number of friends

1 2 friends = function(x) as.character(M[which(M[,1]==x),2]) nb_friends = Vectorize(function(x) length(friends(x)))

as well as the number of friends friends have, and the average number of friends

1 2 friends_of_friends = function(y) (Vectorize(function(x) length(friends(x)))(friends(y))) nb_friends_of_friends = Vectorize(function(x) mean(friends_of_friends(x)))

We can look at the density of the number of friends, for a random node,

1 2 3 4 5 6 Nb = nb_friends(nodes) Nb2 = nb_friends_of_friends(nodes) hist(Nb,breaks=0:40,col=rgb(1,0,0,.2),border="white",probability = TRUE) hist(Nb2,breaks=0:40,col=rgb(0,0,1,.2),border="white",probability = TRUE,add=TRUE) lines(density(Nb),col="red",lwd=2) lines(density(Nb2),col="blue",lwd=2)


and we can also compute the averages, just to check

1 2 3 4 mean(Nb) [1] 6.579439 mean(Nb2) [1] 13.94243

So, indeed, people on average have fewer friends than their friends.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

One R Package a Day

Wed, 06/27/2018 - 08:00

(This article was first published on stevenmortimer.com, and kindly contributed to R-bloggers)

Discover one R package a day by following the @RLangPackage account on Twitter! Inspired by One R Tip Day (@RLangTip) I created a Twitter bot that will feature one package with its description and GitHub URL each day. The R programming language (referred to as #rstats on Twitter) now has over 12,600 packages. Tweeting one each day would take more than 34 years to get through them all and that doesn’t even consider the rapid growth of new packages as shown in this post.

The purpose of the @RLangPackage Twitter account is to feature the diversity of packages that R has to offer. Most users have heard of ggplot2 and dplyr, but there are thousands of other great packages waiting to be discovered. In order to highlight those I took a list of all packages on CRAN and filtered it down to ones with anywhere between 5,000 and 1,000,000 CRAN downloads. I then took a subset of that with anywhere between 10 and 1,000 stars on GitHub. I decided to focus on these mildly popular packages with source code hosted on GitHub so that I can embed the link in the tweet and promote packages with code that people have already started to explore. After following those subsetting rules, one package is selected at random after narrowing to a list of packages that haven’t already been tweeted. I am hosting the bot on Heroku and using Heroku Scheduler to send out a tweet each day at 10:30am UTC or 6:30am Eastern Time. Below the credits and resources is the sloppily written Python that’s currently being hosted on Heroku and executed.

Credits

I was found a couple different blog posts on creating a Twitter bot with R, namely here and here, but they didn’t involve deploying on Heroku or another cloud service. However, there were plenty of Python based Twitter bot tutorials, so I took full advantage of them and went the Python route. Below is the host of resources I considered while figuring out how to deploy the app, what Twitter package to use, and some basic Python syntax which, embarrassingly, I should know by now. I think most users will gravitate towards the tweepy Python library but I found that it throws up an error when using it with Conda 3.6. The issue is documented on GitHub https://github.com/tweepy/tweepy/issues/894.

Resources

Full Script # script.py from os import environ from os.path import join, dirname from dotenv import load_dotenv from re import sub import pandas from TwitterAPI import TwitterAPI, TwitterPager # create .env file path dotenv_path = join(dirname(__file__), '.env') # load file from the path load_dotenv(dotenv_path) if __name__ == "__main__": # connect to api api = TwitterAPI(consumer_key=environ['TWITTER_CONSUMER_KEY'], consumer_secret=environ['TWITTER_CONSUMER_SECRET'], access_token_key=environ['TWITTER_ACCESS_TOKEN'], access_token_secret=environ['TWITTER_ACCESS_TOKEN_SECRET']) # scrape all prior tweets to check which packages I've already tweeted about SCREEN_NAME = 'RLangPackage' pager = TwitterPager(api, 'statuses/user_timeline', {'screen_name': SCREEN_NAME, 'count': 100}) # parse out the package name that occurs before the hyphen at the beginning previous_pks = [] for item in pager.get_iterator(wait=3.5): if 'text' in item: this_pkg = sub("^(\w+) - (.*)", "\\1", item['text']) previous_pks.append(this_pkg) # convert the package names to a dataframe prev_df = pandas.DataFrame({'name': previous_pks}) prev_df.set_index('name') # load the data I've compiled on R packages url = "https://raw.githubusercontent.com/StevenMMortimer/one-r-package-a-day/master/r-package-star-download-data.csv" all_df = pandas.read_csv(url) all_df.set_index('name') # do an "anti join" to throw away previously tweeted rows all_df = pandas.merge(all_df, prev_df, how='outer', indicator=True) all_df = all_df[all_df['_merge'] == 'left_only'] # focus on packages in middle ground of downloads and stars filtered_df = all_df[all_df['stars'].notnull()] filtered_df = filtered_df[filtered_df['stars'].between(10,1000)] filtered_df = filtered_df[filtered_df['downloads'].notnull()] filtered_df = filtered_df[filtered_df['downloads'].between(5000, 1000000)] # randomly select one of the remaining rows selected_pkg = filtered_df.sample(1) # pull out the name and description to see if we need to # truncate because of Twitters 280 character limit prepped_name = selected_pkg.iloc[0]['name'] prepped_desc = sub('\s+', ' ', selected_pkg.iloc[0]['description']) name_len = len(prepped_name) desc_len = len(prepped_desc) # 280 minus 3 for " - ", then minus 23 because links are counted as such, # then minus 9 for the " #rstats " hashtag if desc_len <= (280-3-23-9-name_len): prepped_desc = prepped_desc[0:desc_len] else: prepped_desc = prepped_desc[0:(280-6-23-9-name_len)] + "..." # cobble together the tweet text TWEET_TEXT = prepped_name + " - " + prepped_desc + " #rstats " + selected_pkg.iloc[0]['github_url'] print(TWEET_TEXT) # tweet it out to the world! r = api.request('statuses/update', {'status': TWEET_TEXT}) print('SUCCESS' if r.status_code == 200 else 'PROBLEM: ' + r.text) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: stevenmortimer.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Text-to-speech with R

Wed, 06/27/2018 - 02:00

(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)

Computers started talking to us! They do this with so called Text-to-Speech (TTS) systems. With neural nets, deep learning and lots of training data, these systems have gotten a whole lot better in recent years. In some cases, they are so good that you can’t distinguish between human and machine voice.

In one of our recent codecentric.AI videos, we compared different Text-to-Speech systems (the video is in German, though – but the text snippets and their voice recordings we show in the video are a mix of German and English). In this video, we had a small contest between Polly, Alexa, Siri And Co to find out who best speaks different tongue twisters.

Here, I want to find out what’s possible with R and Text-to-Speech packages.

How does TTS work?

Challenges for good TTS systems are the complexity of the human language: we intone words differently, depending on where they are in a sentence, what we want to convey with that sentence, how our mood is, and so on. AI-based TTS systems can take phonemes and intonation into account.

There are different ways to artificially produce speech. A very important method is Unit Selection synthesis. With this method, text is first normalized and divided into smaller entities that represent sentences, syllables, words, phonemes, etc. The structure (e.g. the pronunciation) of these entities is then learned in context. We call this part Natural Language Processing (NLP). Usually, these learned segments are stored in a database (either as human voice recordings or synthetically generated) that can be searched to find suitable speech parts (Unit Selection). This search is often done with decision trees, neural nets or Hidden-Markov-Models.

If the speech has been generated by a computer, this is called formant synthesis. It offers more flexibility because the collection of words isn’t limited to what has been pre-recorded by a human. Even imaginary or new words can easily be produced and the voices can be readily exchanged. Until recently, this synthetic voice did not sound anything like a human recorded voice; you could definitely hear that it was “fake”. Most of the TTS systems today still suffer from this, but this is in the process of changing: there are already a few artificial TTS systems that do sound very human.

What TTS systems are there?

We already find TTS systems in many digital devices, like computers, smart phones, etc. Most of the “big players” offer TTS-as-a-service, but there are also many “smaller” and free programs for TTS. Many can be downloaded as software or used from a web browser or as an API. Here is an incomplete list:

Text-to-Speech in R

The only package for TTS I found was Rtts. It doesn’t seem very comprehensive but it does the job of converting text to speech. The only API that works right now is **ITRI (http://tts.itri.org.tw)**. And it only supports English and Chinese.

Let’s try it out!

library(Rtts) ## Lade nötiges Paket: RCurl ## Lade nötiges Paket: bitops

Here, I’ll be using a quote from DOUGLAS ADAMS’ THE HITCHHIKER’S GUIDE TO THE GALAXY:

content <- "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."

The main TTS function is tts_ITRI() and I’m going to loop over the different voice options.

speakers = c("Bruce", "Theresa", "Angela", "MCHEN_Bruce", "MCHEN_Joddess", "ENG_Bob", "ENG_Alice", "ENG_Tracy") lapply(speakers, function(x) tts_ITRI(content, speaker = x, destfile = paste0("audio_tts_", x, ".mp3")))

I uploaded the results to Soundcloud for you to hear: – audio-tts-bruceaudio-tts-theresaaudio-tts-angelaaudio-tts-mchen-bruceaudio-tts-mchen-joddessaudio-tts-eng-bobaudio-tts-eng-aliceaudio-tts-eng-tracy

As you can hear, it sounds quite wonky. There are many better alternatives out there, but most of them aren’t free and/or can’t be used (as easily) from R. Noam Ross tried IBM Watson’s TTS API in this post, which would be a very good solution. Or you could access the Google Cloud API from within R.

The most convenient solution for me was to use eSpeak from the command line. The output sounds relatively good, it is free and offers many languages and voices with lots of parameters to tweak. This is how you would produce audio from text with eSpeak:

  • English US
espeak -v english-us -s 150 -w '/Users/shiringlander/Documents/Github/audio_tts_espeak_en_us.wav' "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."
  • just for fun: English Scottish
espeak -v en-scottish -s 150 -w '/Users/shiringlander/Documents/Github/audio_tts_espeak_en-scottish.wav' "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."
  • even funnier: German
espeak -v german -s 150 -w '/Users/shiringlander/Documents/Github/audio_tts_espeak_german.wav' "A common mistake that people make when trying to design something completely foolproof is to underestimate the ingenuity of complete fools."

The playlist contains all audio files I generated in this post.

sessionInfo() ## R version 3.5.0 (2018-04-23) ## Platform: x86_64-apple-darwin15.6.0 (64-bit) ## Running under: macOS High Sierra 10.13.5 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib ## ## locale: ## [1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] Rtts_0.3.3 RCurl_1.95-4.10 bitops_1.0-6 ## ## loaded via a namespace (and not attached): ## [1] Rcpp_0.12.17 bookdown_0.7 digest_0.6.15 rprojroot_1.3-2 ## [5] backports_1.1.2 magrittr_1.5 evaluate_0.10.1 blogdown_0.6 ## [9] stringi_1.2.3 rmarkdown_1.10 tools_3.5.0 stringr_1.3.1 ## [13] xfun_0.2 yaml_2.1.19 compiler_3.5.0 htmltools_0.3.6 ## [17] knitr_1.20 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bayesian GANs [#2]

Wed, 06/27/2018 - 00:18

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

As an illustration of the lack of convergence of the Gibbs sampler applied to the two “conditionals” defined in the Bayesian GANs paper discussed yesterday, I took the simplest possible example of a Normal mean generative model (one parameter) with a logistic discriminator (one parameter) and implemented the scheme (during an ISBA 2018 session). With flat priors on both parameters. And a Normal random walk as Metropolis-Hastings proposal. As expected, since there is no stationary distribution associated with the Markov chain, simulated chains do not exhibit a stationary pattern,

And they eventually reach an overflow error or a trapping state as the log-likelihood gets approximately to zero (red curve).

Too bad I missed the talk by Shakir Mohammed yesterday, being stuck on the Edinburgh by-pass at rush hour!, as I would have loved to hear his views about this rather essential issue…

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

New R package flatxml: working with XML files as R dataframes

Tue, 06/26/2018 - 23:20

(This article was first published on Topics in R, and kindly contributed to R-bloggers)

The world is flat The new R package flatxml provides functions to easily deal with XML files. When parsing an XML document fxml_importXMLFlat produces a special dataframe that is ‘flat’ by its very nature but contains all necessary information about the hierarchical structure of the underlying XML document (for details on the dataframe see the reference for the fxml_importXMLFlat function). flatxml offers a set of functions to work with this dataframe. Apart from representing the XML document in a dataframe structure, there is yet another way in which flatxml relates to dataframes: the fxml_toDataFrame function can be used to extract data from an XML document into a dataframe, e.g. to work on the data with statistical functions. Because in this case there is no need to represent the XML document structure as such (it’s all about the data contained in the document), there is no representation of the hierarchical structure of the document any more, it’s just a normal dataframe. Each XML element, for example Here is some text has certain characteristics that can be accessed via the flatxml interface functions, after an XML document has been imported with \fxml_importXMLFlat. These characteristics are:

  • value: The (text) value of the element, “Here is some text” in the example above
  • attributes: The XML attributes of the element, attribute with its value “some value” in the example above
  • children: The elements on the next lower hierarchical level
  • parent: The element of the next higher hierarchical level, i.e. the element to which the current element is a child
  • siblings: The elements on the same hierarchical level as the current element

Structure of the flatxml interface The flatxml interface to access these characteristics follows a simple logic: For each of the characteristics there are typically three functions available:

  • fxml_has…(): Determines if the current XML element has (at least one instance of) the characteristic
  • fxml_num…(): Returns the number of the characteristics of the current XML (e.g. the number of children elements
  • fxml_get…(): Returns (the IDs of) the respective characteristics of the current XML element (e.g. the children of the current element)

Learn more For more information on the flatxml package please go to http://www.zuckarelli.de/flatxml/index.html.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Topics in R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

May 2018: “Top 40” New Packages

Tue, 06/26/2018 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

While looking over the 215 or so new packages that made it to CRAN in May, I was delighted to find several packages devoted to subjects a little bit out of the ordinary; for instance, bioacoustics analyzes audio recordings, freegroup looks at some abstract mathematics, RQEntangle computes quantum entanglement, stemmatology analyzes textual musical traditions, and treedater estimates clock rates for evolutionary models. I take this as evidence that R is expanding beyond its traditional strongholds of statistics and finance as people in other fields with serious analytic and computational requirements become familiar with the language. And, when I see a package from a philologist and scholar of “Ancient and Medieval Worlds”, I am persuaded to think that R is making a unique contribution to computational literacy.

Below are my “Top 40” package picks for May 2018, organized into the following categories: Computational Methods, Data, Data Science, Finance, Mathematics, Music, Science, Statistics, Time Series, Utilities and Visualization.

Computational Methods

dqrng v0.0.4: Provides fast random number generators with good statistical properties, including the 64-bit variant of the Mersenne-Twister, pcg64, and Xoroshiro128 and Xoroshiro256. There is an Introduction and a vignette on Parallel Usage.

optimParallel v.7-2: Provides a parallel versions of the gradient-based stats::optim() function. The vignette is informative.

Data

childesr v0.1.0: Implements an interface to CHILDES, an open repository for transcripts of parent-child interaction. There is a vignette.

PetfindeR v1.1.3: Is a wrapper of the Petfinder API that implements methods for interacting with and extracting data from the Petfinder database, one of the largest online, searchable databases of adoptable animals and animal welfare organizations across North America. See the Getting Started Guide: part1 and part2.

Data Science

catch v1.0: Provides functions to perform classification and variable selection on high-dimensional tensors (multi-dimensional arrays) after adjusting for additional covariates (scalar or vectors) as CATCH model in Pan, Mai and Zhang (2018).

SemiSupervised v1.0: Implements several safe graph-based semi-supervised learning algorithms. For technical details, refer to Culp and Ryan (2013), Ryan and Culp (2015) and the package vignette.

spFSR v1.0.0: Offers functions to perform feature selection and ranking via simultaneous perturbation stochastic approximation (SPSA-FSR) based on works by Aksakalli and Malekipirbazari (2015) and Yenice et al. (2018). See the Introduction.

Finance

PortfolioAnalytics v1.1.0: Provides functions for portfolio analysis, including numerical methods for portfolio optimization. There is an Introduction and vignettes on Optimization, Custom Moment and Objective Functions, and Portfolio Optimization with CVaR Budgets.

sparseIndexTracking v0.1.0: Provides functions to compute sparse portfolios for financial index tracking, i.e., joint selection of a subset of the assets that compose the index and computation of their relative weights (capital allocation) based on the paper Benidis et al. (2018). The vignette shows how to design a portfolio to track an index.

Mathematics

freegroup v1.0: Provides functions to elements of the free group, including inversion, multiplication by a scalar, group-theoretic power operation, and Tietze forms. See the vignette for details.

ODEsensitivity v1.1.1: Provides functions to perform sensitivity analysis for ordinary differential equation (ode) models using the ODEnetwork package, which simulates a network of second-order ODEs. See [Weber et al. (2018)(https://eldorado.tu-dortmund.de/handle/2003/36875)] for details, and the vignette to get started.

Music

stemmatoloty v0.3.1: Allows users to explore and analyze the genealogy of textual or musical traditions from their variants, with various stemmatological methods, mainly the disagreement-based algorithms suggested by Camps and Cafiero (2015). The vignette provides details.

Science

bioacoustics v0.1.2: Provides functions to analyze audio recordings and automatically extract animal vocalizations. Contains all the necessary tools to process audio recordings of various formats (e.g., WAV, WAC, MP3, ZC), filter noisy files, display audio signals, and detect and extract automatically acoustic features for further analysis such as classification. The vignette provides an example.

epiphy v0.3.4: Provides a toolbox for analyzing plant disease epidemics and a common framework for plant disease intensity data recorded over time and/or space. There is a vignette on Definitions and Relationships between Parameters and another with Examples.

treedater v0.2.0: Offers functions for estimating times of common ancestry and molecular clock rates of evolution using a variety of evolutionary models. See Volz and Frost (2017). The vignette provides an example using the Influenza H3N2 virus.

RQEntangle v0.1.0: Provides functions to compute the Schmidt decomposition of bipartite quantum systems, discrete or continuous, and their respective entanglement metrics. See Ekert and Knight (1995) and the vignettes Entanglement in Coupled Harmonics and Entanglement between Two Coupled Two-Level Systems for details.

Statistics

glmmboot v0.1.2: Provides two functions to perform bootstrap resampling for most models that update() works for. BootGlmm() performs block resampling if random effects are present, and case resampling if not; BootCI() converts output from bootstrap model runs into confidence intervals and p-values. See Humphrey and Swingley (2018) for the details and the vignette to get started.

glmmEP v1.0-1: Allows users to solve Generalized Linear Mixed Model Analysis via Expectation Propagation. In this version, the random effects can be any reasonable dimension. However, only probit mixed models with one level of nesting are supported. See the methodology in Hall et al. (2018), and the user manual in the vignette.

groupedSurv v1.0.1: Provides Rcpp-based functions to compute the efficient score statistics for grouped time-to-event data (Prentice and Gloeckler (1978)), with the optional inclusion of baseline covariates. The vignette gives an example.

modeldb v0.1.0: Provides functions to fit models inside databases with dplyr backends. There are vignettes showing how to implement linear regression and kmeans models.

MLZ v0.1.1: Provides estimation functions and diagnostic tools for mean length-based total mortality estimators based on Gedamke and Hoenig (2006). There is a vignette.

NFWdist v0.1.0: Provides density, distribution function, quantile function, and random generation for the 3D Navarro, Frenk & White (NFW) profile. For details see Robotham & Howlett (2018) and the vignette.

survBootOutliers v1.0: Offers three new concordance-based methods for outlier detection in a survival context. The methodology is described in two papers by Pinto J., Carvalho A. and Vinga S.: paper1 and paper2. The vignette provides an introduction.

vinereg v0.3.0: Implements D-vine quantile regression models with parametric or non-parametric pair-copulas. See Kraus and Czado (2017) and Schallhorn et al. (2017). There is a vignette showing how to use the package, and another covering an Analysis of bike rental data.

Time Series

ASSA v1.0: Provides functions to model and decompose time series into principal components using singular spectrum analysis. See de Carvalho and Rua (2017) and de Carvalho et al (2012).

DTWBI v1.0: Provides functions to impute large gaps within time series based on Dynamic Time Warping methods. See Phan et al. (2017). The website has examples.

permutes v0.1: Uses permutation testing (Maris & Oostenveld (2007)) to help determine optimal windows for analyzing densely-sampled time series. The vignette provides details.

Utilities

bench v1.0.1: Provides tools to benchmark and analyze execution times for R expressions.

conflicted v0.1.0: Provides an alternative to R’s default conflict-resolution strategy for R packages.

diffdf v1.0.0: Provides functions for comparing two data frames. The vignette describes how to use the package.

pkgdown v1.1.0: Provides functions to generate a website from a source package by converting your documentation, vignettes, README, and more to HTML. See the vignette.

rtika v0.1.8: Provides functions to extract text or metadata from over a thousand file types, using Apache Tika to produce either plain-text or structured XHTML content. See the vignette.

tabulizer v0.2.2: Provides bindings for the Tabula Java library, which can extract tables from PDF documents. The tabulizerjars packages provides versioned .jar files. There is an Introduction.

shinytest v1.3.0: Enables automated testing of Shiny applications, using a headless browser.

vcr v0.1.0: A port of the Ruby gem of the same name, vcr enables users to record test suite HTTP requests and replay them during future runs. There is an Introduction and vignettes on Configuration and Request Matching

Visualization

c3 v0.2.0: Implements a wrapper (htmlwidget) for the C3.js charting library that includes all types of C3.js plots, enabling interactive web-based charts to be embedded in R Markdown documents and Shiny applications. The vignette shows basic usage.

chromoMap v0.1: Provides interactive, configurable graphics visualizations of human chromosomes, allowing users to map chromosome elements (like genes, SNPs, etc.) on the chromosome plot, and introduces a special plot, the “chromosome heatmap”, which enables visualizing data associated with chromosome elements. See the vignette.

ExPanDaR v0.2.0: Provides a shiny-based front end and a set of functions for exploratory panel data analysis. Run as a web-based app, it enables users to assess the robustness of empirical evidence without providing them access to the underlying data. There is a vignette on using the package and another on panel data exploration.

ggdistribute v1.0.1: Extends ggplot2 for plotting posterior or other types of unimodal distributions that require overlaying information about a distribution’s intervals. The vignette provides examples.

r2d3 v0.2.2: Provides a suite of tools for using the D3 library to produce dynamic, interactive data visualizations. There are vignettes on Advanced Rendering with Callbacks, R to D3 Data Conversion, CSS & JavaScript Dependencies, Learning D3, Package Development, and D3 Visualization Options.

_____='https://rviews.rstudio.com/2018/06/26/may-2018-top-40-new-packages/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Shiny 1.1.0: Scaling Shiny with async

Tue, 06/26/2018 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

This is a significant release for Shiny, with a major new feature that was nearly a year in the making: support for asynchronous operations!

Without this capability, when Shiny performs long-running calculations or tasks on behalf of one user, it stalls progress for all other Shiny users that are connected to the same process. Therefore, Shiny apps that feature long-running calculations or tasks have generally been deployed using many R processes, each serving a small number of users; this works, but is not the most efficient approach. Such applications now have an important new tool in the toolbox to improve performance under load.

Shiny async is implemented via integration with the future and promises packages. These two packages are used together:

  1. Use future to perform long-running operations in a worker process that runs in the background, leaving Shiny processes free to serve other users in the meantime. This yields much better responsiveness under load, and much more predictable latency.
  2. Use promises to handle the result of each long-running background operation back in the Shiny process, where additional processing can occur, such as further data manipulation, or displaying to the user via a reactive output.

If your app has a small number of severe performance bottlenecks, you can use this technique to get massively better responsiveness under load. For example, if the httr::GET call in this server function takes 30 seconds to complete:

server <- function(input, output, session) { r <- reactive({ httr::GET(url) %>% httr::content("parsed") }) output$plot <- renderPlot({ r() %>% ggplot(aes(speed, dist)) + geom_point() }) }

then the entire R process is stalled for those 30 seconds.

We can rewrite it asynchronously like this:

library(promises) library(future) plan(multisession) server <- function(input, output, session) { r <- reactive({ future(httr::GET(url)) %...>% httr::content("parsed") %...>% }) output$plot <- renderPlot({ r() %...>% { ggplot(., aes(speed, dist)) + geom_point() } }) }

Even if the httr::GET(url) takes 30 seconds, the r reactive executes almost instantly, and returns control to the caller. The code inside future(...) is executed in a different R process that runs in the background, and whenever its result becomes available (i.e. in 30 seconds), the right-hand side of %...>% will be executed with that result. (%...>% is called a “promise pipe”; it works similarly to a magrittr pipe that knows how to wait for and “unwrap” promises.)

If the original (synchronous) code appeared in a Shiny app, then during that 30 seconds, the R process is stuck dealing with the download and can’t respond to any requests being made by other users. But with the async version, the R process only needs to kick off the operation, and then is free to service other requests. This means other users will only have to wait milliseconds, not minutes, for the app to respond.

Case study

We’ve created a detailed case study that walks through the async conversion of a realistic example app. This app processes low-level logging data from RStudio’s CRAN mirrors, to let us explore the heaviest downloaders for each day.

To load test this example app, we launched 50 sessions of simulated load, with a 5 second delay between each launch, and directed this traffic to a single R process. We then rewrote the app to use futures and promises, and reran the load test with this async version. (The tools we used to perform the load testing are not yet publicly available, but you can refer to Sean Lopp’s talk at rstudio::conf 2018 for a preview.)

Under these conditions, the finished async version displays significantly lower (mean) response times than the original. In the table below, “HTTP traffic” refers to requests that are made during page load time, and “reactive processing” refers to the time between the browser sending a reactive input value and the server returning updated reactive outputs.

table td:first-child, table th:first-child {text-align:left !important;} table td, table th {text-align:right !important;} Response type Original Async Delta HTTP traffic 605 ms 139 ms -77% Reactive processing 10.7 sec 3.48 sec -67% Learn more

Visit the promises website to learn more, or watch my recent webinar on Shiny async.

See the full changelog for Shiny v1.1.0.

Related packages

Over the last year, we created or enhanced several other packages to support async Shiny:

  • The promises package (released 2018-04-13) mentioned above provides the actual API you’ll use to do async programming in R. We implemented this as a separate package so that other parts of the R community, not just Shiny users, can take advantage of these techniques. The promises package was inspired by the basic ideas of JavaScript promises, but also have significantly improved syntax and extensibility to make them work well with R and Shiny. Currently, promises is most useful when used with the future package by Henrik Bengtsson.
  • later (released 2017-06-25) adds a low-level feature to R that is critical to async programming: the ability to schedule R code to be executed in the future, within the same R process. You can do all sorts of cool stuff on top of this, as some people are discovering.
  • httpuv (1.4.0 released 2018-04-19) has long been the HTTP web server that Shiny, and most other web frameworks for R, sit on top of. Version 1.4.0 adds support for asynchronous handling of HTTP requests, and also adds a dedicated I/O-handling thread for greatly improved performance under load.

In the coming weeks, you can also expect updates for async compatibility to htmlwidgets, plotly, and DT. Most other HTML widgets will automatically become async compatible once htmlwidgets is updated.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Models are about what changes, and what doesn’t

Tue, 06/26/2018 - 02:00

(This article was first published on R on mages' blog, and kindly contributed to R-bloggers)

Following on from last week’s post on Principled Bayesian Workflow I want to reflect on how to motivate a model.
The purpose of most models is to understand change, and yet, considering what doesn’t change and should be kept constant can be equally important.
I will go through a couple of models in this post to illustrate this idea. The purpose of the model I want to build today is to predict how much ice cream is sold for different temperatures \((t)\).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on mages' blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Re-referencing factor levels to estimate standard errors when there is interaction turns out to be a really simple solution

Tue, 06/26/2018 - 02:00

(This article was first published on ouR data generation, and kindly contributed to R-bloggers)

Maybe this should be filed under topics that are so obvious that it is not worth writing about. But, I hate to let a good simulation just sit on my computer. I was recently working on a paper investigating the relationship of emotion knowledge (EK) in very young kids with academic performance a year or two later. The idea is that kids who are more emotionally intelligent might be better prepared to learn. My collaborator suspected that the relationship between EK and academics would be different for immigrant and non-immigrant children, so we agreed that this would be a key focus of the analysis.

In model terms, we would describe the relationship for each student \(i\) as;

\[ T_i = \beta_0 + \beta_1 I_i + \beta_2 EK_i + \beta_3 I_i \times EK_i + \epsilon_i,\]
where \(T\) is the academic outcome, \(I\) is an indicator for immigrant status (either 0 or 1), and \(EK\) is a continuous measure of emotion knowledge. By including the \(I \times EK\) interaction term, we allow for the possibility that the effect of emotion knowledge will be different for immigrants. In particular, if we code \(I=0\) for non-immigrant kids and \(I=1\) for immigrant kids, \(\beta_2\) represents the relationship of EK and academic performance for non-immigrant kids, and \(\beta_2 + \beta_3\) is the relationship for immigrant kids. In this case, non-immigrant kids are the reference category.

Here’s the data generation:

library(simstudy) library(broom) set.seed(87265145) def <- defData(varname = "I", formula = .4, dist = "binary") def <- defData(def, varname = "EK", formula = "0 + 0.5*I", variance = 4) def <- defData(def, varname = "T", formula = "10 + 2*I + 0.5*EK + 1.5*I*EK", variance = 4 ) dT <- genData(250, def) genFactor(dT, "I", labels = c("not Imm", "Imm")) ## id I EK T fI ## 1: 1 1 -1.9655562 5.481254 Imm ## 2: 2 1 0.9230118 16.140710 Imm ## 3: 3 0 -2.5315312 9.443148 not Imm ## 4: 4 1 0.9103722 15.691873 Imm ## 5: 5 0 -0.2126550 9.524948 not Imm ## --- ## 246: 246 0 -1.2727195 7.546245 not Imm ## 247: 247 0 -1.2025184 6.658869 not Imm ## 248: 248 0 -1.7555451 11.027569 not Imm ## 249: 249 0 2.2967681 10.439577 not Imm ## 250: 250 1 -0.3056299 11.673933 Imm

Let’s say our primary interest in this exploration is point estimates of \(\beta_2\) and \(\beta_2 + \beta_3\), along with 95% confidence intervals of the estimates. (We could have just as easily reported \(\beta_3\), but we decided the point estimates would be more intuitive to understand.) The point estimates are quite straightforward: we can estimate them directly from the estimates of \(\beta_2\) and \(\beta_3\). And the standard error (and confidence interval) for \(\beta_2\) can be read directly off of the model output table. But what about the standard error for the relationship of EK and academic performance for the immigrant kids? How do we handle that?

I’ve always done this the cumbersome way, using this definition:

\[
\begin{aligned}
se_{\beta_2 + \beta_3} &= [Var(\beta_2 + \beta_3)]^\frac{1}{2} \\
&=[Var(\beta_2) + Var(\beta_3) + 2 \times Cov(\beta_2,\beta_3)]^\frac{1}{2}
\end{aligned}
\]

In R, this is relatively easy (though maybe not super convenient) to do manually, by extracting the information from the estimated parameter variance-covariance matrix.

First, fit a linear model with an interaction term:

lm1 <- lm(T ~ fI*EK, data = dT) tidy(lm1) ## term estimate std.error statistic p.value ## 1 (Intercept) 10.161842 0.16205385 62.706574 2.651774e-153 ## 2 fIImm 1.616929 0.26419189 6.120281 3.661090e-09 ## 3 EK 0.461628 0.09252734 4.989098 1.147653e-06 ## 4 fIImm:EK 1.603680 0.13960763 11.487049 9.808529e-25

The estimate for the relationship of EK and academic performance for non-immigrant kids is 0.46 (se = 0.093). And the point estimate for the relationship for immigrant kids is \(2.06=0.46 + 1.60\)

The standard error can be calculated from the variance-covariance matrix that is derived from the linear model:

vcov(lm1) ## (Intercept) fIImm EK fIImm:EK ## (Intercept) 0.026261449 -0.026261449 -0.000611899 0.000611899 ## fIImm -0.026261449 0.069797354 0.000611899 -0.006838297 ## EK -0.000611899 0.000611899 0.008561309 -0.008561309 ## fIImm:EK 0.000611899 -0.006838297 -0.008561309 0.019490291

\[Var(\beta_2+\beta_3) = 0.0086 + 0.0195 + 2\times(-.0086) = 0.0109\]

The standard error of the estimate is \(\sqrt{0.0109} = 0.105\).

So?

OK – so maybe that isn’t really all that interesting. Why am I even talking about this? Well, in the actual study, we have a fair amount of missing data. In some cases we don’t have an EK measure, and in others we don’t have an outcome measure. And since the missingness is on the order of 15% to 20%, we decided to use multiple imputation. We used the mice package in R to impute the data, and we pooled the model estimates from the completed data sets to get our final estimates. mice is a fantastic package, but one thing that is does not supply is some sort of pooled variance-covariance matrix. Looking for a relatively quick solution, I decided to use bootstrap methods to estimate the confidence intervals.

(“Relatively quick” is itself a relative term, since bootstrapping and imputing together is not exactly a quick process – maybe something to work on. I was also not fitting standard linear models but mixed effect models. Needless to say, it took a bit of computing time to get my estimates.)

Seeking credit (and maybe some sympathy) for all of my hard work, I mentioned this laborious process to my collaborator. She told me that you can easily estimate the group specific effects merely by changing the reference group and refitting the model. I could see right away that the point estimates would be fine, but surely the standard errors would not be estimated correctly? Of course, a few simulations ensued.

First, I just changed the reference group so that \(\beta_2\) would be measuring the relationship of EK and academic performance for immigrant kids, and \(\beta_2 + \beta_3\) would represent the relationship for the non-immigrant kids. Here are the levels before the change:

head(dT$fI) ## [1] Imm Imm not Imm Imm not Imm not Imm ## Levels: not Imm Imm

And after:

dT$fI <- relevel(dT$fI, ref="Imm") head(dT$fI) ## [1] Imm Imm not Imm Imm not Imm not Imm ## Levels: Imm not Imm

And the model:

lm2 <- lm(T ~ fI*EK, data = dT) tidy(lm2) ## term estimate std.error statistic p.value ## 1 (Intercept) 11.778770 0.2086526 56.451588 8.367177e-143 ## 2 fInot Imm -1.616929 0.2641919 -6.120281 3.661090e-09 ## 3 EK 2.065308 0.1045418 19.755813 1.112574e-52 ## 4 fInot Imm:EK -1.603680 0.1396076 -11.487049 9.808529e-25

The estimate for this new \(\beta_2\) is 2.07 (se=0.105), pretty much aligned with our estimate that required a little more work. While this is not a proof by any means, I did do variations on this simulation (adding other covariates, changing the strength of association, changing sample size, changing variation, etc.) and both approaches seem to be equivalent. I even created 10000 samples to see if the coverage rates of the 95% confidence intervals were correct. They were. My collaborator was right. And I felt a little embarrassed, because it seems like something I should have known.

But …

Would this still work with missing data? Surely, things would go awry in the pooling process. So, I did one last simulation, generating the same data, but then added missingness. I imputed the missing data, fit the models, and pooled the results (including pooled 95% confidence intervals). And then I looked at the coverage rates.

First I added some missingness into the data

defM <- defMiss(varname = "EK", formula = "0.05 + 0.10*I", logit.link = FALSE) defM <- defMiss(defM, varname = "T", formula = "0.05 + 0.05*I", logit.link = FALSE) defM ## varname formula logit.link baseline monotonic ## 1: EK 0.05 + 0.10*I FALSE FALSE FALSE ## 2: T 0.05 + 0.05*I FALSE FALSE FALSE

And then I generated 500 data sets, imputed the data, and fit the models. Each iteration, I stored the final model results for both models (in one where the reference is non-immigrant and the the other where the reference group is immigrant).

library(mice) nonRes <- list() immRes <- list() set.seed(3298348) for (i in 1:500) { dT <- genData(250, def) dT <- genFactor(dT, "I", labels = c("non Imm", "Imm"), prefix = "non") dT$immI <- relevel(dT$nonI, ref = "Imm") # generate a missing data matrix missMat <- genMiss(dtName = dT, missDefs = defM, idvars = "id") # create obseverd data set dtObs <- genObs(dT, missMat, idvars = "id") dtObs <- dtObs[, .(I, EK, nonI, immI, T)] # impute the missing data (create 20 data sets for each iteration) dtImp <- mice(data = dtObs, method = 'cart', m = 20, printFlag = FALSE) # non-immgrant is the reference group estImp <- with(dtImp, lm(T ~ nonI*EK)) lm1 <- summary(pool(estImp), conf.int = TRUE) dt1 <- as.data.table(lm1) dt1[, term := rownames(lm1)] setnames(dt1, c("2.5 %", "97.5 %"), c("conf.low", "conf.high")) dt1[, iter := i] nonRes[[i]] <- dt1 # immgrant is the reference group estImp <- with(dtImp, lm(T ~ immI*EK)) lm2 <- summary(pool(estImp), conf.int = TRUE) dt2 <- as.data.table(lm2) dt2[, term := rownames(lm2)] setnames(dt2, c("2.5 %", "97.5 %"), c("conf.low", "conf.high")) dt2[, iter := i] immRes[[i]] <- dt2 } nonRes <- rbindlist(nonRes) immRes <- rbindlist(immRes)

The proportion of confidence intervals that contain the true values is pretty close to 95% for both estimates:

mean(nonRes[term == "EK", conf.low < 0.5 & conf.high > 0.5]) ## [1] 0.958 mean(immRes[term == "EK", conf.low < 2.0 & conf.high > 2.0]) ## [1] 0.948

And the estimates of the mean and standard deviations are also pretty good:

nonRes[term == "EK", .(mean = round(mean(estimate),3), obs.SD = round(sd(estimate),3), avgEst.SD = round(sqrt(mean(std.error^2)),3))] ## mean obs.SD avgEst.SD ## 1: 0.503 0.086 0.088 immRes[term == "EK", .(mean = round(mean(estimate),3), obs.SD = round(sd(estimate),3), avgEst.SD = round(sqrt(mean(std.error^2)),3))] ## mean obs.SD avgEst.SD ## 1: 1.952 0.117 0.124

Because I like to include at least one visual in a post, here is a plot of the 95% confidence intervals, with the CIs not covering the true values colored blue:

The re-reference approach seems to work quite well (in the context of this simulation, at least). My guess is the hours of bootstrapping may have been unnecessary, though I haven’t fully tested all of this out in the context of clustered data. My guess is it will turn out OK in that case as well.

Appendix: ggplot code nonEK <- nonRes[term == "EK", .(iter, ref = "Non-immigrant", estimate, conf.low, conf.high, cover = (conf.low < 0.5 & conf.high > 0.5))] immEK <- immRes[term == "EK", .(iter, ref = "Immigrant", estimate, conf.low, conf.high, cover = (conf.low < 2 & conf.high > 2))] EK <- rbindlist(list(nonEK, immEK)) vline <- data.table(xint = c(.5, 2), ref = c("Non-immigrant", "Immigrant")) ggplot(data = EK, aes(x = conf.low, xend = conf.high, y = iter, yend = iter)) + geom_segment(aes(color = cover)) + geom_vline(data=vline, aes(xintercept=xint), lty = 3) + facet_grid(.~ ref, scales = "free") + theme(panel.grid = element_blank(), axis.ticks.y = element_blank(), axis.text.y = element_blank(), axis.title.y = element_blank(), legend.position = "none") + scale_color_manual(values = c("#5c81ba","grey75")) + scale_x_continuous(expand = c(.1, 0), name = "95% CI") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Running Python inside the RStudio Server IDE

Mon, 06/25/2018 - 18:39

(This article was first published on Mango Solutions, and kindly contributed to R-bloggers)

A great many R users will have to run some python code from time to time, and this short video from our Head of Data Engineering, Mark Sellors outlines one approach you can take that doesn’t mean leaving the RStudio IDE.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Demography simulations by @ellis2013nz

Mon, 06/25/2018 - 14:00

(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)

Total fertility rate

This is my second blog post while familiarising myself with the arithmetic of demography.

In my earlier post I looked at how life expectancy is calculated. “Period life expectancy at birth” as estimated by national statistical offices is actually best thought of as a summary of death rates by age group in a particular year; the calculation is effectively “what would be the life expectancy of someone born today, if all through their life the death rates in each age bracket are the same as they are today.” That is (for quite defensible reasons), no effort is made to forecast changes in the death rates; so in a time of decreasing death rates as we have been lucky to be living through for a while now, the actual life expectancy of various cohorts as observed once they’re all dead has been more than that reported in official statistics.

It turns out that a similar calculation is done with total fertility rate, which is the average number of children born to a woman over her lifetime. It has the same problem of life expectancy. For historical periods with sufficient data, we can look at any cohort of women and calculate the average number of children directly. But at any point in time, we don’t know what is going to happen to future birth rates, and hence the total fertility rate of women alive today is unknown.

The solution adopted by demographers is the same for total fertility rate as for life expectancy. To quote Wikipedia:

total period fertility rate (TPFR) of a population is the average number of children that would be born to a woman over her lifetime if: she were to experience the exact current age-specific fertility rates (ASFRs) through her lifetime; and she were to survive from birth through the end of her reproductive life.

In other words, just like period life expectancy is a summary of today’s death rates per age group, period fertility rate is a summary of today’s rates of giving birth per age group.

Converting total fertility rate into probability of giving birth this year

Why do I care? It’s because the ultimate purpose of both of these blog posts (previously hidden, dear reader) is to build a simple simulation model with a small number of user-specified starting parameters, and total fertility rate is a logical choice for one of those parameters. It’s easy to understand (anyone can intuit that 2.1 live children per woman is going to be roughly the replacement rate) and unlike crude birth rates per 100,000 which means different things in countries with different age profiles, it is self-standardising.

My simulation model needs a vector of probabilities for each woman in my population giving birth in a particular year, and to get those probabilities I need a way of converting the total fertility rate into a probability that any woman, given her age, will have a live birth this year. In effect, I need to reverse the calculation demographers do when they estimate the fertility rate in the first place, but whereas they are aggregating detailed data (birth rates per age group) into a single aggregate number I am doing the reverse.

This means I need to make some additional assumptions. Simplest would be to allocate the same probability to any woman of reproductive age, but this is obviously unrealistic. I need a sort of curve that gives higher chance of giving birth in the 20s and 30s but non-zero before and after that peak time; and adds up to an expected value of (say) 2.1 over the whole lifetime. Obviously I could estimate that curve with real data, but I don’t have any to hand, and I wanted something that would be easy to scale up and down to different fertility rates.

Now what’s a curve with definite bounds at the beginning and end, that’s guaranteed to have its area add up to some known constant? Any bounded probability distribution will do the job, and a beta distribution in particular is perfect. Just scale it from [0, 1] to whatever we deem the beginning and ending ages of fertility, choose some values of alpha and beta we like, and multiply the densities by the fertility rate. So for an average rate of 2.2 we can get something like this:

This is the approach I use in my model.

The simulation

The idea of the simulation is that you set your assumption-controlling levers at particular points, hit run, wait a minute, and then look at some standard results. Here’s a screenshot from the end product (click on it to go to the actual tool):

The assumptions you can control include:

  • average children per woman at the beginning of the simulation period
  • male and female infant mortality rates at the beginning of the simulation period
  • death rates other than in the first year, as a multiple of death rates in France in 2015
  • the ratio of male to female births (around 107 boys are born for every 100 girls, varying by country)
  • the size and magnitude of drift upwards and downards in fertility and death rates over the period.

Having run the simulation, you get two sets of output to look at. The screenshot above shows standard results on things like population size, average age, and crude birth and death rates. The screenshot below shows population distribution by age group. This variable is usually shown in demographic pyramids but I’ve opted for straightforward statistical density charts here, because I actually think they are better to read.

With the particular settings used for the model in the screenshots, I’d chosen fairly high infant mortality, death and fertility rates, similar to those in developing countries in the twentieth century. By having all those things trend downwards, we can see the resulting steady increase in life expectancy, and an increasing but aging population for several centuries until a crucial point is reached where crude birth rates are less than death rates and the population starts to decrease.

In fact, playing around with those assumptions for this sort of result was exactly why I’d built this model. I wanted to understand (for example) how changes life expectancy can seemingly get out of sync with crude death rates for a time, during a period when the age composition of a population is changing. If you’re interested in this sort of thing, have a play!

Shining up Shiny

As always, the source code is on GitHub. There are a few small bits of polish of possible interest, including:

  • non-default fonts in a ggplot2 image in a Shiny app (surprisingly tricky, with the best approach I know of being the use of renderImage as set out in this Stack Overflow question and answer, in combination with the showtext R package);
  • loader animations – essential during a simulation like this, where the user needs to wait for some seconds if not minutes – courtesy of the shinycssloaders package;
  • Custom fonts for the web page itself via CSS stored in the www subfolder.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Pages