Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 2 hours 22 min ago

Leaf Plant Classification: Statistical Learning Model – Part 2

Mon, 12/31/2018 - 01:26

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Advanced Modeling

    Tags

    1. Linear Regression
    2. Principal Component Analysis
    3. R Programming

    In this post, I am going to build a statistical learning model as based upon plant leaf datasets introduced in part one of this tutorial. We have available three datasets, each one providing sixteen samples each of one-hundred plant species.

    The features are:

    • shape
    • texture
    • margin

    Specifically, I will take advantage of Discrimination Analysis for classification purposes and I will compare my results with the ones of ref. [1] table 2, as herein reported.

    Ref. [1] authors show the accuracy reached by KNN (proportional and weighted proportional kernel density) for any combination of the datasets in use. Their top accuracy is 96% and it is achieved when using all three datasets. In the present tutorial I am going to compare my results with the ones obtained in ref. [1].

    Packages suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(ggplot2)) Classification Models

    Loading previous post environment where datasets are available.

    load('PlantLeafEnvironment.RData')

    We define an utility function to gather all the steps needed for model building by caret package (ref. [3]), specifically:

    • train partition as 70% training and 30% test samples
    • train control specifying repeated cross-validation with ten folds
    • features set, all features besides species and id
    • train dataset
    • test dataset
    • linear discriminant analysis fit, with preprocessing to compensate for spatial distribution asymmetries and to center and scale values
    • prediction computation
    • confusion matrix computed on the test dataset
    • result as made of the linear discriminant analysis fit and the test dataset confusion matrix

    As pre-processing step, also PCA (Principal Component Analysis) is specified. That allows for slightly better accuracy results compared to non PCA preprocessing scenario (see also ref. [4] for further insights about combining PCA and LDA).

    Among all available Discriminant Analysis models within caret package, the lda appears to be the most efficient with time and as well providing with good results, as we will see at the end. For details about Linear Discriminant Analysis, please see ref. [5] and [6].

    set.seed(1023) plant_leaf_model <- function(leaf_dataset) { train_idx <- createDataPartition(leaf_dataset$species, p = 0.7, list = FALSE) trControl <- trainControl(method = "repeatedcv", number = 10, verboseIter = FALSE, repeats = 5) relevant_features <- setdiff(colnames(leaf_dataset), c("id", "species")) feat_sum <- paste(relevant_features, collapse = "+") frm <- as.formula(paste("species ~ ", feat_sum)) train_set <- leaf_dataset[train_idx,] test_set <- leaf_dataset[-train_idx,] lda_fit <- train(frm, data = train_set, method ="lda", tuneLength = 10, preProcess = c("BoxCox", "center", "scale", "pca"), trControl = trControl, metric = 'Accuracy') lda_test_pred <- predict(lda_fit, test_set) cfm <- confusionMatrix(lda_test_pred, test_set$species) list(model=lda_fit, confusionMatrix = cfm) } Margin dataset model

    We build our model for margin dataset only.

    result <- plant_leaf_model(margin_data)

    Model fit results.

    result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 64 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: centered (64), scaled (64), principal component ## signal extraction (64) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1082, 1076, 1078, 1081, 1086, 1079, ... ## Resampling results: ## ## Accuracy Kappa ## 0.7947299 0.7924885

    Testset confusion matrix.

    result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.7950000 0.7929293 0.7520681 0.8335033

    We take notice of accuracy for final report.

    margin_data_accuracy <- result$confusionMatrix$overall[1]

    Further details can be printed out by:

    result$model$finalModel varImp(result$model)

    that we do not show for brevity.

    Shape dataset model

    We build our model for shape dataset only.

    result <- plant_leaf_model(shape_data)

    Model fit results.

    result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 64 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: Box-Cox transformation (64), centered (64), scaled ## (64), principal component signal extraction (64) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1080, 1072, 1074, 1084, 1076, 1087, ... ## Resampling results: ## ## Accuracy Kappa ## 0.5167374 0.5116926

    Testset confusion matrix.

    result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.5150000 0.5101010 0.4648181 0.5649584

    We take notice of accuracy for final report.

    shape_data_accuracy <- result$confusionMatrix$overall[1] Texture dataset model

    We build our model for texture dataset only.

    result <- plant_leaf_model(texture_data)

    Model fit results.

    result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 64 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: centered (64), scaled (64), principal component ## signal extraction (64) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1078, 1078, 1086, 1076, 1076, 1082, ... ## Resampling results: ## ## Accuracy Kappa ## 0.7580786 0.7554372

    Testset confusion matrix.

    result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.7575000 0.7550505 0.7124314 0.7987117

    We take notice of accuracy for final report.

    texture_data_accuracy <- result$confusionMatrix$overall[1] Margin+Shape datasets model

    We build our model for margin and shape datasets.

    margin_shape_data <- left_join(margin_data, shape_data[,-which(colnames(texture_data) %in% c("species"))], by=c("id")) result <- plant_leaf_model(margin_shape_data)

    Model fit results.

    result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 128 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: Box-Cox transformation (64), centered (128), scaled ## (128), principal component signal extraction (128) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1082, 1084, 1074, 1087, 1084, 1081, ... ## Resampling results: ## ## Accuracy Kappa ## 0.9511508 0.9506051

    Testset confusion matrix.

    result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.9300000 0.9292929 0.9004175 0.9529847

    We take notice of accuracy for final report.

    margin_shape_data_accuracy <- result$confusionMatrix$overall[1] Margin+Texture datasets model

    We build our model for margin and texture datasets.

    margin_texture_data <- left_join(margin_data, texture_data[,-which(colnames(texture_data) %in% c("species"))], by=c("id")) result <- plant_leaf_model(margin_texture_data)

    Model fit results.

    result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 128 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: centered (128), scaled (128), principal component ## signal extraction (128) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1079, 1078, 1084, 1074, 1085, 1081, ... ## Resampling results: ## ## Accuracy Kappa ## 0.9666299 0.9662565

    Testset confusion matrix.

    result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.9475000 0.9469697 0.9208654 0.9672116

    We take notice of accuracy for final report.

    margin_texture_data_accuracy <- result$confusionMatrix$overall[1] Shape+Texture datasets model

    We build our model for shape and texture datasets.

    shape_texture_data <- left_join(shape_data, texture_data[,-which(colnames(texture_data) %in% c("species"))], by=c("id")) result <- plant_leaf_model(shape_texture_data)

    Model fit results.

    result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 128 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: Box-Cox transformation (64), centered (128), scaled ## (128), principal component signal extraction (128) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1083, 1080, 1080, 1082, 1082, 1081, ... ## Resampling results: ## ## Accuracy Kappa ## 0.9102814 0.9092814

    Testset confusion matrix.

    result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.9225000 0.9217172 0.8917971 0.9467373

    We take notice of accuracy for final report.

    shape_texture_data_accuracy <- result$confusionMatrix$overall[1] Margin+Shape+Texture datasets model

    We build our model for all three datasets.

    leaves_data <- left_join(margin_data, shape_texture_data[,-which(colnames(texture_data) %in% c("species"))], by = c("id")) result <- plant_leaf_model(leaves_data)

    Model fit results.

    result$model ## Linear Discriminant Analysis ## ## 1200 samples ## 192 predictor ## 100 classes: 'Acer Campestre', 'Acer Capillipes', 'Acer Circinatum', 'Acer Mono', 'Acer Opalus', 'Acer Palmatum', 'Acer Pictum', 'Acer Platanoids', 'Acer Rubrum', 'Acer Rufinerve', 'Acer Saccharinum', 'Alnus Cordata', 'Alnus Maximowiczii', 'Alnus Rubra', 'Alnus Sieboldiana', 'Alnus Viridis', 'Arundinaria Simonii', 'Betula Austrosinensis', 'Betula Pendula', 'Callicarpa Bodinieri', 'Castanea Sativa', 'Celtis Koraiensis', 'Cercis Siliquastrum', 'Cornus Chinensis', 'Cornus Controversa', 'Cornus Macrophylla', 'Cotinus Coggygria', 'Crataegus Monogyna', 'Cytisus Battandieri', 'Eucalyptus Glaucescens', 'Eucalyptus Neglecta', 'Eucalyptus Urnigera', 'Fagus Sylvatica', 'Ginkgo Biloba', 'Ilex Aquifolium', 'Ilex Cornuta', 'Liquidambar Styraciflua', 'Liriodendron Tulipifera', 'Lithocarpus Cleistocarpus', 'Lithocarpus Edulis', 'Magnolia Heptapeta', 'Magnolia Salicifolia', 'Morus Nigra', 'Olea Europaea', 'Phildelphus', 'Populus Adenopoda', 'Populus Grandidentata', 'Populus Nigra', 'Prunus Avium', 'Prunus X Shmittii', 'Pterocarya Stenoptera', 'Quercus Afares', 'Quercus Agrifolia', 'Quercus Alnifolia', 'Quercus Brantii', 'Quercus Canariensis', 'Quercus Castaneifolia', 'Quercus Cerris', 'Quercus Chrysolepis', 'Quercus Coccifera', 'Quercus Coccinea', 'Quercus Crassifolia', 'Quercus Crassipes', 'Quercus Dolicholepis', 'Quercus Ellipsoidalis', 'Quercus Greggii', 'Quercus Hartwissiana', 'Quercus Ilex', 'Quercus Imbricaria', 'Quercus Infectoria sub', 'Quercus Kewensis', 'Quercus Nigra', 'Quercus Palustris', 'Quercus Phellos', 'Quercus Phillyraeoides', 'Quercus Pontica', 'Quercus Pubescens', 'Quercus Pyrenaica', 'Quercus Rhysophylla', 'Quercus Rubra', 'Quercus Semecarpifolia', 'Quercus Shumardii', 'Quercus Suber', 'Quercus Texana', 'Quercus Trojana', 'Quercus Variabilis', 'Quercus Vulcanica', 'Quercus x Hispanica', 'Quercus x Turneri', 'Rhododendron x Russellianum', 'Salix Fragilis', 'Salix Intergra', 'Sorbus Aria', 'Tilia Oliveri', 'Tilia Platyphyllos', 'Tilia Tomentosa', 'Ulmus Bergmanniana', 'Viburnum Tinus', 'Viburnum x Rhytidophylloides', 'Zelkova Serrata' ## ## Pre-processing: Box-Cox transformation (64), centered (192), scaled ## (192), principal component signal extraction (192) ## Resampling: Cross-Validated (10 fold, repeated 5 times) ## Summary of sample sizes: 1078, 1074, 1084, 1076, 1083, 1082, ... ## Resampling results: ## ## Accuracy Kappa ## 0.9832037 0.9830154

    Testset confusion matrix.

    result$confusionMatrix$overall[1:4] ## Accuracy Kappa AccuracyLower AccuracyUpper ## 0.9900000 0.9898990 0.9745952 0.9972688

    We take notice of accuracy for final report.

    leaf_data_accuracy <- result$confusionMatrix$overall[1] Final Results

    Finally, we gather all results in a data frame and show for comparison. The V symbol indicates when a dataset is selected for model building.

    LDA_accuracy = c(shape_data_accuracy, texture_data_accuracy, margin_data_accuracy, shape_texture_data_accuracy, margin_shape_data_accuracy, margin_texture_data_accuracy, leaf_data_accuracy) LDA_accuracy <- 100*LDA_accuracy # as percentage results <- data.frame(SHA = c("V", " ", " ", "V", "V", " ", "V"), TEX = c(" ", "V", " ", "V", " ", "V", "V"), MAR = c(" ", " ", "V", " ", "V", "V", "V"), prop_KNN_accuracy = c(62.13, 72.94, 75.00, 86.19, 87.19, 93.38, 96.81), wprop_KNN__accuracy = c(61.88, 72.75, 75.75, 86.06, 86.75, 93.31, 96.69), LDA_accuracy = LDA_accuracy )

    Here are ref.[1] results compared to ours.

    results ## SHA TEX MAR prop_KNN_accuracy wprop_KNN__accuracy LDA_accuracy ## 1 V 62.13 61.88 51.50 ## 2 V 72.94 72.75 75.75 ## 3 V 75.00 75.75 79.50 ## 4 V V 86.19 86.06 92.25 ## 5 V V 87.19 86.75 93.00 ## 6 V V 93.38 93.31 94.75 ## 7 V V V 96.81 96.69 99.00

    As we can see, with the only exception of the shape dataset scenario, for all other cases as defined by the datasets in use, Linear Discriminant Analysis achieves higher accuracy with respect ref. [1] K-Nearest-Neighbor based models.

    References
    1. Charles Mallah, James Cope and James Orwell, “PLANT LEAF CLASSIFICATION USING PROBABILISTIC INTEGRATION OF SHAPE, TEXTURE AND MARGIN FEATURES” link
    2. Leaf Dataset
    3. Caret package – train models by tag
    4. Combining PCA and LDA
    5. Linear Discriminant Analysis
    6. Linear Discriminant Analysis – Bit by Bit

    Related Post

    1. NYC buses: C5.0 classification with R; more than 20 minute delay?
    2. NYC buses: Cubist regression with more predictors
    3. NYC buses: simple Cubist regression
    4. Visualization of NYC bus delays with R
    5. Explaining Black-Box Machine Learning Models – Code Part 2: Text classification with LIME
    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    This dance, it’s like a weapon: Radiohead’s and Beck’s danceability, valence, popularity, and more from the LastFM and Spotify APIs

    Sun, 12/30/2018 - 19:18

    (This article was first published on Rcrastinate, and kindly contributed to R-bloggers)

    Giddy up, giddy it up Wanna move into a fool’s gold room With my pulse on the animal jewels Of the rules that you choose to use to get loose With the luminous moves Bored of these limits, let me get, let me get it like Wow! When it comes to surreal lyrics and videos, I’m always thinking of Beck. Above, I cited the beginning of the song “Wow” from his latest album “Colors” which has received rather mixed reviews. In this post, I want to show you what I have done with Spotify’s API. We will also see how “Colors” compares to all the other albums by Beck on a range of the API’s measures. This is what we gonna do:

    • Set up the R packages to access the Spotify and Last.FM APIs
    • Get chart information on a specific user (in this case: me) from Last.FM
    • Get information on these top artists from Spotify
    • Visualize some of the information with ggplot2
    • Get information on all the songs by Beck and Radiohead
    • Again: Visualize this information

    First, let’s set things up. If you want to download the RLastFM package (it’s not on CRAN), you can follow these instruction. If you need an account for the Spotify API, you can get one on this page. Get an account for the Last.FM API with this form.

    library(spotifyr)
    library(RLastFM)
    library(dplyr)

    # Spotify
    s.id <- “< API ID >”
    s.sc <- “< API Secret”

    # LastFM
    l.ky <- “< API Key >”
    l.ap <- “http://ws.audioscrobbler.com/2.0/”

    Sys.setenv(SPOTIFY_CLIENT_ID = s.id)
    Sys.setenv(SPOTIFY_CLIENT_SECRET = s.sc)

    access_token <- get_spotify_access_token()

    I slightly adapted the user.getTopArtists() function in the RLastFM package because I wanted to use the “limit” parameter of the Last.FM API.

    user.getTopArt2 <- function (username, period = NA, key = l.ky, parse = TRUE, limit = 20) {
      params = list(method = “user.gettopartists”, user = username, 
                    period = period, api_key = key, limit = limit)
      params = params[!as.logical(lapply(params, is.na))]
      ret = getForm(l.ap, .params = params)
      doc = xmlParse(ret, asText = TRUE)
      if (parse) 
        doc = RLastFM:::p.user.gettopartists(doc)
      return(doc) }

    Also, I adapted the function get_artist_audio_features() from the spotifyr package to override the dialog where I am required to type the name of the artist in interactive sessions.

    get.art.audio.feats <- function (artist_name, access_token = get_spotify_access_token()) {
      artists <- get_artists(artist_name)
      selected_artist <- artists$artist_name[1]
      artist_uri <- artists$artist_uri[artists$artist_name == 
                                         selected_artist]
      albums <- get_albums(artist_uri)
      if (nrow(albums) > 0) {
        albums <- select(albums, -c(base_album_name, base_album, 
                                    num_albums, num_base_albums, album_rank))
      }
      else {
        stop(paste0(“Cannot find any albums for \””, selected_artist, 
                    “\” on Spotify”))
      }
      album_popularity <- get_album_popularity(albums)
      tracks <- get_album_tracks(albums)
      track_features <- get_track_audio_features(tracks)
      track_popularity <- get_track_popularity(tracks)
      tots <- albums %>% left_join(album_popularity, by = “album_uri”) %>% 
        left_join(tracks, by = “album_name”) %>% left_join(track_features, 
                                                           by = “track_uri”) %>% left_join(track_popularity, by = “track_uri”)
      return(tots) }

    Now, we getting the relevant data. First, we are accessing the current top 20 artists (of all time) of a specific Last.FM user.

    top.arts <- as.data.frame(user.getTopArt2(“< user name >”, period = “overall”, limit = 20))
    head(top.arts)
                         artist playcount                                 mbid rank
    1                  Menomena      1601 ad386705-fb8c-40ec-94d7-e690e079e979    1
    2                      Beck      1594 a8baaa41-50f1-4f63-979e-717c14979dfb    2
    3                 Radiohead      1556 a74b1b7f-71a5-4011-9441-d0b5e4122711    3
    4                 PJ Harvey      1546 e795e03d-b5d5-4a5f-834d-162cfb308a2c    4
    5                    Prince      1501 cdc0fff7-54cf-4052-a283-319b648670fd    5
    6 Nick Cave & The Bad Seeds      1243 172e1f1a-504d-4488-b053-6344ba63e6d0    6

    Now, we are accessing the “artist audio features” for each one of the top 20 artists. I’m including a Sys.sleep(5) here, because I don’t want to access Spotify’s API too frequently. With 5 seconds between each artist, we should be on the safe side – I guess 2 seconds would be enough.

    Please note that get.art.audio.feats() returns a tibble that contains one row for each of the artist’s tracks. That is why I have to aggregate the data by artist. Hence, the “audio features” of an artist are the mean values of all the tracks by this artist.

    art.info.list <- list()
    for (i in 1:nrow(top.arts)) {
      cat(i, “\n”)
      new.art.info <- get.art.audio.feats(top.arts[i, “artist”])
      new.art.info$artist <- top.arts[i, “artist”]
      art.info.list[[length(art.info.list)+1]] <- new.art.info
      Sys.sleep(5)
    }
    art.info <- do.call(“rbind”, art.info.list)

    agg.art.info <- aggregate(cbind(danceability, energy, speechiness,
                                    acousticness, instrumentalness, liveness,
                                    valence, tempo, track_popularity) ~ artist,
                              data = art.info, FUN = mean)

    Now, let’s merge Last.FM’s playcount information with the information we got from Spotify’s API. We are also sorting the dataframe by my playcount on Last.FM.

    agg.art.info <- merge(agg.art.info, top.arts[,c(“playcount”, “rank”, “artist”)], by = “artist”, all.x = T, all.y = F)
    agg.art.info <- agg.art.info[order(agg.art.info$playcount, decreasing = T),]
    saveRDS(agg.art.info, file = “agg.art.info.Rds”)


    What we get is this:

                          artist danceability    energy speechiness acousticness instrumentalness  liveness
    7                   Menomena    0.5397246 0.5475217  0.05414783    0.2678826       0.28483706 0.1451246
    3                       Beck    0.6043865 0.6106874  0.06503006    0.2653964       0.20943021 0.1872540
    13                 Radiohead    0.4095833 0.5755077  0.05504551    0.3131903       0.39465664 0.2026224
    11                 PJ Harvey    0.5155636 0.4578315  0.09909212    0.4328224       0.13406858 0.1739836
    12                    Prince    0.7078350 0.7295825  0.04883786    0.2464166       0.04265596 0.1405495
    8  Nick Cave & The Bad Seeds    0.4147489 0.5152961  0.05414545    0.3465942       0.07602337 0.2952866
         valence    tempo track_popularity playcount rank
    7  0.3084768 131.4496          5.84058      1601    1
    3  0.4795632 115.6165         21.64417      1594    2
    13 0.2858109 118.7648         47.62821      1556    3
    11 0.4331848 118.7181         21.32727      1546    4
    12 0.7672718 126.3280         25.81553      1501    5
    8  0.3521593 119.5002         27.47619      1243    6

    Just a few quick notes concerning the audio features given by the Spotify API (I’m citing Spotify’s webpage here):

    • danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
    • energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
    • speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
    • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
    • instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
    • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
    • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
    • tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
    • track_popularity: The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are.
      Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past. Duplicate tracks (e.g. the same track from a single and an album) are rated independently. 

    I am only using a few of these features for visualizations here, namely danceability, valence, and track_popularity. Let’s start with a scatter plot. ggplot2 and the really great package “ggrepel” already do a good job here.

    library(ggplot2)
    library(ggrepel)
    library(tidyr)
    library(scales)
    library(RColorBrewer)


    data <- readRDS(“agg.art.info.Rds”)

    ggplot(data, aes(x = danceability, y = playcount, size = track_popularity, col = valence)) +   geom_point() +   geom_label_repel(aes(label = artist)) +   labs(x = “Danceability”, y = “Playcount”, size = “Mean popularity”,        col = “Valence”, title = “Popularity, Valence, and Danceability”,        subtitle = “TOP 20 Last.FM artists – with track information from Spotify”) +   scale_color_continuous(low = “darkred”, high = “green”) +   scale_size(range = c(2, 6)) +   theme_minimal() (please click to enlarge) It’s quite a lot information in this plot. Prince – along with the not very popular Nikka Costa – is definitely the most danceable and most positive (in terms of emotional valence) in my top artists. There is no clear relationship between danceability and my playcount. Also, “Menomena”, my top artist, is really not popular on Spotify – I guess you can call that a hipster factor.

    Now, I want to have a semantic profile plot for my top 5 artists. It’s a little bit of coding work before we get this.

    for (var in c(“tempo”, “track_popularity”, “playcount”)) {
      data[, paste0(“norm.”, var)] <- rescale(data[, var], to = c(0, 1))
    }

    data.long <- gather(data, key = variable, value = value, c(“danceability”, “energy”,
                        “speechiness”, “acousticness”, “instrumentalness”, “liveness”, “valence”))

    substr(data.long$variable, 1, 1) <- toupper(substr(data.long$variable, 1, 1))
    data.long$artist <- factor(data.long$artist, levels = data$artist)

    ggplot(data.long[data.long$rank %in% 1:5,], aes(y = value, x = variable, group = artist, col = artist)) +
      geom_line(size = 2, lineend = “round”, alpha = .8) +
      scale_color_manual(values = brewer.pal(5, “Spectral”)) +
      scale_y_continuous(breaks = NULL) +
      labs(x = “”, y = “”, col = “Artist”) +
      coord_flip() +
      theme_dark() + theme(axis.text.y = element_text(size = 15),

                           panel.grid.major.y = element_line(size = 1))

    (please click to enlarge) It’s still not optimal, but we get a good clue of the different profiles of the artists. We can read this plot horizontally to get a ranking of artists for one feature. But we can also read it vertically to get a clue of the overall distribution of features. Now, let’s move on to a more detailed analysis of Beck and Radiohead. First, I am analyzing all the albums by Beck with their year of release on the x-axis, their popularity on the y-axis, danceability color-coded and valence coded by size. beck <- as.data.frame(get.art.audio.feats(“Beck”))
    beck.alb <- aggregate(cbind(danceability, energy, speechiness,                             acousticness, instrumentalness, liveness,                             valence, tempo, track_popularity) ~ album_name,                       data = beck, FUN = mean)
    beck.alb <- merge(beck.alb, unique(beck[, c(“album_name”, “album_release_year”, “album_popularity”)]), by = “album_name”, all.x = T, all.y = F)
    beck.alb$album_release_year <- substr(beck.alb$album_release_year, 1, 4) beck.alb$n.tracks <- sapply(beck.alb$album_name, USE.NAMES = F, FUN = function (x) table(beck$album_name)[x]) beck.alb[beck.alb$album_name == “One Foot In The Grave (Expanded Edition)”, “album_name”] <- “One Foot In The Grave” beck.alb <- beck.alb[order(beck.alb$album_release_year),]
    ggplot(beck.alb, aes(x = album_release_year, y = album_popularity,                      label = album_name, size = valence,                      color = danceability)) +   geom_point() +   geom_label_repel() +   scale_color_continuous(low = “darkred”, high = “green”) +   scale_size(range = c(3, 10)) +   labs(x = “Release year”, size = “Valence”, y = “Album popularity”, color = “Danceability”,        title = “Albums by Beck”, subtitle = “Values taken from Spotify API”) +   theme_minimal() (please click to enlarge) Another thing I wanted to show you is some kind of “album profile” or – more precisely – an answer to the question whether there are tracks on albums that are very salient in terms of track popularity. We can easily do this with a bit of transformation and facet wrapping. I am coding track popularity with both the height of the bar and the color of the bar to get an impression of the popularities of the tracks compared to the overall dataset (because the axes are allowed to very freely for the facets). beck %>% group_by(album_name) %>%   mutate(track.no = 1:length(track_name)) %>%   ungroup() %>% as.data.frame() -> beck
    ggplot(beck, aes(x = factor(track.no), y = track_popularity, fill = track_popularity)) +   geom_bar(stat = “identity”) +   labs(x = “Track”, y = “Track popularity”, fill = “Track popularity”,        title = “Popularity of tracks on albums by Beck”,        subtitle = “Values taken from Spotify API”) +   facet_wrap(~ album_name, scales = c(“free”)) (please click to enlarge) See the rather high first bar for “Mellow Gold”? That’s “Loser”, still the most popular of Beck’s tracks with the rest of the tracks on “Mellow Gold” rather in “Guero” territory in terms of popularity. Also, “Colors” – Beck’s latest album – is rather balanced in terms of popularity. And the same for Radiohead – only that I am excluding two albums (see in the code below) and am using the tempo feature for the coloring. radiohead <- as.data.frame(get.art.audio.feats(“Radiohead”))
    radiohead <- radiohead[!(radiohead$album_name %in% c(“OK Computer OKNOTOK 1997 2017”, “TKOL RMX 1234567”)),]
    radiohead.alb <- aggregate(cbind(danceability, energy, speechiness,                             acousticness, instrumentalness, liveness,                             valence, tempo, track_popularity) ~ album_name,                       data = radiohead, FUN = mean)
    radiohead.alb <- merge(radiohead.alb, unique(radiohead[, c(“album_name”, “album_release_year”, “album_popularity”)]), by = “album_name”, all.x = T, all.y = F)
    radiohead.alb$album_release_year <- substr(radiohead.alb$album_release_year, 1, 4) radiohead.alb$n.tracks <- sapply(radiohead.alb$album_name, USE.NAMES = F, FUN = function (x) table(radiohead$album_name)[x]) radiohead.alb <- radiohead.alb[order(radiohead.alb$album_release_year),]
    ggplot(radiohead.alb, aes(x = album_release_year, y = album_popularity,                      label = album_name, size = valence,                      color = tempo)) +   geom_point() +   geom_label_repel() +   scale_color_continuous(low = “darkred”, high = “green”) +   scale_size(range = c(3, 10)) +   labs(x = “Release year”, size = “Valence”, y = “Album popularity”, color = “Tempo”,        title = “Albums by Radiohead”, subtitle = “Values taken from Spotify API”) +   theme_minimal()

    It’s a shame how Disk 2 of “In Rainbows” compares to Disk 1 because it contains really great songs. Also, it’s interesting that “OK Computer” still not reaches the popularity of “Pablo Honey” – as the second plot suggests, this is because of the popularity of “Creep”. For the second plot, I am not double-coding track popularity with color but am using the valence feature for the color.
    radiohead %>% group_by(album_name) %>%   mutate(track.no = 1:length(track_name)) %>%   ungroup() %>% as.data.frame() -> radiohead
    ggplot(radiohead, aes(x = factor(track.no), y = track_popularity, fill = valence)) +   geom_bar(stat = “identity”) +   labs(x = “Track”, y = “Track popularity”, fill = “Valence”,        title = “Popularity and valence of tracks on albums by Radiohead”,        subtitle = “Values taken from Spotify API”) +   facet_wrap(~ album_name, scales = c(“free”)) Here we see that this valence feature certainly has some problems. “Fitter Happier” is one of the most positive songs by Radiohead with a value of 0.74??? I beg to differ…



    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Rcrastinate. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    R or Python? Why not both? Using Anaconda Python within R with {reticulate}

    Sun, 12/30/2018 - 01:00

    (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)


    This short blog post illustrates how easy it is to use R and Python in the same R Notebook thanks to the
    {reticulate} package. For this to work, you might need to upgrade RStudio to the current preview version.
    Let’s start by importing {reticulate}:

    library(reticulate)

    {reticulate} is an RStudio package that provides “a comprehensive set of tools for interoperability
    between Python and R
    ”. With it, it is possible to call Python and use Python libraries within
    an R session, or define Python chunks in R markdown. I think that using R Notebooks is the best way
    to work with Python and R; when you want to use Python, you simply use a Python chunk:

    ```{python} your python code here ```

    There’s even autocompletion for Python object methods:

    Fantastic!

    However, if you wish to use Python interactively within your R session, you must start the Python
    REPL with the repl_python() function, which starts a Python REPL. You can then do whatever you
    want, even access objects from your R session, and then when you exit the REPL, any object you
    created in Python remains accessible in R. I think that using Python this way is a bit more involved
    and would advise using R Notebooks if you need to use both languages.

    I installed the Anaconda Python distribution to have Python on my system. To use it with {reticulate}
    I must first use the use_python() function that allows me to set which version of Python I want
    to use:

    # This is an R chunk use_python("~/miniconda3/bin/python")

    I can now load a dataset, still using R:

    # This is an R chunk data(mtcars) head(mtcars) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

    and now, to access the mtcars data frame, I simply use the r object:

    # This is a Python chunk print(r.mtcars.describe()) ## mpg cyl disp ... am gear carb ## count 32.000000 32.000000 32.000000 ... 32.000000 32.000000 32.0000 ## mean 20.090625 6.187500 230.721875 ... 0.406250 3.687500 2.8125 ## std 6.026948 1.785922 123.938694 ... 0.498991 0.737804 1.6152 ## min 10.400000 4.000000 71.100000 ... 0.000000 3.000000 1.0000 ## 25% 15.425000 4.000000 120.825000 ... 0.000000 3.000000 2.0000 ## 50% 19.200000 6.000000 196.300000 ... 0.000000 4.000000 2.0000 ## 75% 22.800000 8.000000 326.000000 ... 1.000000 4.000000 4.0000 ## max 33.900000 8.000000 472.000000 ... 1.000000 5.000000 8.0000 ## ## [8 rows x 11 columns]

    .describe() is a Python Pandas DataFrame method to get summary statistics of our data. This means that
    mtcars was automatically converted from a tibble object to a Pandas DataFrame! Let’s check its type:

    # This is a Python chunk print(type(r.mtcars)) ##

    Let’s save the summary statistics in a variable:

    # This is a Python chunk summary_mtcars = r.mtcars.describe()

    Let’s access this from R, by using the py object:

    # This is an R chunk class(py$summary_mtcars) ## [1] "data.frame"

    Let’s try something more complex. Let’s first fit a linear model in Python, and see how R sees it:

    # This is a Python chunk import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf model = smf.ols('mpg ~ hp', data = r.mtcars).fit() print(model.summary()) ## OLS Regression Results ## ============================================================================== ## Dep. Variable: mpg R-squared: 0.602 ## Model: OLS Adj. R-squared: 0.589 ## Method: Least Squares F-statistic: 45.46 ## Date: Sun, 30 Dec 2018 Prob (F-statistic): 1.79e-07 ## Time: 00:45:07 Log-Likelihood: -87.619 ## No. Observations: 32 AIC: 179.2 ## Df Residuals: 30 BIC: 182.2 ## Df Model: 1 ## Covariance Type: nonrobust ## ============================================================================== ## coef std err t P>|t| [0.025 0.975] ## ------------------------------------------------------------------------------ ## Intercept 30.0989 1.634 18.421 0.000 26.762 33.436 ## hp -0.0682 0.010 -6.742 0.000 -0.089 -0.048 ## ============================================================================== ## Omnibus: 3.692 Durbin-Watson: 1.134 ## Prob(Omnibus): 0.158 Jarque-Bera (JB): 2.984 ## Skew: 0.747 Prob(JB): 0.225 ## Kurtosis: 2.935 Cond. No. 386. ## ============================================================================== ## ## Warnings: ## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

    Just for fun, I ran the linear regression with the Scikit-learn library too:

    # This is a Python chunk import numpy as np from sklearn.linear_model import LinearRegression regressor = LinearRegression() x = r.mtcars[["hp"]] y = r.mtcars[["mpg"]] model_scikit = regressor.fit(x, y) print(model_scikit.intercept_) ## [30.09886054] print(model_scikit.coef_) ## [[-0.06822828]]

    Let’s access the model variable in R and see what type of object it is in R:

    # This is an R chunk model_r <- py$model class(model_r) ## [1] "statsmodels.regression.linear_model.RegressionResultsWrapper" ## [2] "statsmodels.base.wrapper.ResultsWrapper" ## [3] "python.builtin.object"

    So because this is a custom Python object, it does not get converted into the equivalent R object.
    This is described here. However, you can still
    use Python methods from within an R chunk!

    # This is an R chunk model_r$aic ## [1] 179.2386 model_r$params ## Intercept hp ## 30.09886054 -0.06822828

    I must say that I am very impressed with the {reticulate} package. I think that even if you are
    primarily a Python user, this is still very interesting to know in case you need a specific function
    from an R package. Just write all your script inside a Python Markdown chunk and then use the R
    function you need from an R chunk! Of course there is also a way to use R from Python, a Python library
    called rpy2 but I am not very familiar with it. From what I read, it seems to be also quite
    simple to use.

    Hope you enjoyed! If you found this blog post useful, you might want to follow
    me on twitter for blog post updates and
    buy me an espresso or paypal.me.

    .bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;} Buy me an Espresso

    var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

    To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

    Leaf Plant Classification: An Exploratory Analysis – Part 1

    Sat, 12/29/2018 - 18:06

    (This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

      Categories

      1. Getting Data

      Tags

      1. Data Management
      2. Data Visualisation
      3. Exploratory Analysis
      4. R Programming

      In this post, I am going to run an exploratory analysis of the plant leaf dataset as made available by UCI Machine Learning repository at this link. The dataset is expected to comprise sixteen samples each of one-hundred plant species. Its analysis was introduced within ref. [1]. That paper describes a method designed to work in conditions of small training set size and possibly incomplete extraction of features.

      This motivated separate processing of three feature types:

      • shape
      • texture
      • margin

      Those are then combined to provide an overall indication of the species (and associated probability). For an accurate description of those features, please see ref. [1] where the classification is implemented by a K-Nearest-Neighbor density estimator. Ref. [1] authors show the accuracy reached by K-Nearest-Neighbor classification for any combination of the datasets in use (see ref. [1] Table 2).

      Packages

      suppressPackageStartupMessages(library(caret)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(ggplot2)) suppressPackageStartupMessages(library(corrplot)) suppressPackageStartupMessages(library(Hmisc)) Exploratory Analysis

      Getting Data

      We can download the leaf dataset as a zip file by taking advantage of the following UCI Machine Learning url.

      url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00241/100%20leaves%20plant%20species.zip" temp_file <- tempfile() download.file(url, temp_file)

      The files of interest are:

      margin_file <- "100 leaves plant species/data_Mar_64.txt" shape_file <- "100 leaves plant species/data_Sha_64.txt" texture_file <- "100 leaves plant species/data_Tex_64.txt"

      that can be so extracted.

      files_to_unzip <- c(margin_file, shape_file, texture_file) unzip(temp_file, files = files_to_unzip, exdir=".", overwrite = TRUE)

      We read them as CSV files. No header is originally provided.

      margin_data <- read.csv(margin_file, header=FALSE, sep=",", stringsAsFactors = TRUE) shape_data <- read.csv(shape_file, header=FALSE, sep=",", stringsAsFactors = TRUE) texture_data <- read.csv(texture_file, header=FALSE, sep=",", stringsAsFactors = TRUE)

      We check the number of rows and columns of the resulting datasets.

      dim(margin_data) ## [1] 1600 65 dim(shape_data) ## [1] 1600 65 dim(texture_data) ## [1] 1599 65

      We notice that the texture dataset has one row less. Such issue will be fixed at a later moment.

      sum(complete.cases(margin_data)) == nrow(margin_data) ## [1] TRUE sum(complete.cases(shape_data)) == nrow(shape_data) ## [1] TRUE sum(complete.cases(texture_data)) == nrow(texture_data) ## [1] TRUE

      No NA's value are present. Column naming is necessary due to the absence of header.

      n_features <- ncol(margin_data) - 1 colnames(margin_data) <- c("species", paste("margin", as.character(1:n_features), sep="")) margin_data$species <- factor(margin_data$species) n_features <- ncol(shape_data) - 1 colnames(shape_data) <- c("species", paste("shape", as.character(1:n_features), sep="")) shape_data$species <- factor(shape_data$species) n_features <- ncol(texture_data) - 1 colnames(texture_data) <- c("species", paste("texture", as.character(1:n_features), sep="")) texture_data$species <- factor(texture_data$species)

      We count the number of entries for each species within each dataset.

      margin_count <- sapply(base::split(margin_data, margin_data$species), nrow) shape_count <- sapply(base::split(shape_data, shape_data$species), nrow) texture_count <- sapply(base::split(texture_data, texture_data$species), nrow)

      That in order to identify what species is associated to the missing entry inside the texture dataset.

      which(margin_count != texture_count) ## Acer Campestre ## 1 which(shape_count != texture_count) ## Acer Campestre ## 1

      The texture data missing entry is associated to Acer Campestre species. Adding an identifier column to all datasets to allow for datasets merging (joining).

      margin_data <- mutate(margin_data, id = 1:nrow(margin_data)) shape_data <- mutate(shape_data, id = 1:nrow(shape_data)) texture_data <- mutate(texture_data, id = 1:nrow(texture_data)) Imputation

      In the following, we fix the missing entry by imputation technique based on median. We suppose the missing entry is related with 16th sample of Acer Campestre texture data, which is the first plant species of our datasets. For the purpose, we take advantage of a temporary dataset made of first 15 entries and then we add such new row with median computed data. Afterwards, we “row-bind” such temporary dataset with the rest of the original texture samples.

      dd <- data.frame(matrix(nrow=1, ncol = 66)) colnames(dd) <- colnames(texture_data) dd$species <- "Acer Campestre" dd$id <- 16 temp_texture_data <- rbind(texture_data[1:15,], dd) features <- setdiff(colnames(temp_texture_data), c("species", "id")) imputed <- sapply(features, function(x) { as.numeric(impute(temp_texture_data[, x], median)[16])}) temp_texture_data[16, names(imputed)] <- imputed texture_data <- rbind(temp_texture_data, texture_data[-(1:15),]) texture_data <- mutate(texture_data, id = 1:nrow(texture_data)) dim(texture_data) ## [1] 1600 66

      Here is what we got at the end.

      str(margin_data) ## 'data.frame': 1600 obs. of 66 variables: ## $ species : Factor w/ 100 levels "Acer Campestre",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ margin1 : num 0.00391 0.00586 0.01172 0.01367 0.00781 ... ## $ margin2 : num 0.00391 0.01367 0.00195 0.01172 0.00977 ... ## $ margin3 : num 0.0273 0.0273 0.0273 0.0371 0.0273 ... ## $ margin4 : num 0.0332 0.0254 0.0449 0.0176 0.0254 ... ## $ margin5 : num 0.00781 0.01367 0.01758 0.01172 0.00195 ... ## $ margin6 : num 0.01758 0.0293 0.04297 0.08789 0.00586 ... ## $ margin7 : num 0.0234 0.0195 0.0234 0.0234 0.0156 ... ## $ margin8 : num 0.00586 0 0 0 0 ... ## $ margin9 : num 0 0.00195 0.00391 0 0.00586 ... ## $ margin10: num 0.0156 0.0215 0.0195 0.0273 0.0176 ... .... .... str(shape_data) ## 'data.frame': 1600 obs. of 66 variables: ## $ species: Factor w/ 100 levels "Acer Campestre",..: 2 2 2 2 2 2 2 2 2 2 ... ## $ shape1 : num 0.000579 0.00063 0.000616 0.000613 0.000599 ... ## $ shape2 : num 0.000609 0.000661 0.000615 0.000569 0.000552 ... ## $ shape3 : num 0.000551 0.000719 0.000606 0.000564 0.000558 ... ## $ shape4 : num 0.000554 0.000651 0.000568 0.000607 0.000569 ... ## $ shape5 : num 0.000603 0.000643 0.000558 0.000643 0.000616 ... ## $ shape6 : num 0.000614 0.00064 0.000552 0.000647 0.000639 ... ## $ shape7 : num 0.000611 0.000646 0.000551 0.000663 0.000631 ... ## $ shape8 : num 0.000611 0.000624 0.000552 0.000658 0.000634 ... ## $ shape9 : num 0.000611 0.000584 0.000531 0.000635 0.000639 ... ## $ shape10: num 0.000594 0.000546 0.00053 0.0006 0.000596 ... ... ... str(texture_data) ## 'data.frame': 1600 obs. of 66 variables: ## $ species : Factor w/ 100 levels "Acer Campestre",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ texture1 : num 0.02539 0.00488 0.01855 0.03516 0.03809 ... ## $ texture2 : num 0.0127 0.0186 0.0137 0.0234 0.0146 ... ## $ texture3 : num 0.003906 0.00293 0.00293 0.000977 0.003906 ... ## $ texture4 : num 0.004883 0 0.00293 0 0.000977 ... ## $ texture5 : num 0.0391 0.0693 0.0518 0.0615 0.0469 ... ## $ texture6 : num 0 0 0 0 0 0 0 0 0 0 ... ## $ texture7 : num 0.0176 0.0137 0.0195 0.0215 0.0225 ... ## $ texture8 : num 0.0352 0.0439 0.0352 0.0615 0.0537 ... ## $ texture9 : num 0.0234 0.0264 0.0225 0.0107 0.0195 ... ## $ texture10: num 0.013672 0 0.000977 0.001953 0.004883 ... ... ... Correlation Analysis

      Since margin, shape and texture covariates are quantitative variables, it is of interest to evaluate correlation among such leaf features.

      We do it by taking advantage of the correlation plot. We show that for the margin1 feature.

      m_l <- split(margin_data, margin_data$species) extract_feature <- function(m_l, feature) { f <- lapply(m_l, function(x) { x[,feature] }) do.call(cbind, f) } thefeature <- "margin1" m <- extract_feature(m_l, thefeature) cor_mat <- cor(m) corrplot(cor_mat, method = "circle", type="upper", tl.cex=0.3)

      The correlation plot is not so easy to interpret. Therefore we implement a procedure capable to filter out the significative and most relevant correlations. At the purpose, we use an helper function named as flattenCorrMatrix as can be found in ref. [3].

      flattenCorrMatrix <- function(cormat, pmat) { ut <- upper.tri(cormat) data.frame( row = rownames(cormat)[row(cormat)[ut]], column = rownames(cormat)[col(cormat)[ut]], cor =(cormat)[ut], p = pmat[ut] ) }

      The following utility function is capable to extract a given feature from one of the available datasets and report the flatten correlation matrix providing with the significative correlation and whose relevance is above a certain absolute value as specified by the threshold parameter.

      most_correlated <- function(dataset, feature, threshold) { m_l <- split(dataset, dataset$species) m <- extract_feature(m_l, feature) rcorr_m <- rcorr(m) flat_cor <- flattenCorrMatrix(rcorr_m$r, rcorr_m$P) attr(flat_cor, "variable") <- feature flat_cor <- flat_cor %>% filter(p < 0.05) %>% filter(abs(cor) > threshold) flat_cor[,-4] # getting rid of the p-value column }

      Here is what we get as correlation matrix for the margin_data dataset and the margin2 feature with a threshold equal to 0.7.

      corr_margin2 <- most_correlated(margin_data, "margin2", 0.7) corr_margin2 ## row column cor ## 1 Acer Capillipes Cercis Siliquastrum 0.7038700 ## 2 Cercis Siliquastrum Cornus Controversa -0.7294585 ## 3 Acer Mono Liriodendron Tulipifera -0.7032354 ## 4 Ilex Aquifolium Morus Nigra 0.7647532 ## 5 Acer Opalus Populus Grandidentata -0.7751047 ## 6 Cercis Siliquastrum Prunus Avium 0.7295616 ## 7 Cornus Chinensis Pterocarya Stenoptera 0.7345398 ## 8 Alnus Cordata Quercus Brantii 0.7257876 ## 9 Acer Campestre Quercus Phillyraeoides -0.7150759 ## 10 Acer Circinatum Quercus Rubra 0.8082904 ## 11 Cornus Controversa Quercus Semecarpifolia -0.7089530 ## 12 Castanea Sativa Quercus Variabilis 0.7299116 ## 13 Acer Rufinerve Quercus Vulcanica 0.7180856

      If we want to collect all the correlation matrixes for the margin_data, here is how we can do.

      margin_names <- setdiff(colnames(margin_data), c("species", "id")) margin_corr_l <- lapply(margin_names, function(x) {most_correlated(margin_data, x, 0.7)}) names(margin_corr_l) <- margin_names

      Let us have a look at a correlation matrix as item of such list.

      margin_corr_l[["margin32"]] ## row column cor ## 1 Eucalyptus Urnigera Morus Nigra 0.7623514 ## 2 Eucalyptus Urnigera Olea Europaea 1.0000000 ## 3 Morus Nigra Olea Europaea 0.7623514 ## 4 Cercis Siliquastrum Quercus Agrifolia 0.7474093 ## 5 Cornus Macrophylla Quercus Brantii 0.8071712 ## 6 Populus Adenopoda Quercus Castaneifolia 0.7092994 ## 7 Quercus Coccifera Quercus Crassipes 0.7422717 ## 8 Arundinaria Simonii Quercus Ellipsoidalis 0.7625542 ## 9 Cornus Controversa Quercus Greggii 1.0000000 ## 10 Acer Mono Quercus Phellos 0.9165285 ## 11 Quercus Crassipes Quercus Semecarpifolia 0.7682705 ## 12 Cytisus Battandieri Quercus Shumardii -0.7324311 ## 13 Quercus Ilex Quercus Suber 0.8382549 ## 14 Quercus Dolicholepis Quercus Texana 0.7090909 ## 15 Quercus Pubescens Salix Intergra 0.7603719 ## 16 Ilex Aquifolium Tilia Platyphyllos 0.7939940 ## 17 Cornus Macrophylla Viburnum Tinus 0.8451543 ## 18 Betula Pendula Viburnum x Rhytidophylloides -0.7224267 ## 19 Populus Grandidentata Zelkova Serrata 0.7453458

      Here we do the same for shape and texture features datasets. For shape dataset we show shape3 feature correlation matrix result.

      shape_names <- setdiff(colnames(shape_data), c("species", "id")) shape_corr_l <- lapply(shape_names, function(x) {most_correlated(shape_data, x, 0.7)}) names(shape_corr_l) <- shape_names shape_corr_l[["shape3"]] ## row column cor ## 1 Acer Campestre Arundinaria Simonii 0.7643580 ## 2 Alnus Viridis Callicarpa Bodinieri -0.7467261 ## 3 Acer Platanoids Morus Nigra -0.7674200 ## 4 Magnolia Salicifolia Prunus Avium 0.7306309 ## 5 Prunus Avium Pterocarya Stenoptera 0.7042563 ## 6 Morus Nigra Quercus Cerris -0.7316674 ## 7 Quercus Alnifolia Quercus Chrysolepis 0.7019743 ## 8 Eucalyptus Glaucescens Quercus Crassifolia 0.7414744 ## 9 Quercus Cerris Quercus Greggii 0.7100278 ## 10 Quercus Cerris Quercus Ilex -0.8346480 ## 11 Quercus Castaneifolia Quercus Phillyraeoides 0.7305218 ## 12 Cornus Macrophylla Quercus Rubra 0.8231097 ## 13 Morus Nigra Quercus Texana 0.7232813 ## 14 Populus Nigra Quercus Trojana -0.7157611 ## 15 Quercus Alnifolia Quercus Vulcanica -0.7350224 ## 16 Quercus Chrysolepis Quercus Vulcanica -0.8157281 ## 17 Quercus Castaneifolia Ulmus Bergmanniana 0.7747396

      For texture dataset we show texture19 feature correlation matrix result.

      texture_names <- setdiff(colnames(texture_data), c("species", "id")) texture_corr_l <- lapply(texture_names, function(x) {most_correlated(texture_data, x, 0.7)}) names(texture_corr_l) <- texture_names texture_corr_l[["texture19"]] ## row column cor ## 1 Acer Palmatum Callicarpa Bodinieri -0.8069241 ## 2 Cornus Macrophylla Ilex Aquifolium 0.7243966 ## 3 Eucalyptus Glaucescens Liriodendron Tulipifera 0.7609341 ## 4 Eucalyptus Urnigera Lithocarpus Cleistocarpus 0.8353815 ## 5 Acer Mono Olea Europaea 0.8007114 ## 6 Alnus Rubra Olea Europaea 0.7121676 ## 7 Liquidambar Styraciflua Quercus Chrysolepis 0.7252803 ## 8 Magnolia Heptapeta Quercus Coccinea 0.7512588 ## 9 Magnolia Heptapeta Quercus Crassifolia 0.8270324 ## 10 Quercus Coccinea Quercus Crassifolia 0.7479191 ## 11 Lithocarpus Cleistocarpus Quercus Crassipes 0.7031255 ## 12 Cotinus Coggygria Quercus Palustris 0.7104135 ## 13 Alnus Sieboldiana Quercus Pyrenaica 0.7401955 ## 14 Ilex Cornuta Quercus Trojana 0.7432011 ## 15 Morus Nigra Quercus Trojana 0.7238492 ## 16 Quercus Greggii Quercus x Hispanica 0.7396316 ## 17 Quercus Pyrenaica Sorbus Aria 0.7345942 ## 18 Eucalyptus Neglecta Tilia Oliveri 0.7565882 ## 19 Prunus Avium Viburnum x Rhytidophylloides -0.7451209 ## 20 Alnus Viridis Zelkova Serrata -0.7052669

      Furthermore, by collecting the number of rows of such correlation matrixes, the most correlated features among the one hundred leaf plant species can be put in evidence.

      t <- sapply(margin_corr_l, nrow) margin_c <- data.frame(feature = names(t), value = t) t <- sapply(shape_corr_l, nrow) shape_c <- data.frame(feature = names(t), value = t) t <- sapply(texture_corr_l, nrow) texture_c <- data.frame(feature = names(t), value = t) ggplot(data = margin_c, aes(x=feature, y=value, fill = feature)) + theme_bw() + theme(legend.position = "none") + geom_histogram(stat='identity') + coord_flip()

      ggplot(data = margin_c, aes(x=feature, y=value, fill = feature)) + theme_bw() + theme(legend.position = "none") + geom_histogram(stat='identity') + coord_flip()

      ggplot(data = margin_c, aes(x=feature, y=value, fill = feature)) + theme_bw() + theme(legend.position = "none") + geom_histogram(stat='identity') + coord_flip()

      Boxplots are shown to highlight differences in features among species. At the purpose, we define the following utility function.

      species_boxplot <- function(dataset, variable) { p <- ggplot(data = dataset, aes(x = species, y = eval(parse(text=variable)), fill= species)) + theme_bw() + theme(legend.position = "none") + geom_boxplot() + ylab(parse(text=variable)) p <- p + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle(paste(variable, "among species", sep = " ")) p } Margin feature boxplot

      For each margin feature, a boxplot as shown below can be generated. Herein, the boxplot associated to the margin1 feature.

      species_boxplot(margin_data, "margin1")

      If you are interested in having a summary report, you may take advantage of the following line of code.

      with(margin_data, tapply(margin1, species, summary)) Shape feature boxplot

      We show the boxplot for shape features by considering the shape20 as example.

      species_boxplot(shape_data, "shape20")

      Texture feature boxplot

      We show the boxplot for texture features by considering the texture31 as example.

      species_boxplot(texture_data, "texture31")

      Saving the current enviroment for further analysis.

      save.image(file='PlantLeafEnvironment.RData') References
      1. Charles Mallah, James Cope and James Orwell, “PLANT LEAF CLASSIFICATION USING PROBABILISTIC INTEGRATION OF SHAPE, TEXTURE AND MARGIN FEATURES”, link
      2. 100 Plant Leaf Dataset
      3. Correlation Matrix: a quick start guide
        1. Related Post

          1. Proteomics Data Analysis (1/3): Data Acquisition and Cleaning
          2. Accessing Web Data (JSON) in R using httr
          3. Zomato Web Scraping with BeautifulSoup in Python
          4. Processing Huge Dataset with Python
          5. Finding the Popular Indian Blogging Platform by Web Scraping in Python
          var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

          To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Part 5: Code corrections to optimism corrected bootstrapping series

      Sat, 12/29/2018 - 10:26

      (This article was first published on R – intobioinformatics, and kindly contributed to R-bloggers)

      The truth is out there R readers, but often it is not what we have been led to believe. The previous post examined the strong positive results bias in optimism corrected bootstrapping (a method of assessing a machine learning model’s predictive power) with increasing p (completely random features). There were 2 implementations of the method given, 1 has a slight error, 2 seems fine. The trend is still the same with the corrected code, but the problem with my code is I did not set ‘replace=TRUE’ in the call to the ‘sample’ function.

      Thanks to ECOQUANT for pointing out the error.

      Let’s just recap what bootstrapping is and what optimism corrected bootstrapping is before we redo the experiments:

      This is from Jason’s excellent blog (https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/), bootstrapping is:

      1. Choose a number of bootstrap samples to perform
      2. Choose a sample size
      3. For each bootstrap sample (b=1 … B)
        1. Draw a sample with replacement with the chosen size
        2. Calculate the statistic on the sample
      4. Calculate the mean of the calculated sample statistics.

      The with replacement part means we have to put each individual sample back when getting our sample in the bth bootstrap iteration. Thus, we usually have duplicate samples in our sample of the data when doing bootstrapping.

      This is the optimism corrected bootstrapping algorithm:

      1. Fit a model M to entire data S and estimate predictive ability C.
      2. Iterate from b=1…B:
        1. Take a resample from the original data, S*
        2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
        3. Use the bootstrap model M* to get predictive ability on S, C_orig
      3. Optimism O is calculated as mean(C_boot – C_orig)
      4. Calculate optimism corrected performance as C-O.

      Since we use the same data in step 3 of the bootstrap to train and test the model (an information leak), we would expect increasing bias (C_orig should be too high, thus O too small) when more and more random features are added. See the previous post for more explanation on this. Another point is, the optimism corrected bootstrap is done with a sample size of N instead of just a fraction of N, usually. I found the following quote to support this:

      “The small data set was repeatedly re-sampled to produce b replicated data sets, each the same size as the original. We used b = 200. The predictive model was fitted to each of
      the b replicated data sets in turn. Each fitted model was then applied both to the resampled data set from which it was generated and to the original data set.”

      Smith, Gordon CS, et al. “Correcting for optimistic prediction in small data sets.” American journal of epidemiology 180.3 (2014): 318-324.

      I have tried reducing the re-sampling size, which reduces the bias somewhat, but it is still there. This makes sense due to the information leak in this method which results in an under estimation of the optimism (O).

      Your welcome to experiment with this code yourselves.

      This code can be directly copied and pasted into R to repeat the experiments.

      Experiment 1: my implementation – glmnet (lasso logistic regression)

      library(glmnet) library(pROC) library(caret) library(ggplot2) library(kimisc) ### TEST 1: bootstrap optimism with glmnet cc <- c() for (zz in seq(2,100,1)){ print(zz) ## set up test data test <- matrix(rnorm(100*zz, mean = 0, sd = 1), nrow = 100, ncol = zz, byrow = TRUE) labelsa <- as.factor(c(rep('A',50),rep('B',50))) colnames(test) <- paste('Y',seq(1,zz),sep='') row.names(test) <- paste('Z',seq(1,100),sep='') ### my own implementation of optimism corrected bootstrapping ## 1. fit model to entire test data (i.e. overfit it) orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial") preds <- predict(orig,newx=test,type='response',s=0.01) auc <- roc(labelsa,as.vector(preds)) original_auc <- as.numeric(auc$auc) ## 2. take resample of data and try to estimate optimism test2 <- cbind(test,labelsa) B <- 50 results <- matrix(ncol=2,nrow=B) for (b in seq(1,B)){ ## get the bootstrapped resampled data boot <- test2[sample(row.names(test2),100,replace=TRUE),] labelsb <- boot[,ncol(boot)] boot <- boot[,-ncol(boot)] ## use the bootstrapped model to predict its own labels bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial") preds <- predict(bootf,newx=boot,type='response',s=0.01) auc <- roc(labelsb,as.vector(preds)) boot_auc <- as.numeric(auc$auc) ## use bootstrap model to predict original labels preds <- predict(bootf,newx=test,type='response',s=0.01) auc <- roc(labelsa,as.vector(preds)) boot_original_auc <- as.numeric(auc$auc) ## add the data to results results[b,1] <- boot_auc results[b,2] <- boot_original_auc } ## calculate the optimism O <- mean(results[,1]-results[,2]) ## calculate optimism corrected measure of prediction (AUC) corrected <- original_auc-O ## cc <- c(cc,corrected) print(cc) } ## print it df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc) p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) + geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap') print(p1) png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm', res = 900, type = 'cairo') print(p1) dev.off()

      Here are the results with 100 samples and 50 bootstrap iterations from 2 to 100 random features from a Gaussian distribution. We are re-sampling using the original sample size (N=100).

      Random features are being added iteratively on the X axis, and on the Y, we have AUC. The AUC should be 0.5 to reflect the data has no real predictive power, but it is highly inflated.

      Experiment 2: another implementation – glm (logistic regression)

      ## TEST2 auc.adjust <- function(data, fit, B){ fit.model <- fit data$pred.prob <- fitted(fit.model) # get overfitted AUC auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC' auc.boot <- vector (mode = "numeric", length = B) auc.orig <- vector (mode = "numeric", length = B) o <- vector (mode = "numeric", length = B) for(i in 1:B){ boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc' fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial") boot.sample$pred.prob <- fitted(fit.boot) # get bootstrapped AUC auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc # get original data boot AUC data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response") auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc # calculated optimism corrected version o[i] <- auc.boot[i] - auc.orig[i] } auc.adj <- auc.app - (sum(o)/B) return(auc.adj) } cc <- c() for (zz in seq(2,100,1)){ print(zz) ## set up test data test <- matrix(rnorm(100*zz, mean = 0, sd = 1), nrow = 100, ncol = zz, byrow = TRUE) labelsa <- as.factor(c(rep('A',50),rep('B',50))) colnames(test) <- paste('Y',seq(1,zz),sep='') row.names(test) <- paste('Z',seq(1,100),sep='') test2 <- data.frame(cbind(labelsa,test)) test2$labelsa <- as.factor(test2$labelsa) ## 1. make overfitted model model <- glm(labelsa ~., data = test2, family = "binomial") ## 2. estimate optimism and correct d <- auc.adjust(test2, model, B=200) cc <- c(cc,d) } ## print it df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc) p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) + geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap') print(p2) png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm', res = 900, type = 'cairo') print(p2) dev.off()

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – intobioinformatics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Part 4: Why does bias occur in optimism corrected bootstrapping?

      Fri, 12/28/2018 - 09:39

      (This article was first published on R – intobioinformatics, and kindly contributed to R-bloggers)

      In the previous parts of the series we demonstrated a positive results bias in optimism corrected bootstrapping by simply adding random features to our labels. This problem is due to an ‘information leak’ in the algorithm, meaning the training and test datasets are not kept seperate when estimating the optimism. Due to this, the optimism, under some conditions, can be very under estimated. Let’s analyse the code, it is pretty straightforward to understand then we can see where the problem originates.

      1. Fit a model M to entire data S and estimate predictive ability C. ## this part is our overfitted estimation of performance (can be AUC, accuracy, etc)
      2. Iterate from b=1…B: ## now we are doing resampling of the data to estimate the error
        1. Take a resample from the original data, S*
        2. Fit the bootstrap model M* to S* and get predictive ability, C_boot ## this will clearly give us another overfitted model performance, which is fine
        3. Use the bootstrap model M* to get predictive ability on S, C_orig ## we are using the same data (samples) used to train the model to test it, therefore it is not surprising that we have values with better performance than expected. C_orig values will be too high.
      3. Optimism O is calculated as mean(C_boot – C_orig)
      4. Calculate optimism corrected performance as C-O.

      One way of correcting for this would be changing step 3 of the bootstrap, instead of testing on the original data, to test on the left out (unseen) data in the bootstrap. This way the training and test data is kept entirely seperate in terms of samples, thus eliminating our bias towards inflated model performance on null datasets with high features. There is probably no point of doing anyway this when we have methods such as LOOCV and K fold cross validation.

      As p (features) >> N (samples) we are going to get better and better ability to get good model performance using the bootstrapped data on the original data. Why? Because the original data contains the same samples as the bootstrap and as we get more features, greater the chance we are going to get some randomly correlating with our response variable. When we test the bootstrap on the original data (plus more samples) it retains some of this random ability to predict the real labels. This is a typical overfitting problem when we have higher numbers of features, and the procedure is faulty.

      Let’s take another experimental look at the problem, this code can be directly copied and pasted into R for repeating the analyses and plots. We have two implementations of the method, the first by me for glmnet (lasso logisitic regression), the second for glm (logisitic regression) from this website (http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html). Feel free to try different machine learning algorithms and play with the parameters.

      library(glmnet) library(pROC) library(caret) library(ggplot) library(kimisc) ### TEST 1: bootstrap optimism with glmnet cc <- c() for (zz in seq(2,100,1)){ print(zz) ## set up test data test <- matrix(rnorm(100*zz, mean = 0, sd = 1), nrow = 100, ncol = zz, byrow = TRUE) labelsa <- as.factor(c(rep('A',50),rep('B',50))) colnames(test) <- paste('Y',seq(1,zz),sep='') row.names(test) <- paste('Z',seq(1,100),sep='') ### my own implementation of optimism corrected bootstrapping ## 1. fit model to entire test data (i.e. overfit it) orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial") preds <- predict(orig,newx=test,type='response',s=0.01) auc <- roc(labelsa,as.vector(preds)) original_auc <- as.numeric(auc$auc) ## 2. take resample of data and try to estimate optimism test2 <- cbind(test,labelsa) B <- 50 results <- matrix(ncol=2,nrow=B) for (b in seq(1,B)){ ## get the bootstrapped resampled data boot <- test2[sample(row.names(test2),50),] labelsb <- boot[,ncol(boot)] boot <- boot[,-ncol(boot)] ## use the bootstrapped model to predict its own labels bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial") preds <- predict(bootf,newx=boot,type='response',s=0.01) auc <- roc(labelsb,as.vector(preds)) boot_auc <- as.numeric(auc$auc) ## use bootstrap model to predict original labels preds <- predict(bootf,newx=test,type='response',s=0.01) auc <- roc(labelsa,as.vector(preds)) boot_original_auc <- as.numeric(auc$auc) ## add the data to results results[b,1] <- boot_auc results[b,2] <- boot_original_auc } ## calculate the optimism O <- mean(results[,1]-results[,2]) ## calculate optimism corrected measure of prediction (AUC) corrected <- original_auc-O ## cc <- c(cc,corrected) print(cc) } ## print it df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc) p1 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) + geom_line() + ggtitle('glmnet - random data only gives positive result with optimism corrected bootstrap') print(p1) png('glmnet_test_upto100.png', height = 15, width = 27, units = 'cm', res = 900, type = 'cairo') print(p1) dev.off()

      So here are the results, as number of noise only features on the x axis increase our 'corrected' estimate of AUC (on y axis) also increases when we start getting enough to allow that noise to randomly predict the labels. So this shows the problem starts about 40-50 features, then gets worse until about 75+. This is with the 'glmnet' function.

      Let’s look at the method using glm, we find the same trend, different implementation.

      ## TEST2 auc.adjust <- function(data, fit, B){ fit.model <- fit data$pred.prob <- fitted(fit.model) # get overfitted AUC auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC' auc.boot <- vector (mode = "numeric", length = B) auc.orig <- vector (mode = "numeric", length = B) o <- vector (mode = "numeric", length = B) for(i in 1:B){ boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc' fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial") boot.sample$pred.prob <- fitted(fit.boot) # get bootstrapped AUC auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc # get original data boot AUC data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response") auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc # calculated optimism corrected version o[i] <- auc.boot[i] - auc.orig[i] } auc.adj <- auc.app - (sum(o)/B) return(auc.adj) } cc <- c() for (zz in seq(2,100,1)){ print(zz) ## set up test data test <- matrix(rnorm(100*zz, mean = 0, sd = 1), nrow = 100, ncol = zz, byrow = TRUE) labelsa <- as.factor(c(rep('A',50),rep('B',50))) colnames(test) <- paste('Y',seq(1,zz),sep='') row.names(test) <- paste('Z',seq(1,100),sep='') test2 <- data.frame(cbind(labelsa,test)) test2$labelsa <- as.factor(test2$labelsa) ## 1. make overfitted model model <- glm(labelsa ~., data = test2, family = "binomial") ## 2. estimate optimism and correct d <- auc.adjust(test2, model, B=200) cc <- c(cc,d) } ## print it df <- data.frame(p=seq(2,100,1),optimism_corrected_boot_AUC=cc) p2 <- ggplot(df, aes(x=p, y=optimism_corrected_boot_AUC)) + geom_line() + ggtitle('glm - random data only gives positive result with optimism corrected bootstrap') print(p2) png('glmt_test_upto100.png', height = 15, width = 27, units = 'cm', res = 900, type = 'cairo') print(p2) dev.off()

      So there we have it, the method has a problem with it and should not be used with greater than about 40 features. This method unfortunately is currently being used for datasets with higher than this number of dimensions (40+) because people think it is a published method it is safe, unfortunately this is not how the world works R readers. Remember, the system is corrupt, science and statistics is full of lies, and if in doubt, do your own tests on positive and negative controls.

      This is with the 'glm' function. Random features added on x axis, the corrected AUC on y axis.

      What if you don’t believe it? I mean this is a text book method. Well R readers if that is so I suggest code it your self and try the code here, run the experiments on null (random data only) datasets with increasing features.

      This is the last part in the series on debunking the optimism corrected bootstrap method. I consider it, debunked.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – intobioinformatics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Using emojis as scatterplot points

      Fri, 12/28/2018 - 06:48

      (This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers)

      Recently I wanted to learn how to use emojis as points in a scatterplot points. It seems like the emojifont package is a popular way to do it. However, I couldn’t seem to get it to work on my machine (perhaps I need to install the font manually?). The other package I found was emoGG; this post shows how to use this package. (For another example involving fire stations, see this script.)

      Let’s load the packages:

      library(ggplot2) library(emoGG)

      Let’s load a less commonly used small dataset that comes with base R called ToothGrowth:

      data("ToothGrowth") summary(ToothGrowth) # len supp dose # Min. : 4.20 OJ:30 Min. :0.500 # 1st Qu.:13.07 VC:30 1st Qu.:0.500 # Median :19.25 Median :1.000 # Mean :18.81 Mean :1.167 # 3rd Qu.:25.27 3rd Qu.:2.000 # Max. :33.90 Max. :2.000

      Here is the dataset’s description from its documentation:

      The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

      A little esoteric… but it’ll work for our purposes. Let’s make a scatterplot of len vs. dose:

      ggplot(data = ToothGrowth) + geom_point(aes(dose, len))

      To plot emojis instead of points, we use the geom_emoji layer and specify which emoji to use with the emoji option:

      ggplot(data = ToothGrowth) + geom_emoji(aes(dose, len), emoji = "1f439") + labs(x = "Dose (mg/day)", y = "Tooth length")

      We can add horizontal jitter to the points manually:

      ggplot(data = ToothGrowth) + geom_emoji(aes(dose + runif(nrow(ToothGrowth), min = -0.2, max = 0.2), len), emoji = "1f439") + labs(x = "Dose (mg/day)", y = "Tooth length")

      I wasn’t able to find an elegant way to have different emojis printed for different points. For example, this dataset contains 30 guinea pigs fed by orange juice and 30 fed by ascorbic acid: we might want different emojis to represent these two groups. The following code does not work:

      # doesn't work ggplot(data = ToothGrowth) + geom_emoji(aes(dose + runif(nrow(ToothGrowth), min = -0.2, max = 0.2), len), emoji = supp) + labs(x = "Dose (mg/day)", y = "Tooth length")

      What we could do is add separate layers manually for each of the 30 guinea pigs. It’s cumbersome but it works…

      p1 <- geom_emoji(data = subset(ToothGrowth, supp == "OJ"), aes(dose + runif(sum(ToothGrowth$supp == "OJ"), min = -0.2, max = 0.2), len), emoji = "1f34a") p2 <- geom_emoji(data = subset(ToothGrowth, supp == "VC"), aes(dose + runif(sum(ToothGrowth$supp == "OJ"), min = -0.2, max = 0.2), len), emoji = "1f48a") ggplot() + p1 + p2 + labs(x = "Dose (mg/day)", y = "Tooth length")

      How did I find the emoji code to use? emoGG comes with an emoji_search function that helps us with that. For example, to find the code for the orange emoji:

      emoji_search("orange") # emoji code keyword # 2184 tangerine 1f34a orange # 2235 carrot 1f955 orange # 4092 a 1f170 red-square # 4093 a 1f170 alphabet # 4094 a 1f170 letter # 4130 o 2b55 circle # 4131 o 2b55 round # 4364 ng 1f196 blue-square # 4365 ng 1f196 words # 4366 ng 1f196 shape # 4367 ng 1f196 icon # 5311 georgia 1f1ec\\U0001f1ea ge var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      My R Take on Advent of Code – Day 3

      Fri, 12/28/2018 - 01:00

      (This article was first published on r-tastic, and kindly contributed to R-bloggers)

      Ho, ho, ho, Happy Chris.. New Year? Between eating the sea of fish (as the Polish tradition requires), assembling doll houses and designing a new kitchen, I finally managed to publish the third post on My R take on Advent of Code. To keep things short and sweet, here’s the original challenge:

      Each Elf has made a claim about which area of fabric would be ideal for Santa’s suit. All claims have an ID and consist of a single rectangle with edges parallel to the edges of the fabric. Each claim’s rectangle is defined as follows:
      The number of inches between the left edge of the fabric and the left edge of the rectangle.
      The number of inches between the top edge of the fabric and the top edge of the rectangle.
      The width of the rectangle in inches.
      The height of the rectangle in inches.
      A claim like #123 @ 3,2: 5×4 means that claim ID 123 specifies a rectangle 3 inches from the left edge, 2 inches from the top edge, 5 inches wide, and 4 inches tall. Visually, it claims the square inches of fabric represented by # (and ignores the square inches of fabric represented by .) in the diagram below:

      ………..
      ………..
      …#####…
      …#####…
      …#####…
      …#####…
      ………..
      ………..
      ………..

      The problem is that many of the claims overlap, causing two or more claims to cover part of the same areas. For example, consider the following claims:

      #1 @ 1,3: 4×4
      #2 @ 3,1: 4×4
      #3 @ 5,5: 2×2

      Visually, these claim the following areas:

      ……..
      …2222.
      …2222.
      .11XX22.
      .11XX22.
      .111133.
      .111133.
      ……..

      The four square inches marked with X are claimed by both 1 and 2. (Claim 3, while adjacent to the others, does not overlap either of them.). If the Elves all proceed with their own plans, none of them will have enough fabric. How many square inches of fabric are within two or more claims?

      This is interesting! let’s load tidyverse and have a quick look at the data:

      library(tidyverse) raw_input <- read.delim('day3-raw-input.txt', header = FALSE) head(raw_input) ## V1 ## 1 #1 @ 850,301: 23x12 ## 2 #2 @ 898,245: 15x10 ## 3 #3 @ 8,408: 12x27 ## 4 #4 @ 532,184: 16x13 ## 5 #5 @ 550,829: 11x10 ## 6 #6 @ 656,906: 13x12

      Ha! It looks like we need to first extract each dimension from the original input – easy-peasy with a little bit of regex and parse_number:

      # separate and clean dimension figures clean_input <- raw_input %>% rename(input = V1) %>% mutate(ID = str_extract(input, '#[:digit:]+'), # extract ID from_left_edge = str_extract(input, '@..?[:digit:]+\\,'), # extract right squares from_top_endge = str_extract(input, '\\,[:digit:]+\\:'), # extract top square width = str_extract(input, '[:digit:]+x'), # extract left dimension height = str_extract(input, 'x[:digit:]+')# extract right dimension ) %>% mutate_if(is.character, readr::parse_number) # extract numbers head(clean_input, 10) ## input ID from_left_edge from_top_endge width height ## 1 #1 @ 850,301: 23x12 1 850 301 23 12 ## 2 #2 @ 898,245: 15x10 2 898 245 15 10 ## 3 #3 @ 8,408: 12x27 3 8 408 12 27 ## 4 #4 @ 532,184: 16x13 4 532 184 16 13 ## 5 #5 @ 550,829: 11x10 5 550 829 11 10 ## 6 #6 @ 656,906: 13x12 6 656 906 13 12 ## 7 #7 @ 489,357: 24x23 7 489 357 24 23 ## 8 #8 @ 529,898: 12x19 8 529 898 12 19 ## 9 #9 @ 660,201: 19x28 9 660 201 19 28 ## 10 #10 @ 524,14: 21x27 10 524 14 21 27

      Now that we have the dimensions, how do we go about determining the overlap?
      My idea was to, for each claim, create a series of ‘coordinates’ where the digit before a dot indicates the position from the left edge and the second number the position from the top edge. Overlapping squares from different samples would have the same ‘coordinates’.

      Right, but how to do it in R? this is where outer function comes in handy. Let’s take the first example:

      # first example #1 @ 1,3: 4x4 from_left_edge <- 1 width <- 4 from_top_endge <- 3 height <- 4 # create a series of coordinates per set of dimensions dims <- as.vector(outer(from_left_edge + 1:width, from_top_endge + 1:height, paste, sep = '.')) sort(dims) ## [1] "2.4" "2.5" "2.6" "2.7" "3.4" "3.5" "3.6" "3.7" "4.4" "4.5" "4.6" ## [12] "4.7" "5.4" "5.5" "5.6" "5.7"

      Neat! Now, let’s wrap it up in a function and apply it to the challenge dataset:

      #function that creates coordicates for squares occupied by each claim get_dimensions <- function(from_left_edge, width, from_top_endge, height ) { dims <- as.vector(outer(from_left_edge + 1:width, from_top_endge + 1:height, paste, sep = '.')) return(dims) } ## apply the function to the challenge dataset final_list <- pmap(list(from_left_edge = clean_input$from_left_edge, width = clean_input$width, from_top_endge = clean_input$from_top_endge, height = clean_input$height), get_dimensions) # a list of 'coordinates' for the first claim head(final_list[[1]]) ## [1] "851.302" "852.302" "853.302" "854.302" "855.302" "856.302"

      Now, let’s calculate the number of those coordiates that appear more than once in our list, which will give us the final solution to the Day 3 Puzzle:

      ## final solution final_list%>% unlist() %>% table() %>% # get counts per coordinate as_tibble() %>% # put it in usable format filter(n > 1) %>% nrow() ## [1] 113576 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: r-tastic. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      My #Best9of2018 tweets

      Fri, 12/28/2018 - 01:00

      (This article was first published on Posts on Maëlle's R blog, and kindly contributed to R-bloggers)

      As 2018 nears its end, it’s time for me to look back on my R/Twitter
      year with the same simple method as last
      year
      : let me identify and webshoot my 9 best
      tweets of 2018!

      Downloading and opening my Twitter data

      Like in 2017 I tweeted too much and therefore was unable to rely on
      rtweet::get_timeline() (or rtweet::get_my_timeline()) to download my
      tweets so I exported data via the Tweets tab of
      https://analytics.twitter.com/. Last year, I downloaded one file per
      quarter but somehow had to download one per month this time. It was
      monotonous but not horrible.

      library("magrittr") dir("data", full.names = TRUE) %>% purrr::map_df(readr::read_csv) %>% unique() -> my_tweets

      4196 tweets!

      Getting the top 9 tweets

      Similarly to my 2017 post, I chose the number of likes as criterion to
      define my best tweets.

      my_tweets <- dplyr::arrange(my_tweets, - likes) my_tweets <- janitor::clean_names(my_tweets) knitr::kable(my_tweets[1:9, "likes"]) likes 162 151 128 112 95 94 91 86 75 best9 <- my_tweets$tweet_permalink[1:9]

      The table above shows that my best tweets were not extremely popular.

      Webshooting and combining the tweets

      My 2017 blog post inspired Bob Rudis to contribute a tweet_shot()
      function to rtweet, relying on the webshot package, so that’s what I
      used.

      shot <- function(permalink){ rtweet::tweet_shot(permalink) %>% magick::image_crop(magick::geometry_area(width = 517, height = 517, y_off = 88)) %>% magick::image_border("salmon", "20x20") } images <- purrr::map(as.character(best9), shot)

      I improved the collage code with two tweaks:

      • I created rows then columns instead of the other way round, because
        that’s what Instagram does. I did not pay attention to this last
        year but Andrew Caines told me this in a comment.

      • I did not save the intermediary images to disk. I’ve upped my
        magick game!

      make_row <- function(i, images){ images[((i-1)*3+1):((i-1)*3+3)] %>% magick::image_join() %>% magick::image_append() } purrr::map(1:3, make_row, images = images) %>% magick::image_join() %>% magick::image_append(stack = TRUE) %>% magick::image_border("salmon", "20x20")

      Worth noting from my R year are, according to these tweets,

      What about your R year, on Twitter and elsewhere? Have a good last days
      of 2018, and a happy 2019!

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Posts on Maëlle's R blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      French Mortality Poster

      Thu, 12/27/2018 - 12:43

      (This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers)

      Based on the heatmaps I drew earlier this month, I made a poster of two centuries of data on mortality rates in France for males and females. It turned out reasonably well, I think. I will probably get it blown up to a nice large size and put it up on the wall. I’ve had very good results with PhD Posters for work like this over the years, by the way.

      Two centuries of French mortality rates

      Here’s a PDF version if you have access to a color printer or plotter and would like a hard copy.

      As before, the data are from mortality.org, and the plots were produced in R using ggplot. The code to make them is available on GitHub. The sidebar was added in Adobe Illustrator. And, of course, if you’d like to learn how to make, refine, and understand plots like this, consider buying my book, Data Visualization: A Practical Introduction.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Part 3: Two more implementations of optimism corrected bootstrapping show shocking bias

      Thu, 12/27/2018 - 12:05

      (This article was first published on R – intobioinformatics, and kindly contributed to R-bloggers)

      Welcome to part III of debunking the optimism corrected bootstrap in high dimensions (quite high number of features) in the Christmas holidays. Previously we saw with a reproducible code implementation that this method is very bias when we have many features (50-100 or more). I suggest avoiding this method until at some point it has been reassessed thoroughly to find how bad this situation is with different numbers of dimensions. Yes, I know for some statisticians this is your favorite method and they tell people how their method is lacking in statistical power, but clearly this has much worse issues, at least on some data. People are currently using this method in genomic medicine, where we have high numbers of features low samples. Just re-run the code yourself and make up your own mind if in doubt. I have now 3 implementations (excluding Caret) confirming the bias. One written by me, two independent statisticians. Let’s run some more experiments.

      This time I have used a different persons implementation using the ‘glm’ function, i.e. logistic regression to show the same misleading trend occurs, i.e. positive results on purely random data. The code has been directly taken from http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html, here. The second implementation is from another unnamed statistician, there was a bug in their code so I had to correct it, there may be further errors in it so feel free to check.

      If you disagree with these findings, please show your own implementation of the method in R (not using Caret) following the same experiment of increasing number of features that are purely noise very high with binary labels. Also, check the last two posts and re-run the code your self, before making your mind up and providing counter arguments. I am practically certain there is a serious problem with this method. Yet again, with situations like this, don’t rely on Mr X’s publication from 10 years ago, use null simulated data sets with different dimensions to investigate the behavior of the method in question your self.

      There are no real predictors in this data, all of them are random data from a normal distribution. The bias is less bad using glm than with glmnet. This method is being used (with high numbers of features) as we speak by Mr Fudge and his friends so they can get into better journals. The system is corrupt.

      You should be able to copy and paste this code directly into R to repeat the results. I can’t vouch for this code because I did not write it, but both implementations show the same result as in the last blog post with my code.

      Implementation 1. A glm experiment.

      ### example of logistic regression optimism corrected bootstrapping # source: http://cainarchaeology.weebly.com/r-function-for-optimism-adjusted-auc.html auc.adjust <- function(data, fit, B){ fit.model <- fit data$pred.prob <- fitted(fit.model) auc.app <- roc(data[,1], data$pred.prob, data=data)$auc # require 'pROC' auc.boot <- vector (mode = "numeric", length = B) auc.orig <- vector (mode = "numeric", length = B) o <- vector (mode = "numeric", length = B) for(i in 1:B){ boot.sample <- sample.rows(data, nrow(data), replace=TRUE) # require 'kimisc' fit.boot <- glm(formula(fit.model), data = boot.sample, family = "binomial") boot.sample$pred.prob <- fitted(fit.boot) auc.boot[i] <- roc(boot.sample[,1], boot.sample$pred.prob, data=boot.sample)$auc data$pred.prob.back <- predict.glm(fit.boot, newdata=data, type="response") auc.orig[i] <- roc(data[,1], data$pred.prob.back, data=data)$auc o[i] <- auc.boot[i] - auc.orig[i] } auc.adj <- auc.app - (sum(o)/B) boxplot(auc.boot, auc.orig, names=c("auc.boot", "auc.orig")) title(main=paste("Optimism-adjusted AUC", "\nn of bootstrap resamples:", B), sub=paste("auc.app (blue line)=", round(auc.app, digits=4),"\nadj.auc (red line)=", round(auc.adj, digits=4)), cex.sub=0.8) abline(h=auc.app, col="blue", lty=2) abline(h=auc.adj, col="red", lty=3) } ## generate random data xx.glmnet <- matrix(nrow=53,ncol=1) xx.glmnet <- data.frame(xx.glmnet) xx.glmnet[,1] <- c(rep(0,25),rep(1,28)) test <- matrix(rnorm(53*1000, mean = 0, sd = 1), nrow = 53, ncol = 1000, byrow = TRUE) xx.glmnet <- cbind(xx.glmnet,test) colnames(xx.glmnet) <- c('outcome',paste('Y',seq(1,1000),sep='')) ## 1. make overfitted model model <- glm(outcome ~., data = xx.glmnet, family = "binomial") ## 2. estimate optimism and correct auc.adjust(xx.glmnet, model, B=200)

      Let's have a look at the results, which agree nicely with our previous findings using a from scratch implementation of the method. So the red line is supposedly our corrected AUC, but the AUC should be 0.5 when running on random data. See previous part 1 post and part 2 post for demonstration of cross validation results on random data which give the correct result.

      Implementation 2: Another glmnet experiment

      ### example of optimism corrected bootstrapping implementation # source: an unnamed programmer library(boot) library(pROC) library(glmnet) #glmnet -all predictors penalised compare_opt_glmnet<- function(orig_data_glmnet, i){ # orig_data_glmnet = the whole data train_data_glmnet<- orig_data_glmnet[i, ] # get bootstrap model_glmnet<- model_process_glmnet(train_data_glmnet) # return a summary optimism results AUC_results_glmnet <- c( # boot against original AUC_train1=roc(orig_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,orig_data_glmnet )[,-1], s = 'lambda.min')))$auc, # boot against boot AUC_train2=roc(train_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,train_data_glmnet )[,-1], s = 'lambda.min')))$auc, AUC_diff=roc(train_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,train_data_glmnet )[,-1], s = 'lambda.min')))$auc- roc(orig_data_glmnet$outcome,as.vector(predict(model_glmnet,type="response", newx = model.matrix (outcome~.,orig_data_glmnet )[,-1], s = 'lambda.min')))$auc ) return(AUC_results_glmnet) } model_process_glmnet<- function(the_data_glmnet){ model_final_glmnet <- cv.glmnet (model.matrix (outcome~.,the_data_glmnet )[,-1],the_data_glmnet$outcome,alpha=1,family = "binomial") return(model_final_glmnet) } # generate random data test <- matrix(rnorm(53*1000, mean = 0, sd = 1), nrow = 53, ncol = 1000, byrow = TRUE) test <- data.frame(test) labs <- factor(c(0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,1,1,1,0,1,1,0,1,0,1,0,0,0,0,0, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)) labs <- data.frame(as.matrix(labs)) colnames(labs)[1] <- 'outcome' xx.glmnet <- cbind(labs,test) ## 1. make overfitted model tempd <- xx.glmnet[,2:ncol(xx.glmnet)] labels <- xx.glmnet[,1] overfittedmodel <- cv.glmnet(as.matrix(tempd),y=labels,alpha=1,family="binomial") lasso.prob <- predict(overfittedmodel,type="response",newx=as.matrix(tempd),s='lambda.min') pred.cligen <- prediction(lasso.prob, labels) auc <- roc(labels,as.vector(lasso.prob)) overfitted_auc <- as.numeric(auc$auc) ## 2. try to correct for optimism Repeats = 100 res_opt_glmnet <- boot(xx.glmnet, statistic = compare_opt_glmnet, R = Repeats) optimism_glmnet <- mean(res_opt_glmnet$t[ , 3]) enhanced_glmnet <- overfitted_auc - optimism_glmnet

      Here are the results of the above code on random data only (pure madness).

      > enhanced_glmnet [1] 0.8916488

      There is some disagreement how to implement this method in Caret as Davis Vaughan disagrees with doing it the way shown here. So, until Max Kuhn gets back to me, relying on the last two posts, I think all the evidence points to a massive positive bias in peoples results using this method with high numbers of features.

      If in doubt compare your results with LOOCV and k fold cross validation, this is how I discovered this method is dodgy by seeing MASSIVE differences on null data-sets where no real predictive ability should be found. These two methods are more widely used and reliable.

      Sorry @ Frank Harrell, I don’t have time atm to do the extra bits you suggest.
      Game on @ Davis Vaughan. Do your own implementation with the same data as I have used with glmnet and glm.

      The steps of this method, briefly, are as follows:

      1. Fit a model M to entire data S and estimate predictive ability C.
      2. Iterate from b=1…B:
        1. Take a resample from the original data, S*
        2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
        3. Use the bootstrap model M* to get predictive ability on S, C_orig
      3. Optimism O is calculated as mean(C_boot – C_orig)
      4. Calculate optimism corrected performance as C-O.
      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – intobioinformatics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Clustering the Bible

      Thu, 12/27/2018 - 10:00

      (This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers)

      During this time of year there is obviously a lot of talk about the Bible. As most people know the New Testament comprises four different Gospels written by anonymous authors 40 to 70 years after Jesus’ supposed crucifiction. Unfortunately we have lost all of the originals but only retained copies of copies of copies (and so on) which date back hundreds of years after they were written in all kinds of different versions (renowned Biblical scholar Professor Bart Ehrmann states that there are more versions of the New Testament than there are words in the New Testament). Just as a fun fact: there are many more Gospels but only those four were included in the official Bible.

      So in general it is interesting to learn a little bit more about this stuff, especially because, for better or worse, it is at the core of Western civilization. One interesting question is how do the four Gospels relate to each other. To find that out you could either study ancient Greek and Christian theology for many years and decades – or you could use the stylo package and do a cluster analysis within seconds. Obviously we go for the latter…

      For that matter we download one version of the four Gospels from Project Gutenberg in the original ancient Greek and put it into a separate directory called ‘corpus’ (which is the name of the default directory of the package). The famous beginning of the Gospel of John (“In the beginning was the Word, and the Word was with God, and the Word was God…”) reads as follows in Ancient Greek:

      Στην αρχή ‘ταν ο λόγος κι’ ο λόγος είτανε με το
      Θεό και Θεός είταν ο λόγος. Είταν εκείνος στην αρχή
      με το Θεό. Όλα τα πάντα μέσο του έγιναν, και χωρίς
      του τίποτα δεν έγινε που γίνηκε. Μέσα του είτανε ζωή
      κι’ η ζωή ‘τανε το φως των ανθρώπων, και το φως
      μέσα στο σκοτάδι φέγγει και το σκοτάδι δεν το κυρίεψε.

      For your convenience I provide you with the curated texts here: John, Luke, Mark, Matthew.

      After that we can already call the main function of the package:

      library(stylo) ## ### stylo version: 0.6.8 ### ## ## If you plan to cite this software (please do!), use the following reference: ## Eder, M., Rybicki, J. and Kestemont, M. (2016). Stylometry with R: ## a package for computational text analysis. R Journal 8(1): 107-121. ## ## ## To get full BibTeX entry, type: citation("stylo") stylo(corpus.lang = "Other", encoding = "UTF-8", mfw.min = 200, mfw.max = 200, custom.graph.title = "Ancient Greek Gospels", gui = FALSE) ## using current directory... ## Performing no sampling (using entire text as sample) ## loading John.txt ... ## loading Luke.txt ... ## loading Mark.txt ... ## loading Matthew.txt ... ## slicing input text into tokens... ## ## turning words into features, e.g. char n-grams (if applicable)... ## ## Total nr. of samples in the corpus: 4 ## .... ## The corpus consists of 70943 tokens ## ## processing 4 text samples ## ## combining frequencies into a table... ## ## ## culling @ 0 available features (words) 5000 ## Calculating z-scores... ## ## Calculating classic Delta distances... ## MFW used: ## 200

      ## ## Function call: ## stylo(gui = FALSE, corpus.lang = "Other", encoding = "UTF-8", mfw.min = 200, mfw.max = 200, custom.graph.title = "Ancient Greek Gospels") ## ## Depending on your chosen options, some results should have been written ## into a few files; you should be able to find them in your current ## (working) directory. Usually, these include a list of words/features ## used to build a table of frequencies, the table itself, a file containing ## recent configuration, etc. ## ## Advanced users: you can pipe the results to a variable, e.g.: ## I.love.this.stuff = stylo() ## this will create a class "I.love.this.stuff" containing some presumably ## interesting stuff. The class created, you can type, e.g.: ## summary(I.love.this.stuff) ## to see which variables are stored there and how to use them. ## ## ## for suggestions how to cite this software, type: citation("stylo")

      As you can see the Gospels of Matthew and Luke are more similar than Mark (the oldest Gospel), John (the last Gospel that made it into the Bible) is farthest away. This is indeed what Biblical scholars have found out by diligently studying the New Testament: Mark, Luke and Matthew are called “Synoptic Gospels” because they are so similar, John kind of stands on its own. The shared pieces of the Synoptic Gospels can be seen here (picture from Wikipedia, Synoptic Gospels):

      As you can see, Biblical scholars have found out that Luke and Matthew share about 60 to 70 percent of the text (through the so called double tradition), whereas Mark shares about 40 percent of the text (through the triple tradition). For comparison, John only shares about 10 percent with the synoptic Gospels. Now, this is exactly what our cluster analysis form above shows – what an impressive feat!

      By the way, there is an even simpler way to interact with the stylo package: through the GUI. Just call stylo() and the following window will open where you can play around with different arguments:

      It has never been easier to do sophisticated textual analyses, even in ancient languages one cannot understand! You might ask: how did R do that?

      Well, behind the scenes R basically creates frequency tables for the words used in the different documents. It then tries to cluster the documents based on a similarity (or distance) measure. Intuitively that means that the more often authors use the same words the closer their texts are. As you can see R doesn’t really have to understand the words or the language used. It just maps statistical measures which seems to be quite effective (not only) in this case. The statistical measures often give some kind of characteristic signature of an author or a source. In a way this is the quantified version of fuzzy concepts like ‘writing style’ yet R can do it much more quickly even for long texts than any human ever could!

      Happy New Year… stay tuned for much more to come in

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      The Christmas Eve Selloff was a Classic Capitulation

      Thu, 12/27/2018 - 02:52

      (This article was first published on R – Quintuitive, and kindly contributed to R-bloggers)

      The selloff on Christmas eve was so bad it looked like a typical bear market capitulation. The following rally merely confirmed it.

      As mentioned in the last post, at the time the correction reached 16%, at the close of December 21st, the oversold indicator was not lighted. What followed was the worst Christmas eve selloff ever. The selloff had two achievements. First, it took this bear market to a slightly deeper level than the 1998 bear market, which, based on my previous analysis, was the smallest bear market to date.

      The second accomplishment of the selloff was that it looked a lot like a capitulation typical of a bear market bottom or a reversal. Running the capitulation function from the previous post confirms. In other words, the seloff took the three relevant indexes 10% or more away from their 50 day exponential moving average.

      What’s next? No one knows. It is a bear market until proven otherwise. In the distant pass, the capitulation indicator has been pretty reliable and less frequent. In the 2002 bear market, and especially in the 2008 there were more signals. The indicator gave a few signals way before the pain started to ease. In 2011, the indicator gave signal even without a bear market and was spot on – the bottom came a day later. Today’s rally certainly confirms the oversold theory.

      The post The Christmas Eve Selloff was a Classic Capitulation appeared first on Quintuitive.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Quintuitive. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Some fun with {gganimate}

      Thu, 12/27/2018 - 01:00

      (This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

      Your browser does not support the video tag.

      In this short blog post I show you how you can use the {gganimate} package to create animations
      from {ggplot2} graphs with data from UNU-WIDER.

      WIID data

      Just before Christmas, UNU-WIDER released a new edition of their World Income Inequality Database:

      *NEW #DATA*
      We’ve just released a new version of the World Income Inequality Database.
      WIID4 includes #data from 7 new countries, now totalling 189, and reaches the year 2017. All data is freely available for download on our website: https://t.co/XFxuLvyKTC pic.twitter.com/rCf9eXN8D5

      — UNU-WIDER (@UNUWIDER) December 21, 2018

      The data is available in Excel and STATA formats, and I thought it was a great opportunity to
      release it as an R package. You can install it with:

      devtools::install_github("b-rodrigues/wiid4")

      Here a short description of the data, taken from UNU-WIDER’s website:

      “The World Income Inequality Database (WIID) presents information on income inequality for
      developed, developing, and transition countries. It provides the most comprehensive set of income
      inequality statistics available and can be downloaded for free.

      WIID4, released in December 2018, covers 189 countries (including historical entities), with over
      11,000 data points in total. With the current version, the latest observations now reach the year 2017.”

      It was also a good opportunity to play around with the {gganimate} package. This package
      makes it possible to create animations and is an extension to {ggplot2}. Read more about it
      here.

      Preparing the data

      To create a smooth animation, I need to have a cylindrical panel data set; meaning that for each
      country in the data set, there are no missing years. I also chose to focus on certain variables
      only; net income, all the population of the country (instead of just focusing on the economically
      active for instance) as well as all the country itself (and not just the rural areas).
      On this link you
      can find a codebook (pdf warning), so you can understand the filters I defined below better.

      Let’s first load the packages, data and perform the necessary transformations:

      library(wiid4) library(tidyverse) library(ggrepel) library(gganimate) library(brotools) small_wiid4 <- wiid4 %>% mutate(eu = as.character(eu)) %>% mutate(eu = case_when(eu == "1" ~ "EU member state", eu == "0" ~ "Non-EU member state")) %>% filter(resource == 1, popcovr == 1, areacovr == 1, scale == 2) %>% group_by(country) %>% group_by(country, year) %>% filter(quality_score == max(quality_score)) %>% filter(source == min(source)) %>% filter(!is.na(bottom5)) %>% group_by(country) %>% mutate(flag = ifelse(all(seq(2004, 2016) %in% year), 1, 0)) %>% filter(flag == 1, year > 2003) %>% mutate(year = lubridate::ymd(paste0(year, "-01-01")))

      For some country and some years, there are several sources of data with varying quality. I only
      keep the highest quality sources with:

      group_by(country, year) %>% filter(quality_score == max(quality_score)) %>%

      If there are different sources of equal quality, I give priority to the sources that are the most
      comparable across country (Luxembourg Income Study, LIS data) to less comparable sources with
      (at least that’s my understanding of the source variable):

      filter(source == min(source)) %>%

      I then remove missing data with:

      filter(!is.na(bottom5)) %>%

      bottom5 and top5 give the share of income that is controlled by the bottom 5% and top 5%
      respectively. These are the variables that I want to plot.

      Finally I keep the years 2004 to 2016, without any interruption with the following line:

      mutate(flag = ifelse(all(seq(2004, 2016) %in% year), 1, 0)) %>% filter(flag == 1, year > 2003) %>%

      ifelse(all(seq(2004, 2016) %in% year), 1, 0)) creates a flag that equals 1 only if the years
      2004 to 2016 are present in the data without any interruption. Then I only keep the data from 2004
      on and only where the flag variable equals 1.

      In the end, I ended up only with European countries. It would have been interesting to have countries
      from other continents, but apparently only European countries provide data in an annual basis.

      Creating the animation

      To create the animation I first started by creating a static ggplot showing what I wanted;
      a scatter plot of the income by bottom and top 5%. The size of the bubbles should be proportional
      to the GDP of the country (another variable provided in the data). Once the plot looked how I wanted
      I added the lines that are specific to {gganimate}:

      labs(title = 'Year: {frame_time}', x = 'Top 5', y = 'Bottom 5') + transition_time(year) + ease_aes('linear')

      I took this from {gganimate}’s README.

      animation <- ggplot(small_wiid4) + geom_point(aes(y = bottom5, x = top5, colour = eu, size = log(gdp_ppp_pc_usd2011))) + xlim(c(10, 20)) + geom_label_repel(aes(y = bottom5, x = top5, label = country), hjust = 1, nudge_x = 20) + theme(legend.position = "bottom") + theme_blog() + scale_color_blog() + labs(title = 'Year: {frame_time}', x = 'Top 5', y = 'Bottom 5') + transition_time(year) + ease_aes('linear')

      I use geom_label_repel to place the countries’ labels on the right of the plot. If I don’t do
      this, the labels of the countries would be floating around and the animation would be unreadable.

      I then spent some time trying to render a nice webm instead of a gif. It took some trial and error
      and I am still not entirely satisfied with the result, but here is the code to render the animation:

      animate(animation, renderer = ffmpeg_renderer(options = list(s = "864x480", vcodec = "libvpx-vp9", crf = "15", b = "1600k", vf = "setpts=5*PTS")))

      The option vf = "setpts=5*PTS" is important because it slows the video down, so we can actually
      see something. crf = "15" is the quality of the video (lower is better), b = "1600k" is the
      bitrate, and vcodec = "libvpx-vp9" is the codec I use. The video you saw at the top of this
      post is the result. You can also find the video here,
      and here’s a gif if all else fails:


      I would have preferred if the video was smoother, which should be possible by creating more frames.
      I did not find such an option in {gganimate}, and perhaps there is none, at least for now.

      In any case {gganimate} is pretty nice to play with, and I’ll definitely use it more!

      Update

      Silly me! It turns out thate the animate() function has arguments that can control the number of frames
      and the duration, without needing to pass options to the renderer. I was looking at options for the
      renderer only, without having read the documentation of the animate() function. It turns out that
      you can pass several arguments to the animate() function; for example, here is how you
      can make a GIF that lasts for 20 seconds running and 20 frames per second, pausing for 5
      frames at the end and then restarting:

      animate(animation, nframes = 400, duration = 20, fps = 20, end_pause = 5, rewind = TRUE)

      I guess that you should only pass options to the renderer if you really need fine-grained control.

      This took around 2 minutes to finish. You can use the same options with the ffmpeg renderer too.
      Here is what the gif looks like:


      Much, much smoother!

      Hope you enjoyed! If you found this blog post useful, you might want to follow
      me on twitter for blog post updates and
      buy me an espresso or paypal.me.

      .bmc-button img{width: 27px !important;margin-bottom: 1px !important;box-shadow: none !important;border: none !important;vertical-align: middle !important;}.bmc-button{line-height: 36px !important;height:37px !important;text-decoration: none !important;display:inline-flex !important;color:#ffffff !important;background-color:#272b30 !important;border-radius: 3px !important;border: 1px solid transparent !important;padding: 1px 9px !important;font-size: 22px !important;letter-spacing:0.6px !important;box-shadow: 0px 1px 2px rgba(190, 190, 190, 0.5) !important;-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;margin: 0 auto !important;font-family:'Cookie', cursive !important;-webkit-box-sizing: border-box !important;box-sizing: border-box !important;-o-transition: 0.3s all linear !important;-webkit-transition: 0.3s all linear !important;-moz-transition: 0.3s all linear !important;-ms-transition: 0.3s all linear !important;transition: 0.3s all linear !important;}.bmc-button:hover, .bmc-button:active, .bmc-button:focus {-webkit-box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;text-decoration: none !important;box-shadow: 0px 1px 2px 2px rgba(190, 190, 190, 0.5) !important;opacity: 0.85 !important;color:#82518c !important;} Buy me an Espresso

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Should Old Acquaintance be Forgot: Tidying up Mac Mail

      Thu, 12/27/2018 - 01:00

      (This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers)

      As the year is closing down, why not spend some of the free time to explore your email data using R and the tidyverse? When I learned that Mac OS Mail stores its internal data in a SQLite database file I was hooked. A quick dive in your email archive might uncover some of your old acquaintances. Let’s take a peak.

      Obviously, the below is only applicable when you are a regular user of the Mail app for Mac OS. As a first step, you need to locate the file Envelope Index that tends to be located in ~/Library/Mail/V6/MailData/. Copy this file somewhere and adjust the path provided in mail_db to point to this copy. Do not work with the original file.

      library(DBI) library(tidyverse) library(lubridate) library(ExPanDaR) library(ggridges) mail_db <- "data/EI" con <- dbConnect(RSQLite::SQLite(), mail_db)

      Now you have established a database connection to your copy of Mac OS Mail’s internal data. I you receive an error message, check whether you have the required packages (including RSQLite) installed. With this established connection, we can now see what the database has in store for us.

      kable(dbListTables(con), col.names = "List of Tables") List of Tables action_ews_messages action_imap_messages action_labels action_messages addresses attachments duplicates_unread_count events ews_copy_action_messages ews_folders imap_copy_action_messages imap_labels imap_messages labels last_spotlight_check_date local_message_actions mailbox_actions mailboxes messages properties recipients sqlite_sequence subjects threads

      Hey, this is an impressive list of tables. For this bog post, I am mostly interested in exploring the development of my email activity in terms of senders and receivers over time. Thus, I focus on the relations messages, addresses and recipients.

      messages <- dbListFields(con, "messages") recipients <- dbListFields(con, "recipients") addresses <- dbListFields(con, "addresses") max_members <- max(length(messages), length(recipients), length(addresses)) length(messages) <- max_members length(recipients) <- max_members length(addresses) <- max_members df <- data.frame(messages, recipients, addresses, stringsAsFactors = FALSE) df[is.na(df)] <- "" kable(df) messages recipients addresses ROWID ROWID ROWID message_id message_id address document_id type comment in_reply_to address_id remote_id position sender subject_prefix subject date_sent date_received date_created date_last_viewed mailbox remote_mailbox flags read flagged size color type conversation_id snippet fuzzy_ancestor automated_conversation root_status conversation_position deleted

      OK. These relations provide enough data to play. Let’s create a table that is organized by message and contains sender and receiver info as well as sending/receiving time.

      sql <- paste("SELECT messages.ROWID as message_id, date_sent,", "date_received, a1.address as sender_address,", "a1.comment as sender_comment, a2.address as recipient_address,", "a2.comment as recipient_comment, snippet", "FROM messages left join addresses AS a1 on messages.sender = a1.ROWID", "LEFT JOIN recipients on recipients.message_id = messages.ROWID", "LEFT JOIN addresses AS a2 on recipients.address_id = a2.ROWID") res <- dbSendQuery(con, sql) df <- dbFetch(res) dbClearResult(res) dbDisconnect(con) df[,c("date_sent", "date_received")] <- lapply(df[,c("date_sent", "date_received")], function(x) as.POSIXct(x, origin = "1970-01-01"))

      The resulting data frame contains all messages, including messages sent to multiple recipients (with me on the sending or receiving end). To limit the messages to the ones where I am involved, I match sender_address and receiver_address to a vector my_adresses containing my email addresses (not disclosed here for privacy reasons).

      df %>% filter(tolower(sender_address) %in% my_addresses | tolower(recipient_address) %in% my_addresses) %>% distinct(.keep_all = TRUE) %>% arrange(date_received) -> emails

      In addition, I prepare a panel dataset that contains data at email address year level.

      emails %>% filter(!tolower(sender_address) %in% my_addresses) %>% mutate(address = tolower(sender_address), year = year(date_received)) %>% group_by(year, address) %>% summarise(emails_received_from = n()) -> emails_received emails %>% filter(!tolower(recipient_address) %in% my_addresses) %>% mutate(address = tolower(recipient_address), year = year(date_received)) %>% group_by(year, address) %>% summarise(emails_sent_to = n()) -> emails_sent panel <- full_join(emails_received, emails_sent) %>% replace_na(list(emails_sent_to = 0, emails_received_from = 0)) %>% arrange(year, -emails_sent_to, -emails_received_from)

      Time for a first analysis. How does my email in- and outflow develop over the years?

      panel %>% group_by(year) %>% summarise(sent_mails = sum(emails_sent_to), received_mails = sum(emails_received_from)) %>% gather(key = "var", value = "count", -year) %>% ggplot(aes(year, count)) + geom_bar(aes(fill = var), position = "dodge", stat="identity") + scale_fill_discrete(name="Direction", labels=c("Emails received", "Emails sent")) + xlab("Year") + ylab("Number of emails") + theme_minimal() + theme(legend.position=c(.3, .8))

      Hmm. I have no interest in extrapolating this time trend… Just for fun and giggles: When do I send emails (by time of day)?

      emails %>% filter(sender_address %in% my_addresses) %>% mutate(sent_tod = hour(date_sent)) %>% ggplot() + geom_histogram(aes(sent_tod), binwidth=1, fill = "#00376C") + xlab("Time of day [24 hours]") + ylab("Number of emails") + theme_minimal()

      No real surprises here (at least for me), besides that I seem to take a dip in the early afternoon. Does this sending behavior exhibit any interesting time trends?

      emails %>% filter(sender_address %in% my_addresses) %>% mutate(sent_tod = hour(date_sent) + minute(date_sent)/60 + second(date_sent)/3600, year = as.factor(year(date_sent))) %>% ggplot(aes(x = sent_tod, y = year, fill = ..x..)) + geom_density_ridges_gradient(scale =2, rel_min_height = 0.01) + ylab("Year") + xlab("Time of day [24 hours]") + theme_minimal() + theme(legend.position = "none")

      2002 and 2003 stand out (I was based in the U.S. for most of this period) and it seems as if I am gradually becoming more of an early starter.

      But enough on this. This is supposed to be about old acquaintances to fit this special end-of-the-year mood. Let’s dive more into the personal sphere. For this, I define an “email contact” as an email address with an email exchange in a given year, meaning that I was both, at the receiving and sending end, at least once.

      panel %>% group_by(year) %>% filter(emails_sent_to > 0, emails_received_from > 0) %>% summarise(nr_email_contacts = n()) %>% ggplot(aes(year, nr_email_contacts)) + geom_bar(stat="identity", fill = "#00376C") + xlab("Year") + ylab("Head count email contacts") + theme_minimal()

      I would never have never guessed that I am exchanging emails with that many people (No, I am not a spammer …, I hope). Who are my “top contacts”?

      panel %>% filter(emails_sent_to > 0, emails_received_from > 0) %>% group_by(address) %>% summarise(email_contacts = sum(emails_sent_to) + sum(emails_received_from), sent_rec_ratio = sum(emails_sent_to) / sum(emails_received_from), emails_sent_to = sum(emails_sent_to), emails_received_from = sum(emails_received_from)) %>% arrange(-email_contacts) -> email_contacts # head(email_contacts, 50)

      I am not including this output here for obvious privacy reasons. If you are interested to follow your email contacts over time, you can use my ExPanD shiny app for quick exploration.

      panel %>% filter(emails_sent_to > 0, emails_received_from > 0) %>% ExPanD(cs_id = "address", ts_id = "year", components = c(missing_values = FALSE, by_group_violin_graph = FALSE, by_group_bar_graph = FALSE))

      Using ExPanD, I encountered a reasonable amount of my “old acquaintances” when focusing on earlier years of the panel. To do this a little bit more systematically, I prepare a linking table, linking the most prominent email contact addresses in the panel file to real people names in an n(email addresses) to 1(person) look-up table. This table has the name email_contacts_names and the fields address and name. Based on this data I produce a new panel sample containing only those email contacts linked to actual persons that I care about.

      panel %>% left_join(email_contacts_names) %>% filter(!is.na(name)) %>% group_by(year, name) %>% summarise(emails_sent_to = sum(emails_sent_to), emails_received_from = sum(emails_received_from)) %>% filter(emails_sent_to > 0, emails_received_from > 0) %>% mutate(email_contacts = emails_sent_to + emails_received_from, sent_rec_ratio = emails_sent_to / emails_received_from) %>% arrange(year, -email_contacts) -> panel_email_contacts

      Using this data, you can now easily explore how your contacts changed over time. As one example I prepared a graph highlighting the (relative) number of email contacts with a selected group of people over time. The vector people contains the names of these people.

      single_out_people <- function(df, names, relative = FALSE) { df$relative <- relative palette <- function(n) { hues = seq(15, 375, length = n) c("#B0B0B0", hcl(h = hues, l = 65, c = 100)[1:n]) } df %>% mutate(name = ifelse(name %in% names, name, "Other")) %>% group_by(year) %>% mutate(val = ifelse(relative, email_contacts/sum(email_contacts), email_contacts)) %>% select(year, name, val) %>% group_by(year, name) %>% summarise(val = sum(val)) %>% spread(key = name, value = val, fill = 0) %>% gather(key = name, value = val, -year) %>% mutate(name = factor(name, levels = c("Other", rev(names)))) %>% arrange(year, name) %>% ggplot(aes(x = year, y = val, fill = name)) + geom_area() + scale_fill_manual(values = palette(length(names))) + xlab("Year") + ylab(ifelse(relative, "Share of annual email contacts", "Annual email contacts")) + theme_minimal() -> p if (relative) p + scale_y_continuous(labels = scales::percent) else p } single_out_people(panel_email_contacts, people, FALSE) + theme(legend.position="none")

      single_out_people(panel_email_contacts, people, TRUE) + theme(legend.position="none")

      The grayish areas marks “Other”. With legends excluded for privacy reasons, you can follow old and new friends through time with this little display.

      While the Envelope Index database seems to offer much more data that is worth exploring, I will it leave it here. Happy New Year everybody and if you are in the mood: Auld Lang Syne.

      Enjoy!

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Le Monde puzzle [#1076]

      Thu, 12/27/2018 - 00:18

      (This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

      A cheezy Le Monde mathematical puzzle : (which took me much longer to find [in the sense of locating] than to solve, as Warwick U does not get a daily delivery of the newspaper [and this is pre-Brexit!]):

      Take a round pizza (or a wheel of Gruyère) cut into seven identical slices and turn one slice upside down. If the only possibly moves are to turn three connected slices to their reverse side, how many moves at least are needed to recover the original configuration? What is the starting configuration that requires the largest number of moves?

      Since there are ony N=2⁷ possible configurations, a brute force exploration is achievable, starting from the perfect configuration requiring zero move and adding all configurations found by one additional move at a time… Until all configurations have been visited and all associated numbers of steps are stable. Here is my R implementation

      nztr=lengz=rep(-1,N) #length & ancestor nztr[0+1]=lengz[0+1]=0 fundz=matrix(0,Z,Z) #Z=7 for (i in 1:Z){ #only possible moves fundz[i,c(i,(i+1)%%Z+Z*(i==(Z-1)),(i+2)%%Z+Z*(i==(Z-2)))]=1 lengz[bit2int(fundz[i,])+1]=1 nztr[bit2int(fundz[i,])+1]=0} while (min(lengz)==-1){ #second loop omitted for (j in (1:N)[lengz>-1]) for (k in 1:Z){ m=bit2int((int2bit(j-1)+fundz[k,])%%2)+1 if ((lengz[m]==-1)|(lengz[m]>lengz[j]+1)){ lengz[m]=lengz[j]+1;nztr[m]=j} }}

      Which produces a path of length five returning (1,0,0,0,0,0,0) to the original state:

      > nztry(2) [1] 1 0 0 0 0 0 0 [1] 0 1 1 0 0 0 0 [1] 0 1 0 1 1 0 0 [1] 0 1 0 0 0 1 0 [1] 1 1 0 0 0 0 1 [1] 0 0 0 0 0 0 0

      and a path of length seven in the worst case:

      > nztry(2^7) [1] 1 1 1 1 1 1 1 [1] 1 1 1 1 0 0 0 [1] 1 0 0 0 0 0 0 [1] 0 1 1 0 0 0 0 [1] 0 1 0 1 1 0 0 [1] 0 1 0 0 0 1 0 [1] 1 1 0 0 0 0 1 [1] 0 0 0 0 0 0 0

      Since the R code was written for an arbitrary number Z of slices, I checked that there is no solution for Z being a multiple of 3.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Very shiny holidays!

      Wed, 12/26/2018 - 17:51

      (This article was first published on Stories by Sebastian Wolf on Medium, and kindly contributed to R-bloggers)

      How would I miss to program just a little bit during the holiday season? But I didn’t want to work on something serious, so I decided to checkout some ground work on R-Shiny + JQuery + CSS. The result are some nice holiday greetings inside a shiny app:

      An app to greet your family with shiny

      I just googled CSS + holidays and what I found was this article on CSS for the holidays. The Pixel Art Christmas Tree I found from dodozhang21 really amazed me. But it was missing some shiny features. So I thought why not make it a custom shiny output? I was already writing about custom shiny inputs but never on custom shiny outputs.

      Actually, the tree is kind of easy. I found that 83 rows of the CSS file account for the moving pixels of the tree as CSS keyframes. There are 5 keyframes (0% , 25%, 50%, 75%, 100% of the animation). So I read them as data.frames in R. Then I sampled X colors for the number of balls the user wants to find in the tree and replaced X of the 83 rows in 5 data frames. Then I put everything back into HTML by:

      glue("$('head').append(\"{style_css}\");")

      where glue is the R package that will replace style_css by the CSS I created from my 5 data frames.

      To see the whole code, you can go to github.

      You think this is not as easy as I thought? Then just get started with R and find the “Jamie” in your company that knows R.

      Happy Holidays!

      Very shiny holidays!

      Very shiny holidays! was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: Stories by Sebastian Wolf on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Part 2: Optimism corrected bootstrapping is definitely bias, further evidence

      Wed, 12/26/2018 - 10:36

      (This article was first published on R – intobioinformatics, and kindly contributed to R-bloggers)

      Some people are very fond of the technique known as ‘optimism corrected bootstrapping’, however, this method is bias and this becomes apparent as we increase the number of noise features to high numbers (as shown very clearly in my previous blog post). This needs exposing, I don’t have the time to do a publication on this nor the interest so hence this article. Now, I have reproduced the bias with my own code.

      Let’s first just show the plot from last time again to recap, so as features increase (that are just noise), this method gives a strong positive AUC compared with cross validation. The response variable is random, the explanatory variables are random, so we should not get a positive result. Now, we are going to implement the method from scratch, instead of using Caret, and show the same thing.

      Perhaps in lower dimensions the method is not so bad, but I suggest avoiding it in all cases. I consider it something of a hack to get statistical power. Regardless of where it is published and by whom. Why? Because I have tested it myself, I am sure we could find some simulations that support its use somewhere, but LOOCV and K-fold cross validation do not have this problem. This method is infecting peoples work with a positive results bias in high dimensions, possibly lower dimensions as well.

      After reading the last post Frank Harrell indirectly accused the makers of the Caret software as being somehow incorrect in their code implementation, so my results he decided were ‘erroneous’. I doubted this was the case, so now we have more evidence to expose the truth. This simple statistical method is described in this text book and a blog post (below), for those who are interested:

      Steyerberg, Ewout W. Clinical prediction models: a practical approach to development, validation, and updating. Springer Science & Business Media, 2008.

      http://thestatsgeek.com/2014/10/04/adjusting-for-optimismoverfitting-in-measures-of-predictive-ability-using-bootstrapping/

      The steps of this method, briefly, are as follows:

      1. Fit a model M to entire data S and estimate predictive ability C.
      2. Iterate from b=1…B:
        1. Take a resample from the original data, let this be S*
        2. Fit the bootstrap model M* to S* and get predictive ability, C_boot
        3. Use the bootstrap model M* to get predictive ability on S, C_orig
      3. Optimism O is calculated as mean(C_boot – C_orig)
      4. Calculate optimism corrected performance as C-O.

      This is a general procedure (Steyerberg, 2008) and so I can use glmnet to fit the model and AUC as my measure of predictive performance. Glmnet is basically logistic regression with feature selection, so if this method for estimating ‘optimism’ is a good one, we should be able to estimate the optimism of overfitting this model using the above method, then correct for it.

      Let’s re-run the analysis with my implementation of this method and compare with Caret (which is great software). Caret agrees with my implementation as shown below in fully reproducible code.

      This code should be fine to copy and paste into R. The data is random noise, yet, optimism corrected bootstrapping gives a strong positive result AUC of 0.86, way way above what we would expect of 0.5.

      In contrast, below I give code for Caret cross validation which gives the correct result, just test it for yourself instead of believing statistical dogma.

      > corrected [1] 0.858264 ## library(glmnet) library(pROC) library(caret) ## set up test data test <- matrix(rnorm(100*1000, mean = 0, sd = 1), nrow = 100, ncol = 1000, byrow = TRUE) labelsa <- as.factor(c(rep('A',50),rep('B',50))) colnames(test) <- paste('Y',seq(1,1000),sep='') row.names(test) <- paste('Z',seq(1,100),sep='') ### my own implementation of optimism corrected bootstrapping ## 1. fit model to entire test data (i.e. overfit it) orig <- glmnet(test,y=labelsa,alpha=1,family = "binomial") preds <- predict(orig,newx=test,type='response',s=0.01) auc <- roc(labelsa,as.vector(preds)) original_auc <- as.numeric(auc$auc) ## 2. take resample of data and try to estimate optimism test2 <- cbind(test,labelsa) B <- 50 results <- matrix(ncol=2,nrow=B) for (b in seq(1,B)){ ## get the bootstrapped resampled data boot <- test2[sample(row.names(test2),50),] labelsb <- boot[,ncol(boot)] boot <- boot[,-ncol(boot)] ## use the bootstrapped model to predict its own labels bootf <- glmnet(boot,y=labelsb,alpha=1,family = "binomial") preds <- predict(bootf,newx=boot,type='response',s=0.01) auc <- roc(labelsb,as.vector(preds)) boot_auc <- as.numeric(auc$auc) ## use bootstrap model to predict original labels preds <- predict(bootf,newx=test,type='response',s=0.01) auc <- roc(labelsa,as.vector(preds)) boot_original_auc <- as.numeric(auc$auc) ## add the data to results results[b,1] <- boot_auc results[b,2] <- boot_original_auc } ## calculate the optimism O <- mean(results[,1]-results[,2]) ## calculate optimism corrected measure of prediction (AUC) corrected <- original_auc-O ### caret implementation of optimism corrected bootstrapping test3 <- data.frame(test2) test3$labelsa <- as.factor(test3$labelsa) test3$labelsa <- make.names(test3$labelsa) ctrl <- trainControl(method = 'optimism_boot', summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, verboseIter = T) fit1 <- train(as.formula( paste( 'labelsa', '~', '.' ) ), data=test3, method="glmnet", # preProc=c("center", "scale") trControl=ctrl, metric = "ROC") ### sanity check, using cv ctrl <- trainControl(method = 'cv', summaryFunction=twoClassSummary, classProbs=T, savePredictions = T, verboseIter = T) fit2 <- train(as.formula( paste( 'labelsa', '~', '.' ) ), data=test3, method="glmnet", # preProc=c("center", "scale") trControl=ctrl, metric = "ROC")

      Here is a plot repeating the Caret results I did with a simple loop to iteratively add noise only features and getting the corrected AUC for each iteration, but this is with my own implementation instead of Caret. You can try this out by sticking a FOR loop outside the code above to calculate the 'corrected' AUC. You can see the same misleading trend as we saw in the last post.

      (X axis = number of noise only features added to random labels)
      (Y axis = ‘optimism corrected bootstrap AUC’)

      People are currently using this method because it is published and they some how think published methods do not have fundamental problems, which is a widespread misconception.

      R machine learning practitioners, do not trust methods just because some statistican says they are good in some paper or a forum, do not pick pet methods you love and so can never see the flaws in, test different methods and run your own positive and negative controls. Try and be objective.

       

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: R – intobioinformatics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Finally, You Can Plot H2O Decision Trees in R

      Wed, 12/26/2018 - 06:23

      (This article was first published on novyden, and kindly contributed to R-bloggers)

      Creating and plotting decision trees (like one below) for the models created in H2O will be main objective of this post:

      Figure 1. Decision Tree Visualization in R

      Decision Trees with H2O

      With release 3.22.0.1 H2O-3 (a.k.a. open source H2O or simply H2O) added to its family of tree-based algorithms (which already included DRF, GBM, and XGBoost) support for one more: Isolation Forest (random forest for unsupervised anomaly detection). There were no simple way to visualize model trees until recently except following clunky (albeit reliable) method of creating a MOJO object and running combination of Java and dot commands.

      That changed in 3.22.0.1 too with introduction of unified Tree API to work with any of the tree-based algorithms above. Data scientists are now able to utilize powerful visualization tools in R (or Python) without resorting to producing intermediate artifacts like MOJO and running external utilities. Please read this article by Pavel Pscheidl who did superb job of explaining H2O Tree API and S4 classes in R before coming back to take it a step further to visualize trees.

      The Workflow: from Data to Decision Tree

      Whether you are still here or came back after reading Pavel’s excellent post let’s set goal straight: create single decision tree model in H2O and visualize its tree graph. With H2O there is always a choice between using Python or R – the choice for R here will become clear when discussing its graphical and analytical capabilities later.

      CART models operate on labeled data (classification and regression) and offer arguably unmatched model interpretability by means of analyzing a tree graph. In data science there is never single way to solve given problem so let’s define end-to-end logical workflow from “raw” data to visualized decision tree:

      Figure 2. Workflow of tasks in this post

      One may argue that allocating tasks between H2O and R could be different but let’s follow outlined plan for this post. Next diagram adds implementation specifics to each step:

      • R package data.table for data munging
      • H2O grid for hyper-parameter search
      • H2O GBM for modeling single decision tree algorithm
      • H2O Tree API for tree model representation
      • R package data.tree for visualization  
      Figure 3. Workflow of tasks in this post with implementation details


      Discussion of this workflow continues for the rest of this post.

      Titanic Dataset

      The famous Titanic dataset contains information about the fate of passengers of the RMS Titanic that sank after colliding with an iceberg. It regularly serves as toy data for blog exercises like this.

      H2O public S3 bucket holds the Titanic dataset readly available and using package data.table makes it fast one-liner to load into R:


      Data Engineering

      Passenger features from the Titanic dataset are discussed at length online, e.g. see Predicting the Survival of Titanic Passengers and Predicting Titanic Survival using Five Algorithms. To summarize the following features were selected and engineered for decision tree model:

      • survived indicates if passenger survived the wreck
      • boat and body leak survival outcome and were dropped completely before modeling
      • name and cabin are too noisy as they are and only used to derive new features
      • title is parsed from name
      • cabin_type is parsed from cabin
      • family_size and family_type are derived from combination of count features sibsp (siblings+spouse) and parch (parents+children)
      • ticket and home.dest are dropped to preserve simplicity of the model
      • missing values in age and fare are imputed using target encoding (mean) over grouping by survived, sex, and embarked columns. 

      Data load and data munging steps above are implemented in R using data.table:

      Starting with H2O

      Creating models with H2O requires running a server process (remote or local) and a client (package h2o in R available from CRAN) where the latter connects and sends commands to the former. The Tree API was introduced with release 3.22.0.1 (10/26/2018) but due to CRAN policies h2o package usually lags several versions behind (on the time of this writing CRAN hosted version 3.20.0.8). There are two ways to work around this:

      1. Install and run package available from CRAN and use strict_version_check=FALSE inside h2o.connect() to communicate with newer version running on server
      2. Or install the latest version of h2o available from H2O repository either to connect to remote server or to both connect and run server locally.

      Tree API is available only with 2d option because it requires access to new classes and functions in h2o package (remember, I asked you read Pavel’s blog). Below code from the official H2O download page shows how to download and install the latest version of the package: 

      Building Decision Tree with H2O

      While H2O offers no dedicated single decision tree algorithm there two approaches using superseding models:

      Choosing GBM option requires one less line of code (no need to calculate number of features to set mtries) so it was used for this post. Otherwise both ways result in the same decision tree with the steps below fully reproducible using h2o.randomForest() instead of h2o.gbm().

      Decision Tree Depth

      When building single decision tree models maximum tree depth stands as the most important parameter to pick. Shallow trees tend to underfit by failing to capture important relationships in data producing similar trees despite varying training data (error due to high bias). On the other hand trees grown too deep overfit by reacting to noise and slight changes in data (error due to high variance). Tuning H2O model’s parameter max_depth that limits decision tree depth aims at balancing the effects of bias and variance. In R using H2O to split data and to tune the model, then visualizing results with ggplot to look for right value unfolds like this:   

      1. split Titanic data into training and validation sets
      2. define grid search object with parameter max_depth
      3. launch grid search on GBM models and grid object to obtain AUC values (model performance)
      4. plot grid model AUC’es vs. max_depth values to determine “inflection point” where AUC growth stops or saturates (see plot below)
      5. register tree depth value at inflection point to use in the final model

      Code below implements these steps:


      and produces chart that points to inflection point for maximum tree depth at 5:

      Figure 4. Visualization of AUC vs. maximum tree depth hyper-parameter trend
      extracted from the H2O grid object after running grid search in H2O.
      Marked inflection point indicates when increasing maximum tree depth
      no longer improves model performance on validation set   Creating Decision Tree

      As evident from the Figure 3 optimal decision tree depth is 5. The code below constructs single decision tree model in H2O and then retrieves tree representation from a GBM model with Tree API function h2o.getModelTree(), which creates an instance of S4 class H2OTree and assigns to variable titanicH2oTree:

      At this point all action moved back inside R with its unparalleled access to analytical and visualization tools. So before navigating and plotting a decision tree – final goal for this post – let’s have brief intro to networks in R.

      Overview of Network Analysis in R

      R offers arguably the richest functionality when it comes to analyzing and visualizing network (graph, tree) objects. Before taking on the task of conquering it spend time visiting a couple of comprehensive articles describing vast landscape of tools and approaches available: Static and dynamic network visualization with R by Katya Ognyanova and Introduction to Network Analysis with R by Jesse Sadler.

      To summarize there are two commonly used packages to manage and analyze networks in R: network (part of statnet family) and igraph (family in itself). Each package implements namesake classes to represent network structures so there is significant overlap between the two and  they mask each other’s functions. Preferred approach is picking only one of two: it appears that igraph is more common for general-purpose applications while network is preferred for social network and statistical analysis (my subjective assessment). And while researching these packages do not forget about package intergraph that seamlessly transforms objects between network and igraph classes. (And this analysis stopped short of expanding into universe of R packages hosted on Bioconductor).

      When it comes to visualizing networks choices quickly proliferate. Both network and igraph offer graphical functions that use R base plotting system but it doesn’t stop here. Following packages specialize in advanced visualizations for at least one or both of the classes:

      • ggraph
      • ggnet2
      • ggnetwork
      • visNetwork
      • DiagrammeR
      • networkD3

      Finally, there is packagedata.tree designed specifically to create and analyze trees in R. It fits the bill of representing and visualizing decision trees perfectly, so it became a tool of choice for this post. Still, visualizing H2O model trees could be fully reproduced with any of network and visualization packages mentioned above. 

      Visualizing H2O Trees

      In the last step a decision tree for the model created by GBM moved from H2O cluster memory to H2OTree object in R by means of Tree API. Still, specific to H2O the H2OTree object now contains necessary details about decision tree, but not in the format understood by R packages such asdata.tree.

      To fill this gap function createDataTree(H2OTree) created that traverses a tree and translates it from H2OTreeinto data.tree accumulating information about decision tree splits and predictions into node and edge attributes of a tree:

      Finally everything lined up and ready for the final step of plotting decision tree:

      • single decision tree model created in H2O
      • its structure made available in R
      • and translated to specialized data.tree for network analysis.

      Styling and plotting data.tree objects is built around rich functionality of the DiagrammerR package. For anything that goes beyond simple plotting read documentation here but also remember that for plotting data.tree takes advantage of:

      • hierarchical nature of tree structures
      • GraphViz attributes to style graph, node and edge properties
      • and dynamic callback functions (in this example GetEdgeLabel(node), GetNodeShape(node), GetFontName(node)) to customize tree’s feel and look 

      The following code will produce this moderately customized decision tree for our H2O model: 

      Figure 5. H2O Decision Tree for Titanic Model Visualized with data.tree in R

           

      References var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: novyden. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Statistical Assessments of AUC

      Wed, 12/26/2018 - 04:59

      (This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers)

      In the scorecard development, the area under ROC curve, also known as AUC, has been widely used to measure the performance of a risk scorecard. Given everything else equal, the scorecard with a higher AUC is considered more predictive than the one with a lower AUC. However, little attention has been paid to the statistical analysis of AUC itself during the scorecard development.

      While it might be less of a concern to rely on a simple comparison of AUC for the model selection in the development stage and then to pick the scorecard with a higher AUC, more attention should be called for on AUC analysis in the post-development stage. For instance, the senior management would need to decide whether it is worthy to retire a legacy scorecard that might be still performing and to launch the full-scale deployment of a new scorecard just for an increase in AUC that might not even be statistically significant. While the claim of certain business benefits can always be used as an argument in favor of the new scorecard, the justification would become even more compelling with a solid statistical evidence. What’s more, the model validation analyst might also want to leverage the outcome of AUC analysis to ensure the statistical soundness of new scorecards.

      In the example below, two logistic regressions were estimated with AUC = 0.6554 and BIC = 6,402 for the model with 6 variables and AUC = 0.6429 and BIC = 6,421 for the model with 3 variables.

      df1 <- read.csv("Downloads/credit_count.txt") df2 <- df1[which(df1$CARDHLDR == 1), ] y <- "DEFAULT" x1 <- c("OWNRENT", "INCOME", "INCPER", "LOGSPEND", "AGE", "EXP_INC") x2 <- c("MAJORDRG", "MINORDRG", "INCOME") m1 <- glm(eval(paste(y, paste(x1, collapse = " + "), sep = " ~ ")), data = df2, family = binomial) # Estimate Std. Error z value Pr(|z|) #(Intercept) -1.749e-01 1.659e-01 -1.054 0.291683 #OWNRENT -2.179e-01 7.686e-02 -2.835 0.004581 ** #INCOME -2.424e-04 4.364e-05 -5.554 2.79e-08 *** #INCPER -1.291e-05 3.318e-06 -3.890 0.000100 *** #LOGSPEND -2.165e-01 2.848e-02 -7.601 2.95e-14 *** #AGE -8.330e-03 3.774e-03 -2.207 0.027312 * #EXP_INC 1.340e+00 3.467e-01 3.865 0.000111 *** BIC(m1) # 6401.586 roc1 <- pROC::roc(response = df2$DEFAULT, predictor = fitted(m1)) # Area under the curve: 0.6554 m2 <- glm(eval(paste(y, paste(x2, collapse = " + "), sep = " ~ ")), data = df2, family = binomial) # Estimate Std. Error z value Pr(|z|) #(Intercept) -1.222e+00 9.076e-02 -13.459 < 2e-16 *** #MAJORDRG 2.031e-01 6.921e-02 2.934 0.00335 ** #MINORDRG 1.920e-01 4.784e-02 4.013 5.99e-05 *** #INCOME -4.706e-04 3.919e-05 -12.007 < 2e-16 *** BIC(m2) # 6421.232 roc2 <- pROC::roc(response = df2$DEFAULT, predictor = fitted(m2)) # Area under the curve: 0.6429

      Both AUC and BIC statistics seemed to favor the first model. However, is a 2% difference in AUC significant enough to infer a better model? Under the Null Hypothesis of no difference in AUC, three statistical tests were employed to assess the difference in AUC / ROC between two models.

      set.seed(2019) # REFERENCE: # A METHOD OF COMPARING THE AREAS UNDER RECEIVER OPERATING CHARACTERISTIC CURVES DERIVED FROM THE SAME CASES # HANLEY JA, MCNEIL BJ (1983) pROC::roc.test(roc1, roc2, method = "bootstrap", boot.n = 500, progress = "none", paired = T) # D = 1.7164, boot.n = 500, boot.stratified = 1, p-value = 0.0861 # REFERENCE: # COMPARING THE AREAS UNDER TWO OR MORE CORRELATED RECEIVER OPERATING CHARACTERISTIC CURVES: A NONPARAMETRIC APPROACH # DELONG ER, DELONG DM, CLARKE-PEARSON DL (1988) pROC::roc.test(roc1, roc2, method = "delong", paired = T) # Z = 1.7713, p-value = 0.0765 # REFERENCE # A DISTRIBUTION-FREE PROCEDURE FOR COMPARING RECEIVER OPERATING CHARACTERISTIC CURVES FROM A PAIRED EXPERIMENT # VENKATRAMAN ES, BEGG CB (1996) pROC::roc.test(roc1, roc2, method = "venkatraman", boot.n = 500, progress = "none", paired = T) # E = 277560, boot.n = 500, p-value = 0.074

      Based upon the above output, there is no strong statistical evidence against the Null Hypothesis.

      pscl::vuong(m1, m2) # Vuong z-statistic H_A p-value #Raw 2.0963830 model1 > model2 0.018024 #AIC-corrected 1.8311449 model1 > model2 0.033539 #BIC-corrected 0.8684585 model1 > model2 0.192572

      In addition, a Vuong test is also performed, supporting no difference between two models after corrected for the Schwarz penalty.

      var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

      To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

      Pages