bounceR 0.1.2: Automated Feature Selection
(This article was first published on r-bloggers – STATWORX, and kindly contributed to R-bloggers)
New FeaturesAs promised, we kept on working on our bounceR package. For once, we changed the interface: users now do not have to choose a number of tuning parameters, that – thanks to my somewhat cryptic documentation – sound more complicated than they are. Inspired by H2o.ai feature to let the user set the time he or she wants to wait, instead of a number of cryptic tuning parameters, we added a similar function.
Further, we changed the source code quite a bit. Henrik Bengtsson gave a very inspiring talk on parallization using the genius future package at this year's eRum conference. A couple days later, Davis Vaughan released furrr. An incredibly smart – kudos – wrapper on-top of the no-less genius purrr package. Davis' package combines purrr's maping functions with future's parallization madness. As you can tell, I am a big fan of all these three packages. Thus, inspired by these new inventions, we wanted to make use of them in our package. So, the entire parallization setup of bounceR is now leveraging furrr. This way, the parallization is so much smarter, faster and works seemingless on different operating systems.
Practical ExampleThus, lets see how you can use it now. Let's start by downloading the package.
# you need devtools, cause we are just about to get it to CRAN, though we are not that far library(devtools) # now you are good to go devtools::install_github("STATWORX/bounceR") # now you can source it like every normal package library(bounceR)To show how the feature selection works, we now need some data, so lets simulate some with our sim_data() function.
# simulate some data data <- sim_data(n = 100, modelvars = 10, noisevars = 300)Now you guys can all imagine that with 310 features on 100 observations, building models could be a little challenging. In order to be able to model the target no less, you need to reduce your feature space. There are numerous ways to do so. In my last Blog Post I described our solution. Let's see how to use our algorithm.
# run our algorithm selected_features <- featureSelection(data = data, target = "y", max_time = "30 mins", bootstrap = "regular", early_stopping = TRUE, parallel = TRUE)What can you expect to get out of it? Well, we return a list with of course the optimal formula calculated by our algorithm. Further, you get a stability matrix with it, where you can see a ranking of the features by importance. Additionally we built in some convenient S4 methods, so you can easily access all the information you need.
OutlookI hope I could teaser you a little to check out the package and help us further improve it. Currently, we are developing two new algorithms for feature selection. Thus, in the next iteration we will implement those two as well. I am looking forward to your comments, issues and thoughts on the package.
Cheers Guys!
Über den Autor Lukas StrömsdörferDer Beitrag bounceR 0.1.2: Automated Feature Selection erschien zuerst auf STATWORX.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Most Starred R Packages on GitHub
(This article was first published on Blog-rss on stevenmortimer.com, and kindly contributed to R-bloggers)
It seems like all the best R packages proudly use GitHub and have a README adorned with badges across the top. The recent Microsoft acquisition of GitHub got me wondering: What proportion of current R packages use GitHub? Or at least refer to it in the URL of the package description. Also, what is the relationship between the number of CRAN downloads and the number of stars on a repository? My curiosity got the best of me so I hastily wrote a script to pull the data. Click here to go straight to the full script and data included at the bottom of this post. I acknowledge there are more elegant ways to have coded this, but let’s press on.
Pulling List of Packages & their DetailsCRAN provides a list and links to all current packages at https://cran.rstudio.com/src/contrib. By scraping this page I found 12,675 current R packages (non-archived). For each package I pinged its detail page using the canonical form link (e.g. https://cran.r-project.org/package=ggplot2). From there I looked at the package description fields "URL" and "BugReports" to see if either contained “github.com”. It turns out that 3,718 of the packages (29.3% of the total) referenced GitHub. Below is the code snippet for retrieving the list of packages that was taken from Gergely Daróczi’s gist.
# getting list of the packages pkgs <- readHTMLTable(readLines(file.path('https://cran.rstudio.com/src/contrib')), which = 1, stringsAsFactors = FALSE)[,-1] # filter out lines that aren't really packages pkgs <- pkgs %>% filter(Size != "-", grepl('tar.gz$', Name)) %>% mutate(Name = sub('^([a-zA-Z0-9\\.]*).*', '\\1', Name))While retrieving the package metadata I pinged the GitHub API to see if I could get the number of stars for the repository. Currently, GitHub allows 5,000 authenticated requests per hour (link), but out of all the packages only 3,718 referenced GitHub, so I could make all the requests at once. Here is the function I used to take a cleaned up version of the package’s URL then form a request to the GitHub API to get star counts:
# get the star count from a clean version of the package's URL gh_star_count <- function(url){ stars <- tryCatch({ this_url <- gsub("https://github.com/", "https://api.github.com/repos/", url) req <- GET(this_url, gtoken) stop_for_status(req) cont <- content(req) cont$stargazers_count }, error = function(e){ return(NA_integer_) }) return(stars) } Analyzing the DataOnce I had all the package detail data, I found that R packages, on average, have 35.7 GitHub stars, but the median number of stars is only 6! ggplot2 has the most stars with 3,174. In my analysis I removed the xgboost, h2o, and feather packages which point to the repository of their implementations in many languages, not just R.
What I really found interesting was comparing CRAN downloads to GitHub repo stars. Using the cranlogs package I was able to get the total package downloads dating back to January 1, 2014. In contrast with the low star counts, the median downloads for R packages is 8,975. Combining stars and downloads data I found that the median R package has 903 downloads per star. Only 38.7% of packages had more than 10 stars, which shows how hard stars are to get even if you’ve written a great package. I’m not sure what proportion of R users frequently reference and contribute to GitHub, but it would be interesting to compare that with the high ratios of downloads to stars.
There are some real outliers in the data. For example, the Rcpp package, perhaps the most downloaded package of all-time, has 15.8M downloads and only 377 stars. Similarly, Hadley’s scales package has 9.4M downloads and only 115 stars. These support/helper packages just don’t get the same star love as the headliners like ggplot2, shiny, and dplyr.
Of course, I could not help but check out the stats for some of the most prolific package authors. After parsing out individuals with the ["aut", "cre"] roles I came to the not so surprising conclusion that Hadley has the most stars of any author with 12,408 stars. In contrast, Dirk Eddelbuettel had one of the lowest star-to-download ratios. For every ~38K downloads Dirk’s repositories will receive one star. Pay no attention to the man behind the curtain since his Rcpp package underpins a whole host of packages without all the GitHub fanfare. Here is a list of popular R package authors and their stats:
Author Notable Packages Downloads Stars Downloads Per Star Hadley Wickham ggplot2, dplyr, httr 113,160,314 12,408 9,119.9 Dirk Eddelbuettel Rcpp, BH 28,433,586 745 38,165.9 Yihui Xie knitr, rmarkdown, bookdown 42,472,860 6,315 6,725.7 Winston Chang R6, shiny 17,161,005 4,027 4,261.5 Jennifer Bryan readxl, gapminder, googlesheets 6,055,774 1,714 3,533.1 JJ Allaire rstudioapi, reticulate, tensorflow 8,882,553 2,798 3,174.6 Jeroen Ooms jsonlite, curl, openssl 25,907,868 1,483 17,469.9 Scott Chamberlain geojsonio, taxize 1,770,664 2,528 700.4 Jim Hester devtools, memoise, readr 22,867,071 4,332 5,278.6 Kirill Müller tibble, DBI 36,159,009 1,077 33,573.8I’m sure you could create mixed models to determine the unique download to star relationship for individuals. Also, you could use other package attributes to predict stars or downloads, but I’ll leave that to another curious soul. I will include tables below regarding the top 10 most downloaded, most starred, most and least downloaded per star.
CreditsCredit is due since this script borrows a couple different pieces of code and concepts. Retrieving the list of packages is from Gergely Daróczi’s gist. Authenticating to GitHub was taken from one of the httr demos.
Appendix Top 10 Most Starred Packages Name Author Downloads Stars Downloads Per Star ggplot2 Hadley Wickham 13,001,703 3,174 4,096.3 shiny Winston Chang 4,571,794 2,902 1,575.4 dplyr Hadley Wickham 8,276,844 2,408 3,437.2 devtools Jim Hester 5,536,730 1,645 3,365.8 knitr Yihui Xie 7,131,564 1,581 4,510.8 data.table Matt Dowle 6,005,795 1,457 4,122.0 plotly Carson Sievert 1,195,880 1,255 952.9 rmarkdown Yihui Xie 5,432,495 1,160 4,683.2 tensorflow JJ Allaire 94,856 1,033 91.8 bookdown Yihui Xie 126,586 1,009 125.5 Top 10 Most Downloaded Packages with Stars Name Author Downloads Stars Downloads Per Star Rcpp Dirk Eddelbuettel 15,824,781 377 41,975.5 ggplot2 Hadley Wickham 13,001,703 3,174 4,096.3 stringr Hadley Wickham 11,547,828 268 43,088.9 stringi Marek Gagolewski 11,310,113 122 92,705.8 digest Dirk Eddelbuettel with contributions by Antoine Lucas 11,233,244 42 267,458.2 plyr Hadley Wickham 10,340,396 470 22,000.8 R6 Winston Chang 9,993,128 212 47,137.4 reshape2 Hadley Wickham 9,582,245 173 55,388.7 scales Hadley Wickham 9,380,757 115 81,571.8 jsonlite Jeroen Ooms 9,112,790 176 51,777.2 Top 10 Packages by Stars per Download (frequently starred) Name Author Downloads Stars Downloads Per Star r2d3 Javier Luraschi 416 235 1.77 workflowr John Blischak 448 169 2.65 goodpractice Hannah Frick 523 192 2.72 xtensor Johan Mabille 2,057 664 3.10 scico Thomas Lin Pedersen 185 59 3.14 shinytest Winston Chang 418 113 3.70 furrr Davis Vaughan 724 171 4.23 pkgdown Hadley Wickham 1,589 332 4.79 rtika Sasha Goodman 168 32 5.25 mindr Peng Zhao 2,051 368 5.57 Bottom 10 Packages by Stars per Download (infrequently starred) Name Author Downloads Stars Downloads Per Star mime Yihui Xie 7,398,765 12 616,563.8 pkgmaker Renaud Gaujoux 1,228,173 2 614,086.5 rngtools Renaud Gaujoux 1,224,959 2 612,479.5 magic Robin K. S. Hankin 344,741 1 344,741.0 gsubfn G. Grothendieck 675,056 2 337,528.0 bindrcpp Kirill Müller 2,996,452 10 299,645.2 plogr Kirill Müller 3,343,099 12 278,591.6 digest Dirk Eddelbuettel with contributions by Antoine Lucas 11,233,244 42 267,458.2 munsell Charlotte Wickham 7,778,712 31 250,926.2 proto Hadley Wickham 2,593,246 11 235,749.6 Full ScriptBelow and available via gist with data at: https://gist.github.com/StevenMMortimer/1b4b626d3d91240a77f969ae04b37114
# load packages & custom functions --------------------------------------------- library(tidyverse) library(httr) library(cranlogs) library(XML) library(ggrepel) library(scales) library(knitr) library(stringr) gh_from_url <- function(x){ if(!grepl(',', x)){ x <- strsplit(x, " ")[[1]] x <- trimws(x[min(which(grepl(pattern='http://github.com|https://github.com|http://www.github.com', x, ignore.case=TRUE)))]) } else { x <- strsplit(x, ",")[[1]] x <- trimws(x[min(which(grepl(pattern='http://github.com|https://github.com|http://www.github.com', x, ignore.case=TRUE)))]) } x <- gsub("http://", "https://", tolower(x)) x <- gsub("www\\.github\\.com", "github.com", x) x <- gsub("/$", "", x) x <- gsub("^github.com", "https://github.com", x) x <- gsub("/issues", "", x) x <- gsub("\\.git", "", x) return(x) } aut_maintainer_from_details <- function(x){ x <- gsub("'|\"", "", x) if(grepl(',', x)){ x <- strsplit(x, "\\],")[[1]] aut_cre_ind <- grepl(pattern='\\[aut, cre|\\[cre, aut|\\[cre', x, ignore.case=TRUE) if(any(aut_cre_ind)){ x <- x[min(which(aut_cre_ind))] x <- gsub("\\[aut, cre|\\[cre, aut|\\[cre", "", x) } x <- strsplit(x, ",")[[1]][1] x <- trimws(gsub("\\]", "", x)) x <- trimws(gsub(" \\[aut", "", x)) } return(x) } gh_star_count <- function(url){ stars <- tryCatch({ this_url <- gsub("https://github.com/", "https://api.github.com/repos/", url) req <- GET(this_url, gtoken) stop_for_status(req) cont <- content(req) cont$stargazers_count }, error = function(e){ return(NA_integer_) }) return(stars) } # authenticate to github ------------------------------------------------------- # use Hadley's key and secret myapp <- oauth_app("github", key = "56b637a5baffac62cad9", secret = "8e107541ae1791259e9987d544ca568633da2ebf") github_token <- oauth2.0_token(oauth_endpoints("github"), myapp) gtoken <- config(token = github_token) # pull list of packages -------------------------------------------------------- # get list of currently available packages on CRAN pkgs <- readHTMLTable(readLines(file.path('https://cran.rstudio.com/src/contrib')), which = 1, stringsAsFactors = FALSE)[,-1] # filter out lines that aren't really packages pkgs <- pkgs %>% filter(Size != "-", grepl('tar.gz$', Name)) %>% mutate(Name = sub('^([a-zA-Z0-9\\.]*).*', '\\1', Name)) %>% distinct(Name, .keep_all = TRUE) # get details for each package ------------------------------------------------- all_pkg_details <- NULL # old fashioned looping! # WARNING: This takes awhile to complete for(i in 1:nrow(pkgs)){ if(i %% 100 == 0){ message(sprintf("Processing package #%s out of %s", i, nrow(pkgs))) } pkg_details <- readHTMLTable(readLines(file.path(sprintf('https://cran.r-project.org/package=%s', pkgs[i,]$Name))), header=FALSE, which = 1, stringsAsFactors = FALSE) pkg_details <- pkg_details %>% mutate(V1 = gsub(":", "", V1)) %>% spread(V1, V2) this_url <- pkg_details$URL on_github <- FALSE this_github_url <- NA_character_ gh_stars <- NA_integer_ if(!is.null(this_url)){ on_github <- grepl('http://github.com|https://github.com|http://www.github.com', this_url) if(on_github){ this_github_url <- gh_from_url(this_url) gh_stars <- gh_star_count(this_github_url) } else { # check the BugReports URL as a backup (e.g. shiny package references GitHub this way) issues_on_github <- grepl('http://github.com|https://github.com|http://www.github.com', pkg_details$BugReports) if(length(issues_on_github) == 0 || !issues_on_github){ this_github_url <- NA_character_ } else { this_github_url <- gh_from_url(pkg_details$BugReports) gh_stars <- gh_star_count(this_github_url) on_github <- TRUE } } } else { this_url <- NA_character_ } downloads <- cran_downloads(pkgs[i,]$Name, from = "2014-01-01", to = "2018-06-15") all_pkg_details <- rbind(all_pkg_details, tibble(name = pkgs[i,]$Name, published = pkg_details$Published, author = aut_maintainer_from_details(pkg_details$Author), url = this_url, github_ind = on_github, github_url = this_github_url, downloads = sum(downloads$count), stars = gh_stars ) ) } # basic summary stats ---------------------------------------------------------- # remove observations where the GitHub URL refers to a repository that # is not specific to R and therefore might have an inflated star count all_pkg_details_clean <- all_pkg_details %>% filter(!(name %in% c('xgboost', 'h2o', 'feather'))) %>% mutate(downloads_per_star = downloads / stars, downloads_per_star = ifelse(!is.finite(downloads_per_star), NA_real_, downloads_per_star)) # proportion of all packages listing github sum(all_pkg_details$github_ind) mean(all_pkg_details$github_ind) # proportion of packages with stars mean(!is.na(all_pkg_details$stars)) # typical number of stars per package mean(all_pkg_details_clean$stars, na.rm=TRUE) median(all_pkg_details_clean$stars, na.rm=TRUE) max(all_pkg_details_clean$stars, na.rm=TRUE) # typical number of downloads per package mean(all_pkg_details_clean$downloads, na.rm=TRUE) median(all_pkg_details_clean$downloads, na.rm=TRUE) # percent of packages over 10 stars mean(all_pkg_details_clean$stars > 10, na.rm=TRUE) mean(all_pkg_details_clean$downloads_per_star, na.rm=TRUE) median(all_pkg_details_clean$downloads_per_star, na.rm=TRUE) # stars histogram -------------------------------------------------------------- ggplot(data=all_pkg_details_clean, mapping=aes(stars)) + geom_histogram(aes(fill=..count..), bins=60) + scale_x_continuous(trans = "log1p", breaks=c(0,1,2,3,10,100,1000,3000)) + labs(x = "Stars", y = "Count", fill = "Count", caption = "Source: api.github.com as of 6/16/18") + ggtitle("Distribution of GitHub Stars on R Packages") + theme_bw() + theme(panel.grid.minor = element_blank(), plot.caption=element_text(hjust = 0)) # stars to downloads scatterplot ----------------------------------------------- plot_dat <- all_pkg_details_clean idx_label <- which(with(plot_dat, downloads > 10000000 | stars > 1000)) plot_dat$name2 <- plot_dat$name plot_dat$name <- "" plot_dat$name[idx_label] <- plot_dat$name2[idx_label] ggplot(data=plot_dat, aes(stars, downloads, label = name)) + geom_point(color = ifelse(plot_dat$name == "", "grey50", "red")) + geom_text_repel(box.padding = .5) + scale_y_continuous(labels = comma) + scale_x_continuous(labels = comma) + labs(x = "GitHub Stars", y = "CRAN Downloads", caption = "Sources:\napi.github.com as of 6/16/18\ncranlogs as of 1/1/14 - 6/15/18") + ggtitle("Relationship Between CRAN Downloads and GitHub Stars") + theme_bw() + theme(plot.caption=element_text(hjust = 0)) # author stats ----------------------------------------------------------------- # summary by author authors_detail <- all_pkg_details_clean %>% group_by(author) %>% summarize(downloads = sum(downloads, na.rm=TRUE), stars = sum(stars, na.rm=TRUE)) %>% mutate(downloads_per_star = downloads / stars, downloads_per_star = ifelse(!is.finite(downloads_per_star), NA_real_, downloads_per_star)) %>% arrange(desc(downloads)) # popular authors pop_authors <- tibble(author = c('Hadley Wickham', 'Dirk Eddelbuettel', 'Yihui Xie', 'Winston Chang', 'Jennifer Bryan', 'JJ Allaire', 'Jeroen Ooms', 'Scott Chamberlain', 'Jim Hester', 'Kirill Müller'), notable_packages = c('ggplot2, dplyr, httr', 'Rcpp, BH', 'knitr, rmarkdown, bookdown', 'R6, shiny', 'readxl, gapminder, googlesheets', 'rstudioapi, reticulate, tensorflow', 'jsonlite, curl, openssl', 'geojsonio, taxize', 'devtools, memoise, readr', 'tibble, DBI') ) author_stats <- pop_authors %>% inner_join(., authors_detail, by='author') %>% select(author, notable_packages, downloads, stars, downloads_per_star) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # single author #all_pkg_details_clean %>% filter(author == 'Dirk Eddelbuettel') %>% arrange(desc(downloads)) # top 10 lists ----------------------------------------------------------------- # Top 10 Most Starred Packages top_starred <- all_pkg_details_clean %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(stars)) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Top 10 Most Downloaded Packages with stars top_downloaded <- all_pkg_details_clean %>% filter(!is.na(stars)) %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(downloads)) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 1)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Bottom 10 Packages by Downloads per Star (frequently starred) frequently_starred <- all_pkg_details_clean %>% filter(downloads > 100) %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(downloads_per_star) %>% slice(1:10) %>% mutate(downloads_per_star = round(downloads_per_star, 2)) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) # Top 10 Packages by Downloads per Star (infrequently starred) infrequently_starred <- all_pkg_details_clean %>% select(name, author, downloads, stars, downloads_per_star) %>% arrange(desc(downloads_per_star)) %>% slice(1:10) %>% rename_all(. %>% gsub("_", " ", .) %>% str_to_title) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Blog-rss on stevenmortimer.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Sketchnotes from TWiML&AI: Practical Deep Learning with Rachel Thomas
(This article was first published on Shirin's playgRound, and kindly contributed to R-bloggers)
These are my sketchnotes for Sam Charrington’s podcast This Week in Machine Learning and AI about Practical Deep Learning with Rachel Thomas:
You can listen to the podcast here.
In this episode, i’m joined by Rachel Thomas, founder and researcher at Fast AI. If you’re not familiar with Fast AI, the company offers a series of courses including Practical Deep Learning for Coders, Cutting Edge Deep Learning for Coders and Rachel’s Computational Linear Algebra course. The courses are designed to make deep learning more accessible to those without the extensive math backgrounds some other courses assume. Rachel and I cover a lot of ground in this conversation, starting with the philosophy and goals behind the Fast AI courses. We also cover Fast AI’s recent decision to switch to their courses from Tensorflow to Pytorch, the reasons for this, and the lessons they’ve learned in the process. We discuss the role of the Fast AI deep learning library as well, and how it was recently used to held their team achieve top results on a popular industry benchmark of training time and training cost by a factor of more than ten. https://twimlai.com/twiml-talk-138-practical-deep-learning-with-rachel-thomas/
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Shirin's playgRound. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How Much Money Should Machines Earn?
(This article was first published on R – Fronkonstin, and kindly contributed to R-bloggers)
TweetEvery inch of sky’s got a star
Every inch of skin’s got a scar
(Everything Now, Arcade Fire)
I think that a very good way to start with R is doing an interactive visualization of some open data because you will train many important skills of a data scientist: loading, cleaning, transforming and combinig data and performing a suitable visualization. Doing it interactive will give you an idea of the power of R as well, because you will also realise that you are able to handle indirectly other programing languages such as JavaScript.
That’s precisely what I’ve done today. I combined two interesting datasets:
- The probability of computerisation of 702 detailed occupations, obtained by Carl Benedikt Frey and Michael A. Osborne from the University of Oxford, using a Gaussian process classifier and published in this paper in 2013.
- Statistics of jobs from (employments, median annual wages and typical education needed for entry) from the US Bureau of Labor, available here.
Apart from using dplyr to manipulate data and highcharter to do the visualization, I used tabulizer package to extract the table of probabilities of computerisation from the pdf: it makes this task extremely easy.
This is the resulting plot:
If you want to examine it in depth, here you have a full size version.
These are some of my insights (its corresponding figures are obtained directly from the dataset):
- There is a moderate negative correlation between wages and probability of computerisation.
- Around 45% of US employments are threatened by machines (have a computerisation probability higher than 80%): half of them do not require formal education to entry.
- In fact, 78% of jobs which do not require formal education to entry are threatened by machines: 0% which require a master’s degree are.
- Teachers are absolutely irreplaceable (0% are threatened by machines) but they earn a 2.2% less then the average wage (unfortunately, I’m afraid this phenomenon occurs in many other countries as well).
- Don’t study for librarian or archivist: it seems a bad way to invest your time
- Mathematicians will survive to machines
The code of this experiment is available here.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Fronkonstin. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Version 0.6-11 of NIMBLE released
(This article was first published on R – NIMBLE, and kindly contributed to R-bloggers)
We’ve released the newest version of NIMBLE on CRAN and on our website.
Version 0.6-11 has important new features, notably support for Bayesian nonparametric mixture modeling, and more are on the way in the next few months.
New features include:
- support for Bayesian nonparametric mixture modeling using Dirichlet process mixtures, with specialized MCMC samplers automatically assigned in NIMBLE’s default MCMC (See Chapter 10 of the manual for details);
- additional resampling methods available with the auxiliary and bootstrap particle filters;
- user-defined filtering algorithms can be used with NIMBLE’s particle MCMC samplers;
- MCMC thinning intervals can be modified at MCMC run-time;
- both runMCMC() and nimbleMCMC() now drop burn-in samples before thinning, making their behavior consistent with each other;
- increased functionality for the ‘setSeed’ argument in nimbleMCMC() and runMCMC();
- new functionality for specifying the order in which sampler functions are executed in an MCMC; and
- invalid dynamic indexes now result in a warning and NaN values but do not cause execution to error out, allowing MCMC sampling to continue.
Please see the NEWS file in the installed package for more details
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R – NIMBLE. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Prediction Interval, the wider sister of Confidence Interval
(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)
In this post, I will illustrate the use of prediction intervals for the comparison of measurement methods. In the example, a new spectral method for measuring whole blood hemoglobin is compared with a reference method. But first, let's start with discussing the large difference between a confidence interval and a prediction interval.
Prediction interval versus Confidence intervalVery often a confidence interval is misinterpreted as a prediction interval, leading to unrealistic “precise” predictions. As you will see, prediction intervals (PI) resemble confidence intervals (CI), but the width of the PI is by definition larger than the width of the CI.
Let’s assume that we measure the whole blood hemoglobin concentration in a random sample of 100 persons. We obtain the estimated mean (Est_mean), limits of the confidence interval (CI_Lower and CI_Upper) and limits of the prediction interval (PI_Lower and PI_Upper): (The R-code to do this is in the next paragraph)
Est_mean CI_Lower CI_Upper PI_Lower PI_Upper 140 138 143 113 167A Confidence interval (CI) is an interval of good estimates of the unknown true population parameter. About a 95% confidence interval for the mean, we can state that if we would repeat our sampling process infinitely, 95% of the constructed confidence intervals would contain the true population mean. In other words, there is a 95% chance of selecting a sample such that the 95% confidence interval calculated from that sample contains the true population mean.
Interpretation of the 95% confidence interval in our example:
- The values contained in the interval [138g/L to 143g/L] are good estimates of the unknown mean whole blood hemoglobin concentration in the population. In general, if we would repeat our sampling process infinitely, 95% of such constructed confidence intervals would contain the true mean hemoglobin concentration.
A Prediction interval (PI) is an estimate of an interval in which a future observation will fall, with a certain confidence level, given the observations that were already observed. About a 95% prediction interval we can state that if we would repeat our sampling process infinitely, 95% of the constructed prediction intervals would contain the new observation.
Interpretation of the 95% prediction interval in the above example:
- Given the observed whole blood hemoglobin concentrations, the whole blood hemoglobin concentration of a new sample will be between 113g/L and 167g/L with a confidence of 95%. In general, if we would repeat our sampling process infinitely, 95% of the such constructed prediction intervals would contain the new hemoglobin concentration measurement.
Remark: Very often we will read the interpretation “The whole blood hemogblobin concentration of a new sample will be between 113g/L and 167g/L with a probability of 95%.” (for example on wikipedia). This interpretation is correct in the theoretical situation where the parameters (true mean and standard deviation) are known.
Estimating a prediction interval in RFirst, let's simulate some data. The sample size in the plot above was (n=100). Now, to see the effect of the sample size on the width of the confidence interval and the prediction interval, let's take a “sample” of 400 hemoglobin measurements using the same parameters:
set.seed(123) hemoglobin<-rnorm(400, mean = 139, sd = 14.75) df<-data.frame(hemoglobin)Although we don't need a linear regression yet, I'd like to use the lm() function, which makes it very easy to construct a confidence interval (CI) and a prediction interval (PI). We can estimate the mean by fitting a “regression model” with an intercept only (no slope). The default confidence level is 95%.
Confidence interval: CI<-predict(lm(df$hemoglobin~ 1), interval="confidence") CI[1,] ## fit lwr upr ## 139.2474 137.8425 140.6524The CI object has a length of 400. But since there is no slope in our “model”, each row is exactly the same.
Prediction interval: PI<-predict(lm(df$hemoglobin~ 1), interval="predict") ## Warning in predict.lm(lm(df$hemoglobin ~ 1), interval = "predict"): predictions on current data refer to _future_ responses PI[1,] ## fit lwr upr ## 139.2474 111.1134 167.3815We get a “warning” that “predictions on current data refer to future responses”. That's exactly what we want, so no worries there. As you see, the column names of the objects CI and PI are the same. Now, let's visualize the confidence and the prediction interval.
The code below is not very elegant but I like the result (tips are welcome :-))
library(ggplot2) limits_CI <- aes(x=1.3 , ymin=CI[1,2], ymax =CI[1,3]) limits_PI <- aes(x=0.7 , ymin=PI[1,2], ymax =PI[1,3]) PI_CI<-ggplot(df, aes(x=1, y=hemoglobin)) + geom_jitter(width=0.1, pch=21, fill="grey", alpha=0.5) + geom_errorbar (limits_PI, width=0.1, col="#1A425C") + geom_point (aes(x=0.7, y=PI[1,1]), col="#1A425C", size=2) + geom_errorbar (limits_CI, width=0.1, col="#8AB63F") + geom_point (aes(x=1.3, y=CI[1,1]), col="#8AB63F", size=2) + scale_x_continuous(limits=c(0,2))+ scale_y_continuous(limits=c(95,190))+ theme_bw()+ylab("Hemoglobin concentration (g/L)") + xlab(NULL)+ geom_text(aes(x=0.6, y=160, label="Prediction\ninterval", hjust="right", cex=2), col="#1A425C")+ geom_text(aes(x=1.4, y=140, label="Confidence\ninterval", hjust="left", cex=2), col="#8AB63F")+ theme(legend.position="none", axis.text.x = element_blank(), axis.ticks.x = element_blank()) PI_CIThe width of the confidence interval is very small, now that we have this large sample size (n=400). This is not surprising, as the estimated mean is the only source of uncertainty. In contrast, the width of the prediction interval is still substantial. The prediction interval has two sources of uncertainty: the estimated mean (just like the confidence interval) and the random variance of new observations.
Example: comparing a new with a reference measurement method.A prediction interval can be useful in the case where a new method should replace a standard (or reference) method.
If we can predict well enough what the measurement by the reference method would be, (given the new method) than the two methods give similar information and the new method can be used.
For example in (Tian, 2017) a new spectral method (Near-Infra-Red) to measure hemoglobin is compared with a Golden Standard. In contrast with the Golden Standard method, the new spectral method does not require reagents. Moreover, the new method is faster. We will investigate whether we can predict well enough, based on the measured concentration of the new method, what the measurement by the Golden Standard would be. (note: the measured concentrations presented below are fictive)
Hb<- read.table("http://rforbiostatistics.colmanstatistics.be/wp-content/uploads/2018/06/Hb.txt", header = TRUE) kable(head(Hb)) New Reference 84.96576 87.24013 99.91483 103.38143 111.79984 116.71593 116.95961 116.72065 118.11140 113.51541 118.21411 121.70586 plot(Hb$New, Hb$Reference, xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") Prediction Interval based on linear regressionNow, let's fit a linear regression model predicting the hemoglobin concentrations measured by the reference method, based on the concentrations measured with the new method.
fit.lm <- lm(Hb$Reference ~ Hb$New) plot(Hb$New, Hb$Reference, xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") cat ("Adding the regression line:")Adding the regression line:
abline (a=fit.lm$coefficients[1], b=fit.lm$coefficients[2] ) cat ("Adding the identity line:")Adding the identity line:
abline (a=0, b=1, lty=2)If both measurement methods would exactly correspond, the intercept would be zero and the slope would be one (=“identity line”, dotted line). Now, let's calculated the confidence interval for this linear regression.
CI_ex <- predict(fit.lm, interval="confidence") colnames(CI_ex)<- c("fit_CI", "lwr_CI", "upr_CI")And the prediction interval:
PI_ex <- predict(fit.lm, interval="prediction") ## Warning in predict.lm(fit.lm, interval = "prediction"): predictions on current data refer to _future_ responses colnames(PI_ex)<- c("fit_PI", "lwr_PI", "upr_PI")We can combine those results in one data frame and plot both the confidence interval and the prediction interval.
Hb_results<-cbind(Hb, CI_ex, PI_ex) kable(head(round(Hb_results),1)) New Reference fit_CI lwr_CI upr_CI fit_PI lwr_PI upr_PI 85 87 91 87 94 91 82 99Visualizing the CI (in green) and the PI (in blue):
plot(Hb$New, Hb$Reference, xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") Hb_results_s <- Hb_results[order(Hb_results$New),] lines (x=Hb_results_s$New, y=Hb_results_s$fit_CI) lines (x=Hb_results_s$New, y=Hb_results_s$lwr_CI, col="#8AB63F", lwd=1.2) lines (x=Hb_results_s$New, y=Hb_results_s$upr_CI, col="#8AB63F", lwd=1.2) lines (x=Hb_results_s$New, y=Hb_results_s$lwr_PI, col="#1A425C", lwd=1.2) lines (x=Hb_results_s$New, y=Hb_results_s$upr_PI, col="#1A425C", lwd=1.2) abline (a=0, b=1, lty=2)In (Bland, Altman 2003) it is proposed to calculate the average width of this prediction interval, and see whether this is acceptable. Another approach is to compare the calculated PI with an “acceptance interval”. If the PI is inside the acceptance interval for the measurement range of interest (see Francq, 2016).
In the above example, both methods do have the same measurement scale (g/L), but the linear regression with prediction interval is particularly useful when the two methods of measurement have different units.
However, the method has some disadvantages:
- Predictions intervals are very sensitive to deviations from the normal distribution.
- In “standard” linear regression (or Ordinary Least Squares (OLS) regression),the presence of measurement error is allowed for the Y-variable (here, the reference method) but not for the X-variable (the new method). The absence of errors on the x-axis is one of the assumptions. Since we can expect some measurement error for the new method, this assumption is violated here.
In contrast to Ordinary Least Square (OLS) regression, Bivariate Least Square (BLS) regression takes into account the measurement errors of both methods (the New method and the Reference method). Interestingly, prediction intervals calculated with BLS are not affected when the axes are switched (del Rio, 2001).
In 2017, a new R-package BivRegBLS was released. It offers several methods to assess the agreement in method comparison studies, including Bivariate Least Square (BLS) regression.
If the data are unreplicated but the variance of the measurement error of the methods is known, The BLS() and XY.plot() functions can be used to fit a bivariate Least Square regression line and corresponding confidence and prediction intervals.
library (BivRegBLS) Hb.BLS = BLS (data = Hb, xcol = c("New"), ycol = c("Reference"), var.y=10, var.x=8, conf.level=0.95) XY.plot (Hb.BLS, yname = "Hemoglobin concentration (g/L) - reference method", xname = "Hemoglobin concentration (g/L) - new method", graph.int = c("CI","PI"))Now we would like to decide whether the new method can replace the reference method. We allow the methods to differ up to a given threshold, which is not clinically relevant. Based on this threshold an “acceptance interval” is created. Suppose that differences up to 10 g/L (=threshold) are not clinically relevant, then the acceptance interval can be defined as Y=X±??, with ?? equal to 10. If the PI is inside the acceptance interval for the measurement range of interest then the two measurement methods can be considered to be interchangeable (see Francq, 2016).
The accept.int argument of the XY.plot() function allows for a visualization of this acceptance interval
XY.plot (Hb.BLS, yname = "Hemoglobin concentration (g/L) - reference method", xname = "Hemoglobin concentration (g/L) - new method", graph.int = c("CI","PI"), accept.int=10)For the measurement region 120g/L to 150 g/L, we can conclude that the difference between both methods is acceptable. If the measurement regions below 120g/l and above 150g/L are important, the new method cannot replace the reference method.
Regression on replicated dataIn method comparison studies, it is advised to create replicates (2 or more measurements of the same sample with the same method). An example of such a dataset:
Hb_rep <- read.table("http://rforbiostatistics.colmanstatistics.be/wp-content/uploads/2018/06/Hb_rep.txt", header = TRUE) kable(head(round(Hb_rep),1)) New_rep1 New_rep2 Ref_rep1 Ref_rep2 88 95 90 84When replicates are available, the variance of the measurement errors are calculated for both the new and the reference method and used to estimate the prediction interval. Again, the BLS() function and the XY.plot() function are used to estimate and plot the BLS regression line, the corresponding CI and PI.
Hb_rep.BLS = BLS (data = Hb_rep, xcol = c("New_rep1", "New_rep2"), ycol = c("Ref_rep1", "Ref_rep2"), qx = 1, qy = 1, conf.level=0.95, pred.level=0.95) XY.plot (Hb_rep.BLS, yname = "Hemoglobin concentration (g/L) - reference method", xname = "Hemoglobin concentration (g/L) - new method", graph.int = c("CI","PI"), accept.int=10)It is clear that the prediction interval is not inside the acceptance interval here. The new method cannot replace the reference method. A possible solution is to average the repeats. The BivRegBLS package can create prediction intervals for the mean of (2 or more) future values, too! More information in this presentation (presented at useR!2017).
In the plot above, averages of the two replicates are calculated and plotted. I'd like to see the individual measurements:
plot(x=c(Hb_rep$New_rep1, Hb_rep$New_rep2), y=c(Hb_rep$Ref_rep1, Hb_rep$Ref_rep2), xlab="Hemoglobin concentration (g/L) - new method", ylab="Hemoglobin concentration (g/L) - reference method") lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,2]), lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,3]), col="#8AB63F", lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,4]), col="#8AB63F", lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,5]), col="#1A425C", lwd=2) lines (x=as.numeric(Hb_rep.BLS$Pred.BLS[,1]), y=as.numeric(Hb_rep.BLS$Pred.BLS[,6]), col="#1A425C", lwd=2) abline (a=0, b=1, lty=2) Remarks- Although not appropriate in the context of method comparison studies, Pearson correlation is still frequently used. See Bland & Altman (2003) for an explanation on why correlations are not adviced.
- Methods presented in this blogpost are not applicable to time-series
- Confidence interval and prediction interval: Applied Linear Statistical Models, 2005, Michael Kutner, Christopher Nachtsheim, John Neter, William Li. Section 2.5
- Prediction interval for method comparison:
Bland, J. M. and Altman, D. G. (2003), Applying the right statistics: analyses of measurement studies. Ultrasound Obstet Gynecol, 22: 85-93. doi:10.1002/uog.12
section: “Appropriate use of regression”. - Francq, B. G., and Govaerts, B. (2016) How to regress and predict in a Bland-Altman plot? Review and contribution based on tolerance intervals and correlated-errors-in-variables models. Statist. Med., 35: 2328-2358. doi: 10.1002/sim.6872.
- del Río, F. J., Riu, J. and Rius, F. X. (2001), Prediction intervals in linear regression taking into account errors on both axes. J. Chemometrics, 15: 773-788. doi:10.1002/cem.663
- Example of a method comparison study: H. Tian, M. Li, Y. Wang, D. Sheng, J. Liu, and L. Zhang, “Optical wavelength selection for portable hemoglobin determination by near-infrared spectroscopy method,” Infrared Phys. Techn 86, 98-102 (2017). doi.org/10.1016/j.infrared.2017.09.004.
- the predict() and lm() functions of R: Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S. Wadsworth & Brooks/Cole.
Related Post
- Six Sigma DMAIC Series in R – Part 2
- Six Sigma DMAIC Series in R – Part 1
- Implementation and Interpretation of Control Charts in R
- Time series model of forecasting future power demand
- Tennis Grand Slam Tournaments Champions Basic Analysis
To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Interpreting machine learning models with the lime package for R
(This article was first published on Revolutions, and kindly contributed to R-bloggers)
Many types of machine learning classifiers, not least commonly-used techniques like ensemble models and neural networks, are notoriously difficult to interpret. If the model produces a surprising label for any given case, it's difficult to answer the question, "why that label, and not one of the others?".
One approach to this dilemma is the technique known as LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is that while for highly non-linear models it's impossible to give a simple explanation of the relationship between any one variable and the predicted classes at a global level, it might be possible to asses which variables are most influential on the classification at a local level, near the neighborhood of a particular data point. An procedure for doing so is described in this 2016 paper by Ribeiro et al, and implemented in the R package lime by Thomas Lin Pedersen and Michael Benesty (and a port of the Python package of the same name).
You can read about how the lime package works in the introductory vignette Understanding Lime, but this limerick by Mara Averick sums also things up nicely:
There once was a package called lime,
Whose models were simply sublime,
It gave explanations for their variations,
One observation at a time.
"One observation at a time" is the key there: given a prediction (or a collection of predictions) it will determine the variables that most support (or contradict) the predicted classification.
The lime package also works with text data: for example, you may have a model that classifies a paragraph of text as a sentiment "negative", "neutral" or "positive". In that case, lime will determine the the words in that sentence which are most important to determining (or contradicting) the classification. The package helpfully also provides a shiny app making it easy to test out different sentences and see the local effect of the model.
To learn more about the lime algorithm and how to use the associated R package, a great place to get started is the tutorial Visualizing ML Models with LIME from the University of Cincinnati Business Analytics R Programming Guide. The lime package is available on CRAN now, and you can always find the latest version at the GitHub repository linked below.
GitHub (thomasp): lime (Local Interpretable Model-Agnostic Explanations)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R Tip: Be Wary of “…”
(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)
R Tip: be wary of “...“.
The following code example contains an easy error in using the R function unique().
vec1 <- c("a", "b", "c") vec2 <- c("c", "d") unique(vec1, vec2) # [1] "a" "b" "c"Notice none of the novel values from vec2 are present in the result. Our mistake was: we (improperly) tried to use unique() with multiple value arguments, as one would use union(). Also notice no error or warning was signaled. We used unique() incorrectly and nothing pointed this out to us. What compounded our error was R‘s “...” function signature feature.
In this note I will talk a bit about how to defend against this kind of mistake. I am going to apply the principle that a design that makes committing mistakes more difficult (or even impossible) is a good thing, and not a sign of carelessness, laziness, or weakness. I am well aware that every time I admit to making a mistake (I have indeed made the above mistake) those who claim to never make mistakes have a laugh at my expense. Honestly I feel the reason I see more mistakes is I check a lot more.
Data science coding is often done in a rush (deadlines, wanting to see results, and so on). Instead of moving fast, let’s take the time to think a bit about programming theory using a very small concrete issue. This lets us show how one can work a bit safer (saving time in the end), without sacrificing notational power or legibility.
A confounding issue is: unique() failed to alert me of my mistake because, unique()‘s function signature (like so many R functions) includes a “...” argument. I might have been careless or in a rush, but it seems like unique was also in a rush and did not care to supply argument inspection.
In R a function that includes a “...” in its signature will accept arbitrary arguments and values in addition to the ones explicitly declared and documented. There are three primary uses of “...” in R: accepting unknown arguments that are to be passed to later functions, building variadic functions, and forcing later arguments to be bound by name (my favorite use). Unfortunately, “...” is also often used to avoid documenting arguments and turns off a lot of very useful error checking.
An example of the “accepting unknown arguments” use is lapply(). lapply() passes what it finds in “...” to whatever function it is working with. For example:
lapply(c("a", "b"), paste, "!", sep = "") # [[1]] # [1] "a!" # # [[2]] # [1] "b!"Notice the arguments “"!", sep = ""” were passed on to paste(). Since lapply() can not know what function the user will use ahead of time it uses the “...” abstraction to forward arguments. Personally I never use this form and tend to write the somewhat more explicit and verbose style shown below.
lapply(c("a", "b"), function(vi) { paste(vi, "!", sep = "") })I feel this form is more readable as the arguments are seen where they are actually used. (Note: this, is a notional example- in practice we would use “paste0(c("a", "b"), "!")” to directly produce the result as a vector.)
An example of using “...” to supply a variadic interface is paste() itself.
paste("a", "b", "c") # [1] "a b c"Other important examples include list() and c(). In fact I like list() and c() as they only take a “...” and no other arguments. Being variadic is so important to list() and c() is that is essentially all they do. One can often separate out the variadic intent with lists or vectors as in:
paste(c("a", "b", "c"), collapse = " ") # [1] "a b c"Even I don’t write code such as the above (that is too long even for me), unless the values are coming from somewhere else (such as a variable). However with wrapr‘s reduce/expand operator we can completely separate the collection of variadic arguments and their application. The notation looks like the following:
library("wrapr") values <- c("a", "b", "c") values %.|% paste # [1] "a b c"Essentially reduce/expand calls variadic functions with items taken from a vector or list as individual arguments (allowing one to program easily over variadic functions). %.|% is intended to place values in the “...” slot of a function (the variadic term). It is designed for a more perfect world where when a function declares “...” in its signature it is then the only user facing part of the signature. This is hardly ever actually the case in R as common functions such as paste() and sum() have additional optional named arguments (which we are here leaving at their default values), whereas c() and list() are pure (take only “...“).
With a few non-standard (name capturing) and variadic value constructors one does not in fact need other functions to be name capturing or variadic. With such tools one can have these conveniences everywhere. For example we can convert our incorrect use of unique() into correct code using c().
unique(c(vec1, vec2)) # [1] "a" "b" "c" "d"In the above code roles are kept separate: c() is collecting values and unique() is applying a calculation. We don’t need a powerful “super unique” or “super union” function, unique() is good enough if we remember to use c().
In the spirit of our earlier article on function argument names we have defined a convenience function wrapr::uniques() that enforces the use of value carrying arguments. With wrapr::uniques() if one attempts the mistake I have been discussing one immediately gets a signaling error (instead of propagating incorrect calculations forward).
library("wrapr") uniques(c(vec1, vec2)) # [1] "a" "b" "c" "d" uniques(vec1, vec2) # Error: wrapr::uniques unexpected arguments: vec2With uniques() we either get the right answer, or we get immediately stopped at the mistaken step. This is a good way to work.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Why your S3 method isn’t working
(This article was first published on That’s so Random, and kindly contributed to R-bloggers)
Throughout the last years I noticed the following happening with a number of people. One of those people was actually yours truely a few years back. Person is aware of S3 methods in R through regular use of print, plot and summary functions and decides to give it a go in own work. Creates a function that assigns a class to its output and then implements a bunch of methods to work on the class. Strangely, some of these methods appear to be working as expected, while others throw an error. After a confusing and painful debugging session, person throws hands in the air and continues working without S3 methods. Which was working fine in the first place. This is a real pity, because all the person is overlooking is a very small step in the S3 chain: the generic function.
A nonworking methodSo we have a function doing all kinds of complicated stuff. It outputs a list with several elements. We assign a S3 class to it before returning, so we can subsequently implement a number of methods1. Lets just make something up here.
my_func <- function(x) { ret <- list(dataset = x, d = 42, y = rnorm(10), z = c('a', 'b', 'a', 'c')) class(ret) <- "myS3" ret } out <- my_func(mtcars)Perfect, now lets implement a print method. Rather than outputting the whole list, we just want to know the most vital information when printing.
print.myS3 <- function(x) { cat("Original dataset has", nrow(x$dataset), "rows and", ncol(x$dataset), "columns\n", "d is", x$d) } out ## Original dataset has 32 rows and 11 columns ## d is 42Ha, that is working!. Now we do a mean method, that gives us the mean of the y variable.
mean.myS3 <- function(x) { mean(x$y) } mean(out) ## [1] 0.2631094Works too! And finally we do a count_letters method. It takes z from the output and counts how often each letter occurs.
count_letters.myS3 <- function(x) { table(out$z) } count_letters(out) ## Error in count_letters(out): could not find function "count_letters"What do you mean “could not find function”? It is right there! Maybe we made a typo. Mmmm, no it doesn’t seem so. Maybe, mmmm, lets look into this…. Half an hour, a bit of swearing and feelings of stupidity later. Pfff, lets not bother about S3, we were happy with just using functions in the first place.
GenericsNow why are print and mean working just fine, but count_letters isn’t? Lets look under the hood of print and mean.
print ## function (x, ...) ## UseMethod("print") ## ## mean ## function (x, ...) ## UseMethod("mean") ## ##They look exactly the same! They call the UseMethod function on their own function name. Looking into the help file of UseMethod, it all of a sudden starts to make sense.
“When a function calling UseMethod(“fun”) is applied to an object with class attribute c(“first”, “second”), the system searches for a function called fun.first and, if it finds it, applies it to the object. If no such function is found a function called fun.second is tried. If no class name produces a suitable function, the function fun.default is used, if it exists, or an error results.”
So by calling print and mean on the myS3 object we were not calling the method itself. Rather, we call the general functions print and mean (the generics) and they call the function UseMethod. This function then does the method dispatch: lookup the method belonging to the S3 object the function was called on. We were just lucky the print and mean generics were already in place and called our methods. However, the count_letters function indeed doesn’t exist (as the error message tells us). Only the count_letters method exist, for objects of class myS3. We just learned that methods cannot get called directly, but are invoked by generics. All we need to do, thus, is build a generic for count_letters and we are all set.
count_letters <- function(x) { UseMethod("count_letters") } count_letters(out) ## ## a b c ## 2 1 1-
It is actually ill-advised to assign a S3 class directly to an output. Rather use a constructor, see 16.3.1 of Advanced R for the how and why.
To leave a comment for the author, please follow the link and comment on their blog: That’s so Random. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Creating Custom Plotly Charts on Datazar
(This article was first published on R Language in Datazar Blog on Medium, and kindly contributed to R-bloggers)
Have you ever wondered how you create or use Plotly (plotly.js) on Datazar? No? Well here it is anyway.
Datazar’s live chart editor allows you to create custom D3js charts with a JavaScript and CSS editor. Using the DatazarJS library, you can call your datasets without having to worry about file paths or URLs; simply do Datazar.dataset('someDataset.csv',()=>{}).
Is your data being automatically updated via the Datazar REST API? The charts update themselves so you don’t have to worry about taking into account new data.
Including Plotly The “Libraries” button will prompt a pop up that allows you to include external JS libraires and CSS styles.Plotly provides a CDN link so you can use their library without saving your own copy. That’s wonderful; let’s use that and copy it into the popup.
The CodeLet’s use one of the examples Plotly was kind enough to create on their site: https://plot.ly/javascript/histograms/
Copy and paste the JS code to the Datazar JS editor and you’re done.
The JS editor on the left contains the code from the histogram example.Note this example actually generates the data on the fly but it’s the same principle as grabbing your data using the async Dataset function.
The CSS editor is just making sure the container page keeps the same color as the Plotly chart.
Using the Chart in a PaperTo use your newly minted chart on a Paper, Datazar’s new interactive research paper, create a Paper and use the option bar on the right to include it. The chart will be on the “Visualization” tab.
The cool thing about using the Plotly chart along with the interactive Paper is that it will keep its interactivity so you can zoom in and do all the fun things Plotly allows on their charts even AFTER you publish your Paper. This means your readers can play with your charts and get a better understanding; after all, that’s the whole point.
Checkout more features here: https://www.datazar.com/features
Creating Custom Plotly Charts on Datazar was originally published in Datazar Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Language in Datazar Blog on Medium. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How to create a Flexdashboard: Exercises
(This article was first published on R-exercises, and kindly contributed to R-bloggers)
INTRODUCTION
With flexdashboard, you can easily create interactive dashboards for R. What is amazing about it is that with R Markdown, you can publish a group of related data visualizations as a dashboard.
Additionally, it supports a wide variety of components, including htmlwidgets; base, lattice, and grid graphics; tabular data; gauges and value boxes and text annotations.
It is flexible and easy to specify rows and column-based layouts. Components are intelligently re-sized to fill the browser and adapted for display on mobile devices.
In combination with Shiny, you can create a high quality dashboard with interactive visualizations.
Before proceeding, please follow our short tutorial.
Look at the examples given and try to understand the logic behind them. Then, try to solve the exercises below by using R without looking at the answers. Then, check the solutions to check your answers.
Exercise 1
Create a new flexdashboard R Markdown file from the R console
Exercise 2
Create the very initial dashboard interface in a single column.
Exercise 3
Add the space in which you will put your first chart
Exercise 4
Add the space in which you will put your second chart. The two charts should be stacked vertically.
Exercise 5
Add a third chart with the same logic.
Exercise 6
Transform your layout to scrolling.
Exercise 7
Displays THE 3 charts split across two columns.
Exercise 8
Change the width of these two columns.
Exercise 9
Defines two rows instead of columns, the first of which has a single chart and the second of which has two charts:
Exercise 10
Change the height of these two columns.
Related exercise sets:- Building Shiny App Exercises (part-8)
- Data visualization with googleVis exercises part 10
- Data Visualization with googleVis exercises part 2
- Explore all our (>1000) R exercises
- Find an R course using our R Course Finder directory
To leave a comment for the author, please follow the link and comment on their blog: R-exercises. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Build httr Functions Automagically from Manual Browser Requests with the middlechild Package
(This article was first published on R – rud.is, and kindly contributed to R-bloggers)
You can catch a bit of the @rOpenSci 2018 Unconference experience at home w with this short-ish ‘splainer video on how to use the new middlechild package (https://github.com/ropenscilabs/middlechild) & mitmproxy to automagically create reusable httr verb functions from manual browser form interactions.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Spreadsheet Data Manipulation in R
(This article was first published on R tutorial for Spatial Statistics, and kindly contributed to R-bloggers)
Today I decided to create a new repository on GitHub where I am sharing code to do spreadsheet data manipulation in R.
The first version of the repository and R script is available here: SpreadsheetManipulation_inR
As an example I am using a csv freely available from the IRS, the US Internal Revenue Service.
https://www.irs.gov/statistics/soi-tax-stats-individual-income-tax-statistics-2015-zip-code-data-soi
This spreadsheet has around 170’000 rows and 131 columns.
Please feel free to request new functions to be added or add functions and code yourself directly on GitHub.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R tutorial for Spatial Statistics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Working with Your Facebook Data in R
(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)
.knitr .inline { background-color: #f7f7f7; border:solid 1px #B0B0B0; } .error { font-weight: bold; color: #FF0000; } .warning { font-weight: bold; } .message { font-style: italic; } .source, .output, .warning, .error, .message { padding: 0 1em; border:solid 1px #F7F7F7; } .source { background-color: #f5f5f5; } .rimage .left { text-align: left; } .rimage .right { text-align: right; } .rimage .center { text-align: center; } .hl.num { color: #AF0F91; } .hl.str { color: #317ECC; } .hl.com { color: #AD95AF; font-style: italic; } .hl.opt { color: #000000; } .hl.std { color: #585858; } .hl.kwa { color: #295F94; font-weight: bold; } .hl.kwb { color: #B05A65; } .hl.kwc { color: #55aa55; } .hl.kwd { color: #BC5A65; font-weight: bold; }
How to Read in and Clean Your Facebook Data – I recently learned that you can download all of your Facebook data, so I decided to check it out and bring it into R. To access your data, go to Facebook, and click on the white down arrow in the upper-right corner. From there, select Settings, then, from the column on the left, “Your Facebook Information.” When you get the Facebook Information screen, select “View” next to “Download Your Information.” On this screen, you’ll be able to select the kind of data you want, a date range, and format. I only wanted my posts, so under “Your Information,” I deselected everything but the first item on the list, “Posts.” (Note that this will still download all photos and videos you posted, so it will be a large file.) To make it easy to bring into R, I selected JSON under Format (the other option is HTML).
After you click “Create File,” it will take a while to compile – you’ll get an email when it’s ready. You’ll need to reenter your password when you go to download the file.
The result is a Zip file, which contains folders for Posts, Photos, and Videos. Posts includes your own posts (on your and others’ timelines) as well as posts from others on your timeline. And, of course, the file needed a bit of cleaning. Here’s what I did.
Since the post data is a JSON file, I need the jsonlite package to read it.
setwd("C:/Users/slocatelli/Downloads/facebook-saralocatelli35/posts")library(jsonlite)
FBposts <- fromJSON("your_posts.json")
This creates a large list object, with my data in a data frame. So as I did with the Taylor Swift albums, I can pull out that data frame.
myposts <- FBposts$status_updatesThe resulting data frame has 5 columns: timestamp, which is in UNIX format; attachments, any photos, videos, URLs, or Facebook events attached to the post; title, which always starts with the author of the post (you or your friend who posted on your timeline) followed by the type of post; data, the text of the post; and tags, the people you tagged in the post.
First, I converted the timestamp to datetime, using the anytime package.
library(anytime)myposts$timestamp <- anytime(myposts$timestamp)
Next, I wanted to pull out post author, so that I could easily filter the data frame to only use my own posts.
library(tidyverse)myposts$author <- word(string = myposts$title, start = 1, end = 2, sep = fixed(" "))
Finally, I was interested in extracting URLs I shared (mostly from YouTube or my own blog) and the text of my posts, which I did with some regular expression functions and some help from Stack Overflow (here and here).
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"myposts$links <- str_extract(myposts$attachments, url_pattern)
library(qdapRegex)
myposts$posttext <- myposts$data %>%
rm_between('"','"',extract = TRUE)
There’s more cleaning I could do, but this gets me a data frame I could use for some text analysis. Let’s look at my most frequent words.
myposts$posttext <- as.character(myposts$posttext)library(tidytext)
mypost_text <- myposts %>%
unnest_tokens(word, posttext) %>%
anti_join(stop_words)
## Joining, by = "word"
counts <- mypost_text %>%
filter(author == "Sara Locatelli") %>%
drop_na(word) %>%
count(word, sort = TRUE)
counts
## # A tibble: 9,753 x 2
## word n
##
## 1 happy 4702
## 2 birthday 4643
## 3 today's 666
## 4 song 648
## 5 head 636
## 6 day 337
## 7 post 321
## 8 009f 287
## 9 ð 287
## 10 008e 266
## # ... with 9,743 more rows
These data include all my posts, including writing “Happy birthday” on other’s timelines. I also frequently post the song in my head when I wake up in the morning (over 600 times, it seems). If I wanted to remove those, and only include times I said happy or song outside of those posts, I’d need to apply the filter in a previous step. There are also some strange characters that I want to clean from the data before I do anything else with them. I can easily remove these characters and numbers with string detect, but cells that contain numbers and letters, such as “008e” won’t be cut out with that function. So I’ll just filter them out separately.
drop_nums <- c("008a","008e","009a","009c","009f")counts <- counts %>%
filter(str_detect(word, "[a-z]+"),
!word %in% str_detect(word, "[0-9]"),
!word %in% drop_nums)
Now I could, for instance, create a word cloud.
library(wordcloud)counts %>%
with(wordcloud(word, n, max.words = 50))
In addition to posting for birthdays and head songs, I talk a lot about statistics, data, analysis, and my blog. I also post about beer, concerts, friends, books, and Chicago. Let’s see what happens if I mix in some sentiment analysis to my word cloud.
library(reshape2)##
## Attaching package: 'reshape2'
counts %>%
inner_join(get_sentiments("bing")) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("red","blue"), max.words = 100)
## Joining, by = "word"
Once again, a few words are likely being misclassified – regression and plot are both negatively-valenced, but I imagine I’m using them in the statistical sense instead of the negative sense. I also apparently use “died” or “die” but I suspect in the context of, “I died laughing at this.” And “happy” is huge, because it includes birthday wishes as well as instances where I talk about happiness. Some additional cleaning and exploration of the data is certainly needed. But that’s enough to get started with this huge example of “me-search.”
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Quantile Regression (home made)
(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)
After my series of post on classification algorithms, it’s time to get back to R codes, this time for quantile regression. Yes, I still want to get a better understanding of optimization routines, in R. Before looking at the quantile regression, let us compute the median, or the quantile, from a sample.
MedianConsider a sample . To compute the median, solvewhich can be solved using linear programming techniques. More precisely, this problem is equivalent towith and , .
To illustrate, consider a sample from a lognormal distribution,
For the optimization problem, use the matrix form, with constraints, and parameters,
1 2 3 4 5 6 7 library(lpSolve) A1 = cbind(diag(2*n),0) A2 = cbind(diag(n), -diag(n), 1) r = lp("min", c(rep(1,2*n),0), rbind(A1, A2),c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,1) [1] 1.077415It looks like it’s working well…
QuantileOf course, we can adapt our previous code for quantiles
1 2 3 4 tau = .3 quantile(x,tau) 30% 0.6741586The linear program is nowwith and , . The R code is now
1 2 3 4 5 6 A1 = cbind(diag(2*n),0) A2 = cbind(diag(n), -diag(n), 1) r = lp("min", c(rep(tau,n),rep(1-tau,n),0), rbind(A1, A2),c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,1) [1] 0.6741586So far so good…
Quantile Regression (simple)Consider the following dataset, with rents of flat, in a major German city, as function of the surface, the year of construction, etc.
1 base=read.table("http://freakonometrics.free.fr/rent98_00.txt",header=TRUE)The linear program for the quantile regression is nowwith and . So use here
1 2 3 4 5 6 7 8 9 10 11 12 require(lpSolve) tau = .3 n=nrow(base) X = cbind( 1, base$area) y = base$rent_euro A1 = cbind(diag(2*n), 0,0) A2 = cbind(diag(n), -diag(n), X) r = lp("min", c(rep(tau,n), rep(1-tau,n),0,0), rbind(A1, A2), c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,2) [1] 148.946864 3.289674Of course, we can use R function to fit that model
1 2 3 4 5 library(quantreg) rq(rent_euro~area, tau=tau, data=base) Coefficients: (Intercept) area 148.946864 3.289674Here again, it seems to work quite well. We can use a different probability level, of course, and get a plot
1 2 3 4 5 6 7 8 9 10 11 12 13 plot(base$area,base$rent_euro,xlab=expression(paste("surface (",m^2,")")), ylab="rent (euros/month)",col=rgb(0,0,1,.4),cex=.5) sf=0:250 yr=r$solution[2*n+1]+r$solution[2*n+2]*sf lines(sf,yr,lwd=2,col="blue") tau = .9 r = lp("min", c(rep(tau,n), rep(1-tau,n),0,0), rbind(A1, A2), c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,2) [1] 121.815505 7.865536 yr=r$solution[2*n+1]+r$solution[2*n+2]*sf lines(sf,yr,lwd=2,col="blue") Quantile Regression (multiple)Now that we understand how to run the optimization program with one covariate, why not try with two ? For instance, let us see if we can explain the rent of a flat as a (linear) function of the surface and the age of the building.
1 2 3 4 5 6 7 8 9 10 11 12 require(lpSolve) tau = .3 n=nrow(base) X = cbind( 1, base$area, base$yearc ) y = base$rent_euro A1 = cbind(diag(2*n), 0,0,0) A2 = cbind(diag(n), -diag(n), X) r = lp("min", c(rep(tau,n), rep(1-tau,n),0,0,0), rbind(A1, A2), c(rep(">=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,3) [1] 0.000000 3.257562 0.077501Unfortunately, this time, it is not working well…
1 2 3 4 5 library(quantreg) rq(rent_euro~area+yearc, tau=tau, data=base) Coefficients: (Intercept) area yearc -5542.503252 3.978135 2.887234Results are quite different. And actually, another technique can confirm the later (IRLS – Iteratively Reweighted Least Squares)
1 2 3 4 5 6 7 8 eps = residuals(lm(rent_euro~area+yearc, data=base)) for(s in 1:500){ reg = lm(rent_euro~area+yearc, data=base, weights=(tau*(eps>0)+(1-tau)*(eps<0))/abs(eps)) eps = residuals(reg) } reg$coefficients (Intercept) area yearc -5484.443043 3.955134 2.857943I could not figure out what went wrong with the linear program. Not only coefficients are very different, but also predictions…
1 2 3 yr = r$solution[2*n+1]+r$solution[2*n+2]*base$area+r$solution[2*n+3]*base$yearc plot(predict(reg),yr) abline(a=0,b=1,lty=2,col="red")
It’s now time to investigate….
To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Detecting unconscious bias in models, with R
(This article was first published on Revolutions, and kindly contributed to R-bloggers)
There's growing awareness that the data we collect, and in particular the variables we include as factors in our predictive models, can lead to unwanted bias in outcomes: from loan applications, to law enforcement, and in many other areas. In some instances, such bias is even directly regulated by laws like the Fair Housing Act in the US. But even if we explicitly remove "obvious" variables like sex, age or ethnicity from predictive models, unconscious bias might still be a factor in our predictions as a result of highly-correlated proxy variables that are included in our model.
As a result, we need to be aware of the biases in our model and take steps to address them. For an excellent general overview of the topic, I highly recommend watching the recent presentation by Rachel Thomas, "Analyzing and Preventing Bias in ML". And for a practical demonstration of one way you can go about detecting proxy bias in R, take a look at the vignette created by my colleague Paige Bailey for the ROpenSci conference, "Ethical Machine Learning: Spotting and Preventing Proxy Bias".
The vignette details general principles you can follow to identify proxy bias in an analysis, in the context of a case study analyzed using R. The case study considers data and a predictive model that might be used by a bank manager to determine the creditworthiness of a loan applicant. Even though race was not explicitly included in the adaptive boosting model (from the C5.0 package), the predictions are still biased by race:
That's because zipcode, a variable highly associated with race, was included in the model. Read the complete vignette linked below to see how Paige modified the model to ameliorate that bias, while still maintaining its predictive power. All of the associated R code is available in the iPython Notebook.
GitHub (ropenscilabs): Ethical Machine Learning: Spotting and Preventing Proxy Bias (Paige Bailey)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Mixed Models with Adaptive Gaussian Quadrature
(This article was first published on iProgn: Interactive Prediction Tools based on Joint Models, and kindly contributed to R-bloggers)
OverviewIn this post, I would like to introduce my new R package GLMMadaptive for fitting mixed-effects models for non-Gaussian grouped/clustered outcomes using marginal maximum likelihood.
Admittedly, there is a number of packages available for fitting similar models, e.g., lme4, glmmsr, glmmTMB, glmmEP, and glmmML among others; more information on other available packages can also be found in GLMM-FAQ. GLMMadaptive differs from these packages in approximating the integrals over the random effects in the definition of the marginal log-likelihood function using an adaptive Gaussian quadrature rule, while allowing for multiple correlated random effects.
An Example: Mixed Effects Logistic RegressionWe illustrate the use of the package in the simple case of a mixed effects logistic regression. We start by simulating some data:
We continue by fitting the mixed effects logistic regression for the longitudinal outcome y assuming random intercepts for the random-effects part. The primary model-fitting function in the package is the mixed_model(), and has four required arguments, namely,
- fixed: a formula for the fixed effects,
- random: a formula for the random effects,
- family: a family object specifying the type of response variable, and
- data: a data frame containing the variables in the previously mentioned formulas.
Assuming that the package has been loaded using library(“GLMMadaptive”), the call to fit the random intercepts logistic regression is:
By default, 11 quadrature points are used, but this can be adjusted using the nAGQ control argument. We extend model fm1 by also including a random slopes term; however, we assume that the covariance between the random intercepts and random slopes is zero. This is achieved by using the `||` symbol in the specification of the random argument. We fit the new model and compare it with fm1 using the anova() function that performs a likelihood ratio test:
We further extend the model by estimating the covariance between the random intercepts and random slopes, and we use 15 quadrature points for the numerical integration:
The package offers a wide range of methods for standard generic functions in R applicable to regression models objects and mixed models objects in particular (e.g., fixef(), ranef(), etc.); more information can be found in the Methods for MixMod Objects vignette. In addition, some highlights of its capabilities:
- It allows for user-defined family objects implementing not standardly available outcome distributions; more information can be found in the Custom Models vignette.
- It can calculate fixed effects coefficients with a marginal / population averaged interpretation using the function marginal_coefs().
- Function effectPlotData() calculates predictions and confidence intervals for constructing effect plots.
The development version of the package is available on the GitHub repository.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: iProgn: Interactive Prediction Tools based on Joint Models. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Minor updates for ggseas and Tcomp R packages by @ellis2013nz
(This article was first published on free range statistics - R, and kindly contributed to R-bloggers)
UpdatesI’ve made small updates to two of my R packages on CRAN: ggseas (seasonal adjustment on the fly for ggplot2 graphics) and Tcomp (tourism forecasting competition data). Neither of the packages changes in a noticeable way for most users.
- The ggseas update is to get it ready for the coming release of ggplot2 2.3.0 scheduled for the end of June 2018. It makes changes under the hood to the mapping of variables to support “tidy evaluation” but does not change anything for users. It will still work with older versions of ggplot2.
- The Tcomp update fixes a problem where the reported length of time series was incorrect.
I’ve written more about ggseas and Tcomp elsewhere. Given they’re both being updated at once, here’s a brief demo of them in action together.
There are 1,311 series in the list tourism in Tcomp. Each element of tourism contains some limited metadata about the series (for example its length, and how long the forecast is to be for), the original training data, and the “answer” in the form of the actual observations over the forecast period. These objects are of class Mdata, introduced in Rob Hyndmans’ Mcomp package, which comes with a convenient plotting method.
library(tidyverse) library(scales) library(ggseas) library(Tcomp) # default plot method for forecasting competition datasets of class Mdata par(bty = "l", font.main = 1) plot(tourism[["M4"]], main = "Series M4 from the tourism forecasting competition")ggseas makes it easy to take a seasonal time series object (ie an object of class ts with frequency > 1), convert it into a data frame, and produce exploratory graphics with it. For example, here’s code to take the training data from that same forecasting competition object and compare the original with a seasonally adjusted version (using X13-SEATS-ARIMA for the seasonal adjustment), and a 12 month rolling average:
# convert a time series to a data frame the_data <- ggseas::tsdf(tourism[["M4"]]$x) # draw a graphic ggplot(the_data, aes(x = x, y = y, colour = 1)) + geom_line(aes(colour = "Original")) + stat_seas(aes(colour = "Seasonally adjusted")) + stat_rollapplyr(aes(colour = "12 month rolling average"), width = 12) + scale_colour_manual(values = c("Original" = "grey50", "Seasonally adjusted" = "red", "12 month rolling average" = "blue")) + theme(legend.position = c(0.2, 0.8)) + scale_y_continuous("Unknown units\n", label = comma) + labs(x = "", colour = "") + ggtitle("Comparison of statistical summary methods in plotting a time series", "Monthly series 4 from the Tourism forecasting competition")Then the ggsdc function makes it easy to look at a decomposition of a time series. This really comes into its own when we want to look at two or more time series, mapped to colour (there’s an example in the helpfile), but we can use it with a univariate time series too:
ggsdc(the_data, aes(x = x, y = y), method = "seas") + geom_line() + ggtitle("Seasonal decomposition with `X13-SEATS-ARIMA`, `seasonal` and `ggseas`", "Monthly series 4 from the Tourism forecasting competition") + scale_y_continuous("Unknown units\n", label = comma) + labs(x = "")I wrote the first version of the functions that later became ggseas in 2012 when working for New Zealand’s Ministry of Business, Innovation and Employment. We had new access to large amounts of monthly electronic transactions data and needed an efficient way to explore it on our way to developing what eventually became the Monthly Regional Tourism Estimates.
Reflections on CRANThis experience and recent twittering on Twitter have led me to reflect on the CRAN package update process. Both these updates went through really smoothly; I’m always impressed at the professionalism of the volunteers behind CRAN.
I think CRAN is hugely important and an amazing asset for the R community. It’s really important that packages be published on CRAN. I think the mapping of dependencies, and the process for checking that they all keep working together, is the key aspect here.
ggplot2 apparently has more than 2,000 reverse dependencies (ie packages that import some functionality from ggplot2), all of them maintained more or less voluntarily. When major changes are made I can’t think of any way other than CRAN (or something like it) for helping the system react and keep everything working. For instance, I found out that the new ggplot2 v2.3.0 would break ggseas when I got an automated email advising me of this, from the tests Hadley Wickham and others working on ggplot2 ran to identify problems for their reverse dependencies. The RStudio community site was useful for pointing me (and others with similar issues) in the direction of a fix, and of course one uses GitHub or similar to manage issues and code version control, but in the end we need the centralised discipline of something like CRAN as the publication point and the definitive map of the dependencies. So, a big shout out to the volunteers who maintain CRAN and do an awesome job.
Reflections on testingWhen a user pointed out a bug in the Tcomp package by raising an issue on GitHub, I was pleased with myself for using genuine test-driven development. That is, I first wrote a test that failed because of the bug:
test_that("series lengths are correct", { expect_equal(sum(sapply(tourism, function(s){length(s$x) - s$n})), 0) })…and then worked on finding the problem and fixing it upstream. It’s such an effective way to work, and it means that you have a growing body of tests. Packages like Tcomp and ggseas go for years without me doing anything to them, so when I do have to go back and make a change it’s very reassuring to have a bunch of tests (plus running all the examples, and performing the checks for CRAN) to be sure that everything still works the way it’s meant to.
There’s not enough tests in the build process of those packages at the moment; the more experienced I get, the more I think “build more tests, and build them earlier” is one of the keys to success. This is just in code-driven projects either; I think there’s broad applicability in any professional situation, although the development and automation of tests is harder when you’re not just dealing with code.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: free range statistics - R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
reticulate – another step towards a multilingual and collaborative way of working
(This article was first published on eoda english R news, and kindly contributed to R-bloggers)
R, Julia, Python – todays data scientists have the choice between numerous different programming languages, each with their own strengths and weaknesses. Would it not be convenient to bring those languages together and use the individual strength of each language? The package “reticulate” for R takes a step in this direction.
The reticulate package creates a Python connection for R which enables the user to combine both languages for an optimal result. This tutorial referring to the Python engine for R Markdown implemented by reticulate, we will illustrate the basic idea, functionality and usefulness of the engine.
As an in-depth tutorial with examples and a fictional use case, we include an HTML notebook which is created in R Markdown. You can find it here.
The weapon of choice – data analysis in your day-to-day work
It could all be so simple: Everyone speaks the same language, borders blur and collaborations become easier than ever. What seems to sound a bit dramatic, happens more often in the everyday life of a data scientist. Even though R can show its strengths from a statistical point of view, Python, as a multi-purpose language, is becoming increasingly popular in the data science scene. Because of its multifunctionality, it is used more and more frequently the programming language of choice. Therefore, it seems necessary to build a bridge between both languages and create a common space within a project.
The virtual bridge – the reticulate package
The reticulate package provides a Python connection for R. Users can execute Python code directly from the R console, load Python scripts into the R environment and make them usable for R. In addition, reticulate provides a Python engine for R Markdown, which is described in the following example in more detail.
Introduction – setup & Python chunks in R Markdown
The setup of reticulate is very simple. After installing the package as well as the preferred distribution (2.7x or 3.x) you can
library(reticulate) knitr::knit_engines$set(python = reticulate::eng_python)install the Python engine. The chunks of code work as usual and every R chunk can be used for a Python chunk.
For example:
```{r testChunk_R, warning = FALSE, message = FALSE} Code ```R chunk and
```{python TestChunk_Py, warning = FALSE, message = FALSE} Code ```a Python chunk. But attention: The code executed as a Python chunk will not be loaded into the global environment! Access to functions and/ or variables from outside is only possible when knit is executed which means that the execution of chunk alone is not enough. In our corresponding HTML notebook you will find further examples for the implementation of functions in different languages.
Continuation – automatic type conversion and interconnectivity
Taken together, reticulate creates an appropriate bridge between two languages. This allows developers with different programming skills to work in the same document of a joint project, all that without having to translate code. Furthermore, the strengths of both languages can be used. To loosen up boundaries, reticulate also provides the possibility to access variables of other languages within the chunk of one language given.
The following code extract illustrates the function:
Reticulate automatically adapts types of respective language. For example, if chunk1_Py reads test_vector, the R type vector is converted into the equivalent Python type list.
Further type assignments can be found here.
Source_Python – the “hidden treasure” among functions
The ability to implement Python code directly into a document and use it across languages already provides us with a powerful tool for seamless collaboration. The hidden treasure among functions in the reticulate package can be found in the function source_python. This function reads a Python script and enables functions and variables for both languages within the markdown document. This operating mode is illustrated in the following example:
// Datei bubblesort.py def bsort(numbers): if(isinstance(numbers, list) == False or all(isinstance(p, (int, float)) for p in numbers) == False or len(numbers) == 0): print "Warning: Input must be of Type List, only numeric and have length >= 1" return for i in reversed(range(1, len(numbers))): for j in range(0, i): if(numbers[j] > numbers[j+1]): numbers = swap(numbers, j, j+1) return numbers def swap(numbers, id1, id2): temp = numbers[id1] numbers[id1] = numbers[id2] numbers[id2] = temp return numbers // Markdown Dokument ```{r setup_script} source_python(„bubblesort.py“) ``` ```{r sort_R} bsort(c(4, 3, 2, 1)) ``` Output: 1, 2, 3, 4 ```{python sort_Py} print bsort([4,3,2,1]) ```Output: 1,2,3,4
The Python script bubblesort.py implements the bubblesort sorting algorithm under bsort(numbers) function. This implementation is then provided in the markdown document with source_python(„bubblesort.py“) and can be used by both languages, as shown above.
With this feature reticulate removes the last existing limitation. This allows developers to completely work in their preferred development environment and finally merge the results into a final document. Already existing Python scripts (e.g. scripts for database connection, deep learning algorithms, etc.) can be transferred seamlessly.
Use case – text mining with preparatory work
Considering the following fictional use case: an electronics retailer Elektro-X wants to sign an advertising contract with one of the three retailers Lenavu, Ace-Ar and AS-OS. In order to get a more detailed picture of the retailer’s satisfaction, Elektro-X starts a text mining project implemented in Python: shop reviews will be read and reviews for the retailer’s products will be searched. In addition, a function is provided for calculating an average rating of several reviews. Due to a lack of time, Elektro-X is unable to follow through with the project and hired eoda to finish this project. eoda has completed the analysis in R to gain access to ggplot for visualization while using the pre-work done in Python. eoda has transferred the project to R using the reticulate package. Furthermore, R has been used to visualize the evaluations and sentiment analysis.
Conclusion – the future of big data
Concluding, it can be said that packages like reticulate have the potential to significantly influence the big data industry and thus change its future. Especially service providers like eoda can benefit from barrier-free collaboration with their customers, as they can work on core problems more quickly without having to deal with possible problems beforehand. If you would like to work more efficiently on customer projects in the future, start considering language barriers solutions such as reticulate.
Would you like to know more about multilingual and collaborative working processes? We as data science specialists can help you take the next step. Contact us for more information at www.eoda.de.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: eoda english R news. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
R community update: announcing New Delhi useR meetup group
(This article was first published on R – Tech and Mortals, and kindly contributed to R-bloggers)
“The way to change the world is through individual responsibility and taking local action in your own community.” – Jeff Bridges Over the past few years, R’s adoption has grown rapidly throughout the world. Much of it can be attributed to the growth of ‘data science’ as a domain. But R’s popularity primarily exists because of its amazing community and their contributions. Be it through open source development, Twitter (#rstats), discussion forums or programs such as Google Summer of Code (GSoC), there are numerous channels for beginners and practitioners to interact with each other. I’ve had the opportunity to interact with members from R ecosystem through the previously mentioned channels. It has been wonderful to say the least. However, despite increasing popularity and user-base in India, the local community aspect has largely been missing. Even New Delhi, nation’s capital and a startup hub, lacks a strong community. There’d been efforts in the past to organize meetups and conferences but nothing really worked out. To address this, some R practitioners from the region have started a useR meetup group. We plan to host regular meetings, bring together existing R users and help introduce the language to beginners. You can find the meetup page link here. Also, we’re grateful to the R consortium for their support to the group through the R User Group and Small Conference Support Program. List of other groups which are part of the program can be found here. You can find more information about the program on their website. In addition to members, we’re also looking for speakers. Although we would prefer speakers to be present on venue, webinars can also work out. If you have useful ideas/learnings to share with the group or know someone else who could do the same, please reach out through mail or start a discussion on the meetup page. Again, do join the group if you want to be a part of this community. Here’s hoping to the community’s success. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Tech and Mortals. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...