R news and tutorials contributed by hundreds of R bloggers
Updated: 4 hours 31 min ago

### Le Monde puzzle [#1111]

Wed, 09/18/2019 - 00:19

[This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Another low-key arithmetic problem as Le Monde current mathematical puzzle:

Notice that there are 10 numbers less than, and prime with 11, 100 less than and prime with 101, 1000 less than, and prime with 1111? What is the smallest integer N such that the cardinal of the set of M

As it is indeed a case for brute force resolution as in the following R code

library(numbers) homanycoprim=function(n){ many=1 for (i in 2:(n-1)) many=many+coprime(i,n) return(many)} smallest=function(){ n=1e4 many=homanycoprim(n) while (many!=1e4) many=homanycoprim(n<-n+1) return(n)}

which returns n=10291 as the smallest value of N.  For the other two questions, the usual integer2digit function is necessary

smalldiff=function(){ n=1111;mul=1 while (mul<1e6) { x=as.numeric(strsplit(as.character(n*mul), "")[[1]]) while (sum(duplicated(x))!=0){ mul=mul+1 x=as.numeric(strsplit(as.character(n*mul), "")[[1]])} print(n*mul);mul=mul+1}}

leading to 241,087 as the smallest and 9,875,612,340 as the largest (with 10 digits).

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### The Learning Platform for Data-Driven Companies

Tue, 09/17/2019 - 15:55

[This article was first published on DataCamp Community - r programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The nature of work today requires continuous learning and the ability to respond appropriately to new information—including an increasing abundance of data. Companies must ensure that their employees are data fluent—those that do outperform their peers in revenue growth, market share, profitability, customer and employee satisfaction.

Data fluency is the ability to understand data, communicate insights from that data, and use it to solve problems and make more informed decisions. Data fluency includes a spectrum of skills and proficiencies and means having the appropriate level of data skills to work efficiently and effectively based on the different needs of each job role. According to a survey DataCamp conducted with Training Industry of over 300 L&D leaders representing diverse industries, 84% of companies plan to invest in data fluency by 2020.

75% of companies with mature data fluency have seen an increase in individual productivity—while 68% of companies with immature data fluency say they have greater inefficiency in workflows.

Faster path from data to insights and decisions

Many companies struggle with managing their data and focusing on the questions their data should answer. Wouldn’t you like to give every employee the skills and autonomy to ask better questions of data, get clearer insights, and make decisions faster? And ultimately, be able to use your time, resources, and talent more effectively?

“DataCamp has democratized data science learning across our organization. We no longer need to depend on a small number of knowledgeable people to educate others.”—Mark Adamson, Quantitative Analyst, EDF Energy

DataCamp for Business helps companies—from financial institutions and health care firms to management consultancies and governments—equip their workforce with the skills to learn from their data in an efficient and effective manner.

A modern learning experience

We provide an intuitive, personalized, and interactive approach to online learning paired with expert instruction. DataCamp makes it easy for teams to learn and retain 21st-century skills—from basic analytics to advanced topics in data science and data engineering—and gain the confidence to apply those skills in the real world right away.

“The key benefit of DataCamp is the flexibility. It’s about giving my team the ability to develop themselves and ultimately make themselves more valuable…DataCamp is helping my team to transfer some of their skills in SQL, SAS, and R into Python and what we’re building in the cloud.”—Duncan Bain, Senior Data Scientist, Scottish Power

Our platform is accessible anytime, anywhere, and on any device. It’s even optimized for mobile, so your employees can practice new skills in Python, R, and SQL on their daily commute. We break down complex concepts into short, fun exercises so that everyone can learn by doing when it’s convenient for them.

Learning tracks and specialized course collections to guide your teams

DataCamp’s content catalog includes courses and hands-on projects for everyone from beginner to expert, including courses and curated learning tracks tailored to specific teams and industries. Explore course collections in marketing or finance, plus deep-dive career tracks and skill tracks in data science and analytics.

Discover your team’s data science skill level

Gain a bird’s-eye view of your team’s skill level with a quick but rigorous assessment. DataCamp Signal isn’t like other tests: your team will write actual code in addition to completing multiple-choice questions and the difficulty of the assessment automatically adapts based on performance.

Signal also provides personalized course recommendations based on your employees’ strengths and skill gaps, which means they can make the most of the time they spend learning on DataCamp.

Learning at scale

DataCamp makes it easy to implement and manage a scalable learning solution for teams of any size—so you can get your team up and running quickly and they can apply their new skills faster. Monitor your teams’ engagement with simple progress tracking and advanced reporting, and easily integrate DataCamp into your learning ecosystem with Single Sign-On (SSO) and LMS integrations with Degreed, Cornerstone OnDemand, and SAP SuccessFactors. And to keep pace with your unique learning objectives, DataCamp allows you to create custom assignments and learning tracks for each of your teams.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Retrieving google drive item shares and permissions (in R)

Tue, 09/17/2019 - 14:00

Google drive is a great tool, specifically we’ve been using “G Suite” (the equivalent of google drive but for businesses), for a long time. Lately I noticed it’s missing an important feature – monitoring file shares and permission of google drive items across organization is non-trival (at least in the G suite basic subscription).

I wanted to get a better sense of how my files and folders are shared across users within and outside the organization. I decided to give the googledrive package a try and extract the shares and permissions of important folders I had on my account. Here’s how I did that.

Using googledrive to mass extract shares and permission of google drive items

First, I wanted to focus on specific folders which contain a lot of subfolders. My goal was to generate a tibble with the names of all sub-items (folder or files one level under the specific folders), along with all users which have access to these folders.

Here is a description of what you need to do in order to accomplish what I described, followed by the code I used.

1. Get the id (or URL) for the folder you want to retrieve data from. You can get the id by just visiting the folder, and the id is in the URL, i.e., https://drive.google.com/drive/u/1/folders/.
2. Then, retrieve all items within the folder.
3. Use purrr functions (i.e., map and map_df to iterate over the results and bring them into a tidy form).
4. (Optional) Use pivot_wider to create a tibble in a wide format where each row is an item and each column is the type of access each user has on the item.

The first function I’m using is a function I defined called get_permissions_df():

library(tidyverse) library(googledrive) # the package used to iterface with the google api # Function to retrieve email addresses of permissions (read/write) -------- get_permissions_df <- function(permission_list){ map_df(permission_list, ~{ if (!is.null(.$emailAddress)){ tibble(user_email = .$emailAddress, user_role = .$role) } else { tibble(user_email = NA, user_role = NA) } }) } The function returns a tibble with two columns: user_email and user_role, according to the users with access to the folder (access can be owner/writer/reader). Now, to actually pulling the data and processing it: # Read the contents of the folder ----- # note that the first time you run this, you will be asked to login into your gmail using a web browser. folder_contents <- drive_ls(as_id("")) # replace here with the URL or ID of the folder. # Get a tidy form of all shares and permissions of subfolders tidy_permissions <- folder_contents %>% mutate(new_creds = map(drive_resource, ~{get_permissions_df(.$permissions)}) ) %>% select(name, new_creds) %>% unnest(new_creds) %>% filter(!is.na(user_email)) # Optional - turn into a wider form where each column is a user, # each row is a subfolder, and values are permissions of users. wide_permissions <- tidy_permissions %>% distinct(name, user_email, .keep_all = T) %>% pivot_wider(id_cols = name, names_from = user_email, values_from = user_role, values_fill = list(user_role = "none"))

There you have it: tidy_permissions will hold the names of all subfolders with permissions. A folder will appear as many times as it has permissions (with the user email and the type of permission). The wide_permissions will hold a wide version in which each row is a folder and each column is a user.

Note that this works for specific folders. You can also use drive_ls() without any arguments (or use it with recursive=TRUE), to pull everything on the drive (or everything within all subfolders, recursively). When I did that it took me around 5-10 minutes to pull all the data and about 5 minutes to prepare it, since I have $$>100k$$ items.

Conclusions

The post provides a method to create a concise tibble with the contents of you google drive items, and their user permissions.
You can either run it on all items in your google drive or on selected folders (and sub-folders within). The method is especially useful in the context of data safety and security, when you want to make sure you are not sharing sensitive items in an undesired manner.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Obtaining tokens with AzureAuth inside a Shiny app

Tue, 09/17/2019 - 09:00

by Hong Ooi, senior data scientist, Microsoft Azure

As of version 1.2.0 (released to CRAN late last month), it’s possible to use the AzureAuth package to login interactively to Azure from within a Shiny webapp. Because a Shiny app has separate UI and server components, some changes had to be made to the interactive authentication flows. In particular, the authorization step (logging in to Azure) has to be conducted separately from the token acquisition step.

AzureAuth now provides the build_authorization_uri function to facilitate this separation. You call this function to obtain a URI that you browse to in order to login to Azure. Once you have logged in, Azure will return an authorization code as part of a redirect, which can be captured by the browser. The code is then passed to the get_azure_token function in the auth_code argument, which then completes the task of acquiring the token.

Here is a skeleton Shiny app that demonstrates its use.

Note that this process is only necessary within a web app, and only when using an interactive authentication flow. In a normal R session, or when using the client credentials or resource owner grant flows, you can simply call get_azure_token directly.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Hindi and Other Languages in India based on 2001 census

Tue, 09/17/2019 - 07:45

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

India is the world’s largest Democracy and as it goes, also a highly diverse place. This is my attempt to see how “Hindi” and other languages are spoken in India.

In this post, we’ll see how to collect data for this relevant puzzle – directly from Wikipedia and How we’re going to visualize it – highlighting the insight.

Data

Wikipedia is a great source for data like this – Languages spoken in India and also because Wikipedia lists these tables as html it becomes quite easier for us to use rvest::html_table() to extract the table as dataframe without much hassle.

options(scipen = 999) library(rvest) # for webscraping library(tidyverse) # for data analysis and visualization # the wikipedia page URL - thanks to DuckDuckGo search lang_url <- "https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers_in_India" # extracting the entire content of the page content <- read_html(lang_url) # extracting only tables from the downloaded content tables <- content %>% html_table(fill = TRUE) # from the page we know, it's the first table we want picking up the first # element from the list of tables lang_table <- tables[[1]] ### header cleaning - exclude the first row lang_table <- lang_table[-1, ] lang_table ## First language speakers First language speakers ## 2 Hindi[b] 422,048,642 41.03% ## 3 English 226,449 0.02% ## 4 Bengali 83,369,769 8.10% ## 5 Telugu 74,002,856 7.19% ## 6 Marathi 71,936,894 6.99% ## 7 Tamil 60,793,814 5.91% ## 8 Urdu 51,536,111 5.01% ## 9 Kannada 37,924,011 3.69% ## 10 Gujarati 46,091,617 4.48% ## 11 Odia 33,017,446 3.21% ## 12 Malayalam 33,066,392 3.21% ## 13 Sanskrit 14,135 <0.01% ## Second languagespeakers[11] Third languagespeakers[11] Total speakers ## 2 98,207,180 31,160,696 551,416,518 ## 3 86,125,221 38,993,066 125,344,736 ## 4 6,637,222 1,108,088 91,115,079 ## 5 9,723,626 1,266,019 84,992,501 ## 6 9,546,414 2,701,498 84,184,806 ## 7 4,992,253 956,335 66,742,402 ## 8 6,535,489 1,007,912 59,079,512 ## 9 11,455,287 1,396,428 50,775,726 ## 10 3,476,355 703,989 50,271,961 ## 11 3,272,151 319,525 36,609,122 ## 12 499,188 195,885 33,761,465 ## 13 1,234,931 3,742,223 4,991,289 ## Total speakers ## 2 53.60% ## 3 12.18% ## 4 8.86% ## 5 8.26% ## 6 8.18% ## 7 6.49% ## 8 5.74% ## 9 4.94% ## 10 4.89% ## 11 3.56% ## 12 3.28% ## 13 0.49%

At this point, we’ve got the required table but mind you, The numbers are in characters and for us to plot visualizations – it has to be in Numeric format. We’ll pick only First Language Speakers for further sections so will change those numbers from character into numeric format

# clean-up the messed up column names lang_table <- lang_table %>% janitor::clean_names() lang_table[1,"x"] <- "Hindi" lang_table ## x first_language_speakers first_language_speakers_2 ## 2 Hindi 422,048,642 41.03% ## 3 English 226,449 0.02% ## 4 Bengali 83,369,769 8.10% ## 5 Telugu 74,002,856 7.19% ## 6 Marathi 71,936,894 6.99% ## 7 Tamil 60,793,814 5.91% ## 8 Urdu 51,536,111 5.01% ## 9 Kannada 37,924,011 3.69% ## 10 Gujarati 46,091,617 4.48% ## 11 Odia 33,017,446 3.21% ## 12 Malayalam 33,066,392 3.21% ## 13 Sanskrit 14,135 <0.01% ## second_languagespeakers_11 third_languagespeakers_11 total_speakers ## 2 98,207,180 31,160,696 551,416,518 ## 3 86,125,221 38,993,066 125,344,736 ## 4 6,637,222 1,108,088 91,115,079 ## 5 9,723,626 1,266,019 84,992,501 ## 6 9,546,414 2,701,498 84,184,806 ## 7 4,992,253 956,335 66,742,402 ## 8 6,535,489 1,007,912 59,079,512 ## 9 11,455,287 1,396,428 50,775,726 ## 10 3,476,355 703,989 50,271,961 ## 11 3,272,151 319,525 36,609,122 ## 12 499,188 195,885 33,761,465 ## 13 1,234,931 3,742,223 4,991,289 ## total_speakers_2 ## 2 53.60% ## 3 12.18% ## 4 8.86% ## 5 8.26% ## 6 8.18% ## 7 6.49% ## 8 5.74% ## 9 4.94% ## 10 4.89% ## 11 3.56% ## 12 3.28% ## 13 0.49% lang_table %>% select(one_of("x","first_language_speakers")) %>% mutate(first_language_speakers = parse_number(first_language_speakers)) -> lang_table_first names(lang_table_first) <- c("Language","first_language_speakers") lang_table_first ## Language first_language_speakers ## 1 Hindi 422048642 ## 2 English 226449 ## 3 Bengali 83369769 ## 4 Telugu 74002856 ## 5 Marathi 71936894 ## 6 Tamil 60793814 ## 7 Urdu 51536111 ## 8 Kannada 37924011 ## 9 Gujarati 46091617 ## 10 Odia 33017446 ## 11 Malayalam 33066392 ## 12 Sanskrit 14135 Visualization

Now that we got a categorical and a numerical variable. It’s time to play with some visualization – as it’s typical – a bar chart.

All Languages

lang_table_first %>% mutate(Language = fct_reorder(Language,-first_language_speakers)) %>% ggplot() + geom_bar(aes(Language, first_language_speakers), stat = "identity", fill = ifelse(lang_table_first$Language == 'Hindi', "#ffdd00", "#ff00ff")) + theme_minimal() + labs(title = "Most Spoken Languages", subtitle = "First Language in India", caption = "Data Source: Wikipedia - Census 2001") That’s a long tail with Hindi leading the way. Hindi & Everyone else library(viridis) lang_table_first %>% mutate(Language = ifelse(Language == "Hindi", "Hindi","non_Hindi")) %>% group_by(Language) %>% summarize(first_language_speakers = sum(first_language_speakers)) %>% mutate(percentage = round((first_language_speakers / sum(first_language_speakers))*100,2)) %>% ggplot() + geom_bar(aes(Language,percentage,fill = Language), stat = "identity" ) + scale_fill_viridis_d(option = 'E', direction = -1) + scale_y_continuous(limits = c(0,60)) + theme_minimal() + geom_label(aes(Language,percentage, label= paste0(percentage,"%"))) + labs(title = "Hindi vs Non_Hindi", subtitle = "First Spoken Language in India", caption = "Data:Wikipedia - Census 2001") Living up to the Diversity of India, A mixed (assorted) group of languages other than Hindi forms ~54% while Hindi-only is ~46% Summary Not getting into the politics of this context, In this post, we learnt how to get data (that’s requried for us) using rvest and did analysis using tidyverse to generate some valuable insights on India’s most spoken first languages. If you are interested to know more regarding R, You can check out this tutorial. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### citecorp: working with open citations Tue, 09/17/2019 - 02:00 [This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. citecorp is a new (hit CRAN in late August) R package for working with data from the OpenCitations Corpus (OCC). OpenCitations, run by David Shotton and Silvio Peroni, houses the OCC, an open repository of scholarly citation data under the very open CC0 license. The I4OC (Initiative for Open Citations) is a collaboration between many parties, with the aim of promoting “unrestricted availability of scholarly citation data”. Citation data is available through Crossref, and available in R via our packages rcrossref, fulltext and crminer. Citation data is also available via the OCC; and this OCC data is now available in R through the new package citecorp. How much citation data does the OCC have? Quoting the OpenCitations website (as of today): the OCC has ingested the references from 326,743 citing bibliographic resources, and contains information about 13,964,148 citation links to 7,565,367 cited resources Why citation data? Citations are the links among scholarly works (articles, books, etc.), leading to many important uses including finding related articles, calculating article impact, and even use in academic hiring decisions. Why open citation data? Until recently most citation data has been locked behind publisher walls. Unfortunately, many publishers see giving away citation data as losing potential profits. Through the I4OC, many publishers have made their citation metadata public, but some of the largest publishers still have not done so: Elsevier, American Chemical Society, IEEE. Without all citation data being open, any work that builds on citation data only has a sub-sample of all citations; you are drawing conclusions about citations from a rather small subset of all existing citations. Nonetheless, the currently available open citation data is an important resource; and can be amended with citation data behind paywalls for those that have access. About citecorp and the OCC OpenCitations created their own identifiers called Open Citation Identifiers (oci), e.g., 020010009033611182421271436182433010601-02001030701361924302723102137251614233701000005090307 You are probably not going to be using oci identifiers, but rather DOIs and/or PMIDs (PubMed identifier) and/or PMCIDs (PubMed Central identifier) (see the PubMed Wikipedia entry for more). See ?citecorp::oc_lookup for methods for cross-walking among identifier types. OpenCitations has a Sparql endpoint for querying their data; you can find that at http://opencitations.net/sparql; we do interface with the OCC Sparql endpoint in citecorp, but we don’t provide a user interface directly to it in citecorp. The OCC is also available as data dumps. Links citecorp source code: https://github.com/ropenscilabs/citecorp citecorp on CRAN: https://cloud.r-project.org/web/packages/citecorp/ Installation Install from CRAN install.packages("citecorp") Development version remotes::install_github("ropenscilabs/citecorp") Load citecorp library(citecorp) Converting among identifiers Three functions are provided for converting among different identifier types; each function gives back a data.frame containing the url for the article, PMID, PMCID and DOI: • oc_doi2ids • oc_pmid2ids • oc_pmcid2ids oc_doi2ids("10.1097/igc.0000000000000609") #> type value #> 1 paper https://w3id.org/oc/corpus/br/1 #> 2 pmid 26645990 #> 3 pmcid PMC4679344 #> 4 doi 10.1097/igc.0000000000000609 oc_pmid2ids("26645990") #> type value #> 1 paper https://w3id.org/oc/corpus/br/1 #> 2 doi 10.1097/igc.0000000000000609 #> 3 pmcid PMC4679344 #> 4 pmid 26645990 oc_pmcid2ids("PMC4679344") #> type value #> 1 paper https://w3id.org/oc/corpus/br/1 #> 2 doi 10.1097/igc.0000000000000609 #> 3 pmid 26645990 #> 4 pmcid PMC4679344 Under the hood we interact with their Sparql endpoint to do these queries. COCI methods A series of three more functions are meant for fetching references of, citations to, or metadata for individual scholarly works. Here, we look for data for the DOI 10.1108/jd-12-2013-0166 Peroni, S., Dutton, A., Gray, T. and Shotton, D. (2015), “Setting our bibliographic references free: towards open citation data”, Journal of Documentation, Vol. 71 No. 2, pp. 253-277. Note: If you don’t load tibble you get normal data.frame’s library(tibble) doi <- "10.1108/jd-12-2013-0166" references: the works cited within the paper oc_coci_refs(doi) #> # A tibble: 36 x 7 #> oci timespan citing creation author_sc journal_sc cited #> * #> 1 020010100083… P9Y2M5D 10.1108… 2015-03… no no 10.1001/j… #> 2 020010100083… P41Y8M 10.1108… 2015-03… no no 10.1002/a… #> 3 020010100083… P25Y6M 10.1108… 2015-03… no no 10.1002/(… #> 4 020010100083… P17Y2M 10.1108… 2015-03… no no 10.1007/b… #> 5 020010100083… P2Y2M3D 10.1108… 2015-03… no no 10.1007/s… #> 6 020010100083… P5Y8M27D 10.1108… 2015-03… no no 10.1007/s… #> 7 020010100083… P2Y3M 10.1108… 2015-03… no no 10.1016/j… #> 8 020010100083… P1Y10M 10.1108… 2015-03… no no 10.1016/j… #> 9 020010100083… P12Y 10.1108… 2015-03… no no 10.1023/a… #> 10 020010100083… P13Y10M 10.1108… 2015-03… no no 10.1038/3… #> # … with 26 more rows citations: the works that cite the paper oc_coci_cites(doi) #> # A tibble: 13 x 7 #> oci timespan citing creation author_sc journal_sc cited #> * #> 1 02001010707360… P1Y4M1D 10.1177/… 2016-07… no no 10.110… #> 2 02007050504361… P2Y11M2… 10.7554/… 2018-03… no no 10.110… #> 3 02001010405360… P2Y 10.1145/… 2018 no no 10.110… #> 4 02001000903361… P2Y3M4D 10.1093/… 2017-06… no no 10.110… #> 5 02001000007360… P1Y 10.1007/… 2017 no no 10.110… #> 6 02003030406361… P0Y 10.3346/… 2015 no no 10.110… #> 7 02001000007360… P2Y9M12D 10.1007/… 2017-12… no no 10.110… #> 8 02003020303362… P1Y14D 10.3233/… 2016-03… no no 10.110… #> 9 02003020303362… P3Y5M4D 10.3233/… 2018-08… no no 10.110… #> 10 02001000007360… P2Y 10.1007/… 2018 no no 10.110… #> 11 02001010402362… P3Y4M21D 10.1142/… 2018-07… no no 10.110… #> 12 02001000007360… P1Y 10.1007/… 2017 no no 10.110… #> 13 02001000507362… P2Y4M 10.1057/… 2017-08 no no 10.110… metadata: including the ISSN, volumne, title, authors, etc. oc_coci_meta(doi) #> # A tibble: 1 x 13 #> reference source_id volume title author source_title oa_link citation #> * #> 1 10.1001/… issn:002… 71 Sett… Peron… Journal Of … "" 10.1177… #> # … with 5 more variables: page , citation_count , issue , #> # year , doi Use cases There are many example use cases using the OCC already in the literature. Here are a few of those, not necessarily using R: To do Get in touch Get in touch if you have any citecorp questions in the issue tracker or the rOpenSci discussion forum. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Easyalluvial 0.2.1 released Tue, 09/17/2019 - 02:00 [This article was first published on R on datistics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. easyalluvial allows you to build exploratory alluvial plots (sankey diagrams) with a single line of code while automatically binning numerical variables. This releas 0.2.1 ensures tidyr 1.0.0 compatibility and fixes a bug around categorical variables for model response plots Model Response Plots with Categorical Variables This feature had som glitches before as edvardoss reported in this issue. If anybody else encounters some glitches or inconcistencies please report them as well. We create a grid of all possible feature combinations and use an alluvial plot to visualise the model response. Learn more about this feature in this previous blog post suppressPackageStartupMessages( require(tidyverse) ) suppressPackageStartupMessages( require(easyalluvial) ) df = titanic %>% select_if(is.factor) set.seed(0) m = randomForest::randomForest( Survived ~ ., df) imp = m$importance dspace = get_data_space(df, imp, degree = 3) pred = predict(m, newdata = dspace,type = 'response') p = alluvial_model_response(pred, dspace, imp, degree = 3) grid = add_marginal_histograms(p, plot = F, data_input = df) grid = add_imp_plot(grid = grid, p = p, data_input = df, plot = T)

More changes var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Simulating an open cohort stepped-wedge trial

Tue, 09/17/2019 - 02:00

In a current multi-site study, we are using a stepped-wedge design to evaluate whether improved training and protocols can reduce prescriptions of anti-psychotic medication for home hospice care patients with advanced dementia. The study is officially called the Hospice Advanced Dementia Symptom Management and Quality of Life (HAS-QOL) Stepped Wedge Trial. Unlike my previous work with stepped-wedge designs, where individuals were measured once in the course of the study, this study will collect patient outcomes from the home hospice care EHRs over time. This means that for some patients, the data collection period straddles the transition from control to intervention.

Whenever I contemplate a simulation, I first think about the general structure of the data generating process before even thinking about outcome model. In the case of a more standard two-arm randomized trial, that structure is quite simple and doesn’t require much, if any, thought. In this case, however, the overlaying of a longitudinal patient outcome process on top of a stepped-wedge design presents a little bit of a challenge.

Adding to the challenge is that, in addition to being a function of site- and individual-specific characteristics/effects, the primary outcome will likely be a function of time-varying factors. In particular here, certain patient-level health-related factors that might contribute to the decision to prescribe anti-psychotic medications, and the time-varying intervention status, which is determined by the stepped-wedge randomization scheme. So, the simulation needs to accommodate the generation of both types of time-varying variables.

I’ve developed a bare-boned simulation of sites and patients to provide a structure that I can add to at some point in the future. While this is probably a pretty rare study design (though as stepped-wedge designs become more popular, it may be less rare than I am imagining), I thought the code could provide yet another example of how to approach a potentially vexing simulation in a relatively simple way.

Data definition

The focus here is on the structure of the data, so I am not generating any outcome data. However, in addition to generating the treatment assignment, I am creating the time-varying health status, which will affect the outcome process when I get to that.

In this simulation, there will be 5 sites, each followed for 25 weeks (starting with week 0). Each week, a site will have approximately 20 new patients, so we should expect to generate around $$5 \times 25 \times 20 = 2500$$ total patients.

For each patient, we will be generating a series of health status, which ranges from 1 to 4, with 1 being healthiest, and 4 being death. I will use a Markov chain to generate this series. Two arguments required to simulate the Markov process are the starting state (which is created in S0) and the transition matrix P, which determines the probabilities of moving from one state to another.

NPER <- 25 perDef <- defDataAdd(varname = "npatient", formula = 20, dist = "poisson") patDef <- defDataAdd(varname = "S0", formula = "0.4;0.4;0.2", dist = "categorical") P <- t(matrix(c( 0.7, 0.2, 0.1, 0.0, 0.1, 0.3, 0.4, 0.2, 0.0, 0.1, 0.5, 0.4, 0.0, 0.0, 0.0, 1.0), nrow = 4)) Data generation

The data generation process starts with the sites and then proceeds to the patient level data. To begin, the five sites are generated (for now without any site-specific variables, but that could easily be modified in the future). Next, records for each site for each of the 25 periods (from week 0 to week 24) are generated; these site level records include the number patients to be generated for each site, each week:

set.seed(3837263) dsite <- genData(5, id = "site") dper <- addPeriods(dsite, nPeriods = NPER, idvars = "site", timeid = "site.time", perName = "period") dper <- addColumns(perDef, dper) dper ## site period site.time npatient ## 1: 1 0 1 17 ## 2: 1 1 2 20 ## 3: 1 2 3 25 ## 4: 1 3 4 18 ## 5: 1 4 5 23 ## --- ## 121: 5 20 121 17 ## 122: 5 21 122 15 ## 123: 5 22 123 16 ## 124: 5 23 124 19 ## 125: 5 24 125 20

Now, we assign each of the five sites to its own intervention “wave”. The first site starts at the beginning of the the study, week 0. The second starts 4 weeks later at week 4, and so on, until the fifth and last site starts the intervention at week 16. (Obviously, a more realistic simulation would include many more sites, but all of this can easily be scaled up.) The intervention indicator is $$I_{ct}$$, and is set to 1 when cluster $$c$$ during week $$t$$ is in the intervention, and is 0 otherwise.

dsw <- trtStepWedge(dper, "site", nWaves = 5, lenWaves = 4, startPer = 0, perName = "period", grpName = "Ict") dsw <- dsw[, .(site, period, startTrt, Ict)]

Here are the intervention assignments for the first two sites during the first 8 weeks.

dsw[site %in% c(1,2) & period < 8] ## site period startTrt Ict ## 1: 1 0 0 1 ## 2: 1 1 0 1 ## 3: 1 2 0 1 ## 4: 1 3 0 1 ## 5: 1 4 0 1 ## 6: 1 5 0 1 ## 7: 1 6 0 1 ## 8: 1 7 0 1 ## 9: 2 0 4 0 ## 10: 2 1 4 0 ## 11: 2 2 4 0 ## 12: 2 3 4 0 ## 13: 2 4 4 1 ## 14: 2 5 4 1 ## 15: 2 6 4 1 ## 16: 2 7 4 1

To generate the patients, we start by generating the 2500 or so individual records. The single baseline factor that we include this time around is the starting health status S0.

dpat <- genCluster(dper, cLevelVar = "site.time", numIndsVar = "npatient", level1ID = "id") dpat <- addColumns(patDef, dpat) dpat ## site period site.time npatient id S0 ## 1: 1 0 1 17 1 2 ## 2: 1 0 1 17 2 1 ## 3: 1 0 1 17 3 2 ## 4: 1 0 1 17 4 2 ## 5: 1 0 1 17 5 1 ## --- ## 2524: 5 24 125 20 2524 3 ## 2525: 5 24 125 20 2525 2 ## 2526: 5 24 125 20 2526 1 ## 2527: 5 24 125 20 2527 1 ## 2528: 5 24 125 20 2528 1

Here is a visualization of the patients (it turns out there are 2528 of them) by site and starting point, with each point representing a patient. The color represents the intervention status: light blue is control (pre-intervention) and dark blue is intervention. Even though a patient may start in the pre-intervention period, they may actually receive services in the intervention period, as we will see further on down.

The patient health status series are generated using a Markov chain process. This particular transition matrix has an “absorbing” state, as indicated by the probability 1 in the last row of the matrix. Once a patient enters state 4, they will not transition to any other state. (In this case, state 4 is death.)

dpat <- addMarkov(dpat, transMat = P, chainLen = NPER, id = "id", pername = "seq", start0lab = "S0") dpat ## site period site.time npatient id S0 seq state ## 1: 1 0 1 17 1 2 1 2 ## 2: 1 0 1 17 1 2 2 3 ## 3: 1 0 1 17 1 2 3 3 ## 4: 1 0 1 17 1 2 4 3 ## 5: 1 0 1 17 1 2 5 4 ## --- ## 63196: 5 24 125 20 2528 1 21 4 ## 63197: 5 24 125 20 2528 1 22 4 ## 63198: 5 24 125 20 2528 1 23 4 ## 63199: 5 24 125 20 2528 1 24 4 ## 63200: 5 24 125 20 2528 1 25 4

Now, we aren’t interested in the periods following the one where death occurs. So, we want to trim the data.table dpat to include only those periods leading up to state 4 and the first period in which state 4 is entered. We do this first by identifying the first time a state of 4 is encountered for each individual (and if an individual never reaches state 4, then all the individual’s records are retained, and the variable .last is set to the maximum number of periods NPER, in this case 25).

dlast <- dpat[, .SD[state == 4][1,], by = id][, .(id, .last = seq)] dlast[is.na(.last), .last := NPER] dlast ## id .last ## 1: 1 5 ## 2: 2 13 ## 3: 3 2 ## 4: 4 6 ## 5: 5 3 ## --- ## 2524: 2524 7 ## 2525: 2525 5 ## 2526: 2526 19 ## 2527: 2527 20 ## 2528: 2528 8

Next, we use the dlast data.table to “trim” dpat. We further trim the data set so that we do not have patient-level observations that extend beyond the overall follow-up period:

dpat <- dlast[dpat][seq <= .last][ , .last := NULL][] dpat[, period := period + seq - 1] dpat <- dpat[period < NPER] dpat ## id site period site.time npatient S0 seq state ## 1: 1 1 0 1 17 2 1 2 ## 2: 1 1 1 1 17 2 2 3 ## 3: 1 1 2 1 17 2 3 3 ## 4: 1 1 3 1 17 2 4 3 ## 5: 1 1 4 1 17 2 5 4 ## --- ## 12608: 2524 5 24 125 20 3 1 3 ## 12609: 2525 5 24 125 20 2 1 2 ## 12610: 2526 5 24 125 20 1 1 1 ## 12611: 2527 5 24 125 20 1 1 1 ## 12612: 2528 5 24 125 20 1 1 1

And finally, we merge the patient data with the stepped-wedge treatment assignment data to create the final data set. The individual outcomes for each week could now be generated, because would we have all the baseline and time-varying information in a single data set.

dpat <- merge(dpat, dsw, by = c("site","period")) setkey(dpat, id, period) dpat <- delColumns(dpat, c("site.time", "seq", "npatient")) dpat ## site period id S0 state startTrt Ict ## 1: 1 0 1 2 2 0 1 ## 2: 1 1 1 2 3 0 1 ## 3: 1 2 1 2 3 0 1 ## 4: 1 3 1 2 3 0 1 ## 5: 1 4 1 2 4 0 1 ## --- ## 12608: 5 24 2524 3 3 16 1 ## 12609: 5 24 2525 2 2 16 1 ## 12610: 5 24 2526 1 1 16 1 ## 12611: 5 24 2527 1 1 16 1 ## 12612: 5 24 2528 1 1 16 1

Here is what the individual trajectories of health state look like. In the plot, each column represents a different site, and each row represents a different starting week. For example the fifth row represents patients who appear for the first time in week 4. Sites 1 and 2 are already in the intervention in week 4, so none of these patients will transition. However, patients in sites 3 through 5 enter in the pre-intervention stage in week 4, and transition into the intervention at different points, depending on the site.

The basic structure is in place, so we are ready to extend this simulation to include more covariates, random effects, and outcomes. And once we’ve done that, we can explore analytic approaches.

This study is supported by the National Institutes of Health National Institute of Aging R61AG061904. The views expressed are those of the author and do not necessarily represent the official position of the funding organizations.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Business AI for SMB’s: A Management Take On The Challenges of Leveraging Data and AI for Sales Growth and Enhancing the Customer Experience (CX)

Mon, 09/16/2019 - 09:04

[This article was first published on R – Remix Institute, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The modern business world is captivated by data science and analytics. It seems like in almost every enterprise there is talk of needing to bulk up the data science and analytics functions to solve all conceivable business problems and chart the company towards endless success. In more cash-rich operations, management often seeks to go even beyond analytics and towards vaunted ‘AI’ solutions, generally caught up in Silicon Valley hype and expecting the state of AI to be decades (or even centuries) beyond its actual current state. Whether their company is large or small, countless contemporary managers are trying to win the data science and analytics race, but often come up frustrated when their expectations don’t meet reality.

4 Common Reasons Why Data Science and Analytics Initiatives Are Slow To Gain Momentum

There are many reasons why this may be the case in any particular organization, though companies which share the struggle typically exhibit at least one of the following traits:

1. Fractured data. How profitable is a certain SKU? It depends on who you ask. Did our marketing campaign generate sufficient ROI? Perhaps according to one analysis, but not another. Many managers can relate to this exceedingly-common experience. Different departments and teams manage their own sources of data, and their staff have their own custom methods of parsing and interpreting that data, which consequently leads to different answers when senior management asks questions to gauge the health of the business or viability of a project. In short, there is no ‘single source of truth.’ This is often related to another common corporate issue:

1. Siloed departments. When a company is small, typically a startup on through the first several years, it tends to operate with a relatively-flat and collaborative organizational structure. This is partly because smaller organizations often don’t have the money to build large teams of varied functions, but also because small and growing businesses tend to be more agile; they move faster than larger organizations because they have to in order to stay solvent and capitalize on their momentum. This is how a small company outmaneuvers larger competitors so that it, too, can eventually scale and become a big company. As a company grows and additional layers of management are added, new departments created, and ownership / senior management becomes increasingly-separated from daily operations, bureaucratic structures tend to crop up, leading to silos. Departments in large organizations may rarely speak to one another or know what is happening in other parts of the company. This means that not only may departments maintain disparate sources of data; it also often means their analyses and uses of the data may be operating at cross-purposes without them even realizing. But even if there is a single repository of data and the various departments are coordinating their use, many companies find themselves running into yet another issue:

1. Biased or flawed analysis. As much as we may like to think business data is unbiased and objective, factors such as corporate politics, lack of business acumen, or simple human error can result in skewed interpretations of that data. While data itself is non-sentient and unbiased, it doesn’t actually convey ‘information’ until interpreted. What does the data mean? The job of the analyst is to answer that question and tell a story which makes the data practical to management. While human involvement in this process is necessary and vital, the more you depend on it, the greater likelihood of flawed results. In some organizations, this leads to yet another problem:

1. Mistrust in the data. This often happens in more traditional businesses or older organizations outside the tech/e-commerce sectors, where senior management may have instituted some form of analytics or data science initiative because they heard about it on CNBC, Fox Business, Harvard Business Review, etc., but the initiative never really received sufficient attention, funding, or buy-in from the bulk of the staff. Institutional memory has a certain way of determining ‘how we do things around here,’ and when the data disagrees with the existing paradigm, the data is assumed to be wrong. One of the best contemporary examples of this phenomenon is the baseball analytics field pioneered at the Oakland A’s and documented in Michael Lewis’ book, Moneyball.

Sometimes data is rightly mistrusted because of the aforementioned problems in how it is collected and presented, but this issue can occur even in organizations with good data practices. It is a dangerous attitude for management to have, because even if the data has been properly collected and structured, various departments are collaborating, and the analysts and data scientists have done a good job of avoiding biases and other skews, reliable analysis can still be discarded due to a cultural mismatch. When this happens, management may pay lip service to analytics and data science, but will continue marching onward based on their personal preferences.

What are some quick-win strategies to address these common problems?

So what exactly is the manager to do in today’s environment? Most sensible, informed business people have by now realized that data science and analytics are going to be critical components of business success from here on out, but often the aforementioned problems run deep and present enormous roadblocks. And what about startups or small businesses that don’t have these same problems, but also doesn’t have the budget to build a full-scale analytics or data science team?

Thankfully, the general trend of markets is deflationary; stuff gets cheaper over time as it becomes more available, and this includes analytics and data science tools just as much as consumer goods. Consequently, open source tools which solve some of these issues — such as R, Python, and RemixAutoML — are beginning to appear, allowing even entrepreneurs and small enterprises access to capabilities which even just a short number of years ago would be hard to come by for the Fortune 500.

Can you provide specifics? How can SMB’s and startups tap into this?

Let’s start with one example based off an Amazon model. If you’re looking for a company which leads in leveraging machine learning and AI to optimize sales and profits, look no further than Amazon. From AI-enabled products like Alexa, to AI-based product recommenders on Amazon’s website, to AWS machine learning software, Amazon has no shortage of uses for machine learning and AI. Even Jeff Bezos said in his 2016 Letter to Amazon Shareholders,

These big trends are not that hard to spot (they get talked and written about a lot), but they can be strangely hard for large organizations to embrace. We’re in the middle of an obvious one right now: machine learning and artificial intelligence.

At Amazon, we’ve been engaged in the practical application of machine learning for many years now. Some of this work is highly visible: our autonomous Prime Air delivery drones; the Amazon Go convenience store that uses machine vision to eliminate checkout lines; and Alexa,1 our cloud-based AI assistant. (We still struggle to keep Echo in stock, despite our best efforts. A high-quality problem, but a problem. We’re working on it.)

But much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more. Though less visible, much of the impact of machine learning will be of this type – quietly but meaningfully improving core operations.

Superficial advice like “look at how successful Amazon is, so just do what they do” won’t help you. It’s almost impossible to try. Amazon’s R&D budget alone is $22 billion (according to Statista), dwarfing even the total annual revenues of most large companies. In all likelihood, your company’s R&D budget is$0. And since many companies don’t understand the significant investment required to be a leader in machine learning and AI (like Amazon), odds are you need a more pragmatic approach that can be done on a shoestring budget. Then, if you can pick up some quick-win ROI, management may see justification for further investment in machine learning and AI initiatives.

The example we’re going to show you would be building Amazon’s “frequently bought together” product up-sell algorithm. According to McKinsey, 35% of what consumer purchase on Amazon comes from its product recommendation algorithms. Amazon has been doing product recommenders for decades, and that means decades of research and investment you won’t be able to match. However, you can build an Amazon-style ‘frequently bought together’ product recommender able to lift average order values and market basket sizes.

Example of Amazon’s “Frequently Bought Together” Algorithm

This is where you can use an open-source, automated machine learning tool (such as RemixAutoML) which allows small-to-medium sized businesses and startups to build Amazon-style ‘frequently bought together’ product recommendations with just a single line of code and very little data. Your organization may have a messy data warehouse, but the only data points needed are:

• invoice or transaction data, such as invoice or sales order numbers
• a list of item or SKU numbers purchased on each invoice or sales order

This data can easily be extracted from either your Point-Of-Sale system or your e-commerce platform. Using that data, an analytics professional can create a machine learning model with a single line of code in RemixAutoML capable of competing with even the largest big box retailers, thus lifting conversion rates, average order values, and repurchase rates while increasing market share.

Tools like RemixAutoML help overcome hurdles such as fractured data (since only few data points are required), mistrust of data and siloed objectives (as everyone can utilize and see immediate ROI), and biased analysis (as the tool uses laws of probability and machine learning to reduce bias).

As a quick example, consider this online retail dataset from an e-commerce company in the UK:

Online Retail Data Set from a UK eCommerce Company | Source: UCI Machine Learning Repository

Again, the only two data points needed are invoice number (called InvoiceNo in this data set) and item number (called StockCode in the data set).

Running the following R code using RemixAutoML yields the end product below: a table of products frequently bought together based on highest statistical significance. This would equip the sales organization to leverage the table whenever it tries to upsell a customer with Product B given that Product A has been added to their shopping cart. This upsell could be facilitated by adding ‘frequently bought together’ recommenders on the website, in a personalized sales email, or at the Point-Of-Sale in a brick and mortar store. For more technical users, details are provided in the R code comments.

The output mirrors Amazon’s ‘frequently bought together’ algorithm. The column called StockCode_LHS means “StockCode Left-Hand Side” and is the product the customer has added to their cart. The column called StockCode_RHS means “StockCode Right-Hand Side” and is the product that is most ‘frequently bought together’ with StockCode_LHS at the highest statistically significant level.

Support means the percent of total transactions in which StockCode_LHS and StockCode_RHS appear together. Confidence means the conditional probability that StockCode_RHS is purchased given that StockCode_LHS has been added to the cart. The columns called Lift, Chi_SQ, and P_Value are all statistical significance metrics of the relationship between StockCode_LHS and StockCode_RHS. RuleRank is the ranking system that RemixAutoML uses to rank the market basket affinities for you.

Nick Gausling is a businessman and investor who has worked across multiple industries with companies both small and large. You can connect with Nick at https://www.linkedin.com/in/nick-gausling/ or www.nickgausling.com

Douglas Pestana is a data scientist with 10+ years experience in data science and machine learning at Fortune 500 and Fortune 1000 companies. He is one of the co-founders of Remix Institute.

library(data.table) library(dplyr) library(magrittr) library(RemixAutoML) # IMPORT ONLINE RETAIL DATA SET THEN CLEAN DATA SET------- # Original Source: UCI Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/online+retail # download file from Remix Insitute Box account online_retail_data = data.table::fread("https://remixinstitute.box.com/shared/static/v2c7mkkqm9eswyqbzkg5tbqqzfa1885v.csv", header = T, stringsAsFactors = FALSE) # create a flag for cancelled invoices online_retail_data$CancelledInvoiceFlag = ifelse(substring(online_retail_data$InvoiceNo, 1, 1) == 'C', 1, 0) # create a flag for negative quantities online_retail_data$NegativeQuantityFlag = ifelse(online_retail_data$Quantity < 0, 1, 0) # remove cancelled invoices and negative quantitites online_retail_data_clean = online_retail_data %>% dplyr::filter(., CancelledInvoiceFlag != 1) %>% dplyr::filter(., NegativeQuantityFlag != 1) # PREP DATA SET FOR MODELING ------------- # for market basket analysis models, you'll need data grouped by invoice (InvoiceNo) and item number (StockCode). Then you can sum up the units sold. online_retail_data_for_model = online_retail_data_clean %>% dplyr::group_by(., InvoiceNo, StockCode) %>% dplyr::summarise(., Quantity = sum(Quantity, na.rm = TRUE) ) # RUN AUTOMATED MARKET BASKET ANALYSIS (PRODUCT RECOMMENDER) IN RemixAutoML ----------- # the AutoMarketBasketModel from RemixAutoML automatically converts your data, # runs the market basket model algorithm, and adds Chi-Square statistics for significance market_basket_model = RemixAutoML::AutoMarketBasketModel( data = online_retail_data_for_model, OrderIDColumnName = "InvoiceNo", ItemIDColumnName = "StockCode" ) # add product Description # left-hand side products StockCode_LHS_description = online_retail_data_clean %>% dplyr::select(., StockCode, Description) %>% dplyr::rename(., StockCode_LHS = StockCode, Description_LHS = Description ) %>% dplyr::distinct(., StockCode_LHS, .keep_all = TRUE) # right-hand side products StockCode_RHS_description = online_retail_data_clean %>% dplyr::select(., StockCode, Description) %>% dplyr::rename(., StockCode_RHS = StockCode, Description_RHS = Description )%>% dplyr::distinct(., StockCode_RHS, .keep_all = TRUE) # merge market_basket_model_final = merge(market_basket_model, StockCode_RHS_description, by = 'StockCode_RHS', all.x = TRUE) market_basket_model_final = merge(market_basket_model_final, StockCode_LHS_description, by = 'StockCode_LHS', all.x = TRUE) # re-sort by StockCode_LHS and RuleRank market_basket_model_final = market_basket_model_final[order(StockCode_LHS, RuleRank),] # view results View(market_basket_model_final) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Feature-based time series analysis

Mon, 09/16/2019 - 02:00

[This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post, I showed how the feasts package can be used to produce various time series graphics.

The feasts package also includes functions for computing FEatures And Statistics from Time Series (hence the name). In this post I will give three examples of how these might be used.

library(tidyverse) library(tsibble) library(feasts) Exploring Australian tourism data

I used this example in my talk at useR!2019 in Toulouse, and it is also the basis of a vignette in the package, and a recent blog post by Mitchell O’Hara-Wild. The data set contains domestic tourist visitor nights in Australia, disaggregated by State, Region and Purpose.

An example of a feature would be the autocorrelation function at lag 1 — it is a numerical summary capturing some aspect of the time series. Autocorrelations at other lags are also features, as are the autocorrelations of the first differenced series, or the seasonally differenced series, etc. Another example of a feature is the strength of seasonality of a time series, as measured by $$1-\text{Var}(R_t)/\text{Var}(S_t+R_t)$$ where $$S_t$$ is the seasonal component and $$R_t$$ is the remainder component in an STL decomposition. Values close to 1 indicate a highly seasonal time series, while values close to 0 indicate a time series with little seasonality.

The feasts package has some inbuilt feature calculation functions. For example coef_hurst will calculate the Hurst coefficient of a time series and feat_spectral will compute the spectral entropy of a time series. To apply these to all series in the tourism data set, we can use the features function like this.

There are also functions which produce collections of features based on tags. For example, feature_set(tags="acf") will produce 7 features based on various autocorrelation coefficients, while feature_set(tags="stl") produces 7 features based on an STL decomposition of a time series, including the strength of seasonality mentioned above.

tourism %>% features(Trips, feature_set(tags="stl")) ## # A tibble: 304 x 12 ## Region State Purpose trend_strength seasonal_streng… seasonal_peak_y… ## ## 1 Adela… Sout… Busine… 0.451 0.380 3 ## 2 Adela… Sout… Holiday 0.541 0.601 1 ## 3 Adela… Sout… Other 0.743 0.189 2 ## 4 Adela… Sout… Visiti… 0.433 0.446 1 ## 5 Adela… Sout… Busine… 0.453 0.140 3 ## 6 Adela… Sout… Holiday 0.512 0.244 2 ## 7 Adela… Sout… Other 0.584 0.374 2 ## 8 Adela… Sout… Visiti… 0.481 0.228 0 ## 9 Alice… Nort… Busine… 0.526 0.224 0 ## 10 Alice… Nort… Holiday 0.377 0.827 3 ## # … with 294 more rows, and 6 more variables: seasonal_trough_year , ## # spikiness , linearity , curvature , stl_e_acf1 , ## # stl_e_acf10

We can then use these features in plots to identify what type of series are heavily trended and what are most seasonal.

tourism %>% features(Trips, feature_set(tags="stl")) %>% ggplot(aes(x=trend_strength, y=seasonal_strength_year, col=Purpose)) + geom_point() + facet_wrap(vars(State))

Clearly, holiday series are most seasonal which is unsurprising. The strongest trends tend to be in Western Australia.

The most seasonal series can also be easily identified and plotted.

tourism %>% features(Trips, feature_set(tags="stl")) %>% filter(seasonal_strength_year == max(seasonal_strength_year)) %>% left_join(tourism, by = c("State","Region","Purpose")) %>% ggplot(aes(x = Quarter, y = Trips)) + geom_line() + facet_grid(vars(State,Region,Purpose))

This is holiday trips to the most popular ski region of Australia.

The feasts package currently includes 42 features which can all be computed in one line like this.

# Compute features tourism_features <- tourism %>% features(Trips, feature_set(pkgs="feasts"))

Unusual time series can be identified by first doing a principal components decomposition

pcs <- tourism_features %>% select(-State, -Region, -Purpose) %>% prcomp(scale=TRUE) %>% augment(tourism_features) pcs %>% ggplot(aes(x=.fittedPC1, y=.fittedPC2, col=Purpose)) + geom_point() + theme(aspect.ratio=1)

We can then identify some unusual series.

pcs %>% filter(.fittedPC1 > 10.5) %>% select(Region, State, Purpose, .fittedPC1, .fittedPC2) ## # A tibble: 3 x 5 ## Region State Purpose .fittedPC1 .fittedPC2 ## ## 1 Australia's North West Western Australia Business 16.5 -8.24 ## 2 Australia's South West Western Australia Holiday 11.3 3.03 ## 3 Melbourne Victoria Holiday 14.2 -8.24 Replicating Hyndman, Wang and Laptev (2015)

I used a similar approach to identifying outliers in my ICDM 2015 paper with Earo Wang and Nikolay Laptev. In that paper, we used hourly data on thousands of mail servers from Yahoo.

The data is stored in the tsfeatures package as an mts object. Unfortunately, mts and ts objects are particularly bad at handling hourly and other sub-daily data, as time is stored numerically and the precision is not sufficient to always work properly. So we need to do a little work to create a properly formulated tsibble.

yahoo <- tsfeatures::yahoo_data() %>% as_tsibble() %>% mutate( Time = hms::hms(day = trunc(index) - 1L, hour = as.integer((round(24*(index-1))) %% 24)) ) %>% as_tsibble(index = Time, key=key) %>% select(Time, key, value)

The key variable here identifies the particular server and measurement variable on that server. There are 1748 such time series, each with 1437 (almost 60 days) observations.

Now we create the features on all series, matching the original paper as closely as possible. There are a few small differences due to different choices in how features are computed, but they make little difference to the results.

The features are computed in two groups — the first group uses the original data, while the second group uses the scaled data.

yahoo_features <- bind_cols( yahoo %>% features(value, features=list( mean = ~ mean(., na.rm = TRUE), var = ~ var(., na.rm = TRUE) )), yahoo %>% features(scale(value), features = list( ~ feat_acf(.), ~ feat_spectral(.), ~ n_flat_spots(.), ~ n_crossing_points(.), ~ var_tiled_var(., .period = 24, .size = 24), ~ shift_level_max(., .period = 24, .size = 24), ~ shift_var_max(., .period = 24, .size = 24), ~ shift_kl_max(., .period = 24, .size = 48), ~ feat_stl(., .period = 24, s.window = "periodic", robust = TRUE) )) ) %>% rename(lumpiness = var_tiled_var) %>% select(key, mean, var, acf1, trend_strength, seasonal_strength_24, linearity, curvature, seasonal_peak_24, seasonal_trough_24, spectral_entropy, lumpiness, spikiness, shift_level_max, shift_var_max, n_flat_spots, n_crossing_points, shift_kl_max, shift_kl_index)

Now we can compute the principal components.

hwl_pca <- yahoo_features %>% select(-key) %>% na.omit() %>% prcomp(scale=TRUE) %>% augment(na.omit(yahoo_features)) hwl_pca %>% as_tibble() %>% ggplot(aes(x=.fittedPC1, y=.fittedPC2)) + geom_point()

A useful plot for identifying the outlying observations is an HDR scatterplot which highlights regions of high density, and identifies outliers as those observations in low density regions. The red dots are the 1% highest density observations, the orange dots are in the 50% region, the yellow dots are in the 99% regions, while those shown in black are the 1% most “unusual” observations (having the lowest bivariate density on this plot.) The five most outlying points are highlighted by their row numbers.

hdrcde::hdrscatterplot(hwl_pca$.fittedPC1, hwl_pca$.fittedPC2, noutliers=5)

Replicating Kang, Hyndman and Smith-Miles (2017)

Another paper that uses features is this IJF paper from 2017 in which we explored the feature space of the M3 time series data.

First we need to create several tsibbles, corresponding to the different seasonal periods in the M3 data.

m3totsibble <- function(z) { rbind( as_tsibble(z$x) %>% mutate(Type="Training"), as_tsibble(z$xx) %>% mutate(Type="Test") ) %>% mutate( st = z$st, type = z$type, period = z$period, description = z$description, sn = z$sn, ) %>% as_tibble() } m3yearly <- Mcomp::M3 %>% subset("yearly") %>% purrr::map_dfr(m3totsibble) %>% as_tsibble(index=index, key=c(sn,period,st)) m3quarterly <- Mcomp::M3 %>% subset("quarterly") %>% purrr::map_dfr(m3totsibble) %>% mutate(index = yearquarter(index)) %>% as_tsibble(index=index, key=c(sn,period,st)) m3monthly <- Mcomp::M3 %>% subset("monthly") %>% purrr::map_dfr(m3totsibble) %>% mutate(index = yearmonth(index)) %>% as_tsibble(index=index, key=c(sn,period,st)) m3other <- Mcomp::M3 %>% subset("other") %>% purrr::map_dfr(m3totsibble) %>% as_tsibble(index=index, key=c(sn,period,st)) There are some bespoke features used by Kang et al (IJF 2017), so rather than use the inbuilt feasts functions, we can create our own feature calculation function. khs_stl <- function(x, period) { output <- c(frequency=period, feat_spectral(x)) lambda <- forecast::BoxCox.lambda(ts(x, frequency=period), lower=0, upper=1, method='loglik') stlfeatures <- feat_stl(box_cox(x, lambda), .period=period, s.window='periodic', robust=TRUE) output <- c(output, stlfeatures, lambda=lambda) if(period==1L) output <- c(output, seasonal_strength=0) return(output) } m3_features <- bind_rows( m3yearly %>% features(value, features=list(~ khs_stl(.,1))), m3quarterly %>% features(value, features=list(~ khs_stl(.,4))) %>% rename(seasonal_strength = seasonal_strength_4), m3monthly %>% features(value, features=list(~ khs_stl(.,12))) %>% rename(seasonal_strength = seasonal_strength_12), m3other %>% features(value, features=list(~ khs_stl(.,1))) ) %>% select(sn, spectral_entropy, trend_strength, seasonal_strength, frequency, stl_e_acf1, lambda) Finally we plot the first two principal components of the feature space. m3_features %>% select(-sn) %>% prcomp(scale=TRUE) %>% augment(m3_features) %>% ggplot(aes(x=.fittedPC1, y=.fittedPC2)) + geom_point(aes(color = factor(frequency))) var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Practical Data Science with R update Mon, 09/16/2019 - 00:44 [This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Just got the following note from a new reader: Thank you for writing Practical Data Science with R. It’s challenging for me, but I am learning a lot by following your steps and entering the commands. Wow, this is exactly what Nina Zumel and I hoped for. We wish we could make everything easy, but an appropriate amount of challenge is required for significant learning and accomplishment. Of course we try to avoid inessential problems. All of the code examples from the book can be found here (and all the data sets here). The second edition is coming out very soon. Please check it out. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Le Monde puzzle [#1110] Mon, 09/16/2019 - 00:19 [This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. A low-key sorting problem as Le Monde current mathematical puzzle: If the numbers from 1 to 67 are randomly permuted and if the sorting algorithm consists in picking a number i with a position higher than its rank i and moving it at the correct i-th position, what is the maximal number of steps to sort this set of integers when the selected integer is chosen optimaly? As the question does not clearly state what happens to the number j that stood in the i-th position, I made the assumption that the entire sequence from position i to position n is moved one position upwards (rather than having i and j exchanged). In which case my intuition was that moving the smallest moveable number was optimal, resulting in the following R code sorite<-function(permu){ n=length(permu) p=0 while(max(abs(permu-(1:n)))>0){ j=min(permu[permu<(1:n)]) p=p+1 permu=unique(c(permu[(1:n) which takes at most n-1 steps to reorder the sequence. I however checked this insertion sort was indeed the case through a recursive function resorite<-function(permu){ n=length(permu);p=0 while(max(abs(permu-(1:n)))>0){ j=cand=permu[permu<(1:n)] if (length(cand)==1){ p=p+1 permu=unique(c(permu[(1:n) which did confirm my intuition. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### pinp 0.0.9: Real Fix and Polish Sun, 09/15/2019 - 15:46 [This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Another pinp package release! pinp allows for snazzier one or two column Markdown-based pdf vignettes, and is now used by a few packages. A screenshot of the package vignette can be seen below. Additional screenshots are at the pinp page. This release comes exactly one week (i.e. the minimal time to not earn a NOTE) after the hot-fix release 0.0.8 which addressed breakage on CRAN tickled by changed in TeX Live. After updating the PNAS style LaTeX macros, and avoiding the issue with an (older) custom copy of titlesec, we now have the real fix, thanks to the eagle-eyed attention of Javier Bezos. The error, as so often, was simple and ours: we had left a stray \makeatother in pinp.cls where it may have been in hiding for a while. A very big Thank You! to Javier for spotting it, to Norbert for all his help and to James for double-checking on PNAS. The good news in all of this is that the package is now in better shape than ever. The newer PNAS style works really well, and I went over a few of our extensions (such as papersize support for a4 as well as letter), direct on/off off a Draft watermark, a custom subtitle and more—and they all work consistently. So happy vignette or paper writing! The NEWS entry for this release follows. Changes in pinp version 0.0.9 (2019-09-15) • The processing error first addressed in release 0.0.8 is now fixed by removing one stray command; many thanks to Javier Bezos. • The hotfix of also installing titlesec.sty has been reverted. • Processing of the ‘papersize’ and ‘watermark’ options was updated. Courtesy of CRANberries, there is a comparison to the previous release. More information is on the tint page. For questions or comments use the issue tracker off the GitHub repo. This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Develop Performance Benchmark with GRNN Sun, 09/15/2019 - 05:27 [This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. It has been mentioned in https://github.com/statcompute/GRnnet that GRNN is an ideal approach employed to develop performance benchmarks for a variety of risk models. People might wonder what the purpose of performance benchmarks is and why we would even need one at all. Sometimes, a model developer had to answer questions about how well the model would perform even before completing the model. Likewise, a model validator also wondered whether the model being validated has a reasonable performance given the data used and the effort spent. As a result, the performance benchmark, which could be built with the same data sample but an alternative methodology, is called for to address aforementioned questions. While the performance benchmark can take various forms, including but not limited to business expectations, industry practices, or vendor products, a model-based approach should possess following characteristics: – Quick prototype with reasonable efforts – Comparable baseline with acceptable outcomes – Flexible framework without strict assumptions – Practical application to broad domains With both empirical and conceptual advantages, GRNN is able to accommodate each of above-mentioned requirements and thus can be considered an appropriate candidate that might potentially be employed to develop performance benchmarks for a wide variety of models. Below is an example illustrating how to use GRNN to develop a benchmark model for the logistic regression shown in https://statcompute.wordpress.com/2019/05/04/why-use-weight-of-evidence/. The function grnn.margin() was also employed to explore the marginal effect of each attribute in a GRNN. .gist table { margin-bottom: 0; } var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Posts Sun, 09/15/2019 - 02:13 [This article was first published on Random R Ramblings, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Random R Ramblings. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### Twitter “Account Analysis” in R Sat, 09/14/2019 - 16:11 [This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This past week @propublica linked to a really spiffy resource for getting an overview of a Twitter user’s profile and activity called accountanalysis. It has a beautiful interface that works as well on mobile as it does in a real browser. It also is fully interactive and supports cross-filtering (zoom in on the timeline and the other graphs change). It’s especially great if you’re not a coder, but if you are, @kearneymw’s {rtweet} can get you all this info and more, putting the power of R behind data frames full of tweet inanity. While we covered quite a bit of {rtweet} ground in the 21 Recipes book, summarizing an account to the degree that accountanalysis does is not in there. To rectify this oversight, I threw together a static clone of accountanalysis that can make standalone HTML reports like this one. It’s a fully parameterized R markdown document, meaning you can run it as just a function call (or change the parameter and knit it by hand): rmarkdown::render( input = "account-analysis.Rmd", params = list( username = "propublica" ), output_file = "~/Documents/propublica-analysis.html" ) It will also, by default, save a date-stamped copy of the user info and retrieved timeline into the directory you generate the report from (add a prefix path to the save portion in the Rmd to store it in a better place). With all the data available, you can dig in and extract all the information you want/need. FIN You can get the Rmd at your favorite social coding service: var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### RSwitch 1.5.0 Release Now Also Corrals RStudio Server Connections Sat, 09/14/2019 - 15:19 [This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. RSwitch is a macOS menubar application that works on macOS 10.14+ and provides handy shortcuts for developing with R on macOS. Version 1.5.0 brings a reorganized menu system and the ability to manage and make connections to RStudio Server instances. Here’s a quick peek at the new setup: All books, links, and other reference resources are under a single submenu system: If there’s a resource you’d like added, follow the links on the main RSwitch site to file PRs where you’re most comfortable. You can also setup automatic checks and notifications for when new RStudio Dailies are available (you can still always check manually and this check feature is off by default): But, the biggest new feature is the ability to manage and launch RStudio Server connections right from RSwitch: Click to view slideshow. These RStudio Server browser connections are kept separate from your internet browsing and are one menu selection away. RSwitch also remembers the size and position of your RStudio Server session windows, so everything should be where you want/need/expect. This is somewhat of an experimental feature so definitely file issues if you run into any problems or would like things to work differently. FIN Kick the tyres, file issues or requests and, if so inclined, let me know how you’re liking RSwitch! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – rud.is. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. ### September 2019 Democratic Debates Added to {ggchicklet} Sat, 09/14/2019 - 11:34 [This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The latest round of the 2020 Democratic debates is over and the data from all the 2019 editions of the debates have been added to {ggchicklet}. The structure of the debates2019 built-in dataset has changed a bit: library(ggchicklet) library(hrbrthemes) library(tidyverse) debates2019 ## # A tibble: 641 x 7 ## elapsed timestamp speaker topic debate_date debate_group night ## ## 1 1.04 21:03:05 Warren Economy 2019-09-13 1 1 ## 2 1.13 21:04:29 Klobuchar Economy 2019-09-13 1 1 ## 3 1.13 21:06:02 O'Rourke Economy 2019-09-13 1 1 ## 4 0.226 21:07:20 O'Rourke Economy 2019-09-13 1 1 ## 5 1.06 21:07:54 Booker Economy 2019-09-13 1 1 ## 6 0.600 21:09:08 Booker Economy 2019-09-13 1 1 ## 7 0.99 21:09:50 Warren Economy 2019-09-13 1 1 ## 8 0.872 21:11:03 Castro Economy 2019-09-13 1 1 ## 9 1.07 21:12:00 Gabbard Economy 2019-09-13 1 1 ## 10 1.11 21:13:20 de Blasio Economy 2019-09-13 1 1 ## # … with 631 more rows There are now debate_date, debate_group and night columns to make it easier to segment out or group together the debate nights. The topic names across the online JavaScript data for the June, July and September debates weren’t uniform so they’ve been cleaned up as well: distinct(debates2019, topic) %>% arrange(topic) %>% print(n=nrow(.)) ## # A tibble: 26 x 1 ## topic ## ## 1 Abortion ## 2 Age ## 3 Campaign Finance Reform ## 4 Civil Rights ## 5 Climate ## 6 Closing ## 7 Economy ## 8 Education ## 9 Elections Reform ## 10 Foreign Policy ## 11 Gun Control ## 12 Healthcare ## 13 Immigration ## 14 Lead ## 15 Opening ## 16 Other ## 17 Party Strategy ## 18 Politics ## 19 Race ## 20 Resilience ## 21 Socialism ## 22 Statement ## 23 Trade ## 24 Trump ## 25 Veterans ## 26 Women's Rights This should make it easier to compare speaker times per-topic across the debates. Here’ how to generate the chart in the featured image slot for the September debate: debates2019 %>% filter(debate_group == 3) %>% mutate(speaker = fct_reorder(speaker, elapsed, sum, .desc=FALSE)) %>% mutate(topic = fct_inorder(topic)) %>% ggplot(aes(speaker, elapsed, group = timestamp, fill = topic)) + geom_chicklet(width = 0.75) + scale_y_continuous( expand = c(0, 0.0625), position = "right", breaks = seq(0, 18, 2), labels = c(0, sprintf("%d min.", seq(2, 18, 2))), limits = c(0, 18) ) + ggthemes::scale_fill_tableau("Tableau 20") + guides( fill = guide_legend(nrow = 2) ) + coord_flip() + labs( x = NULL, y = NULL, fill = NULL, title = "How Long Each Candidate Spoke", subtitle = "September 2019 Democratic Debates", caption = "Each bar segment represents the length of a candidate’s response to a question.\nOriginal \n#rstats reproduction by @hrbrmstr" ) + theme_ipsum_rc(grid="X") + theme(axis.text.x = element_text(color = "gray60", size = 10)) + theme(legend.position = "top") Now that the field has been thinned a bit (yes, others are still running, but really?) we can see who has blathered the most on stage so far: debates2019 %>% filter(debate_group == 3) %>% distinct(speaker) %>% left_join(debates2019) %>% count(speaker, wt=elapsed, sort=TRUE) %>% mutate(speaker = fct_inorder(speaker) %>% fct_rev()) %>% ggplot(aes(speaker, n)) + geom_col(fill = ft_cols$slate, width=0.55) + coord_flip() + scale_y_continuous(expand = c(0, 0.55), position = "right") + labs( x = NULL, y = "Speaking time (minutes)", title = "Total Speaking Time Across All 2019 Debates\nfor Those Left Standing in September" ) + theme_ipsum_es(grid="X")

And, here’s what they’ve all blathered about:

debates2019 %>% filter(debate_group == 3) %>% distinct(speaker) %>% left_join(debates2019) %>% count(topic, wt=elapsed, sort=TRUE) %>% mutate(topic = fct_inorder(topic) %>% fct_rev()) %>% ggplot(aes(topic, n)) + geom_col(fill = ft_cols\$slate, width=0.55) + coord_flip() + scale_y_continuous(expand = c(0, 0.25), position = "right") + labs( x = NULL, y = "Topic time (minutes)", title = "Total Topic Time Across All 2019 Debates\nfor Those Left Standing in September" ) + theme_ipsum_es(grid="X")

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### ttdo 0.0.3: New package

Sat, 09/14/2019 - 01:29

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new package of mine arrived on CRAN yesterday, having been uploaded a few days prior on the weekend. It extends the most excellent (and very minimal / zero depends) unit testing package tinytest by Mark van der Loo with the very clever and well-done diffobj package by Brodie Gaslam. Mark also tweeted about it.

The package was written to address a fairly specific need. In teaching STAT 430 at Illinois, I am relying on the powerful PrairieLearn system (developed there) to provides tests, quizzes or homework. Alton and I have put together an autograder for R (which is work in progress, more on that maybe another day), and that uses this package to provides colorized differences between supplied and expected answers in case of an incorrect answer.

Now, the aspect of providing colorized diffs when tests do not evalute to TRUE is both simple and general enough. As our approach works rather well, I decided to offer the package on CRAN as well. The small screenshot gives a simple idea, the README.md contains a larger screenshoot.

The initial NEWS entries follow below.

Changes in ttdo version 0.0.3 (2019-09-08)
Changes in ttdo version 0.0.2 (2019-08-31)
• Updated defaults for format and mode to use the same options used by diffobj along with fallbacks.
Changes in ttdo version 0.0.1 (2019-08-26)
• Initial version, with thanks to both Mark and Brodie.

Please use the GitHub repo and its issues for any questions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

### Tell Me a Story: How to Generate Textual Explanations for Predictive Models

Fri, 09/13/2019 - 21:33

TL;DR: If you are going to explain predictions for a black box model you should combine statistical charts with natural language descriptions. This combination is more powerful than SHAP/LIME/PDP/Break Down charts alone. During this summer Adam Izdebski implemented this feature for explanations generated in R with DALEX library. How he did it? Find out here:

Long version:
Amazing things were created during summer internships at MI2DataLab this year. One of them is the generator of natural language descriptions for DALEX explainers developed by Adam Izdebski.

Is text better than charts for explanations?
Packages from DrWhy.AI toolbox generate lots of graphical explanations for predictive models. Available statistical charts allow to better understand how a model is working in general (global perspective) or for a specific prediction (local perspective).
Yet for domain experts without training in mathematics or computer science, graphical explanations may be insufficient. Charts are great for exploration and discovery, but for explanations they introduce some ambiguity. Have I read everything? Maybe I missed something?
To address this problem we introduced the describe() function, which
automatically generates textual explanations for predictive models. Right now these natural language descriptions are implemented in R packages by ingredients and iBreakDown.

Insufficient interpretability
Domain experts without formal training in mathematics or computer science can often find statistical explanations as hard to interpret. There are various reasons for this. First of all, explanations are often displayed as complex plots without instructions. Often there is no clear narration or interpretation visible. Plots are using different scales mixing probabilities with relative changes in model’s prediction. The order of variables may be also misleading. See for example a Break-Down plot.

The figure displays the prediction generated with Random Forest that a selected passenger survived the Titanic sinking. The model’s average response on titanic data set (intercept) is equal to 0.324. The model predicts that the selected passenger survived with probability 0.639. It also displays all the variables that have contributed to that prediction. Once the plot is described it is easy to interpret as it posses a very clear graphical layout.
However, interpreting it for the first time may be tricky.

Properties of a good description
Effective communication and argumentation is a difficult craft. For that reason, we refer to winning debates strategies as for guiding in generating persuasive textual explanations. First of all, any description should be
intelligible and persuasive. We achieve this by using:

• Fixed structure: Effective communication requires a rigid structure. Thus we generate descriptions from a fixed template, that always includes a proper introduction, argumentation and conclusion part. This makes the description more predictable, hence intelligible.
• Situation recognition: In order to make a description more trustworthy, we begin generating the text by identifying one of the scenarios, that we are dealing with. Currently, the following scenarios are available:
• The model prediction is significantly higher than the average model prediction. In this case, the description should convince the reader why the prediction is higher than the average.
• The model prediction is significantly lower than the average model prediction. In this case, the description should convince the reader why the prediction is lower than the average.
• The model prediction is close to the average. In this case the description should convince the reader that either: variables are contradicting each other or variables are insignificant.

Identifying what should be justified, is a crucial step for generating persuasive descriptions.

Description’s template for persuasive argumentation
As noted before, to achieve clarity we generate descriptions with three separate components: an introduction, an argumentation part, and a summary.

An introduction should provide a claim. It is a basic point that an arguer wishes to make. In our case, it is the model’s prediction. Displaying the additional information about the predictions’ distribution helps to place it in a context — is it low, high or close to the average.

An argumentation part should provide evidence and reason, which connects the evidence to the claim. In normal settings this will work like that: This particular passenger survived the catastrophe (claim) because it was a child (evidence no. 1) and children were evacuated from the ship in the first order as in the phrase women and children first. (reason no. 1) What is more, the children were traveling in the 1-st class (evidence no. 2) and first-class passengers had the best cabins, which were close to the rescue boats. (reason no. 2).

The tricky part is that we are not able to make up a reason automatically, as it is a matter of context and interpretation. However what we can do is highlight the main evidence, that made the model produce the claim. If a model is making its’ predictions for the right reason, evidences should make much sense and it should be easy for the reader to make a story and connect the evidence to the claim. If the model is displaying evidence, that makes not much sense, it also should be a clear signal, that the model may not be trustworthy.

A summary is just the rest of the justification. It states that other pieces of evidence are with less importance, thus they may be omitted. A good rule of thumb is displaying three most important evidence, not to make the picture too complex. We can refer to the above scheme as to creating relational arguments as in winning debates guides.

Implementation
The logic described above is implemented in ingredients and iBreakDown packages.

For generating a description we should pass the explanation generated by ceteris_paribus() or break_down() or shap() to the describe() function.

describe(break_down_explanation) # Random Forest predicts, that the prediction for the selected instance is 0.639 which is higher than the average. # The most important variables that increase the prediction are gender, fare. # The most important variable that decrease the prediction is class. # Other variables are with less importance. The contribution of all other variables is -0.063 .

There are various parameters that control the display of the description making it more flexible, thus suited for more applications. They include:

• generating a short version of descriptions,
• displaying predictions’ distribution details,
• generating more detailed argumentation.

While explanations generated by iBreakDown are feature attribution explanations that aim at providing interpretable reasons for the model’s prediction, explanations generated by ingredients are rather speculative. In fine, they explain how the model’s prediction would change if we perturb the instance being explained. For example, ceteris_paribus() explanation explores how would the prediction change if we change the values of a single feature while keeping the other features unchanged.

describe(ceteris_paribus_explanation, variables = “age”) # For the selected instance Random Forest predicts that , prediction is equal to 0.699. # The highest prediction occurs for (age = 16), while the lowest for (age = 74). # Breakpoint is identified at (age = 40). # Average model responses are *lower* for variable values *higher* than breakpoint.

Applications and future work

Generating natural language explanations is a sensitive task, as the interpretability always depends on the end user’s cognition. For this reason, experiments should be designed to assess the usefulness of the descriptions being generated. Furthermore, more vocabulary flexibility could be added, to make the descriptions more human alike. Lastly, descriptions could be integrated with a chatbot that would explain predictions interactively, using the framework described here. Also, better discretization techniques can be used for generating better continuous ceteris paribus and aggregated profiles textual explanations.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));