R news and tutorials contributed by hundreds of R bloggers
Updated: 11 hours 7 min ago

### Moving on as Head of Solutions and AI at Draper and Dash

Fri, 09/18/2020 - 09:52

[social4i size="small" align="align-left"] --> [This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Well, what can I say? It has been a blast over the past year and a half. In my time at the company I have managed to work on a number of fascinating projects, here is a list of just some of them:

1. Data Science Platform – this has been the culmination of lots of machine learning projects, undertaken separately in R and Python, and has allowed for our Healthcare Data Science Platform to be developed. This, alongside skilled data scientists, data engineers and front end developers, we think we have developed something really special
2. Clinical Outcome Review System (CORS) – we have built a lot into this product and it has allowed the web tool to be co created and aligned with our core customers. The CORS tool is a web framework that enables the centralisation of ‘learning from deaths’ in one, easy to use, web tool.
3. Command Centre – led the design and development of the underpinning machine learning and forecasting models embedded into the solution. Furthermore, working alongside key trusts to deploy the solution and tailor to the trusts needs. For information about this product see: https://draperanddash.com/solutions-2/command-centre/.
4. Readmission avoidance tool, stranded patient predictor, ED Length of Stay estimator, waiting list cohorter, building an effective forecasting solution, alongside many other projects – primarily developing the forecasting and predictive functionality, alongside our key data scientist Alfonso. These have not been created as Qlik and other BI dashboards which smoothly integrate and augment the machine learning with exploratory analysis functionality in the various business intelligence solutions.
5. Data Assurance Programme – working with a number of trusts to establish our data assurance framework and build out the discovery process that heightens and improves the visibility of data issues across the organisational sphere.
6. AI and ML innovation lead – established the wider AI/ML meetup group and managed the preexisting trustwide AI and ML group. This involved sending regular updates to the group and making sure they were sighted of coming D&D products, tools and algorithms.
7. Developed the Machine Learning and Data Science Training led on the rollout of upskilling analysts in data science and machine learning techniques, mainly using the R programming language as a vehicle for the demonstration of the ‘art of the possible’.
8. Cancer analytics – led the statistical analysis of a number of key cancer measures to establish if the change in the group was of significant noteworthiness.
9. COVID-19 – SIR and SEIR model development. These led to the correct estimation of a number of epidemic curves and peaks. Additionally, I created a Shiny web tool to allow for the estimation of peak demand. This was a very fun project to be working on, in a distressing time.

These are just a few of the projects I have been involved with and I am hugely proud of the time at the company.

I look forward to my next challenge back in the NHS as Head of Advanced Analytics for a CSU I won’t name…

What will I miss about the role?

I will miss a number of things in the role, but if I had to rank order them, as a statistician you know how much we like to quantify, they would be:

1. The people – the team at D&D are great and I have made some really good friends.
2. The challenge – every day is different, from coding a Shiny app, to working in R to do a Machine Learning problem, to working in Python to work on NLP and Deep Learning, to working as a consultant in a consulting role, to working with clients, to upskilling NHS analysts, to working on a number of different domains.
3. The learning curve – I like to continuously develop, and D&D affords that opportunity. Every day is a school day, as my professor used to say.
4. The team spirit – we are a team and that is how we deliver all our projects.
5. The coding – I love coding and I have time to do this on a daily basis – from R to Python for data science, SQL for extracting information, among others.

The only thing I won’t miss is the travel, but you can’t have it all…

Does this sound like your bag?

If you want to become the new me, then you are in luck, as D&D are looking for a replacement:

Click on the image above to be routed to the Draper and Dash main website.

A special thanks to all the guys at D&D

I would like to say that I have really enjoyed my time with the team and I will miss each and every one of the team.

Team D&D

You guys have been great and it has been emotional.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Moving on as Head of Solutions and AI at Draper and Dash first appeared on R-bloggers.

### Permutations in R

Fri, 09/18/2020 - 07:58

[social4i size="small" align="align-left"] --> [This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

During the interview process for Data Science positions, it is likely to be asked to calculate Combinations or Permutations. Today we will provide an example of how we can solve numerically permutation problems in R.

Find all Permutations of the word baboon

Mathematically we can approach this question as follows:

$$P=\frac{n!}{n_1! n_2! n_3!…n_k!}$$

Where:

• $$n$$ is the total number of object, i.e. letters in our case which is 6
• $$n_1$$ is the number of objects of type 1, for example, the number of b which is 2
• $$n_2$$ is the number of objects of type 2, for example, the number of a which is 1
• $$n_3$$ is the number of objects of type 3, for example, the number of o which is 2
• $$n_4$$ is the number of objects of type 4, for example, the number of n which is 1

Hence the number of permutations is $$P=\frac{6!}{2! \times 2!} = 180$$

Return all Permutations in R

Let’s return all permutations in R

Working with combinat package

library(combinat) # get the list of all permutations my_list<-combinat::permn(c("b","a","b","o","o","n")) # convert the list to a matrix my_matrix<-do.call(rbind,my_list) #take the unique rows my_matrix<-unique(my_matrix) head(my_matrix)

[,1] [,2] [,3] [,4] [,5] [,6] [1,] "b" "a" "b" "o" "o" "n" [2,] "b" "a" "b" "o" "n" "o" [3,] "b" "a" "b" "n" "o" "o" [4,] "b" "a" "n" "b" "o" "o" [5,] "b" "n" "a" "b" "o" "o" [6,] "n" "b" "a" "b" "o" "o"

If we want to get the number of rows of the table, which are actually our permutations:

dim(my_matrix) # [1] 180 6

As expected we got 180 rows (the permutations) and 6 columns (the number of letters)

When we are in a position to get all the possible permutations, we will be able to calculate the permutations of more complicated problems. Let’s say, that the question was to calculate all the possible permutations of the word baboon but by picking 4 letters each time (instead of 6).

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Permutations in R first appeared on R-bloggers.

### Sequential satisficing

Thu, 09/17/2020 - 20:00

[social4i size="small" align="align-left"] --> [This article was first published on R on OSM, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In our last post, we ran simulations on our 1,000 randomly generated return scenarios to compare the average and risk-adjusted return for satisfactory, naive, and mean-variance optimized (MVO) maximum return and maximum Sharpe ratio portfolios.1 We found that you can shoot for high returns or high risk-adjusted returns, but rarely both. Assuming no major change in the underlying average returns and risk, choosing the efficient high return or high risk-adjusted return portfolio generally leads to similar performance a majority of the time in out-of-sample simulations. What was interesting about the results was that even though we weren’t arguing that one should favor satisficing over optimizing, we found that the satisfactory portfolio generally performed well.

One area we didn’t test was the effect of sequence on mean-variance optimization. Recall we simulated 1,000 different sixty month (five year) return profiles for four potential assets—stocks, bonds, commodities (gold) and real estate. The weights we used to calculate the different portfolio returns and risk were based on data from 1987-1991 and were kept constant for all the simulations. What if the weighting system could “learn” from previous scenarios or adjust to recent information?

In this post, we’ll look at the impact of incorporating sequential information on portfolio results and compare that to the non-sequential results. To orient folks, we’ll show the initial portfolio weight simulation graph with the different portfolios and efficient frontier based on the historical data. The satisfactory, naive, MVO-derived maximum Sharpe ratio, and MVO-derived maximum return portfolios are the colored blue, black, red, and purple points, respectively.

Now we’ll take the weighting schemes and apply them to the return simulations. However, we’ll run through the returns sequentially and calculate the returns and risk-adjusted returns for the weighting scheme using the prior simulation as the basis for the allocation on the next simulation. For example, we’ll calculate the efficient frontier and run the portfolio weight algorithm (see Weighting on a friend for more details) on simulation #75. From those calculations, we’ll assign the weights for the satisfactory, maximum Sharpe ratio, and maximum efficient return portfolios to compute the returns and risk-adjusted returns on simulation #76. For the record, simulation #1 uses the same weights as those that produced the dots in the graph above.

Running the simulations, here’s the average return by weighting algorithm.

And here’s the original.

For the naive and maximum return portfolios, the average return isn’t much different on the sequential calculations than on original constant weight computations. The sequential Sharpe allocations are about 1% point lower than the stable ones. However, the sequential satisfactory allocations are almost 2% points lower than the constant allocation. In about 5% of the cases, the return simulation was not capable of achieving a satisfactory portfolio with the original constraints of not less than 7% and not more than 10% annualized return and risk. In such a case, we reverted to the original weighting. This adjustment did not affect the average return even when we excluded the “unsatisfactory” portfolios.

We assume that the performance drag is due to the specificity of the constraints amid random outcomes. The other portfolios (except for the naive one) optimize for the previous scenario and apply those optimal weights on the next portfolio in the sequence. The satisfactory portfolio chooses average weights to achieve a particular constraint and then applies those weights in sequence. Due to the randomness of the return scenarios, it seems possible that a weighting scheme that works for one scenario would produce poor results on another scenario. Of course, the satisfactory portfolios still achieved their return constraint on average.

This doesn’t exactly explain why the MVO portfolios didn’t suffer a performance drag. Our intuition is that the MVO portfolios are border cases. Hence, there probably won’t be too much dispersion in the results on average since the return simulations draw from the same return, risk, and error terms throughout. However, the satisfactory portfolio lies with the main portion of the territory, so the range of outcomes could be much larger. Clearly, this requires further investigation, but we’ll have to shelve that for now.

In the next set of graphs, we check out how each portfolio performs relative to the others. First the sequential simulation.

And the original:

There are some modest changes in the satisfactory and naive vs. MVO-derived portfolios. But the biggest change occurs in the satisfactory vs. naive portfolio—almost a 22 percentage point decline! This appears to support our conjecture above that the order of the sequence could be affecting the satisfactory portfolio’s performance. Nonetheless, sequencing does not appear to alter the performance of the naive and satisfactory portfolios relative to the MVO-derived ones in a material way. Let’s randomize the sequence to say if results change.

For the random draw case, we randomly draw one return simulation on which to calculate our portfolio weights and then apply those weights on another randomly drawn portfolio. The draws are without replacement to ensure we don’t use the same simulation both to calculate weights and returns as well as to ensure that we use all simulations.2

The average return graph isn’t much different for the random sampling than it is for the sequential so we’ll save some space and not print it. Here is the relative performance graph.

And here is the original relative performance for ease of comparison.

With random ordering we see that the naive and satisfactory portfolios relative to the maximum return experience a modest performance improvement. Relative to the maximum Sharpe portfolio, the performance is relatively unchanged. The satisfactory portfolio again suffers a drag relative to the naive, but somewhat less severe that in the sequential case.

What are the takeaways? In both sequential and random sampling, the performance of the naive and satisfactory compared to the MVO portfolios was relatively stable. That supports our prior observation that you can shoot for high returns or high risk-adjusted returns, but usually not both. It’s also noteworthy that the maximum Sharpe ratio consistently underperforms the naive and satisfactory portfolios in all of the simulations. The team at Alpha Architect noted similar results in a much broader research project. Here, part of the study found that the maximum Sharpe portfolio suffered the worst performance relative to many others including an equal-weighted (i.e., naive) portfolio.

One issue with our random sampling test is that it was only one random sequence. To be truly robust, we’d want to run the sampling simulation more than once. But we’ll have to save that for when we rent time on a cloud GPU because we don’t think our flintstone-aged laptop will take kindly to running 1,000 simulations of 1,000 random sequences of 1,000 simulated portfolios with 1,000 optimizations and 3,000 random portfolio weightings!

Perceptive readers might quibble with how we constructed our “learning” simulations. Indeed, in both the sequential and random sample cases we calculated our allocation weights only using the prior simulation in the queue (e.g. weights derived from return simulation #595 are deployed on simulation #596 (or, for example, #793 in the random simulation)). But that “learning” could be considered relatively short term. A more realistic scenario would be to allow for cumulative learning by trying to incorporate all or a majority of past returns in the sequence. And we could get even more complicated by throwing in some weighting scheme to emphasize nearer term results. Finally, we could calculate cumulative returns over one scenario or multiple scenarios. We suspect relative performance would be similar, but you never know!

Until we experiment with these other ideas, the R and Python code are below. While we generally don’t discuss the code in detail, we have been getting more questions and constructive feedback on it of late. Thank you for that! One thing to note is the R code to produce the simulations runs much quicker than the Python code. Part of that is likely due to how we coded in the efficient frontier and maximum Sharpe and return portfolios in the different languages. But the differences are close to an order of magnitude. If anyone notices anything we could do to improve performance, please drop us an email at nbw dot osm at gmail dot com. Thanks!

R code

# Built using R 3.6.2 ### Load packages suppressPackageStartupMessages({ library(tidyquant) library(tidyverse) }) ### Load data # Seem prior posts for how we built these data frames df <- readRDS("port_const.rds") dat <- readRDS("port_const_long.rds") sym_names <- c("stock", "bond", "gold", "realt", "rfr") sim1 <- readRDS("hist_sim_port16.rds") ### Call functions # See prior posts for how we built these functions source("Portfolio_simulation_functions.R") source("Efficient_frontier.R") ## Function for calculating satisfactory weighting port_sim_wts <- function(df, sims, cols, return_min, risk_max){ if(ncol(df) != cols){ print("Columns don't match") break } # Create weight matrix wts <- matrix(nrow = (cols-1)*sims, ncol = cols) count <- 1 for(i in 1:(cols-1)){ for(j in 1:sims){ a <- runif((cols-i+1),0,1) b <- a/sum(a) c <- sample(c(b,rep(0,i-1))) wts[count,] <- c count <- count+1 } } # Find returns mean_ret <- colMeans(df) # Calculate covariance matrix cov_mat <- cov(df) # Calculate random portfolios port <- matrix(nrow = (cols-1)*sims, ncol = 2) for(i in 1:nrow(port)){ port[i,1] <- as.numeric(sum(wts[i,] * mean_ret)) port[i,2] <- as.numeric(sqrt(t(wts[i,]) %*% cov_mat %*% wts[i,])) } port <- as.data.frame(port) %>% colnames<-(c("returns", "risk")) port_select <- cbind(port, wts) port_wts <- port_select %>% mutate(returns = returns*12, risk = risk*sqrt(12)) %>% filter(returns >= return_min, risk <= risk_max) %>% summarise_at(vars(3:6), mean) port_wts } ## Portfolio func port_func <- function(df,wts){ mean_ret = colMeans(df) returns = sum(mean_ret*wts) risk = sqrt(t(wts) %*% cov(df) %*% wts) c(returns, risk) } ## Portfolio graph pf_graf <- function(df, nudge, multiplier, rnd, y_lab, text){ df %>% gather(key, value) %>% ggplot(aes(reorder(key, value), value*multiplier*100)) + geom_bar(stat='identity', fill = 'darkblue') + geom_text(aes(label = format(round(value*multiplier,rnd)*100,nsmall = 1)), nudge_y = nudge)+ labs(x = "", y = paste(y_lab, "(%)", sep = " "), title = paste(text, "by portfolio", sep = " ")) } ## Create outperformance graph perf_graf <- function(df, rnd, nudge, text){ df %>% rename("Satisfactory vs. Naive" = ovs, "Satisfactory vs. Max" = ovr, "Naive vs. Max" = rve, "Satisfactory vs. Sharpe" = ove, "Naive vs. Sharpe" = sve) %>% gather(key, value) %>% ggplot(aes(reorder(key, value), value*100)) + geom_bar(stat='identity', fill = 'darkblue') + geom_text(aes(label = format(round(value,rnd)*100, nsmall = 1)), nudge_y = nudge)+ labs(x = "", y = "Percent (%)", title = paste("Frequency of outperformance:", text, sep = " ")) } ### Create weight schemes satis_wts <- c(0.32, 0.4, 0.08, 0.2) # Calculated in previous post using port_select_func simple_wts <- rep(0.25, 4) eff_port <- eff_frontier_long(df[2:61,2:5], risk_increment = 0.01) eff_sharp_wts <- eff_port[which.max(eff_port$sharpe),1:4] %>% as.numeric() eff_max_wts <- eff_port[which.max(eff_port$exp_ret), 1:4] %>% as.numeric() ## Run port sim on 1987-1991 data port_sim_1 <- port_sim_lv(df[2:61,2:5],1000,4) # Run function on three weighting schemes and one simulation weight_list <- list(satis = satis_wts, naive = simple_wts, sharp = eff_sharp_wts, max = eff_max_wts) wts_df <- data.frame(wts = c("satis", "naive", "sharp", "max"), returns = 1:4, risk = 5:8, stringsAsFactors = FALSE) for(i in 1:4){ wts_df[i, 2:3] <- port_func(df[2:61,2:5], weight_list[[i]]) } wts_df$sharpe = wts_df$returns/wts_df$risk # Graph portfolio simulation with three portfolios port_sim_1$graph + geom_point(data = wts_df, aes(x = risk*sqrt(12)*100, y = returns*1200), color = c("darkblue", "black", "red", "purple"), size = 4) + geom_line(data = eff_port, aes(stdev*sqrt(12)*100, exp_ret*1200), color = 'blue', size = 1.5) + theme(legend.position = c(0.05,0.8), legend.key.size = unit(.5, "cm"), legend.background = element_rect(fill = NA)) ## Calculate performance with rolling frontier on max sharpe set.seed(123) eff_sharp_roll <- list() eff_max_roll <- list() satis_roll <- list() for(i in 1:1000){ if(i == 1){ sharp_wts <- eff_sharp_wts max_wts <- eff_max_wts sat_wts <- satis_wts }else { eff_calc <- eff_frontier_long(sim1[[i-1]]$df, risk_increment = 0.01) sharp_wts <- eff_calc[which.max(eff_calc$sharpe), 1:4] max_wts <- eff_calc[which.max(eff_calc$exp_ret), 1:4] sat_wts <- port_sim_wts(sim1[[i-1]]$df, 1000, 4, 0.07, 0.1) } eff_sharp_roll[[i]] <- port_func(sim1[[i]]$df, as.numeric(sharp_wts)) eff_max_roll[[i]] <- port_func(sim1[[i]]$df, as.numeric(max_wts)) satis_roll[[i]] <- port_func(sim1[[i]]$df, as.numeric(sat_wts)) } list_to_df <- function(list_ob){ x <- do.call('rbind', list_ob) %>% as.data.frame() %>% colnames<-(c("returns", "risk")) %>% mutate(sharpe = returns/risk) x } roll_lists <- c("eff_sharp_roll", "eff_max_roll", "satis_roll") for(obj in roll_lists){ x <- list_to_df(get(obj)) assign(obj, x) } # Return roll_mean_pf <- data.frame(Satisfactory = mean(satis_roll[,1], na.rm = TRUE), Naive = mean(simple_df[,1]), Sharpe = mean(eff_sharp_roll[,1]), Max = mean(eff_max_roll[,1])) # Graph mean returns pf_graf(roll_mean_pf, 1, 12, 2,"Return (%)", "Rolling simulation return") pf_graf(mean_pf, 1, 12, 2,"Return (%)", "Original simulation return") # Create relative performance df roll_ret_pf <- data.frame(ovs = mean(satis_df[,1] > simple_df[,1]), ovr = mean(satis_df[,1] > eff_max_roll[,1]), rve = mean(simple_df[,1] > eff_max_roll[,1]), ove = mean(satis_df[,1] > eff_sharp_roll[,1]), sve = mean(simple_df[,1] > eff_sharp_roll[,1])) # Graph outperformance perf_graf(roll_ret_pf, 2, 4, "Rolling simulation returns") perf_graf(ret_pf, 2, 4, "Original simulation returns") ## Sampling portfolios set.seed(123) eff_sharp_samp <- list() eff_max_samp <- list() satis_samp <- list() for(i in 1:1000){ if(i == 1){ sharp_wts <- eff_sharp_wts max_wts <- eff_max_wts sat_wts <- satis_wts }else { samp_1 <- sample(1000, 1) # Sample a return simulation for weight scheme eff_calc <- eff_frontier_long(sim1[[samp_1]]$df, risk_increment = 0.01) sharp_wts <- eff_calc[which.max(eff_calc$sharpe), 1:4] max_wts <- eff_calc[which.max(eff_calc$exp_ret), 1:4] sat_wts <- port_sim_wts(sim1[[samp_1]]$df, 1000, 4, 0.07, 0.1) } samp_2 <- sample(1000, 1) # sample a return simulation to analyze performance eff_sharp_samp[[i]] <- port_func(sim1[[samp_2]]$df, as.numeric(sharp_wts)) eff_max_samp[[i]] <- port_func(sim1[[samp_2]]$df, as.numeric(max_wts)) satis_samp[[i]] <- port_func(sim1[[samp_2]]$df, as.numeric(sat_wts)) } samp_lists <- c("eff_sharp_samp", "eff_max_samp", "satis_samp") for(obj in samp_lists){ x <- list_to_df(get(obj)) assign(obj, x) } # Return samp_mean_pf <- data.frame(Satisfactory = mean(satis_samp[,1], na.rm=TRUE), Naive = mean(simple_df[,1]), Sharpe = mean(eff_sharp_samp[,1]), Max = mean(eff_max_samp[,1])) # Graph mean returns: NOT SHOWN pf_graf(samp_mean_pf, 1, 12, 2,"Return (%)", "Random sampling return") # pf_graf(mean_pf, 1, 12, 2,"Return (%)", "Original return") # Create relative performance df samp_ret_pf <- data.frame(ovs = mean(satis_df[,1] > simple_df[,1]), ovr = mean(satis_df[,1] > eff_max_samp[,1]), rve = mean(simple_df[,1] > eff_max_samp[,1]), ove = mean(satis_df[,1] > eff_sharp_samp[,1]), sve = mean(simple_df[,1] > eff_sharp_samp[,1])) # Graph outperformance perf_graf(samp_ret_pf, 2, 4, "Random sample returns") perf_graf(ret_pf, 2, 4, "Original returns")

Python code

# Built using Python 3.7.4 # Load libraries import pandas as pd import pandas_datareader.data as web import numpy as np import matplotlib.pyplot as plt %matplotlib inline plt.style.use('ggplot') ## Load data # Seem prior posts for how we built these data frames df = pd.read_pickle('port_const.pkl') dat = pd.read_pickle('data_port_const.pkl') port_names = ['Original','Naive', 'Sharpe', 'Max'] sim1 = pd.read_pickle('hist_sim_port16.pkl') ## Load functions part 1 # Portfolio simulation functions # NOte the Port_sim class is slightly different than previous posts, # so we're reproducing it here. # We were getting a lot of "numpy.float64 is not callable" errors # due to overlapping names on variables and functions, so we needed # to fix the code. If it still throws error, let us know if it does. ## Simulation function class Port_sim: def calc_sim(df, sims, cols): wts = np.zeros((sims, cols)) for i in range(sims): a = np.random.uniform(0,1,cols) b = a/np.sum(a) wts[i,] = b mean_ret = df.mean() port_cov = df.cov() port = np.zeros((sims, 2)) for i in range(sims): port[i,0] = np.sum(wts[i,]*mean_ret) port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,])) sharpe = port[:,0]/port[:,1]*np.sqrt(12) return port, wts, sharpe def calc_sim_lv(df, sims, cols): wts = np.zeros(((cols-1)*sims, cols)) count=0 for i in range(1,cols): for j in range(sims): a = np.random.uniform(0,1,(cols-i+1)) b = a/np.sum(a) c = np.random.choice(np.concatenate((b, np.zeros(i))),cols, replace=False) wts[count,] = c count+=1 mean_ret = df.mean() port_cov = df.cov() port = np.zeros(((cols-1)*sims, 2)) for i in range(sims): port[i,0] = np.sum(wts[i,]*mean_ret) port[i,1] = np.sqrt(np.dot(np.dot(wts[i,].T,port_cov), wts[i,])) sharpe = port[:,0]/port[:,1]*np.sqrt(12) return port, wts, sharpe def graph_sim(port, sharpe): plt.figure(figsize=(14,6)) plt.scatter(port[:,1]*np.sqrt(12)*100, port[:,0]*1200, marker='.', c=sharpe, cmap='Blues') plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show() # Constraint function def port_select_func(port, wts, return_min, risk_max): port_select = pd.DataFrame(np.concatenate((port, wts), axis=1)) port_select.columns = ['returns', 'risk', 1, 2, 3, 4] port_wts = port_select[(port_select['returns']*12 >= return_min) & (port_select['risk']*np.sqrt(12) <= risk_max)] port_wts = port_wts.iloc[:,2:6] port_wts = port_wts.mean(axis=0) return port_wts def port_select_graph(port_wts): plt.figure(figsize=(12,6)) key_names = {1:"Stocks", 2:"Bonds", 3:"Gold", 4:"Real estate"} lab_names = [] graf_wts = port_wts.sort_values()*100 for i in range(len(graf_wts)): name = key_names[graf_wts.index[i]] lab_names.append(name) plt.bar(lab_names, graf_wts, color='blue') plt.ylabel("Weight (%)") plt.title("Average weights for risk-return constraint", fontsize=15) for i in range(len(graf_wts)): plt.annotate(str(round(graf_wts.values[i])), xy=(lab_names[i], graf_wts.values[i]+0.5)) plt.show() ## Load functions part 2 # We should have wrapped the three different efficient frontier functions # into one class or function but ran out of time. This is probably what slows # down the simulations below. # Create efficient frontier function from scipy.optimize import minimize def eff_frontier(df_returns, min_ret, max_ret): n = len(df_returns.columns) def get_data(weights): weights = np.array(weights) returns = np.sum(df_returns.mean() * weights) risk = np.sqrt(np.dot(weights.T, np.dot(df_returns.cov(), weights))) sharpe = returns/risk return np.array([returns,risk,sharpe]) # Contraints def check_sum(weights): return np.sum(weights) - 1 # Rante of returns mus = np.linspace(min_ret,max_ret,21) # Function to minimize def minimize_volatility(weights): return get_data(weights)[1] # Inputs init_guess = np.repeat(1/n,n) bounds = ((0.0,1.0),) * n eff_risk = [] port_weights = [] for mu in mus: # function for return cons = ({'type':'eq','fun': check_sum}, {'type':'eq','fun': lambda w: get_data(w)[0] - mu}) result = minimize(minimize_volatility,init_guess,method='SLSQP',bounds=bounds,constraints=cons) eff_risk.append(result['fun']) port_weights.append(result.x) eff_risk = np.array(eff_risk) return mus, eff_risk, port_weights # Create max sharpe function from scipy.optimize import minimize def max_sharpe(df_returns): n = len(df_returns.columns) def get_data(weights): weights = np.array(weights) returns = np.sum(df_returns.mean() * weights) risk = np.sqrt(np.dot(weights.T, np.dot(df_returns.cov(), weights))) sharpe = returns/risk return np.array([returns,risk,sharpe]) # Function to minimize def neg_sharpe(weights): return -get_data(weights)[2] # Inputs init_guess = np.repeat(1/n,n) bounds = ((0.0,1.0),) * n # function for return constraint = {'type':'eq','fun': lambda x: np.sum(x) - 1} result = minimize(neg_sharpe, init_guess, method='SLSQP', bounds=bounds, constraints=constraint) return -result['fun'], result['x'] # Create efficient frontier function from scipy.optimize import minimize def max_ret(df_returns): n = len(df_returns.columns) def get_data(weights): weights = np.array(weights) returns = np.sum(df_returns.mean() * weights) risk = np.sqrt(np.dot(weights.T, np.dot(df_returns.cov(), weights))) sharpe = returns/risk return np.array([returns,risk,sharpe]) # Function to minimize def port_ret(weights): return -get_data(weights)[0] # Inputs init_guess = np.repeat(1/n,n) bounds = ((0.0,1.0),) * n # function for return constraint = {'type':'eq','fun': lambda x: np.sum(x) - 1} result = minimize(port_ret, init_guess, method='SLSQP', bounds=bounds, constraints=constraint) return -result['fun'], result['x'] ## Load functions part 3 ## Portfolio return def port_func(df, wts): mean_ret = df.mean() returns = np.sum(mean_ret * wts) risk = np.sqrt(np.dot(wts, np.dot(df.cov(), wts))) return returns, risk ## Return and relative performance graph def pf_graf(names, values, rnd, nudge, ylabs, graf_title): df = pd.DataFrame(zip(names, values), columns = ['key', 'value']) sorted = df.sort_values(by = 'value') plt.figure(figsize = (12,6)) plt.bar('key', 'value', data = sorted, color='darkblue') for i in range(len(names)): plt.annotate(str(round(sorted['value'][i], rnd)), xy = (sorted['key'][i], sorted['value'][i]+nudge)) plt.ylabel(ylabs) plt.title('{} performance by portfolio'.format(graf_title)) plt.show() ## Create portfolio np.random.seed(123) port_sim_1, wts_1, sharpe_1 = Port_sim.calc_sim(df.iloc[1:61,0:4],1000,4) # Create returns and min/max ranges df_returns = df.iloc[1:61, 0:4] min_ret = min(port_sim_1[:,0]) max_ret = max(port_sim_1[:,0]) # Find efficient portfolio eff_ret, eff_risk, eff_weights = eff_frontier(df_returns, min_ret, max_ret) eff_sharpe = eff_ret/eff_risk ## Create weight schemes satis_wts = np.array([0.32, 0.4, 0.08, 0.2]) # Calculated in previous post using port_select_func simple_wts = np.repeat(0.25, 4) eff_sharp_wts = eff_weights[np.argmax(eff_sharpe)] eff_max_wts = eff_weights[np.argmax(eff_ret)] wt_list = [satis_wts, simple_wts, eff_sharp_wts, eff_max_wts] wts_df = np.zeros([4,3]) for i in range(4): wts_df[i,:2] = port_func(df.iloc[1:61,0:4], wt_list[i]) wts_df[:,2] = wts_df[:,0]/wts_df[:,1] ## Graph portfolios plt.figure(figsize=(12,6)) plt.scatter(port_sim_1[:,1]*np.sqrt(12)*100, port_sim_1[:,0]*1200, marker='.', c=sharpe_1, cmap='Blues') plt.plot(eff_risk*np.sqrt(12)*100,eff_ret*1200,'b--',linewidth=2) col_code = ['blue', 'black', 'red', 'purple'] for i in range(4): plt.scatter(wts_df[i,1]*np.sqrt(12)*100, wts_df[i,0]*1200, c = col_code[i], s = 50) plt.colorbar(label='Sharpe ratio', orientation = 'vertical', shrink = 0.25) plt.title('Simulated portfolios', fontsize=20) plt.xlabel('Risk (%)') plt.ylabel('Return (%)') plt.show() ## Create simplifed satisfactory portfolio finder function def port_sim_wts(df1, sims1, cols1, ret1, risk1): pf, wt, _ = Port_sim.calc_sim(df1, sims1, cols1) port_wts = port_select_func(pf, wt, ret1, risk1) return port_wts ## Run sequential simulation np.random.seed(123) eff_sharp_roll = np.zeros([1000,3]) eff_max_roll = np.zeros([1000,3]) satis_roll = np.zeros([1000,3]) for i in range(1000): if i == 0: sharp_weights = eff_sharp_wts max_weights = eff_max_wts sat_weights = satis_wts else: _, sharp_wts = max_sharpe(sim1[i-1][0]) _, max_wts = max_ret(sim1[i-1][0]) # Running the optimizatin twice is probably slowing down the simulation sharp_weights = sharp_wts max_weights = max_wts test = port_sim_wts(sim[i-1][0], 1000, 4, 0.07,0.1) if np.isnan(test): sat_weights = satis_wts else: sat_weights = test eff_sharp_roll[i,:2] = port_func(sim1[i][0], sharp_weights) eff_max_roll[i,:2] = port_func(sim1[i][0], max_weights) satis_roll[i,:2] = port_func(sim1[i][0], sat_weights) eff_sharp_roll[:,2] = eff_sharp_roll[:,0]/eff_sharp_roll[:,1] eff_max_roll[:,2] = eff_max_roll[:,0]/eff_max_roll[:,1] satis_roll[:,2] = satis_roll[:,0]/satis_roll[:,1] # Calculate simple returns simple_df = np.zeros([1000,3]) for i in range(1000): simple_df[i,:2] = port_func(sim1[i][0], simple_wts) simple_df[:,2] = simple_df[:,0]/simple_df[:,1] ## Add simulations to list and graph roll_sim = [satis_roll, simple_df, eff_sharp_roll, eff_max_roll] port_means = [] for df in roll_sim: port_means.append(np.mean(df[:,0])*1200) port_names = ['Satisfactory', 'Naive', 'Sharpe', 'Max'] # Sequential simulation pf_graf(port_names, port_means, 1, 0.5, 'Returns (%)', 'Rolling simulation return') # Original simulation port_means1 = [] for df in list_df: ## Comparison charts # Build names for comparison chart comp_names= [] for i in range(4): for j in range(i+1,4): comp_names.append('{} vs. {}'.format(port_names[i], port_names[j])) # Calculate comparison values comp_values = [] for i in range(4): for j in range(i+1, 4): comps =np.mean(roll_sim[i][:,0] > roll_sim[j][:,0]) comp_values.append(comps) # Sequential comparisons pf_graf(comp_names[:-1], comp_values[:-1], 2, 0.025, 'Frequency (%)', 'Rolling simulation frequency of') port_means1.append(np.mean(df[:,0])*1200) pf_graf(port_names, port_means1, 1, 0.5, 'Returns (%)', 'Original simulation return') # original comparisons # Calculate comparison values comp_values1 = [] for i in range(4): for j in range(i+1, 4): comps1 = np.mean(list_df[i][:,0] > list_df[j][:,0]) comp_values1.append(comps1) pf_graf(comp_names[:-1], comp_values1[:-1], 2, 0.025, 'Frequency (%)', 'Original simulation frequency of') ## Sample simulation from datetime import datetime start_time = datetime.now() np.random.seed(123) eff_sharp_samp = np.zeros([1000,3]) eff_max_samp = np.zeros([1000,3]) satis_samp = np.zeros([1000,3]) naive_samp = np.zeros([1000,3]) for i in range(1000): if i == 0: sharp_weights = eff_sharp_wts max_weights = eff_max_wts sat_weights = satis_wts nav_weights = simple_wts else: samp1 = int(np.random.choice(1000,1)) _, sharp_wts = max_sharpe(sim1[samp1][0]) _, max_wts = max_ret(sim1[samp1][0]) sharp_weights = sharp_wts max_weights = max_wts test = port_sim_wts(sim1[samp1][0], 1000, 4, 0.07, 0.1) if np.isnan(test.any()): sat_wts = satis_wts else: sat_wts = test samp2 = int(np.random.choice(1000,1)) eff_sharp_samp[i,:2] = port_func(sim1[samp2][0], sharp_wts) eff_max_samp[i,:2] = port_func(sim1[samp2][0], max_wts) satis_samp[i,:2] = port_func(sim1[samp2][0], sat_wts) naive_samp[i,:2] = port_func(sim1[samp2][0], nav_wts) eff_sharp_samp[:,2] = eff_sharp_samp[:,0]/eff_sharp_samp[:,1] eff_max_samp[:,2] = eff_max_samp[:,0]/eff_max_samp[:,1] satis_samp[:,2] = satis_samp[:,0]/satis_samp[:,1] naive_samp[:,2] = naive_samp[:,0]/naive_samp[:,1] end_time = datetime.now() print('Duration: {}'.format(end_time - start_time)) # Duration: 0:07:19.733893 # Create sample list and graph samp_list = [eff_sharp_samp, eff_max_samp, satis_samp, naive_samp] port_means_samp = [] for df in samp_list: port_means_samp.append(np.mean(df[:,0])*1200) # Sample graph pf_graf(port_names, port_means_samp, 1, 0.5, 'Returns (%)', 'Random sample simulation return') # Original graph pf_graf(port_names, port_means1, 1, 0.5, 'Returns (%)', 'Original simulation return') # Calculate comparison values for sample simulation comp_values_samp = [] for i in range(4): for j in range(i+1, 4): comps_samp = np.mean(samp_list[i][:,0] > samp_list[j][:,0]) comp_values_samp.append(comps_samp) # Sample graph pf_graf(comp_names[:-1], comp_values_samp[:-1], 2, 0.025, 'Frequency (%)', 'Random sample simulation frequency of') # original graph pf_graf(comp_names[:-1], comp_values1[:-1], 2, 0.025, 'Frequency (%)', 'Original simulation frequency of')

1. Whew, what a mouthful!

2. Ensuring our scenario draws don’t match is almost overkill, since the likelihood of drawing two of the same samples in a row is about one in a million.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Sequential satisficing first appeared on R-bloggers.

### A Frosty Deal?

Thu, 09/17/2020 - 20:00

[social4i size="small" align="align-left"] --> [This article was first published on R | Quantum Jitter, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Reading news articles on the will-they-won’t-they post-Brexit trade negotiations with the EU sees days of optimism jarred by days of gloom. Do negative news articles, when one wants a positive outcome, leave a deeper impression?

I wondered if I could get a more objective view from quantitative analysis of textual data. To do this, I’m going to look at hundreds of articles published in the Guardian newspaper over the course of the year to see how trade-talk sentiment has changed week-to-week.

library(tidyverse) library(rebus) library(wesanderson) library(kableExtra) library(lubridate) library(GuardianR) library(quanteda) library(scales) library(tictoc) library(patchwork) library(text2vec) library(topicmodels)
theme_set(theme_bw()) cols <- wes_palette(name = "Chevalier1")

The Withdrawal Agreement between the UK and the European Union was signed on the 24th of January 2020. I’ll import Brexit-related newspaper articles from that date.

The Guardian newspaper asks for requests to span no more than 1 month at a time. So I’ll first create a set of monthly date ranges.

dates_df <- tibble(start_date = seq(ymd("2020-01-24"), today(), by = "1 month")) %>% mutate(end_date = start_date + months(1) - 1) dates_df %>% kable()

start_date end_date 2020-01-24 2020-02-23 2020-02-24 2020-03-23 2020-03-24 2020-04-23 2020-04-24 2020-05-23 2020-05-24 2020-06-23 2020-06-24 2020-07-23 2020-07-24 2020-08-23 2020-08-24 2020-09-23

I’ll import the newspaper articles in monthly chunks. Note, access to the Guardian’s API requires a key which may be requested here.

tic() article_df <- dates_df %>% pmap_dfr(., function(start_date, end_date) { Sys.sleep(1) get_guardian( "brexit", from.date = start_date, to.date = end_date, api.key = key ) }) toc()

The data need a little cleaning, for example, to remove multi-topic articles, html tags and non-breaking spaces.

trade_df <- article_df %>% filter(!str_detect(id, "/live/"), sectionId %in% c("world", "politics", "business")) %>% mutate( body = str_remove_all(body, "<.*?>") %>% str_to_lower(), body = str_remove_all(body, "[^a-z0-9 .-]"), body = str_remove_all(body, "nbsp") )

A corpus then gives me a collection of texts whereby each document is a newspaper article.

Although I’ve only imported articles mentioning Brexit since the Withdrawal Agreement was signed, some of these articles will not be related to trade negotiations with the EU. For example, there are on-going negotiations with many countries around the world. So, I’m going to use word embeddings to help narrow the focus to the specific context of the UK-EU trade deal.

The chief negotiator for the EU is Michel Barnier, so I’ll quantitatively identify words in close proximity to “Barnier” in the context of these Brexit news articles.

window <- 5 trade_fcm <- trade_corp %>% fcm(context = "window", window = window, count = "weighted", weights = window:1) glove <- GlobalVectors$new(rank = 60, x_max = 10) set.seed(42) wv_main <- glove$fit_transform(trade_fcm, n_iter = 10)
## INFO [10:06:33.114] epoch 1, loss 0.3817 ## INFO [10:06:34.959] epoch 2, loss 0.2510 ## INFO [10:06:36.759] epoch 3, loss 0.2225 ## INFO [10:06:38.577] epoch 4, loss 0.2021 ## INFO [10:06:40.438] epoch 5, loss 0.1847 ## INFO [10:06:42.303] epoch 6, loss 0.1710 ## INFO [10:06:44.124] epoch 7, loss 0.1605 ## INFO [10:06:45.936] epoch 8, loss 0.1524 ## INFO [10:06:47.754] epoch 9, loss 0.1457 ## INFO [10:06:49.594] epoch 10, loss 0.1403
wv_context <- glovecomponents word_vectors <- wv_main + t(wv_context) search_coord <- word_vectors["barnier", , drop = FALSE] word_vectors %>% sim2(search_coord, method = "cosine") %>% as_tibble(rownames = NA) %>% rownames_to_column("term") %>% rename(similarity = 2) %>% arrange(desc(similarity)) %>% slice(1:10) %>% kable() term similarity barnier 1.0000000 negotiator 0.7966461 michel 0.7587372 frost 0.7093119 eus 0.6728152 chief 0.6365480 brussels 0.5856139 negotiators 0.5598537 team 0.5488111 accused 0.5301669 Word embedding is a learned modelling technique placing words into a multi-dimensional vector space such that contextually-similar words may be found close by. Not surprisingly, the closest word contextually is “Michel”. And as he is the chief negotiator for the EU, we find “eu’s”, “chief”, and “negotiator” also in the top most contextually-similar words. The word embeddings algorithm, through word co-occurrence, has identified the name of Michel Barnier’s UK counterpart David Frost. So filtering articles for “Barnier”, “Frost” and “UK-EU” should help narrow the focus. context_df <- trade_df %>% filter(str_detect(body, "barnier|frost|uk-eu")) context_corp <- context_df %>% corpus(docid_field = "shortUrl", text_field = "body") I can then use quanteda’s kwic function to review the key phrases in context to ensure I’m homing in on the texts I want. Short URLs are included below so I can click on any to read the actual article as presented by The Guardian. set.seed(123) context_corp %>% tokens( remove_punct = TRUE, remove_symbols = TRUE, remove_numbers = TRUE ) %>% kwic(pattern = phrase(c("trade negotiation", "trade deal", "trade talks")), valuetype = "regex", window = 7) %>% as_tibble() %>% left_join(article_df, by = c("docname" = "shortUrl")) %>% slice_sample(n = 10) %>% select(docname, pre, keyword, post, headline) %>% kable() docname pre keyword post headline https://gu.com/p/ee3qc ecj unless we have such a thin trade deal that its not worth the paper its Brexit: Boris Johnson faces Eurotunnel test https://gu.com/p/end82 london a separate process to the troubled trade talks that got under way in london on Irish MEP in line for EU finance role vacated due to lockdown scandal https://gu.com/p/ezjdz said the downsides with the eu free trade deal the us free trade deal and our Brexit bill hugely damaging to UK’s reputation, says ex-ambassador https://gu.com/p/d7d9t people we have who have been negotiating trade deals forever she said while people criticise the Brexit trade talks: EU to back Spain over Gibraltar claims https://gu.com/p/eyzhq played down the prospect of reaching a trade deal with the eu in time for december No 10 blames EU and plays down prospects of Brexit trade deal https://gu.com/p/ez2v6 it will make it harder to strike trade deals going forward he told channel news after Brexit: UK negotiators ‘believe brinkmanship will reboot trade talks’ https://gu.com/p/d7n4t alignment with eu rules in any brexit trade deal while brussels threatened to put tariffs on Pound falls as Boris Johnson takes tough line on EU trade deal https://gu.com/p/dnvbj personal rapport when communicating remotely related post-brexit trade talks with eu on course to fail johnson Fears Brexit talks could collapse in June but UK still optimistic https://gu.com/p/d94j9 this situation and we work on a trade deal with them of course the united kingdom Ursula von der Leyen mocks Boris Johnson’s stance on EU trade deal https://gu.com/p/ezkxc it threatens to damage british prospects of trade deals with the us and eu it puts Tuesday briefing: Rancour as law-breaking bill goes forward Quanteda provides a sentiment dictionary which, in addition to identifying positive and negative words, also finds negative-negatives and negative-positives such as, for example, “not effective”. For each week’s worth of articles, I’ll calculate the proportion of positive sentiments. tic() sent_df <- context_corp %>% dfm(dictionary = data_dictionary_LSD2015) %>% as_tibble() %>% left_join(context_df, by = c("doc_id" = "shortUrl")) %>% mutate( date = ceiling_date(as_date(webPublicationDate), "week"), pct_pos = (positive + neg_negative) / (positive + neg_negative + negative + neg_positive) ) sent_df %>% select(doc_id, starts_with("pos"), starts_with("neg")) %>% slice(1:10) %>% kable() doc_id positive negative neg_positive neg_negative https://gu.com/p/d6qhb 40 22 0 0 https://gu.com/p/d9e9j 27 15 0 0 https://gu.com/p/d6kzd 51 27 0 1 https://gu.com/p/d6bt2 37 7 0 0 https://gu.com/p/d9vjq 13 23 0 0 https://gu.com/p/d7n8b 57 34 1 0 https://gu.com/p/d79cn 56 48 3 1 https://gu.com/p/d6t3c 28 26 0 0 https://gu.com/p/d9xtf 33 13 1 0 https://gu.com/p/d696t 15 21 1 0 summary_df <- sent_df %>% group_by(date) %>% summarise(pct_pos = mean(pct_pos), n = n()) toc() ## 0.708 sec elapsed Plotting the changing proportion of positive sentiment over time did surprise me a little. The outcome was more balanced than I expected which perhaps confirms the deeper impression left on me by negative articles. The upper violin plot shows the average weight of the sentiment across multiple articles for each week. Individually the articles range from 20% to 80% positive, with discernible periods of relatively negative and relatively positive sentiment. The lower plot shows the volume of articles. As we draw closer to the crunch-point the volume appears to be picking up. p1 <- sent_df %>% ggplot(aes(date, pct_pos)) + geom_violin(aes(group = date), alpha = 0.5, fill = cols[1]) + geom_line(data = summary_df, aes(date, pct_pos), colour = cols[1], linetype = "dashed") + geom_hline(yintercept = 0.5, linetype = "dotted", colour = cols[4]) + scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) + labs(title = "Changing Sentiment Towards a UK-EU Trade Deal", subtitle = "Week-to-week Since the Withdrawal Agreement", x = NULL, y = "Positive Sentiment") p2 <- summary_df %>% ggplot(aes(date, n)) + geom_line(colour = cols[1]) + labs(x = "Weeks", y = "Article Count", caption = "Source: Guardian Newspaper") p1 / p2 + plot_layout(heights = c(2, 1)) Some writers exhibit more sentiment variation than others. byline_df <- sent_df %>% mutate(byline = word(byline, 1, 2) %>% str_remove_all(PUNCT)) %>% group_by(byline, date) %>% summarise(pct_pos = mean(pct_pos), n = n()) top_3 <- byline_df %>% count(byline, sort = TRUE) %>% ungroup() %>% filter(!is.na(byline)) %>% slice(c(3, 2)) %>% pull(byline) byline_df %>% filter(byline %in% top_3) %>% ggplot(aes(date, pct_pos, colour = byline)) + geom_line() + geom_hline(yintercept = 0.5, linetype = "dotted", colour = cols[2]) + scale_y_continuous(labels = percent_format(), limits = c(0.2, 0.8)) + scale_colour_manual(values = cols[c(1, 4)]) + labs(title = "Changing Sentiment Towards a UK-EU Trade Deal", subtitle = "Week-to-week Since the Withdrawal Agreement", x = "Weeks", y = "Positive Sentiment", colour = "Byline", caption = "Source: Guardian Newspaper") R Toolbox Summarising below the packages and functions used in this post enables me to separately create a toolbox visualisation summarising the usage of packages and functions across all posts. Package Function base library[12]; c[8]; function[2]; mean[2]; set.seed[2]; conflicts[1]; cumsum[1]; is.na[1]; months[1]; search[1]; seq[1]; sum[1]; Sys.sleep[1] dplyr filter[8]; mutate[8]; as_tibble[4]; group_by[3]; if_else[3]; n[3]; select[3]; slice[3]; summarise[3]; tibble[3]; arrange[2]; desc[2]; left_join[2]; starts_with[2]; count[1]; pull[1]; rename[1]; slice_sample[1]; ungroup[1] ggplot2 aes[5]; geom_line[3]; ggplot[3]; labs[3]; geom_hline[2]; scale_y_continuous[2]; geom_violin[1]; scale_colour_manual[1]; theme_bw[1]; theme_set[1] GuardianR get_guardian[1] kableExtra kable[5] lubridate date[3]; as_date[1]; ceiling_date[1]; today[1]; ymd[1] patchwork plot_layout[1] purrr map[1]; map2_dfr[1]; pmap_dfr[1]; possibly[1]; set_names[1] quanteda corpus[2]; data_dictionary_LSD2015[1]; dfm[1]; fcm[1]; kwic[1]; phrase[1]; t[1]; tokens[1] readr read_lines[1] rebus literal[4]; lookahead[3]; whole_word[2]; ALPHA[1]; lookbehind[1]; one_or_more[1]; or[1]; PUNCT[1] scales percent_format[2] stringr str_detect[5]; str_remove_all[5]; str_c[2]; str_remove[2]; str_count[1]; str_to_lower[1]; word[1] text2vec sim2[1] tibble enframe[1]; rownames_to_column[1] tictoc tic[2]; toc[2] tidyr unnest[1] wesanderson wes_palette[1] var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R | Quantum Jitter. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post A Frosty Deal? first appeared on R-bloggers. ### R – Risk and Compliance Survey: we need your help! Thu, 09/17/2020 - 11:32 [social4i size="small" align="align-left"] --> [This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. We would like to evaluate how business restrictions impact the commercial usage of R. To do this, we would love to hear your thoughts. We have created a short survey, which should take one minute of your time. Complete the survey. Thank you! The post R – Risk and Compliance Survey: we need your help! appeared first on Mango Solutions. var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post R – Risk and Compliance Survey: we need your help! first appeared on R-bloggers. ### VirtuEARL: Speaker interview Thu, 09/17/2020 - 11:07 [social4i size="small" align="align-left"] --> [This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. We caught up with one of our VirtuEARL speakers and EARL regulars Jeremy Horne on what we can expect from his VirtuEARL talk and how he thinks the R community has adapted since lockdown. You’ve presented a few times now at EARL, what makes you come back each time? ​I’ve said it for several years now – EARL has become the “must go” event of the R calendar – not only have I presented a few times, but I’ve been fortunate enough to attend every EARL since it started in 2014 and on each occasion, I’ve taken something new away to try with my clients – and indeed, the quality of the talks and use cases grows better and better each year – so I’m looking forward to some more excellent talks this year too. How has the R community adapted since lockdown? ​Like every community, it was tough in the first few months. Events got cancelled (including my very own BrightonR user group ) and the interaction with other R users was limited to channels like slack – but with the realisation that we’re a long way off face-to-face meetups again, virtual user groups have been set-up and done quite well – there were some fantastic presentations at LondonR last month and we’ve finally got BrightonR off the ground again too! Why are events like EARL and R meet ups important do you think? ​The networking – being able to talk to other like-minded R users and staying on the pulse of what they’re up to – this is the bit I miss terribly at the moment as it always gets me thinking about new things that I can do within R. What can we expect to hear about from your EARL talk? ​I’m hoping you’ll take a few tips away on how you can improve your email marketing using R and specifically text analytics – and equally, would love to hear how you get on afterwards if you implement any of the techniques. If you’d like to hear Jeremy’s talk and some other inspirational R based talks, then join us at VirtuEARL. The online Enterprise Applications of the R Language Conference taking place this October. While you’re here – Please complete this short poll to help us evaluate how business restrictions impact the commercial usage of R. The post VirtuEARL: Speaker interview appeared first on Mango Solutions. var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post VirtuEARL: Speaker interview first appeared on R-bloggers. ### D&D’s Data Science Platform (DSP) – making healthcare analytics easier Thu, 09/17/2020 - 06:44 [social4i size="small" align="align-left"] --> [This article was first published on R Blogs – Hutsons-hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. My current company Draper and Dash have been working on the Healthcare DSP for a number of months and we are now ready to launch. Find out what the HTN had to say: https://www.thehtn.co.uk/2020/09/15/new-health-data-science-platform-launches/ A word from D&D’s Chief Technology Officer Data Science Resilience & Collaboration in times of unprecedented disruption https://t.co/pPYTw55w4g – all powered by R and Python under the hood. It has been great to work on this project. — Gary Hutson (@StatsGary) September 17, 2020 What it does? The DSP uses web technologies alongside Healthcare models built in R and Python to combine analytics with predictive machine learning, forecasting, Natural Language Processing, among other techniques: Find out more To learn more about this please register for our webinar: It has been a pleasure for my team to work so closely alongside my Chief Technical Officer and we have really put together an exciting product. As stated, register for the webinar by clicking the above image to find out more. var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R Blogs – Hutsons-hacks. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post D&D’s Data Science Platform (DSP) – making healthcare analytics easier first appeared on R-bloggers. ### Gold-Mining Week 2 (2020) Wed, 09/16/2020 - 23:37 [social4i size="small" align="align-left"] --> [This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Week 2 Gold Mining and Fantasy Football Projection Roundup now available. The post Gold-Mining Week 2 (2020) appeared first on Fantasy Football Analytics. var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: R – Fantasy Football Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Gold-Mining Week 2 (2020) first appeared on R-bloggers. ### High School Swimming State-Off Tournament Championship California (1) vs. Texas (2) Wed, 09/16/2020 - 20:00 [social4i size="small" align="align-left"] --> [This article was first published on Swimming + Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This is it folks, the big one, the finals of the State-Off Tournament, where we’ll see California (1) take on Texas (2) for high school swimming superiority. We’ll also use a t-test to determine whether or not swimmers actually swim faster in finals sessions, when the pressure is on. Oh and it’s so on! library(SwimmeR) library(dplyr) library(stringr) library(flextable) library(ggplot2) My flextable styling function is still working great since I made it two weeks ago. In the finals you’ve just gotta stick with what works. flextable_style <- function(x) { x %>% flextable() %>% bold(part = "header") %>% # bold header bg(bg = "#D3D3D3", part = "header") %>% # puts gray background behind the header row autofit() } Getting Results The suspense is killing me here! Let’s get this thing underway by grabbing results from github. California_Link <- "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/CA_States_2019.csv" California_Results <- read.csv(url(California_Link)) %>% mutate(State = "CA") Texas_Link <- "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/TX_States_2020.csv" Texas_Results <- read.csv(url(Texas_Link)) %>% mutate(State = "TX", Grade = as.character(Grade)) Results <- California_Results %>% bind_rows(Texas_Results) %>% mutate(Gender = case_when( str_detect(Event, "Girls") == TRUE ~ "Girls", str_detect(Event, "Boys") == TRUE ~ "Boys" )) Scoring the Meet Results_Final <- results_score( results = Results, events = unique(ResultsEvent), meet_type = "timed_finals", lanes = 8, scoring_heats = 2, point_values = c(20, 17, 16, 15, 14, 13, 12, 11, 9, 7, 6, 5, 4, 3, 2, 1) ) Scores <- Results_Final %>% group_by(State, Gender) %>% summarise(Score = sum(Points)) Scores %>% arrange(Gender, desc(Score)) %>% ungroup() %>% flextable_style()

State

Gender

Score

CA

Boys

1516.5

TX

Boys

808.5

CA

Girls

1454.0

TX

Girls

871.0

Scores %>% group_by(State) %>% summarise(Score = sum(Score)) %>% arrange(desc(Score)) %>% ungroup() %>% flextable_style()

.cl-88a56576{font-family:'Arial';font-size:11px;font-weight:bold;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-88a56577{font-family:'Arial';font-size:11px;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-88a56578{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-88a56579{margin:0;text-align:right;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-88a5657a{width:54px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-88a5657b{width:47px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-88a5657c{width:54px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-88a5657d{width:47px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-88a5657e{width:54px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-88a5657f{width:47px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}

State

Score

CA

2970.5

TX

1679.5

In an outcome that many of you probably predicted ahead of time, California (1) lives up to their number one ranking and secures State-Off Tournament crown over Texas (2). Just because it was expected though doesn’t meet it’s not impressive. This meet was super fast. Let’s look at some individual performances and name our Swimmers of the Meet.

Swimmers of the Meet

Two swimmers come into this meet on Swimmer of the Meet streaks. Lillie Nordmann has won two in row for the Texas girls, and Zoie Hartman has won two in a row for the California girls. Swimmer of the Meet criteria is still the same as it’s been for the entire State-Off. We’ll look for athletes who have won two events, thereby scoring a the maximum possible forty points. In the event of a tie, where multiple athletes win two events, we’ll use All-American standards as a tiebreaker.

Cuts_Link <- "https://raw.githubusercontent.com/gpilgrim2670/Pilgrim_Data/master/State_Cuts.csv" Cuts <- read.csv(url(Cuts_Link)) Cuts <- Cuts %>% # clean up Cuts filter(Stroke %!in% c("MR", "FR", "11 Dives")) %>% # %!in% is now included in SwimmeR rename(Gender = Sex) %>% mutate( Event = case_when((Distance == 200 & #match events Stroke == 'Free') ~ "200 Yard Freestyle", (Distance == 200 & Stroke == 'IM') ~ "200 Yard IM", (Distance == 50 & Stroke == 'Free') ~ "50 Yard Freestyle", (Distance == 100 & Stroke == 'Fly') ~ "100 Yard Butterfly", (Distance == 100 & Stroke == 'Free') ~ "100 Yard Freestyle", (Distance == 500 & Stroke == 'Free') ~ "500 Yard Freestyle", (Distance == 100 & Stroke == 'Back') ~ "100 Yard Backstroke", (Distance == 100 & Stroke == 'Breast') ~ "100 Yard Breaststroke", TRUE ~ paste(Distance, "Yard", Stroke, sep = " ") ), Event = case_when( Gender == "M" ~ paste("Boys", Event, sep = " "), Gender == "F" ~ paste("Girls", Event, sep = " ") ) ) Ind_Swimming_Results <- Results_Final %>% filter(str_detect(Event, "Diving|Relay") == FALSE) %>% # join Ind_Swimming_Results and Cuts left_join(Cuts %>% filter((Gender == "M" & Year == 2020) | (Gender == "F" & Year == 2019)) %>% select(AAC_Cut, AA_Cut, Event), by = 'Event') Swimmer_Of_Meet <- Ind_Swimming_Results %>% mutate( AA_Diff = (Finals_Time_sec - sec_format(AA_Cut)) / sec_format(AA_Cut), Name = str_to_title(Name) ) %>% group_by(Name) %>% filter(n() == 2) %>% # get swimmers that competed in two events summarise( Avg_Place = sum(Place) / 2, AA_Diff_Avg = round(mean(AA_Diff, na.rm = TRUE), 3), Gender = unique(Gender), State = unique(State) ) %>% arrange(Avg_Place, AA_Diff_Avg) %>% group_split(Gender) # split out a dataframe for boys (1) and girls (2)

Boys

Swimmer_Of_Meet[[1]] %>% slice_head(n = 5) %>% select(-Gender) %>% ungroup() %>% flextable_style()

Name

Avg_Place

AA_Diff_Avg

State

Hu, Ethan

1.0

-0.052

CA

Dillard, Ben

1.5

-0.044

CA

Saunders, Max

1.5

-0.030

CA

Mefford, Colby

1.5

-0.027

CA

Lee, Connor

2.0

-0.029

CA

Ethan Hu leads a California sweep en route to winning his second Swimmer of the Meet title, following the one he earned against Georgia (8) in the first round. Ethan of course now swims for Stanford, and those of you who have been following along know that Stanford is something of trigger for me. The great school and swim program aside, I’m genuinely confused as to why anyone wants to swim outdoors in the sneakily cold Bay Area. I’ve mentioned before that I used to live, and swim, in the Bay. I hated the cold and complained about it so much that my non-swimming, long-suffering, wife was inspired to remix this Peanuts cartoon about it.

Anyways, here’s Ethan’s results:

Results_Final %>% filter(Name == "Hu, Ethan") %>% select(Place, Name, School, Finals_Time, Event) %>% arrange(desc(Event)) %>% ungroup() %>% flextable_style()

Place

Name

School

Finals_Time

Event

1

Hu, Ethan

Harker_CCS

1:45.44

Boys 200 Yard IM

1

Hu, Ethan

Harker_CCS

45.72

Boys 100 Yard Butterfly

Girls

Swimmer_Of_Meet[[2]] %>% slice_head(n = 5) %>% select(-Gender) %>% ungroup() %>% flextable() %>% bold(part = "header") %>% bg(bg = "#D3D3D3", part = "header") %>% autofit()

Name

Avg_Place

AA_Diff_Avg

State

Hartman, Zoie

1

-0.047

CA

Lillie Nordmann

1

-0.046

TX

Kit Kat Zenick

1

-0.029

TX

Tuggle, Claire

2

-0.031

CA

Ristic, Ella

2

-0.023

CA

Zoie Hartman just beats out Lillie Nordmann to win her third Swimmer of the Meet crown – the most by any athlete in the whole State-Off Tournament! A hearty congratulations to Zoie on what I’m sure she considers the crowning achievement of her swimming career thus far. Zoie is from Danville CA, a town quite near the Bay Area, but she swims for the University of Georgia, which unlike Stanford, and the Bay in general, is nice and warm while also having a lovely indoor pool. Zoie just might be my favorite athlete in this entire meet. She’s truly the embodiment of everything the State-Off stands for – being super fast and also grousing about how Northern California is to cold for outdoor swimming. Here are her results:

Results_Final %>% filter(Name == "Hartman, Zoie") %>% select(Place, Name, School, Finals_Time, Event) %>% arrange(desc(Event)) %>% ungroup() %>% flextable_style()

Place

Name

School

Finals_Time

Event

1

Hartman, Zoie

Monte Vista_NCS

1:55.29

Girls 200 Yard IM

1

Hartman, Zoie

Monte Vista_NCS

59.92

Girls 100 Yard Breaststroke

Prelims-Finals Drops

My memories of swimming in finals at prelims-finals meets are mixed. Sometimes I remember feeling really jazzed up and energized by the finals atmosphere. Other times my memories are of being tired, worn out by the slog of the previous sessions. Memory is a tricky thing though – can’t always trust it. Instead of looking inward, into the murk of my mind, let’s instead look at the data and try to answer the question – do swimmers swim faster in finals than they do in prelims? We’ll only consider individual events because it’s possible to change the composition of relay teams between prelims and finals and that’s not what we’re talking about here.

First we’ll want to calculate the difference between a swimmer’s finals time and prelims time, but let’s do it as a difference per 50 yards, so all events can be compared on an even basis.

Drops <- Results %>% filter( is.na(Finals_Time) == FALSE, is.na(Prelims_Time) == FALSE, str_detect(Event, "Diving") == FALSE, Place <= 16 # only swimmers in places 1-16 actually swim in both prelims and finals ) %>% mutate( Distance = as.numeric(str_extract(Event, "\\d{1,}")), # distance in yards for each race Drop = sec_format(Finals_Time) - sec_format(Prelims_Time), # change in time between prelims and finals Per_Drop = Drop / (Distance / 50) # time dropped per 50 of the the race )

Event_Drop <- Drops %>% filter(str_detect(Event, "Relay") == FALSE) %>% group_by(Event) %>% summarise(Drop_Avg = mean(Per_Drop, na.rm = TRUE)) %>% # average time dropped per 50, in seconds mutate(Event = factor(Event, levels = unique(ResultsEvent))) %>% # set event order arrange(Event) # order by event Event_Drop %>% flextable_style() %>% set_formatter_type(fmt_double = "%.02f", # format doubles to have two decimal places fmt_integer = "%.0f") %>% # format integers to have no decimal places autofit() .cl-89ed605a{font-family:'Arial';font-size:11px;font-weight:bold;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-89ed605b{font-family:'Arial';font-size:11px;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(17, 17, 17, 1.00);background-color:transparent;}.cl-89edaec0{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-89edaec1{margin:0;text-align:right;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:2px;padding-top:2px;padding-left:5px;padding-right:5px;line-height: 1.00;background-color:transparent;}.cl-89ee238c{width:157px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-89ee238d{width:73px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-89ee238e{width:157px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-89ee238f{width:73px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-89ee2390{width:157px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-89ee2391{width:73px;background-color:transparent;vertical-align: middle;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-89ee2392{width:73px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-89ee2393{width:157px;background-color:transparent;vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-89ee2394{width:73px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-89ee2395{width:157px;background-clip: padding-box;background-color:rgba(211, 211, 211, 1.00);vertical-align: middle;border-bottom: 2.00px solid rgba(0, 0, 0, 1.00);border-top: 2.00px solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;} Event Drop_Avg Girls 200 Yard Freestyle -0.11 Boys 200 Yard Freestyle -0.08 Girls 200 Yard IM -0.11 Boys 200 Yard IM -0.06 Girls 50 Yard Freestyle -0.05 Boys 50 Yard Freestyle -0.06 Girls 100 Yard Butterfly -0.05 Boys 100 Yard Butterfly -0.11 Girls 100 Yard Freestyle -0.06 Boys 100 Yard Freestyle -0.04 Girls 500 Yard Freestyle 0.01 Boys 500 Yard Freestyle -0.06 Girls 100 Yard Backstroke -0.07 Boys 100 Yard Backstroke 0.04 Girls 100 Yard Breaststroke -0.03 Boys 100 Yard Breaststroke 0.01 So there’s some numbers. In most events the number is negative, meaning the swimmers got faster on average in the finals. A few events have positive numbers, meaning swimmers got slower in finals. More clarity is required. We can use a test to see if on average swimmers swim faster (or slower) in finals compared to prelims. We’re going to use a t-test, which maybe some of you are familiar with, although there are other tests, like the Wilcoxen test that are arguably better suited to this purpose. To do so we need a properly phrased null hypothesis: “the average change in time per 50 yards between prelims and finals is 0 seconds”. We’ll pass 0 to mu inside t.test. It’s worth noting that for small samples it’s important to check that the samples are normally distributed, but for larger (n > 20-30 is a rule of thumb) it’s not important. There are 1039 samples in our data set, so we don’t need to check for normality. t.test(DropsPer_Drop, mu = 0)
## ## One Sample t-test ## ## data: DropsPer_Drop ## t = -7.6983, df = 1038, p-value = 3.203e-14 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## -0.07791070 -0.04626014 ## sample estimates: ## mean of x ## -0.06208542 Taking a look at the t.test results we see that our p-value is 3.20e-14, well below the customary threshold of 0.05, so we can reject the null hypothesis that there’s no difference between times per 50 in prelims and times per 50 in finals, and conclude that there is in fact a difference. Crucially though the test does not tell us specifically what the difference is. If we keep examining the results though we can see a 95 percent confidence interval of -0.078 to -0.046, meaning that we can be 95% sure the true value is in that range. Since the entire confidence interval is negative we can be 95% confident (or more) that swimmers do swim faster in finals than they do in prelims – at least for this meet. We can show this visually by plotting a density plot, with our sample mean denoted by a blue line, and our null hypothesis by a red one. Drops %>% ggplot(aes(x = Per_Drop)) + geom_density() + geom_vline( # mean value aes(xintercept = mean(Per_Drop)), color = "blue", linetype = "dotted", size = 1 ) + geom_vline( # null hypothesis at 0 aes(xintercept = 0), color = "red", linetype = "dotted", size = 1 ) + theme_bw() + labs(x = "Seconds Dropped Per 50 Yards", y = "Density", title = "Change In Performance in Finals vs. Prelims") From the plot, and from the t.test results above, it’s clear that although swimmers do drop time, it’s not much time on average, only about 0.06 seconds. It might also be interesting to see how place impacts amount of time dropped. Let’s plot that. Place_Drop <- Drops %>% group_by(Place, State) %>% summarise(Drop_Avg = mean(Per_Drop, na.rm = TRUE)) Place_Drop %>% ggplot() + geom_line(aes(x = Place, y = Drop_Avg, color = State)) + scale_x_continuous(breaks = seq(1, 16, 1)) + theme_bw() There’s a couple things going on here. One is that the amount of time dropped tracks with place. Athletes who win tend to drop more time than those who come 16th. It could be that those athletes are just better, and “flip the switch” to deliver in the finals. It could also be that winning is lots of fun, and the chance to do so brings something more out of swimmers. Heck it could be both. We can also see that there’s a big discontinuity around 8th-9th place. Those of you familiar with swimming can probably figure out what’s going on, and skip the rest of this paragraph where I’m going to explain it in excruciating detail. Assuming an 8 lane pool, which is the case for both the California and Texas meets, swimmers who place between 1st and 8th in the prelims swim in the “A” final at finals. These swimmers can place no higher than 1st (of course) and, critically, no lower than 8th overall long as they complete the race without being disqualified. That means that if a swimmer is in 8th place mid race and doesn’t think they can catch the 7th place swimmer there’s an incentive to just coast to the finish, because they’re guaranteed 8th regardless of their time. Similarly, swimmers who place between 9th and 16th in prelims swim in a “B” final at finals, where they can place no higher than 9th (regardless of time) and no lower than 16th. Getting 9th means winning the heat, and there’s a drive to do so. The further back one is in one’s heat though the more difficult it is to tell what’s going on, and be motivated to race those around you, and the less likely one is to win (the heat or the overall), so less drive to improve on a prelims time. In Closing Well that’s it for the State-Off championship round. California, the number one seed, has in fact prevailed. There will be one more post in the State-Off series next week, where I’ll throw all 8 teams into one giant battle royale of a meet. Be sure to come back to Swimming + Data Science for that! draw_bracket( teams = c( "California", "Texas", "Florida", "New York", "Pennsylvania", "Illinois", "Ohio", "Georgia" ), round_two = c("California", "Texas", "Florida", "Pennsylvania"), round_three = c("California", "Texas"), champion = "California", title = "Swimming + Data Science High School Swimming State-Off", text_size = 0.9 ) var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Swimming + Data Science. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post High School Swimming State-Off Tournament Championship California (1) vs. Texas (2) first appeared on R-bloggers. ### Introducing our new book, Tidy Modeling with R Wed, 09/16/2020 - 20:00 [social4i size="small" align="align-left"] --> [This article was first published on rstats | Julia Silge, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Today Max Kuhn and I are pleased to announce our new book, Tidy Modeling with R! This book focuses on how to use tidymodels and has two main goals: • Readers will learn how to use R and tidymodels packages to create robust and reliable models. This is a practical book, full of code examples with real datasets. • The book encourages good methodology and statistical practice. The design of tidymodels packages (from the software itself to documentation, training materials, and this book) is oriented toward the challenges of dealing with models in the real world and setting up everyday practitioners for success. Tidy Modeling with R is currently a work in progress. The first eleven chapters are available as of today, and we encourage you to check them out! We will continue to add new chapters in the upcoming months, and we intend to publish a print version when the book is completed. The source code to create the book is on GitHub, and readers such as you are welcome to contribute if you wish. Guidelines for contributions are available here. var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: rstats | Julia Silge. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Introducing our new book, Tidy Modeling with R first appeared on R-bloggers. ### Learning Data Science with RStudio Cloud: A Student’s Perspective Wed, 09/16/2020 - 20:00 [social4i size="small" align="align-left"] --> [This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. On August 5, 2020, RStudio announced the general availability of RStudio Cloud, its cloud-based platform for doing, teaching, and learning data science using only a browser. We often recommend RStudio Cloud to teachers because it simplifies course setup, makes it easy to distribute and manage exercises, and helps the instructor track student progress. With COVID-19 driving more courses online this fall, RStudio has developed a number of resources for instructors to use with RStudio Cloud, including: However, while we often hear from teachers about how RStudio Cloud has helped them, we don’t often get to hear what the experience is like from the student’s point of view. We recently had the opportunity to have an in-depth discussion with Lara Zaremba, a Master’s degree student in International Economics at Johann Wolfgang Goethe-Universität, in Frankfurt Germany. In the video below, Lara describes some of what she experienced collaborating with other students and being a mentor to others on the platform. An earlier part of our conversation with Lara also highlighted the need for increased diversity within the R community. While Lara felt that RStudio Cloud empowered her to become successful as a student and mentor, she mentioned that she had not seen herself as a data scientist because of the stereotypes associated with “computer people”. One of the most powerful forces helping overcome those stereotypes today is R-Ladies Global (rladies.org), a world-wide organization dedicated to promoting gender diversity through meetups, code, teaching, and leadership. We encourage all to support their work and to help everyone see themselves as full members of the R community. You can learn more about how you can participate and make our community more welcoming to all at https://rladies.org/about-us/help/. var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Learning Data Science with RStudio Cloud: A Student’s Perspective first appeared on R-bloggers. ### Risk Scoring in Digital Contact Tracing Apps Wed, 09/16/2020 - 18:00 [social4i size="small" align="align-left"] --> [This article was first published on Theory meets practice..., and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Abstract: We attempt a mathematical description of the risk scoring algorithm used in the Google and Apple exposure notification framework (v1). This algorithm is used in many digital contact tracing apps for COVID-19, e.g., in Germany or Switzerland. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. The markdown+Rknitr source code of this blog is available under a GNU General Public License (GPL v3) license from github. Motivation My phone runs a digital COVID-19 contact tracing app based on the Apple and Google exposure notification (GAEN) framework [Apple| Google| Wikipedia]. Since my future health might depend on it, I am interested in the risk classification algorithm of the framework: when does it warn me about a possible exposure risk? Since algorithmic transparency is a key component in gaining acceptance for digital contact tracing apps, this blog post tries to give a mathematical description of the risk scoring algorithm in the GAEN framework – as far as I have been able to understand it. The description is accompanied by a small case study implemented in R. The GAEN is used by country specific COVID-19 Apps in several countries, e.g., Switzerland, Latvia, Denmark, Italy, Poland. In this post special focus is on the German Corona-Warn-App (CWA), which comes with a special commitment to transparency: The Corona-Warn-App is an app that helps trace infection chains of SARS-CoV-2 (which can cause COVID-19) in Germany. The app is based on technologies with a decentralized approach and notifies users if they have been exposed to SARS-CoV-2. Transparency is key to both protect the app’s end-users and to encourage adoption. This means that all code and documentation of the CWA is available under an open-source license through GitHub. A good overview of the technical foundation of the app is provided here. However, important parts of the CWA risk scoring are taken from the GAEN framework, which is less transparent1. In particular I was interested in the API v1 form of the risk scoring, because this forms the basis of the CWA and also provides epidemiological flexibility through the use of the transmission risk level parameter. The Risk Scoring A phone using the GAEN framework scans the immediate surroundings in regular time intervals2 using the BLE protocol. Doing so the phone collects anonymous identifier keys of the devices running BLE near you. If a GAEN user is tested positive, they can voluntarily decide to upload their anonymous identifier keys to a server. With regular intervals the app on your phone downloads these keys and investigates whether there is a match between the keys you encountered and those available on the server. If there is a match, an algorithm decides based on attributes of the contact, e.g. duration, distance and infectiousness of the potential infector, whether the user should be displayed a warning about an increased risk for infection. It is this classification algorithm, which I am interested in. Introducing mathematical notation, let $$E$$ be the set of encounters recorded by your phone during the last 14 days. We assume each entry $$e\in E$$ consists of attributes3 containing the identifier key of the phone you met, $$\text{key}(e)$$, (aka. temporary exposure key (TEK)), the calendar day the contact happened, $$\text{date}(e)$$, the duration of the contact, $$\text{duration}(e)$$, and the average attenuation of the signal during the contact, $$\text{antennuation}(e)$$. The later serves as proxy for the distance between the two phones. Furthermore, the server contains the set of uploaded keys by CWA users, who tested positive for COVID-19 and who gave consent to upload their keys. Uploaded keys of test positives are also called diagnosis keys – technically they are however just TEKs. For simplicity we shall assume in this document that a phone is using one key per calendar day and that an upload of keys consists of the keys used on the day of upload and the previous 13 days4. Let $$S$$ denote the set of uploaded keys and for $$s\in S$$ let $$\text{key}(s)$$ denote the diagnosis key, $$\text{day_valid}(s)$$ the calendar day the key was in use and $$\text{TRL}(s)$$ the transmission risk. The later is a numerical value for how infectious the potential infector was on $$\text{day_valid}(s)$$. For simplicity, in what follows we shall consider only the so called level of the duration, attenuation and transmission risk, which in the GAEN API is a discretized version of the continuous quantity represented by an integer between 0-8. We shall denote by $$I$$ the subset of $$E$$, which consists of keys known to have been positively tested for COVID-19, that is the keys have been uploaded to the server. For simplicity we shall extend the elements of $$E$$ with additional information provided by the server about the key, i.e. we let $I = \{ (e, s) : e \in E , s \in S , \text{key}(s) = \text{key}(e) \}.$ Case Study with Aisha, Anton and Betty We shall illustrate the above using an example inspired by the CWA documentation. In the example the CWA user Betty travels with Anton and Aisha together by bus to and from work on the 9th and the 16th of the month. Each trip lasts 10 minutes. Anton sat one meter away from Betty, while Aisha sat two meters away. Both Anton and Aisha become test positive on the 20th and decide to upload their keys to the server on the 20th and 21st, respectively. We perform Betty’s risk calculation on the 20th and 21st, respectively. To make the example clearer, the original trip length of 10 minutes is changed to 11 minutes. GAEN Configuration Configuration of the risk scoring parameters as done for the CWA: ## Transmission risk level provided at key upload - see https://github.com/corona-warn-app/cwa-app-android/blob/fe9128fdb28384640d0a248b7ff46701d6a04f0e/Corona-Warn-App/src/main/java/de/rki/coronawarnapp/util/ProtoFormatConverterExtensions.kt#L12 for the google App trl <- structure(c(5, 6, 8, 8, 8, 5, 3, rep(1, 7)), names=0:13) # Breaks in dB for the attenuation - see https://github.com/google/exposure-notifications-internals/blob/e953a09dd3c20c04dfa29e362eef461890dac25c/exposurenotification/src/main/java/com/google/samples/exposurenotification/features/ContactTracingFeature.java#L515 attenuation_breaks <- c(0, 10, 15, 27, 33, 51, 63, 73, Inf) attenuation_include_lowest <- TRUE # Level for each attenuation bucket, see https://github.com/corona-warn-app/cwa-server/blob/master/services/distribution/src/main/resources/master-config/exposure-config.yaml attenuation_lvl <- tibble( bucket = c("[0,10]", "(10,15]", "(15,27]","(27,33]", "(33,51]", "(51,63]", "(63,73]", "(73,Inf]"), lvl=c(2, 2, 2, 2, 2, 2, 2, 0)) # Bucket limits in https://github.com/google/exposure-notifications-internals/blob/e953a09dd3c20c04dfa29e362eef461890dac25c/exposurenotification/src/main/java/com/google/samples/exposurenotification/features/ContactTracingFeature.java#L502 (note that the buckets are closed to the right) duration_breaks <- c(-Inf, 0, 5, 10, 15, 20, 25, 30, Inf) duration_include_lowest <- FALSE # Levels - see https://github.com/corona-warn-app/cwa-server/blob/f3d66840ef59e2ced10fb9b42f08a8938e76075a/services/distribution/src/main/resources/master-config/exposure-config.yaml#L50 duration_lvl <- tibble( bucket = c("(-Inf,0]", "(0,5]", "(5,10]" , "(10,15]", "(15,20]", "(20,25]", "(25,30]", "(30, Inf]"), lvl=c(0, 0, 0, 1, 1, 1, 1, 1)) # Days parameters - see https://github.com/corona-warn-app/cwa-server/blob/f3d66840ef59e2ced10fb9b42f08a8938e76075a/services/distribution/src/main/resources/master-config/exposure-config.yaml#L40 days_lvl <- 5 # Time_attenuation_buckets (to be used later) time_attenuation_bucket_breaks <- c(0, 55, 63, Inf) time_attenuation_buckets <- levels(cut(0:100, breaks=time_attenuation_bucket_breaks, include.lowest=TRUE, right=FALSE)) The $$\text{days}(e)$$ parameter of an $$e\in E$$ exposure reflects the possibility to provide additional risk relevant information about the CWA user at the day of the contact. However, the CWA currently keeps it at a fixed value of 5 and, hence, it plays little role in the risk calculations. Part 1: Uploaded keys on the server Anton had his CWA running since 2020-09-13, while Aisha has been running her app since 2020-09-01. For each of the two we build a tibble containing the uploaded keys at the day of transmission, i.e. on the 20th and 21st, respectively5. Furthermore, we construct two tibbles mimicking the state of the diagnosis key server on 2020-09-20 and 2020-09-21, respectively. # Upload of Anton on the 2020-09-20 where his app has been running for 8 days anton_ndays <- min(14, as.numeric(as.Date("2020-09-20") - cwa_start_anton + 1)) anton <- tibble(key=rkey(anton_ndays),valid=seq(as.Date("2020-09-20"),length.out=anton_ndays,by="-1 day"), trl=trl[seq_len(anton_ndays)]) # Upload of Aisha on the 2020-09-21 where her app has been running for 14 days aisha_ndays <- min(14, as.numeric(as.Date("2020-09-20") - cwa_start_aisha + 1)) aisha <- tibble(key=rkey(aisha_ndays),valid=seq(as.Date("2020-09-21"),length.out=aisha_ndays,by="-1 day"), trl=trl[seq_len(aisha_ndays)]) # Upload to server consists of keys for the last 13 days (note: day zero caveat is ignored). Hence dk consists of all keys who were active during the last 13 days within dk20200920 <- anton %>% filter(valid >= as.Date("2020-09-20") - 13) dk20200921 <- rbind(anton, aisha) %>% filter(valid >= as.Date("2020-09-21") - 13) # Show diagnosis keys of 2020-09-20 (these are Anton's keys) dk20200920 ## # A tibble: 8 x 3 ## key valid trl ## <chr> <date> <dbl> ## 1 2a0a87 2020-09-20 5 ## 2 051447 2020-09-19 6 ## 3 f810f4 2020-09-18 8 ## 4 17a0d6 2020-09-17 8 ## 5 3d12e3 2020-09-16 8 ## 6 aaf590 2020-09-15 5 ## 7 22b909 2020-09-14 3 ## 8 1be004 2020-09-13 1 Part 2: Betty’s exposure history Betty’s sighting history on the 20th would look as follows: ## # A tibble: 6 x 7 ## key date attenuation duration attenuation_lvl duration_lvl days_lvl ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 3d12e3 2020-09-16 40 11 2 1 5 ## 2 3d12e3 2020-09-16 40 11 2 1 5 ## 3 835292 2020-09-16 60 11 2 1 5 ## 4 835292 2020-09-16 60 11 2 1 5 ## 5 3d8508 2020-09-09 60 11 2 1 5 ## 6 3d8508 2020-09-09 60 11 2 1 5 Hence, Betty’s set of exposures with test positives provided by the server on the 20th is: i_set <- inner_join(eh_betty, dk20200920, by="key") i_set ## # A tibble: 2 x 9 ## key date attenuation duration attenuation_lvl duration_lvl days_lvl valid trl ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl> ## 1 3d12e3 2020-09-16 40 11 2 1 5 2020-09-16 8 ## 2 3d12e3 2020-09-16 40 11 2 1 5 2020-09-16 8 Risk scoring The description of the risk scoring in the CWA app is centered around Fig. 12 and 13 of the solution architecture documentation. An illustrative example of how this computation looks can be found here. For further details the source code for the risk scoring in the CWA can be consulted [Android version| iOS]. However, for the Android version, which is the one I checked most, it is not 100% transparent (to me) how the important exposureSummary object is generated by the API. Nevertheless, the computation of the GAEN risk scoring algorithm consists of two steps: a per-exposure risk score calculation followed by an aggregation over all exposures. Risk Score We compute the total risk for each exposure with a test positive, i.e. for each element $$i\in I$$, by: $\text{TR}(i) = \text{duration}(i)\times \text{attenuation}(i) \times \text{TRL}(i) \times \text{days}(i)$ Exposure risks with a total risk below the minimum risk score of 11 are put to zero: # Minimum risk score - see https://github.com/corona-warn-app/cwa-server/blob/f3d66840ef59e2ced10fb9b42f08a8938e76075a/services/distribution/src/main/resources/master-config/app-config.yaml#L13 minimum_risk_score <- 11 # Compute risk score and filter out those below the minimum risk <- i_set %>% mutate(risk_score = attenuation_lvl * duration_lvl * trl * days_lvl) %>% filter(risk_score >= minimum_risk_score) risk ## # A tibble: 2 x 10 ## key date attenuation duration attenuation_lvl duration_lvl days_lvl valid trl risk_score ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <dbl> ## 1 3d12e3 2020-09-16 40 11 2 1 5 2020-09-16 8 80 ## 2 3d12e3 2020-09-16 40 11 2 1 5 2020-09-16 8 80 From this we can compute the the so called Normalized Total Risk Score defined as $\text{Normalized Total Risk Score} := \frac{\max_{i\in I} \text{TR}(i)}{50}.$ I’m guessing, but the value of 50 appears to be a “baseline” for the transmission risk given by the risk test positives have on the day they upload their keys ( first element of trl is 5 and, hence, $$1 \times 2 \times 5 \times 5 = 50$$). # Derive the normalized total risk score maxTR <- max(riskrisk_score) ntrs <- maxTR / 50 c(maxTR, ntrs)

## [1] 80.0 1.6

In order to reflect that the accumulation of several exposures with a test positive increases your risk, a weighted summation of time spent in three 3 attenuation buckets (distance: far, intermediate and close) is computed.

For each class $$j=1,2,3$$ (low, mid and high) a weighted time in the class is computed \begin{align} wt_j = \max\left(30, \sum_{i\in I^*} \mathcal{I}\left(l_j \leq \text{attenuation}(i) < u_j \right) \times \text{duration}(i) \times w_{j}\right) \end{align} where $$mathcal{I}(A)$$ denotes the indicator function which is 1 if the condition $$A$$ is fulfilled and zero otherwise. Furthermore, $$I^* = \{ i \in I : \text{TR}(i)>0\}$$, $$w_1=0, w_2=0.5$$ and $$w_3=1$$ and the limits $$(l_j, u_j)$$ of the three classes are chosen as $$[63, \infty)$$, $$[55,63)$$ and $$[0,55)$$ dB. Note that for reasons beyond my knowledge, the time in each class is capped at 30 minutes. The total time is then calculated as $\text{Total time} = \sum_{j=1}^4 wt_j,$ where $$wt_4\geq 0$$ is denoted a bucket offset value, which is currently set to 0 in the CWA. Hence, the largest possible total time is $$0\times 30 + 0.5\times 30 + 1\times 30 + 0 = 45$$ minutes.

# Accumulate time in the 3 attenuation buckets ta_buckets <- risk %>% mutate( time_attenuation_bucket = cut(attenuation, breaks=time_attenuation_bucket_breaks, include.lowest=TRUE, right=FALSE)) %>% group_by(time_attenuation_bucket) %>% summarise(time = sum(duration)) %>% # Cap at 30 minutes mutate(time = pmin(30, time, na.rm=TRUE)) # Define the weights - https://github.com/corona-warn-app/cwa-server/blob/f3d66840ef59e2ced10fb9b42f08a8938e76075a/services/distribution/src/main/resources/master-config/attenuation-duration.yaml#L15 weights <- tibble( time_attenuation_bucket=time_attenuation_buckets, weight=c(1, 0.5, 0)) ta_buckets_w <- right_join( ta_buckets, weights, by="time_attenuation_bucket") # Do the weighted summation and add the bucket offset exposure_score <- ta_buckets_w %>% summarise(total = sum(time*weight, na.rm=TRUE) + 0) %>% as.numeric()

The combined risk score is now given as $\text{Combined Risk Score} = \text{Exposure Score } \times \text{Normalized Total Risk Score}$

A warning is given by the app, if the combined risk is above a threshold value of 15.

In code:

# Compute combined risk score and classify (combined_risk_score <- exposure_score * ntrs)

## [1] 35.2

(alarm <- combined_risk_score >= 15)

## [1] TRUE

Hence, Betty receives an alarm on the 20th.

The classifier in a nutshell

To summarise, the resulting binary classifier of the CWA can be expressed mathematically as

\begin{align*} \mathcal{I}\left\{\left[\sum_{j=1}^ 3 \max\left(30, \sum_{i\in I^*} \mathcal{I}\left( \text{attenuation}(i) \in [l_j,u_j) \right) \times \text{duration}(i) \times w_{j}\right)\right] \times \\ \quad\quad\quad\quad\quad \max_{i\in I} \{\text{duration}(i)\times \text{attenuation}(i) \times \text{TRL}(i) \times \text{days}(i)\} \geq 15\cdot 50 \right\}. \end{align*}

The formula can also be assembled into one classifier function, which we will use to repeat the risk classification based on the available server information on 2020-09-21:

cwa_classifier <- function(i_set) { # Compute risk score and filter out those below the minimum risk <- i_set %>% mutate(risk_score = attenuation_lvl * duration_lvl * trl * days_lvl) %>% filter(risk_score >= minimum_risk_score) # Derive the normalized total risk score maxTR <- max(risk$risk_score) ntrs <- maxTR / 50 # Accumulate time in the 3 attenuation buckets ta_buckets <- risk %>% mutate( time_attenuation_bucket = cut(attenuation, breaks=time_attenuation_bucket_breaks, include.lowest=TRUE, right=FALSE)) %>% group_by(time_attenuation_bucket) %>% summarise(time = sum(duration)) %>% # Cap at 30 minutes mutate(time = pmin(30, time, na.rm=TRUE)) # Define the weights - https://github.com/corona-warn-app/cwa-server/blob/f3d66840ef59e2ced10fb9b42f08a8938e76075a/services/distribution/src/main/resources/master-config/attenuation-duration.yaml#L15 weights <- tibble( time_attenuation_bucket=time_attenuation_buckets, weight=c(1, 0.5, 0)) ta_buckets_w <- right_join( ta_buckets, weights, by="time_attenuation_bucket") # Do the weighted summation and add the bucket offset exposure_score <- ta_buckets_w %>% summarise(total = sum(time*weight, na.rm=TRUE) + 0) %>% as.numeric() # Compute combined risk score and classify (combined_risk_score <- exposure_score * ntrs) (alarm <- combined_risk_score >= 15) # Return individals elements of the risk classification return(list( risk=risk, ta_buckets_w=ta_buckets_w, exposure_score=exposure_score, ntrs=ntrs, combined_risk_score=combined_risk_score, alarm=alarm)) } To illustrate use of the function we will repeat the classification with the server information available on 2020-09-21. Note that this information is slightly different than on 2020-09-20, because now also Aisha’s keys are available: i_set <- left_join(eh_betty, dk20200921, by="key") cwa_classifier(i_set) ##$risk ## # A tibble: 4 x 10 ## key date attenuation duration attenuation_lvl duration_lvl days_lvl valid trl risk_score ## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <dbl> ## 1 3d12e3 2020-09-16 40 11 2 1 5 2020-09-16 8 80 ## 2 3d12e3 2020-09-16 40 11 2 1 5 2020-09-16 8 80 ## 3 835292 2020-09-16 60 11 2 1 5 2020-09-16 5 50 ## 4 835292 2020-09-16 60 11 2 1 5 2020-09-16 5 50 ## ## $ta_buckets_w ## # A tibble: 3 x 3 ## time_attenuation_bucket time weight ## <chr> <dbl> <dbl> ## 1 [0,55) 22 1 ## 2 [55,63) 22 0.5 ## 3 [63,Inf] NA 0 ## ##$exposure_score ## [1] 33 ## ## $ntrs ## [1] 1.6 ## ##$combined_risk_score ## [1] 52.8 ## ## alarm ## [1] TRUE In other words and not so surprising we also get an alarm, if Betty runs the risk scoring on the 21st. The combined risk score is now also higher due to the multiple exposures. Discussion Despite an honest effort to try to understand the code and the documentation, there are still specific details, which I’m unsure about. One reason is that only parts of the code of the GAEN framework v1 are public - this should be improved. Another reason is that I lack the technical skills to really build and execute the CWA from source on my phone or to evaluate he GAEN snippets. This would allow me to test different scenarios and gain insight based on a reverse engineering approach. A mathematical and high-level implementation documentation as the one attempted in this post is not easy to understand either, but math is a universal language with enough abstraction to not get lost in the nitty-gritty details of source code. If the source code is public, however, a ground truth can always be established - it may just take time… This post mixes the two approaches by supplementing the math with an illustrative example using R code - the source of this is available from GitHub. More details on the use of mathematics in digital contact tracing can be found in my lecture given as part of the 2020 summer course on “The Mathematics and Statistics of Infectious Disease Outbreaks” at Stockholm University [slides | video part 1 | video part 2]. Literature 1. Code snippets have recently been published [google | apple | inofficial GitHub of the apple code]. Unfortunately, most of the published code appears to cover later releases of the GAEN. Furthermore, as far as I can tell, both the Google and the Apple snippets don’t cover the full code used for the risk scoring in v1. Hence, some of my above explanations deduced from the code may fail to truly capture the state and functioning of the v1 risk scoring. 2. Approximately every 5 minutes. 3. This description is a simplification. In reality the BLE protocol consists of a series of scans where it is recorded which keys can be found. These series sightings of a key are then aggregated to exposures periods by the GAEN, which have a beginning and an end. 4. In practice the TEKs change frequently each day to preserve anonymity, but it’s possible to deduce all corresponding TEK from one uploaded diagnosis key. 5. For this post we ignore the fact that the key active on the day of upload is not uploaded until tomorrow, but see also this issue. var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script')); To leave a comment for the author, please follow the link and comment on their blog: Theory meets practice.... R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. The post Risk Scoring in Digital Contact Tracing Apps first appeared on R-bloggers. ### Predicting pneumonia outcomes: Results (using DataRobot API) Tue, 09/15/2020 - 20:00 [social4i size="small" align="align-left"] --> [This article was first published on R on notast, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. This post is a supplementary material for an assignment. The assignment is part of the Augmented Machine Learning unit for a Specialised Diploma in Data Science for Business. The aim of this project is to classify if patients with Community Acquired Pneumonia (CAP) became better after seeing a doctor or became worse despite seeing a doctor. Previously, EDA (part1, part 2)and feature enginnering was done in R. The dataset was then imported into DataRobot for the first round of basic modelling (the aim of the assignment was to use DataRobot). Next, R was used for advanced feature selection in the second round of DataRobot’s modelling. The post here will outline the results after both rounds of modelling. All models library(datarobot) library(tidyverse) theme_set(theme_light()) ConnectToDataRobot(endpoint = "https://app.datarobot.com/api/v2", token = "API token key") ## [1] "Authentication token saved" # Saving the relevant project ID projects_inDR<-data.frame(name=ListProjects()projectName, ID=ListProjects()$projectId) project_ID<-projects_inDR %>% filter(name=="Classify CAP") %>% pull(ID) After 2 runs of modelling (basic and with advanced feature selection), a total of 31 models was created. # Saving the models List_Model3<-ListModels(project_ID) %>% as.data.frame(simple = F) List_Model3 %>% select(expandedModel, featurelistName, Weighted AUC.crossValidation) %>% arrange(-Weighted AUC.crossValidation) %>% DT::datatable(rownames = F, options = list(searchHighlight = TRUE, paging= T)) {"x":{"filter":"none","data":[["Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)::One-Hot Encoding::Missing Values Imputed::Standardize","Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)::One-Hot Encoding::Missing Values Imputed::Standardize","Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)::One-Hot Encoding::Numeric Data Cleansing::Smooth Ridit Transform","Elastic-Net Classifier (mixing alpha=0.5 / Binomial Deviance)::One-Hot Encoding::Missing Values Imputed::Standardize","Elastic-Net Classifier (L2 / Binomial Deviance)::Regularized Linear Model Preprocessing v19","Elastic-Net Classifier (L2 / Binomial Deviance)::Regularized Linear Model Preprocessing v19","Elastic-Net Classifier (L2 / Binomial Deviance)::Regularized Linear Model Preprocessing v19","RuleFit Classifier::One-Hot Encoding::Missing Values Imputed","RuleFit Classifier::One-Hot Encoding::Missing Values Imputed","RuleFit Classifier::One-Hot Encoding::Missing Values Imputed","Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)::One-Hot Encoding::Numeric Data Cleansing::Smooth Ridit Transform","Keras Slim Residual Neural Network Classifier using Training Schedule (1 Layer: 64 Units)::One-Hot Encoding::Numeric Data Cleansing::Smooth Ridit Transform","Light Gradient Boosting on ElasticNet Predictions ::One-Hot Encoding::Numeric Data Cleansing::Standardize::Ordinal encoding of categorical variables::Elastic-Net Classifier (L2 / Binomial Deviance)","Light Gradient Boosting on ElasticNet Predictions ::One-Hot Encoding::Numeric Data Cleansing::Standardize::Ordinal encoding of categorical variables::Elastic-Net Classifier (L2 / Binomial Deviance)","Light Gradient Boosting on ElasticNet Predictions ::One-Hot Encoding::Numeric Data Cleansing::Standardize::Ordinal encoding of categorical variables::Elastic-Net Classifier (L2 / Binomial Deviance)","RandomForest Classifier (Gini)::Tree-based Algorithm Preprocessing v1","RandomForest Classifier (Gini)::Tree-based Algorithm Preprocessing v1","Generalized Additive2 Model::Ordinal encoding of categorical variables::Missing Values Imputed::Text fit on Residuals (L2 / Binomial Deviance)","RandomForest Classifier (Gini)::Tree-based Algorithm Preprocessing v1","Generalized Additive2 Model::Ordinal encoding of categorical variables::Missing Values Imputed::Text fit on Residuals (L2 / Binomial Deviance)","Light Gradient Boosted Trees Classifier with Early Stopping::Tree-based Algorithm Preprocessing v1","Light Gradient Boosted Trees Classifier with Early Stopping::Tree-based Algorithm Preprocessing v1","Generalized Additive2 Model::Ordinal encoding of categorical variables::Missing Values Imputed::Text fit on Residuals (L2 / Binomial Deviance)","Light Gradient Boosted Trees Classifier with Early Stopping::Tree-based Algorithm Preprocessing v1","AVG Blender::Average Blender","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","AVG Blender::Average Blender","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed","eXtreme Gradient Boosted Trees Classifier::Ordinal encoding of categorical variables::Missing Values Imputed"],["mean ft impact","DR Reduced Features M8","mean ft impact","L1","DR Reduced Features M8","mean ft impact","L1","DR Reduced Features M8","mean ft impact","L1","L1","DR Reduced Features M8","mean ft impact","DR Reduced Features M8","L1","mean ft impact","DR Reduced Features M8","DR Reduced Features M8","L1","L1","DR Reduced Features M8","L1","mean ft impact","mean ft impact","mean ft impact","mean ft impact","L1","DR Reduced Features M8","Multiple featurelists","DR Reduced Features M8","DR Reduced Features M8"],[0.89633,0.89813,0.89855,0.90051,0.90055,0.90109,0.90127,0.90344,0.90479,0.90484,0.90766,0.90936,0.90959,0.91192,0.91223,0.9136,0.91415,0.91845,0.91857,0.91987,0.91988,0.92296,0.92304,0.92403,0.92505,0.9266,0.92685,0.92769,0.92878,0.92993,null]],"container":" \n \n \n expandedModel<\/th>\n featurelistName<\/th>\n Weighted AUC.crossValidation<\/th>\n <\/tr>\n <\/thead>\n<\/table>","options":{"searchHighlight":true,"paging":true,"columnDefs":[{"className":"dt-right","targets":2}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]} TPR and TNR 3 metrics were used to compare the models, namely, AUC, sensitivity/ true positive rate (TPR) and specificity/ true negative rate (TNR). AUC was the primary metric. TPR and TNR had to be extracted separately. DataRobot determines the TPR and TNR based on the maximized F1 score (though the TPR and TNR can be adjusted based on other F1 score). There is an indirect way and direct way to extract the TPR and TNR . 1. Indirect The TPR and TNR can be indirectly calculated by using the values in the confusion matrix. # extract all modelId model_id<-List_Model3 %>% filter(!is.na(Weighted AUC.crossValidation)) %>% pull(modelId) However, the API was unable to obtain the confusion matrix. (GetModel(project_ID, model_id[1]) %>% GetConfusionChart(source = DataPartition$VALIDATION)) (GetModel(project_ID, model_id[1]) %>% GetConfusionChart(source = DataPartition$CROSSVALIDATION)) (GetModel(project_ID, model_id[9]) %>% GetConfusionChart(source = DataPartition$VALIDATION)) (GetModel(project_ID, model_id[9]) %>% GetConfusionChart(source = DataPartition$CROSSVALIDATION)) ## [1] "Error: (404) No confusion chart for validation" ## [1] "Error: (404) No confusion chart for validation" ## [1] "Error: (404) No confusion chart for validation" ## [1] "Error: (404) No confusion chart for validation" 2. Direct way The TPR and TNR is directly printed on the Selection Summary under the ROC option in the DataRobot GUI. The same can be accessed via the GetRocCurve function in the API. # loop to get all TNR and TPR from ROC Curve option. https://stackoverflow.com/questions/29402528/append-data-frames-together-in-a-for-loop TPR_TNR<-data.frame() for (i in model_id){ selection_summary<-GetModel(project_ID, i) %>% GetRocCurve(source = DataPartition$CROSSVALIDATION) temp<-selection_summary$rocPoints%>% filter(f1Score==max(f1Score)) %>% select(truePositiveRate, trueNegativeRate) %>% mutate(modelId= i) TPR_TNR<-bind_rows(TPR_TNR, temp) } TPR_TNR %>% head() ## # A tibble: 6 x 3 ## truePositiveRate trueNegativeRate modelId ## <dbl> <dbl> <chr> ## 1 0.724 0.903 5f564d6cdf91a0559ffbfb8c ## 2 0.621 0.931 5f54de5d269d39249ea468a3 ## 3 0.770 0.876 5f54de5d269d39249ea468a2 ## 4 0.649 0.936 5f54de5d269d39249ea468a4 ## 5 0.787 0.886 5f54de5d269d39249ea468a7 ## 6 0.736 0.906 5f564d880a859d0d17d859c8 Updating main dataframe TPR_TNR was joined with the main dataframe, List_Model3, to allow all metrics to be in a common dataframe. List_Model3<-List_Model3 %>% left_join(TPR_TNR, by = "modelId") %>% # select columns of interest select(modelType, expandedModel, modelId, blueprintId, featurelistName, featurelistId, Weighted AUC.crossValidation, TPR=truePositiveRate, TNR=trueNegativeRate) %>% mutate(Weighted AUC.crossValidation= as.double(Weighted AUC.crossValidation)) ## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion

Results Performance of models

The AUC was high and similar for all models.
For this study, it was more important to identify as many patients who became worse despite seeing a doctor (i.e. true positive). Identification of these patients with poor outcomes would allow better intervention to be provided to increase their chances of a better clinical evolution. Thus, models with high TPR were preferred. Though there is a trade-off between TPR and TNR, selecting models with higher TPR in this study would not compromise TNR drastically as TNR were high and similar for all models. TNR was better in TPR for all models due to the imbalanced dataset. The models saw more negative classes during training and thus were more familiar to identify negative cases during validation/testing.

List_Model3 %>% pivot_longer(cols = c(Weighted AUC.crossValidation, TPR, TNR), names_to="metric") %>% ggplot(aes(metric, value, fill=metric)) + geom_boxplot(alpha=.5)
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

Between models generated using advanced feature selection, the AUC was high for both DR Reduced Features and the top 32 mean feature impact variables. There was less variability for DR Reduced Features.
DR Reduced Features had higher average TPR but larger variation compared to the top 32 mean feature impact variables. A model using DR Reduced Features had the highest TPR.

List_Model3 %>% pivot_longer(cols = c(Weighted AUC.crossValidation, TPR, TNR), names_to="metric") %>% ggplot(aes(metric, value, fill=featurelistName)) + geom_boxplot(alpha=.5)
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

Selecting a model for deployment

Out of the 31 models, a handful of the models were selected to be compared against the holdout set to determine the most appropriate model for deployment. The models had to have high AUC and high TPR (as shared above TPR is prioritise over TNR).

deploy_all<-List_Model3 %>% mutate( AUC_rank=min_rank(desc(Weighted AUC.crossValidation)), TPR_rank= min_rank(desc(TPR)), #min rank will have same ranking for ties TNR_rank= min_rank(desc(TNR))) %>% select(modelType, featurelistName, Weighted AUC.crossValidation, AUC_rank, TPR, TPR_rank, TNR, TNR_rank, modelId, blueprintId) %>% mutate(across(.cols = c(Weighted AUC.crossValidation, TPR, TNR), .fns = ~round(.x, 5))) deploy_all %>% select(-c(modelId, blueprintId))
## # A tibble: 31 x 8 ## modelType featurelistName Weighted AUC.c~ AUC_rank TPR TPR_rank TNR ## <chr> <chr> <dbl> <int> <dbl> <int> <dbl> ## 1 eXtreme ~ DR Reduced Fea~ NA NA NA NA NA ## 2 Keras Sl~ DR Reduced Fea~ 0.909 19 0.724 11 0.903 ## 3 RuleFit ~ L1 0.905 21 0.621 24 0.931 ## 4 Keras Sl~ L1 0.908 20 0.770 4 0.876 ## 5 eXtreme ~ L1 0.927 4 0.649 21 0.936 ## 6 RandomFo~ L1 0.919 12 0.787 3 0.886 ## 7 Generali~ mean ft impact 0.923 8 0.736 6 0.906 ## 8 Generali~ L1 0.920 11 0.724 11 0.903 ## 9 AVG Blen~ Multiple featu~ 0.929 2 0.638 22 0.942 ## 10 RandomFo~ mean ft impact 0.914 15 0.695 15 0.913 ## # ... with 21 more rows, and 1 more variable: TNR_rank <int>

The 5 models were:

1. The model with the highest AUC which also had the highest TPR
2. The model with the 2nd highest AUC
3. The model with the 2nd highest TPR
4. A model with a relatively balanced high AUC and TPR (7th for AUC and 4th for TPR)
5. Another model with a relatively balanced high AUC and TPR (5th for AUC and 6th for TPR)

(deploy_all %>% select(-c(modelId, blueprintId)) %>% filter(AUC_rank %in% c(1,2,7,5)| TPR_rank==2)) %>% DT::datatable(rownames = F, options = list(searchHighlight = TRUE, paging= T))

{"x":{"filter":"none","data":[["AVG Blender","AVG Blender","eXtreme Gradient Boosted Trees Classifier","Light Gradient Boosted Trees Classifier with Early Stopping","eXtreme Gradient Boosted Trees Classifier"],["Multiple featurelists","mean ft impact","DR Reduced Features M8","mean ft impact","mean ft impact"],[0.92878,0.92505,0.92993,0.92403,0.9266],[2,6,1,7,5],[0.63793,0.8046,0.81609,0.77011,0.73563],[22,2,1,4,6],[0.94167,0.88241,0.88519,0.88704,0.90463],[4,29,26,24,19]],"container":"

\n

\n

\n

modelType<\/th>\n

featurelistName<\/th>\n

Weighted AUC.crossValidation<\/th>\n

AUC_rank<\/th>\n

TPR<\/th>\n

TPR_rank<\/th>\n

TNR<\/th>\n

# model ID for the 5 candidates deploy_modelsId<-deploy_all %>% filter(AUC_rank %in% c(1,2,7,5)| TPR_rank==2) %>% pull(modelId)

The holdout dataset was unlocked and the performance of these 5 models on the holdout set was compared.

# extract model and holdout AUC deploy_models<-ListModels(project_ID) %>% as.data.frame(simple = F) deploy_models<-deploy_models%>% select(modelType, expandedModel, featurelistName, Weighted AUC.holdout, modelId, blueprintId) %>% filter(modelId %in% deploy_modelsId) # extract holdout TPR and TNR TPR_TNR_holdout<-data.frame() for (i in deploy_modelsId){ selection_summary<-GetModel(project_ID, i) %>% GetRocCurve(source = DataPartition$HOLDOUT) temp<-selection_summary$rocPoints%>% filter(f1Score==max(f1Score)) %>% select(truePositiveRate, trueNegativeRate) %>% mutate(modelId= i) TPR_TNR_holdout<-bind_rows(TPR_TNR_holdout, temp) } # join all metrics deploy_results<- left_join(deploy_models, TPR_TNR_holdout, by="modelId") %>% distinct(modelId, .keep_all = T) %>% rename(TPR= truePositiveRate, TNR=trueNegativeRate)

Based on the holdout metrics, Light Gradient Boosted Trees with early stopping (Light GBT) was selected as the model for deployment. Although Light GBT did not have the highest holdout AUC, its AUC of 0.94033 was still close the maximum AUC of 0.95695. The merits of Light GBT was its high TPR and TNR and it had the smallest difference between TPR and TNR. Moreover, it had the highest TPR which is valued for this classification problem.

# plot (deploy_results %>% unite("model", c(modelType, featurelistName), sep=":") %>% pivot_longer(cols = c(Weighted AUC.holdout`, TPR, TNR), names_to="metric") %>% ggplot(aes(model, value, colour=metric)) + geom_point(alpha=.7) + coord_flip()+ labs(x="", title = "Metrics of Potential Models for Deployment", subtitle = "tested on holdout set")+ scale_x_discrete(labels = scales::wrap_format(55))) #https://stackoverflow.com/questions/21878974/auto-wrapping-of-labels-via-labeller-label-wrap-in-ggplot2

LightGBT

Description of the deployed Light GBT

# modelId of Light GBT LightGBT_Id <-deploy_results %>% filter(str_detect(modelType, "Light")) %>% pull(modelId) #blueprint LightGBT_blueprint<-GetModelBlueprintDocumentation(project_ID, LightGBT_Id) %>% data.frame()
(glue::glue ("Description of the deployed model:: \n{LightGBT_blueprint$description}")) ## Description of the deployed model:: ## The Classifier uses the LightGBM implementation of Gradient Boosted Trees. ## LightGBM is a gradient boosting framework designed to be distributed and efficient with the following advantages: ## Gradient Boosting Machines (or Gradient Boosted Trees, depending on who you ask to explain the acronym ‘GBM’) are a cutting-edge algorithm for fitting extremely accurate predictive models. GBMs have won a number of recent predictive modeling competitions and are considered among many Data Scientists to be the most versatile and useful predictive modeling algorithm. GBMs require very little preprocessing, elegantly handle missing data, strike a good balance between bias and variance and are typically able to find complicated interaction terms, which makes them a useful “swiss army knife” of predictive models. GBMs are a generalization of Freund and Schapire’s adaboost algorithm (1995) to handle arbitrary loss functions. They are very similar in concept to random forests, in that they fit individual decision trees to random re-samples of the input data, where each tree sees a boostrap sample of the rows of a the dataset and N arbitrarily chosen columns where N is a configural parameter of the model. GBMs differ from random forests in a single major aspect: rather than fitting the trees in parallel, the GBM fits each successive tree to the residual errors from all the previous trees combined. This is advantageous, as the model focuses each iteration on the examples that are most difficult to predict (and therefore most useful to get correct). Due to their iterative nature, GBMs are almost guaranteed to overfit the training data, given enough iterations. The 2 critial parameters of the algorithm; therefore, are the learning rate (or how fast the model fits the data) and the number of trees the model is allowed to fit. It is critical to cross-validate one of these 2 parameters. ## Early stopping is used to determine the best number of trees where overfitting begins. In this manner GBMs are usually capable of squeezing every last bit of information out of the training set and producing the model with the highest possible accuracy. Hyper-parameters which were tuned:: ### tuned values # list within a list para<-GetModel(project_ID, LightGBT_Id) %>% GetTuningParameters() # option 1 to covert to df para_df<-summary(para) # option 2 to convert to df2 #paraName<- list() #value<-list() #for(i in 1:18){ # paraName_temp<-para[["tuningParameters"]][[i]][["parameterName"]] %>% data.frame() # paraName[[i]]<-paraName_temp # value_temp<-para[["tuningParameters"]][[i]][["currentValue"]] %>% data.frame() # value[[i]]<-value_temp #} #paraName<-bind_rows(paraName) #value_df<-data.frame() #for (i in 1:length(value)){ # temp<- value[[i]] %>% data.frame() # value_df<-rbind(temp, value_df) #} #bind_cols(paraName, value_df) ### parameters description para_describe<-LightGBT_blueprint %>% # as num ID to para w no ID (e.g. parameters.type-> parameters.type.0) rename_with(.fn=~paste0(.x, ".0") , .cols=matches("parameter(.*)[^0-9]$")) %>% # select only parameters var select(starts_with("parameter")) %>% # parameters.type.0-> type.0 rename_with(.fn=~str_replace(.x, "parameters.", "")) %>% # type.0-> type_parameter0. ##need [] to encapsulate full stop, list of permitted char rename_with(.fn=~str_replace(.x, "[.]", "_parameter")) %>% # https://tidyr.tidyverse.org/articles/pivot.html#multiple-observations-per-row-1 pivot_longer(cols=everything(), names_to = c(".value", "parameter"), names_sep = "_") %>% # parameter10-> 10 ## sep the alphabets not numbers, convert to int separate(parameter, into = c(NA, "para"), sep = "[a-z]+", convert = T) %>% # remove parthesis and abbreviation for name mutate(name= str_remove(name, "\$$.*\$$") %>% str_trim())

There is an extra tuned parameter in para_df which is the type of classification. This extra parameter was ignored.

anti_join(para_df, para_describe, by="name")
## # A tibble: 1 x 4 ## name current default constraint ## <chr> <chr> <chr> <chr> ## 1 objective binary binary select from: binary, multiclass, multiclassova

Description of the hyper-parameters and the tuned values

left_join(para_describe, para_df, by="name") %>% select(parameter=name, description, limits=constraint, default_value=default, tuned_values=current) %>% DT::datatable(rownames = F, options = list(searchHighlight = TRUE, paging= T))

{"x":{"filter":"none","data":[["learning_rate","n_estimators","num_leaves","max_depth","max_bin","subsample_for_bin","min_split_gain","min_child_weight","min_child_samples","subsample","subsample_freq","colsample_bytree","reg_alpha","reg_lambda","sigmoid","is_unbalance","early_stopping_rounds"],["Shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate (lr) and n_estimators(n). In dart, it also affects normalization values: [1e-7, 1e2]","The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance. values: [1, 1e6]","Number of leaves in one tree. values: [2, 1e4]","Maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables. Deeper the tree the more variable interactions the model can capture. Tree still grow by leaf-wise. <0 means no limit values: ['none', [1, 1e4]]","Max number of bin that feature values will bucket in. Small bin may reduce training accuracy but may increase general power (deal with over-fit). LightGBM will auto compress memory according max_bin. For example, LightGBM will use uint8_t for feature value if max_bin=255. values: [3, 1e4]","Number of samples for constructing bins. values: [1, 1e6]","Minimum loss reduction required to make a further partition on a leaf node of the tree. values: [0, 100]","Minimum sum of instance weight(hessian) needed in a child(leaf). values: [0, 1e2]","Minimum number of data need in a child(leaf). values: [0, 1e3]","Subsample ratio of the training instance. values: [0.01, 1]","Frequency of subsample ‘none’ means it is not enabled. values: ['none', [1, 1e3]]","Subsample ratio of columns when constructing each tree. By default, the value of colsample_bytree for LightGBM classes is 1.0. However, based on the training data, DataRobot may choose a different initial value for this parameter. values: [0, 1]","L1 regularization term on weights. values: [0, 1e6]","L2 regularization term on weights. values: [0, 1e6]","Parameter for sigmoid function. Will be used in binary classification and lambdarank. values: [1e-06, 1e03]","Used in binary classification. Set this to true if training data are unbalance. values: [True, False]","Will stop training if one metric of one validation data doesn’t improve in last early_stopping_round rounds. values: [0, 1e3]"],["1e-07 to 100","1 to 250000","2 to 10000","1 to 10000 or select from: none","3 to 10000","1 to 1e+06","0 to 100","0 to 100","0 to 1000","0.01 to 1","1 to 1000 or select from: none","0.04 to 1","0 to 1e+06","0 to 1e+06","1e-06 to 1000","select from: False, True","0 to 1000"],["0.050000000000000003","1000","16","none","255","50000","0","5","10","1","1","0.29999999999999999","0","0","1","FALSE","200"],["0.050000000000000003","1000","[2, 4, 16]","none","255","50000","0","5","10","1","1","0.29999999999999999","0","0","1","FALSE","200"]],"container":" \n

\n

\n

parameter<\/th>\n

description<\/th>\n

limits<\/th>\n

default_value<\/th>\n

Reviewing deployed model

Light GBT was deployed and used to make predictions on unseen data. An unseen dataset was reserved earlier to evaluate the deployed model’s performance in the real world.

The AUC was 0.926 for the unseen data which is close to the AUC for the holdout set (0.94) suggesting Light GBT is neither overfitting nor underfitting the unseen data.

The frequent interlacing of the Predicted and the Actual Lines in the Lift Chart also accentuated that the Light GBT is neither underestimating nor overestimating.

The sensitivity of 0.759 was not as high compared to the sensitivity of 0.861 when predicting the holdout set. The smaller sensitivity is likely due to the low absolute number of positive cases in the unseen data. Inspecting purely by numbers, identifying 22/29 positive cases appears rather impressive. However, when the relative proportion was calculated, it appeared less remarkable.

The powerful predictive properties of machine learning can determine patients with CAP who will deteriorate despite receiving medical treatment. According to the cumulative lift chart, when Light GBT was used to predict the unseen data, the top 20% predictions based on the model’s probability of worse CAP outcomes contained almost 4 times more cases compared to no modelling.

A CAP watch list targeting 20% of the patients based on Light GBT can expect to have almost 4 times more patients with poorer outcomes compared to not using a model and selecting a random sample of patients. Doctors can pay closer attention to these patients who are on the watch list as they are almost 4 times more likely to have worse clinical evolution.

According to the cumulative gain chart, the CAP watch list can expect to have almost 80% of the patients with worse outcomes using the top 20% predictions based on Light GBT.

Thoughts on DataRobot

Automated machine learning (AutoML) platforms like DataRobot reduces the amount of manual input in a data science project and democratizes data science to less data literate clinicians. Nonetheless, there are pitfalls to AutoML as manual review is still required to enhance the models and validate that the variables driving the predictions make clinical sense. As seen in this series, an augmented machine learning approach integrating domain knowledge and the context of data when using AutoML platforms will be more appropriate, especially in healthcare where human life is at stake.

For functions which are available in both DataRobot GUI and via R API, the ease of using the mouse makes it more convenient to use the GUI. However, without documentation of code, the project will be harder to reproduce. The option of using either GUI or via the API may create confusion among the team on which steps to be done in the GUI or via the API. Additionally, there are functions which are not available via the API. While there is a function via the API to predict on the unseen dataset, there is no function to provide the performance on the unseen dataset. Thus, the above cumulative lifts and gain charts were screen shots.

Future work
1. Deploying the model. However, deployment is not an option with the student’s version of DataRobot.
2. Visualizing DataRobot’s results in Tableau. However, this requires the deployment option.
3. Running customized models in DataRobot. However, model registry is required but is not an option with the student’s version of DataRobot.
4. Modelling the entire project in either R’s tidymodels or python’s sklearn.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Predicting pneumonia outcomes: Results (using DataRobot API) first appeared on R-bloggers.

### Big Data Ignite 2020 Webinar Series

Tue, 09/15/2020 - 09:09

[social4i size="small" align="align-left"] --> [This article was first published on R – Hi! I am Nagdev, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Big Data Ignite (BDI) was born out of a shared vision: To foster a local center of excellence in advanced computing technologies and practice. After initial success in organizing local Meetup groups, co-founders Elliott and Tuhin realized that to achieve their goal, the scope and scale of activism would need to grow. So, in 2016, the Big Data Ignite conference was launched! In its first year, the conference attracted over 30 presenters and over 200 participants. The two-day conference was designed to provide hands-on skills in data-related tools and techniques, as well as a practical overview of the data-technology landscape and the state of the art. The first day of the conference was devoted to deep-dive workshops by major tool vendors and academics. The second day was filled with visionary keynote addresses, panel discussions, and practical breakout sessions focusing on use-cases and strategy in the areas of data analytics, data management, IoT, and cloud computing. It was a combination that worked ‚ bringing together hands-on learning and strategic guidance from practicing experts.

Big Data Ignite 2017, 2018, and 2019 steadily built the momentum. The conference grew to three days, over 50 presenters, and over 400 participants, offering an even broader variety of workshops, keynotes, and breakout sessions, with a special focus on AI

Big Data Ignite in Uncertain Times

This year, BDI is hosting a series of webinars instead of a three-day conference. These webinars would be on every Tuesday and Thursday from September 15 – October 15 at 12:00 to 1:00 PM EST. Industry leaders and academics have already scheduled various presentations. BDI 2020 is my second year, and I’m also a key speaker for the September 15 session “Predictive Maintenance Pipeline Using R”.

Webinar Series

Check out the below link for all the talks scheduled in this session
https://bigdataignite.org/

Free Registration

This year there is absolutely no cost to attend these sessions. You can register on their website at https://bigdataignite.org/register/

The post Big Data Ignite 2020 Webinar Series appeared first on Hi! I am Nagdev.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Big Data Ignite 2020 Webinar Series first appeared on R-bloggers.

### Why R? Webinar – Me, Myself and my Rprofile

Tue, 09/15/2020 - 04:00

This Thursday September 17 we are lauching another [whyr.pl/webinars/][http://whyr.pl/webinars/] entitled Me, Myself and my Rprofile

2020-09-17 Me, Myself and my Rprofile

Details
• donate: whyr.pl/donate/
• date: every Thursday 8:00 pm UTC+2
• format: 45 minutes long talk streamed on YouTube + 10 minutes for Q&A
2020-09-24 Pause for 2020.whyr.pl Conference (Remote)

Previous talks

Sydeaka Watson Data Science for Social Justice. Video

John Blischak Reproducible research with workflowr: a framework for organizing, versioning, and sharing your data analysis projects. Video

JD Long Taking friction out of R: helping drive data science adoption in organizations. Video

Leon Eyrich Jessen In Silico Immunology: Neural Networks for Modelling Molecular Interactions using Tensorflow via Keras in R. Video

Erin Hodgess Using R with High Performance Tools on a Windows Laptop. Video

Julia Silge Understanding Word Embeddings. Video

Bernd Bischl, Florian Pfisterer and Martin Binder Pipelines and AutoML with mlr3. Video

Mateusz Zawisza + Armin Reinert Uplift modeling for marketing campaigns. Video

Erin LeDell – Scalable Automatic Machine Learning in R with H2O AutoML. Video

Ahmadou Dicko – Humanitarian Data Analysis with R. Video

Dr. Nina Zumel and Dr. John Mount from win-vectorAdvanced Data Preparation for Supervised Machine Learning. Video

Lorenzo BraschirZYPAD: Development pipeline for R production. Video

Robin Lovelace and Jakub Nowosad (authors of Geocomputation with R) – Recent changes in R spatial and how to be ready for them. Video

Heidi Seibold, Department of Statistics (collaboration with LMU Open Science Center) (University of Munich) – Teaching Machine Learning online. Video

Olgun Aydin – PwC PolandIntroduction to shinyMobile. Video

Achim Zeileis from Universität InnsbruckR/exams: A One-for-All Exams Generator – Online Tests, Live Quizzes, and Written Exams with R. Video

Stay up to date
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Why R? Webinar - Me, Myself and my Rprofile first appeared on R-bloggers.

### Dataviz Interview

Mon, 09/14/2020 - 21:05

[social4i size="small" align="align-left"] --> [This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I had a very nice chat recently about data visualization with Brian Fannin, a research actuary with the
CAS. We covered a variety of topics from R and ggplot in particular, to how to think about data visualization in general, and what the dataviz community is learning from COVID. You can watch it here:

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Dataviz Interview first appeared on R-bloggers.

### Bookdown, Not Bookout

Mon, 09/14/2020 - 20:00

[social4i size="small" align="align-left"] --> [This article was first published on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I first got to grips with the R coding language to do the statistical analysis for my dissertation, but I still did all my actual writing in more traditional word processing apps. 1 This became a pain any time I needed to copy-paste an update into a table or, even worse, change a plot and re-export to an image and re-import it into my writing environment. Going forward, I resolved to find a better way to incorporate my analysis code and writing to avoid this problem.

Bookdown is a fantastic R package that compiles Rmarkdown source documents into one big output, with code evaluated and, crucially, the ability to add cross-references to tables/plots etc. It seems to work best when producing HTML or PDF output, for which it has a whole bunch of great customisation options. This is great, as I always submit my final work in PDF. Unfortunately, due to needing to interact with lecturers and non-R-code-users for feedback on my work, I also need it to work with the Word docx file format. It can do this, but it does not have nearly as many options for customising the output. And a lot of the things you can build when producing a PDF just flat out don’t work in Word.

This has led to… some work on my part.

The most obvious solution to me seemed to simply change what my code was doing depending on the format I was exporting to. I can tweak the final PDF output to be perfect, but the docx format has to be just good enough to get the point across to whoever I’m seeking feedback from.

You can set Bookdown to export to multiple formats, and set options for each export format, using an _output.yml file. But, weirdly, I don’t actually find it easy to change code execution in your document based on what is currently being rendered by Bookdown. You’d think there’d be a way to declare is(pdf) or is(docx) and execute appropriately, right? Unfortunately not. 2

According to this Stackoverflow thread, you can determine the currently rendering format with rmarkdown::all_output_formats(knitr::current_input())[1]. I think _output.yml is supposed to be ‘shuffled’ by Rstudio so that index 1 is always the currently rendering output, but that wasn’t occurring when I tested it; every time I would just get back is the first format listed in _output.yml, even if it wasn’t the one currently being rendered.So if _output.yml had bookdown::pdf_document and then bookdown::word_document in it, all I would get back is bookdown::pdf_document even when rendering to docx. 3

I could just swap in an out various outputs as I need them, of course, but I ended up taking a step back at this point and instead considering what was going on when rendering my source files. If you’re like me, you’re using Rstudio as your IDE, and this gives you a handy ‘knit’ or ‘build’ button in your toolbar when it detects you’re in a Rmarkdown/Bookdown project. While nice, this abstraction actually ended up masking for me the underlying method. Really all these things are doing is calling bookdown::render_book, itself a wrap around rmarkdown::render. This function takes an argument for the format(s) to render when called, using the options found in _output.yml for each particular output format.

Once I realised that rendering a book is just another function, I decided to just wrap it in my own function. I can then provide different, unique _output.yml files to each call to bookdown::render_book based on an argument I provide, using yaml and ymlthis packages to read and write the yaml files as I need them:

build_book <- function(format = "all"){ switch(format, "all" = formats <- c("bookdown::pdf_document2", "bookdown::word_document2"), "pdf" = formats <- "bookdown::pdf_document2", "word" = formats <- "bookdown::word_document2" ) for(fmt in formats) { if(grepl("pdf", fmt)) { out_yml <- yaml::read_yaml("_pdf_output.yml") ymlthis::use_yml_file(ymlthis::as_yml(out_yml), "_output.yml") } if(grepl("word", fmt)) { out_yml <- yaml::read_yaml("_word_output.yml") ymlthis::use_yml_file(ymlthis::as_yml(out_yml), "_output.yml") } bookdown::render_book(here::here("index.Rmd"), output_format = fmt) fs::file_delete("_output.yml") } }

This pulls in specific yaml files from a project folder, and sets it as the _output.yml for Bookdown to find. Now, when rendering, rmarkdown::all_output_formats(knitr::current_input())[1] returns the correct file format for the current render, as it’s the only format in the output yaml file! This allowed me to put this at the start of my bookdown source files:

fmt <- rmarkdown::all_output_formats(knitr::current_input())[1]

And then test for output with:

if(!grepl("word", fmt)) { my_table %>% kableExtra::kbl(caption = "My full table caption", caption.short = "Short caption for ToC", booktabs = TRUE) %>% kableExtra::kable_styling(latex_options = c("striped", "HOLD_position", "scale_down")) } else { exTab %>% knitr::kable(caption = "My full table caption", booktabs = TRUE)

This lets me, for example here, use kableExtra to further style my table for PDF/HTML output, but since kableExtra doesn’t really work for docx, just use regular ol’ knitr::kable to render it there!

Taking it further

Once I started writing my own book rendering wrapper function, the opportunities opened up in front of me. For example – one thing I like to do is build my drafts and save them into folders with today’s date, so I can put my feedback from lecturers in that same folder and keep it all together for easy reference. So I added the following to my build_book() function:

build_path <- glue::glue("book/builds/{Sys.Date()}") bookdown_yml <- ymlthis::yml_bookdown_opts(.yml = ymlthis::yml_empty(), book_filename = "My Project", rmd_subdir = c("src"), delete_merged_file = FALSE, output_dir = build_path )

Now by creating a custom _bookdown.yml file for each call to build_book(), each build will save into a book/src/DATE specific folder!

There are now tons of ways to further change how Bookdown builds your book based on arguments you provide to your own build_book, swapping in and out various Bookdown configuration files as required. Is this the most efficient way to customise your Bookdown project? Probably not, with all the creating and changing of existing yaml files, but speed isn’t really my concern here – running build_book and having to step back and wait for a minute or two for my code to run is much more preferable to the stupid amounts of time I spent correcting fiddly tables/plot/cross reference output by hand in Word before!

Word / .docx output tweaks

So now I can customise my output based on where I’m rendering, I need to fix a couple of other things in my docx output. These now need to be done by editing my reference document for Bookdown docx output. For a primer on creating and editing your reference doc, check out this article.

One extra thing I needed was numbered section headings, not just H1, H2 etc rendered as headings. This can’t just be set in Bookdown, but instead can be done in your reference doc by selecting your header, going to Format -> Style... -> Modify, and then finding this hidden gem of a menu in the lower left of the style formatting menu:

Then clicking Outline Numbered:

You can also click ‘customise’ here for some extra options, such as how much sub-headings should indent. Be aware you also need to do this for every heading you want numbered, so H1, H2, etc. I then combined this with build_book to choose whether to number headings in docx output:

build_book <- function(format = "word", word_num = TRUE){ switch(format, "all" = formats <- c("bookdown::pdf_document2", "bookdown::word_document2"), "pdf" = formats <- "bookdown::pdf_document2", "word" = formats <- "bookdown::word_document2" ) for(fmt in formats) { if(grepl("pdf", fmt)) { out_yml <- yaml::read_yaml("_pdf_output.yml") ymlthis::use_yml_file(ymlthis::as_yml(out_yml), "_output.yml") } if(grepl("word", fmt)) { if(word_num = TRUE) { out_yml <- yaml::read_yaml("_word_output_numbered.yml") ymlthis::use_yml_file(ymlthis::as_yml(out_yml), "_output.yml") } else { out_yml <- yaml::read_yaml("_word_output.yml") ymlthis::use_yml_file(ymlthis::as_yml(out_yml), "_output.yml") } } } bookdown::render_book(here::here("index.Rmd"), output_format = fmt) fs::file_delete("_output.yml") } }

And just refer to the two differently formatted ref docx’s within _word_output_numbered.yml and _word_output.yml respectively. Each one gets swapped in when called upon.

Finally, I needed to put in a table of contents and lists of figures/tables at the start of the document. There really doesn’t seem to be an easy way to create ToCs in docx from Bookdown, but fortunately it’s just a quick click or two in Word to auto-generate them from document headings or existing document styles. 4 But, I wanted some blank pages at the start, with ‘Table of Contents’ etc as a heading on each one, so at least I had space for all these things set aside automatically that I didn’t need to think about.

You can include some basic latex functions in Rmarkdown, one of which is \newpage to start a new page. But, if I tried adding # Table of Contents as a header on a page at the start of the document, when I auto generate the ToC in Word this heading itself gets included in the ToC!

I ended up declaring pandoc custom style fences to give this page the style of ‘Title’ rather than ‘heading’, so it won’t be included in the auto generated ToC. I also used glue and the output detector from above to only include it in docx output:

if(grepl("word", fmt)) { word_toc <- glue::glue(' \\newpage ::: {{custom-style="Title"}} Table of Contents ::: \\newpage ::: {{custom-style="Title"}} List of Figures ::: \\newpage ::: {{custom-style="Title"}} List of Tables ::: \\newpage ::: {{custom-style="Title"}} List of Appendices ::: ') } else { word_toc <- NULL } word_toc

Remember to escape your forward slashes and your curly brackets in glue by doubling up!

Writing environment

Just as an aside, as great as Rstudio is, I wouldn’t really recommend it as the best pure writing environment. You can certainly make it a bit nicer with themes and by editing the font used, but you’re not going to get away from its original purpose as a coding environment. While fine for the coding parts of my projects, there was also going to be quite a lot of just plain ol’ prose writing, and I wanted to find somewhere a bit nicer to do those parts.

Although Rmarkdown’s .Rmd files are theoretically plain text files, I found options to be limited when looking for a writing app that can handle them. Initially I tried using VS Code, as while first and foremost a coding environment, it’s highly customisable. In particular, a solarized colour theme, an R markdown plugin, and discovering VS Code’s zen mode got me quite a long way towards a pleasant writing environment, but it still felt a little janky.

I’ve ended up settling on IA Writer as my main ‘writing’ app of choice. It handles .Rmd files without a problem, feels super smooth, and full-screen provides a nice, distraction free writing environment. It’s not particularly cheap, but you get what you pay for. As a bonus, I can use Working Copy to sync my git repo for each project to my mobile devices and continue working on them in IA Writer’s mobile apps just as if I was on desktop.

If IA Writer is a bit too steep in cost, Typora is a really nice alternative. It’s currently free while in beta, so not sure how much longer that’s going to be the case, but it also handles .Rmd files really well. I just preferred some of the library/file management options in IA myself.

On mobile, 1Writer or Textastic are also great choices for iOS, and JotterPad on Android is also great for markdown, although I haven’t been able to check it works with .Rmd files myself yet.

Reproducibility

Once I start cobbling together workflows and hacks like those I’ve detailed in this blog post, I invariably start worrying about stability and reproducibility. 5 What happens when I want to start a new project? Or if I lose or break my laptop and I need to start over again on a new device?

This is where working in code is really handy, and having recently tried my hand at R package development comes in really handy. See, now I know just enough about how R works with packages to put my custom project functions – like build_book() – in an R/ folder within my project, and have the usethis package create me a DESCRIPTION file that lists my project’s dependencies.

I can then also drop all my book building assets – like my custom reference docs for docx files, custom latex preambles 6, bibliography files, the works – into that same project, push it to GitHub, and turn that whole thing into a template repository. 7 Now, anytime I need to create a new Bookdown project, I can just clone that repo, and run devtools::load_all() to have my custom build_book function back in action!

Taking it a step further, I can use the new-ish renv package to create a renv.lock file in my template repo that I can then use to exactly re-install the package dependencies I might need for any book project. 8

As a side bonus, while reassuring myself in creating a stable and reproducible project setup, this also follows a lot of the steps recommended to turn your research into a research compendium. This means that someone else can also download the git repositories I create this way, and similarly, use my renv.lock or package dependencies in DESCRIPTION to very quickly get up and running with a similar environment to mine to test out my work for themselves. I could even take it a step further and include a Dockerfile and link to an online Rstudio server using the holepunch package to recreate it exactly!

Wrapping up

This blog post is really the result of a couple weekends of StackOverflow diving and thinking about what I really need for my next big writing project. It can be a bit overwhelming working with something with as many options as Bookdown/Rmarkdown. The sheer amount of configurations and things like the integration with Rstudio can lead you to forget that, underneath it all, is still just code you can pull apart and make do what you want with a little sideways thinking. I found a lot of questions from people trying to make Bookdown do something from within, when really a little meta-coding with a little function that just swaps out different _bookdown.yml files as needed would do the trick.

I hope this post helps a little bit for those out there in the same boat as me. Us poor souls that still have to deal with docx and Bookdown need to stick together. If you have your own Bookdown/Rmarkdown hints and tips, tweet them to me, I’d love to hear them!

1. I like Scrivener, but man, be ready to spend some time reading the manual and figuring out all the little things you can tweak.
2. There is knit::is_latex_output() and knit::is_html_output() but no is_word. You could work in the negative, but I wanted a way to be explicitly clear about exactly the format being rendered.
3. Maybe I was doing something wrong, but it never worked for me.
4. And nicely, table and figure captions from kable/knitr get given their own styles that can be used to generate separate indexes in Word. See this help article for more on generating lists of figures/tables in Word from document styles.
5. My dissertation project was such a house of cards by the end that I was scared to change anything in any configuration menu/file incase I broke the whole thing.
6. Now that’s a whole other rabbit hole to fall down…
7. For a fun trick, setup this template repo with exactly the folder structure you want in all your projects, and use .keep hidden files to keep otherwise empty directories synced in git. Useful for that ‘data’ folder you want to keep in the template but don’t actually have any data in yet.
8. renv is great, and near as I can tell is aiming to replicate the functionality of things like npm’s package.json file and python’s virtual environments. Highly recommend checking out, it’s worked very smoothly for me so far.
var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Bookdown, Not Bookout first appeared on R-bloggers.

### Announcing the 2020 RStudio Table Contest

Mon, 09/14/2020 - 20:00

[social4i size="small" align="align-left"] --> [This article was first published on RStudio Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Here at RStudio, we love tables. No, not tables like the one pictured above, but data tables.
Did you know that data tables have been used since the 2nd century? They’ve really been around.

Tables are a fantastic way to communicate lists of quantitative and qualitative information. Sometimes, tables can fall very short of their potential for greatness. But that was the past: we now have some excellent R packages at our disposal to generate well-designed and functional presentation tables. And because of this renaissance of table-making in R, we’re announcing a contest: The 2020 RStudio Table Contest. It will run from September 15th to October 31st, 2020.

One thing we love about the R community is how open and generous you are in sharing the code and process you use to solve problems. This lets others learn from your experience and invites feedback to improve your work. We hope this contest encourages more sharing and helps to recognize the many outstanding ways people work with and display data with R.

Contest Judging Criteria

Tables will be judged based on technical merit, artistic design, and quality of documentation. We recognize that some tables may excel in only one category and others in more than one or all categories. Honorable mentions will be awarded with this in mind.

We are working with maintainers of many of the R community’s most popular R packages for building tables, including Yihui Xie of DT, Rich Iannone of gt, Greg Lin of
reactable, David Gohel of
flextable, David Hugh-Jones of huxtable
, and Hao Zhu of kableExtra. Many of these maintainers will help review submissions built with their packages.

Requirements

A submission must include all code and data used to replicate your entry. This may be a fully knitted R Markdown document with code (for example published to RPubs or shinyapps.io), a repository, or rstudio.cloud project.

A submission can use any table-making package available in R, not just the ones mentioned above.

Submission Types – We are looking for three types of table submissions,

1. Single Table Example: This may highlight interesting structuring of content, useful and tricky features – for example, enabling interaction – or serve as an example of a common table popular in a specific field. Be sure to document your code for clarity.
2. Tutorial: It’s all about teaching us how to craft an excellent table or understand a package’s features. This may include several tables and narrative.
3. Other: For submissions that do not easily fit into one of the types above.

Category – Given that tables have different features and purposes, we’d also like you to further categorize the submission table. There are four categories, static-HTML, interactive-HTML, static-print, and interactive-Shiny. Simply choose the one that best fits your table.

You can submit your entry for the contest by filling the form at rstd.io/table-contest-2020. The form will generate a post on RStudio Community, which you can then edit further at a later date. You may make multiple entries.

The deadline for submissions is October 31st, 2020, at midnight Pacific Time.

Prizes

Grand Prize

• One year of shinyapps.io Basic plan or one year of RStudio.Cloud Premium.
• Additionally, any number of RStudio t-shirts, books, and mugs (worth up to \$200).

Unfortunately, we may not be able to send t-shirts, books, or other items larger than stickers to non-US addresses for which shipping and customs costs are high.

Honorable Mention

• A good helping of hex stickers for RStudio packages plus a side of hexes for table-making packages, and other goodies.
Tables Gallery

Previous Shiny Contests have driven significant improvement to the popular Shiny Gallery and we hope this contest will spur development of a similar Tables Gallery. Winners and other participants may be invited to feature their work on such a resource.

We will announce the winners and their submissions on the RStudio blog, RStudio Community, and also on Twitter.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Announcing the 2020 RStudio Table Contest first appeared on R-bloggers.

### Generating probabilities for ordinal categorical data

Mon, 09/14/2020 - 20:00

[social4i size="small" align="align-left"] --> [This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Over the past couple of months, I’ve been describing various aspects of the simulations that we’ve been doing to get ready for a meta-analysis of convalescent plasma treatment for hospitalized patients with COVID-19, most recently here. As I continue to do that, I want to provide motivation and code for a small but important part of the data generating process, which involves creating probabilities for ordinal categorical outcomes using a Dirichlet distribution.

Motivation

The outcome for the analysis that we will be conducting is the WHO 11-point ordinal scale for clinical improvement at 14 days, which ranges from 0 (uninfected and out of the hospital) to 10 (dead), with various stages of severity in between. We plan to use a Bayesian proportional odds model to assess the effectiveness of the therapy. Since this is a meta-analysis, we will be including these data from a collection of studies being conducted around the world.

Typically, in a proportional odds model one has to make an assumption about proportionality. In this case, while we are willing to make that assumption within specific studies, we are not willing to make that assumption across the various studies. This means we need to generate a separate set of intercepts for each study that we simulate.

In the proportional odds model, we are modeling the log-cumulative odds at a particular level. The simplest model with a single exposure/treatment covariate for a specific study or cluster $$k$$ is

$log \left( \frac{P(\text{score}_{k} < x )}{P(\text{score}_{k} \ge x) } \right) = \alpha_{xk} + \beta A,$
where $$x$$ ranges from 1 to 10, all the levels of the WHO score excluding the lowest level $$x=0$$. $$A$$ is the treatment indicator, and is $$A=1$$ for patients who receive the treatment. $$\alpha_{xk}$$ is the intercept for each study/cluster $$k$$. $$\beta$$ is interpreted as the log-odds ratio comparing the odds of the treated with the non-treated within each study. The proportionality assumption kicks in here when we note that $$\beta$$ is constant for all levels of $$x$$. In addition, in this particular model, we are assuming that the log-odds ratio is constant across studies (not something we will assume in a more complete model). We make no assumptions about how the study intercepts relate to each other.

To make clear what it would mean to make a stronger assumption about the odds across studies consider this model:

$log \left( \frac{P(\text{score}_{k} < x )}{P(\text{score}_{k} \ge x) } \right) = \alpha_{x} + b_k + \beta A,$

where the intercepts for each study are related, since they are defined as $$\alpha_{x} + b_k$$, and share $$\alpha_x$$ in common. If we compare the log-odds of the treated in one study $$k$$ with the log-odds of treated in another study $$j$$ (so $$A=1$$ in both cases), the log-odds ratio is $$b_j – b_k$$. The ratio is independent of $$x$$, which implies a strong proportional odds assumption across studies. In contrast, the same comparison across studies based on the first model is $$\alpha_{xj} – \alpha_{xk}$$, which is not necessarily constant across different levels of $$x$$.

This is a long way of explaining why we need to generate different sets of intercepts for each study. In short, we would like to make the more relaxed assumption that odds are not proportional across studies or clusters.

The Dirichlet distribution

In order to generate ordinal categorical data I use the genOrdCat function in the simstudy package. This function requires a set of baseline probabilities that sum to one; these probabilities map onto level-specific intercepts. There will be a distinct set of baseline probabilities for each study and I will create a data set for each study. The challenge is to be able to generate unique baseline probabilities as if I were sampling from a population of studies.

If I want to generate a single probability (i.e. a number between $$0$$ and $$1$$), a good solution is to draw a value from a beta distribution, which has two shape parameters $$\alpha$$ and $$\beta$$.

Here is a single draw from $$beta(3, 3)$$:

set.seed(872837) rbeta(1, shape1 = 3, shape2 = 3) ## [1] 0.568

The mean of the beta distribution is $$\alpha/(\alpha + \beta)$$ and the variance is $$\alpha\beta/(\alpha+\beta)^2(\alpha + \beta + 1)$$. We can reduce the variance and maintain the same mean by increasing $$\alpha$$ and $$\beta$$ by a constant factor (see addendum for a pretty picture):

library(data.table) d1 <- data.table(s = 1, value = rbeta(1000, shape1 = 1, shape2 = 2)) d2 <- data.table(s = 2, value = rbeta(1000, shape1 = 5, shape2 = 10)) d3 <- data.table(s = 3, value = rbeta(1000, shape1 = 100, shape2 = 200)) dd <- rbind(d1, d2, d3) dd[, .(mean(value), sd(value)), keyby = s] ## s V1 V2 ## 1: 1 0.338 0.2307 ## 2: 2 0.336 0.1195 ## 3: 3 0.333 0.0283

The Dirichlet distribution is a multivariate version of the beta distribution where $$K$$ values between $$0$$ and $$1$$ are generated, with the caveat that they sum to $$1$$. Instead of $$\alpha$$ and $$\beta$$, the Dirichlet is parameterized by a vector of length $$K$$

$\boldsymbol{\alpha} = \left(\alpha_1,\dots, \alpha_K\right)^T,$

where there are $$K$$ levels of the ordinal outcome. A draw from this distribution returns a vector $$\boldsymbol{p} = ( p_1, \dots, p_K)^T$$ where $$\sum_{i=1}^K p_i = 1$$ and

$E(p_k)=\frac{\alpha_k}{\sum_{i=1}^K \alpha_i}.$
A draw from a Dirichlet distribution with $$K=2$$ is actually equivalent to a draw from a beta distribution where $$\boldsymbol{\alpha} = (\alpha, \beta)^T$$. Before, I generated data from a $$beta(1, 2)$$, and now here is a draw from $$Dirichlet\left(\boldsymbol\alpha = (1,2)\right)$$ using rdirichlet from the gtools package:

library(gtools) dir <- rdirichlet(1000, alpha = c(1,2)) head(dir) ## [,1] [,2] ## [1,] 0.3606 0.639 ## [2,] 0.4675 0.533 ## [3,] 0.2640 0.736 ## [4,] 0.0711 0.929 ## [5,] 0.5643 0.436 ## [6,] 0.0188 0.981

The first column has the same distribution as the $$beta$$ distribution from before; the mean and standard deviation are close to the values estimated above:

c(mean(dir[,1]), sd(dir[,1])) ## [1] 0.332 0.236

To ramp things up a bit, say we have $$K = 5$$, and the target mean values for each level are $$\boldsymbol{p} = \left(\frac{1}{9}, \frac{2}{9}, \frac{3}{9}, \frac{2}{9}, \frac{1}{9} \right)$$, one way to specify this is:

dir_1 <- rdirichlet(1000, alpha = c(1, 2, 3, 2, 1)) head(dir_1) ## [,1] [,2] [,3] [,4] [,5] ## [1,] 0.1710 0.6637 0.0676 0.0633 0.0343 ## [2,] 0.1130 0.1150 0.2803 0.4229 0.0689 ## [3,] 0.1434 0.0678 0.3316 0.1721 0.2851 ## [4,] 0.0250 0.1707 0.3841 0.2490 0.1712 ## [5,] 0.0633 0.3465 0.4056 0.0853 0.0993 ## [6,] 0.1291 0.1510 0.3993 0.2612 0.0593

Here are the observed means for each $$p_k$$, pretty close to the target:

apply(dir_1, 2, mean) ## [1] 0.111 0.221 0.328 0.229 0.112

Of course, we could generate data with a similar target $$\boldsymbol{p}$$ by multiplying $$\boldsymbol\alpha$$ by a constant $$c$$. In this case, we use $$c=10$$ and see that the average values for each $$p_k$$ are also close to the target:

dir_2 <- rdirichlet(1000, alpha = c(10, 20, 30, 20, 10)) apply(dir_2, 2, mean) ## [1] 0.113 0.222 0.334 0.220 0.111

There is a key difference between specifying $$\boldsymbol{\alpha}$$ and $$c\boldsymbol{\alpha}$$. Just as in the beta distribution, as $$c$$ grows larger, the variation within each $$p_k$$ decreases. This will be useful when generating the study specific probabilities if we want explore different levels of variation.

Here’s the standard deviations from the two data sets just generated:

apply(dir_1, 2, sd) ## [1] 0.102 0.131 0.144 0.134 0.098 apply(dir_2, 2, sd) ## [1] 0.0333 0.0425 0.0508 0.0421 0.0333 Generating the baseline probabilities

A simple function that includes two key arguments – the base probabilities (which are really $$\boldsymbol{\alpha}$$) and a similarity index (which is really just the constant $$c$$) – implements these ideas to generate study-specific probabilities for each outcome level. As the similarity index increases, the variation across studies or sites decreases. The function includes an additional adjustment to ensure that the row totals sum exactly to $$1$$ and not to some value infinitesimally greater than $$1$$ as a result of rounding. Such a rounding error could cause problems for the function genOrdCat.

genBaseProbs <- function(n, base, similarity, digits = 8) { n_levels <- length(base) x <- rdirichlet(n, similarity * base) #--- ensure that each vector of probabilities sums exactly to 1 x <- round(floor(x*1e8)/1e8, digits) # round the generated probabilities xpart <- x[, 1:(n_levels-1)] # delete the base prob of the final level partsum <- apply(xpart, 1, sum) # add the values of levels 1 to K-1 x[, n_levels] <- 1 - partsum # the base prob of the level K = 1 - sum(1:[K-1]) return(x) }

In this first example, I am generating 11 values (representing base probabilities) for each of 9 studies using a relatively low similarity index, showing you the first six studies:

basestudy <- genBaseProbs( n = 9, base = c(0.05, 0.06, 0.07, 0.11, 0.12, 0.20, 0.12, 0.09, 0.08, 0.05, 0.05), similarity = 15, ) round(head(basestudy), 3) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] ## [1,] 0.094 0.022 0.121 0.100 0.061 0.102 0.053 0.309 0.059 0.078 0.000 ## [2,] 0.025 0.079 0.043 0.197 0.083 0.044 0.099 0.148 0.025 0.150 0.107 ## [3,] 0.007 0.042 0.084 0.066 0.049 0.145 0.191 0.323 0.078 0.012 0.003 ## [4,] 0.061 0.021 0.063 0.104 0.092 0.292 0.112 0.110 0.113 0.026 0.008 ## [5,] 0.067 0.023 0.021 0.042 0.063 0.473 0.108 0.127 0.016 0.013 0.046 ## [6,] 0.001 0.018 0.054 0.225 0.150 0.301 0.043 0.081 0.100 0.008 0.020

A great way to see the variability is a cumulative probability plot for each individual study. With a relatively low similarity index, you can generate quite a bit of variability across the studies. In order to create the plot, I need to first calculate the cumulative probabilities:

library(ggplot2) library(viridis) cumprobs <- data.table(t(apply(basestudy, 1, cumsum))) n_levels <- ncol(cumprobs) cumprobs[, id := .I] dm <- melt(cumprobs, id.vars = "id", variable.factor = TRUE) dm[, level := factor(variable, labels = c(0:10))] ggplot(data = dm, aes(x=level, y = value)) + geom_line(aes(group = id, color = id)) + scale_color_viridis( option = "D") + theme(panel.grid.minor = element_blank(), panel.grid.major.y = element_blank(), legend.position = "none")

Here is a plot of data generated using a similarity index of 150. Variation is reduced pretty dramatically:

Using base probabilities to generate ordinal data

Now that we have these base probabilities, the last step is to use them to generate ordinal outcomes. I am generating the simplest of data sets: 9 “studies” each with 500 subjects, without any covariates or even treatment assignment. Since the genOrdCat requires an adjustment variable, I am adjusting everyone by 0. (This is something I need to fix – there should be no such requirement.)

library(simstudy) d_study <- genData(9, id = "study") d_ind <- genCluster(d_study, "study", numIndsVar = 500, "id") d_ind[, z := 0] d_ind ## study id z ## 1: 1 1 0 ## 2: 1 2 0 ## 3: 1 3 0 ## 4: 1 4 0 ## 5: 1 5 0 ## --- ## 4496: 9 4496 0 ## 4497: 9 4497 0 ## 4498: 9 4498 0 ## 4499: 9 4499 0 ## 4500: 9 4500 0

To generate the ordinal categorical outcome, we have to treat each study separately since they have unique baseline probabilities. This can be accomplished using lapply in the following way:

basestudy <- genBaseProbs( n = 9, base = c(0.05, 0.06, 0.07, 0.11, 0.12, 0.20, 0.12, 0.09, 0.08, 0.05, 0.05), similarity = 50 ) list_ind <- lapply( X = 1:9, function(i) { b <- basestudy[i,] d_x <- d_ind[study == i] genOrdCat(d_x, adjVar = "z", b, catVar = "ordY") } )

The output list_ind is a list of data tables, one for each study. For example, here is the 5th data table in the list:

list_ind[[5]] ## study id z ordY ## 1: 5 2001 0 7 ## 2: 5 2002 0 9 ## 3: 5 2003 0 5 ## 4: 5 2004 0 9 ## 5: 5 2005 0 9 ## --- ## 496: 5 2496 0 9 ## 497: 5 2497 0 4 ## 498: 5 2498 0 7 ## 499: 5 2499 0 5 ## 500: 5 2500 0 11

And here is a table of proportions for each study that we can compare with the base probabilities:

t(sapply(list_ind, function(x) x[, prop.table(table(ordY))])) ## 1 2 3 4 5 6 7 8 9 10 11 ## [1,] 0.106 0.048 0.086 0.158 0.058 0.162 0.092 0.156 0.084 0.028 0.022 ## [2,] 0.080 0.024 0.092 0.134 0.040 0.314 0.058 0.110 0.028 0.110 0.010 ## [3,] 0.078 0.050 0.028 0.054 0.148 0.172 0.162 0.134 0.058 0.082 0.034 ## [4,] 0.010 0.056 0.116 0.160 0.054 0.184 0.102 0.084 0.156 0.056 0.022 ## [5,] 0.010 0.026 0.036 0.152 0.150 0.234 0.136 0.084 0.120 0.026 0.026 ## [6,] 0.040 0.078 0.100 0.092 0.170 0.168 0.196 0.050 0.038 0.034 0.034 ## [7,] 0.006 0.064 0.058 0.064 0.120 0.318 0.114 0.068 0.082 0.046 0.060 ## [8,] 0.022 0.070 0.038 0.160 0.182 0.190 0.074 0.068 0.070 0.036 0.090 ## [9,] 0.054 0.046 0.052 0.128 0.100 0.290 0.102 0.092 0.080 0.030 0.026

Of course, the best way to compare is to plot the data for each study. Here is another cumulative probability plot, this time including the observed (generated) probabilities in black over the baseline probabilities used in the data generation in red:

Sometime soon, I plan to incorporate something like the function genBaseProbs into simstudy to make it easier to incorporate non-proportionality assumptions into simulation studies that use ordinal categorical outcomes.

The variance of the beta distribution (and similarly the Dirichlet distribution) decreases as $$\alpha$$ and $$\beta$$ both increase proportionally (keeping the mean constant). I’ve plotted the variance of the beta distribution for $$\alpha = 1$$ and different levels of $$\beta$$ and $$C$$. It is clear that at any level of $$\beta$$ (I’ve drawn a line at $$\beta = 1$$), the variance decreases as $$C$$ increases. It is also clear that, holding $$\alpha$$ constant, the relationship of $$\beta$$ to variance is not strictly monotonic:

var_beta <- function(params) { a <- params[1] b <- params[2] (a * b) / ( (a + b)^2 * (a + b + 1)) } loop_b <- function(C, b) { V <- sapply(C, function(x) var_beta(x*c(1, b))) data.table(b, V, C) } b <- seq(.1, 25, .1) C <- c(0.01, 0.1, 0.25, 0.5, 1, 2, 4, 10, 100) d_var <- rbindlist(lapply(b, function(x) loop_b(C, x))) ggplot(data = d_var, aes(x = b, y = V, group = C)) + geom_vline(xintercept = 1, size = .5, color = "grey80") + geom_line(aes(color = factor(C))) + scale_y_continuous(name = expression("Var beta"~(alpha==1~","~beta))) + scale_x_continuous(name = expression(beta)) + scale_color_viridis(discrete = TRUE, option = "B", name = "C") + theme(panel.grid = element_blank(), legend.title.align=0.15)

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

The post Generating probabilities for ordinal categorical data first appeared on R-bloggers.

### Gold-Mining Week 1 (2020)

Mon, 09/14/2020 - 19:09

[social4i size="small" align="align-left"] --> [This article was first published on R – Fantasy Football Analytics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Week 1 Gold Mining and Fantasy Football Projection Roundup now available. (Better late than never!)

The post Gold-Mining Week 1 (2020) appeared first on Fantasy Football Analytics.

var vglnk = {key: '949efb41171ac6ec1bf7f206d57e90b8'}; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; // s.defer = true; // s.src = '//cdn.viglink.com/api/vglnk.js'; s.src = 'https://www.r-bloggers.com/wp-content/uploads/2020/08/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));