How to interactively examine any R code – 4 ways to not just read the code, but delve into it step-by-step
(This article was first published on Jozef's Rblog, and kindly contributed to R-bloggers)
IntroductionAs pointed out by a recent read the R source post on the R hub’s website, reading the actual code, not just the documentation is a great way to learn more about programming and implementation details. But there is one more activity to get even more hands-on experience and understanding of the code in practice.
In this post, we provide tips on how to interactively debug R code step-by-step and investigate the values of objects in the middle of function execution. We will look at doing this for both exported and non-exported functions from different packages. We will also look at interactively debugging generics and methods, using functionality provided by base R.
Contents- Interactively examining functions with debug() and debugonce()
- Debugging non-exported functions using :::
- Conveniently debugging methods with debugcall()
- Inserting debugging code anywhere inside a function body with trace()
- References
The 2 key functions we will be using for our interactive investigation of code are debug() and debugonce(). When debug() is called on a function, it will set a debugging flag on that function. When the function is executed, the execution will proceed one step at a time, giving us the option to investigate exactly what is going on in the context of that function call similarly to placing browser() at a certain point in our code.
Let us see a quick example:
debug(order) order(10:1)When running the second line, the code execution will stop inside order() and we can freely run the function line by line.
When we no longer want to have the function flagged for debugging, call undebug():
undebug(order)Alternatively, if we only want to have the function in debug mode for one execution, we can call debugonce() on the function. This approach may also be safer due to no need to undebug() later:
debugonce(order) order(10:1) Debugging non-exported functions using :::The great thing about debug() and debugonce() is that they allow us to interactively investigate not just the code that we are currently writing, but any interpreted R function. To debug functions not even exported from package namespaces, we can use :::. For example, we normally cannot access the list_rmds() function from the blogdown package as it is not exported.
# This will not work library(blogdown) debugonce(list_rmds) ## Error in debugonce(list_rmds): object 'list_rmds' not found # This will not work either debugonce(blogdown::list_rmds) ## Error: 'list_rmds' is not an exported object from 'namespace:blogdown'If we need to, we can still debug it using ::: to access it in the package namespace:
# This will work debugonce(blogdown:::list_rmds)This is particularly useful when debugging nested calls inside package code, which tend to use unexported functions.
Conveniently debugging methods with debugcall()Many R functions are implemented as S3 generics, that will call the proper method based on the signature of the arguments. A good example of this approach is aggregate(). Looking at its code, we see it only dispatches to the proper method based on the arguments provided:
body(stats::aggregate) ## UseMethod("aggregate")Using debug(aggregate) would therefore not be very useful for interactive investigation, as we most likely want to look at the method that is called to actually see what is going on.
For this purpose, we can use debugcall(), which will conveniently take us directly to the method. In the following case, it is the data.frame method of the aggregate() generic:
eval(debugcall( aggregate(mtcars["hp"], mtcars["carb"], FUN = mean), once = TRUE ))As seen above, we can also use the once = TRUE argument to only debug the call once.
For more technical details, the reference provided by ?debugcall is a great resource. This is also true for ?debug and ?trace which I also strongly recommend reading.
Inserting debugging code anywhere inside a function body with trace()If debugonce() and friends are not sufficient for our purposes and we want to insert advanced debugging code at different places within a function body, we can use trace() to do just that.
Imagine for example we would like to investigate a specific place in the code of the aforementioned stats::aggregate.data.frame method. First, we can explore the function body:
as.list(body(stats::aggregate.data.frame)) ## [[1]] ## `{` ## ## [[2]] ## if (!is.data.frame(x)) x <- as.data.frame(x) ## ## [[3]] ## FUN <- match.fun(FUN) ## ## [[4]] ## if (NROW(x) == 0L) stop("no rows to aggregate") ## ## [[5]] ## if (NCOL(x) == 0L) { ## x <- data.frame(x = rep(1, NROW(x))) ## return(aggregate.data.frame(x, by, function(x) 0L)[seq_along(by)]) ## } ## ## [[6]] ## if (!is.list(by)) stop("'by' must be a list") ## ## [[7]] ## if (is.null(names(by)) && length(by)) names(by) <- paste0("Group.", ## seq_along(by)) else { ## nam <- names(by) ## ind <- which(!nzchar(nam)) ## names(by)[ind] <- paste0("Group.", ind) ## } ## ## [[8]] ## if (any(lengths(by) != NROW(x))) stop("arguments must have same length") ## ## [[9]] ## y <- as.data.frame(by, stringsAsFactors = FALSE) ## ## [[10]] ## keep <- complete.cases(by) ## ## [[11]] ## y <- y[keep, , drop = FALSE] ## ## [[12]] ## x <- x[keep, , drop = FALSE] ## ## [[13]] ## nrx <- NROW(x) ## ## [[14]] ## ident <- function(x) { ## y <- as.factor(x) ## l <- length(levels(y)) ## s <- as.character(seq_len(l)) ## n <- nchar(s) ## levels(y) <- paste0(strrep("0", n[l] - n), s) ## as.character(y) ## } ## ## [[15]] ## grp <- lapply(y, ident) ## ## [[16]] ## multi.y <- !drop && ncol(y) ## ## [[17]] ## if (multi.y) { ## lev <- lapply(grp, function(e) sort(unique(e))) ## y <- as.list(y) ## for (i in seq_along(y)) y[[i]] <- y[[i]][match(lev[[i]], ## grp[[i]])] ## eGrid <- function(L) expand.grid(L, KEEP.OUT.ATTRS = FALSE, ## stringsAsFactors = FALSE) ## y <- eGrid(y) ## } ## ## [[18]] ## grp <- if (ncol(y)) { ## names(grp) <- NULL ## do.call(paste, c(rev(grp), list(sep = "."))) ## } else integer(nrx) ## ## [[19]] ## if (multi.y) { ## lev <- as.list(eGrid(lev)) ## names(lev) <- NULL ## lev <- do.call(paste, c(rev(lev), list(sep = "."))) ## grp <- factor(grp, levels = lev) ## } else y <- y[match(sort(unique(grp)), grp, 0L), , drop = FALSE] ## ## [[20]] ## nry <- NROW(y) ## ## [[21]] ## z <- lapply(x, function(e) { ## ans <- lapply(X = split(e, grp), FUN = FUN, ...) ## if (simplify && length(len <- unique(lengths(ans))) == 1L) { ## if (len == 1L) { ## cl <- lapply(ans, oldClass) ## cl1 <- cl[[1L]] ## ans <- unlist(ans, recursive = FALSE) ## if (!is.null(cl1) && all(vapply(cl, identical, NA, ## y = cl1))) ## class(ans) <- cl1 ## } ## else if (len > 1L) ## ans <- matrix(unlist(ans, recursive = FALSE), nrow = nry, ## ncol = len, byrow = TRUE, dimnames = if (!is.null(nms <- names(ans[[1L]]))) ## list(NULL, nms)) ## } ## ans ## }) ## ## [[22]] ## len <- length(y) ## ## [[23]] ## for (i in seq_along(z)) y[[len + i]] <- z[[i]] ## ## [[24]] ## names(y) <- c(names(by), names(x)) ## ## [[25]] ## row.names(y) <- NULL ## ## [[26]] ## yNow we can choose a point in the function body, where we would like to interactively explore. For example the 21st element starting with z <- lapply(x, function(e)) { may be of interest. In that case, we can call:
trace(stats::aggregate.data.frame, tracer = browser, at = 21) ## Tracing function "aggregate.data.frame" in package "stats" ## [1] "aggregate.data.frame"And see that this has added a call to .doTrace() to the function body:
as.list(body(stats::aggregate.data.frame))[[21L]] ## { ## .doTrace(browser(), "step 21") ## z <- lapply(x, function(e) { ## ans <- lapply(X = split(e, grp), FUN = FUN, ...) ## if (simplify && length(len <- unique(lengths(ans))) == ## 1L) { ## if (len == 1L) { ## cl <- lapply(ans, oldClass) ## cl1 <- cl[[1L]] ## ans <- unlist(ans, recursive = FALSE) ## if (!is.null(cl1) && all(vapply(cl, identical, ## NA, y = cl1))) ## class(ans) <- cl1 ## } ## else if (len > 1L) ## ans <- matrix(unlist(ans, recursive = FALSE), ## nrow = nry, ncol = len, byrow = TRUE, dimnames = if (!is.null(nms <- names(ans[[1L]]))) ## list(NULL, nms)) ## } ## ans ## }) ## }When we now call the aggregate() function on a data.frame, we will have the code stop at our selected point in the execution of the data.frame method:
aggregate(mtcars["hp"], mtcars["carb"], FUN = mean)When done debugging, use untrace() to cancel the tracing:
untrace(stats::aggregate.data.frame) ## Untracing function "aggregate.data.frame" in package "stats"Happy investigating and debugging!
References R documentation of the referenced functions- R documentation on debug(), debugonce(), etc.
- R documentation on trace(), untrace(), etc.
- R documentation on debugcall()
- Debugging chapter of Writing R Extensions
- Debugging with RStudio
- Jenny Bryan’s Accessing R Source
- R-hub blog’s Read the R source!
- Advanced R’s Debugging and Debugging, condition handling, and defensive programming chapters
To leave a comment for the author, please follow the link and comment on their blog: Jozef's Rblog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
How do we combine errors, in biology? The delta method
(This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers)
In a recent post I have shown that we can build linear combinations of model parameters (see here ). For example, if we have two parameter estimates, say Q and W, with standard errors respectively equal to \(\sigma_Q\) and \(\sigma_W\), we can build a linear combination as follows:
\[Z = AQ + BW + C\]
where A, B and C are three coefficients. The standard error for this combination can be obtained as:
\[ \sigma_Z = \sqrt{ A^2 \sigma^2_Q + B^2 \sigma^2_W + 2AB \sigma_{QW} }\]
In biology, nonlinear transformations are much more frequent than linear transformations. Nonlinear transformations are, e.g., \(Z = exp(Q + W)\), or, \(Z = 1/(Q + W)\). What is the standard error for these nonlinear transformations? This is not a complex problem, but the solution may be beyond biologists with an average level of statistical proficiency. It is named the ‘delta method’ and it provides the so called ‘delta standard errors’. I thought it might be useful to talk about it, by using a very simple language and a few examples.
Example 1: getting the half-life of a herbicideA herbicide has proven to follow a first order degradation kinetic in soil, with constant degradation rate \(k = -0.035\) and standard error equal to \(0.00195\). What is the half-life (\(T_{1/2}\)) of this herbicide and its standard error?
Every pesticide chemist knows that the half-life (\(T_{1/2}\)) is derived by the degradation rate, according to the following equation:
\[T_{1/2} = \frac{\log(0.5)}{k}\]
Therefore, the half-life for our herbicide is:
Thalf <- log(0.5)/-0.035 Thalf ## [1] 19.80421But … what is the standard error of this half-life? There is some uncertainty around the estimate of \(k\) and it is clear that such an uncertainty should propagate to the estimate of \(T_{1/2}\); unfortunately, the transformation is nonlinear and we cannot use the expression given above for linear transformations.
The basic ideaThe basic idea behind the delta method is that most of the simple nonlinear functions, which we use in biology, can be locally approximated by the tangent line through a point of interest. For example, our nonlinear half-life function is \(Y = \log(0.5)/X\) and, obviously, we are interested in the point where \(X = k = -0.035\). In the graph below, we have represented our nonlinear function (in black) and its tangent line (in red) through the above point: we can see that the approximation is fairly good in the close vicinity of \(X = -0.035\).
What is the equation of the tangent line? In general, if the nonlinear function is \(G(X)\), you may remember from high school that the slope \(m\) of the tangent line is equal to the first derivative of \(G(X)\), that is \(G'(X)\). You may also remember that the equation of a line with slope \(m\) through the point \(P(X_1, Y_1)\) is \(Y – Y_1 = m(X – X_1)\). As \(Y_1 = G(X_1)\), the tangent line has equation:
\[Y = G(X_1) + G'(X_1)(X – X_1)\]
We need the derivative!In order to write the equation of the red line in the Figure above, we need to consider that \(X_1 = -0.035\) and we need to be able to calculate the first derivative of our nonlinear half-life fuction. I am not able to derive the expression of the first derivative for all nonlinear functions and it was a relief for me to discover that R can handle this task in simple ways, e.g. by using the function ‘D()’. For our case, it is:
D(expression(log(0.5)/X), "X") ## -(log(0.5)/X^2)Therefore, we can use this R function to calculate the slope \(m\) of the tangent line in the figure above:
X <- -0.035 m <- eval( D(expression(log(0.5)/X), "X") ) m ## [1] 565.8344We already know that \(G(-0.035) = 19.80421\). Therefore, we can write the equation of the tangent line (red line in the graph above):
\[Y = 19.80421 + 565.8344 \, (X + 0.035)\]
that is:
\[Y = 19.80421 + 565.8344 \, X + 565.8344 \cdot 0.035 = 39.60841 + 565.8344 \, X\]
Replacing a curve with a lineNow, we have two functions:
- the original nonlinear half-life function \(Y = \log(0.5)/X\)$
- a new linear function (\(Y = 39.60841 + 565.8344 \, X\)), that is a very close approximation to the previous one, at least near to the point \(X = -0.035\), which we are interested in.
Therefore, we can approximate the former with the latter! If we use the linear function, we see that the half-life is:
39.60841 + 565.8344 * -0.035 ## [1] 19.80421which is what we expected. The advantage is that we can now use the low of propagation of errors to estimate the standard error (see the first and second equation in this post):
\[ \sigma_{ \left[ 39.60841 + 565.8344 \, X \right]} = \sqrt{ 562.8344^2 \, \sigma^2_X}\]
Here we go:
sqrt( m^2 * (0.00195 ^ 2) ) ## [1] 1.103377 In general…If we have a nonlinear transformation \(G(X)\), the standard error for this transformation is approximated by knowing the first derivative \(G'(X)\) and the standard error of \(X\):
\[\sigma_{G(X)} \simeq \sqrt{ [G'(X)]^2 \, \sigma^2_X }\]
Example 2: a back-transformed countA paper reports that the mean number of microorganisms in a substrate, on a logarithmic scale, was \(X_1 = 5\) with standard error \(\sigma = 0.84\). It is easy to derive that the actual number of micro-organisms was \(\exp{5} = 148.4132\); what is the standard error of the back-transformed mean?
The first derivative of our nonlinear function is:
D(expression(exp(X)), "X") ## exp(X)and thus the slope of the tangent line is:
X <- 5 m <- eval( D(expression(exp(X)), "X") ) m ## [1] 148.4132According to the function above, the standard error for the back-transformed mean is:
sigma <- 0.84 sqrt( m^2 * sigma^2 ) ## [1] 124.6671 Example 3: Selenium concentration in olive drupesThe concentration of selenium in olive drupes was found to be \(3.1 \, \mu g \,\, g^{-1}\) with standard error equal to 0.8. What is the intake of selenium when eating one drupe? Please, consider that one drupe weights, on average, 3.4 g (SE = 0.31) and that selenium concentration and drupe weight show a covariance of 0.55.
The amount of selenium is easily calculated as:
X <- 3.1; W = 3.4 X * W ## [1] 10.54Delta standard errors can be obtained by considering the partial derivatives for each of the two variables:
mX <- eval( D(expression(X*W), "X") ) mW <- eval( D(expression(X*W), "W") )and combining them as follows:
sigmaX <- 0.8; sigmaW <- 0.31; sigmaXW <- 0.55 sqrt( (mX^2)*sigmaX^2 + (mW^2)*sigmaW^2 + 2*X*W*sigmaXW ) ## [1] 4.462726For those of you who would like to get involved with matrix notation: we can reach the same result via matrix multiplication (see below). This might be easier when we have more than two variables to combine.
der <- matrix(c(mX, mW), 1, 2) sigma <- matrix(c(sigmaX^2, sigmaXW, sigmaXW, sigmaW^2), 2, 2, byrow = T) sqrt( der %*% sigma %*% t(der) ) ## [,1] ## [1,] 4.462726 The delta method with RIn R there is a shortcut function to calculate delta standard errors, that is available in the ‘car’ package. In order to use it, we need to have:
- a named vector for the variables that we have to combine
- an expression for the transformation
- a variance-covariance matrix
For the first example, we have:
obj <- c("k" = -0.035) sigma <- matrix(c(0.00195^2), 1, 1) library(car) ## Loading required package: carData deltaMethod(object = obj, g="log(0.5)/k", vcov = sigma) ## Estimate SE 2.5 % 97.5 % ## log(0.5)/k 19.80421 1.103377 17.64163 21.96678For the second example:
obj <- c("X1" = 5) sigma <- matrix(c(0.84^2), 1, 1) deltaMethod(object = obj, g="exp(X1)", vcov = sigma) ## Estimate SE 2.5 % 97.5 % ## exp(X1) 148.4132 124.6671 -95.92978 392.7561For the third example:
obj <- c("X" = 3.1, "W" = 3.4) sigma <- matrix(c(0.8^2, 0.55, 0.55, 0.31^2), 2, 2, byrow = T) deltaMethod(object = obj, g="X * W", vcov = sigma) ## Estimate SE 2.5 % 97.5 % ## X * W 10.54 4.462726 1.793218 19.28678The function ‘deltaMethod()’ is very handy to be used in connection with model objects, as we do not need to provide anything, but the transformation function. But this is something that requires another post!
However, two final notes relating to the delta method need to be pointed out here:
- the delta standard error is always approximate;
- if the original variables are gaussian, the transformed variable, usually, is not gaussian.
Thanks for reading!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Create Animation in R : Learn by Examples
(This article was first published on ListenData, and kindly contributed to R-bloggers)
This tutorial covers various ways you can create animated charts or plots using R. Animation is a very important element of data visualization. Animated charts are visually appealing and it fetches attention of audience. There are many online data visualization tools available in market which can generate animated charts but most of them are paid tools. Also problem with the online animation tools is that it asks you to upload data to their server, which is a data breach if you work on a real-world data of your client. Since R is open-source, you can download it for free and can create animated charts without moving data to server of any external server. Simple Animation in RLet’s create dummy data for illustration. In the program below, we are generating 3 columns containing some random observations. First column named A contains 50 observations ranging from 1 to 75. Similarly second column contains similar number of observations but range interval is different.
df = data.frame(A=sample(1:75, 50, replace=TRUE),B=sample(1:100, 50, replace=TRUE),
stringsAsFactors = FALSE)
gganimate package is used for animation in R. It is an extension of popular package for graphics – ggplot2 package.
library(ggplot2)library(tidyverse)
library(gganimate)
library(directlabels)
library(png)
library(transformr)
library(grid)
ggplot(df, aes(A, B)) +
geom_line() +
transition_reveal(A) +
labs(title = 'A: {frame_along}')
geom_line() is used for creating line chart. transition_reveal(A) allows you to let data gradually appear.frame_along gives the position that the current frame corresponds to.
What is frame and rendering in animation?In animation, a frame is one of the many still images which compose the complete moving picture. Rendering is a kind of computing to output the final result. In gganimate package, it is by default 100 frames to render. You can change the number of frames under nframes= parameter in animatefunction.
p = ggplot(df, aes(A, B, group = C)) +geom_line() +
transition_reveal(A) +
labs(title = 'A: {frame_along}')
animate(p, nframes=40)
How to save animated plot in GIF format file?
You can use anim_save(file_location,plot) function to export animated chart in GIF format.
anim_save("basic_animation.gif", p)Frames per Second (fps)
It is the amount of time spend on each frame per second. You can use parameter fps in animate() function. By default, it is 10 frames per second.
animate(p, nframes=40, fps = 2)Decreasing fps from 10 means slowing down speed of animation.
How to stop loop in animation?Loop means continuously repeating animation over and over again. To end loop, you can use renderer = gifski_renderer(loop = FALSE) option in animate function.
animate(p, renderer = gifski_renderer(loop = FALSE)) How to change layout of plot?You can change height and width of plot by mentioning the size in animate( ) function.
animate(p, fps = 10, duration = 14, width = 800, height = 400) Advanced Animation in R : ExamplesPrepare Data for Example
In this example, we will create bar chart showing change in monthly sales figure of different products.
dates = paste(rep(month.abb[1:10], each=10), 2018)
df = data.frame(Product=rep(sample(LETTERS[1:10],10), 10),
Period=factor(dates, levels=unique(dates)),
Sales=sample(1:100,100, replace = TRUE))
head(df)
Product Period Sales order
1 E Jan 2018 15 1
2 H Jan 2018 34 2
3 F Jan 2018 42 3
4 E Jan 2018 49 4
5 J Jan 2018 49 5
6 C Jan 2018 60 6
# Ranking by Period and Sales
df = df %>%
arrange(Period, Sales) %>%
mutate(order = 1:n())
# Animation
p = df %>%
ggplot(aes(order, Sales)) +
geom_bar(stat = "identity", fill = "#ff9933") +
labs(title='Total Sales in {closest_state}', x=NULL) +
theme(plot.title = element_text(hjust = 0.5, size = 18)) +
scale_x_continuous(breaks=df$order, labels=df$Product, position = "top") +
transition_states(Period, transition_length = 1, state_length = 2) +
view_follow(fixed_y=TRUE) +
ease_aes('cubic-in-out')
animate(p, nframes=50, fps=4)
anim_save("bar_animation.gif", p)
Detailed Explanation
- transition_states() animates plot by categorical or discrete variable. “States” are the animation sequences which plays. When a state transition is triggered, there will be a new state whose animation sequence will then run. In this case, state is Period column. state_length refers to relative length of the pause at the states. transition_length refers to relative length of the transition.
- view_follow(fixed_y=TRUE) means y-axis would be fixed when animation is running.
- ease_aes( ) refers to motion in animation that starts quickly and then decelerates. Or vice-versa.
- You can set theme using theme_set(theme_minimal())
Recently BJP secured majority in Lok Sabha Election. In 1984, they contested first time in Lok Sabha Election. INC (Indian National Congress) used to be the biggest political party in India a decade ago. Here we will see the trend analysis on “% of seats won by these two parties) from 1984 to 2019. Source of Data : Election Commission of India
library(ggplot2)library(tidyverse)
library(gganimate)
library(directlabels)
library(png)
library(transformr)
library(grid)
# Read Data
df = read.table(text =
" Year Perc_Seats Party
1984 0.79 INC
1989 0.38 INC
1991 0.45 INC
1996 0.27 INC
1998 0.27 INC
1999 0.22 INC
2004 0.28 INC
2009 0.4 INC
2014 0.09 INC
2019 0.1 INC
1984 0 BJP
1989 0.17 BJP
1991 0.23 BJP
1996 0.31 BJP
1998 0.35 BJP
1999 0.35 BJP
2004 0.27 BJP
2009 0.23 BJP
2014 0.52 BJP
2019 0.56 BJP
", header=TRUE)
# Set Theme
theme_set(theme_minimal())
# Plot and animate
p =
ggplot(data = df, aes(x= factor(Year), y=Perc_Seats, group=Party, colour=Party)) +
geom_line(size=2, show.legend = FALSE) +
scale_color_manual(values=c("#ff9933", "#006400")) +
scale_x_discrete(position = "top") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = 'Lok Sabha Election : % of seats won',
x = NULL, y = NULL) +
geom_text(aes(label=scales::percent(Perc_Seats, accuracy = 1),
vjust= -2), show.legend = FALSE) +
theme(plot.title = element_text(hjust = 0.5)) +
geom_dl(aes(label=Party), method="last.points") +
transition_reveal(Year) +
coord_cartesian(clip = 'off') +
ease_aes('cubic-in-out')
animate(p, fps = 10, width = 800, height = 400)
anim_save("election.gif", p)
How to save animated plot as video
Make sure ffmpeg is installed on your system before using the code below. It is available for download for all the operating systems.
animate(nations_plot, renderer = ffmpeg_renderer(), width = 800, height = 450)
anim_save("nations.mp4")
About Author:
Deepanshu founded ListenData with a simple objective – Make analytics easy to understand and follow. He has over 8 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains.
Let's Get Connected: LinkedIn
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: ListenData. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Predicting Car Battery Failure With R And H2O – Study
(This article was first published on R-Analytics, and kindly contributed to R-bloggers)
Using R and H2O Isolation Forest to predict car battery failures. Carlos Kassab 2019-May-24 This is a study about what might be if car makers start using machine learning in our cars to predict falures. # Loading librariessuppressWarnings( suppressMessages( library( h2o ) ) )
suppressWarnings( suppressMessages( library( data.table ) ) )
suppressWarnings( suppressMessages( library( plotly ) ) )
suppressWarnings( suppressMessages( library( DT ) ) )
# Reading data file
# Data from: https://www.kaggle.com/yunlevin/levin-vehicle-telematics
dataFileName = "/Development/Analytics/AnomalyDetection/AutomovileFailurePrediction/v2.csv"
carData = fread( dataFileName, skip=0, header = TRUE )
carBatteryData = data.table( TimeStamp = carData$timeStamp
, BatteryVoltage = as.numeric( carData$battery )
)
rm(carData)
# Data cleaning, filtering and conversion
carBatteryData = na.omit( carBatteryData ) # Keeping just valid Values
# According to this article:
# https://shop.advanceautoparts.com/r/advice/car-maintenance/car-battery-voltage-range
#
# A perfect voltage ( without any devices or electronic systems plugged in )
# is between 13.7 and 14.7V.
# If the battery isn’t fully charged, it will diminish to 12.4V at 75%,
# 12V when it’s only operating at 25%, and up to 11.9V when it’s completely discharged.
#
# Battery voltage while a load is connected is much slower
# it should be something between 9.5V and 10.5V
#
# This value interval ensures that your battery can store and deliver enough
# current to start your car and power all your electronics and electric devices
# without any difficulty
carBatteryData = carBatteryData[BatteryVoltage >= 9.5] # Filtering voltages greater or equal to 9.5
carBatteryData$TimeStamp = as.POSIXct( paste0( substr(carBatteryData$TimeStamp,1,17),"00" ) )
carBatteryData = unique(carBatteryData) # Removing duplicate voltage readings
carBatteryData = carBatteryData[order(TimeStamp)]
# spliting all data, using the last date as testing data and the rest for training.
lastDate = max( as.Date( format( carBatteryData$TimeStamp, "%Y-%m-%d" ) ) )
trainingData = carBatteryData[ as.Date( format( carBatteryData$TimeStamp, "%Y-%m-%d" ) ) != lastDate ]
testingData = carBatteryData[ as.Date( format( carBatteryData$TimeStamp, "%Y-%m-%d" ) ) == lastDate ]
################################################################################
# Creating Anomaly Detection Model
################################################################################
h2o.init( nthreads = -1, max_mem_size = "5G" ) ##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\LaranIkal\AppData\Local\Temp\Rtmp6lTw4H/h2o_LaranIkal_started_from_r.out
## C:\Users\LaranIkal\AppData\Local\Temp\Rtmp6lTw4H/h2o_LaranIkal_started_from_r.err
##
##
## Starting H2O JVM and connecting: Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 seconds 899 milliseconds
## H2O cluster timezone: America/Mexico_City
## H2O data parsing timezone: UTC
## H2O cluster version: 3.24.0.2
## H2O cluster version age: 1 month and 7 days
## H2O cluster name: H2O_started_from_R_LaranIkal_tzd452
## H2O cluster total nodes: 1
## H2O cluster total memory: 4.44 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.6.0 (2019-04-26) h2o.no_progress() # Disable progress bars for Rmd
h2o.removeAll() # Cleans h2o cluster state. ## [1] 0 # Convert the training dataset to H2O format.
trainingData_hex = as.h2o( trainingData[,2], destination_frame = "train_hex" )
# Build an Isolation forest model
trainingModel = h2o.isolationForest( training_frame = trainingData_hex
, sample_rate = 0.1
, max_depth = 32
, ntrees = 100
)
# According to H2O doc:
# http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html
#
# Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees.
# Isolation Forest creates multiple decision trees to isolate observations.
#
# Trees are split randomly, The assumption is that:
#
# IF ONE UNIT MEASUREMENTS ARE SIMILAR TO OTHERS,
# IT WILL TAKE MORE RANDOM SPLITS TO ISOLATE IT.
#
# The less splits needed, the unit is more likely to be anomalous.
#
# The average number of splits is then used as a score.
# Calculate score for training dataset
score <- h2o.predict( trainingModel, trainingData_hex )
result_pred <- as.vector( score$predict )
################################################################################
# Setting threshold value for anomaly detection.
################################################################################
# Setting desired threshold percentage.
threshold = .995 # Let's say we have 99.5% voltage values correct
# Using avobe threshold to get score limit to filter anomalous voltage readings.
scoreLimit = round( quantile( result_pred, threshold ), 4 )
################################################################################
# Get anomalous voltage readings from testing data, using model and scoreLimit got using training data.
################################################################################
# Convert testing data frame to H2O format.
testingDataH2O = as.h2o( testingData[,2], destination_frame = "testingData_hex" )
# Get score using training model
testingScore <- h2o.predict( trainingModel, testingDataH2O )
# Add row score at the beginning of testing dataset
testingData = cbind( RowScore = round( as.vector( testingScore$predict ), 4 ), testingData )
# Check if there are anomalous voltage readings from testing data
anomalies = testingData[ testingData$RowScore > scoreLimit, ] # Here there is and additional filter to ensure maintenance recommendation
# If there are more than 3 anomalous voltage readings, display an alert.
if( dim( anomalies )[1] > 3 ) {
cat( "Show alert on car display: Battery got anomalous voltage readings, it is recommended to take it to service." )
plot_ly( data = anomalies
, x = ~TimeStamp
, y = ~BatteryVoltage
, type = 'scatter'
, mode = "lines"
, name = 'Anomalies') %>%
layout( yaxis = list( title = 'Battery Voltage.' )
, xaxis = list( categoryorder='trace', title = 'Date - Time.' )
)
} ## Show alert on car display: Battery got anomalous voltage readings, it is recommended to take it to service. if( dim( anomalies )[1] > 3 ) {
DT::datatable(anomalies[,c(2,3)], rownames = FALSE )
}
Show 102550100 entries
Search:
TimeStampBatteryVoltage
2018-01-31T14:15:00Z10.175
2018-01-31T15:29:00Z14.88
2018-01-31T15:29:00Z14.92
2018-01-31T15:32:00Z10.38
2018-01-31T20:38:00Z10.12
2018-02-01T00:50:00Z10.43
2018-02-01T01:02:00Z9.727
Showing 1 to 7 of 7 entries
Previous1Next
Using this approach we may prevent failures on cars, not only for batteries but for many cases when sensors are used.
Carlos Kassab
https://www.linkedin.com/in/carlos-kassab-48b40743/
We are using R, more information about R:
https://www.r-bloggers.com
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));
To leave a comment for the author, please follow the link and comment on their blog: R-Analytics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Practical Data Science with R, half off sale!
(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)
Our publisher, Manning, is running a Memorial Day sale this weekend (May 24-27, 2019), with a new offer every day.
- Fri: Half off all eBooks
- Sat: Half off all MEAPs
- Sun: Half off all pBooks and liveVideos
- Mon: Half off everything
The discount code is: wm052419au.
Many great opportunities to get Practical Data Science with R 2nd Edition at a discount!!!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Rstudio & ThinkR roadshow – June 6 – Paris
(This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers)
On June the 6th, 2019, Rstudio is partnering with ThinkR to offer you a one day event around “R in production”. See you in Paris!
If you’re an experienced developer or a decision-maker looking to learn more about what R and RStudio have to offer, then this event made is for you!
During the first part of the event, there will be two workshops, from 2 to 5:
- From 2.30 to 3.30: How to create a production-ready Shiny application with the {golem} package.
- From 4.00 to 5.00: Best practices for using RStudio products in production.
Following these “technical” workshops, we’re happy to welcome you for a cocktail reception.
In this second part of the evening, we’ll be talking about real-life applications of R in production: scaling scripts and Shiny applications, coding for durability, maintaining…
The doors will open at 13h30 for the workshops. You can also join us at 17h30 for the cocktail and use-cases.
The number of places is limited. Registration is free but mandatory: follow this link to register.
Full program:
The post Rstudio & ThinkR roadshow – June 6 – Paris appeared first on (en) The R Task Force.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
[R]eady for Production: a Joint Event with RStudio and EODA
(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)
We’re excited to team up with EODA, an RStudio Full Service Certified Partner, to host a free data science in production event in Frankfurt, Germany, on June 13. This one-day event will be geared for data science and IT teams that want to learn how to integrate their analysis solutions with the optimal IT infrastructure.
This is a great chance to work in smaller groups with experts from EODA and RStudio on best-practice approaches to productive data-science architectures, and to see real-world solutions to deployment problems.
With sessions in English and German, the conference will start with a high-level overview of the right interaction between data science and IT, and then focus on more hands-on solutions to engineering problems such as building APIs with Plumber and deploying to RStudio Connect, using Python and SQL in the RStudio IDE, and Shiny load testing.
For more information, and to secure your spot, head to the registration page!
[R]eady for Production with EODA and RStudio
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Royal Society of Biology: Introduction to Reproducible Analyses in R
(This article was first published on Emma R, and kindly contributed to R-bloggers)
Learn to experiment with R to make analyses and figures more reproducibleIf you’re in the UK and not too far from York you might be interested in a Royal Society of Biology course which forms part of the Industry Skills Certificate. More details at this link
Introduction to Reproducible Analyses in R
24 June 2019 from 10:00 until 16:00
The course is aimed at researchers at all stages of their careers interested in experimenting with R to make their analyses and figures more reproducible.
No previous coding experience is assumed and you can work on your own laptop or use a computer at the training venue, the University of York.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Emma R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Spotlight on: Julia Silge, Stack Overflow
(This article was first published on RBlog – Mango Solutions, and kindly contributed to R-bloggers)
Julia Silge is joining us as one of our keynote speakers at EARL London 2019. We can’t wait to hear Julia’s full keynote, but until then she kindly answered a few questions. Julia shared with us what we can expect from her address – which will focus on how Stack Overflow uses R and their recent developer survey.
Hi Julia! Tell us about the StackOverflow Developer Survey and your role at Stack OverflowThe Stack Overflow Developer Survey is the largest and most comprehensive survey of people who code around the world each year. This year, we had almost 90,000 respondents who shared their opinions on topics including their favourite technologies, their priorities in looking for a job, and what music they listen to while coding. I am the data scientist who works on this survey, and I am involved throughout the process from initial design to writing copy about results. We have an amazing team who works together on this project, including a project manager, designers, community managers, marketers, and developers.
My role focuses on data analysis. Before the survey was fielded, I worked with one of our UX researchers on question writing, so that our expectations for data analysis were aligned, as well as using data from previous years’ surveys and our site to choose which technologies to include this year. After the survey was fielded, I cleaned and analyzed the data, created data visualizations, and wrote the text for both our developer-facing and business-facing reports.
Why did you use R to analyse the survey?All of our data science tooling at Stack Overflow is R-centric, but specifically, with our annual survey, we are working with a complex dataset on a tight schedule and the R ecosystem provides the fluent data analysis tools we need to deliver compelling results on time. From munging complicated raw data to creating beautiful visualizations to delivering data deliverables via an API, R is the right tool for the job for us.
Were there results from the survey this year that came as a surprise?This is such a rich dataset to get to work with, full of interesting things to notice! One result this year that I didn’t expect ahead of time was with our question about whether a respondent eventually wanted to move from technical work into people management. We found that younger, less experienced respondents were more likely to say that they wanted to make the switch! Once I thought about it more carefully, I came to think that those more experienced folks with an interest in managing probably had already shifted careers and were not there to answer that question anymore. Another result that was a surprise to me was just how many different kinds of metal people listen to, more than I even knew existed!
Do you see the gender imbalance improving?Although our annual survey has a broad capacity for informing useful and actionable conclusions, including about gender, our results don’t represent everyone in the developer community evenly. We know that people from marginalized groups and underrepresented groups in tech participate on Stack Overflow at lower rates than they participate in the software workforce. This means that we undersample such groups in our survey (because of how we invite respondents to the survey, mostly on our site itself). Over the past few years, we have seen incremental improvement in the proportion of responses that are from marginalized or underindexed groups such as minority genders or minority racial/ethnic groups; we are so happy to see this because we want to hear from everyone who codes, everywhere. We believe the biggest driver of this kind of positive change is and will continue to be improving the balance of who participates on Stack Overflow itself, and we are committed to making Stack Overflow a more welcoming and inclusive platform. This kind of work can be difficult and slow, but we are in it for the long haul.
What future trends might you be able to predict from the survey?One trend we’ve seen over the past several years that I expect to continue is the normalization of salaries for data work. Several years ago, people who worked as data scientists were extreme outliers in salary. Salaries for data scientists have started to move toward the norm for software engineering work, especially if you control for education (for example, comparing a data scientist with a master’s degree to a software engineer with a master’s degree). I don’t see this as entirely bad news, because it is associated with some standardization of data science as a role and increased industry agreement about what a data scientist is, what a data engineer is, how to hire for these roles, and what career paths might look like.
Given Python’s rise again this year, do you see this continuing? How will this affect the use of R?Python has exhibited a meteoric rise over the past several years and is the fastest-growing major programming language in the world. Python has been climbing in the ranks of our survey over the past several years, edging past first PHP, then C#, then Java this year. It currently sits just below SQL in the ranking. I have a hard time imagining that next year more developers will say they use Python than say they use SQL! You can dig this interview up next year and point out my prediction failure if I am wrong.
In terms of R and R’s future, it’s important to note that R’s use has also been growing dramatically on Stack Overflow, both absolutely and relatively. R is now a top 10 to top 15 programming language (both in questions asked and traffic). Data technologies are in general growing a lot, and there are many factors that go into an individual or an organization deciding to embrace R, or Python, or both.
Thanks Julia!You can catch Julia and a whole host of other brilliant speakers at EARL London on 10-12 September at The Tower Hotel London.
We have discounted early bird tickets available for a limited time – please visit the EARL site for more information, we hope to see you there!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: RBlog – Mango Solutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Bayesian estimation of fatality rates and accidents involving cyclists on Queensland roads
(This article was first published on R – Daniel Oehm | Gradient Descending, and kindly contributed to R-bloggers)
In my previous post I built a Shiny app mapping accidents on Queensland roads which was great at showing the problematic areas within cities and regional areas. I have added to this by estimating the fatality rate given the observed accidents and the rate of accidents involving cyclists for SA3 and SA4 areas. I have also updated the filters making them tidier. What follows in commentary of what can be found in the Shiny app. If you want to jump straight to it, run
shiny::runGitHub("doehm/road-accidents/", "doehm", launch.browser = TRUE) Most dangerous areasI used a Bayesian approach to estimate the fatality rate for 2017 (the data isn’t complete for 2018) and presented it as a proportion of the number of observed road accidents. The data dates back to 2001 but it’s reasonable to use the most recent data to allow for improvements in road conditions, policies, advertising, population growth, etc which may have an underlying impact on the risk of road accidents.
To construct a prior I used years 2014-2016 at SA3 level. By taking 10,000 draws from the posterior we estimate the fatality rate distributions for each area. The bands represent the 80% and 95% prediction interval.
The top 3 most dangerous SA4’s for 2017 are Mackay – Isaac – Whitsunday, Wide Bay and The Queensland outback. It’s not surprising that the large majority of fatalities occurred at high speeds on highways and back roads. Having said that a few have occurred closer to the town centers in these areas. It’s possible the road conditions are not particularly good but that’s hard to determine. The road condition filter only has 4 states, sealed (dry / wet) and unsealed (dry / wet) and doesn’t offer much more information. It appears dry conditions on sealed roads is when most accidents occur.
The 3 least dangerous areas are Brisbane North, Brisbane Inner City and Brisbane South. Again, it shouldn’t be surprising given the speed limit is lower in the city, if accidents occur they are generally less severe. There are a lot of rear ended accidents and accidents when leaving a car park – the classic.
At the SA3 level the incident rates focus on smaller geographical areas which can highlight other features. For example, at the SA4 level the area often includes the city centers and the Hinterland regions which tend to have different rates where it’s often higher in the Hinterlands. This is largely due to the differing speed limits. The most dangerous areas are Burnett, Port Douglas and the Ipswich Hinterlands. The least dangerous are Chermside, Brisbane Inner West and Springfield – Redbank. Click the image to zoom in, there are a lot of SA3’s.
Most dangerous areas for cyclists
As a cyclist it’s good to know which roads to pay particular attention on. In a similar way we’ll look at the rate of road accidents involving cyclists by area. An influencing factor is how many cyclists are on the roads relative to cars and other vehicles. If the area has a strong cycling culture it’s more likely that an accident will involve a cyclist.
It’s not surprising the dense city centers with more traffic have more accidents involving cyclists. At the SA4 level it’s pretty clear that Brisbane Inner City is a dangerous place to ride a bike, particularly on Adelaide street which you can see from the app (go to show filter > unit type involved > bicycle). The rate of accidents involving cyclists in Brisbane Inner City is significantly higher than all other SA4’s. I’m curious to know if the the growth of the CityCycle scheme (which is awesome by the way) is somewhat a contributing factor. Although, given this was introduced in 2011, and 2005-2010 saw the highest rate of growth in accidents involving cyclists, probably isn’t the biggest factor in the high rate of accidents but it should contribute if more people are on bikes – they’re also not the easiest bikes in the world to ride. The other areas in the top 3, Brisbane City West and Cairns. Sheridan Street, the main drag in Cairns is where you need to be extra vigilant.
Here Noosa tops the chart for 2017. It sits within the Sunshine Coast SA4 region ranking 5th in the chart above which also includes the Hinterland regions which is why it scores lower above. Noosa has a strong cycling culture which could be influencing it’s high rate. With a median of a whopping 18% definitely keep your eyes peeled when cycling on those roads. Also, take care in Sherwood – Indooroopilly. Keep in mind that these areas are not necessarily where the most accidents involving cyclists occur. Rather if there is an accident it’s more likely to have involved a cyclists relative to other areas. If you’re driving in those areas keep an eye out for cyclists on the road, perhaps there aren’t marked cycling lanes or off-road bike paths or some dodgy intersections.
Ideally the rate would be adjusted for the number of cyclists on the road, however this is still a useful statistic to find the more dangerous areas. This analysis has also been done for pedestrians involved in car accidents which is can be found in the app. Just a heads up, be careful in Fortitude Valley.
Code bits
Run the shiny app
library(shiny) runGitHub("doehm/road-accidents/", "doehm", launch.browser = TRUE)Load the data from Github.
library(repmis) source_data("https://github.com/doehm/road-accidents/raw/master/data/road-accident-data.Rdata")Estimate hyperparameters
# get prior information from previous 3 years # use method of moments to get alpha and beta # ideally we would use a hierarchical model but this is suitable hyper_params <- function(y){ mu <- mean(y) s2 <- var(y) gamma <- mu*(1-mu)/s2-1 alpha <- gamma*mu beta <- gamma*(1-mu) return(list(alpha = alpha, beta = beta)) } # estimating the actual parameters df <- accidents_raw %>% filter( Loc_Queensland_Transport_Region != "Unknown", Crash_Year >= 2014, Crash_Year <= 2016 ) %>% group_by(Loc_ABS_Statistical_Area_3) %>% summarise( count = n(), n_fatalities = sum(Count_Casualty_Fatality) ) %>% mutate(theta = n_fatalities/(count)) hyper_params(df$theta)Forest plots
library(tidyverse) library(showtext) # font font_add_google(name = "Montserrat", family = "mont") showtext_auto() # SA3 forest plot of fatality rate # set my theme and colours theme_forest <- function(scale = 1){ theme_minimal() + theme( legend.position = "none", axis.text = element_text(family = "mont", size = 16*scale), axis.text.x = element_text(vjust = 0.5), axis.title.y = element_blank(), axis.title.x = element_text(family = "mont", size = 16*scale), plot.title = element_text(family = "mont", hjust = 0.5, size = 26*scale, face = "bold"), plot.subtitle = element_text(family = "mont", hjust = 0.5, size = 20*scale), plot.caption = element_text(size = 12*scale) ) } my_cols <- function(n = 16) colorRampPalette(c("darkmagenta", "turquoise"))(n) # simulate from the posterior posterior_f <- function(df, y, n, a = 1.3, b = 77, inflator = 100) { out <- data.frame() qs <- c(0.025, 0.1, 0.5, 0.9, 0.975) for(k in 1:nrow(df)){ out <- rbind(out, inflator*rgamma(1e4, shape = a+y[k], rate = b+n[k]) %>% quantile(qs)) } colnames(out) <- paste0("q", 100*qs) return(out) } # SA4 # fatalities areas <- grep("Area", colnames(accidents_raw), value = TRUE)[3:4] names(areas) <- c("sa3", "sa4") fatality_fn <- function(area){ accidents_raw %>% group_by_(area) %>% filter( Crash_Year == 2017, ) %>% summarise( count = length(Count_Casualty_Total), n_fatalities = sum(Count_Casualty_Fatality) ) %>% bind_cols(posterior_f(df = ., y = .$n_fatalities, n = .$count)) %>% arrange(q50) %>% mutate_(area = interp(~factor(v, level = v), v = as.name(area))) %>% ggplot() + geom_segment(mapping = aes(x = q2.5, xend = q97.5, y = area, yend = area)) + geom_segment(mapping = aes(x = q10, xend = q90, y = area, yend = area, col = q50), size = 2) + geom_point(mapping = aes(x = q50, y = area), pch = 3) + theme_forest() + scale_colour_gradientn(colors = my_cols()) + labs( title = "Fatality rate given observed road accidents", subtitle = paste("Bayesian estimate of the fatality rate for", toupper(names(area)), "areas in 2017"), x = "Fatality rate (%)") } fatality_plots <- list(sa3 = fatality_fn(areas[1]), sa4 = fatality_fn(areas[2]))Cyclist plot
# cyclists df <- accidents_raw %>% filter( Crash_Year >= 2014, Crash_Year <= 2016, Loc_ABS_Statistical_Area_3 != "Unknown" ) %>% group_by(Loc_ABS_Statistical_Area_3) %>% summarise( n_bicycles = sum(Count_Unit_Bicycle > 0), n_accidents = n() ) %>% mutate(p = n_bicycles/n_accidents) %>% arrange(desc(p)) # estimate hyperparameters hyper_params(df$p) # cyclists cyclist_fn <- function(area){ accidents_raw %>% group_by_(area) %>% filter(Crash_Year == 2017) %>% summarise( count = n(), n_bicycles = sum(Count_Unit_Bicycle > 0) ) %>% bind_cols(posterior_f(df = ., y = .$n_bicycles, n = .$count, a = 1.55, b = 25)) %>% arrange(q50) %>% mutate_(area = interp(~factor(v, level = v), v = as.name(area))) %>% ggplot() + geom_segment(mapping = aes(x = q2.5, xend = q97.5, y = area, yend = area)) + geom_segment(mapping = aes(x = q10, xend = q90, y = area, yend = area, col = q50), size = 2) + geom_point(mapping = aes(x = q50, y = area), pch = 3) + theme_forest() + scale_colour_gradientn(colors = my_cols()) + labs( title = "Rate of cyclists involved in road accidents", subtitle = paste("Bayesian estimate of the rate of accidents involving cyclists for", toupper(names(area)), "areas in 2017"), x = "Accidents involving cyclists (%)" ) } cyclist_plots <- list(sa3 = cyclist_fn(areas[1]), sa4 = cyclist_fn(areas[2])) # Brisbane inner accidents_raw %>% filter(Loc_ABS_Statistical_Area_4 == "Brisbane Inner City", Crash_Year < 2018) %>% group_by(Crash_Year) %>% summarise( n_bikes = sum(Count_Unit_Bicycle > 0), n_accidents = n(), p_bikes = n_bikes/n_accidents ) %>% bind_cols(posterior_f(df = ., y = .$n_bikes, n = .$n_accidents, a = 1.55, b = 25)/100) %>% ggplot(aes(x = Crash_Year, y = q50)) + geom_line(col = "darkmagenta") + geom_point(col = "darkmagenta") + theme_minimal() + theme_forest() + labs( x = "Year", title = "Accidents involving cyclists - Brisbane Inner City", subtitle = "Accidents involving cyclists are increasing, indicative of growth in popularity of cycling - Be careful on the roads" )The post Bayesian estimation of fatality rates and accidents involving cyclists on Queensland roads appeared first on Daniel Oehm | Gradient Descending.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – Daniel Oehm | Gradient Descending. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Comparing Frequentist, Bayesian and Simulation methods and conclusions
(This article was first published on Posts on R Lover ! a programmer, and kindly contributed to R-bloggers)
So, a programmer, a frequentist, and a bayesian walk into a bar. No this post
isn’t really on the path to some politically incorrect stereotypical humor. Jut
trying to make it fun and catch your attention. As the title implies this post
is really about applying the differing viewpoints and methodologies inherent in
those approaches to statistics. To be honest I’m not even going to spend a lot
time on the details of methods, I want to focus on the conclusions each of these
people would draw after analyzing the same data using their favored methods.
This post isn’t really so much about how they would proceed but more about what
they would conclude at the end of the analysis. I make no claim about which of
these fine people have the best way of doing things although I was raised a
frequentist, I am more and more enamored of bayesian methods and while my tagline
is accurate “R Lover !a programmer” I will admit a love for making computers do
the hard work for me so if that involves a little programming, I’m all for
it.
Last week I saw a nice post from Anindya
Mozumdar
on the R Bloggers feed. The topic material was
fun for me (analyzing the performance of male 100m sprinters and the fastest
man on earth), as well as exploring the concepts in Allen B. Downey’s book
Think Stats: Probability and Statistics for
Programmers , which is at
my very favorite price point, free download, and Creative Commons licensed. The
post got me interested in following up a bit and thinking things through “out
loud” as a blog post.
This post is predicated on your having read
https://www.radmuzom.com/2019/05/18/fun-with-statistics-is-usain-bolt-really-the-fastest-man-on-earth/.
Please do. I’ll wait until you come back.
Welcome back. I’m going to try and repeat as little as possible from that blog
post just make comparisons. So to continue our little teaser… So, a
programmer, a frequentist, and a bayesian walk into a bar and start arguing
about whether Usain Bolt really is the fastest man on earth. The programmer has
told us how they would go about answering the question. The answer was:
There is only a 1.85% chance of seeing a difference as large as the observed
difference if there is actually no difference between the (median) timings of
Usain Bolt and Asafa Powell.
and was derived by counting observations across 10,000 simulations of the data
using the infer package and looking at differences between median timings. Our
null hypothesis was there is no “real” difference difference between Bolt and
Powell even though our data has a median for Bolt of 9.90 and median for Powell
of 9.95. That is after all a very small difference. But our simulation allows us
to reject that null hypothesis and favor the alternative that the difference is
real.
Should we be confident that we are 100% – 1.85% = 98% likely to be correct? NO!
as Downey notes:
For most problems, we only care about the order of magnitude: if the p-value
is smaller that 1/100, the effect is likely to be real; if it is greater than
1/10, probably not. If you think there is a difference between a 4.8%
(significant!) and 5.2% (not significant!), you are taking it too seriously.
Can we say that Bolt will win a race with Powell 98% of time? Again a resounding
NO! We’re 98% certain that the “true” difference in their medians is .05
seconds? NOPE.
Okay we’ve heard the programmer’s story at our little local bar. It’s time to
let our frequentist have their moment in the limelight. Technically the best
term would be Neyman-Pearson Frequentist but we’re not going to stand on
formality. It is nearly a century old and stands as an “improvement” on Fisher’s
significance testing. A nice little summary here on
Wikipedia.
I’m not here to belabor the nuances but frequentist methods are among the
oldest and arguably the most prevalent method in many fields. They are often the
first method people learned in college and sometimes the only method. They still
drive most of the published research in many fields although other methods are
taking root.
Before the frequentist can tell their tale though let’s make sure they have the
same data as in the earlier post. Let’s load all the libraries we’re going to
use and very quickly reproduce the process Anindya Mozumdar went through to
scrape and load the data. We’ll have a tibble named male_100 that contains
the requisite data and we’ll confirm that the summary for the top 6 runners mean
and median are identical. Note that I am suppressing messages as the libraries
load since R 3.6.0 has gotten quite chatty in this regard.
Most of the code above is simply shortened from the original post. The only
thing that is completely new is forcats::fct_reorder(male_100$runner, male_100$timing) which takes the runner factor and reorders it according to
the median by runner. This doesn’t matter for the calculations we’ll do but it
will make the plots look nicer.
One of the issues with a frequentist approach compared to a programmers approach
is that there are a lot of different tests you could choose. But in this case
wearing my frequentist hat there really are only two choices. A Oneway ANOVA or
the Kruskall Wallis which uses ranks and eliminates some assumptions.
This also gives me a chance to talk about a great package that supports both
frequentists and bayesian methods and completely integrates visualizing your
data with analyzing your data, which IMHO is the only way to go. The package is
ggstatsplot. Full disclosure
I’m a minor contributor to the package but please know that the true hero of the
package is Indrajeet Patil. It’s stable,
mature, well tested and well maintained try it out.
So let’s assume we want to run a classic Oneway ANOVA first (Welch’s method so
we don’t have to assume equal variances across groups). Assuming that the
omnibuds F test is significant lets say we’d like to look at the pairwise
comparisons and adjust the p values for multiple comparison using Holm. We’re a
big fan of visualizing the data by runner and of course we’d like to plot things
like the mean and median and the number of races per runner. We’d of course like
to know our effect size we’ll stick with eta squared we’d like it as elegant as
possible.
Doing this analysis using frequentist methods in R is not difficult. Heck
I’ve even blogged about it myself it’s so
“easy”. The benefit of ggbetweenstats from ggstatsplot is that it pretty
much allows you to do just about everything in one command. Seamlessly mixing
the plot and the results into one output. We’re only going to scratch the
surface of all the customization possibilities.
Our conclusion is similar to that drawn by simulation. We can clearly reject the
null that all these runners have the same mean time. Using Games-Howell and
controlling for multiple comparisons with Holm, however, we can only show
support for the difference between Usain Bolt and Maurice Green. There is
insufficient evidence to reject the null for all the other possible pairings.
(You can actually tell ggbetweenstats to show the p value for all the pairings
but that gets cluttered quickly).
From a frequentist perspective there are a whole set of non-parametric tests
that are available for use. They typically make fewer assumptions about the data
we have and often operate by exchanging the ranks of the outcome variable
(timing) rather than using the number.
The only thing we need to change in our input to the function is type = "np" and the title.
ggbetweenstats(data = male_100, x = runner, y = timing, type = "np", var.equal = FALSE, pairwise.comparisons = TRUE, partial = FALSE, effsize.type = "biased", point.jitter.height = 0, title = "Non-Parametric (Rank) testing", ggplot.component = ggplot2::scale_y_continuous(breaks = seq(9.6, 10.4, .2), limits = (c(9.6,10.4))), messages = FALSE )Without getting overly consumed by the exact numbers note the very similar
results for the overall test, but that we now also are more confident about
whether the difference between Usain Bolt and Justin Gaitlin. I highlight that
because there is a common misconception that non-parametric tests are always less
powerful (sensitive) than their parametric cousins.
Much of the credit for this section goes to Danielle Navarro (bookdown translation: Emily Kothe) in Learning Statistics with R
It usually takes several lessons or even an entire semester to teach the
frequentist method, because null hypothesis testing is a very elaborate
contraption that people (well in my experience very smart undergraduate
students) find very hard to master. In contrast, the Bayesian approach to
hypothesis testing “feels” far more intuitive. Let’s apply it to our current
scenario.
We’re at the bar the three of us wondering whether Usain Bolt is really the
fastest or whether all these individual data points really are just a random
mosaic of data noise. Both the programmer and the frequentist set the testing up
conceptually the same way. Can we use the data to reject the null that all the
runners are the same. Convinced they’re not all the same they applied the same
general procedure to reject (or not) the hypothesis that any pair was the same
for example Bolt versus Powell (for the record I’m not related to either). They
differ in computational methods and assumptions but not in overarching method.
At the end of their machinations they have no ability to talk about how likely
(probable) it is that runner 1 will beat runner 2. Often times that’s exactly
what you really want to know. There are two hypotheses that we want to compare,
a null hypothesis h0 that all the runners run equally fast and an alternative
hypothesis h1 that they don’t. Prior to looking at the data while we’re
sitting at the bar we have no real strong belief about which hypothesis is true
(odds are 1:1 in our naive state). We have our data and we want it to inform our
thinking. Unlike frequentist statistics, Bayesian statistics allow us to talk
about the probability that the null hypothesis is true (which is a complete no
no in a frequentist context). Better yet, it allows us to calculate the
posterior probability of the null hypothesis, using Bayes’ rule and our data.
In practice, most Bayesian data analysts tend not to talk in terms of the raw
posterior probabilities. Instead, we/they tend to talk in terms of the posterior
odds ratio. Think of it like betting. Suppose, for instance, the posterior
probability of the null hypothesis is 25%, and the posterior probability of the
alternative is 75%. The alternative hypothesis h1 is three times as probable as the
null h0, so we say that the odds are 3:1 in favor of the alternative.
At the end of the Bayesian’s efforts they can make what feel like very natural
statements of interest, for example, “The evidence provided by our data
corresponds to odds of 42:1 that these runners are not all equally fast.
Let’s try it using ggbetweenstats…
ggbetweenstats(data = male_100, x = runner, y = timing, type = "bf", var.equal = FALSE, pairwise.comparisons = TRUE, partial = FALSE, effsize.type = "biased", point.jitter.height = 0, title = "Bayesian testing", messages = FALSE )Yikes! Not what I wanted to see in the bar. The pairwise comparisons have gone
away (we’ll get them back) and worse yet what the heck does loge(BF10) = 2.9
mean? I hate log conversions I was promised a real number like 42:1! Who’s
Cauchy why is he there at .0.707?
Let’s break this down. loge(BF10) = 2.9 is also exp(2.9) or about 18 so
the good news is the odds are better than 18:1 that the runners are not equally
fast. Since rounding no doubt loses some accuracy lets use the BayesFactor
package directly and get a more accurate answer before we round anovaBF(timing ~ runner, data = as.data.frame(male_100), rscaleFixed = .707) is what we want
where rscaleFixed = .707 ensures we have the right Cauchy value.
Okay that’s better so to Bayesian thinking the odds are 19:1 against the fact that they all run about the same speed, or 19:1 they run at different speeds.
Hmmm. One of the strengths/weaknesses of the Bayesian approach is that people
can have their own sense of how strong 19:1 is. I like those odds. One of the
really nice things about the Bayes factor is the numbers are inherently
meaningful. If you run the data and you compute a Bayes factor of 4, it means
that the evidence provided by your data corresponds to betting odds of 4:1 in
favor of the alternative. However, there have been some attempts to quantify the
standards of evidence that would be considered meaningful in a scientific
context. One that is widely used is from Kass and Raftery (1995). (N.B. there are others and I have deliberately selected one of the most conservative standards. See for example https://media.springernature.com/full/springer-static/image/art%3A10.1186%2Fs12888-018-1761-4/MediaObjects/12888_2018_1761_Fig1_HTML.png)
Okay we have “positive evidence” and we can quantify it, that’s good. But what
about all the pairwise comparisons? Can we take this down to all the
individual pairings? I’m on the edge of my bar stool here. What are the odds
Bolt really is faster than Powell? Can we quantify that without somehow breaking
the multiple comparisons rule?
The short answer is yes we can safely extend this methodology to incorporate
pairwise comparisons. We shouldn’t abuse the method and we should fit our model
with the best possible prior information but in general, as simulated
here,
With Bayesian inference (and the correct prior), though, this problem
disappears. Amazingly enough, you don’t have to correct Bayesian inferences for
multiple comparisons.
With that in mind let’s build a quick little function that will allow us to pass
a data source and two names and run a Bayesian t-test via BayesFactor::ttestBF
to compare two runners. ttestBF returns a lot of info in a custom object so
we’ll use the extractBF function to grab it in a format where we can pluck out
the actual BF10
Now that we have a function we can see what the odds are that Bolt is faster
than the other 5 and print them one by one
Okay now I feel like we’re getting somewhere with our bar discussions. Should I
feel inclined to make a little wager on say who buys the next round of drinks as
a Bayesian I have some nice useful information. I’m not rejecting a null
hypothesis I’m casting the information I have as a statement of the odds I think
I have of “winning”.
But of course this isn’t the whole story so please read on…
Who’s Cauchy and why does he matter?Earlier I made light of the fact that the output from ggbetweenstats had
rCauchy = 0.707 and anovaBF uses rscaleFixed = .707. Now we need to spend
a little time actually understanding what that’s all about. Cauchy is
Augustin-Louis Cauchy and
the reason that’s relevant is that BayesFactor makes use of his distribution as
a default. I’m not even
going to try and take you into the details of the math but it is important we
have a decent understanding of what we’re doing to our data.
The BayesFactor
package
has a few built-in “default” named settings. They all have the same shape; the
only differ by their scale, denoted by r. The three named defaults are medium =
0.707, wide = 1, and ultrawide = 1.414. “Medium”, is the default. The scale
controls how large, on average, the expected true effect sizes are. For a
particular scale 50% of the true effect sizes are within the interval (−r,r).
For the default scale of “medium”, 50% of the prior effect sizes are within the
range (−0.7071,0.7071). Increasing r increases the sizes of expected effects;
decreasing r decreases the size of the expected effects.
Let’s compare it to a frequentist test we’re all likely to know, the t-test,
(we’ll use the Welch variant). Our initial hypothesis is that Bolt’s mean times
are different than Powell’s mean times (two-sided) and then test the one-sided
that Bolt is faster. Then let’s go a little crazy and run it one sided but
specify the mean difference 0.038403 of a second faster that we “see” in our data
mu = -0.038403.
Hopefully that last one didn’t trip you up and you recognized by definition if
the mean difference in our sample data is -0.038403 then the p value should
reflect 50/50 p value?
Let’s first just try different rCauchy values with ttestBF.
justtwo <- male_100 %>% filter(runner %in% c("Usain Bolt", "Asafa Powell")) %>% droplevels %>% as.data.frame ttestBF(formula = timing ~ runner, data = justtwo, rscale = "medium") ## Bayes factor analysis ## -------------- ## [1] Alt., r=0.707 : 5.164791 ±0% ## ## Against denominator: ## Null, mu1-mu2 = 0 ## --- ## Bayes factor type: BFindepSample, JZS ttestBF(formula = timing ~ runner, data = justtwo, rscale = "wide") ## Bayes factor analysis ## -------------- ## [1] Alt., r=1 : 4.133431 ±0% ## ## Against denominator: ## Null, mu1-mu2 = 0 ## --- ## Bayes factor type: BFindepSample, JZS ttestBF(formula = timing ~ runner, data = justtwo, rscale = .2) ## Bayes factor analysis ## -------------- ## [1] Alt., r=0.2 : 6.104113 ±0% ## ## Against denominator: ## Null, mu1-mu2 = 0 ## --- ## Bayes factor type: BFindepSample, JZSOkay the default medium returns just what we reported earlier 5:1 odds. Going
wider gets us 4:1 and going narrower (believing the difference is smaller) takes
us to 6:1. Not huge differences but noticeable and driven by our data.
Let’s investigate directional hypotheses with ttestBF. First let’s ask what’s the evidence that Bolt is faster than Powell NB the order is driven by factor level in the dataframe not the order in the filter command below. Also note that faster is a lower number
justtwo <- male_100 %>% filter(runner %in% c("Usain Bolt", "Asafa Powell")) %>% droplevels %>% as.data.frame # notice these two just return the same answer in a different order ttestBF(formula = timing ~ runner, data = justtwo, nullInterval = c(0, Inf)) ## Bayes factor analysis ## -------------- ## [1] Alt., r=0.707 0 ttestBF(formula = timing ~ runner, data = justtwo, nullInterval = c(-Inf, 0)) ## Bayes factor analysis ## -------------- ## [1] Alt., r=0.707 -InfSo the odds that Bolt has a bigger number i.e. is slower than Powell is 0.04:1
and the converse is the odds that Bolt has a smaller timing (is faster) is 10:1.
You can feel free to put these in the order that makes the most sense to your
workflow. They’re always going to be mirror images.
And yes in some circumstances it is perfectly rational to combine the
information by dividing those odds. See this blog post for a research
scenario
but accomplishing it is trivial. Running this code snippet essentially combines
what we know in both directions of the hypothesis.
- All three approaches yielded similiar answers to our question at the bar.
- Frequentist methods have stood the test of time and are pervasive in the
literature - Computational methods like resmapling allow us to free ourselves
from some of the restrictions and assumptions in classical hypothesis testing
in an age when cpmpute power is cheap - Bayesian methods allow us to speak in
the very human vernacular language of probabilities about our evidence.
I hope you’ve found this useful. I am always open to comments, corrections and
suggestions.
Chuck (ibecav at gmail dot com)
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Posts on R Lover ! a programmer. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Analysing the HIV pandemic, Part 4: Classification of lab samples
(This article was first published on R Views, and kindly contributed to R-bloggers)
Andrie de Vries is the author of “R for Dummies” and a Solutions Engineer at RStudio
Phillip (Armand) Bester is a medical scientist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa
In this post we complete our series on analysing the HIV pandemic in Africa. Previously we covered the bigger picture of HIV infection in Africa, and a pipeline for drug resistance testing of samples in the lab.
Then, in part 3 we saw that sometimes the same patient’s genotype must be repeatedly analysed in the lab, from samples taken years apart.
Let’s say we have genotyped a patient five years ago and we have a current genotype sequence. It should be possible to retrieve the previous sequence from a database of sequences without relying on identifiers only or at all. Sometimes when someone remarries they may change their surname or transcription errors can be made, which makes finding previous samples tedious and error-prone. So instead of using patient information to look for previous samples to include, we can rather use the sequence data itself and then confirm the sequences belong to the same patient or investigate any irregularities. If we suspect mother-to-child transmission from our analysis, we confirm this with the healthcare worker who sent the sample.
In this final part, we discuss how the inter- and intra-patient HIV genetic distances were analyzed using logistic regression to gain insights into the probability distribution of these two classes. In other words, the goal is to find a way to tell whether two genetic samples are from the same person or from two different people.
Samples from the same person can have slightly different genetic sequences, due to mutations and other errors. This is especially useful in comparing samples of genetic material from retroviruses.
Preliminary analysisTo help answer this question, we downloaded data from the Los Alamos HIV sequence database (specifically, Virus HIV-1, subtype C, genetic region POL CDS).
Each observation is the (dis)similarity distance between different samples.
library(readr) library(dplyr) library(ggplot2) ## Warning: package 'ggplot2' was built under R version 3.5.2 pt_distance <- read_csv("dist_sample_10.csv.zip", col_types = "ccdccf") head(pt_distance) ## # A tibble: 6 x 6 ## sample1 sample2 distance sub area type ## ## 1 KI_797.67744.AB874124… KI_481.67593.AB873933.… 0.0644 B INT Inter ## 2 502-2794.39696.JF3202… WC3.27170.EF175209.B.U… 0.0418 B INT Inter ## 3 KI_882.67653.AB874186… KI_813.67589.AB874131.… 0.0347 B INT Inter ## 4 HTM360.13332.DQ322231… C11-2069070.63977.AB87… 0.0487 B INT Inter ## 5 O5598.34737.GQ372062.… LM49.4011.AF086817.B.T… 0.0360 B INT Inter ## 6 GKN.45901.HQ026515.B.… C11-2069083.65198.AB87… 0.0699 B INT InterNext, plot a histogram of the distance between samples. This clearly shows that the distance between samples of the same subject (intra-patient) is smaller than the distance between different subjects (inter-patient). This is not surprising.
However, from the histogram it is also clear that there is not a clear demarcation between these types. Simply eye-balling the data seems to indicate that one could use an arbitrary threshold of around 0.025 to indicate whether the sample is from the same person or different people.
pt_distance %>% mutate( type = forcats::fct_rev(type) ) %>% ggplot(aes(x = distance, fill = type)) + geom_histogram(binwidth = 0.001) + facet_grid(rows = vars(type), scales = "free_y") + scale_fill_manual(values = c("red", "blue")) + coord_cartesian(xlim = c(0, 0.1)) + ggtitle("Histogram of phylogenetic distance by type") ModelingSince we have two sample types (intra-patient vs inter-patient), this is a binary classification problem.
Logistic regression is a simple algorithm for binary classification, and a special case of a generalized linear model (GLM). In R, you can use the glm() function to fit a GLM, and to specify a logistic regression, use the family = binomial argument.
In this case we want to train a model with distance as independent variable, and type the dependent variable, i.e. type ~ distance.
We train on 100,000 (n = 1e5) observations purely to reduce computation time:
pt_sample <- pt_distance %>% sample_n(1e5) model <- glm(type ~ distance, data = pt_sample, family = binomial) ## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred(Note that sometimes the model throws a warning indicating numerical problems. This happens because the overlap between intra and inter is very small. If there is a very sharp dividing line between classes, the logistic regression algorithm has problems to converge.)
However, in this case the numerical problems doesn’t actually cause a practical problem with model itself.
The model summary tells us that the distance variable is highly significant (indicated by the ***):
summary(model) ## ## Call: ## glm(formula = type ~ distance, family = binomial, data = pt_sample) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.4035 -0.0050 -0.0010 -0.0002 8.4904 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 5.7887 0.1796 32.23 <2e-16 *** ## distance -355.1454 9.3247 -38.09 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 23659.2 on 99999 degrees of freedom ## Residual deviance: 1440.5 on 99998 degrees of freedom ## AIC: 1444.5 ## ## Number of Fisher Scoring iterations: 12Now we can use the model to compute a prediction for a range of genetic distances (from 0 to 0.05) and create a plot.
newdata <- data.frame(distance = seq(0, 0.05, by = 0.001)) pred <- predict(model, newdata, type = "response") plot_inter <- pt_sample %>% filter(distance <= 0.05, type == "Inter") %>% sample_n(2000) plot_intra <- pt_sample %>% filter(distance <= 0.05, type == "Intra") %>% sample_n(2000) threshold <- with(newdata, approx(pred, distance, xout = 0.5))$y ggplot() + geom_point(data = plot_inter, aes(x = distance, y = 0), alpha = 0.05, col = "blue") + geom_point(data = plot_intra, aes(x = distance, y = 1), alpha = 0.05, col = "red") + geom_rug(data = plot_inter, aes(x = distance, y = 0), col = "blue") + geom_rug(data = plot_intra, aes(x = distance, y = 0), col = "red") + geom_line(data = newdata, aes(x = distance, y = pred)) + annotate(x = 0.005, y = 0.9, label = "Type == intra", geom = "text", col = "red") + annotate(x = 0.04, y = 0.1, label = "Type == inter", geom = "text", col = "blue") + geom_vline(xintercept = threshold, col = "grey50") + ggtitle("Model results", subtitle = "Predicted probability that Type == 'Intra'") + xlab("Phylogenetic distance") + ylab("Probability")Logistic regression essentially fits an s-curve that indicates the probability. In this case, for small distances (lower than ~0.01) the probability of being the same person (i.e., type is intra) is almost 100%. For distances greater than 0.03 the probability of being type intra is almost zero (i.e., the model predicts type inter).
The model puts the distance threshold at approximately 0.016.
The practical value of this workIn part 2, we discussed how researchers developed an automated pipeline of phylogenetic analysis. The project was designed to run on the Raspberry Pi, a very low-cost computing device. This meant that the cost of implementation of the project is low, and the project has been implemented at the National Health Laboratory Service (NHLS) in South Africa.
In this part, we described the very simple logistic regression model that runs as part of the pipeline. In addition to the descriptive analysis, e.g., heat maps and trees (as described in part 3), this logistic regression makes a prediction whether two samples were obtained from the same person, or from two different people. This prediction is helpful in allowing the laboratory staff identify potential contamination of samples, or indeed to match samples from people who weren’t matched properly by their name and other identifying information (e.g., through spelling mistakes or name changes).
Finally, it’s interesting to note that traditionally the decision whether two samples were intra-patient or inter-patient was made on heuristics, instead of modelling. For example, a heuristic might say that if the genetic distance between two samples is less than 0.01, they should be considered a match from a single person.
Heuristics are easy to implement in the lab, but sometimes it can happen that the origin of the original heuristic gets lost. This means that it’s possible that the heuristic is no longer applicable to the sample population.
This modelling gave the researchers a tool to establish confidence intervals around predictions. In addition, it is now possible to repeat the model for many different local sample populations of interest, and thus have a tool that is better able to discriminate given the most recent data.
ConclusionIn this multi-part series of HIV in Africa we covered four topics:
- In part 1, we analysed the incidence of HIV in sub-Sahara Africa, with special mention of the effect of the wide-spread availability of anti-retroviral (ARV) drugs during 2004. Since then, there was a rapid decline in HIV infection rates in South Africa.
- In part 2, we described the PhyloPi project – a phylogenetic pipeline to analyse HIV in the lab, available for the low-cost RaspBerry Pi. This work as published in the PLoS ONE journal: “PhyloPi: An affordable, purpose built phylogenetic pipeline for the HIV drug resistance testing facility”
- Then, part 3 described the biological mechanism how the HIV virus mutates, and how this can be modeled using a Markov chain, and visualized as heat maps and phylogenetic trees.
- This final part covered how we used a very simple logistic regression model to identify if two samples in the lab came from the same person or two different people.
Dear readers,
I hope that you enjoyed this series on ‘Analysing the HIV pandemic’ using R and some of the tools available as part of the tidyverse packages. Learning R provided me not only with a tool set to analyse data problems, but also a community. Being a biologist, I was not sure of the best approach for solving the problem of inter- and intra-patient genetic distances. I contacted Andrie from Rstudio, and not only did he help us with this, but he was also excited about it. It was a pleasure telling you about our journey on this blog site, and a privilege doing this with experts.
Armand
_____='https://rviews.rstudio.com/2019/05/23/pipeline-for-analysing-hiv-part-4/';
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
MRAN snapshots, and you
(This article was first published on Revolutions, and kindly contributed to R-bloggers)
For almost five years, the entire CRAN repository of R packages has been archived on a daily basis at MRAN. If you use CRAN snapshots from MRAN, we'd love to hear how you use them in this survey. If you're not familiar with the concept, or just want to learn more, read on.
Every day since September 17, 2014, we (Microsoft and, before the acquisition, Revolution Analytics) have archived a snapshot of the entire CRAN repository as a service to the R community. These daily snapshots have several uses:
- As a longer-term archive of binary R packages. (CRAN keeps an archive of package source versions, but binary versions of packages are kept for a limited time. CRAN keeps package binaries only for the current R version and the prior major version, and only for the latest version of the package).
- As a static CRAN repository you can use like the standard CRAN repository, but frozen in time. This means changes to CRAN packages won't affect the behavior of R scripts in the future (for better or worse). options(repos="https://cran.microsoft.com/snapshot/2017-03-15/") provides a CRAN repository that works with R 3.3.3, for example — and you can choose any date since September 17, 2014.
- The checkpoint package on CRAN provides a simple interface to these CRAN snapshots, allowing you use a specific CRAN snapshot by specifying a date, and making it easy to manage multiple R project each using different snapshots.
- Microsoft R Open, Microsoft R Client, Microsoft ML Server and SQL Server ML Services all use fixed CRAN repository snapshots from MRAN by default.
- The rocker project provides container instances for historical versions of R, tied to an appropriate CRAN snapshot from MRAN suitable for the corresponding R version.
MRAN and the CRAN snapshot system was created at a time when reproducibility was an emerging concept in the R ecosystem. Now, there are several methods available to ensure that your R code works consistently, even as R and CRAN changes. Beyond virtualization and containers, you have packages like packrat and miniCRAN, RStudio's package manager, and the full suite of tools for reproducible research.
As CRAN has grown and changes to packages have become more frequent, maintaining MRAN is an increasingly resource-intensive process. We're contemplating changes, like changing the frequency of snapshots, or thinning the archive of snapshots that haven't been used. But before we do that we'd like to hear from the community first. Have you used MRAN snapshots? If so, how are you using them? How many different snapshots have you used, and how often do you change that up? Please leave your feedback at the survey link below by June 14, and we'll use the feedback we gather in our decision-making process. Responses are anonymous, and we'll summarize the responses in a future blog post. Thanks in advance!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
MRAN snapshots, and you
(This article was first published on Revolutions, and kindly contributed to R-bloggers)
For almost five years, the entire CRAN repository of R packages has been archived on a daily basis at MRAN. If you use CRAN snapshots from MRAN, we'd love to hear how you use them in this survey. If you're not familiar with the concept, or just want to learn more, read on.
Every day since September 17, 2014, we (Microsoft and, before the acquisition, Revolution Analytics) have archived a snapshot of the entire CRAN repository as a service to the R community. These daily snapshots have several uses:
- As a longer-term archive of binary R packages. (CRAN keeps an archive of package source versions, but binary versions of packages are kept for a limited time. CRAN keeps package binaries only for the current R version and the prior major version, and only for the latest version of the package).
- As a static CRAN repository you can use like the standard CRAN repository, but frozen in time. This means changes to CRAN packages won't affect the behavior of R scripts in the future (for better or worse). options(repos="https://cran.microsoft.com/snapshot/2017-03-15/") provides a CRAN repository that works with R 3.3.3, for example — and you can choose any date since September 17, 2014.
- The checkpoint package on CRAN provides a simple interface to these CRAN snapshots, allowing you use a specific CRAN snapshot by specifying a date, and making it easy to manage multiple R project each using different snapshots.
- Microsoft R Open, Microsoft R Client, Microsoft ML Server and SQL Server ML Services all use fixed CRAN repository snapshots from MRAN by default.
- The rocker project provides container instances for historical versions of R, tied to an appropriate CRAN snapshot from MRAN suitable for the corresponding R version.
MRAN and the CRAN snapshot system was created at a time when reproducibility was an emerging concept in the R ecosystem. Now, there are several methods available to ensure that your R code works consistently, even as R and CRAN changes. Beyond virtualization and containers, you have packages like packrat and miniCRAN, RStudio's package manager, and the full suite of tools for reproducible research.
As CRAN has grown and changes to packages have become more frequent, maintaining MRAN is an increasingly resource-intensive process. We're contemplating changes, like changing the frequency of snapshots, or thinning the archive of snapshots that haven't been used. But before we do that we'd like to hear from the community first. Have you used MRAN snapshots? If so, how are you using them? How many different snapshots have you used, and how often do you change that up? Please leave your feedback at the survey link below by June 14, and we'll use the feedback we gather in our decision-making process. Responses are anonymous, and we'll summarize the responses in a future blog post. Thanks in advance!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Deep (learning) like Jacques Cousteau – Part 5 – Vector addition
(This article was first published on Embracing the Random | R, and kindly contributed to R-bloggers)
(TL;DR: You can add vectors that have the same number of elements.)
LaTeX and MathJax warning for those viewing my feed: please view
directly on website!
You want to know how to rhyme, you better learn how to add
It’s mathematics
Mos Def in ‘Mathematics’
Last
time,
we learnt about scalar multiplication. Let’s get to adding vectors
together!
We will follow the notation in Goodfellow, Ian, et
al.. If our vector \boldsymbol{x}
has n elements that are real numbers, then we can say that
\boldsymbol{x} is a \boldsymbol{n}-dimensional vector.
We can also say that \boldsymbol{x} lies in some set of all vectors
that have the same dimensions as itself. This might be a bit abstract at first,
but it’s not too bad at all.
Let’s define two vectors:
\boldsymbol{x} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}\\ \boldsymbol{y} = \begin{bmatrix} 5 \\ 6 \end{bmatrix}
They have two elements. They are, therefore, two-dimensional vectors.
What are some other two dimensional vectors made up of real numbers? We
could have:
\boldsymbol{x'} = \begin{bmatrix} -2.5 \\ 10.35 \end{bmatrix}\\ \boldsymbol{y'} = \begin{bmatrix} 0 \\ 95.208 \end{bmatrix}
There are infinitely many vectors with two elements that we could come up
with! How can we describe this infinite set of vectors made up of real
numbers? We can say that our two-dimensional vectors made up of real
numbers lie in this set:
\mathbb{R} \times \mathbb{R}
What in the world does this mean?
Cartesian productsTo understand what the above product means, let’s use a simplified
example. Let’s define sets of integers like so:
X = \{1, 2, 3\} \\ Y = \{ 4, 5, 6 \}
What is the result of X \times Y? We can depict this operation in a
matrix:
%
What we’re doing here is taking an arbitrary element from our first set,
X. We are then pairing it up with a second arbitrary element from the
set, Y. The entries in the matrix form a set of ordered pairs
(i.e. the order of elements in the pair of numbers matter) which can
be defined like so:
X \times Y = \{(x, y)\:|\:x \in X \textrm{ and } y \in Y\}
This is called the Cartesian product of two sets. Translating the
equation into plain English gives us this:
\textrm{The Cartesian product of sets } X \textrm{ and } Y \textrm{ is the set of ordered pairs } (x, y) \\ \textrm{ such that } x \text{ comes from a set } X \textrm{ and } y \textrm{ comes from a set } Y \textrm{.}
Now let’s replace the terminology of ordered pairs with the word
vectors. If we depict the vectors as row vectors, our matrix now
looks like this:
%
Now let’s replace sets X and Y with the set of real numbers,
\mathbb{R}. The Cartesian product then becomes:
\mathbb{R} \times \mathbb{R} = \mathbb{R}^2 = \{(x, y) \:|\: x \in \mathbb{R} \textrm{ and } y \in \mathbb{R}\}
We now have a set of ordered pairs which are drawn from the Cartesian
product of the set of real numbers with itself! If we again replace
the term ‘ordered pairs’ with ‘vectors’, we now have the set
of all two dimensional, real number vectors. We can finally say this about the vectors we defined earlier:
\boldsymbol{x,y,x',y'} \in \mathbb{R}^2
Hooray!
Now it’s not difficult to see that \mathbb{R}^n is the set formed by
taking the n-ary Cartesian product of the set of real numbers. We
can now quite easily generalise to vectors of n dimensions.
If we have vectors of equal dimensions, we simply add the elements
of the vectors, one element at a time! The resulting vector is a vector of
the same dimensions as the vectors that were just added.
Let’s use our example vectors from before:
\boldsymbol{x} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}\\ \boldsymbol{y} = \begin{bmatrix} 5 \\ 6 \end{bmatrix}
We know that x, y \in \mathbb{R}^2. We now know that
x + y \in \mathbb{R}^2. Let’s add them:
\begin{bmatrix} 1 \\ 2 \end{bmatrix} + \begin{bmatrix} 5 \\ 6 \end{bmatrix} = \begin{bmatrix} 6 \\ 8 \end{bmatrix}
Easy!
How can we add vectors in R?Some operations on vectors are defined differently to those that are
defined in linear algebra.
When we have two numeric vectors with the same number of elements, we
can add them just like we saw above:
But what if we try to add two numeric vectors with different numbers
of elements?
What the hell just happened? This isn’t what we have just learnt! How is it
possible to add these two vectors given the mathematical definition of
vector addition?
If we look at the help page for +, we can find our answer:
?`+`The binary operators return vectors containing the result of the
element by element operations. If involving a zero-length vector the
result has length zero. Otherwise, the elements of shorter vectors
are recycled as necessary (with a warning when they are recycled only
fractionally).
This is what R is doing:
- For the first two elements of the longer vector, R is adding
the shorter vector to it in the above mathematically defined
way. - When it runs out of elements in the shorter vector, R goes
back to the first element of the shorter vector and adds this
to the next element of the longer vector. - R continues like this until it runs out of elements in the longer
vector.
(Man, that’s a lot of use of the word vector!)
In this way, R allows you to add vectors of unequal length. We also told
that some other operators also behave in this way:
The operators are + for addition, – for subtraction, * for
multiplication, / for division and ^ for exponentiation.
These aren’t our focus here so let’s move on.
What about adding scalars to vectors?In our standard mathematical definition, we can’t add scalars to
vectors. But we can in R!
We simply add the scalar to our vector, one element at a time:
x <- 1 y <- c(1, 2, 3) x + y ## [1] 2 3 4 ConclusionWe learnt how to add vectors. We also learnt a little bit more about
sets. We learnt that R can behave differently to our mathematical
definitions of vector addition. Let’s now move onto dot products!
To leave a comment for the author, please follow the link and comment on their blog: Embracing the Random | R. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
New Color Palette for R
(This article was first published on Deeply Trivial, and kindly contributed to R-bloggers)
As I was preparing some graphics for a presentation recently, I started digging into some of the different color palette options. My motivation was entirely about creating graphics that weren’t too visually overwhelming, which I found the default “rainbow” palette to be.
But as the creators of the viridis R package point out, we also need to think about how people with colorblindness might struggle with understanding graphics. If you create figures in R, I highly recommend checking it out at the link above!
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Easy quick PCA analysis in R
(This article was first published on R – intobioinformatics, and kindly contributed to R-bloggers)
Principal component analysis (PCA) is very useful for doing some basic quality control (e.g. looking for batch effects) and assessment of how the data is distributed (e.g. finding outliers). A straightforward way is to make your own wrapper function for prcomp and ggplot2, another way is to use the one that comes with M3C (https://bioconductor.org/packages/devel/bioc/html/M3C.html) or another package. M3C is a consensus clustering tool that makes some major modifications to the Monti et al. (2003) algorithm so that it behaves better, it also provides functions for data visualisation.
Let’s have a go on an example cancer microarray dataset.
# M3C loads an example dataset mydata with metadata desx library(M3C) # do PCA pca(mydata,colvec=c('gold'),printres=TRUE)So, now what prcomp has done is extracted the eigenvectors of the data’s covariance matrix, then projected the original data samples onto them using linear combination. This yields PC scores which are plotted on PC1 and PC2 here (eigenvectors 1 and 2). The eigenvectors are sorted and these first two contain the highest variation in the data, but it might be a good idea to look at additional PCs, which is beyond this simple blog post and function.
You can see above there are no obvious outliers for removal here, which is good for cluster analysis. However, were there outliers, we can label all samples with their names using the ‘text’ parameter.
library(M3C) pca(mydata,colvec=c('gold'),printres=TRUE,text=colnames(mydata))Now other objectives would be comparing samples with batch to make sure we do not have batch effects driving the variation in our data, and comparing with other variables such as gender and age. Since the metadata only contains tumour class we are going to use that next to see how it is related to these PC scores.
This is a categorical variable, so the ‘scale’ parameter should be set to 3, ‘controlscale’ is set to TRUE because I want to choose the colours, and ‘labels’ parameter passes the metadata tumour class into the function. I am increasing the ‘printwidth’ from its default value because the legend is quite wide.
For more information see the function documentation using ?pca.
# first reformat meta to replace NA with Unknown desx$class <- as.character(desx$class) desx$class[is.na(desx$class)] <- as.factor('Unknown') # next do the plot pca(mydata,colvec=c('gold','skyblue','tomato','midnightblue','grey'), labels=desx$class,controlscale=TRUE,scale=3,printres=TRUE,printwidth=25)So, now it appears that the variation that governs these two PCs is indeed related to tumour class which is good. But, what if the variable is continuous, and we wanted to compare read mapping rate, or read duplication percentage, or age with our data? So, a simple change of the parameters can allow this too. Let’s make up a continuous variable, then add this to the plot. In this case we change the ‘scale’ parameter to reflect we are using continuous data, and the spectral palette is used for the colours by default.
randomvec <- seq(1,50) pca(mydata,labels=randomvec,controlscale=TRUE,scale=1,legendtitle='Random', printres=TRUE,printwidth=25)So, since this is a random variable, we can see it has no relationship with our data. Let’s just define a custom scale now. So we change ‘scale’ to 2, then use the ‘low’ and ‘high’ parameters to define the scale colour range.
pca(mydata,labels=randomvec,controlscale=TRUE,scale=2,legendtitle='Random', printres=TRUE,,printwidth=25,low='grey',high='red')Super easy, yet pretty cool. The idea here is just to minimise the amount of code hanging around for doing basic analyses like these. You can rip the code from here: https://github.com/crj32/M3C/blob/master/R/pca.R. Remember if your features are on different scales, the data needs transforming to be comparable beforehand.
M3C is part of a series of cluster analysis tools that we are releasing, the next is ‘Spectrum’, a fast adaptive spectral clustering tool, which is already on CRAN (https://cran.r-project.org/web/packages/Spectrum/index.html). Perhaps there will be time to blog about that in the future.
For quantitative analysis of drivers in the variation on data, I recommend checking out David Watson’s great function in the ‘bioplotr’ package called ‘plot_drivers’ (https://github.com/dswatson/bioplotr). This compares the PCs with categorical and continuous variables and performs univariate statistical analysis on them producing a very nice plot.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – intobioinformatics. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Create a CLI for R with npm
(This article was first published on Colin Fay, and kindly contributed to R-bloggers)
How to build a CLI for R, with npm.
BackgroundThis blog post was triggered by a discussion on Twitter with Martin
Skarzynski,
who was looking for a way to build a CLI that launches an RScript.
Here’s a way to do this using npm.
Please note that this blog post won’t teach you how to build the command
line tool, it will quickly go over the way to create a system-wide
command line interface, using npm.
If you want to learn more about building the utility, see this
fantastic series of blog
posts
by Mark Sellor.
Now, the idea is to have a CLI, i.e. a way to launch your utility with:
$ mytoolAnd that, system-wide.
What you’ll need- An R script (script.R) with in it, for example:
- npm, which you can get from
there.
Create a new folder, and go inside it.
mkdir cli && cd cliCreate the R Script there.
echo '#!/usr/bin/env Rscript --vanilla' > script.R echo 'cli::cat_rule("yeay")' >> script.R echo 'cli::cat_bullet(Sys.time())' >> script.RTry your script to see if it works:
RScript script.RNow launch an npm project:
npm init -y(You can also run it without the -y to interactively add information
to the package.json.)
Now the important part: add a "bin" value in the package.json, with
the name of the bin you want to create, and the path to the script,
relatively to the package file. Here is an example of a package.json
(I removed some elements).
Install it globally (need sudo rights):
sudo npm linkAnd, voilà! Open your terminal, and you’re done!
clir ## ── yeay ──────────────────────────────────────────────── ## ● 2019-05-22 22:36:29 Other way to go- See the {littler}
implementation
To leave a comment for the author, please follow the link and comment on their blog: Colin Fay. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Bug when Creating Reference Maps with Choroplethr
(This article was first published on R – AriLamstein.com, and kindly contributed to R-bloggers)
Last week the Census Bureau published a free course I created on using Choroplethr to map US Census Data. Unfortunately, a few people have reported problems when following one of the examples in the course. This post describes that issues and provides instructions for working around it.
Where does the bug occur?The bug occurs in Module 2 Lesson 4, when creating this map:
The code I provide to create that map can be seen here:
state_choropleth(df_pop_state, num_colors = 1, zoom = c("california", "oregon", "washington"), reference_map = TRUE)Note that this map combines a standard choropleth (a map that uses color to indicate state population) with a reference map from Google maps (a map that shows terrain, large cities, and so on).
What’s the problem?After I completed production of the course, google added a requirement that you must register an API key before getting their maps. Since Choroplethr requests reference maps without an API key, Google now refuses the request.
Behind the scenes, Choroplethr uses the ggmap package to request the maps. Here is how David Kahle, the author of ggmap, describes the issue in the help file for ggmap’s register_google function:
As of mid-2018, the Google Maps Platform requires a registered API key. While this alleviates previous burdens (e.g. query limits), it creates some challenges as well. The most immediate challenge for most R users is that ggmap functions that use Google’s services no longer function out of the box, since the user has to setup an account with Google, enable the relevant APIs, and then tell R about the user’s setup.
The help file goes into more detail, and I encourage you to read it.
WorkaroundGetting the above example to work is a three step process.
1. Get an API key from Google by visiting the Google Maps Platform. This is free, although they require that you enter a credit card.
2. Register your key with the ggmap package by typing the following:
library(ggmap) register_google("")3. Install the mapproj package by typing the following
install.packages("mapproj") Future WorkWhile the above steps solve the problem, I do not consider the fix to be particularly elegant. I am currently exploring other solutions, and will create a blog post if I find a better solution.
The post Bug when Creating Reference Maps with Choroplethr appeared first on AriLamstein.com.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: R – AriLamstein.com. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...
Timing hash functions with the bench package
(This article was first published on r – Jumping Rivers, and kindly contributed to R-bloggers)
This blog post has two goals
- Investigate the bench package for timing R functions
- Consequently explore the different algorithms in the digest package using bench
The digest package provides a hash function to summarise R objects. Standard hashes are available, such as md5, crc32, sha-1, and sha-256.
The key function in the package is digest() that applies a cryptographical hash function to arbitrary R objects. By default, the objects are internally serialized using md5. For example,
library("digest") digest(1234) ## [1] "37c3db57937cc950b924e1dccb76f051" digest(1234, algo = "sha256") ## [1] "01b3680722a3f3a3094c9845956b6b8eba07f0b938e6a0238ed62b8b4065b538"The digest package is fairly popular and has a large number of reverse dependencies
length(tools::package_dependencies("digest", reverse = TRUE)$digest) ## [1] 186The number of available hashing algorithms has grown over the years, and as a little side project, we decided to test the speed of the various algorithms. To be clear, I’m not considering any security aspects or the potential of hash clashes, just pure speed.
Timing in RThere are numerous ways of timing R functions. A recent addition to this list is the bench package. The main function bench::mark() has a number of useful features over other timing functions.
To time and compare two functions, we load the relevant packages
library("bench") library("digest") library("tidyverse")then we call the mark() function and compare the md5 with the sha1 hash
value = 1234 mark(check = FALSE, md5 = digest(value, algo = "md5"), sha1 = digest(value, algo = "sha256")) %>% select(expression, median) ## # A tibble: 2 x 2 ## expression median ## ## 1 md5 50.2µs ## 2 sha1 50.9µsThe resulting tibble object, contains all the timing information. For simplicity, we’ve just selected the expression and median time.
More advanced benchOf course, it’s more likely that you’ll want to compare more than two things. You can compare as many function calls as you want with mark(), as we’ll demonstrate in the following example. It’s probably more likely that you’ll want to compare these function calls against more than one value. For example, in the digest package there are eight different algorithms. Ranging from the standard md5 to the newer xxhash64 methods. To compare times, we’ll generate n = 20 random character strings of length N = 10,000. This can all be wrapped up in the single function press() function call from the bench package:
N = 1e4 results = suppressMessages( bench::press( value = replicate(n = 20, paste0(sample(LETTERS, N, replace = TRUE), collapse = "")), { bench::mark( iterations = 1, check = FALSE, md5 = digest(value, algo = "md5"), sha1 = digest(value, algo = "sha1"), crc32 = digest(value, algo = "crc32"), sha256 = digest(value, algo = "sha256"), sha512 = digest(value, algo = "sha512"), xxhash32 = digest(value, algo = "xxhash32"), xxhash64 = digest(value, algo = "xxhash64"), murmur32 = digest(value, algo = "murmur32") ) } ) )The tibble results contain timing results. But it’s easier to work with relative timings. So we’ll rescale
rel_timings = results %>% unnest() %>% select(expression, median) %>% mutate(expression = names(expression)) %>% distinct() %>% mutate(median_rel = unclass(median/min(median)))Then plot the results, ordered by slowed to largest
ggplot(rel_timings) + geom_boxplot(aes(x = fct_reorder(expression, median_rel), y = median_rel)) + theme_minimal() + ylab("Relative timings") + xlab(NULL) + ggtitle("N = 10,000") + coord_flip()The sha256 algorithm is about three times slower than the xxhash32 method. However, it’s worth bearing in mind that although it’s relatively slower, the absolute times are very small
rel_timings %>% group_by(expression) %>% summarise(median = median(median)) %>% arrange(desc(median)) ## # A tibble: 8 x 2 ## expression median ## ## 1 sha256 171.2µs ## 2 sha1 112.6µs ## 3 md5 109.5µs ## 4 sha512 108.4µs ## 5 crc32 91µs ## 6 xxhash64 85.1µs ## 7 murmur32 82.1µs ## 8 xxhash32 77.6µsIt’s also worth seeing how the results vary according to the size of the character string N.
Regardless of the value of N, the sha256 algorithm is consistently in the slowest.
ConclusionR is going the way of “tidy” data. Though it wasn’t the focus of this blog post, I think that the bench package is as good as other timing packages out there. Not only that, but it fits in with the whole “tidy” data thing. Two birds, one stone.
The post Timing hash functions with the bench package appeared first on Jumping Rivers.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));To leave a comment for the author, please follow the link and comment on their blog: r – Jumping Rivers. R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...