Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by (750) R bloggers
Updated: 15 min 31 sec ago

RcppArmadillo 0.9.400.2.0

Tue, 04/30/2019 - 03:59

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

A new RcppArmadillo release based on the very recent Armadillo upstream release arrived on CRAN earlier today, and will get to Debian shortly.

Armadillo is a powerful and expressive C++ template library for linear algebra aiming towards a good balance between speed and ease of use with a syntax deliberately close to a Matlab. RcppArmadillo integrates this library with the R environment and language–and is widely used by (currently) 587 other packages on CRAN.

The (upstream-only again this time) changes are listed below:

  • Upgraded to Armadillo release 9.400.2 (Surrogate Miscreant)

    • faster cov() and cor()

    • added .as_col() and .as_row()

    • expanded .shed_rows() / .shed_cols() / .shed_slices() to remove rows/columns/slices specified in a vector

    • expanded vectorise() to handle sparse matrices

    • expanded element-wise versions of max() and min() to handle sparse matrices

    • optimised handling of sparse matrix expressions: sparse % (sparse +- scalar) and sparse / (sparse +- scalar)

    • expanded eig_sym(), chol(), expmat_sym(), logmat_sympd(), sqrtmat_sympd(), inv_sympd() to print a warning if the given matrix is not symmetric

    • more consistent detection of vector expressions

Courtesy of CRANberries, there is a diffstat report relative to previous release. More detailed information is on the RcppArmadillo page. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box . offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Could not Resist

Tue, 04/30/2019 - 02:21

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Also, Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019 is now content complete! It is deep into editing and soon into production!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Dealing with correlation in designed field experiments: part I

Tue, 04/30/2019 - 02:00

(This article was first published on R on The broken bridge between biologists and statisticians, and kindly contributed to R-bloggers)

Observations are grouped

When we have recorded two traits in different subjects, we can be interested in describing their joint variability, by using the Pearson’s correlation coefficient. That’s ok, altough we have to respect some basic assumptions (e.g. linearity) that have been detailed elsewhere (see here). Problems may arise when we need to test the hypothesis that the correlation coefficient is equal to 0. In this case, we need to make sure that all the couples of observations are taken on independent subjects.

Unfortunately, this is most often false whenever measures are taken from designed field experiments. In this case, observations may be grouped by one or more treatment/blocking factors. This has been clearly described by Bland and Altman (1994); we would like to give an example that is more closely related to plant/crop science. Think about a genotype experiment, where we compare the behaviour of several genotypes in a randomised blocks design. Usually, we do not only measure yield. We also measure other traits, such as crop height. At the end of the experiment, we might be interested in reporting the correlation between yield and height. How should we proceed? It would seem an easy task, but it is not.

Let’s assume that we have a randomised blocks design, with 27 genotypes and 3 replicates. For each plot, we recorded two traits, i.e. yield and the weight of thousand kernels (TKW). In the end, we have 81 plots and just as many couples of measures in all. We will use the dataset ‘WheatQuality.csv’, that is available on ‘gitHub’.

library(Hmisc) library(knitr) library(plyr) dataset <- read.csv("", header=T) head(dataset) ## Genotype Block Height TKW Yield Whectol ## 1 arcobaleno 1 90 44.5 64.40 83.2 ## 2 arcobaleno 2 90 42.8 60.58 82.2 ## 3 arcobaleno 3 88 42.7 59.42 83.1 ## 4 baio 1 80 40.6 51.93 81.8 ## 5 baio 2 75 42.7 51.34 81.3 ## 6 baio 3 76 41.1 47.78 81.1 How many correlations?

It may be tempting to consider the whole lot of measures and calculate the correlation coefficient between yield and TKW. This is the result:

ra <- with(dataset, rcorr(Yield, TKW) ) ra$r[1,2] #Correlation coefficient ## [1] 0.540957 ra$P[1,2] #P-level ## [1] 1.850931e-07

We observe a positive correlation, and \(r\) seems to be significantly different from 0. Therefore, we would be encouraged to conclude that plots with a high value on yield tend to have a high value on TKW, as well. Unfortunately, such a conclusion is not supported by the data.

Indeed, the test of significance is clearly invalid, as the 81 plots are not independent; they are grouped by block and genotype and we are totally neglecting these two effects. Are we sure that the same correlation holds for all genotypes/blocks? Let’s check this.

wCor <- function(x) cor(x$Yield, x$TKW) wgCors <- ddply(dataset, ~Genotype, wCor) wgCors ## Genotype V1 ## 1 arcobaleno 0.9847228 ## 2 baio 0.1611952 ## 3 claudio -0.9993872 ## 4 colorado 0.9837293 ## 5 colosseo 0.4564855 ## 6 creso -0.5910193 ## 7 duilio -0.9882330 ## 8 gianni -0.7603802 ## 9 giotto 0.9520211 ## 10 grazia 0.4980828 ## 11 iride 0.7563338 ## 12 meridano 0.1174342 ## 13 neodur 0.4805871 ## 14 orobel 0.8907754 ## 15 pietrafitta -0.9633891 ## 16 portobello 0.9210135 ## 17 portofino -0.9900764 ## 18 portorico 0.1394211 ## 19 preco 0.9007067 ## 20 quadrato -0.5840238 ## 21 sancarlo -0.6460670 ## 22 simeto -0.4051779 ## 23 solex -0.6066363 ## 24 terrabianca -0.4076416 ## 25 verdi 0.5801404 ## 26 vesuvio -0.7797493 ## 27 vitromax -0.8056514 wbCors <- ddply(dataset, ~Block, wCor) wbCors ## Block V1 ## 1 1 0.5998137 ## 2 2 0.5399990 ## 3 3 0.5370398

As for the genotypes, we have 27 correlation coefficients, ranging from -0.999 to 0.985. We only have three couples of measurements per genotype and it is clear that this is not much, to reliably estimate a correlation coefficient. However, it is enough to be suspicious about the extent of correlation between yield and TKW, as it may depend on the genotype.

On the other hand, the correlation within blocks is more constant, independent on the block and similar to the correlation between plots.

It may be interesting to get an estimate of the average within-group correlation. To this aim, we can perform two separate ANOVAs (one per trait), including all relevant effects (blocks and genotypes) and calculate the correlation between the residuals:

mod1 <- lm(Yield ~ factor(Block) + Genotype, data = dataset) mod2 <- lm(TKW ~ factor(Block) + Genotype, data = dataset) wCor <- rcorr(residuals(mod1), residuals(mod2)) wCor$r ## x y ## x 1.00000000 0.03133109 ## y 0.03133109 1.00000000 wCor$P ## x y ## x NA 0.7812693 ## y 0.7812693 NA

The average within-group correlation is very small and unsignificant. Let’s think about this correlation: residuals represent yield and TKW values for all plots, once the effects of blocks and genotypes have been removed. A high correlation of residuals would mean that, letting aside the effects of the block and genotype to which it belongs, a plot with a high value on yield also shows a high value on TKW. The existence of such correlation is clearly unsopported by our dataset.

As the next step, we could consider the means for genotypes/blocks and see whether they are correlated. Blocks and genotypes are independent and, in principle, significance testing is permitted. However, this is not recommended with block means, as three data are too few to make tests.

means <- ddply(dataset, ~Genotype, summarise, Mu1=mean(Yield), Mu2 = mean(TKW)) rgPear <- rcorr( as.matrix(means[,2:3]) ) rgPear$r ## Mu1 Mu2 ## Mu1 1.0000000 0.5855966 ## Mu2 0.5855966 1.0000000 rgPear$P ## Mu1 Mu2 ## Mu1 NA 0.00133149 ## Mu2 0.00133149 NA means <- ddply(dataset, ~Block, summarise, Mu1=mean(Yield), Mu2 = mean(TKW)) rbPear <- cor( as.matrix(means[,2:3]) ) rbPear ## Mu1 Mu2 ## Mu1 1.00000000 -0.08812544 ## Mu2 -0.08812544 1.00000000

We note that the correlation between genotype means is high and significant. On the contrary, the correlation between block means is near to 0.

And so what?

At this stage, you may be confused… Let’s try to clear the fog.

Obtaining a reliable measure of correlation from designed experiments is not obvious. Indeed, in every designed field experiment we have groups of subjects and there are several possible types of correlation:

  1. correlation between plot measurements
  2. correlation between the residuals
  3. correlation between treatment/block means

All these correlations should be investigated and used for interpretation.

  1. The ‘naive’ correlation between the plot measurements is very easily calculated, but it is grossy misleading. Indeed, it disregards the treatment/block effects and it does not permit hypotheses testing, as the subjects are not independent. In this example, looking at the ‘naive’ correlation coefficient, we would wrongly conclude that plots with high yield also have high TKW, while further analyses show that this is not true, in general. We would reasonably suggest that the ‘naive’ correlation coefficient is never used for interpretation.
  2. The correlation between the residuals is a reliable measure of joint variation, because the experimental design is adequately referenced, by removing the effects of tratments/blocks. In this example, residuals are not correlated. Further analyses show that the correlation between yield and TKW, if any, may depend on the genotype, while it does not depend on the block.
  3. The correlation between treatment/block means permits to assess whether the treatment/block effects on the two traits are correlated. In this case, while we are not allowed to conclude that yield and TKW are, in general, correlated, we can conclude that the genotypes with a high level of yield also show a high level of TKW.
Take-home message

Whenever we have data from designed field experiments, our correlation analyses should never be limited to the calculation of the ‘naive’ correlation coefficient between the observed values. This may not be meaningful! On the contrary, our interpretation should be mainly focused on the correlation between residuals and on the correlation between the effects of treatments/blocks.

An elegant and advanced method to perform sound correlation analyses on data from designed field experiments has been put forward by Piepho (2018), within the frame of mixed models. Such an approach will be described in another post.

  1. Bland, J.M., Altman, D.G., 1994. Statistics Notes: Correlation, regression, and repeated data. BMJ 308, 896–896.
  2. Piepho, H.-P., 2018. Allowing for the structure of a designed experiment when estimating and testing trait correlations. The Journal of Agricultural Science 156, 59–70.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on The broken bridge between biologists and statisticians. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Analysing the HIV pandemic, Part 1: HIV in sub-Sahara Africa

Tue, 04/30/2019 - 02:00

(This article was first published on R Views, and kindly contributed to R-bloggers)

Phillip (Armand) Bester is a medical scientist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa

Sabeehah Vawda is a pathologist, researcher, and lecturer at the Division of Virology, University of the Free State, and National Health Laboratory Service (NHLS), Bloemfontein, South Africa

Andrie de Vries is the author of “R for Dummies” and a Solutions Engineer at RStudio


The Human Immunodeficiency Virus (HIV) is the virus that causes acquired immunodeficiency syndrome (AIDS). The virus invades various immune cells, causing loss of immunity, and thus increased susceptibility to infections, including Tuberculosis and cancer. In a recent publication in PLoS ONE, the authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug resistance testing facility. In this series of blog posts we highlight the serious problem of HIV infection in sub-Saharan Africa, with special analysis of the situation in South Africa.

Stages of HIV infection

HIV infection can be divided into the three consecutive stages: acute primary infection, asymptomatic stage, and the symptomatic stage.

The first stage, acute primary infection, has symptoms very much like flu and may last for a week or two. The body reacts with an immune response, which results in the production of antibodies to fight the HIV infection. This process is called seroconversion and can last a couple of months. During this stage, although the patient is infected and the virus is spreading through the body, the patient might not test positive. This initial period of seroconversion is called ‘the window period’ and depends on the type of test used. Rapid tests are done at the point of care. This means that the test can be done at the clinic with a finger prick and the result is ready in 20 minutes. The drawback of this test is a window period of three months and a small false positive rate. The rapid test detects HIV antibodies, and because the immune system needs some time to produce sufficient antibodies to be detected, there is this window period. Most laboratories these days use fourth-generation ELISA (Enzyme-Linked Immunosorbent Assay) for HIV diagnosis and confirmation. This technique detects both HIV antibodies and antigens. Antigens are the foreign objects that the immune system recognizes as ‘non-self’; in this case, it is the viral protein p24. The advantage of this technique is a window period of only one month.

This first stage, including the window period, is then followed by the asymptomatic stage, which may last for as long as ten years. During this stage, the infected person does not experience symptoms and feels healthy. However, the virus is still replicating and destroying immune cells, especially CD4 cells. This damages the immune system and ultimately leads to stage 3 if not treated. This does not mean that people at stage 3 are doomed, but the earlier treatment starts, the better the outcome.

Stage 3 is referred to as symptomatic HIV infection or AIDS (Acquired Immune Deficiency Syndrome). At this stage, the immune system is so weak that it is not able to fight off bacterial or fungal infections that typically do not cause infections in immune competent people. These serious infections are called opportunistic infections, and have a high morbidity and mortality rate.

Transmission and epidemiology

Worldwide, approximately 36.9 million (UNAIDS) people are living with HIV.

HIV is transmitted mainly by:

  • Having unprotected sex
  • Non-sterile needles in drug use or sharing needles
  • Mother-to-child transmission during birth or breastfeeding
  • Infected blood transfusions, transplants or other medical procedures (very unlikely)

We mentioned the window period of the HIV infection as well as the asymptomatic stage. During any of the stages, it is possible to transmit the infection. The problem with the window period is an unknown HIV status or falsely assumed negative status, and during the asymptomatic stage, there is no reason for the infected person to seek medical attention. There are obviously behavioural issues in HIV transmission, and due to the long asymptomatic phase, HIV-positive status can be unknown for a long period. For these reasons, it is important that high-risk individuals do frequent HIV tests to determine their status.

Treatment for HIV infection

HIV is treatable but not (yet) curable. The good news, however, is that if a person receives antiretroviral (ARV) treatment, their viral load suppresses (viral replication stops) and the chance of transmitting HIV drastically decreases.

So 30 years into this pandemic, the big question is, why is HIV still a problem?

Not all countries adopted the use of ARVs in an equal manner. Although AZT (Zidovudine) was the first drug to be approved by the FDA in March 1987, it was soon discovered that monotherapy with only AZT was not effective for very long, as the virus developed resistance to the medicine quickly. Since then, ARVs have come a long way, and patients are placed on:

  • HAART (Highly Active Antiretroviral Treatment), or
  • cART (combination Antiretroviral Treatment), which typically consists of 3 drugs of different classes.
HIV in Africa

Let’s look at the rates of HIV infection in different African countries. The world factbook by the CIA has some HIV infection rate data.

suppressPackageStartupMessages({ library(dplyr) library(readr) library(stringr) library(tidyr) library(ggplot2) library(forcats) library(knitr) library(maptools) library(viridis) library(RColorBrewer) library(mapproj) library(broom) library(ggrepel) library(sf) }) # read the HIV data HIV_rate_2016 <- read_csv( file.path(file_path, "HIV rates.csv"), col_names = TRUE, col_types = "cd" ) # read the Africa shape file africa <- sf::st_read( file.path(file_path, "Africa_SHP/Africa.shp"), stringsAsFactors = FALSE, quiet = TRUE ) %>% rename(Country = "COUNTRY") %>% left_join(HIV_rate_2016, by = "Country") africa %>% ggplot(aes(fill = Rate)) + geom_sf() + coord_sf() + scale_fill_viridis(option = "plasma") + theme_minimal()

In the choropleth above, we see that South Africa, Botswana, Lesotho, and Swaziland seem to have the highest rates of infection. This is presented as the percentage infected, which takes into account population sizes. It is important to understand that the level of denial is indirectly proportional to the reported rate of infection. Even in this day and age, denial of stigmatized diseases is an issue.

Cleaning the data

We can also look at the burden of HIV as the number of people infected, and we might get a different picture from what we saw from the choropleth.

Here, we read in the data, and rename the columns to Country, PersCov (percentage ARV coverage), NumberOnARV (Number of patients on ARVs), and NumberInfected (Number of patients infected).

# Read csv with ARV infection dat arv_dat <- read_csv(file.path(file_path, "ARV cov 2017.csv"), col_types = "cccc", col_names = c("Country", "PersCov", "NumberOnARV", "NumberInfected"), skip = 1 ) head(arv_dat) ## # A tibble: 6 x 4 ## Country PersCov NumberOnARV NumberInfected ## ## 1 Afghanistan No data 790 No data ## 2 Albania 42 [40-44] 570 1400 [1300-1400] ## 3 Algeria 80 [75-87] 11000 14 000 [13 000-15 000] ## 4 Andorra No data No data No data ## 5 Angola 26 [22-30] 78700 310 000 [260 000-360 000] ## 6 Antigua and Barbuda No data No data No data

This data has several symptoms of being very messy:

  • Very long variable names, descriptive, but difficult to work with; this was changed during import
  • The values contain confidence intervals in brackets; this will be difficult to work with as-is
  • We might want to transform no data to NA
  • We are interested in Sub-Saharan Africa, but the data is for the whole world
# A list of Sub-Saharan countries sub_sahara <- readLines(file.path(file_path, "Sub-Saharan.txt")) clean_column <- function(x){ # Remove the ranges in brackets and convert the values to numeric x %>% str_replace_all("\\[.*?\\]", "") %>% str_replace_all("<", "") %>% str_replace_all(" ", "") %>% as.numeric() } arv_dat <- arv_dat %>% filter(Country %in% sub_sahara) %>% na_if("No data") %>% mutate_at(2:4, clean_column) head(arv_dat) ## # A tibble: 6 x 4 ## Country PersCov NumberOnARV NumberInfected ## ## 1 Angola 26 78700 310000 ## 2 Benin 55 38400 70000 ## 3 Botswana 84 318000 380000 ## 4 Burkina Faso 65 61400 94000 ## 5 Burundi 77 60100 78000 ## 6 Cameroon 49 254000 510000

We use a regular expression to get rid of all the square bracket ranges. We also remove the “<” sign and spaces within numbers, change “No data” to NA, and convert the characters to numbers. We filter out the countries we don’t want. (Note that some countries are not available in the ARV data, e.g., Swaziland and Reunion.)

Highest infected countries

Now look at the countries with the highest number of infected people of all ages.

arv_dat %>% top_n(4, wt = NumberInfected) %>% arrange(-NumberInfected) %>% kable( caption = "Countries with the highest number of HIV infections" ) Table 1: Countries with the highest number of HIV infections Country PersCov NumberOnARV NumberInfected South Africa 61 4359000 7200000 Nigeria 33 1040000 3100000 Mozambique 54 1156000 2100000 Kenya 75 1122000 1500000

We can see that South Africa has the highest number of HIV-infected people in Sub-Saharan Africa.

HIV in Southern Africa

In South Africa, the first AIDS-related death occurred in 1985. Not all patients were eligible to receive ARVs, and it was only in 2004 that ARVs became available in the public sector in South Africa. Eligibility restriction still applied, so not all HIV infected patients received treatment.

Ideally, a country would have all its HIV-infected people on treatment, but due to financial constraints, this is not always possible. In South Africa, patients were only initialized on ARVs when their CD4 counts dropped below a certain level. This threshold was initially 200 cells/mL in 2004, which was then changed to 350 cells/mL and 500 cell/mL at later intervals. These recommendations were a compromise between the availability of funds and getting ARVs to the people needing it the most. CD4 cells are a major component of the immune system; the lower the CD4 cell count the higher the chance for opportunistic infections. Thus, the idea is to support the patients who are most likely to contract an opportunistic infection.

The problem with this was that about only a third of the HIV infected people in South Africa were receiving HAART treatment. In 2017, the guidelines changed to test and treat; i.e., any newly diagnosed patient will receive HAART treatment. This is a big improvement for many reasons, but notably a lower infection rate. If a patient is taking HAART treatment and it is effective in suppressing the viral replication, the chances of the patient transmitting the virus are very close to zero.

However, these treatments are not without side effects, which in some cases causes very poor adherence to the treatment. There are numerous factors to blame here, specifically socio-economic factors and depression. There is also ignorance and the “fear of knowing”, which causes people not to know their status. Finally, human nature brings with it various other complexities, such as conspiracy theories, and religious and personal beliefs. This will be a very long post if we delve into all the issues, but the take-home message is: the situation is complicated.

ARV coverage by country

We looked at the rate of HIV infections, and also the number of people infected, in the most endemic countries. We have talked about treatment. It would be interesting to look at ARV coverage by country.

Let’s see how these countries rank by ARV coverage:

arv_dat %>% na.omit(PersCov) %>% ggplot(aes(x = reorder(Country, PersCov), y = PersCov)) + geom_point(aes(colour = NumberInfected), size = 3) + scale_colour_viridis( name = "Number of people infected", trans = "log10", option = "plasma" ) + coord_flip() + ylab("% ARV coverage") + xlab("Country") + theme_bw()

This shows that Zimbabwe, Namibia, Botswana, and Rwanda have the highest ARV coverage (above 80%). South Africa has the highest number of infections (as we saw before), and coverage of just above 60%.

Botswana rolled out their treatment program in 2002, and by mid-2005, about half of the eligible population received ARV treatment. South Africa, on the other hand, only started treatment in 2004, which we discuss later.

When talking about treatment, we should also look at the changes in mortality.

HIV related deaths

Read in the data:

hiv_mort <- read_csv(file.path(file_path, "HIV deaths.csv"), col_types = "ccccc") %>% na_if("No data") %>% mutate_at(vars(starts_with("Deaths")), clean_column) %>% filter(Country %in% sub_sahara) head(hiv_mort) ## # A tibble: 6 x 5 ## Country Deaths_2017 Deaths_2010 Deaths_2005 Deaths_2000 ## ## 1 Angola 13000 10000 7900 3900 ## 2 Benin 2500 2600 4300 2600 ## 3 Botswana 4100 5900 13000 15000 ## 4 Burkina Faso 2900 5400 12000 15000 ## 5 Burundi 1700 5400 8600 8500 ## 6 Cameroon 24000 25000 26000 17000 summary(hiv_mort) ## Country Deaths_2017 Deaths_2010 Deaths_2005 ## Length:43 Min. : 100 Min. : 100 Min. : 100 ## Class :character 1st Qu.: 1900 1st Qu.: 1975 1st Qu.: 2050 ## Mode :character Median : 4400 Median : 5400 Median : 8250 ## Mean : 15442 Mean : 23483 Mean : 33227 ## 3rd Qu.: 16250 3rd Qu.: 27250 3rd Qu.: 48250 ## Max. :150000 Max. :200000 Max. :260000 ## NA's :3 NA's :3 NA's :3 ## Deaths_2000 ## Min. : 100 ## 1st Qu.: 1150 ## Median : 6500 ## Mean : 26496 ## 3rd Qu.: 41500 ## Max. :130000 ## NA's :3

The 2017 mean for the dataset as a whole is about half of that during the early 2000s. It would be interesting to plot this data, but it will probably be too busy as it is. We can instead have a look at countries which had the most change.

hiv_mort <- hiv_mort %>% mutate( min = apply(hiv_mort[, 2:4], 1, FUN = min), max = apply(hiv_mort[, 2:4], 1, FUN = max), Change = max - min )

Next, we can create a plot of the data, and look at the top five countries with the biggest change in HIV-related mortality.

hiv_mort %>% top_n(5, wt = Change) %>% gather(Year, Deaths, Deaths_2017:Deaths_2000) %>% na.omit() %>% mutate( Year = str_replace(Year, "Deaths_", "") %>% as.numeric(), Country = fct_reorder(Country, Deaths) ) %>% ggplot(aes(x = Year, y = Deaths, color = Country)) + geom_line(size = 1) + geom_vline(xintercept = 2004, color = "black", linetype = "dotted", size = 1.5) + scale_color_viridis(option = "D", discrete = TRUE) + theme_bw() + theme(legend.position = "bottom")

Remember, we mentioned that HAART (Highly Active Antiretroviral Treatment) was introduced in 2004 in South Africa, depicted here by the black dotted line. It is easy to appreciate the dramatic effect the introduction of ARVs had in South Africa.

Although the picture above is positive, the fight is not over. The target is to get at least 90% of HIV-infected patients on treatment. Adherence to ARV regimens stays crucial not only to suppress viral replication, but also to minimize the development of drug resistance.

Infection rates

As mentioned earlier, if a patient is taking and responding to treatment, the viral load gets suppressed and the chances of transmitting the infection become very close to null. Thus, the more patients with an undetectable viral load, the lower the transmission rate.

Read the data:

new_infections <- read_csv(file.path(file_path, "Epidemic transition metrics_Trend of new HIV infections.csv"), na = "...", col_types = cols( .default = col_character(), `2017_1` = col_double() ) ) %>% select( -ends_with("_upper"), -ends_with("lower"), -ends_with("_1") ) %>% mutate_at(-1, clean_column) %>% na.omit() ## Warning: Duplicated column names deduplicated: '2017' => '2017_1' [26] new_infections %>% gather(Year, NewInfections, 2:9) %>% ggplot(aes(x = Year, y = NewInfections, color = Country)) + geom_point() + theme_classic() + theme(legend.position = "none") + xlab("Year") + ylab("Number of new infections")

This is a bit busy. Countries that are highly endemic with good ARV coverage and prevention of infection programs should have a steeper decline in the newly infected people. At first glance, it looks like some of the data points are fairly linear. Let’s go with that assumption, and apply linear regression to each country.

rates_modeled <- new_infections %>% filter(Country %in% sub_sahara) %>% na.omit() %>% gather(Year, NewInfections, 2:9) %>% mutate(Year = as.numeric(Year)) %>% group_by(Country) %>% do(tidy(lm(NewInfections ~ Year, data = .))) %>% filter(term == "Year") %>% ungroup() %>% mutate( Country = fct_reorder(Country, estimate, .desc = TRUE) ) %>% arrange(desc(estimate)) %>% select(-one_of("term", "statistic")) rates_modeled %>% head() %>% kable( caption = "Results of linear regression: Rate of new infections per year" ) Table 2: Results of linear regression: Rate of new infections per year Country estimate std.error p.value Madagascar 469.04762 12.56126 0.0000000 Côte d’Ivoire 190.47619 153.99689 0.2623441 Botswana 130.95238 92.46968 0.2064860 Mali 108.33333 23.21683 0.0034452 Congo 103.57143 16.45271 0.0007486 Eritrea 89.28571 23.05347 0.0082374 rates_modeled %>% na.omit() %>% ggplot(aes(x = Country, y = estimate, fill = p.value >= 0.05)) + geom_col() + coord_flip() + theme_bw() + ylab("Estimated change in HIV infection (people/year)")

With a quick look at the plot shown above, we can see that for most countries, a linear model fits the data with a significant p-value cutoff of 0.05.

It is important to note here that the data we have at hand is from 2010 to 2017. This shows that some countries – notably, South Africa – are on a good trajectory. Botswana, being the “Poster Child” of a good HIV treatment and prevention program, seems to have stabilized in terms of rate of infection, with a positive but insignificant estimate of the rate of infection. This could be explained by the following reasons:

  • First African country to introduce HAART, 2002
  • Progressive in terms of prevention programs
  • Looking only from 2010, we are missing the dramatic decline in infection
  • The WHO goal is to get 90% of a country’s infected people on HAART, but the last 5-7% might be the hardest to convince

We can combine the ARV and estimated rates of infection data.

arv_on_infection <- arv_dat %>% left_join(rates_modeled, by = "Country") %>% mutate(p_interpretation = if_else(p.value >= 0.05, "Significant", "Insignificant")) ## Warning: Column `Country` joining character vector and factor, coercing ## into character vector arv_on_infection %>% na.omit() %>% ggplot(aes(x = PersCov, y = estimate, shape = p_interpretation >= 0.05)) + geom_point(aes(color = NumberInfected), size = 2) + geom_text_repel(aes(label = Country), size = 3) + scale_color_gradient(high = "red", low = "blue") + theme_grey() + xlab("% ARV coverage") + ylab("Estimated change in HIV infection\n(people/year)") + ggtitle("Antiretroviral (ARV) coverage")

South Africa has the highest number of infected people, but on the positive side, has a downward trajectory of about 15000 fewer people newly infected each year. Although ARVs do play a crucial role in controlling this epidemic, it is not the only factor involved. Prevention of mother-to-child transmission has been very successful in South Africa. Awareness campaigns and education are playing a big role as well. The plot above shows our linearly modeled rates.

The laboratory, HIV diagnosis and monitoring

HIV-related laboratory tests are not the only diagnostics done in a Virology department, but in endemic countries, it accounts for the majority of tests which are done. The first HIV-related test done would be for diagnosis. This is done differently in adults than in infants. As we discussed earlier, after HIV infection, the immune system develops antibodies. We can use a field of study called serology to detect antibodies and antigens, and in most cases, an ELISA test is performed to confirm HIV seroconversion or status. Since the mother’s antibodies will be present in the infant, an ELISA will tell us the baby is positive even though not infected. Infants are diagnosed by detecting viral RNA or DNA in their blood. This is done by PCR (Polymerase Chain Reaction).

Once a patient is diagnosed as HIV-positive, the patient will be initiated on HAART, and in most cases, the viral load will be suppressed. In the South African public sector treatment program, after HAART initiation, the patient gets two six-monthly viral load tests to make sure viral replication is suppressed. To keep an eye out for trouble, a yearly viral load is done to confirm adherence and effectiveness of the treatment.

When an unsuppressed viral load is detected, action is taken and adherence counselling is performed. If this does not solve the problem, drug-resistance testing is performed to assess the resistance profile of the infection in order to adjust the ARV regimen accordingly. This is done by isolating the viral RNA, converting it to DNA, amplifying the DNA to sufficient quantities to enable sequencing of the DNA. In our laboratory, we use Sanger sequencing, but other sequencing technologies also exist.

Figure 1: HIV Genome as depicted by the Los Alamos HIV sequence database. Available at

This diagram depicts the genome of HIV. The most common targets for interfering with viral replication is located in the pol gene. Specifically:

  • prot: The viral protease. Many of the viral proteins are translated as longer polypeptides, which are then cleaved into mature proteins by the protease.

  • p51 RT: The viral reverse transcriptase: Each virion contains two copies of viral RNA. The reverse transcriptase converts the RNA to DNA.

  • p31 int: The viral integrase: This enzyme integrates the reverse transcribed viral DNA into host genomes of the infected cells, and establishes chronic infection.

Essentially, ARVs interfere with these viral enzymes by inhibiting their action:

  • Protease inhibitors prevent the maturation of viral proteins.

  • Reverse transcriptase inhibitors prevent the formation of a DNA copy of the viral genome, which then gives the integrase nothing to work with.

  • Integrase inhibitors prevent the integration of viral DNA into the host genome, which is a crucial part of replication and infection.

Combining these ARVs in clever ways results in HAART or cART. By sequencing the viral RNA, we can detect mutations that cause resistance to specific ARVs. This information is then used to adjust the ARV regimen to once again effectively suppress viral replication.

The viral reverse transcriptase has a high error rate when doing the conversion of RNA to DNA, and introduces random mutations in the viral genome. In the presence of selective pressure like ARVs, these random mutations might give advantageous phenotypic traits to the replicating virus, like drug resistance. On the other hand, if the patient is properly adhering to the treatment, the viral replication is suppressed, replication does not occur, thus mutations can’t occur.

This high rate of mutation can be used in the laboratory as one of the quality-control tools. The polymerase chain reaction is prone to contamination, so it is possible when doing these reactions that one sample might contaminate another. This will give rise to false mutations in the contaminated sample and an erroneous result to the treating clinician, thus direct negative impact on the patient.

What next?

In a recent publication in PLoS ONE, the authors described how they used affordable hardware to create a phylogenetic pipeline, tailored for the HIV drug resistance testing facility.

  • In Part 2 of this four part series, we discuss this pipeline.

  • In Part 3, we will discuss genetic distances and phylogenetics.

  • Finally, in Part 4, we will look at the application of logistic regression in analyzing inter- and intra-patient genetic distance of viral sequences.

See you in the next section!


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio 1.2 Released

Tue, 04/30/2019 - 02:00

(This article was first published on RStudio Blog, and kindly contributed to R-bloggers)

We’re excited to announce the official release of RStudio 1.2!

What’s new in RStudio 1.2?

Over a year in the making, this new release of RStudio includes dozens of new productivity enhancements and capabilities. You’ll now find RStudio a more comfortable workbench for working in SQL, Stan, Python, and D3. Testing your R code is easier, too, with integrations for shinytest and testthat. Create, and test, and publish APIs in R with Plumber. And get more done with background jobs, which let you run R scripts while you work.

Underpinning it all is a new rendering engine based on modern Web standards, so RStudio Desktop looks sharp on displays large and small, and performs better everywhere – especially if you’re using the latest Web technology in your visualizations, Shiny applications, and R Markdown documents. Don’t like how it looks now? No problem–just make your own theme.

You can read more about what’s new this release in the release notes, or our RStudio 1.2 blog series.

Thank you!

We’d like to thank the open source community for helping us to make this release possible. Many of you used the preview releases for your day to day work and gave us invaluable bug reports, ideas, and feedback. We’re grateful for your support–thank you for helping us to continue making RStudio the best workbench for data science!

All products based on RStudio have been updated. You can download the new release, RStudio 1.2.1335, here:

Download RStudio 1.2

Feedback on the new release is always welcome in the community forum. Update and let us know what you think!

Coming Soon: RStudio Pro 1.2 General Availability announcement

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

ViennaR Meetup March – Full Talks Online

Tue, 04/30/2019 - 02:00

(This article was first published on Quantargo Blog, and kindly contributed to R-bloggers)

The full talks of the ViennaR March Meetup are finally online: A short Introduction to ViennaR, Laura Vana introducing R-Ladies Vienna and Hadley Wickham with a great introduction to tidy(er) data and the new functions pivot_wider() and pivot_longer(). Stay tuned for the next ViennaR Meetups!


You can download the slides of the introduction here.

Ronald Hochreiter, Mario Annau, Walter Djuric Laura Vana: R-Ladies

Laura introduced the R-Ladies Vienna – a program initiated by the R-Consortium to

achieve proportionate representation by encouraging, inspiring, and empowering the minorities currently underrepresented in the R community. R-Ladies’ primary focus, therefore, is on supporting the R enthusiasts who identify as an underrepresented minority to achieve their programming potential, by building a collaborative global network of R leaders, mentors, learners, and developers to facilitate individual and collective progress worldwide.


Visit the Meetup site to find out more about upcoming R-Ladies events including workshops covering an R Introduction and Bayesian Statistics.

Laura Vana Hadley Wickham: Tidy(er) Data

Hadley Wickham’s talk covered the tidyr package and two new functions: pivot_wide() and pivot_long() which have finally been renamed to pivot_wider() and pivot_longer() as a result of a survey, see also the posts on Twitter and Github. These function should replace gather() and spread() since they seem to be hard-to-remember for most users coming to tidyr.

Hadley Wickham

See you at one of our next Meetup/Course/Get-Together!


Your Quantargo Team

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Marketing with Machine Learning: Apriori

Tue, 04/30/2019 - 01:05

(This article was first published on R – Detroit Data Lab, and kindly contributed to R-bloggers)

This is part 1 of an ongoing series, introduced in Detroit Data Lab Presents: Marketing with Machine Learning


Apriori, from the latin “a priori” means “from the earlier.” As with many of our predictions, we’re learning from the past and applying it toward the future.

It’s the “Hello World” of marketing with machine learning! The simple application is growth in sales by identifying items that are commonly purchased together as part of a set by a customer, segment, or market. Talented marketers can power a variety of strategies with insights from this, including intelligent inventory management, market entry strategies, synergy identification in mergers & acquisitions, and much more.

The Marketing with Machine Learning Starter Kit: A problem, a data specification, a model, and sample code.

First, the problem we’re solving. We all understand Amazon’s “customers like you also purchased…” so let’s branch out. In the example at the bottom, we’re looking to help shape our market strategy around a key client. This tactic would be particularly useful as part of an account-based marketing strategy, where a single client is thought of as a “market of one” for independent analysis.

We’ll pretend your business has gone to great lengths to understand where one of your customers sources all of its materials, and mapped it out accordingly. You want to understand your opportunity for expansion. If this customer purchases services X from other customers, are there other strongly correlated services? You could be missing out on an opportunity to land more business, simply by not knowing enough to ask.

We move on to the data specification. There are two best ways to receive this data. The long dataset, or “single” dataset, looks like this:

The wide dataset, or the “basket” dataset, looks like this:

CSV’s, as usual, are the typical way these are transferred. Direct access to a database or system is always preferred, since we want to tap into pre-existing data pipelines where necessary so it’s already cleaned and scrubbed of any sensitive information.

Next, the model to use. In this case, we’re interested in determining the strength of association between purchasing service X and any other services, so we’ve selected the apriori algorithm. The apriori algorithm has 3 key terms that provide an understanding of what’s going on: support, confidence, and lift.

Support is the count of how often items appear together. For example, if there are 3 purchases:

  1. Pen, paper, keyboard
  2. Computer, soda, pen
  3. Pen, marker, TV, paper

Then we say the support for pen + paper is 2, since the group occurred two times.

Confidence is essentially a measure of how much we believe product will follow, once we know product X is being purchased. In the example above, we would say that buying pen => buying paper has confidence of 66.7%, since there are 3 transactions with a pen, and 2 of them also include paper.

Lift is the trickiest one. It’s the ratio of your confidence value to your expected confidence value. It often is considered the importance of a rule. Our lift value for the example above is (2/3) / ((3/3) * (2/3)) or 1. A lift value of 1 is actually uneventful or boring, since what it tells us is that the pen is very popular, and therefore the paper purchase probably has nothing to do with the pen being in the transaction. A lift value of greater than 1 implies some significance in the two items, and warrants further investigation.

Finally, the code you can use. To keep the introduction simple, we’re using R with only a few packages, and nothing more sophisticated than a simple CSV.

Using the Detroit Business Certification Register from the City of Detroit, we’ll interpret it as a consumption map. This dataset, with its NIGP Codes, provides us with a transactional set of all categories a business is in.

## Apriori Demonstration code # Loading appropriate packages before analysis library(tidyr) library(dplyr) library(stringr) library(arules) # Read in the entire Detroit Business Register CSV dfDetBusiness <- read.csv("Detroit_Business_Certification_Register.csv") # Prepare a new data frame that puts each NIGP code in its own row, by business dfCodes <- dfDetBusiness %>% mutate(Codes = strsplit(as.character(NIGP.Codes),",")) %>% unnest(Codes) %>% select(Business.Name, Codes) # Some rows appear with a decimal place. Strip out the decimal place. dfCodes$Codes <- str_remove(dfCodes$Codes, "\\.0") # Also trim the whitespace from any codes if it exists dfCodes$Codes <- trimws(dfCodes$Codes) # Write the data write.csv(dfCodes, file = "TransactionsAll.csv", row.names = FALSE) # Import as single with arule objTransactions <- read.transactions("TransactionsAll.csv", format = "single", skip = 1, cols = c(1,2)) # Glance at the summary; does it make sense? summary(objTransactions) # Relative frequency of top 10 items across all transactions itemFrequencyPlot(objTransactions,topN=10,type="relative") # Generate the association rules; maximum 4 items in a rule assoc.rules <- apriori(objTransactions, parameter = list(supp=0.001, conf=0.8,maxlen=4)) # Check out the summary of the object summary(assoc.rules) # View the top 25 rules inspect(assoc.rules[1:25])

By plotting the frequency, reviewing the summary of the association rules, and inspecting a few of the rows, a plan can begin to emerge on what actions to take.

In this case, we find Rental Equipment (977) and Amusement & Decorations (037) always go together. Similarly, when Acoustical Tile (010) and Roofing Materials (077) are purchased, so are Elevators & Escalators (295). An odd combination!


With these insights uncovered, a marketer can begin to form a strategy. Do you develop a new product? Exit a specific market? Press on the customer to purchase more in a certain area from us, or a strategic partner? The possibilities are endless, but at least you’re armed with knowledge to make the decision!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Detroit Data Lab. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

My book ‘Cricket analytics with cricketr and cricpy’ is now on Amazon

Mon, 04/29/2019 - 14:07

(This article was first published on R – Giga thoughts …, and kindly contributed to R-bloggers)

‘Cricket analytics with cricketr and cricpy – Analytics harmony with R and Python’ is now available on Amazon in both paperback ($21.99) and kindle ($9.99/Rs 449) versions. The book includes analysis of cricketers using both my R package ‘cricketr’ and my python package ‘cricpy’ for all formats of the game namely Test, ODI and T20. Both packages use data from ESPN Cricinfo Statsguru. The paperback is available on Amazon for $21.99 and the kindle version is available for $9.99/Rs 449

Pick up your copy today!

The book includes the following chapters


Introduction 7
1. Cricket analytics with cricketr 9
1.1. Introducing cricketr! : An R package to analyze performances of cricketers 10
1.2. Taking cricketr for a spin – Part 1 48
1.2. cricketr digs the Ashes! 69
1.3. cricketr plays the ODIs! 97
1.4. cricketr adapts to the Twenty20 International! 139
1.5. Sixer – R package cricketr’s new Shiny avatar 168
1.6. Re-introducing cricketr! : An R package to analyze performances of cricketers 178
1.7. cricketr sizes up legendary All-rounders of yesteryear 233
1.8. cricketr flexes new muscles: The final analysis 277
1.9. The Clash of the Titans in Test and ODI cricket 300
1.10. Analyzing performances of cricketers using cricketr template 338
2. Cricket analytics with cricpy 352
2.1 Introducing cricpy:A python package to analyze performances of cricketers 353
2.2 Cricpy takes a swing at the ODIs 405
Analysis of Top 4 batsman 448
2.3 Cricpy takes guard for the Twenty20s 449
2.4 Analyzing batsmen and bowlers with cricpy template 490
9. Average runs against different opposing teams 493
3. Other cricket posts in R 500
3.1 Analyzing cricket’s batting legends – Through the mirage with R 500
3.2 Mirror, mirror … the best batsman of them all? 527
4. Appendix 541
Cricket analysis with Machine Learning using Octave 541
4.1 Informed choices through Machine Learning – Analyzing Kohli, Tendulkar and Dravid 542
4.2 Informed choices through Machine Learning-2 Pitting together Kumble, Kapil, Chandra 555
Further reading 569
Important Links 570

Also see
1. My book “Deep Learning from first principles” now on Amazon
2. Practical Machine Learning with R and Python – Part 1
3. Revisiting World Bank data analysis with WDI and gVisMotionChart
4. Natural language processing: What would Shakespeare say?
5. Optimal Cloud Computing
6. Pitching yorkpy … short of good length to IPL – Part 1
7. Computer Vision: Ramblings on derivatives, histograms and contours

To see all posts click Index of posts

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Giga thoughts …. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Getting a Data Science Job is not a Numbers Game!

Mon, 04/29/2019 - 12:23

(This article was first published on Learn R Programming & Build a Data Science Career | Michael Toth, and kindly contributed to R-bloggers)

My First (Non Data Science) Job Search

Let me tell you a story about my first job search. It was 2010, and data science jobs weren’t really a thing yet. I’ll get to that in a minute, but bear with me first because there’s a point to all this.

At the time, I was a junior at the University of Pennsylvania, where I was studying finance and statistics.

Every year, there was months-long on-campus recruiting season where all of the students frantically applied to secure prestigious jobs and internships from big banks and other financial companies.

And I knew I NEEDED to get one of those jobs.

But unlike a lot of the other students at Penn, I didn’t grow up in New York City. I was from a working class family in the midwest. So when I started my job search, I had no connections in the industry. Zero.

Obviously, my applications weren’t going to be fast tracked. But more importantly, I had nobody to ask questions about the different job opportunities available.

It felt like everybody else in school understood all of the different types of jobs in finance and which ones to apply to.

Not me. I had no clue. But, what the hell, I figured. I was smart. I was near the top of my class! I had a 3.8 GPA and an extremely difficult and technical course load. Somebody would hire me! And I didn’t really care what type of job I got, I just needed a job. So I applied to everything.

I applied to trading jobs. I applied to investment risk jobs. I applied to investment banking jobs. Over 100 jobs in total. And then… crickets. Most of the companies didn’t even respond to me! It was demoralizing.

I was freaking out. I needed to land a job. My first job was extremely important, and it would set me up for my entire career. I worried if I didn’t get a job now, I would have to move back to Ohio after I graduated, probably closing the door on a prestigious finance job forever.

My Big Break

Finally, after weeks of anguish, I got a big break. I had been asked to interview for one of the jobs I had applied to, a technical trading job at Allstate.

I was so excited. When I went to the interview, I met a guy named Mark. Mark was the one who had decided to interview me, and the hiring decision was ultimately his to make.

Mark and I got along well, from what I remember. He had a relatively senior role at the company, but his schooling background had been similar to mine. He’d studied finance and engineering, and he was looking for somebody smart with a strong combination of finance and technical skills.

He must have seen some potential in me, because shortly after the interview he offered me the job. Of the hundred-or-so jobs I applied to, this was the only offer I received.

What I Learned

So what’s my point? I tell you your data science job search is not a numbers game, but then I tell you how I applied to over a hundred jobs to get a single offer. What gives?

Here’s the thing: I got this job because I had the exact combination of finance and technical skills that Mark was looking for. Most of the other jobs I applied for, I never had a shot of getting, because I didn’t have the right background. All of those applications, and all of my effort in applying, were a waste.

If I had instead focused on only applying to the technical finance jobs where I had a unique advantage, I’d have had a much higher success rate.

Getting My First Data Science Job with a Strategic R Blog Post

My experience during that application process changed how I thought about job searches. Years later, when I was trying to break into data science from my finance job at BlackRock, I took a different approach.

I knew I wanted to get into data science, but with no formal training and potentially hundreds of applicants for each position, I also knew I needed to stand out. So I scoured job boards searching for jobs where I knew my skills would give me an advantage.

I was very good at financial data manipulation, something the majority of data scientists know absolutely nothing about.

So when I found a financial technology startup specialized in online lending analytics that was looking for a data scientist, I knew it was the perfect opportunity for me.

What did I do? Did I send in my application and then just hope they contacted me?


I wrote a detailed blog post analyzing historical default rates for Lending Club loans. This was exactly the type of work I’d be expected to do at the company. I probably spent 10 hours doing the research and analysis and then writing that blog post.

What do you imagine happened? They called me in for an on-site interview. And I pretty much breezed through it. The interview was primarily to assess my cultural fit, not my data science skills, because I’d already proven to them I was capable!

Avoid the Spray and Pray Approach when Applying for Data Science Jobs

When you carefully select the companies you’re going to apply to based on an alignment of their needs and your skills, you can dedicate more time to each application. That extra time is how you stand out in a pool of hundreds of job applicants for a single position. You could:

  • Write a blog post showing your ability to do the work
  • Send them a detailed list of metrics they should be tracking to improve their business
  • Analyze a relevant public dataset and tell the company how they could incorporate that data into their product

The specific thing you do will vary across companies and industries. But if you can do something to add value and differentiate yourself from the hundreds of other applications, you will vastly improve your chances of getting a job.

ALWAYS REMEMBER THIS: The job you’re applying for exists because the company has a problem that they’re trying to solve. They’re not looking for a generic person with a generic set of data analysis skills. They’re looking for a specific person that can help them solve their problems.

That doesn’t mean you need to know everything, but it does mean that you should lean into your specific strengths when deciding where to apply. If you can show the company that you can solve their problems, you become a top-5 candidate immediately.

For your next data science job search, don’t apply for a hundred jobs. Find 10 jobs where you bring unique skills to the table, and try to do something that demonstrates your unique skills to those employers. I promise you’ll find far more success with this approach.

I will help you learn the specific skills you need to work more effectively, grow your income, and improve your career.

Sign up here to receive my best tips

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Learn R Programming & Build a Data Science Career | Michael Toth. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Why use RTutor for interactive tutorials if there is RStudio’s learnr?

Mon, 04/29/2019 - 11:00

(This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers)

There are nowadays different options to create interactive R tutorials that allow users to type in code, automatically check their solution and get hints if they are stuck. Two options are RStudio’s learnr package and my RTutor package.

Obviously, RStudio has a great track record for creating and continously supporting awesome software and R packages. So unless there are compelling reasons, it probably makes sense to go with RStudio’s learnr. Below are three potential reasons to consider RTutor. The may be relevant or not for your application.

1. Larger Multi-Chunk Exercises

I created and use RTutor mainly for longer interactive problem sets in university courses and to replicate research articles in an interactive fashion. The key idea is that different code chunks are bundled into exercises and will be solved sequentially.

By default the results of earlier chunks will be availble in later chunks in the same exercise. For example, if some data set is loaded in the first chunk of an exercise, it can be used in all later chunks.

In contrast, in learnr each code chunk is independent and one has not access to the results of earlier chunks. Here is an interesting discussion on whether such access should be possible or not. JJ Allaire also states compelling counter-arguments:

The biggest downside to allowing environment sharing is that the exercises now cannot in principle be run on an “exercise farm”. The package doesn’t currently take advantage of this but it would be nice to swap this in underneath all the existing tutorials later.

A secondary downside is that users must now execute the chunks in order (obviously in an instructor-led classroom setting this isn’t such a big problem).

So it is not clear whether in the future, learnr will allow sharing of the results of previous chunks. The mentioned secondary downside that the chunks must be solved in order, can be alleviated in two ways in RTutor. First, one can set the chunk option optional=TRUE which means that chunk does not need to be solved and the results will not be available in future chunks. Second, it is by default not possible to access chunks from previous exercises and consequently one can start with a new exercise without having completely solved the previous exercise.

So I guess if you mainly construct longer exercises where chunks build upon each other, RTutor might currently be the more convenient option for you than learnr.

2. Ability to solve inside RStudio

Both learnr and RStudio allow to show and solve a problems set in a Shiny app in the web browser. The designs differ, but I would say that there are no huge differences. An advantage of learnr is that you can more flexible change the default design and e.g. easily include interactive chunks into markdown based slides.

An advantage of RTutor is that you can also solve problem sets by editing an RMarkdown file inside RStudio. RTutor creates a RStudio Addin that allows you to check your solution. You can also type hint() in the console to get a hint for the chunk you are currently working on. For R beginners a Shiny app is probably the better option. Yet, for my Master courses where students solve several RTutor problem sets, I always let students solve the problem sets inside RStudio. The advantage is simply that students also work with the tools that they can use outside the problem set: RStudio and RMarkdown files.

3. Automatic Default Hints and Solution Checkers

I tried to design RTutor with the goal in mind to minimize the need to write custom code checkers and custom hints. While it is possible to customize code checkers and hints for each chunk (and each expression inside a chunk), I find that there is often no need for it.

Consider an example chunk of an RTutor solution file that asks the user to enter a specific dplyr pipe:

#< task df = data.frame(a=rep(1:3, length=10), x=runif(10)) #> df %>% filter(a==2) %>% arrange(x)

This is essentially just a sample solution, where the code in the # block will be already shown to the user. If the user wants a hint, RTutor analyses the currently entered code and tries to adapt the hint.

For example, assume the user has entered so far no code on her own. Then the following automatically generated hint will be shown:

My solution consists of a chain of the form: df??? %>% filter??? %>% arrange??? The ??? may stand for some function arguments wrapped in () that you must figure out.

If instead the user has entered the following code

df = data.frame(a=rep(1:3, length=10), x=runif(10)) df %>% filter(a==3) %>% arrange(x)

the following automatic hint is generated

In your following chain, I can detect wrong results already after the second element 'filter': df %>% filter(a == 3) %>% !! WRONG RESULTS !! arrange(x) %>%

There is surely always scope for improvement, but personally I often stay with the defaults without customization. While I don’t have much experience with learnr my feeling from reading the documentation is that currently there is more need for customization of code checks and hints.

Other aspects Distributing and Hosting Problem Sets

Both learnr and RTutor problem sets can be easily distributed to students to be solved on their own computer or be hosted in the web on or in an RStudio Cloud project.

I let students solve the problem sets on their own computer in all of my courses. Of course, it is some work to get R, RStudio and the relevant packages installed, but students should learn to install relevant software anyways. Even though students solve the problem set on their own computer, they have to hand in the solution and it will be partly relevant for their total grade. For this purpose, RTutor allows students to generate a submission file. This will be uploaded on a course management system, like Moodle. Bulk downloading all submissions and a little script allows me then automatic grading.

In principle, you could also decide to host the problem sets on your own Shiny server. But then you have think about security. If you want to do that, I would recommend to use learnr rather than RTutor. Unless you know pretty well what your are doing security-wise, it would probably also be a good idea to look at RStudio’s commercial offerings. In learnr’s documentation it is stated:

You can use various features of RStudio Connect and Shiny Server Pro to run tutorials within a resource and/or filesystem sandbox.

I was daddling a bit with RAppArmor based secure evaluation for RTutor. Yet, probably the ideal environment would be something that spins separate Docker containers, as is described in this Datacamp blog entry. Possible, this can be nicely implement using ShinyProxy. But I have not yet looked into details.

Community size, Long term support

Ok, perhaps some of the reasons above may want you to try RTutor. But are not the larger community and the backing by RStudio strong arguments for learnr? I guess that is definitely the case. But how important these arguments are for you, depends on your preferences and your application.

I can only say that I plan to use continue using RTutor in my courses and we currently expand RTutor usage in our department. I also have regularly updated it with new features and plan to do so in future, but not neccessarily at a fast pace. And of course, you can always file a Github issue for a feature request. RTutor also has some Github stars, so also other people seem to use it.

In addition I don’t think that there is a big lock-in effect. Even if one uses RTutor now and possibly wants to switch to newer learnr versions in the future, it should not be a lot of work to convert your tutorials. In the end both are based on RMarkdown files.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Templated output in R

Mon, 04/29/2019 - 02:00

(This article was first published on R on Abhijit Dasgupta, and kindly contributed to R-bloggers)

Earo Wang, who is the curator for the We are R-Ladies twitter feed this week (last week of April, 2019), had a really nice tweet about using the whisker package to create a template incorporating text and data in R. Her example created a list of tidyverse packages with descriptions.

I really liked the example, but thought that the glue package might be able to do the same thing. I used Earo’s code to generate a tibble with the package names and descriptions, and glue_data to create the templated list.

library(tidyverse) library(glue) tidy_pkgs <- tibble('pkgs' = c('ggplot2','purrr','readr','tidyr', 'dplyr','forcats','lubridate','stringr')) %>% mutate(descr = map_chr(pkgs, ~packageDescription(., fields='Description'))) glue_data(tidy_pkgs, "- [**{pkgs}**](http://{pkgs} {descr}")
  • ggplot2: A system for ‘declaratively’ creating graphics,
    based on “The Grammar of Graphics”. You provide the data, tell ‘ggplot2’
    how to map variables to aesthetics, what graphical primitives to use,
    and it takes care of the details.
  • purrr: A complete and consistent functional programming toolkit for R.
  • readr: The goal of ‘readr’ is to provide a fast and friendly way to read
    rectangular data (like ‘csv’, ‘tsv’, and ‘fwf’). It is designed to flexibly
    parse many types of data found in the wild, while still cleanly failing when
    data unexpectedly changes.
  • tidyr: An evolution of ‘reshape2’. It’s designed
    specifically for data tidying (not general reshaping or aggregating)
    and works well with ‘dplyr’ data pipelines.
  • dplyr: A fast, consistent tool for working with data frame like objects,
    both in memory and out of memory.
  • forcats: Helpers for reordering factor levels (including
    moving specified levels to front, ordering by first appearance,
    reversing, and randomly shuffling), and tools for modifying factor
    levels (including collapsing rare levels into other, ‘anonymising’,
    and manually ‘recoding’).
  • lubridate: Functions to work with date-times and time-spans: fast and user
    friendly parsing of date-time data, extraction and updating of components of
    a date-time (years, months, days, hours, minutes, and seconds), algebraic
    manipulation on date-time and time-span objects. The ‘lubridate’ package has
    a consistent and memorable syntax that makes working with dates easy and
    Parts of the ‘CCTZ’ source code, released under the Apache 2.0 License,
    are included in this package. See for more
  • stringr: A consistent, simple and easy to use set of
    wrappers around the fantastic ‘stringi’ package. All function and
    argument names (and positions) are consistent, all functions deal with
    “NA”’s and zero length vectors in the same way, and the output from
    one function is easy to feed into the input of another.
var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Abhijit Dasgupta. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

visualising bias and unbiasedness

Mon, 04/29/2019 - 00:19

(This article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers)

A question on X validated led me to wonder at the point made by Christopher Bishop in his Pattern Recognition and Machine Learning book about the MLE of the Normal variance being biased. As it is illustrated by the above graph that opposes the true and green distribution of the data (made of two points) against the estimated and red distribution. While it is true that the MLE under-estimates the variance on average, the pictures are cartoonist caricatures in their deviance permanence across three replicas. When looking at 10⁵ replicas, rather than three, and at samples of size 10, rather than 2, the distinction between using the MLE (left) and the unbiased estimator of σ² (right).

When looking more specifically at the case n=2, the humongous variability of the density estimate completely dwarfs the bias issue:

Even when averaging over all 10⁵ replications, the difference is hard to spot (and both estimations are more dispersed than the truth!):

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Xi'an's Og. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Interact with PostGIS from R

Sun, 04/28/2019 - 23:21

(This article was first published on (en) The R Task Force, and kindly contributed to R-bloggers)

PostGIS extends capabilities of PostgreSQL database to deal with spatial data. Using PostGIS, your database supports geographic queries to be run directly in SQL. In this blog post, we will connect and interact with a PostGIS database from R, using {DBI} and {sf}.

Package {sf} and PostGIS are friends

Package {sf} is similar to PostGIS database in multiple ways:

  • R package {sf} was created to supports simple features (thus, named “sf”). “Simple features” is an ISO standard defined by the Open Geospatial Consortium (OGC) to standardize storage and access to geographical datasets. PostGIS supports objects and functions specified in the OGC “Simple Features for SQL” specification.
  • Most {sf} functions starts with st_ to follow the denomination of functions in PostGIS. For instance, function sf::st_union has its equivalent ST_Union in Postgis. If you were wondering why, now you know !
  • Spatial data with {sf} are manipulated in a SQL way. Simple features in {sf} allows to manipulate vector spatial objects using {dplyr}, which syntax is similar to SQL functions names. For instance, dplyr::select, which can be applied on {sf} objects, has its equivalent SELECT in SQL syntax.
Configuration of your session Packages

Main packages used in this post are {DBI}, {RPostgres}, {sqlpetr} and {sf}. If you want to properly install spatial packages on Ubuntu Server, you may want to refer to our post: installation of R on ubuntu and tips for spatial packages. To change from the beloved Viridis palette, this post contains some custom ThinkR colors and palettes that you will have to change to reproduce the code. The corresponding internal package was largely inspired by @drsimonj post: Creating corporate colour palettes for ggplot2

library(DBI) library(RPostgres) library(sqlpetr) library(sf) # library(rpostgis) library(dplyr) library(dbplyr) library(rnaturalearth) library(ggplot2) # ThinkR internal package for color palette and other brand identity library(thinkridentity) Verification of GDAL drivers

To be able to connect PostGIS database in R, you will need a GDAL version that supports PostGIS driver. You can verify this using sf::st_drivers().

# GDAL drivers st_drivers() %>% filter(grepl("Post", name)) ## name long_name write copy is_raster is_vector vsi ## 1 PostgreSQL PostgreSQL/PostGIS TRUE FALSE FALSE TRUE FALSE Build a Docker with Postgres and PostGIS enabled

At ThinkR, we love using Docker containers to put R in production on servers. We already used Docker to deploy our own R archive repository, or your shiny application. For reproducibility of this blog post, it is thus natural to use Docker. We build a container with a Postgres database having PostGIS enabled.

To use this Docker container in production on your server, do not forget to set a persistant drive.

docker_cmd <- "run --detach --name some-postgis --publish 5432:5432 --env POSTGRES_PASSWORD=mysecretpassword -d mdillon/postgis" system2("docker", docker_cmd, stdout = TRUE, stderr = TRUE) ## [1] "368b4733bdd333ee9c4501b44c3e6738f2cf313730fb18db8582202ccb42adf3" Connection to the PostGIS database from R

You can connect to any database using {DBI} package, provided that you have the correct drivers installed. If you use Rstudio as your IDE to develop in R, I would recommend using package {sqlpetr} to connect to your Postgres database. This package has a “connexion contract” allowing to explore your database directly in the Rstudio Connection Pane.

# With sqlpetr con <- sp_get_postgres_connection( host = "localhost", dbname = "postgres", port = 5432, user = "postgres", password = "mysecretpassword", seconds_to_test = 30, connection_tab = TRUE ) Write and read World data in PostGIS

In the following examples, we will play with a World map from package {rnaturalearth}.
Equal Earth projection is used for maps representation here. Equal Earth projection is in the last version of “proj”. If it is not available to you, you can choose another one in the code below. But please avoid Mercator.

ne_world <- rnaturalearth::ne_countries(scale = 50, returnclass = "sf") # Choose one available to you world_map_crs <- "+proj=eqearth +wktext" # Use your custom colours ne_world %>% st_transform(world_map_crs) %>% ggplot() + geom_sf(fill = thinkr_cols("warning"), colour = thinkr_cols("dark")) + theme(panel.background = element_rect(fill = thinkr_cols("primary"))) )

We can write the data in our PostGIS database directly using sf::st_write. Default parameters overwrite = FALSE, append = FALSE prevent overwriting your data if already in the database. {sf} will guess the writing method in the database.

st_write(ne_world, dsn = con, layer = "ne_world", overwrite = FALSE, append = FALSE)

Then, we can read data from the database using sf::st_read.

world_sf <- st_read(con, layer = "ne_world") Use SQL queries before loading the spatial data

When using st_read, the dataset is completely loaded in the memory of your computer. If your spatial dataset is big, you may want to query only a part of it. Classical SQL queries can be used, for instance to extract Africa only, using st_read with parameter query.

query <- paste( 'SELECT "name", "name_long", "geometry"', 'FROM "ne_world"', 'WHERE ("continent" = \'Africa\');' ) africa_sf <- st_read(con, query = query) Tips: Do you need help for your SQL query ?

PostGIS allows you to make SQL queries with geomatics functions that you can also define before reading your dataset with {sf}. However, if you are not totally comfortable with SQL but you are a ninja with {dplyr}, then you will be fine. By using dplyr::show_query, you can get the SQL syntax of your {sf} operations. Let’s take an example.
With {sf}, union of countries by continent could be written as follows:

world_sf %>% group_by(continent) %>% summarise()

Indeed, the complete {sf} code would be:

world_sf %>% group_by(continent) %>% summarise(geometry = st_union(geometry))

Now, let us connect to the Postgres database with {dplyr}. When connected to a database with {dplyr}, queries are executed directly in SQL in the database, not in the computer memory.
Then use show_query to get the translation in SQL proposed by {dplyr}.

# Read with dplyr directly in database world_tbl <- tbl(con, "ne_world") # Create query world_tbl %>% group_by(continent) %>% summarise( geometry = st_union(geometry)) %>% show_query() ## ## SELECT "continent", ST_UNION("geometry") AS "geometry" ## FROM "ne_world" ## GROUP BY "continent"

{dplyr} has not been tailored to translate spatial operations in SQL, but as you can see below, difference with the correct query is very small. Only the PostGIS function is not correctly written (ST_Union).

query <- paste( 'SELECT "continent", ST_Union(geometry) AS "geometry"', 'FROM "ne_world"', 'GROUP BY "continent";' ) union_sf <- st_read(con, query = query, quiet = TRUE) # Plot (Choose your own palette) union_sf %>% st_transform(world_map_crs) %>% ggplot() + geom_sf(aes(fill = continent), colour = thinkr_cols("dark")) + scale_fill_thinkr_d(palette = "discrete2_long", guide = FALSE) + theme(panel.background = element_rect(fill = thinkr_cols("primary_light")))

Be careful. You can not trust operations with {dplyr}/{dbplyr} on a spatial database because the ‘geometry’ column is not imported as spatial column with tbl() and you would loose it. However, you can use {dplyr} syntax to explore your dataset and all columns out of the spatial information. All operations will be realised inside the database, which assures not to overload the memory.

What about rasters ?

If you follow this blog, you know that I like rasters. Package {sf} is only dealing with spatial vector objects. It can not manage rasters in a PostGIS database. To do so, you can have a look at package {rpostgis} with functions pgGetRast() and pgWriteRast(). This package works with Rasters from package {raster}. You will need to connect using package {RPostgreSQL}, which does not have a Rstudio “connexion contract”. Note that you can have two connections to your database, one with {sqlpetr} to keep the Rstudio Connection pane, and the other with {RPostgreSQL} to connect with Raster objects.

  • Connect to the database with {RPostgreSQL}
library(rpostgis) con_raster <- RPostgreSQL::dbConnect("PostgreSQL", host = "localhost", dbname = "postgres", user = "postgres", password = "mysecretpassword", port = 5432 ) # check if the database has PostGIS pgPostGIS(con_raster) ## [1] TRUE
  • Create an empty raster and save it in the database
# From documentation basic test r <- raster::raster( nrows = 180, ncols = 360, xmn = -180, xmx = 180, ymn = -90, ymx = 90, vals = 1 ) # Write Raster in the database pgWriteRast(con_raster, name = "test", raster = r, overwrite = TRUE ) ## [1] TRUE
  • Read the raster from the database
r_read <- pgGetRast(con_raster, name = "test") print(r_read) ## class : RasterLayer ## dimensions : 180, 360, 64800 (nrow, ncol, ncell) ## resolution : 1, 1 (x, y) ## extent : -180, 180, -90, 90 (xmin, xmax, ymin, ymax) ## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 ## data source : in memory ## names : layer ## values : 1, 1 (min, max)
  • Do not forget to close the database connection after your operations
# Disconnect database RPostgreSQL::dbDisconnect(con_raster) ## [1] TRUE Disconnect database and close Docker container
  • Disconnect database connection made with {sqlpetr}
  • Stop and remove Docker container if wanted
# Disconnect database DBI::dbDisconnect(con) # Stop Docker container system2("docker", "stop some-postgis") system2("docker", "rm --force some-postgis") # Verify docker system2("docker", "ps -a") To go further Packages used in this blogpost # remotes::install_github("ThinkR-open/attachment") sort(attachment::att_from_rmd("2019-04-27_postgis.Rmd")) ## [1] "attachment" "DBI" "dbplyr" "dplyr" "ggplot2" "knitr" "raster" ## [8] "rnaturalearth" "rpostgis" "RPostgres" "RPostgreSQL" "sf" "sqlpetr" "thinkridentity"

The post Interact with PostGIS from R appeared first on (en) The R Task Force.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: (en) The R Task Force. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

RStudio’s multiple cursors rule!

Sun, 04/28/2019 - 21:30

(This article was first published on Embracing the Random | R, and kindly contributed to R-bloggers)

(TL;DR: Come on. This is pretty short. Productivity level up by harnessing the power of RStudio!)


Say we’re at work and we’ve received some data in Excel.

Let’s say that this Excel workbook contains a subset of a broader dataset. For example, maybe our Excel file contains:

  • a list of customer ID numbers that we need to look into, or
  • a list of dates where our data looks weird and we need to find out what has gone wrong.

Maybe the broader dataset is located within a database. Maybe we have a larger dataset within R itself. In any case, we want to use this subset of data to filter a broader dataset and get more info that might help us in our data analysis adventure!

What I’ve seen a lot of people do in Excel (including myself!)

Let’s proceed with a date and time related example given my recent obsession with time zones. We’ve been given some file of dates to look into and it looks like this:

We want to collapse these into quoted, comma-separated strings that we’ll pass into the c() function. This will give us a character vector. Once we have a character vector, we want to convert these into POSIXct values using the power of vectorisation!

In the past, I would have embarassingly done something like this to enclose each date in single quotes with a trailing comma at the end of each line:

But this requires far too many keystrokes! My new mechanical keyboard with CHERRY MX Brown keyswitches are rated for only 50 million keystrokes per switch. I must be frugal with my keystrokes!

Lazy is good – let’s try this instead

Let’s harness the power of RStudio.

  1. We’ll copy the values in the spreadsheet column and paste them into a new script in RStudio:

  2. If we’re using a PC keyboard, let’s hold down Alt and click and drag the cursor down the left-hand side of each row until we reach the last line. If we’re using a Mac, we’ll have to do some Googling…but I can only assume that we should use the option key instead of the Alt key!

  3. Let’s highlight all of our pasted text row-wise. Let’s hold down Alt+Shift, then press the right arrow.

  4. We’ll quote them lines using another shortcut – Shift+' (Shift and the quotation mark key):

  5. While our cursor is still across all lines of the last column, we’ll press the comma key:

  6. Once we’ve removed the annoying trailing comma on the last line, we’ve got ourselves some quoted, comma-separated strings!

Now we can pass these into the c() function and continue on with our lives. For example, we now have this:

c("26/04/2019 08:00:00", "25/04/2019 21:30:00", "25/04/2019 22:00:00", "25/04/2019 18:45:00", "25/04/2019 14:40:00", "25/04/2019 22:00:00", "25/04/2019 16:00:00") ## [1] "26/04/2019 08:00:00" "25/04/2019 21:30:00" "25/04/2019 22:00:00" ## [4] "25/04/2019 18:45:00" "25/04/2019 14:40:00" "25/04/2019 22:00:00" ## [7] "25/04/2019 16:00:00"

We assign it to a variable:

date_time_str <- c("26/04/2019 08:00:00", "25/04/2019 21:30:00", "25/04/2019 22:00:00", "25/04/2019 18:45:00", "25/04/2019 14:40:00", "25/04/2019 22:00:00", "25/04/2019 16:00:00")

We use the dmy_hms() function from lubridate to convert each element of our vector into POSIXct objects. Since we’re in Sydney, Australia, we’ll assign each the ‘Australia/Sydney’ time zone:

library(lubridate) date_time_values <- dmy_hms(date_time_str, tz='Australia/Sydney') print(date_time_values) ## [1] "2019-04-26 08:00:00 AEST" "2019-04-25 21:30:00 AEST" ## [3] "2019-04-25 22:00:00 AEST" "2019-04-25 18:45:00 AEST" ## [5] "2019-04-25 14:40:00 AEST" "2019-04-25 22:00:00 AEST" ## [7] "2019-04-25 16:00:00 AEST"


And since we’re in Sydney, Australia…

We’ll end with this:

“Bloody brilliant!”


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Embracing the Random | R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

NYC… Dangerous or Deadly?

Sun, 04/28/2019 - 19:28

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)


Traffic collisions in New york city are all too common and can result in devastating injuries and death. Even so, New Yorkers and tourists travel around the city risking life and limb to get where they need to go. This R Shiny project is designed to give insight in to when, where and how to best to travel around the five boroughs of New York, safely. 

Follow the following links to view my Shiny app and the code on Github.  

Data & Methodology

The dataset used for this app was provided by NYC Open Data and the NYPD. Every traffic collision that occurs in NYC where the police are called is recorded in to a database with associated data regarding location, time, persons and vehicles involved, and any injuries that occured. This data set has recorded collisions for 2015 through early 2017.

This app is designed to show trends in collisions as well as give the user the ability to filter for their specific neighborhoods, commuting path and time if desired. I filtered the data to involve only collisions with an injury or death having occurred.  This provided enough data points to spot trends in collisions as well as being more relevant for the user. 

In addition to the data that is provided by the NYPD, I added extra columns to break down the collisions in to:

  • Weekday and Weekend
  • Type of transportation
  • Dangerous or Deadly


The App and Results Map

The map provides the user the ability to view groupings and trends of collisions filtered by:

  • Weekday or Weekend
  • Transportation type
  • Time of day
  • Time of year
  • Number of collisions  

The last slider provides the user the ability to “declutter” the map and more easily spot overall trends. 

Some areas are hotbeds for collisions:

  • Midtown Manhattan
  • Bridge Entrances
  • Busy Transfer points (Subway entrances leading in to vehicle traffic)

Weekend collisions are much less prevalent than weekday.

Tables and Charts

The “Tables and Charts” tab gives more insight in to more relative differences in time and location. 

Looking at collisions over the course of the day, “deadly” collisions are more clustered between the hours of 8PM and 6AM whereas “dangerous” collisions peak during morning and evening rush hour hour.   

The tables below the graph update with the user inputs and show the most frequent types of each variable. 

Many more insights and trends can be gleaned from the map, tables and plot by changing the variables and inputs. 


Future Research

I look to further my app and research in to NYC collisions by adding more interactivity to the map and looking at correlations between types of drivers, age of driver, and  seasonality. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tips to Reduce the Complexity of Slide Making with Xaringan

Sun, 04/28/2019 - 18:00

(This article was first published on Yongfu's Blog, and kindly contributed to R-bloggers)

Although I’m a strong believer in R Markdown, I’m not so sure about it in the realm of slide making. After using GUI slide making tools such as Powerpoint or Google Slides for years, it is not easy to get used to making slides with markdown, since it requires additional processing in the brain – the conversion between images in the brain and the markup language to create them on the slides.
xaringan::inf_mr() greatly reduces this burden on the brain, but one serious drawback still persists. When making slides with Markdown, the source document often gets very long. I frequently found myself lost in the source document (after scrolling up and down), not knowing which part of the slide is it that I’m looking at1.
To be fair, there are great things about making slides with markdown, for example, it’s easier to manage and reuse images through URLs, and markdown is convenient for formatting code (automatic syntax highlighting).

I have been making a lot of slides with Xaringan lately. After spending hours on Xaringan, I began to have more faith in making slides with markdown because I figured out some tips to reduce the pain caused by markup langauges. The idea is simple – reduce the length and complexity of the source R Markdown document. knitr is my friend here.

Use Several Source Documents

bookdown uses several Rmd files to generate a book. The same idea can be used in Xaringan, and here are some advantages I can think of:

  • By splitting the source document into several meaningful sub-documents, you can locate particular parts of your slide faster (the so-called “chunking”).

  • Your slides can be easily reused, since you can copy a part of your slides by choosing the relevent Rmd file(s). (Not copy-and-pasting text from the parts you want in one lengthy Rmd file)

Child Document

To use several source Rmd documents to generate a single Xaringan (or any R Markdown) output, use knitr chunk option child to include other Rmd files in a Rmd document. For example, I would create one index.Rmd and several Rmd files with meaningful names (e.g., opening.Rmd, intro-github.Rmd, contact.Rmd, etc.):

my-slide/ ├── index.Rmd ├── opening.Rmd ├── intro-github.Rmd ├── contact.Rmd ├── html-table.txt ├── img/ └── addons/    ├── custom.css    └── macros.js

The chunk option child is then used in index.Rmd to include other *.Rmd:

--- title: "Introduction to R Markdown" output: xaringan::moon_reader: nature: beforeInit: ["addons/macros.js", ""] css: [default, default-fonts, addons/custom.css] --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) options(knitr.duplicate.label = 'allow') ``` ```{r child='opening.Rmd'} ``` --- ```{r child='intro-github.Rmd'} ``` --- ```{r child='contact.Rmd'} ``` Complex Markups in Independent Files

Sometimes we may want to put cool things in the slides that can only be created from HTML syntax, for example, the table below (generated from

Heuristic plausible implausible Syntax-driven plausible P600
N400 ✗ implausible P600 N400

This table looks simple, but it is generated by a 2728-character-long HTML code, which greatly increases the length and complexity of the source document. To make the source document cleaner, you can put code of these kinds in separate text files and include them into the R Markdown source document by knitr::asis_output(readLines('html-table.txt'))2 and R Markdown inline code `r `:

A table generated from []( `r knitr::asis_output(readLines('html-table.txt'))` remark.js Built-in Functionalities

remark.js actually provides useful functionalities to help reduce the complexity of the slides. READ THE DOCs, and it might save you a great deal of time. Xaringan and remark.js both provide good wiki pages, and I regret I read them too late – after I spent a lot of time copy-and-pasting and made my Rmd source document uglier.

Below are some notes extract from Xaringan & remark.js Wiki pages. They serve as quick references (for myself) and aren’t meant to be detailed. Again, READ THE DOCs!


Xaringan Configuration (remark has a more thorough documentation) is set in YAML frontmatter:

output: xaringan::moon_reader: nature: ratio: "16:10" beforeInit: ["addons/macros.js", ""] highlightLines: true highlightSpans: false navigation: scroll: false css: [default, default-fonts, addons/custom.css] yolo: false seal: true remark Special Syntax
  • name: Adding ID to a slide

    name: about ## About
    • reference with see [About](#about)
  • count

    count: false This slide will not be counted.
  • template

    name: template-slide Some content. --- template: template-slide Content appended to template-slide's content.
  • layout

  • Image with Absolute postition:

    Define macros in addons/macros.js:

    remark.macros['abs'] = function(width="30%", left="85%", top="15%", cl="") { var url = this; return ' + url + '" style="position:absolute;left:' + left + ';top:' + top + ';width:' + width + '" class="' + cl + '" />'; };

    Use it in markdown:

    ![:abs width, left, top](url) ![:abs 30%, 50%, 0%](url) src="url" style="position:absolute; width:30%; left:50%; top:0%;">
  1. This is the major drawback of markup languages compared to GUI authoring tools. Although markdown itself is designed to deal with this drawback (i.e. the too-complicated HTML syntax), the inherently complex structure of slideshows still complicates the source document of the slides writen in markdown. 

  2. Note that there is only one line in html-table.txt so readLines() returns a character vector of length 1. If there are multiple lines in the text file, one needs to use paste(readLines('html-table.txt'), collapse = '\n') to properly print out the text file. 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Yongfu's Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

The Mrs. White probability puzzle

Sun, 04/28/2019 - 16:50

(This article was first published on R [english] – NC233, and kindly contributed to R-bloggers)

tl;dr -I don’t remember how many games of Clue I’ve played but I do remember being surprised by Mrs White being the murderer in only 2 of those games. Can you give an estimate and an upper bound for the number of games I have played?
We solve this problem by using Bayes theorem and discussing the data generation mechanism, and illustrate the solution with R.

The characters in the original game of Clue. Mrs White is third from the left on the first row (and is now retired!) Making use of external information with Bayes theorem Having been raised a frequentist, I first tried to solve this using a max likelihood method, but quickly gave up when I realized how intractable it actually was, especially for the upper bound. This is a problem where conditioning on external knowledge is key, so the most natural way to tackle it is to use Bayes theorem. This will directly yield an interpretable probability for what we’re looking for (most probable number of and uncertainty interval)   Denote an integer n>3 and: Our notations for the problem   What we want writes as a simple product of quantities that we can compute, thanks to Bayes: Bayes theorem gives us directly the probability we are looking for Notice that there is an “proportional to” sign instead of an equal. This is because the denominator is just a normalization constant, which we can take care of easily after computing the likelihood and the prior. Likelihood The likelihood indicates the odds of us observing the data (in this case, that k_Mrs_White = 2) given the value of the unknown parameter (here the number of games played). Since at the beginning of each game the murderer is chosen at uniform random between 6 characters, the number of times Mrs White ends up being the culprit can be modeled as a binomial distribution with parameters n and 1/6.   This will be easily obtained using the dbinom function, which gives directly the exact value of P(x = k), for any x and a binomial distribution of parameters n and p. Let’s first import a few useful functions that I put in our GitHub repo to save some space on this post, and set a few useful parameters: library(tidyverse) source("clue/clue_functions.R") ## Parameters k_mrs_white <- 2 # Number of times Mrs. White was the murderer prob <- 1/6 # Probability of Mrs. White being the murderer for one game Note that we can’t exactly obtain the distribution for any game from 1 to infinity, so we actually compute the distribution for 1 to 200 games (this doesn’t matter much in practice): x <- 1:200 # Reduction of the problem to a finite number of games ## Likelihood dlikelihood <- dbinom(k_mrs_white, x, prob) easy enough   Side note: when I was a student, I kept forgetting that the distribution functions existed in R and whenever I needed them I used to re-generate them using the random generation function (rbinom in this case) Prior There are a lot of possible choices for the prior but here I’m going to consider that I don’t have any real reason to believe and assume a uniform probability for any number of games between 3 and 200: dprior1 <- dunifdisc(x,3,100) plot_clue_prior(x, dprior1) Uniform prior for all number of games between 3 and 100 First posterior Using the likelihood and the prior, we can easily compute the posterior, normalize it and plot it: dposterior1 <- dlikelihood * dprior1 dposterior1 <- dposterior1 / sum(dposterior1) plot_clue_posterior(x, dposterior1) Plot of the first posterior computed We can also compute directly the estimates we’re looking for. The most probable number of games played is 11: > which.max(dposterior1) [1] 11 And there is a 95% chance that the number of games is less than 40: > threshold_val <- 0.975 > which(cumsum(dposterior1) > (threshold_val))[1] [1] 40 A more realistic data generation mechanism I find this result very unsatisfying. It doesn’t “feel” right to me that someone would be surprised by only 2 occurrences of Mrs White being guilty in such a small number of games! For example, I simulated 40 games, a number that was supposedly suspiciously high according to the model: set.seed(9) N_sim_games <- 40 sim_murderer <- runifdisc(N_sim_games, 6) plot_murderer <- ggplot(tibble(x=1:N_sim_games, y=sim_murderer), aes(y, stat(count))) + geom_histogram(aes(y =..count..), bins=6, fill="white",colour="black") + ylab("Frequency - murderer") + xlab("Character #") + scale_x_continuous(breaks = 1:6) print(plot_murderer) Simulating 40 games of Clue. Character #4 and #5 were the murderer in respectively only 2 and 3 games We observe that characters #4 and #5 are the murderers in respectively only 2 and 3 games!   In the end I think what really counts is not the likelihood that *Mrs White* was the murderer 2 times, but that the *minimum* number of times one of the characters was the culprit was 2 times!   I think it’s a cool example of a problem where just looking at the data available is not enough to make good inference – you also have to think about *how* the data was generated (in a way, it’s sort of a twist of the Monty Hall paradox, which is one of the most famous examples of problems where the data generation mechanism is critical to understand the result).   I wrote a quick and dirty function based on simulations to generate this likelihood, given a certain number of games. I saved the distribution directly in the GitHub (and called it Gumbel since it kinda looks like an extreme value problem) so that we can call it and do the same thing as above: gumbelclue_2 <- readRDS("clue/dcluegumbel_2.rds") gumbelclue_2 <- gumbelclue_2[x] dposterior_gen <- gumbelclue_2 * dprior1 dposterior_gen <- dposterior_gen / sum(dposterior_gen) plot_clue_posterior(x, dposterior_gen) Posterior with the new data generation mechanism The new posterior has the same shape but appears shifted to the right. For example N_games = 50 seems much more likely now! The estimates are now 23 for the number of games: > which.max(dposterior_gen) [1] 23 And 51 for the max bound of the uncertainty interval > threshold_val <- 0.975 > which(cumsum(dposterior_gen) > (threshold_val))[1] [1] 51

Credits for title image: Yeonsang

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R [english] – NC233. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Test your tidyness – a short quiz to check your tidyverse capabilities

Sun, 04/28/2019 - 14:00

(This article was first published on R on Adi Sarid's personal blog, and kindly contributed to R-bloggers)

Over the last month I gave a tidyverse + intro to data science corporate training in a startup in Tel-Aviv. We had two groups (beginners and intermediates), and for the last assignment of the course I was aiming for a short quiz comprised of various topics which we covered during the course, such that can also be automated easily (i.e., multiple choice questions).

I came up with the following quiz, which I thought would be nice to share here. I guess that experts should probably be able to complete this in a few minutes, intermediate/beginners would probably complete this by up to 30 minutes.

Exam instructions

The following exam contains 10 questions which spans across the different topics regaring tidyverse, and some analysis dilemmas. Each question has four options but only one correct answer. Each correct answer provides you with 10 points.

You can use any materials you want including but not limited to: cheat sheets, our course materials, stack overflow, package documentation, running code and seeing what it does.

Question 1:

When would you use an R markdown file (.Rmd) versus a script file (.R) to save your work?

  1. If I want the relative position of the file retained (so that it is easier to load files from the same directory), I will use an .Rmd file, otherwise I will use a .R file.
  2. When I want a complete documentation of my work in a report I will use a .Rmd. I will use a .R file for debugging and sourcing functions.
  3. There is no significant difference between the two formats, and they can be used for the same things interchangably.
  4. There is no benefit to using .R script files, the .Rmd format is always superior.
Question 2:

Look at the following segments of code.

# segment 1: new_data <- read.csv("myfilename.csv") # segment 2: new_data %>% group_by(some_cool_suff) %>% summarize(average = mean(avg_me, na.rm = T)) -> updated_df # segment 3: avg_var <- mean(new_data$avg_me[!], na.rm = T) # segment 4: data.frame(a = 1:10, b = letters[1:10]) %>% sample_n(3)

Which segments would you classify as tidyverse syntax?
(tidyverse syntax = code which uses functions from tidyverse packages, in which there is no function that you can replace to a tidyverse equivalent)

  1. Segment 1 and segment 3.
  2. Segment 2 and segment 4.
  3. Segment 4.
  4. Segment 2.
Question 3:

What ggplot2 geoms would you use to generate the following charts?

  1. Figure 1: not generated with ggplot2, Figure 2: geom_point.
  2. Figure 1: geom_boxplot, Figure 2: geom_line.
  3. Figure 1: geom_violin, Figure 2: geom_point.
  4. Figure 1: geom_boxplot, Figure 2: geom_point + geom_line.

Question 4:

What is the difference between the matrix and the tibble in the following?

matrix(cbind(1:10, letters[1:10], LETTERS[1:10]), ncol = 3) ## [,1] [,2] [,3] ## [1,] "1" "a" "A" ## [2,] "2" "b" "B" ## [3,] "3" "c" "C" ## [4,] "4" "d" "D" ## [5,] "5" "e" "E" ## [6,] "6" "f" "F" ## [7,] "7" "g" "G" ## [8,] "8" "h" "H" ## [9,] "9" "i" "I" ## [10,] "10" "j" "J" tibble(num = 1:10, sl = letters[1:10], cl = LETTERS[1:10]) ## # A tibble: 10 x 3 ## num sl cl ## ## 1 1 a A ## 2 2 b B ## 3 3 c C ## 4 4 d D ## 5 5 e E ## 6 6 f F ## 7 7 g G ## 8 8 h H ## 9 9 i I ## 10 10 j J
  1. The tibble has named variables (columns) and the matrix does not name the columns.
  2. The tibble retains the original data type and the matrix converts the data types.
  3. matrix is a base R function and tibble is a tidyverse function.
  4. All of the above.
Question 5:

What stringr function would you use to simplify the following code?

some_string <- c("How are you today?", "Is this test ok?", "You're already half way in!") map_chr(some_string, ~paste0(stringi::stri_wrap(., width = 5), collapse = "\n"))
  1. str_count.
  2. str_wrap.
  3. str_sub.
  4. No such function: must use a combination of a stringr and a loop (or a map function).
Question 6:

What is the difference between contains and one_of?

  1. Both are “select helpers”, one_of is used to specify strings which starts with one of the specified expressions, and contains lets you specify the variable names in “non standard evaluation” (unquoted) style.
  2. contains selects variables based on the regular expression you feed as an argument. one_of needs you to specify the variable names as strings.
  3. contains selects variables which contain the literal string you feed into it. one_of needs you to specify the variables names as strings.
  4. Both functions do the same thing with the same arguments.
Question 7:

When reshaping data with the gather function, what is the meaning of the ... argument?

  1. Specify which variables to gather by.
  2. Specify which variables not to gather by (using the “-” sign).
  3. Specify either a or b.
  4. Provide variable by which to group the resulting tibble.
Question 8:

What function would you use to get all the rows in tibble1 which are not in tibble2?

  1. setdiff(tibble1, tibble2)
  2. setdiff(tibble2, tibble1)
  3. intersect(tibble1, tibble2)
  4. semi_join(tibble1, tibble2)
Question 9:

Assume you examine the data which appears in the following scatter plot using per-axis boxplots. Would classify point A as an outlier?

  1. Yes, only accoring to the y-axis.
  2. Yes, only according to the x-axis.
  3. Yes, according to either x-axis or y-axis.
  4. No, it will not be classified as an outlier.

Question 10:

You encountered a data set in which all variables are normally distributed with an unequal variance and
unequal expectancy (mean). You wish to run a KMeans clustering to cluster the data. What would you do
as a preprocessing step?

  1. Scale and center the data using the function scale.
  2. Scale and center the data using min-max scaling and centering.
  3. Either a or b.
  4. Nothing – since the data is already normally distributed, no scaling or centering is required.
Bonus question (5 points bonus):

Did you sign up for R-Bloggers updates? (feed to receive R related news and updates)

  1. Yes (5 points bonus).
  2. No, but I’m doing it now (2.5 points bonus).
  3. No, and I don’t intend to.

P.S. I’m not getting any benefits from R-bloggers for “advertising” them, I genuinly think it’s a great source to stay updated, and improve your R capabilities.

Quiz answers

Answers available in the following gist.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Adi Sarid's personal blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Bayes vs. the Invaders! Part Four: Convergence

Sun, 04/28/2019 - 10:38

(This article was first published on Weird Data Science, and kindly contributed to R-bloggers)

Sealed and Buried

In the previous three posts

'>1 in our series delving into the cosmic horror of UFO sightings in the US, we have descended from the deceptively warm and sunlit waters of basic linear regression, through the increasingly frigid, stygian depths of Bayesian inference, generalised linear models, and the probabilistic programming language Stan.

In this final post we will explore the implications of the murky realms in which we find ourselves, and consider the awful choices that have led us to this point. We will therefore look, with merciful brevity, at the foul truth revealed by our models, but also consider the arcane philosophies that lie sleeping beneath.

Deviant Interpretations

Our crazed wanderings through dark statistical realms have led us eventually to a varying slope, varying intercept negative binomial generalised linear model, whose selection was justified over its simpler cousins via leave-one-out cross-validation (LOO-CV). By interrogating the range of hyperparameters of this model, we could reproduce an alluringly satisfying visual display of the posterior predictive distribution across the United States:

Varying intercept and slope negative binomial GLM of UFO sightings against population. (PDF Version)

Further, our model provides us with insight into the individual per-state intercept \(\alpha\) and slope \(\beta\) parameters of the underlying linear model, demonstrating that there is variation between the rate of sightings in US states that cannot be accounted for by their ostensibly human population.

Varying slope and intercept negative binomial GLM parameter plot for UFO sightings model. (PDF Version)

Interpreting these parameters, however, is not as quite as simple as in a basic linear modeladditive models (GAMs), and further into the vast chaotic mass of machine learning approaches such as random forests and convolutional neural networks, there is a general tradeoff between the terrifying power of the modelling technique and the ability of the human mind to .'>2. Most importantly our negative binomial GLM employs a log link function to relate the linear model to the data:

y &\sim& \mathbf{NegBinomial}(\mu, \phi)\\
\log(\mu) &=& \alpha + \beta x\\
\alpha &\sim& \mathcal{N}(0, 1)\\
\beta &\sim& \mathcal{N}(0, 1)\\
\phi &\sim& \mathbf{HalfCauchy}(2)

In a basic linear regression, \(y=\alpha+\beta x\), the \(\alpha\) parameter can be interpreted as the value of \(y\) when \(x\) is 0. Increasing the value of \(x\) by 1 results in a change in the \(y\) value of \(\beta\). We have, however, been drawn far beyond such naive certainties.

The \(\alpha\) and \(\beta\) coefficients in our negative binomial GLM produce the \(\log\) of the \(y\) value: the mean of the negative binomial in our parameterisation.

With a simple rearrangement, we can being to understand the grim effects of this transformation:

_ & \log(\mu) &=& \alpha + \beta x\\
\Rightarrow &\mu &=& \operatorname{e}^{\alpha + \beta x}\\

If we set \(x=0\):

\mu_0 &=& \operatorname{e}^{\alpha}

The mean of the negative binomial when \(x\) is 0 is therefore \(\operatorname{e}^{\alpha}\). If we increase the value of \(x\) by 1:

\mu_1 &=& \operatorname{e}^{\alpha + \beta}\\
&=& \operatorname{e}^{\alpha} \operatorname{e}^{\beta}

Which, if we recall the definition of the underlying mean of our model’s negative binomial, \(\mu_0\), above, is:
$$\mu_0 \operatorname{e}^{\beta}$$

The effect of an increase in \(x\) is therefore multiplicative with a log link: each increase of \(x\) by 1 causes the mean of the negative binomial to be further multiplied by \(\operatorname{e}^{\beta}\).

Despite this insidious complexity, in many senses our naive interpretation of these values still holds true. A higher value for the \(\beta\) coefficient does mean that the rate of sightings increases more swiftly with population.

With the full, unsettling panoply of US States laid out before us, any attempt to elucidate their many and varied deviations would be overwhelming. Broadly, we can see that both slope and intercepts are generally restricted to a fairly close range, with the 50% and 95% credible intervals notably overlapping in many cases. Despite this, there are certain unavoidable abnormalities from which we cannot — must not — shrink:

  • Only Pennsylvania presents a slope (\(\beta\)) parameter that could be considered as potentially zero, if we consider its 95% credible interval. The correlation between population and number of sightings is otherwise unambiguously positive.
  • Delaware, whilst presenting a wide credible interval for its slope (\(\beta\)) parameter, stands out as suffering from the greatest rate of change in sightings as its population increases.
  • Both California and Utah, present suspiciously narrow credible intervals on their slope (\(\beta\)) parameters. The growth in sightings as the population increases therefore demonstrates a worrying consistency although, in both cases, this rate is amongst the lowest of all the states.

We can conclude, then, that while the total number of sightings in Delaware are currently low, any increase in numbers of residents there appears to possess a strange fascination for visitors from beyond the inky blackness of space. By contrast, whilst our alien observers have devoted significant resources to monitoring Utah and California, their apparent willingness to devote further effort to tracking those states’ burgeoning populations is low.

Trembling Uncertainty

One of the fundamental elements of the Bayesian approach is its willing embrace of uncertainty. The output of our eldritch inferential processes are not point estimates of the outcome, as in certain other approaches, but instead posterior predictive distributions for those outcomes. As such, if when we turn our minds to predicting new outcomes based on previously unseen data, our outcome is a distribution over possible values rather than a single estimate. Thus, at the dark heart of Bayesian inference is a belief in the truth that all uncertainty be quantified as probability distributions.

The Bayesian approach as inculcated here has a predictive bent to it. These intricate methods lend themselves to forecasting a distribution of possibilities before the future unveils itself. Here, we gain a horrifying glimpse into the emerging occurrence of alien visitations to the US as its people busy themselves about their various concerns, scrutinised and studied, perhaps almost as narrowly as a man with a microscope might scrutinise the transient creatures that swarm and multiply in a drop of water.

Unavoidable Choices

The twisted reasoning underlying this series of posts has been not only in indoctrinating others into the hideous formalities of Bayesian inference, probabilistic programming, and the arcane subtleties of the Stan programming language; but also as an exercise in exposing our own minds to their horrors. As such, there is a tentative method to the madness of some of the choices made in this series that we will now elucidate.

Perhaps the most jarring choice has been our choice to code these models in Stan directly, rather than using one of the excellent helper libraries that allow for more concise generation of the underlying Stan code. Both brms and rstanarm possess the capacity to spawn models such as ours with greater simplicity of specification and efficiency of output, due to a number of arcane tricks. As an exercise in internalising such forbidden knowledge, however, it is useful to address reality unshielded by such swaddling conveniences.

In fabricating models for more practical reasons, however, we would naturally turn to these tools unless our unspeakable demands go beyond their natural scope. As a personal choice, brms is appealing due to it more natural construction of readable per-model Stan code to be compiled. This allows for the grotesque internals of generated models to be inspected and, if required, twisted to whatever form we desire. rstanarm, by contrast, avoids per-model compilation by pre-compiling more generically applicable models, but its underlying Stan code is correspondingly more arcane for an unskilled neophyte.

The Stan models presented in previous posts have also been constructed as simply as possible and have avoided all but the most universally accepted tricks for improving speed and stabilitylog form of the probability distribution sampling statements. Both of these contribute significantly to speed and numerical stability in sampling, and are sufficiently universal that their inclusion seemed justified.'>3. Most notably, Stan presents specific functions for GLMs based on the Poisson and negative binomial distributions that apply standard link functions directly. As mentioned, we consider it more useful for personal and public indoctrination to use the basic, albeit log-form parameterisations.

Last Rites

In concluding the dark descent of this series of posts on Bayesian inference, generalised linear models, and the unearthly effects of extraterrestrial visitions on humanity, we have applied numerous esoteric techniques to identify, describe, and quantify the relationship between human population and UFO sightings. The enigmatic model constructed throughout this and the previous three entries darkly implies that, while the rate of inexplicable aerial phenomena is inextricably and positively linked to humanity’s unchecked growth, there are nonetheless unseen factors that draw our non-terrestrial visitors to certain populations more than others, and that their focus and attention is ever more acute.

This series has inevitably fallen short of a full and meaningful elucidation of the techniques of Bayesian inference and Stan. From this first step on such a path, then, interested students of the bizarre and arcane would be well advised to draw on the following esoteric resources:

Until then, watch the skies and archive your data.

Footnotes var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Weird Data Science. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Parsing and Visualizing Transcriptomics Data in Shiny

Sun, 04/28/2019 - 07:05

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)


The genome represents the history book and instruction set of all organisms on Earth. Over the last century, genetic analysis has provided deep insight into the underpinnings of biological processes and the molecular etiology of disease. A crucial principle to understanding how genes function is that defects can arise due to mutation of the genes themselves, but also in the with the way that they are regulated and the amount of gene products they produce. Messenger RNA (mRNA) copies of DNA genes are copies in a process called “transcription”. These mRNA are then used in turn as templates to produce proteins in the cell in a process called “translation”. The number of copies of the mRNA copies transcribed from genes and the number of copies of protein translated from mRNA are referred to as the degree to which these molecules are “expressed”.

Figure 1. Brief discussion of gene expression

Expression of genes dynamically changes in response to cues from the environment and these changes coordinate cellular biochemistry, signaling and behavior. Aberrant gene expression changes are also associated with pathophysiology of many disease, for example numerous cancers, in which key genes are expressed at too high or too low of a level, which decouples certain pathways and behaviors from the normal function of the cell.

The study of transcriptional changes, known as transcriptomics, is an important component of the business model of biotech and pharmaceutical companies, often applied to the areas of synthetic biology or drug target discovery. The data analyzed in this project come from a next generation sequencing (NGS) technique called RNA sequencing (RNA-seq), which sequences and quantifies the RNA molecules in a cell.

Each of the samples analyzed in this set are from yeast, which is a very common model organism used for genetic research. Yeast are simple and inexpensive to work with, are eukaryotic cells (like the cells in our body), and have fewer genes compared to humans (a little over 6000 compared to our >20,000). Although they seem humble and simple, yeast genetics has led to the discovery of numerous crucial genetic pathways in our own cells, such as the molecules responsible for controlling and mediating cell division.

Link to the dataset on Kaggle
Github source code
App link

Data & Methods

Dataset Overview
This dataset consisted of several CSV files describing the experimental conditions of each sample, some information about the genes, and a table of RNA-seq reads (normalized in reads of the specific gene per million reads of mRNA) for each gene in each sample.

The total list of experimental condition in the sample set were extremely varied with many of them having < 3 replicate samples for the condition. To simplify presentability to the user and increase individual group data, only 4 comparisons were evaluated in this analysis:

Experimental Conditions:
• Yeast samples grown at high temperature (37 C) vs ideal growth conditions (30 C)
• Yeast samples grown at lower temperature (30 C) vs ideal growth conditions (15 C)
• Yeast samples growth with either glucose or ethanol as their carbon source
• Wildtype yeast samples or strains optimized for biofuel production

The other CSV files for intracellular localization and additional information about the genes were incomplete. This information will be filled in with web scraping of the SGD database ( at a later time and this information added to the analysis.

Data Cleaning
The columns for expression level were read in as strings and needed to be converted to numeric. Experimental conditions were converted to factors to provide categorical levels for grouping analysis.

For each data comparison below, the mean of each gene was calculated for the replica samples (experimental and control) and a 2-sample t-test for the comparison groups was carried out for the p-value of significant differences between of their distributions. Filtering, preprocessing and calculation of these data subsets were carried out ahead of time and provide for the webpage, because calculations and loading of larger datasets were initially impacting site performance. When users select the data, only the relevant expression and pre-calculated expression values are loaded for the experimental comparison of interest.

Volcano Plot
This data visualization was chosen because it is a common method of displaying gene expression data. The reason for this is because ratios of higher expression of genes (experimental/control) can range from 1 to no upper bound, whereas lower levels of comparable gene expression range from 0 to 1. This gives a skewed impression of expression change in a plot of raw expression ratios and undue weight to genes that have higher expression in the sample. The log2(experimental) – log2(control) transformation of the data provides a more symmetrical and even-weighted display of the change in genes of both higher and lower expression level. Statistical significance (t-test: p-value) are plotted as -log10(p-value), so higher values on the y-axis reflect a higher degree of significance of that gene expression. The user may select the threshold for significance of gene expression difference (fold-change of expression and the p-value for t-test comparison of experimental variable vs control), and the genes are dynamically displayed in red on the plot.

Clustering Analysis and Heatmap
The gene names from the Gene List selected by user-defined thresholds were used to filter the raw dataset to collect the expression data from each replicate for the significantly expressed genes. An expression matrix of experimental replicas were applied to hierarchical clustering and visualization by an interactive heatmap. This visualization is useful to assess the consistency of gene expression between experimental conditions and whether the primary source of variance is the experimental conditions themselves or between experimental repeats or individual strains used for the samples.

Gene List
This tab shows a panel with a list of all of the significant genes that match the user threshold criteria along with their expression values for both experimental conditions, the fold gene expression change, and p-value. Additional information about these genes will be implemented in the future. The full list of these genes may be downloaded by the user with a click of a button in the side panel.

How to Use

The user may select 4 different datasets in which RNA expression from yeast grown in a comparison condition is compared against samples from control  growth conditions. Slider bars on the left side panel allow the user to select the threshold cutoffs for significantly different genes of interest. The Gene Expression tab displays a volcano plot of mean gene fold change (experiment vs control) against statistical significance of difference of the means (p-values). Genes of interest matching the thresholds set by the user are displayed in red.

The Clustering Analysis tab, displays a heat map of all user selected genes with gene expression from each experimental sample from the 2 comparison conditions. The margins of the heat map also have a dendrogram showing hierarchical clustering of the samples based on their gene expression profiles. This enables the user to view trends in genes across samples and experimental conditions, as well as determine whether the comparison of variance between their sample replicates and experimental conditions.

In the Gene List tab lists all of the user selected genes along with their mean normalized expression level (experimental condition vs control), ratio of gene expression (experimental/control) and the t-test p-value of the difference between experimental conditions. This list may be downloaded with the click of a button in the user side panel.


Using the app, I collected the genes with a gene expression change of ≥ 2-fold and p ≤ 0.02 for each of the experimental conditions featured in the app. I manually browsed the SGD database for these significant genes to get an impression of the biological processes, pathways and functions that maybe affected in these conditions (summarized below).

Future Directions

This project was also chosen because it will be easily extensible with the webscraping and machine learning functionality. Web scraping will be used to collect additional information on each gene from publicly available genetic repositories, such as standard name, description, functional/pathway information, gene ontology terms, and sequence data. Machine learning methods will be used on these additional data to glean further biological insights into the mechanism of genetic networks under these experimental conditions.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...