Subscribe to R bloggers feed R bloggers
R news and tutorials contributed by hundreds of R bloggers
Updated: 1 day 12 min ago

Making Of: A Free API For COVID-19 Data

Wed, 04/01/2020 - 09:00

[This article was first published on r-bloggers | STATWORX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently, some colleagues and I attended the 2-day COVID-19 hackathon #wirvsvirus, organized by the German government. Thereby, we’ve developed a great application for simulating COVID-19 curves based on estimations of governmental measure effectiveness (FlatCurver). As there are many COVID-related dashboards and visualizations out there, I thought that gathering the underlying data from a single point of truth would be a minor issue. However, I soon realized that there are plenty of different data sources, mostly relying on the Johns Hopkins University COVID-19 case data. At first, I thought that’s great, but at a second glance, I revised my initial thought. The JHU datasets have some quirky issues to it that makes it a bit cumbersome to prepare and analyze it:

  • weird column names including special characters
  • countries and states „in the mix“
  • wide format, quite unhandy for data analysis
  • import problems due to line break issues
  • etc.

For all of you, who have been or are working with COVID-19 time series data and want to step up your data-pipeline game, let me tell you: we have an API for that! The API uses official data from the European Centre for Disease Prevention and Control and delivers a clear and concise data structure for further processing, analysis, etc.

Overview of our COVID-19 API

Our brand new COVID-19-API brings you the latest case number time series right into your application or analysis, regardless of your development environment. For example, you can easily import the data into Python using the requests package:

import requests import json import pandas as pd # POST to API payload = {'country': 'Germany'} # or {'code': 'DE'} URL = 'https://api.statworx.com/covid' response = requests.post(url=URL, data=json.dumps(payload)) # Convert to data frame df = pd.DataFrame.from_dict(json.loads(response.text))

Or if you’re an R aficionado, use httr and jsonlite to grab the lastest data and turn it into a cool plot.

library(httr) library(dplyr) library(jsonlite) library(ggplot2) # Post to API payload <- list(code = "ALL") response <- httr::POST(url = "https://api.statworx.com/covid", body = toJSON(payload, auto_unbox = TRUE), encode = "json") # Convert to data frame content <- rawToChar(response$content) df <- data.frame(fromJSON(content)) # Make a cool plot df %>% mutate(date = as.Date(date)) %>% filter(cases_cum > 100) %>% filter(code %in% c("US", "DE", "IT", "FR", "ES")) %>% group_by(code) %>% mutate(time = 1:n()) %>% ggplot(., aes(x = time, y = cases_cum, color = code)) + xlab("Days since 100 cases") + ylab("Cumulative cases") + geom_line() + theme_minimal() Developing the API using Flask

Developing a simple web app using Python is straightforward using Flask. Flask is a web framework for Python. It allows you to create websites, web applications, etc. right from Python. Flask is widely used to develop web services and APIs. A simple Flask app looks something like this.

from flask import Flask app = Flask(__name__) @app.route('/') def handle_request(): """ This code gets executed """ return 'Your first Flask app!'

In the example above, app.route decorator defines at which URL our function should be triggered. You can specify multiple decorators to trigger different functions for each URL. You might want to check out our code in the Github repository to see how we build the API using Flask.

Deployment using Google Cloud Run

Developing the API using Flask is straightforward. However, building the infrastructure and auxiliary services around it can be challenging, depending on your specific needs. A couple of things you have to consider when deploying an API:

  • Authentification
  • Security
  • Scalability
  • Latency
  • Logging
  • Connectivity

We’ve decided to use Google Cloud Run, a container-based serverless computing framework on Google Cloud. Basically, GCR is a fully managed Kubernetes service, that allows you to deploy scalable web services or other serverless functions based on your container. This is how our Dockerfile looks like.

# Use the official image as a parent image FROM python:3.7 # Copy the file from your host to your current location COPY ./main.py /app/main.py COPY ./requirements.txt /app/requirements.txt # Set the working directory WORKDIR /app # Run the command inside your image filesystem RUN pip install -r requirements.txt # Inform Docker that the container is listening on the specified port at runtime. EXPOSE 80 # Run the specified command within the container. CMD ["python", "main.py"]

You can develop your container locally and then push it in to the container registry of your GCP project. To do so, you have to tag your local image using docker tag according to the following scheme: [HOSTNAME]/[PROJECT-ID]/[IMAGE]. The hostname is one of the following: gcr.io, us.gcr.io, eu.gcr.io, asia.gcr.io. Afterward, you can push using gcloud push, followed by your image tag. From there, you can easily connect the container to the Google Cloud Run service:

When deploying the service, you can define parameters for scaling, etc. However, this is not in scope for this post. Furthermore, GCR allows custom domain mapping to functions. That’s why we have the neat API endpoint https://api.statworx.com/covid.

Conclusion

Building and deploying a web service is easier than ever. We hope that you find our new API useful for your projects and analyses regarding COVID-19. If you have any questions or remarks, feel free to contact us or to open an issue on Github. Lastly, if you make use of our free API, please add a link to our website, https://www.statworx.com to your project. Thanks in advance and stay healthy!

Über den Autor

Sebastian Heinz

I am the founder and CEO of STATWORX. I enjoy writing about machine learning and AI, especially about neural networks and deep learning. In my spare time, I love to cook, eat and drink as well as traveling the world.

ABOUT US

STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.  

.button { background-color: #0085af; }

.x-container.width { width: 100% !important; }

.x-section { padding-top: 00px !important; padding-bottom: 80px !important; }

Der Beitrag Making Of: A Free API For COVID-19 Data erschien zuerst auf STATWORX.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers | STATWORX. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Crowdfight Covid-19 – Call for R-community action

Wed, 04/01/2020 - 08:49

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Dear R-community,

This is a rather unusual post for an R-blog, but unusual times may require that. I am not involved into organizing this effort except for volunteering in the same way I want you to do. So, if you aren’t aware of this initiative yet, please likewise consider subscribing to Crowdfight Covid-19.

Motivation: Most current plans to fight COVID-19 rely on the assumption that treatments and/or vaccines will be available in a few months. Delays in these treatments will have enormous consequences, both in terms of economic impact and human lives.

Aim: To put the wider scientific community at the service of COVID-19 research. There is now a huge pool of highly skilled scientists, willing to volunteer their time and expertise for this cause. This goes from virologists who don’t have the resources to get directly involved with COVID-19 to researchers in other disciplines (bioinformatics, image analysis, AI…). This is a huge resource, the “bottleneck” being coordination. The team behind the platform tries to remove that bottleneck.

The distribution of tasks has been ongoing for three days now. Among many requests less related to coding, the daily task list also contains some data wrangling and modeling tasks every now and then, and many of us probably have additional spare time right now. Examples thus far have been calls for

    1. Building webscraping tool from scratch, which should help to consolodate government measures that have been put into place across various countries at different points in time,
    2. Support for unit testing and data fetching functionality wrapped around a more domain specific R functionality (biochemistry struff)
    3. Bayesian modeling and corresponding clinical experiment design
    4. Phylogenetic tree analyis of the virus RNA

I cannot share any concrete code snippets or implementation outlines, as the mentioned examples are either not existent yet or within private repositories. 
But it’s free to subscribe, and you can also put an upper bound on your weekly availability.

Crowdfight COVID-19 | Volunteers var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Facts About Coronavirus Disease 2019 (COVID-19) in 5 Charts created with R and ggplot2

Wed, 04/01/2020 - 01:45

[This article was first published on novyden, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Coronovirus pandemic is changing our lifestyle from daily routine to near- and midterm plans, affecting relationships at home and work, adjusting our economical priorities and abilities, making us reassess value of goods and services, and arguably impacting all aspects of life. Better knowledge and understanding of the decease, its manifestations and dynamics must play critical role in assessment of current events and decisions we make. Below I compiled some useful facts about COVID-19 into 5 charts and included discussion of R and ggplot2 techniques used to create them.

At the end of 2019, a novel coronavirus was identified as the cause of a cluster of pneumonia cases in Wuhan, a city in the Hubei Province of China. It rapidly spread, resulting in an epidemic throughout China, followed by an increasing number of cases in other countries throughout the world. In February 2020, the World Health Organization designated the disease COVID-19, which stands for coronavirus disease 2019. The virus that causes COVID-19 is designated severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2); previously, it was referred to as 2019-nCoV.

Understanding of COVID-19 is evolving. This topic will discuss the epidemiology, clinical features, diagnosis, management, and prevention of COVID-19. 

         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Though not all topics above are covered in this blog I reserve the right to publish more charts so stay tuned.

Clinical Features   Incbuation Period The incubation period for COVID-19 is thought to be within 14 days following exposure, with most cases occurring approximately four to five days after exposure [29-31].

Using data from 181 publicly reported, confirmed cases in China with identifiable exposure, one modeling study estimated that symptoms would develop in 2.5 percent of infected individuals within 2.2 days and in 97.5 percent of infected individuals within 11.5 days [32]. The median incubation period in this study was 5.1 days.

         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Common approach to display quartiles and extreme percentiles of continuous distribution is with box plot. I chose against it for couple of reasons: a) research above had insufficient information about quartiles and b) box plots are less known outside of statistical community. Instead a gauge chart common in dashboard types of applications was used:


Implementation details in R

Dataset

Dataset will hold 6 rows with values corresponding to 5 percentiles – 0% (minimum), 2.5% and 97.5% (corresponding to 0.95 confidence interval), 50% (median), 100% (maximum) – and one more for average:

Using factor() will place gauges in order from least to greatest and additional column stext used to display a value in readable format for each gauge.

Chart

First, let’s load packages used for plotting: ggplot2, ggthemes, and scales:

Realization of gauge charts using ggplot2 I borrowed from this example with a few changes explained next:

Line by line explainer:

  • 2-4: prepare rectangles for each value . Each gauge is a pair of overlapping rectangles – one dispaying value  geom_rect() with constant one geom_rect(aes(ymax=14, ymin=0, xmax=2, xmin=1), fill =“#ece8bd“) as a background. 
  • 10: separate gauges by facets.
  • 5, 6: transform coordinate system to polar, rotate it to start at 9 pm and trim to display only upper half of gauges.
  • 9 places text label with value in the middle of each gauge.
  • 7, 8: color schema from few_pal(). 
  • 11: removing guides from the chart.
  • 12-15: title, subtitle, caption, and axis labels.
  • 16-19: customization using ggthemes package and theme().

 

Illness Severity

The spectrum of symptomatic infection ranges from mild to critical; most infections are not severe [33,35-40]. Specifically, in a report from the Chinese Center for Disease Control and Prevention that included approximately 44,500 confirmed infections with an estimation of disease severity [41]:

  ● Mild (no or mild pneumonia) was reported in 81 percent.   ● Severe disease (eg, with dyspnea, hypoxia, or >50 percent lung involvement on imaging within 24 to 48 hours) was reported in 14 percent.   ● Critical disease (eg, with respiratory failure, shock, or multiorgan dysfunction) was reported in 5 percent.   ● The overall case fatality rate was 2.3 percent; no deaths were reported among noncritical cases.

         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Obvious choice is a bar chart consisting of 4 bars – 3 for illness severity specturm plus case fatality rate reported in the same study:


Implementation details in R

Dataset


Dataset with 4 rows and 4 columns where severity is a factor() ordered by percent, percent_label used to display values above bars, and severity_label details illness severity:

Chart

This is the case of simple bar chart using geom_bar() with state=’identity’ enhanced just with a couple of artifacts: geom_text() and annotate():

Line by line explainer:

  • 1-2: bar chart with stat=”identity” displaying 4 bars.
  • 3: placing percent labels above bars.
  • 4: displaying y-axis labels in percent format.
  • 5-6: color schema from few_pal() and custom labeling of the legend.
  • 7-8: text annotation about CFR in the middle of the chart.
  • 9-12: title, subtitle, caption, and axis labels.
  • 13-17: customization using ggthemes package and theme().

 

Clinical Manifestations  Pneumonia appears to be the most frequent serious manifestation of infection, characterized primarily by fever, cough, dyspnea, and bilateral infiltrates on chest imaging [32,36-38]. There are no specific clinical features that can yet reliably distinguish COVID-19 from other viral respiratory infections.

In a study describing 138 patients with COVID-19 pneumonia in Wuhan, the most common clinical features at the onset of illness were [38]:

  ●Fever in 99 percent   ●Fatigue in 70 percent   ●Dry cough in 59 percent   ●Anorexia in 40 percent   ●Myalgias in 35 percent   ●Dyspnea in 31 percent   ●Sputum production in 27 percent

         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

Continuing using bar chart to display clinical manifestations of COVID-19 at the onset of illness:

Implementation Details in R

Dataset

This is example of a bar chart requiring a bare minimum of information – just 2 columns with name and percent to display 7 bars:

Chart

Once again code below creates a bar chart using stat = “identity”:

Line by Line explainer:

  • 1-2: bar chart with stat=”identity” displaying 4 bars.
  • 3: displaying y-axis labels in percent format.
  • 4: color schema from few_pal().
  • 5-8: title, subtitle, caption, and axis labels.
  • 9-12: customization using ggthemes package and theme().

 

Case Fatality Rate

According to a joint World Health Organization (WHO)-China fact-finding mission, the case-fatality rate ranged from 5.8 percent in Wuhan to 0.7 percent in the rest of China [17]. Most of the fatal cases occurred in patients with advanced age or underlying medical comorbidities [20,41]. (See ‘Risk factors for severe illness’ below.)

The proportion of severe or fatal infections may vary by location. As an example, in Italy, 12 percent of all detected COVID-19 cases and 16 percent of all hospitalized patients were admitted to the intensive care unit; the estimated case fatality rate was 7.2 percent in mid-March [42,43]. In contrast, the estimated case fatality rate in mid-March in South Korea was 0.9 percent [44]. This may be related to distinct demographics of infection; in Italy, the median age of patients with infection was 64 years, whereas in Korea the median age was in the 40s.

         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD

This chart displays CFR’s by age groups based on 44672 confirmed cases in China through February 11 with overall CFR = 2.3%:

Imlementation Details in R Dataset

The data includes age, deaths, cases, and cfr computed as a ratio of last two:

Chart

This chart combines bar and line charts into single plot reflecting CFR rate dynamic over age groups and additionally reflects size of these groups using bar width:

Line by line explainer:

  • 1,2: line chart over CFR by age groups.
  • 3: horizontal dotted line representing overall case fatality  rate.
  • 1,4: bar chart with stat=”identity” displaying CFR’s for each age group with adjusted bar width based on number of cases in each group.
  • 5,6: placing text labels with explicit value and calculation of CFR for each age group.
  • 7: displaying y-axis labels in percent format.
  • 8: color schema from few_tаbleau().
  • 9-12: title, subtitle, caption, and axis labels.
  • 13-15: customization using ggthemes package and theme().
  Epidemiology Period of infectivity The interval during which an individual with COVID-19 is infectious is uncertain. Most data informing this issue are from studies evaluating viral RNA detection from respiratory and other specimens. However, detection of viral RNA does not necessarily indicate the presence of infectious virus.

Viral RNA levels appear to be higher soon after symptom onset compared with later in the illness [18]; this raises the possibility that transmission might be more likely in the earlier stage of infection, but additional data are needed to confirm this hypothesis.

The duration of viral shedding is also variable; there appears to be a wide range, which may depend on severity of illness. In one study of 21 patients with mild illness (no hypoxia), 90 percent had repeated negative viral RNA tests on nasopharyngeal swabs by 10 days after the onset of symptoms; tests were positive for longer in patients with more severe illness [19]. In another study of 137 patients who survived COVID-19, the median duration of viral RNA shedding from oropharyngeal specimens was 20 days (range of 8 to 37 days) [20].

         Coronavirus disease 2019 (COVID-19) by Kenneth McIntosh, MD
 
This chart informs of minimum, median, and maxium duration of viral shedding by infected individuals by using bars resembling time lines:

 
Imlementation Details in R Dataset

This chart will use bars to imitate time lines of period of infectivity based on research of how long individuals shedded viral RNA that identified minimum, median and maximum times:

Chart

Yet another example of a bar chart with additional hack using geom_point()‘s to display an improvised icon of SARS-CoV-2 virus:

Line by line explainer:

  • 1,2: bar chart with stat=”identity” displaying 3 very thin bars imitating time line. 
  • 3-6: overlaying 3 different point shapes with varying size to improvise virus icon
  • 7,8: text annotation about the difference between being infectious and viral RNA shedding.
  • 9: flipping x and y axis to display time line horizontally.
  • 10-13: title, subtitle, caption, and axis labels. 
  • 14-16: customization using ggthemes package and theme().

Conclusions

Most of the facts above are results of extremely young research of COVID-16 that started just little over 3 months ago. There are still many unknowns about both the virus SARS-CoV-2 and the desease. I compiled of few of them in the final chart – some will seem surprising given the wealth of knowledge science accumulated about other deseases:


References


var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: novyden. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

foreach 1.5.0 now available on CRAN

Tue, 03/31/2020 - 18:32

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post is to announce that version 1.5.0 of the foreach package is now on CRAN. Foreach is an idiom that allows for iterating over elements in a collection, without the use of an explicit loop counter. The foreach package is now more than 10 years old, and is used by nearly 700 packages across CRAN and Bioconductor. (If you're interested in an overview of the foreach package and its history, this RStudio::conf talk by Bryan Lewis is worth watching.)

The main change is 1.5.0 is intended to help reduce errors associated with modifying global variables. Now, %dopar% loops with a sequential backend will evaluate the loop body inside a local environment. Why make that change? Let's illustrate with an example:

library(foreach) registerDoSEQ() a <- 0 foreach(i=1:10) %dopar% {a <- a + 1} a ## [1] 0

Here, the assignment inside the %dopar% is to a local copy of a, so the global variable a remains unchanged. The reason for this change is because %dopar% is intended for use in a parallel context, where modifying the global environment doesn’t make sense: the work will be taking place in different R processes, sometimes on different physical machines, possibly in the cloud. In this context there is no shared global environment to manipulate, unlike the case of a sequential backend.

Because of this, it’s almost always a mistake to modify global variables from %dopar%, even if it used to succeed. This change will hopefully reduce the incidence of programming errors where people prototype their code with a sequential backend, only to have it fail when they use it with a real (parallel) backend.

Note that the behaviour of the %do% operator, which is intended for a sequential backend, remains the same. It matches that of a regular for loop:

a <- 0 foreach(i=1:10) %do% {a <- a + 1} a ## [1] 10

If you have any questions or comments, please email me or open an issue at the GitHub repo.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Blogging A to Z: The A to Z of tidyverse

Tue, 03/31/2020 - 17:06

[This article was first published on Deeply Trivial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Announcing my theme for this year’s blogging A to Z!

The tidyverse is a set of R packages for data science. The big thing about the tidyverse is making sure your data are tidy. What does that mean?

  1. Each row is an observation
  2. Each column is a variable
  3. Each cell contains only one value
When I first learned about the tidy approach, I thought, “Why is this special? Isn’t that what we should be doing?” But thinking about keeping your data tidy has really changed the way I approach my job, and has helped me solve some tricky data wrangling issues. When you really embrace this approach, merging data, creating new variables, and summarizing cases becomes much easier. And the syntax used is the tidyverse is much more intuitive than much of the code in R, making it easier to memorize many of the functions; they follow a predictable grammar, so you don’t need to constantly look things up. See you tomorrow for the first post – A is for arrange! var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Deeply Trivial. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualizing decision tree partition and decision boundaries

Tue, 03/31/2020 - 15:00

[This article was first published on r – paulvanderlaken.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Grant McDermott develop this new R package I had thought of: parttree

parttree includes a set of simple functions for visualizing decision tree partitions in R with ggplot2. The package is not yet on CRAN, but can be installed from GitHub using:

# install.packages("remotes") remotes::install_github("grantmcdermott/parttree")

Using the familiar ggplot2 syntax, we can simply add decision tree boundaries to a plot of our data.

In this example from his Github page, Grant trains a decision tree on the famous Titanic data using the parsnip package. And then visualizes the resulting partition / decision boundaries using the simple function geom_parttree()

library(parsnip) library(titanic) ## Just for a different data set set.seed(123) ## For consistent jitter titanic_train$Survived = as.factor(titanic_train$Survived) ## Build our tree using parsnip (but with rpart as the model engine) ti_tree = decision_tree() %>% set_engine("rpart") %>% set_mode("classification") %>% fit(Survived ~ Pclass + Age, data = titanic_train) ## Plot the data and model partitions titanic_train %>% ggplot(aes(x=Pclass, y=Age)) + geom_jitter(aes(col=Survived), alpha=0.7) + geom_parttree(data = ti_tree, aes(fill=Survived), alpha = 0.1) + theme_minimal()

Super awesome!

This visualization precisely shows where the trained decision tree thinks it should predict that the passengers of the Titanic would have survived (blue regions) or not (red), based on their age and passenger class (Pclass).

This will be super helpful if you need to explain to yourself, your team, or your stakeholders how you model works. Currently, only rpart decision trees are supported, but I am very much hoping that Grant continues building this functionality!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r – paulvanderlaken.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Updates to R GUIs: BlueSky, jamovi, JASP, & RKWard

Tue, 03/31/2020 - 12:03

[This article was first published on R – r4stats.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Graphical User Interfaces (GUIs) for the R language help beginners get started learning R, help non-programmers get their work done, and help teams of programmers and non-programmers work together by turning code into menus and dialog boxes. There has been quite a lot of progress on R GUIs since my last post on this topic. Below I describe some of the features added to several R GUIs.

BlueSky Statistics

BlueSky Statistics has added mixed-effects linear models. Its dialog shows an improved model builder that will be rolled out to the other modeling dialogs in future releases. Other new statistical methods include quantile regression, survival analysis using both Kaplan-Meier and Cox Proportional Hazards models, Bland-Altman plots, Cohen’s Kappa, Intraclass Correlation, odds ratios and relative risk for M by 2 tables, and sixteen diagnostic measures such as sensitivity, specificity, PPV, NPV, Youden’s Index, and the like. The ability to create complex tables of statistics was added via the powerful arsenal package. Some examples of the types of tables you can create with it are shown here.

Several new dialogs have been added to the Data menu. The Compute Dummy Variables dialog creates dummy (aka indicator) variables from factors for use in modeling. That approach offers greater control over how the dummies are created than you would have when including factors directly in models.

A new Factor Levels menu item leads to many of the functions from the forcats package. They allow you to reorder factor levels by count, by occurrence in the dataset, by functions of another variable, allow you to lump low-frequency levels into a single “Other” category, and so on. These are all helpful in setting the order and nature of, for example, bars in a plot or entries in a table.

The BlueSky Data Grid now has icons that show the type of variable i.e. factor, ordered factor, string, numeric, date or logical. The Output Viewer adds icons to let you add notes to the output (not full R Markdown yet), and a trash can icon lets you delete blocks of output.

A comprehensive list of the changes to this release is located here and my updated review of it is here.

jamovi

New modules expand jamovi’s capabilities to include time-based survival analysis, Bland-Altman analysis & plots, behavioral change analysis, advanced mediation analysis, differential item analysis, and quantiles & probabilities from various continuous distributions.

jamovi’s new Flexplot module greatly expands the types of graphs it can create, letting you take a single graph type and repeat it in rows and/or columns making it easy to visualize how the data is changing across groups (called facet, panel, or lattice plots).

You can read more about Flexplot here, and my recently-updated review of jamovi is here.

JASP

The JASP package has added two major modules, machine learning, and network analysis. The machine learning module includes boosting, K-nearest neighbors, and random forests for both regression and classification problems. For regression, it also adds regularized linear regression. For clustering, it covers hierarchical, K-means, random forest, density-based, and fuzzy C-means methods. It can generate models and add predictions to your dataset, but it still cannot save models for future use. The main method it is missing is a single decision tree model. While less accurate predictors, a simple tree model can often provide insight that is lacking from other methods.

Another major addition to JASP is Network Analysis. It helps you to study the strengths of interactions among people, cell phones, etc. With so many people working from home during the Coronavirus pandemic, it would be interesting to see what this would reveal about how our patterns of working together have changed.

A really useful feature in JASP is its Data Library. It greatly speeds your ability to try out a new feature by offering a completely worked-out example including data. When trying out the network analysis feature, all I had to do was open the prepared example to see what type of data it would use. With most other data science software, you’re left to dig about in a collection of datasets looking for a good one to test a particular analysis. Nicely done!

I’ve updated my full review of JASP, which you can read here.

RKWard

The main improvement to the RKWard GUI for R is adding support for R Markdown. That makes it the second GUI to support R Markdown after R Commander. Both the jamovi and BlueSky teams are headed that way. RKWard’s new live preview feature lets you see text, graphics, and markdown as you work. A comprehensive list of new features is available here, and my full review of it is here.

Conclusion

R GUIs are gaining features at a rapid pace, quickly closing in on the capabilities of commercial data science packages such as SAS, SPSS, and Stata. I encourage R GUI users to contribute their own additions to the menus and dialog boxes of their favorite(s). The development teams are always happy to help with such contributions. To follow the progress of these and other R GUIs, subscribe to my blog, or follow me on twitter.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – r4stats.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Contagiousness of COVID-19 Part I: Improvements of Mathematical Fitting (Guest Post)

Tue, 03/31/2020 - 09:00

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Learning Machines proudly presents a guest post by Martijn Weterings from the Food and Natural Products research group of the Institute of Life Technologies at the University of Applied Sciences of Western Switzerland in Sion.

The topic of this post will be the fitting with the R-package optim. Food? That sounds like a rather unlikely match for writing a post on a blog about quantitative analysis, however, there is a surprising overlap between these disciplinary fields. For example, whether you model the transport of a flavour molecule or transport of a virus, the type of mathematical equations and the ways to treat the data are a lot similar.

This contribution will be split into two parts. In the first part, we pick up on the earlier fitting described in a previous blog-post here (see Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)?). These fits are sometimes difficult to perform. How can we analyse that difficult behaviour and how can we make further improvements? In the second part, we will see that all these efforts to make a nice performing algorithm to perform the fitting is actually not much useful for the current case. Just because we use a mathematical model, which sounds rigorous, does not mean that our conclusions/predictions are trustworthy.

These two parts will be accompanied by the R-script covid.r.

COVID-19: a data epidemic

With the outbreak of COVID-19 one thing that is certain is that never before a virus has gone so much viral on the internet. Especially, a lot of data about the spread of the virus is going around. A large amount of data is available in the form of fancy coronavirus-trackers that look like weather forecasts or overviews of sports results. Many people have started to try predicting the evolution of the epidemiological curve and along with that the reproduction number , but can this be done with this type of data?

In this blog-post, we describe the fitting of the data with the SIR model and explain the tricky parts of the fitting methodology and how we can mitigate some of the problems that we encounter.

The general problem is that the fitting-algorithm is not always finding it’s way to the best solution. Below is a graph that shows an out of the box fit of the data with the optim package (it’s the one from the previous blog post Epidemiology: How contagious is Novel Coronavirus (2019-nCoV)? ). Next to it, we show a result that is more optimal. Why did we not find this result directly with the optim package?

Why doesn’t the algorithm converge to the optimal solution?

There are two main reasons why the model is not converging well.

Early stopping of the algorithm

The first reason is that the optim algorithm (which is updating model parameters starting from an initial guess and moving towards the optimal solution) is stopping too early before it has found the right solution.

How does the optim package find a solution? The gradient methods used by the optim package find the optimum estimate by repeatedly improving the current estimate and finding a new solution with a lower residual sum of squares (RSS) each time. Gradient methods do this by computing for a small change of the parameters in which direction the RSS will change the fastest and then, in the case of the BFGS method used by the optim package, it computes (via a line search method) where in that direction the lowest value for the RSS is. This is repeated until no further improvement can be made, or when the improvement is below some desired/sufficient minimal level.

In the two images below we see how the algorithm solves stepwise the fit, for a SIR model that uses the parameters and (these parameters had been explained in the previous blog post and are repeated in this post below). The images are contour plot (lines) and surface plot (colours) for the value of the RSS as a function of the model parameters. The minimum is around and where eventually the algorithm should end.

We see in these images effects that make it difficult for the algorithm to approach the optimum quickly in few steps, or it may even get blocked before that point (also it may end up in a local optimum, which is a bit different case, although we have it here as well and there’s a local optimum with a value for ).

Computation of the gradient If the function that we use for the optimization does not provide an expression for the gradient of the function (which is needed to find the direction of movement) then the optim package will compute this manually by taking the values at nearby points.

How much nearby do these points need to be? The optim package uses the scale of the parameters for this. This scale does not always work out of the box and when it is too large then the algorithm is not making an accurate computation of the gradient.

In the image below we see this by the path taken by the algorithm is shown by the red and black arrows. The red arrows show the path when we do not fine-tune the optimization, the black path shows the path when we reduce the scale of the parameters manually. This is done with the control parameter. In the code of the file covid.R you see this in the function:

OptTemp <- optim(new, RSS2, method = "L-BFGS-B", lower = c(0,1.00001), hessian = TRUE, control = list(parscale = c(10^-4,10^-4), factr = 1))

By using parscale = c(10^-4,10^-4) we let the algorithm compute the gradient at a smaller scale (we could actually also use the ndeps parameter). In addition, we used factr = 1, which is a factor that determines the point when the algorithm stops (in this case when the improvement is less than one times the machine precision).

So by changing the parameter parscale we can often push the algorithm to get closer to the optimal solution.

A zigzag path towards the optimum may occur when the surface plot of the RSS has a sort of long stretched valley shape. Then the algorithm is computing a path that moves towards the optimum like a sort of snowboarder on a half-pipe, taking lots of movements along the axis in the direction of the curvature of the half-pipe, and much less movement along the axis downhill towards the bottom.

In the case above we had let the algorithm start at and and this was chosen on purpose for the illustration. But we do not always make such a good initial guess. In the image below we see what happens when we had chosen and as starting condition (note that image should be stretched out along the y-axis due to the different ranges of and in which case the change of the RSS is much faster/stronger in the direction left-right than the direction up-down).

The red curve, which shows the result of the algorithm without the fine-tuning, stops already after one step around where it hits the bottom of the curvature of the valley/half-pipe and is not accurately finding out that there is still a long path/gradient in the other direction. We can improve the situation by changing the parscale parameter, in which case the algorithm will more precisely determine the slope and continue it’s path (see the black arrows). But in the direction of the y-axis, it does this only in small steps, so it will only slowly converge to the optimal solution.

We can often improve this situation by changing the relative scale of the parameters, however, in this particular case, it is not easy, because of the L-shape of the ‘valley’ (see the above image). We could change the relative scales of the parameters to improve convergence in the beginning, but then the convergence at the end becomes more difficult.

Ill-conditioned problem

The second reason for the bad convergence behaviour of the algorithm is that the problem is ill-conditioned. That means that a small change of the data will have a large influence on the outcome of the parameter estimates.

In that case, the data is not very useful to differentiate between different parameters of the model. A large range of variation in the parameters can more or less explain the same data.

An example of this is in the image below, where we see that for different values of R0 we can still fit the data without much difference in the residual sum of squares (RSS). We get every time a value for around to (and the shape of the curve is not much dependent on the value of ).

This value for relates to the initial growth rate. Let’s look at the differential equations to see why variations in have so little effect on the begin of the curve. In terms of the parameters and the equations are now:

   

Here we see that, when is approximately equal to (which is the case in the beginning), then we get approximately and the beginning of the curve will be approximately exponential.

Thus, for a large range of values of , the beginning of the epidemiological curve will resemble an exponential growth that is independent of the value of . In the opposite direction: when we observe exponential growth (initially) then we can not use this observation to derive a value for .

With these ill-conditioned problems, it is often difficult to get the algorithm to converge to the minimum. This is because changes in some parameter (in our case ) will result in only a small improvement of the RSS and a large range of the parameters have more or less the same RSS.

The error/variance of the parameter estimates

So if small variations in the data occur, due to measurements errors, how much impact will this have on the estimates of the parameters? Here we show the results for two different ways to do determine this. In the file covid.R the execution of the methods will be explained in more detail.

Using an estimate of the Fisher information. We can determine an estimate for (lower bound of) the variance of the parameters by considering the Cramér-Rao bound, which states that the variance of (unbiased) parameter estimates are equal to or larger than the inverse of the Fisher Information matrix. The Fisher information is a matrix with the second-order partial derivatives of the log-likelihood function evaluated at the true parameter values.

The log-likelihood function is this thing:

We do not know this loglikelihood function and it’s dependence on the parameters and because we do not have the true parameter values and also we do not know the variance of the random error of the data points (the term in the likelihood function). But we can estimate it based on the Hessian, a matrix with the second-order partial derivatives of our objective function evaluated at our final estimate.

##################### ## ## computing variance with Hessian ## ################### ### The output of optim will store values for RSS and the hessian mod <- optim(c(0.3, 1.04), RSS2, method = "L-BFGS-B", hessian = TRUE, control = list(parscale = c(10^-4,10^-4), factr = 1)) # unbiased estimate of standard deviation # we divide by n-p # where n is the number of data points # and p is the number of estimated parameters sigma_estimate <- sqrt(mod$value/(length(Infected)-2)) # compute the inverse of the hessian # The hessian = the second order partial derivative of the objective function # in our case this is the RSS # we multiply by 1/(2 * sigma^2) covpar <- solve(1/(2*sigma_estimate^2)*mod$hessian) covpar # [,1] [,2] #[1,] 1.236666e-05 -2.349611e-07 #[2,] -2.349611e-07 9.175736K and R0 e-09 ## the variance of R0 is then approximately ## covpar[2,2]^0.5 #[1] 9.579006e-05

Using a Monte Carlo estimation. A formula to compute exactly the propagation of errors/variance in the data to the errors/variance in the estimates of the parameters is often very complex. The Hessian will only give us a lower bound (I personally find it more useful to see any potential strong correlation between parameters), and it is not so easy to implement. There is however a very blunt but effective way to get an idea of the propagation of errors and that is by performing a random simulation.

The full details of this method are explained in the covid.R file. Here we will show just the results of the simulation:

In this simulation, we simulated times new data based on a true model with parameter values and and with the variance of data points corresponding to the observed RSS of our fit. We also show in the right graph how the parameters and are distributed for the same simulation. The parameters and are strongly correlated. This results in them having a large marginal/individual error, but the values and have much less relative variation (this is why we changed the fitting parameters from and to and ).

Fitting by using the latest data and by using the analytical solution of the model

Now, we are almost at the end of this post, and we will make a new attempt to fit again the epidemiological curve, but now based on more new data.

What we do this time is make some small adaptations:

  • The data is the number of total people that have gotten sick. This is different from the (infectious) and (recovered) output of the model. We make the comparison of the modelled with the data (the total that have gone sick).
  • In this comparison, we will use a scaling factor because the reported number of infected/infectious people is an underestimation of the true value, and this latter value is what the model computes. We use two scaling factors one for before and one for after February 12 (because at that time the definition for reporting cases had been changed).
  • We make the population size a fitting variable. This will correct for the two assumptions that we have homogeneous mixing among the entire population of China and that of the population is susceptible. In addition, we make the infected people at the start a fitting variable. In this model, we will fit . There is data for a separate but it is not such an accurate variable (because the recovery and the infectious phase is not easy to define/measure/determine).

Because the computation of all these parameters is too difficult in a single optim function we solve the parameters separately in a nested way. In the most inner loop, we solve the scaling parameters (which can be done with a simple linear model), in the middle loop we solve the and with the optim function, in the outer loop we do a brute force search for the optimal starting point of .

To obtain a starting condition we use a result from Harko, Lobo and Mak 2014 (Exact analytical solutions of the Susceptible-Infected-Recovered (SIR) epidemic model and of the SIR model with equal death and birth rates) who derived expressions for , and in terms of a single differential equation. The equation below is based on their equations but expressed in slightly different terms:

   

We can solve this equation as a linear equation which gives us a good starting condition (small sidenote: using some form of differential equation is a general way of getting starting conditions, but the might be noisy, in that case, one could integrate the expression).

The further details of the computation can be found in the covid.R script. Below you see a result of the outer loop where we did a brute force search (which gives an optimum around for ) and next to it a fitted curve for the parameters , , and .

In this new fit, we get again a low reproduction number . One potential reason for this is that due to the measures that have been taken, the Chinese have been able to reduce the rate of the spread of the virus. The model is unaware of this and interprets this as a reduction that is due to immunity (decrease of susceptible people). However, only a very small fraction of the people have gained immunity (about of the population got sick if we consider ). For the virus to stop spreading at already such a low fraction of sick people it must mean that the is very low.

Thus, an estimation of the parameters, based on this type of data, is difficult. When we see a decrease in the growth rate then one or more of the following four effects play a role: (1) The number of susceptible people has decreased sufficiently to overcome the reproduction rate . This relative decrease in susceptible people happens faster when the total number of people is smaller. (2) Something has changed about the conditions, the reproduction rate is not constant in time. For instance, with respiratory infections, it is common that the transfer rates depend on weather and are higher during winter. (3) The measures that are being taken against the disease are taking effect. (4) The model is too simple with several assumptions that overestimate the effect of the initial growth rate. This growth rate is very high 25%" title="Rendered by QuickLaTeX.com" height="13" width="35" style="vertical-align: 0px;"/> per day, and we observe a doubling every three days. This means that the time between generations is very short, something that is not believed to be true. It may be likely that the increase in numbers is partially due to variable time delay in the occurrence of the symptoms as well as sampling bias.

For statisticians, it is difficult to estimate what causes the changes in the epidemic curves. We should need more detailed information in order to fill in the gaps which do not seem to go away by having just more data (and this coronavirus creates a lot of data, possibly too much). But as human beings under threat of a nasty disease, we can at least consider ourselves lucky that we have a lot of options how the disease can fade away. And we can be lucky that we see a seemingly/effective reproduction rate that is very low, and also only a fraction of the population is susceptible.

Wrap up

So now we have done all this nice mathematics and we can draw accurately a modelled curve through all our data points. But is this useful when we model the wrong data with the wrong model? The difference between statistics and mathematics is that statisticians need to look beyond the computations.

  • We need to consider what the data actually represents, how is it sampled, whether there are biases and how strongly they are gonna influence our analysis. We should actually do this ideally before we start throwing computations at the data. Or such computations will at most be exploratory analysis, but they should not start to live their own life without the data.
  • And we need to consider how good a representation our models are. We can make expressions based on the variance in the data, but the error is also determined by the bias in our models.

At the present time, COVID-19 is making an enormous impact on our lives, with an unclear effect for the future (we even do not know when the measures are gonna stop, end of April, end of May, maybe even June?). Only time will tell what the economic aftermath of this coronavirus is gonna be, and how much it’s impact will be for our health and quality of life. But one thing that we can assure ourself about is that the ominous view of an unlimited exponential growth (currently going around on social media) is not data-driven.

In this post, I have explained some mathematics about fitting. However, I would like to warn for the blunt use of these mathematical formula’s. Just because we use a mathematical model does not mean that our conclusions/predictions are trustworthy. We need to challenge the premises which are the underlying data and models. So in a next post, “Contagiousness of COVID-19 Part 2: Why the Result of Part 1 is Useless”, I hope to explain what sort of considerations about the data and the models one should take into account and make some connections with other cases where statistics went in a wrong direction.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What is a dgCMatrix object made of? (sparse matrix format in R)

Tue, 03/31/2020 - 07:44

[This article was first published on R – Statistical Odds & Ends, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been working with sparse matrices in R recently (those created using Matrix::Matrix with the option sparse=TRUE) and found it difficult to track down documentation about what the slots in the matrix object are. This post describes the slots in a class dgCMatrix object.

(Click here for full documentation of the Matrix package (and it is a lot–like, 215 pages a lot).)

Background

It turns out that there is some documentation on dgCMatrix objects within the Matrix package. One can access it using the following code:

library(Matrix) ?`dgCMatrix-class`

According to the documentation, the dgCMatrix class

…is a class of sparse numeric matrices in the compressed, sparse, column-oriented format. In this implementation the non-zero elements in the columns are sorted into increasing row order. dgCMatrix is the “standard” class for sparse numeric matrices in the Matrix package.

An example

We’ll use a small matrix as a running example in this post:

library(Matrix) M <- Matrix(c(0, 0, 0, 2, 6, 0, -1, 5, 0, 4, 3, 0, 0, 0, 5, 0), byrow = TRUE, nrow = 4, sparse = TRUE) rownames(M) <- paste0("r", 1:4) colnames(M) <- paste0("c", 1:4) M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 .

Running str on x tells us that the dgCMatrix object has 6 slots. (To learn more about slots and S4 objects, see this section from Hadley Wickham’s Advanced R.)

str(M) # Formal class 'dgCMatrix' [package "Matrix"] with 6 slots # ..@ i : int [1:7] 1 2 1 2 3 0 1 # ..@ p : int [1:5] 0 1 2 5 7 # ..@ Dim : int [1:2] 4 4 # ..@ Dimnames:List of 2 # .. ..$ : chr [1:4] "r1" "r2" "r3" "r4" # .. ..$ : chr [1:4] "c1" "c2" "c3" "c4" # ..@ x : num [1:7] 6 4 -1 3 5 2 5 # ..@ factors : list()

x, i and p

If a matrix M has nn non-zero entries, then its x slot is a vector of length nn containing all the non-zero values in the matrix. The non-zero elements in column 1 are listed first (starting from the top and ending at the bottom), followed by column 2, 3 and so on.

M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 . M@x # [1] 6 4 -1 3 5 2 5 as.numeric(M)[as.numeric(M) != 0] # [1] 6 4 -1 3 5 2 5

The i slot is a vector of length nn. The kth element of M@i is the row index of the kth non-zero element (as listed in M@x). One big thing to note here is that the first row has index ZERO, unlike R’s indexing convention. In our example, the first non-zero entry, 6, is in the second row, i.e. row index 1, so the first entry of M@i is 1.

M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 . M@i # [1] 1 2 1 2 3 0 1

If the matrix has nvars columns, then the p slot is a vector of length nvars + 1. If we index the columns such that the first column has index ZERO, then M@p[1] = 0, and M@p[j+2] - M@p[j+1] gives us the number of non-zero elements in column j.

In our example, when j = 2, M@p[2+2] - M@p[2+1] = 5 - 2 = 3, so there are 3 non-zero elements in column index 2 (i.e. the third column).

M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 . M@p # [1] 0 1 2 5 7

With the x, i and p slots, one can reconstruct the entries of the matrix.

Dim and Dimnames

These two slots are fairly obvious. Dim is a vector of length 2, with the first and second entries denoting the number of rows and columns the matrix has respectively. Dimnames is a list of length 2: the first element being a vector of row names (if present) and the second being a vector of column names (if present).

factors

This slot is probably the most unusual of the lot, and its documentation was a bit difficult to track down. From the CRAN documentation, it looks like factors is

… [an] Object of class “list” – a list of factorizations of the matrix. Note that this is typically empty, i.e., list(), initially and is updated automagically whenever a matrix factorization is
computed.

My understanding is if we perform any matrix factorizations or decompositions on a dgCMatrix object, it stores the factorization under factors so that if asked for the factorization again, it can return the cached value instead of recomputing the factorization. Here is an example:

M@factors # list() Mlu <- lu(M) # perform triangular decomposition str(M@factors) # List of 1 # $ LU:Formal class 'sparseLU' [package "Matrix"] with 5 slots # .. ..@ L :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots # .. .. .. ..@ i : int [1:4] 0 1 2 3 # .. .. .. ..@ p : int [1:5] 0 1 2 3 4 # .. .. .. ..@ Dim : int [1:2] 4 4 # .. .. .. ..@ Dimnames:List of 2 # .. .. .. .. ..$ : chr [1:4] "r2" "r3" "r4" "r1" # .. .. .. .. ..$ : NULL # .. .. .. ..@ x : num [1:4] 1 1 1 1 # .. .. .. ..@ uplo : chr "U" # .. .. .. ..@ diag : chr "N" # .. ..@ U :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots # .. .. .. ..@ i : int [1:7] 0 1 0 1 2 0 3 # .. .. .. ..@ p : int [1:5] 0 1 2 5 7 # .. .. .. ..@ Dim : int [1:2] 4 4 # .. .. .. ..@ Dimnames:List of 2 # .. .. .. .. ..$ : NULL # .. .. .. .. ..$ : chr [1:4] "c1" "c2" "c3" "c4" # .. .. .. ..@ x : num [1:7] 6 4 -1 3 5 5 2 # .. .. .. ..@ uplo : chr "U" # .. .. .. ..@ diag : chr "N" # .. ..@ p : int [1:4] 1 2 3 0 # .. ..@ q : int [1:4] 0 1 2 3 # .. ..@ Dim: int [1:2] 4 4

Here is an example which shows that the decomposition is only performed once:

set.seed(1) M <- runif(9e6) M[sample.int(9e6, size = 8e6)] <- 0 M <- Matrix(M, nrow = 3e3, sparse = TRUE) system.time(lu(M)) # user system elapsed # 13.527 0.161 13.701 system.time(lu(M)) # user system elapsed # 0 0 0 var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Statistical Odds & Ends. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Close Encounters of the R Kind

Tue, 03/31/2020 - 02:00

[This article was first published on R Views, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Affiliation

Harrison – Center for Strategic and Budgetary Analysis, Washington DC

Cara – Department of the Air Force (Studies, Analyses, and Assessments – AF/A9), Washington DC

Disclaimer

The views expressed in this article represent the personal views of the author and are not necessarily the views of the Department of Defense (DoD) or the Department of the Air Force.

This post is an effort to condense the ‘buzz’ surrounding the explosion of open source solutions in all facets of analysis – to include those done by Military Operations Research Society ( MORS) members and those they support – by describing our experiences with the R programming language.

The impact of R in the statistical world (and by extension, data science) has been big: worthy of an entire issue of SIGNIFICANCE Magazine (RSS). Surprisingly, R is not a new computing language; modeled on S and Scheme,- the technology at the core of R is over forty years old. This longevity is, in itself, noteworthy. Additionally, fee-based and for-profit companies have begun to incorporate R with their products. While statistics is the focus of R, with the right packages – and know-how – it can also be used for a much broader spectrum, to include machine learning, optimization, and interactive web tools.

In this post, we go back and forth discussing our individual experiences.

Getting Started with R

Harrison: I got started with R in earnest shortly before retiring from the U.S. Navy in 2016. I knew that I was going to need a programming language to take with me into my next career. The reason I chose R was not particularly analytical; the languages that I had done the most work in during grad school – MATLAB and Java – were not attractive in that the first required licensing fees and the second was – to me at the time – too ‘low level’ for the type of analysis I wanted to perform. I had used SPlus in my statistics track, but never really ‘took’ to it while in school. Several tools to ‘bridge’ the gap between Excel and R had been recommended to me by a friend, including RStudio and Rcommander.

Onboard ship many years ago, I learned to eat with chopsticks by requesting that the wardroom staff stop providing me with utensils, substituting a bag of disposable chopsticks I purchased in Singapore. Turns out when deprived of other options, you can learn very fast. Learning R basics was the same; instead of silverware, it was removing the shortcuts to my usual tools on my home laptop (Excel). I simply did every task that I could from the mundane to the elegant in R.

Cara: I started dabbling with R in 2017 when I had about a year and a half left in my PhD journey, after I decided to pursue a post-doctoral government career. Sitting comfortably in academia with abundant software licenses for almost a decade, I had no reason until that point to consider abandoning my SAS discipleship (other than abhorrent graphics capability, which I bolstered with use of SigmaPlot). My military background taught me not to expect access to expensive software in a government gig, and I had a number of friends and colleagues already using R and associated tools, so I installed it on my home computer. Other than being mildly intrigued by the software version naming convention, I stubbornly clung to SAS to finish my doctoral research, however.

It was not until I became a government OR analyst in late 2018 that I really started my R journey (mostly out of necessity at this point). R does not require a license, and furthermore, is approved for use on government computers. I found R to be an easier and more natural programming language to learn coming from a SAS background than Python, Javascript, or C++ would have been.

How has using R shaped your practice?

Harrison: There is a lot of talk about how various tools perform in the sense of runtime, precision, graphics, etc. These are considerations, but they are completely eclipsed by the following: We don’t talk as a community about how much the tools we use shape our thinking. I frequently tell my colleagues that the fundamental unit in Excel is called a cell because it is your mind prison. There’s actually some truth to that. R is vectorized, so for most functions, passing an array gives an appropriate array output. When you work day-in and day-out with vectors, you stop thinking about individual operations start to think in terms of sentences. The magrittr %>% operator, which takes the expression on the left as the first argument to the function on the right, makes this possible. Analysis begins to feel more like writing sentences – or even short poems – than writing computing code.

Early in my work with R, I was told by a colleague that “R might be good but the graphics are terrible”. This was a bit of a shock, as graphics has been one of the main selling points of the language, and I didn’t want to be making seedy graphs. From that point on, I made it a point to make the best graphics I possibly could, usually – but not always – using methods and extensions found in the ggplot2 package. It is no exaggeration to say that I spend roughly 20% of my analysis time picking colors and other aesthetics for plots. If you are willing to take the time, you can get the graphics to sing; there are color schemes based on The Simpsons and Futurama, and fonts based on xkcd comics.

Cara: When I began teaching myself R – and using it daily – I thought I was merely learning the syntax of a new programming language. With the analytic capability inherent with R and the flexibility of development environments, however, it is really more of a way of thinking. Fold in the powerful (and mostly free!) resources and passionate following of analysts and data scientists, and you get an R community that I truly enjoy being a part of.

The R environment, even as a novice user, can have positive impacts on your workflow. For example, beyond syntax, my earliest explorations in R taught me that if you are going to do something more than once, write a function. I had never truly internalized that idea, even after a decade of using SAS. Another thing I learned relatively early on – get to know the dplyr package, and use it! I had been coding in R for about 6 months before I was really introduced to functions like dplyr::filter(), dplyr::select(), and dplyr::mutate(); these are powerful functions that can save a ton of code. I’ve been analyzing data for over a decade and I’ve never come across a dataset that was already in the form I needed. Prior to using the dplyr package, however, I was spending a lot of time manipulating data using no functions and a lot of lines of code. Beyond time savings, dplyr helps you think about your data more creatively. As a very basic example, dplyr::summarise() is a more powerful option than mean() used alone, especially for multiple calculations in a single data table. And once you master the Wonder Twin-esque combination of using group_by() with summarise(), you’ll be amazed at what you can (quickly) reveal through exploratory analysis. Data wrangling is (and always will be) a fact of life. The more efficiently you manipulate data, however, the more time you have to spend on the seemingly more exciting aspects of any project.

Disadvantages of R

Harrison: This piece is not a ‘sales pitch’ for R; but rather a sober consideration of what the tradeoffs an organization needs to consider when choosing an analytic platform writ large:

  1. Compatibility and Editing. Because R is a computing language, graphics built in R are not editable by non-R users, as opposed to Excel graphs. This can be a challenge in the frequent case where the reviewers are not the same people that created the plots. If you made the plot, you are going to have to be the one who does the editing, unless there is another R user who understands your particular technique in the office.

  2. No license costs do not mean that it’s free: I frequently like to say that I haven’t spent a dime on analytics software since I retired from the Navy; this is strictly true, but also misleading. I have spent considerable time learning the best practices in R over the past 4 years. An organization that is looking to make this choice needs to realize upfront that the savings in fees will be largely eaten up by extra manpower to learn how to make it work. The reward for investing the time in increasing the ability of your people to code is twofold; first, it makes them closer in touch with the actual analysis, and secondly, it allows for bespoke applications.

Cara: I work in a pretty dynamic world as a government operations research analyst (ORSA); we don’t typically have dedicated statisticians, programmers, data scientists, modelers, or data viz specialists. Most of the time, we are all functioning in some or all of those capacities. As a former engineer with a dynamic background, this suits me well. However, it also means that things change from day to day, from project to project, and as the government analytic world changes (rapidly). I do not have the flexibility to use one software package exclusively. Further, I face challenges within DoD related to systems, software, classification, and computing infrastructure that most people in academia or industry do not. In my organization, there has been a relatively recent and rapid shift in the analytic environment. We formerly leaned heavily on Excel-based regressions and descriptive statistics, usually created by a single analysts that answer a single question, and in many cases these models were not particular dynamic or scalable. We now focus on using open-source tools in a team construct, sometimes with industry partners, to create robust models that are designed to answer multiple questions from a variety of perspectives; scale easily to mirror operational requirements; fit with other models; and transition well to high performance computing environments.

The two open-source tools we (i.e., my division) currently use most for programming are R and Python. We have had success combining data analysis, statistics, and graphical models to create robust tools coded as RShiny apps. Recently, we chose to code in Python for a project that involved machine learning and high performance computing. I do not propose to debate the strengths and weaknesses of either R or Python in this forum; rather, I challenge you to consider carefully the implications of programming language choice for any project with a cradle to grave perspective.

References
  1. Getting started with R can be daunting. We recommend the following references.
    Stack Overflow. This invaluable resource is a bulletin board exchange of programming ideas and tips. The real skill required to use it effectively is knowing how to write an effective question. “I hate ggplot” or “My R code doesn’t work” are not useful; try “Could not subset closure” or “ggplot axis font size” instead.

  2. Vignettes. Well-developed R packages have vignettes, which are very useful in seeing both an explanation of the code as well as an example. Two very good references are the ggplot2 gallery and the dplyr vignette Finally, the RViews blog is a great way to keep up-to-date with practice.

  3. Books. Although I tend to acquire books with reckless abandon, the ones I actually keep and use have withstood careful consideration and have generally pegged the daily utility meter. Try R for Data Science by Wickham and Grolemund (O’Reilly Publishing 2017) and Elegant Graphics for Data Analysis by Wickham (Springer 2016); available both as print copies or electronic editions.

  4. Podcasts. For those moments in your life when you need some data science-related enrichment, the producers of DataCamp host an excellent podcast called DataFramed. Fifty-nine episodes have been recorded so far; find them on soundcloud, Spotify, YouTube, or VFR direct from the creator’s listening notes.

  5. RStudio Cheatsheets. Sometimes you need densely constructed (read: compact yet surprisingly in-depth), straightforward references. RStudio creates (and updates) these two-pagers for the most popular and versatile R packages to be great portable references for programmers – think of them as a combined dictionary and thesaurus for learning R. Fun fact: they can be downloaded in multiple languages.

  6. Forums. (1) Data Science Center of Education (DSCOE) is a CAC-enabled collaboration site that hosts data science tutorials developed by Army analysts, mostly using R, and supports a week-long R immersion course offered at Center for Army Analysis (CAA) twice a year. The DSCOE forum is managed collaboratively by the CAA, U.S. Army Cyber Command (ARCYBER), Naval Postgraduate School (NPS), and the United Stated Military Academy (USMA). Contributions are both welcome and encouraged. (2) R-bloggers, created in 2015, is an R centric forum designed to foster connection, collaboration, and resource sharing within the R community. The utility of this forum lies in its array of technical resources that will benefit both new and practiced users. (3) Data Science DC, for those in the NCR, was formed via the concatenation of numerous meetup groups – including RMeetup DC – and a major proponent of a number of events, including hackathons and the DCR conference (held annually in the fall).

_____='https://rviews.rstudio.com/2020/03/31/close-encounters-of-the-r-kind/';

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R Views. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

COVID-19 in Belgium

Tue, 03/31/2020 - 02:00

[This article was first published on R on Stats and R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Photo by Markus Spiske

Introduction

The Novel COVID-19 Coronavirus is still spreading quickly in several countries and it does not seem like it is going to stop anytime soon as the peak has not yet been reached in many countries.

Since the beginning of its expansion, a large number of scientists across the world have been analyzing this Coronavirus from different perspectives and with different technologies with the hope of coming up with a cure in order to stop its expansion and limit its impact on citizens.

Top R resources on Coronavirus

In the meantime, epidemiologists, statisticians and data scientists are working towards a better understanding of the spread of the virus in order to help governments and health agencies in taking the most optimal decisions. This led to the publication of a great deal of online resources about the virus, which I collected and organized in an article covering the top R resources on Coronavirus. This article is a collection of the best resources I’ve had the chance to discover, with a brief summary for each of them. It includes Shiny apps, dashboards, R packages, blog posts and datasets.

Publishing this collection led many readers to submit their piece of work, which made the article even more complete and more insightful for anyone interested in analyzing the virus from a quantitative perspective. Thanks to everyone who contributed and who helped me in collecting and summarizing these R resources about COVID-19!

Given my field of expertise, I am not able to help in this fight against the virus from a medical point of view. However, I still wanted to contribute as much as I could. From understanding better the disease to bringing scientists and doctors together to build something bigger and more impactful, I truly hope that this collection will, to a small extent, help to fight the pandemic.

Coronavirus dashboard for your own country

Besides receiving analyses, blog posts, R code and Shiny apps from people across the world, I realized that many people were trying to create a dashboard tracking the spread of the Coronavirus for their own country. So in addition to the collection of top R resources, I also published an article detailing the steps to follow to create a dashboard specific to a country. See how to create such dashboard in this article and an example with Belgium.

The code has been made available on GitHub and is open source so everyone can copy it and adapt it to their own country. The dashboard was intentionally kept simple so anyone with a minimum knowledge in R could easily replicate it, and advanced users could enhance it according to their needs.

Motivations, limitations and structure of the article

By seeing and organizing many R resources about COVID-19, I am fortunate enough to have read a lot of excellent analyses on the disease outbreak, the impact of different health measures, forecasts of the number of cases, projections about the length of the pandemic, hospitals capacity, etc.

Furthermore, I must admit that some countries such as China, South Korea, Italy, Spain, UK and Germany received a lot of attention as shown by the number of analyses done on these countries. However, to my knowledge and at the date of publication of this article, I am not aware of any analysis of the spread of the Coronavirus specifically for Belgium.1 The present article aims at filling that gap.

Throughout my PhD thesis in statistics, my main research interest is about survival analysis applied to cancer patients (more information in the research section of my personal website). I am not an epidemiologist and I have no extensive knowledge in modelling disease outbreaks via epidemiological models.

I usually write articles only about things I consider myself familiar with, mainly statistics and its applications in R. At the time of writing this article, I was however curious where Belgium stands regarding the spread of this virus, I wanted to play with this kind of data in R (which is new to me) and see what comes out.

In order to satisfy my curiosity while not being an expert, in this article I am going to replicate analyses done by more knowledgeable people and apply them to my country, that is, Belgium. From all the analyses I have read so far, I decided to replicate the analyses done by Tim Churches and Prof. Dr. Holger K. von Jouanne-Diedrich. This article is based on a mix of their articles which can be found here and here. They both present a very informative analysis on how to model the outbreak of the Coronavirus and show how contagious it is. Their articles also allowed me to gain an understanding of the topic and in particular an understanding of the most common epidemiological model. I strongly advise interested readers to also read their more recent articles for more advanced analyses and for an even deeper understanding of the spread of the COVID-19 pandemic.

Other more complex analyses are possible and even preferable, but I leave this to experts in this field. Note also that the following analyses take into account only the data until the date of publication of this article, so the results should not be viewed, by default, as current findings.

In the remaining of the article, we first introduce the model which will be used to analyze the Coronavirus outbreak in Belgium. We also briefly discuss and show how to compute an important epidemiological measure, the reproduction number. We then use our model to analyze the outbreak of the disease in the case where there would be no public health intervention. We conclude the article by summarizing more advanced tools and techniques that could be used to further model COVID-19 in Belgium.

Analysis of Coronavirus in Belgium A classic epidemiological model: the SIR model

Before diving into the real-life application, we first introduce the model that will be used.

There are many epidemiological models but we will use one of the simplest, the SIR model. Tim Churches’ explanation of this model and how to fit it using R is so nice, I will reproduce it here with a few minor changes.

The basic idea behind the SIR model (Susceptible – Infectious – Recovered) of communicable disease outbreaks is that there are three groups (also called compartments) of people:

  • those who are healthy but susceptible to the disease: S
  • the infectious (and thus, infected) people: I
  • people who have recovered: R

SIR model. Source: Wikipedia

To model the dynamics of the outbreak we need three differential equations to describe the rates of change in each group, parameterised by:

  • \(\beta\) which controls the transition between S and I
  • \(\gamma\) which controls the transition between I and R

Formally, this gives:

\[\frac{dS}{dt} = – \frac{\beta IS}{N}\]

\[\frac{dI}{dt} = \frac{\beta IS}{N} – \gamma I\]

\[\frac{dR}{dt} = \gamma I\]

Before fitting the SIR model to the data, the first step is to express these differential equations as an R function, with respect to time t.

SIR <- function(time, state, parameters) { par <- as.list(c(state, parameters)) with(par, { dS <- -beta * I * S / N dI <- beta * I * S / N - gamma * I dR <- gamma * I list(c(dS, dI, dR)) }) } Fitting a SIR model to the Belgium data

To fit the model to the data we need two things:

  1. a solver for these differential equations
  2. an optimiser to find the optimal values for our two unknown parameters, \(\beta\) and \(\gamma\)

The function ode() (for ordinary differential equations) from the {deSolve} R package makes solving the system of equations easy, and to find the optimal values for the parameters we wish to estimate, we can just use the optim() function built into base R.

Specifically, what we need to do is minimise the sum of the squared differences between \(I(t)\), which is the number of people in the infectious compartment \(I\) at time \(t\), and the corresponding number of cases as predicted by our model \(\hat{I}(t)\). This quantity is known as the residual sum of squares (RSS):

\[RSS(\beta, \gamma) = \sum_t \big(I(t) – \hat{I}(t) \big)^2\]

In order to fit a model to the incidence data for Belgium, we need a value N for the initial uninfected population. The population of Belgium in November 2019 was 11,515,793 people, according to Wikipedia. We will thus use N = 11515793 as the initial uninfected population.

Next, we need to create a vector with the daily cumulative incidence for Belgium, from February 4 (when our daily incidence data starts), through to March 30 (last available date at the time of publication of this article). We will then compare the predicted incidence from the SIR model fitted to these data with the actual incidence since February 4. We also need to initialise the values for N, S, I and R. Note that the daily cumulative incidence for Belgium is extracted from the {coronavirus} R package developed by Rami Krispin.

# devtools::install_github("RamiKrispin/coronavirus") library(coronavirus) data(coronavirus) `%>%` <- magrittr::`%>%` # extract the cumulative incidence df <- coronavirus %>% dplyr::filter(Country.Region == "Belgium") %>% dplyr::group_by(date, type) %>% dplyr::summarise(total = sum(cases, na.rm = TRUE)) %>% tidyr::pivot_wider( names_from = type, values_from = total ) %>% dplyr::arrange(date) %>% dplyr::ungroup() %>% dplyr::mutate(active = confirmed - death - recovered) %>% dplyr::mutate( confirmed_cum = cumsum(confirmed), death_cum = cumsum(death), recovered_cum = cumsum(recovered), active_cum = cumsum(active) ) # put the daily cumulative incidence numbers for Belgium from # Feb 4 to March 30 into a vector called Infected library(lubridate) Infected <- subset(df, date >= ymd("2020-02-04"))$confirmed_cum # Create an incrementing Day vector the same length as our # cases vector Day <- 1:(length(Infected)) # now specify initial values for N, S, I and R N <- 11515793 init <- c( S = N - Infected[1], I = Infected[1], R = 0 )

Then we need to define a function to calculate the RSS, given a set of values for \(\beta\) and \(\gamma\).

# define a function to calculate the residual sum of squares # (RSS), passing in parameters beta and gamma that are to be # optimised for the best fit to the incidence data RSS <- function(parameters) { names(parameters) <- c("beta", "gamma") out <- ode(y = init, times = Day, func = SIR, parms = parameters) fit <- out[, 3] sum((Infected - fit)^2) }

Finally, we can fit the SIR model to our data by finding the values for \(\beta\) and \(\gamma\) that minimise the residual sum of squares between the observed cumulative incidence (observed in Belgium) and the predicted cumulative incidence (predicted by our model). We also need to check that our model has converged, as indicated by the message shown below:

# now find the values of beta and gamma that give the # smallest RSS, which represents the best fit to the data. # Start with values of 0.5 for each, and constrain them to # the interval 0 to 1.0 # install.packages("deSolve") library(deSolve) Opt <- optim(c(0.5, 0.5), RSS, method = "L-BFGS-B", lower = c(0, 0), upper = c(1, 1) ) # check for convergence Opt$message ## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Convergence is confirmed. Now we can examine the fitted values for \(\beta\) and \(\gamma\):

Opt_par <- setNames(Opt$par, c("beta", "gamma")) Opt_par ## beta gamma ## 0.5986056 0.4271395

Remember that \(\beta\) controls the transition between S and I (i.e., susceptible and infectious) and \(\gamma\) controls the transition between I and R (i.e., infectious and recovered). However, those values do not mean a lot but we use them to get the fitted numbers of people in each compartment of our SIR model for the dates up to March 30 that were used to fit the model, and compare those fitted values with the observed (real) data.

sir_start_date <- "2020-02-04" # time in days for predictions t <- 1:as.integer(today() - ymd(sir_start_date)) # get the fitted values from our SIR model fitted_cumulative_incidence <- data.frame(ode( y = init, times = t, func = SIR, parms = Opt_par )) # add a Date column and the observed incidence data library(dplyr) fitted_cumulative_incidence <- fitted_cumulative_incidence %>% mutate( Date = ymd(sir_start_date) + days(t - 1), Country = "Belgium", cumulative_incident_cases = Infected ) # plot the data library(ggplot2) fitted_cumulative_incidence %>% ggplot(aes(x = Date)) + geom_line(aes(y = I), colour = "red") + geom_point(aes(y = cumulative_incident_cases), colour = "blue") + labs( y = "Cumulative incidence", title = "COVID-19 fitted vs observed cumulative incidence, Belgium", subtitle = "(Red = fitted incidence from SIR model, blue = observed incidence)" ) + theme_minimal()

From the above graph we see that the number of observed confirmed cases follows, unfortunately, the number of confirmed cases expected by our model. The fact that both trends are overlapping indicates that the pandemic is clearly in an exponential phase in Belgium. More data would be needed to see whether this trend is confirmed in the long term.

The following graph is similar than the previous one, except that the y-axis is measured on a log scale. This kind of plot is called a semi-log plot or more precisely a log-linear plot because only the y-axis is transformed with a logarithm scale. Transforming the scale in log has the advantage that it is more easily readable in terms of difference between the observed and expected number of confirmed cases and it also shows how the number of observed confirmed cases differs from an exponential trend.

fitted_cumulative_incidence %>% ggplot(aes(x = Date)) + geom_line(aes(y = I), colour = "red") + geom_point(aes(y = cumulative_incident_cases), colour = "blue") + labs( y = "Cumulative incidence (log scale)", title = "COVID-19 fitted vs observed cumulative incidence, Belgium", subtitle = "(Red = fitted incidence from SIR model, blue = observed incidence)" ) + theme_minimal() + scale_y_log10()

The plot indicates that, at the beginning of the pandemic and until March 12, the number of confirmed cases stayed below what would be expected in an exponential phase. In particular, the number of confirmed cases stayed constant at 1 case from February 4 to February 29. From March 13 and until March 30, the number of confirmed cases kept increasing at a rate close to an exponential rate.

We also notice a small jump between March 12 and March 13, which may potentially indicate an error in the data collection, or a change in the testing/screening methods.

Reproduction number \(R_0\)

Our SIR model looks like a good fit to the observed cumulative incidence data in Belgium, so we can now use our fitted model to calculate the basic reproduction number \(R_0\), also referred as basic reproduction ratio.

The reproduction number gives the average number of susceptible people who are infected by each infectious person. In other words, the reproduction number refers to the number of healthy people that get infected per number of sick people. When \(R_0 > 1\) the disease starts spreading in a population, but not if \(R_0 < 1\). Usually, the larger the value of \(R_0\), the harder it is to control the epidemic.

Formally, we have:

\[R_0 = \frac{\beta}{\gamma}\]

We can compute it in R:

Opt_par ## beta gamma ## 0.5986056 0.4271395 R0 <- as.numeric(Opt_par[1] / Opt_par[2]) R0 ## [1] 1.401429

An \(R_0\) of 1.4 is slightly below the values calculated by others for COVID-19 and the \(R_0\) for SARS and MERS, which are similar diseases also caused by coronavirus. This is mainly due to the fact that the number of confirmed cases stayed constant and equal to 1 at the beginning of the pandemic.

A \(R_0\) of 1.4 means that, on average in Belgium, 1.4 persons are infected for each infected person.

Using our model to analyze the outbreak if there was no intervention

It is instructive to use our model fitted to the first 56 days of available data on confirmed cases in Belgium, to see what would happen if the outbreak were left to run its course, without public health intervention.

# time in days for predictions t <- 1:120 # get the fitted values from our SIR model fitted_cumulative_incidence <- data.frame(ode( y = init, times = t, func = SIR, parms = Opt_par )) # add a Date column and join the observed incidence data fitted_cumulative_incidence <- fitted_cumulative_incidence %>% mutate( Date = ymd(sir_start_date) + days(t - 1), Country = "Belgium", cumulative_incident_cases = I ) # plot the data fitted_cumulative_incidence %>% ggplot(aes(x = Date)) + geom_line(aes(y = I), colour = "red") + geom_line(aes(y = S), colour = "black") + geom_line(aes(y = R), colour = "green") + geom_point(aes(y = c(Infected, rep(NA, length(t) - length(Infected)))), colour = "blue" ) + scale_y_continuous(labels = scales::comma) + labs(y = "Persons", title = "COVID-19 fitted vs observed cumulative incidence, Belgium") + scale_colour_manual(name = "", values = c( red = "red", black = "black", green = "green", blue = "blue" ), labels = c( "Susceptible", "Recovered", "Observed incidence", "Infectious" )) + theme_minimal()

The same graph in log scale for the y-axis and with a legend for better readability:

# plot the data fitted_cumulative_incidence %>% ggplot(aes(x = Date)) + geom_line(aes(y = I, colour = "red")) + geom_line(aes(y = S, colour = "black")) + geom_line(aes(y = R, colour = "green")) + geom_point(aes(y = c(Infected, rep(NA, length(t) - length(Infected))), colour = "blue")) + scale_y_log10(labels = scales::comma) + labs(y = "Persons (log scale)", title = "COVID-19 fitted vs observed cumulative incidence, Belgium") + scale_colour_manual(name = "", values = c(red = "red", black = "black", green = "green", blue = "blue"), labels = c("Susceptible", "Observed incidence", "Recovered", "Infectious")) + theme_minimal()

More summary statistics

Other interesting statistics can be computed from the fit of our model. For example:

  • the date of the peak of the pandemic
  • the number of severe cases
  • the number of people in need of intensive care
  • the number of deaths
fit <- fitted_cumulative_incidence # peak of pandemic fit[fit$I == max(fit$I), c("Date", "I")] ## Date I ## 87 2020-04-30 525315.7 # severe cases max_infected <- max(fit$I) max_infected / 5 ## [1] 105063.1 # cases with need for intensive care max_infected * 0.06 ## [1] 31518.94 # deaths with supposed 0.7% fatality rate max_infected * 0.007 ## [1] 3677.21

Given these predictions, with the exact same settings and no intervention at all to limit the spread of the pandemic, the peak in Belgium is expected to be reached by the end of April. About 525,000 people would be infected by then, which translates to about 105,000 severe cases, about 32,000 persons in need of intensive care and up to 3,700 deaths (assuming a 0.7% fatality rate, as suggested by this source).

At this point, we understand why such strict containment measures and regulations are taken in Belgium!

Note that those predictions should be taken with a lot of caution. On the one hand, as mentioned above, they are based on rather unrealistic assumptions (for example, no public health interventions, fixed reproduction number \(R_0\), etc.). More advanced projections are possible with the {projections} package, among others (see this section for more information on this matter). On the other hand, we still have to be careful and strictly follow public health interventions because previous pandemics such as H1N1 and Spanish flu have shown that incredibly high numbers are not impossible!

The purpose of this article was to give an illustration of how such analyses are done in R with a simple epidemiological model. Those are the numbers our simple model produces and we hope they are wrong.

Additional considerations

As previously mentioned, the SIR model and the analyses done above are rather simplistic and may not give a true representation of the reality. In the following sections, we highlight five improvements that could be done to enhance theses analyses and lead to a better overview of the spread of the Coronavirus in Belgium.

Ascertainment rates

In the previous analyses and graphs, it is assumed that the number of confirmed cases represent all the cases that are infectious. This is far from reality as only a proportion of all cases are screened, detected and counted in the official figures. This proportion is known as the ascertainment rate.

The ascertainment rate is likely to vary during the course of an outbreak, in particular if testing and screening efforts are increased, or if detections methods are changed. Such changing ascertainment rates can be easily incorporated into the model by using a weighting function for the incidence cases.

In his first article, Tim Churches demonstrates that a fixed ascertainment rates of 20% makes little difference to the modelled outbreak with no intervention, except that it all happens a bit more quickly.

More sophisticated models

More sophisticated models could also be used to better reflect real-life transmission processes. For instance, another classical model in disease outbreak is the SEIR model. This extended model is similar to the SIR model, where S stands for Susceptible and R stands for Recovered, but the infected people are divided into two compartments:

  1. E for the Exposed/infected but asymptomatic
  2. I for the Infected and symptomatic

These models belong to the continuous-time dynamic models that assume fixed transition rates. There are other stochastic models that allow for varying transition rates depending on attributes of individuals, social networking, etc.

Modelling the epidemic trajectory using log-linear models

As noted above, the initial exponential phase of an outbreak, when shown in a log-linear plot (the y-axis on a log scale and the x-axis without transformation), appears (somewhat) linear. This suggests that we can model epidemic growth, and decay, using a simple log-linear model of the form:

\[log(y)=rt+b\]

where y is the incidence, r is the growth rate, t is the number of days since a specific point in time (typically the start of the outbreak), and b is the intercept. In this context, two log-linear models, one to the growth phase before the peak, and one to the decay phase after the peak are fitted to the epidemic (incidence cases) curve.

The doubling and halving time estimates which you very often hear in the news can be estimated from these log-linear models. Furthermore, these log-linear models can also be used on the epidemic trajectory to estimate the reproduction number \(R_0\) in the growth and decay phases of the epidemic.

The {incidence} package in R, part of the R Epidemics Consortium (RECON) suite of packages for epidemic modelling and control, makes the fitting of this kind of models very convenient.

Estimating changes in the effective reproduction number \(R_e\)

In our model, we set a reproduction number \(R_0\) and kept it constant. It would nonetheless be useful to estimate the current effective reproduction number \(R_e\) on a day-by-day basis so as to track the effectiveness of public health interventions, and possibly predict when an incidence curve will start to decrease.

The {EpiEstim} package in R can be used to estimate \(R_e\) and allow to take into consideration human travel from other geographical regions in addition to local transmission (Cori et al. 2013; Thompson et al. 2019).

More sophisticated projections

In addition to naïve predictions based on a simple SIR model, more advanced and complex projections are also possible, notably, with the {projections} package. This packages uses data on daily incidence, the serial interval and the reproduction number to simulate plausible epidemic trajectories and project future incidence.

Conclusion

This article started with (i) a description of a couple of R resources on the Coronavirus pandemic (i.e., a collection and a dashboard) that can be used as background materials and (ii) the motivations behind this article. We then detailed the most common epidemiological model, i.e. the SIR model, before actually applying it on Belgium incidence data.

This resulted in a visual comparison of the fitted and observed cumulative incidence in Belgium. It showed that the COVID-19 pandemic is clearly in an exponential phase in Belgium in terms of number of confirmed cases.

We then explained what is the reproduction number and how to compute it in R. Finally, our model was used to analyze the outbreak of the Coronavirus if there was no public health intervention at all.

Under this (probably too) simplistic scenario, the peak of the COVID-19 in Belgium is expected to be reached by the end of April, 2020, with around 525,000 infected people and about 3,700 deaths. These very alarmist naïve predictions highlight the importance of restrictive public health actions taken by governments, and the urgence for citizens to follow these health actions in order to mitigate the spread of the virus in Belgium (or at least slow it enough to allow health care systems to cope with it).

We concluded this article by describing five improvements that could be implemented to further analyze the disease outbreak.

Thanks for reading. I hope this article gave you a good understanding of the spread of the COVID-19 Coronavirus in Belgium. Feel free to use this article as a starting point for analyzing the outbreak of this disease in your own country. See also a collection of top R resources on Coronavirus to gain even further knowledge.

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion. If you find a mistake or bug, you can inform me by raising an issue on GitHub. For all other requests, you can contact me here.

Get updates every time a new article is published by subscribing to this blog.

Related articles:

References

Cori, Anne, Neil M Ferguson, Christophe Fraser, and Simon Cauchemez. 2013. “A New Framework and Software to Estimate Time-Varying Reproduction Numbers During Epidemics.” American Journal of Epidemiology 178 (9). Oxford University Press: 1505–12.

Thompson, RN, JE Stockwin, RD van Gaalen, JA Polonsky, ZN Kamvar, PA Demarsh, E Dahlqwist, et al. 2019. “Improved Inference of Time-Varying Reproduction Numbers During Infectious Disease Outbreaks.” Epidemics 29. Elsevier: 100356.

  1. Feel free to let me know in the comments or by contacting me if you performed some analyses specifically for Belgium and which I could include in my article covering the top R resources on the Coronavirus.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R on Stats and R. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Can unbalanced randomization improve power?

Tue, 03/31/2020 - 02:00

[This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Of course, we’re all thinking about one thing these days, so it seems particularly inconsequential to be writing about anything that doesn’t contribute to solving or addressing in some meaningful way this pandemic crisis. But, I find that working provides a balm from reading and hearing all day about the events swirling around us, both here and afar. (I am in NYC, where things are definitely swirling.) And for me, working means blogging, at least for a few hours every couple of weeks.

I have tried in some small way to get involved with researchers who are trying to improve outcomes for patients who are showing the symptoms or test positive for COVID-19. One group that reached out to me is concerned with how patients with heart conditions will be adversely affected by the disease, and is evaluating a number of drug treatments that could improve their outcomes.

Given that we know that outcomes under usual care are not that great for heart patients, there is a desire to try to get possible treatments to as many people as possible, even in a randomized control trial. One question that came up in the design of this study was whether there would be efficiency gains by using a 1:2 randomization scheme? That is, should we randomize two patients to the experimental drug treatment for every one patient we randomize to the usual care group? In the case of a binary outcome, it appears that we will only lose efficiency if we use anything other than a 1:1 randomization.

Brief public service announcement: simstudy update

When it became clear that I needed to explore the implications of unbalanced randomization for this project, I realized that the simstudy package, which supports much of the simulations on this blog, could not readily handle anything other than 1:1 randomization. I had to quickly rectify that shortcoming. There is a new argument ratio in the trtAssign function where you can now specify any scheme for any number of treatment arms. This is available in version 1.16, which for the moment can be found only on github (kgoldfeld/simstudy).

Here is an example based on a 1:2:3 allocation. I’m not sure if that would ever be appropriate, but it shows the flexibility of the new argument. One counter-intuitive aspect of this implementation is that the balance argument is set to TRUE, indicating that the allocation to the groups will be perfect, or as close as possible to the specified ratios. If balance is FALSE, the ratios are used as relative probabilities instead.

library(simstudy) library(parallel) RNGkind("L'Ecuyer-CMRG") set.seed(16) dx <- genData(600) dx <- trtAssign(dx, nTrt = 3, balanced = TRUE, ratio = c(1,2,3), grpName = "rx") dx[, table(rx)] ## rx ## 1 2 3 ## 100 200 300 Unbalanced designs with a binary outcome

The outcome in the COVID-19 study is a composite binary outcome (at least one of a series of bad events has to occur within 30 days to be considered a failure). Here, I am considering the effect of different randomization schemes on the power of the study. We assumed in the usual care arm 40% of the patients would have a bad outcome and that the drug treatment would reduce the bad outcomes by 30% (so that 28% of the drug treatment arm would have a bad outcome).

If we generate a single data set under these assumptions, we can fit a logistic regression model to recover these parameters.

estCoef <- function(n, formula, ratio) { def <- defDataAdd(varname = "y", formula = formula, dist = "binary") dx <- genData(n) dx <- trtAssign(dx, grpName = "rx", ratio = ratio) dx <- addColumns(def, dx) coef(summary(glm(y~rx, family = binomial, data = dx))) } estCoef(n = 244*2, formula = "0.4 - 0.3 * 0.4 * rx", ratio = c(1, 1)) ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.4328641 0.1310474 -3.303111 0.0009561867 ## rx -0.4577476 0.1924533 -2.378487 0.0173838304

The probabilities of a bad outcome for the usual care group and drug treatment group are

c(usual = 1/(1 + exp(0.433)), drug = 1/(1+exp(0.433 + 0.458))) ## usual drug ## 0.3934102 0.2909035 Assessing power

In order to assess power, we need to generate many data sets and keep track of the p-values. The power is calculated by estimating the proportion of p-values that fall below 0.05.

Here is the analytic solution for a 1:1 ratio.

power.prop.test(p1 = .4, p2 = .7*.4, power = .80) ## ## Two-sample comparison of proportions power calculation ## ## n = 243.4411 ## p1 = 0.4 ## p2 = 0.28 ## sig.level = 0.05 ## power = 0.8 ## alternative = two.sided ## ## NOTE: n is number in *each* group

The sample size estimate based on 80% suggests we would need 244 patients per arm, or 488 total patients. If we use this estimated \(n\) in a simulation for power (using 1000 datasets), we should be close to 80%:

est.power <- function(n, ratio, p1, reduction) { formula = paste(p1, "* (1 -", reduction, "* rx)") p.val <- estCoef(n, formula, ratio)["rx", 4] return(p.val) } pvals <- unlist(mclapply(1:1000, function(x) est.power(488, c(1, 1), 0.4, 0.3))) mean(pvals < 0.05) ## [1] 0.814 The power experiment

Now we are ready to evaluate the question that motivated all of this. If we start to change the ratio from 1:1 to 1:2, to 1:3, etc., what happens to the power? And does this pattern change based on the assumptions about failure rates in the usual care arm and the expected reductions in the treatment arm? Here is the code that will allow us to explore these questions:

res <- list() for (p1 in c(.2, .3, .4, .5)) { for (r in c(.2, .25, .3)) { p2 <- (1- r) * p1 n <- ceiling(power.prop.test(p1 = p1, p2 = p2, power = .80)$n)*2 for (i in c(1:5)) { pvals <- mclapply(1:1000, function(x) est.power(n, c(1, i), p1, r)) pvals <- unlist(pvals) dres <- data.table(n, control_p = p1, pct_reduction = r, control = 1, rx = i, power = mean( pvals < .05)) res <- append(res, list(dres)) } } } res <- rbindlist(res)

Repeating the power simulation for a variety of assumptions indicates that, at least in the case of a binary outcome, using an unbalanced design does not improve the quality of the research even though it might get more patients the drug treatment:

ggplot(data = res, aes(x = rx, y = power)) + geom_line(color = "blue") + facet_grid(control_p ~ pct_reduction, labeller = label_both) + theme(panel.grid = element_blank()) + scale_x_continuous(name = "ratio of treatment to control", breaks = c(1:5), labels = paste0(c(1:5),":1")) + scale_y_continuous(limits = c(.5,.9), breaks = c(.6, .7, .8))

Continuous outcomes

In the case of binary outcomes, reducing sample size in the control group reduces our ability to efficiently estimate the proportion of events, even though we may improve estimation for the treatment group by adding patients. In the case of a continuous outcome, we may be able to benefit from a shift of patients from one group to another if the variability of responses differs across groups. In particular, arms with more variability could benefit from a larger sample. Next time, I’ll show some simulations that indicate this might be the case.

Stay well.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: ouR data generation. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Predicting 1,000,000 of COVID-19 done with R before March 18 – call for help

Mon, 03/30/2020 - 19:48

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Due to the unprecedented technological progress that has been made in years past, the current climate allows us to monitor this pandemic better than any other pandemic in the past. We will argue, however, that R was instrumental in predicting when the 1,000,000th case of COVID-19 will occur, as demonstrated here in our collaboration spread out on three continents:

https://www.sciencedirect.com/science/article/pii/S2590113320300079

Since India is currently in lockdown and the correction process is in India, it has not been concluded as of writing.

The first draft of our paper was prepared on March 18 and can be accessed here: http://www.cs.laurentian.ca/wkoczkodaj/p/?C=M;O=D
Open this link and click twice on “last modified” to see the data (the computing was done a few days earlier). 
Our heuristic developed for the prediction could not be implemented so quickly had it not been for our use of R. The function ‘nls‘ is crucial for modelling only the front incline part of the Gaussian function (also known as Gaussian). Should this pandemic not stop, or at the very least slow down, one billion cases could occur by the end of May 2020.

The entire world waits for the inflection point (https://en.wikipedia.org/wiki/Inflection_point) and if you help us, we may be able to reach this point sooner.

A few crucial R commands are: 
modE <- nls(dfcov$all ~ a * exp(bdfcov$report), data = dfcov,
start = list(a = 100, b = 0.1))
a <- summary(modE)$parameters[1]
b <- summary(modE)$parameters[2]
summary(modE)
x <- 1:m + dfcov$report[length(dfcov$report)]
modEPr <- a * exp(b
x)

Waldemar W. Koczkodaj (email: wkoczkodaj [AT] cs.laurentian.ca)
(for the entire Collaboration)

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Analyzing Remote Sensing Data using Image Segmentation

Mon, 03/30/2020 - 19:45

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post summarizes material posted as an Additional Topic to accompany the book Spatial Data Analysis in Ecology and Agriculture using R, Second Edition. The full post, together with R code and data, can be found in the Additional Topics section of the book’s website, http://psfaculty.plantsciences.ucdavis.edu/plant/sda2.htm

1. Introduction
The idea is best described with images. Here is a display in mapview of an agricultural region in California’s Central Valley. The boundary of the region is a rectangle developed from UTM coordinates, and it is interesting that the boundary of the region is slightly tilted with respect to the network of roads and field boundaries. There are several possible explanations for this, but it does not affect our presentation, so we will ignore it.
There are clearly several different cover types in the region, including fallow fields, planted fields, natural regions, and open water. Our objective is to develop a land classification of the region based on Landsat data. A common method of doing this is based on the so-called false color image. In a false color image, radiation in the green wavelengths is displayed as blue, radiation in the red wavelengths is displayed as green, and radiation in the infrared wavelengths is displayed as red. Here is a false color image of the region shown in the black rectangle.  

Because living vegetation strongly reflects radiation in the infrared region and absorbs it in the red region, the amount of vegetation contained in a pixel can be estimated by using the normalized difference vegetation index, or NDVI, defined as

where IR is the pixel intensity of the infrared band and R is the pixel intensity of the red band. Here is the NDVI of the region in the false color image above.
If you compare the images you can see that the areas in  black rectangle above that appear green have a high NDVI, the areas that appear brown have a low NDVI, and the open water in the southeast corner of the image actually has a negative NDVI, because water strongly absorbs infrared radiation. Our objective is to generate a classification of the land surface into one of five classes: dense vegetation, sparse vegetation, scrub, open land, and open water. Here is the classification generated using the method I am going to describe.

There are a number of image analysis packages available in R. I elected to use the OpenImageR and SuperpixelImageSegmentation packages, both of which were developed by L. Mouselimis (2018, 2019a). One reason for this choice is that the objects created by these packages have a very simple and transparent data structure that makes them ideal for data analysis and, in particular, for integration with the raster package.  The packages are both based on the idea of image segmentation using superpixels . According to Stutz et al. (2018), superpixels were introduced by Ren and Malik (2003). Stutz et al. define a superpixel as “a group of pixels that are similar to each other in color and other low level properties.” Here is the false color image above divided into superpixels.

The superpixels of can then be represented by groups of pixels all having the same pixel value, as shown here. This is what we will call image segmentation.

In the next section we will introduce the concepts of superpixel image segmentation and the application of these packages. In Section 3 we will discuss the use of these concepts in data analysis. The intent of image segmentation as implemented in these and other software packages is to provide support for human visualization. This can cause problems because sometimes a feature that improves the process of visualization detracts from the purpose of data analysis. In Section 4 we will see how the simple structure of the objects created by Mouselimis’s two packages can be used to advantage to make data analysis more powerful and flexible. This post is a shortened version of the Additional Topic that I described at the start of the post.

2. Introducing superpixel image segmentation
The discussion in this section is adapted from the R vignette Image Segmentation Based on Superpixels and Clustering (Mouselimis, 2019b). This vignette describes the R packages OpenImageR and SuperpixelImageSegmentation. In this post I will only discuss the OpenImageR package. It uses simple linear iterative clustering (SLIC, Achanta et al., 2010), and a modified version of this algorithm called SLICO (Yassine et al., 2018), and again for brevity I will only discuss the former. To understand how the SLIC algorithm works, and how it relates to our data analysis objective, we need a bit of review about the concept of color space. Humans perceive color as a mixture of the primary colors red, green, and blue (RGB), and thus a “model” of any given color can in principle be described as a vector in a three-dimensional vector space. The simplest such color space is one in which each component of the vector represents the intensity of one of the three primary colors. Landsat data used in land classification conforms to this model. Each band intensity is represented as an integer value between 0 and 255. The reason is that this covers the complete range of values that can be represented by a sequence of eight binary (0,1) values because 28 – 1 = 255.  For this reason it is called the eight-bit RGB color model.
There are other color spaces beyond RGB, each representing a transformation for some particular purpose of the basic RGB color space. A common such color space is the CIELAB color space, defined by the International Commission of Illumination (CIE). You can read about this color space here. In brief, quoting from this Wikipedia article, the CIELAB color space “expresses color as three values: L* for the lightness from black (0) to white (100), a* from green (−) to red (+), and b* from blue (−) to yellow (+). CIELAB was designed so that the same amount of numerical change in these values corresponds to roughly the same amount of visually perceived change.”
Each pixel in an image contains two types of information: its color, represented by a vector in a three dimensional space such as the RGB or CIELAB color space, and its location, represented by a vector in two dimensional space (its x and y coordinates). Thus the totality of information in a pixel is a vector in a five dimensional space. The SLIC algorithm performs a clustering operation similar to K-means clustering on the collection of pixels as represented in this five-dimensional space.
The SLIC algorithm measures total distance between pixels as the sum of two components, dlab, the distance in the CIELAB color space, and dxy, the distance in pixel (x,y) coordinates. Specifically, the distance Ds is computed as

where S is the grid interval of the pixels and m is a compactness parameter that allows one to emphasize or de-emphasize the spatial compactness of the superpixels. The algorithm begins with a set of K cluster centers regularly arranged in the (x,y) grid and performs K-means clustering of these centers until a predefined convergence measure is attained.
The following discussion assumes some basic familiarity with the raster package. If you are not familiar with this package, an excellent introduction is given here. After downloading the data files, you can use them to first construct RasterLayer objects and then combine them into a RasterBrick.

library(raster)
> b2 <- raster(“region_blue.grd”)
> b3 <- raster(“region_green.grd”)
> b4 <- raster(“region_red.grd”)
> b5 <- raster(“region_IR.grd”)
> region.brick <- brick(b2, b3, b4, b5)

We will need to save the number of rows and columns for present and future use.
> print(nrows <- region.brick@nrows)
[1] 187
> print(ncols <- region.brick@ncols)
[1] 329
Next we plot the region.brick object (shown above) and write it to disk.
> # False color image
> plotRGB(region.brick, r = 4, g = 3, b = 2,
+   stretch = “lin”)
> # Write the image to disk
> jpeg(“FalseColor.jpg”, width = ncols,
+   height = nrows)
> plotRGB(region.brick, r = 4, g = 3, b = 2,
+   stretch = “lin”)
> dev.off()

We will now use SLIC for image segmentation. The code here is adapted directly from the superpixels vignette.

> library(OpenImageR)
> False.Color <- readImage(“FalseColor.jpg”)
> Region.slic = superpixels(input_image =
+   False.Color, method = “slic”, superpixel = 80,
+   compactness = 30, return_slic_data = TRUE,
+   return_labels = TRUE, write_slic = “”,
+   verbose = FALSE)
Warning message:
The input data has values between 0.000000 and 1.000000. The image-data will be multiplied by the value: 255!
> imageShow(Region.slic$slic_data)


 
We will discuss the warning message in Section 3. The function imageShow() is part of the OpenImageR package.
The OpenImageR function superpixels() creates superpixels, but it does not actually create an image segmentation in the sense that we are using the term is defined in Section 1. This is done by functions in the SuperPixelImageSegmentation package (Mouselimis, 2018), whose discussion is also contained in the superpixels vignette. This package is also described in the Additional Topic. As pointed out there, however, some aspects that make the function well suited for image visualization detract from its usefulness for data analysis, so we won’t discuss it further in this post. Instead, we will move directly to data analysis using the OpenImageR package.

3. Data analysis using superpixels
To use the functions in the OpenImageR package for data analysis, we must first understand the structure of objects created by these packages. The simple, open structure of the objects generated by the function superpixels() makes it an easy matter to use these objects for our purposes. Before we begin, however, we must discuss a preliminary matter: the difference between a Raster* object and a raster object. In Section 2 we used the function raster() to generate the blue, green, red and IR band objects b2, b3, b4, and b5, and then we used the function brick() to generate the object region.brick. Let’s look at the classes of these objects.

> class(b3)
[1] “RasterLayer”
attr(,”package”)
[1] “raster”
> class(full.brick)
[1] “RasterBrick”
attr(,”package”)
[1] “raster”

Objects created by the raster packages are called Raster* objects, with a capital “R”. This may seem like a trivial detail, but it is important to keep in mind because the objects generated by the functions in the OpenImageR and SuperpixelImageSegmentation packages are raster objects that are not Raster* objects (note the deliberate lack of a courier font for the word “raster”).
In Section 2 we used the OpenImageR function readImage() to read the image data into the object False.Color for analysis. Let’s take a look at the structure of this object.

> str(False.Color)
num [1:187, 1:329, 1:3] 0.373 0.416 0.341 0.204 0.165 …

It is a three-dimensional array, which can be thought of as a box, analogous to a rectangular Rubik’s cube. The “height” and “width” of the box are the number of rows and columns respectively in the image and at each value of the height and width the “depth” is a vector whose three components are the red, green, and blue color values of that cell in the image, scaled to [0,1] by dividing the 8-bit RGB values, which range from 0 to 255, by 255. This is the source of the warning message that accompanied the output of the function superpixels(), which said that the values will be multiplied by 255. For our purposes, however, it means that we can easily analyze images consisting of transformations of these values, such as the NDVI. Since the NDVI is a ratio, it doesn’t matter whether the RGB values are normalized or not. For our case the RasterLayer object b5 of the RasterBrick False.Color is the IR band and the object b4 is the R band. Therefore we can compute the NDVI as

> NDVI.region <- (b5 – b4) / (b5 + b4)

Since NDVI is a ratio scale quantity, the theoretically best practice is to plot it using a monochromatic representation in which the brightness of the color (i.e., the color value) represents the value (Tufte, 1983). We can accomplish this using the RColorBrewer function brewer.pal().

> library(RColorBrewer)
> plot(NDVI.region, col = brewer.pal(9, “Greens”),
+   axes = TRUE, main = “Region NDVI”)
 

This generates the NDVI map shown above. The darkest areas have the highest NDVI. Let’s take a look at the structure of the RasterLayer object NDVI.region.

> str(NDVI.region)
Formal class ‘RasterLayer’ [package “raster”] with 12 slots
              *                 *                   *
..@ data    :Formal class ‘.SingleLayerData’ [package “raster”] with 13 slots
              *                 *                   *
.. .. ..@ values    : num [1:61523] 0.1214 0.1138 0.1043 0.0973 0.0883 …
              *                 *                   *
..@ ncols   : int 329
..@ nrows   : int 187

Only the relevant parts are shown, but we see that we can convert the raster data to a matrix that can be imported into our image segmentation machinery as follows. Remember that by default R constructs matrices by columns. The data in Raster* objects such as NDVI.region are stored by rows, so in converting these data to a matrix we must specify byrow = TRUE.

> NDVI.mat <- matrix(NDVI.region@data@values,
+   nrow = NDVI.region@nrows,
+   ncol = NDVI.region@ncols, byrow = TRUE)

The function imageShow() works with data that are either in the eight bit 0 – 255 range or in the [0,1] range (i.e., the range of x between and including 0 and 1). It does not, however, work with NDVI values if these values are negative. Therefore, we will scale NDVI values to [0,1].

> m0 <- min(NDVI.mat)
> m1 <- max(NDVI.mat)
> NDVI.mat1 <- (NDVI.mat – m0) / (m1 – m0)
> imageShow(NDVI.mat)

 Here is the resulting image.

The function imageShow() deals with a single layer object by producing a grayscale image. Unlike the green NDVI plot above, in this plot the lower NDVI regions are darker.
We are now ready to carry out the segmentation of the NDVI data. Since the structure of the NDVI image to be analyzed is the same as that of the false color image, we can simply create a copy of this image, fill it with the NDVI data, and run it through the superpixel image segmentation function. If an image has not been created on disk, it is also possible (see the Additional Topic) to create the input object directly from the data.

> NDVI.data <- False.Color
> NDVI.data[,,1] <- NDVI.mat1
> NDVI.data[,,2] <- NDVI.mat1
> NDVI.data[,,3] <- NDVI.mat1

In the next section we will describe how to create an image segmentation of the NDVI data and how to use cluster analysis to create a land classification.

4. Expanding the data analysis capacity of superpixels
Here is an application of the function superpixels() to the NDVI data generated in Section 3.

> NDVI.80 = superpixels(input_image = NDVI.data,
+   method = “slic”, superpixel = 80,
+   compactness = 30, return_slic_data = TRUE,
+   return_labels = TRUE, write_slic = “”,
+   verbose = FALSE)
> imageShow(Region.slic$slic_data)

Here is the result.
Here the structure of the object NDVI.80 .

> str(NDVI.80)
List of 2
$ slic_data: num [1:187, 1:329, 1:3] 95 106 87 52 42 50 63 79 71 57 …
$ labels   : num [1:187, 1:329] 0 0 0 0 0 0 0 0 0 0 …

It is a list with two elements, NDVI.80$slic_data, a three dimensional array of pixel color data (not normalized), and NDVI.80$labels, a matrix whose elements correspond to the pixels of the image. The second element’s name hints that it may contain values identifying the superpixel to which each pixel belongs. Let’s see if this is true.

> sort(unique(as.vector(NDVI.80$labels)))
[1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[25] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
[49] 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71

There are 72 unique labels. Although the call to superpixels() specified 80 superpixels, the function generated 72. We can see which pixels have the label value 0 by setting the values of all of the other pixels to [255, 255, 255], which will plot as white.

> R0 <- NDVI.80
for (i in 1:nrow(R0$label))
+    for (j in 1:ncol(R0$label))
+       if (R0$label[i,j] != 0)
+          R0$slic_data[i,j,] <- c(255,255,255)
> imageShow(R0$slic_data)


This produces the plot shown here.
The operation isolates the superpixel in the upper right corner of the image, together with the corresponding portion of the boundary. We can easily use this approach to figure out which value of NDVI.80$label corresponds to which superpixel. Now let’s deal with the boundary. A little exploration of the NDVI.80 object suggests that pixels on the boundary have all three components equal to zero. Let’s isolate and plot all such pixels by coloring all other pixels white.

> Bdry <- NDVI.80
for (i in 1:nrow(Bdry$label))
+    for (j in 1:ncol(Bdry$label))
+     if (!(Bdry$slic_data[i,j,1] == 0 &
+     Bdry$slic_data[i,j,2] == 0 &
+       Bdry$slic_data[i,j,3] == 0))
+       Bdry$slic_data[i,j,] <- c(255,255,255)
> Bdry.norm <- NormalizeObject(Bdry$slic_data)
> imageShow(Bdry$slic_data)

This shows that we have indeed identified the boundary pixels. Note that the function imageShow() displays these pixels as white with a black edge, rather than pure black.

Having done a preliminary analysis, we can organize our segmentation process into two steps. The first step will be to replace each of the superpixels generated by the OpenImageR function superpixels() with one in which each pixel has the same value, corresponding to a measure of central tendency (e.g, the mean, median , or mode) of the original superpixel. The second step will be to use the K-means unsupervised clustering procedure to organize the superpixels from step 1 into a set of clusters and give each cluster a value corresponding to a measure of central tendency of the cluster. I used the code developed to identify the labels and to generate the  boundary to construct a function make.segments() to carry out the segmentation. The first argument of make.segments() is the superpixels object, and the second is the functional measurement of central tendency. Although in this case each of the three colors of the object NDVI.80 have the same values, this may not be true for every application, so the function analyzes each color separately. Because the function is rather long, I have not included it in this post. If you want to see it, you can go to the Additional Topic linked to the book’s website.

Here is the application of the function to the object NDVI.80 with the second argument set to mean.

> NDVI.means <- make.segments(NDVI.80, mean)
> imageShow(NDVI.means$slic_data)

Here is the result.

The next step is to develop clusters that represent identifiable land cover types. In a real project the procedure would be to collect a set of ground truth data from the site, but that option is not available to us. Instead we will work with the true color rendition of the Landsat scene, shown here.
The land cover is subdivided using K-means into five types: dense crop, medium crop, scrub, open, and water.

> set.seed(123)
> NDVI.clus <-
+ kmeans(as.vector(NDVI.means$slic_data[,,1]), 5)
> vege.class <- matrix(NDVI.clus$cluster,
+    nrow = NDVI.region@nrows,
+    ncol = NDVI.region@ncols, byrow = FALSE)
> class.ras <- raster(vege.class, xmn = W,
+    xmx = E, ymn = S, ymx = N, crs =
+    CRS(“+proj=utm +zone=10 +ellps=WGS84”))

Next I used the raster function ratify() to assign descriptive factor levels to the clusters.

> class.ras <- ratify(class.ras)
> rat.class <- levels(class.ras)[[1]]
> rat.class$landcover <- c(“Water”, “Open”,
+  “Scrub”, “Med. Crop”, “Dense Crop”)
> levels(class.ras) <- rat.class
> levelplot(class.ras, margin=FALSE,
+   col.regions= c(“blue”, “tan”,
+   “lightgreen”, “green”, “darkgreen”),
+    main = “Land Cover Types”)

The result is shown in the figure at the start of the post. We can also overlay the original boundaries on top of the image. This is more easily done using plot() rather than levelplot(). The function plot() allows plots to be built up in a series of statements. The function levelplot() does not.

> NDVI.rasmns <- raster(NDVI.means$slic_data[,,1],
+   xmn = W, xmx = E, ymn = S, ymx = N,
+   crs = CRS(“+proj=utm +zone=10 +ellps=WGS84”))
> NDVI.polymns <- rasterToPolygons(NDVI.rasmns,
+   dissolve = TRUE)
> plot(class.ras, col = c(“blue”, “tan”,
+   “lightgreen”, “green”, “darkgreen”),
+   main = “Land Cover Types”, legend = FALSE)
> legend(“bottom”, legend = c(“Water”, “Open”,
+   “Scrub”, “Med. Crop”, “Dense Crop”),
+   fill = c(“blue”, “tan”, “lightgreen”, “green”,
+   “darkgreen”))
> plot(NDVI.polymns, add = TRUE)

The application of image segmentation algorithms to remotely sensed image classification is a rapidly growing field, with numerous studies appearing every year. At this point, however, there is little in the way of theory on which to base an organization of the topic. If you are interested in following up on the subject, I encourage you to explore it on the Internet.

References
Achanta, R., A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk (2010). SLIC Superpixels. Ecole Polytechnique Fedrale de Lausanne Technical Report 149300.

Appelhans, T., F. Detsch, C. Reudenbach and S. Woellauer (2017).  mapview: Interactive Viewing of Spatial Data in R. R package version 2.2.0.  https://CRAN.R-project.org/package=mapview

Bivand, R., E. Pebesma, and V. Gómez-Rubio. (2008). Applied Spatial Data Analysis with R. Springer, New York, NY.

Frey, B.J. and D. Dueck (2006). Mixture modeling by affinity propagation. Advances in Neural Information Processing Systems 18:379.

Hijmans, R. J. (2016). raster: Geographic Data Analysis and Modeling. R package version 2.5-8. https://CRAN.R-project.org/package=raster

Mouselimis, L. (2018). SuperpixelImageSegmentation: Superpixel Image Segmentation. R package version 1.0.0. https://CRAN.R-project.org/package=SuperpixelImageSegmentation

Mouselimis, L. (2019a). OpenImageR: An Image Processing Toolkit. R package version 1.1.5. https://CRAN.R-project.org/package=OpenImageR

Mouselimis, L. (2019b) Image Segmentation Based on Superpixel Images and Clustering. https://cran.r-project.org/web/packages/OpenImageR/vignettes/Image_segmentation_superpixels_clustering.html.

Ren, X., and J. Malik (2003) Learning a classification model for segmentation. International Conference on Computer Vision, 2003, 10-17.

Stutz, D., A. Hermans, and B. Leibe (2018). Superpixels: An evaluation of the state-of-the-art. Computer Vision and Image Understanding 166:1-27.

Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press, Cheshire, Conn.

Yassine, B., P. Taylor, and A. Story (2018).  Fully automated lung segmentation from chest radiographs using SLICO superpixels. Analog Integrated Circuits and Signal Processing 95:423-428.

Zhou, B. (2013) Image segmentation using SLIC superpixels and affinity propagation clustering. International Journal of Science and Research. https://pdfs.semanticscholar.org/6533/654973054b742e725fd433265700c07b48a2.pdf

 

<

p style=”text-align: center”> 

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Tip: How To Look Up Matrix Values Quickly

Mon, 03/30/2020 - 18:41

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is a powerful data science language because, like Matlab, numpy, and Pandas, it exposes vectorized operations. That is, a user can perform operations on hundreds (or even billions) of cells by merely specifying the operation on the column or vector of values.

Of course, sometimes it takes a while to figure out how to do this. Please read for a great R matrix lookup problem and solution.

In R we can specify operations over vectors. For arithmetic this is easy, but some more complex operations you “need to know the trick.”

Patrick Freeman (@PTFreeman) recently asked: what is the idiomatic way to look up a bunch of values from a matrix by row and column keys? This is actually easy to do if we first expand the data matrix into RDF-triples. If our data were in this format we could merge/join it against our desired column indices.

Let’s start with an example data matrix.

# example matrix data m <- matrix(1:9, nrow = 3) row.names(m) <- c('R1' ,'R2', 'R3') colnames(m) <- c('C1', 'C2', 'C3') knitr::kable(m) C1 C2 C3 R1 1 4 7 R2 2 5 8 R3 3 6 9

And our data-frame containing the indices we want to look-up.

# row/columns we want w <- data.frame( i = c('R1', 'R2', 'R2'), j = c('C2', 'C3', 'C2')) knitr::kable(w) i j R1 C2 R2 C3 R2 C2

That is: we want to know the matrix values from [R1, C2], [R2, C3], and [R2, C2].

The trick is: how do we convert the matrix into triples? digEmAll, has a great solution to that here.

# unpack into 3-column format from: # https://stackoverflow.com/a/9913601 triples <- data.frame( i = rep(row.names(m), ncol(m)), j = rep(colnames(m), each = nrow(m)), v = as.vector(m)) knitr::kable(triples) i j v R1 C1 1 R2 C1 2 R3 C1 3 R1 C2 4 R2 C2 5 R3 C2 6 R1 C3 7 R2 C3 8 R3 C3 9

What the above code has done is: write each entry of the original matrix as a separate row with the original row and column ids landed as new columns. This data format is very useful.

The above code is worth saving as a re-usable snippet, as getting it right is a clever step.

Now we can vectorize our lookup using the merge command, which produces a new joined table where the desired values have been landed as a new column.

res <- merge(w, triples, by = c('i', 'j'), sort = FALSE) knitr::kable(res) i j v R1 C2 4 R2 C3 8 R2 C2 5

And that is it: we have used vectorized and relational concepts to look up many values from a matrix very quickly.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why R? Webinars

Mon, 03/30/2020 - 15:00

[This article was first published on http://r-addict.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivated by a successful turnaround of visits of Why R? 2019 keynote talks
that uploaded to Why R? YouTube channel we decided to start Why R? Webinars series! We hope this will impact the growth of the R community.

Details
  • channel: youtube.com/c/WhyRFoundation
  • date: every Thursday 8:00 pm GMT+2 (starting April 2nd!)
  • format: one 45 minutes long talk streamed on YouTube + 10 minutes for Q&A
  • comments: ability to ask questions on YouTube as message on live chat
First talk

We are excited to present our first webinar speaker: Achim Zeileis from Universität Innsbruck.
The topic of the webinar is R/exams: A One-for-All Exams Generator – Online Tests, Live Quizzes, and Written Exams with R which is a very good suite for teachers facing the need of
the remote R teaching!

Speaker’s biogram and the abstract of the talk is available on a meetup event and on the webinar url that is also visible from youtube.com/c/WhyRFoundation YouTube channel.

Stay up to date var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: http://r-addict.com. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Screenager: screening times at bioRxiv

Mon, 03/30/2020 - 12:00

[This article was first published on Rstats – quantixed, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When a preprint is uploaded to bioRxiv, it undergoes screening before it appears online. How long does it take for Affiliates to screen preprints at bioRxiv?

tl;dr I used R to look at bioRxiv screening times. Even though bioRxiv has expanded massively, screening happens quickly (in about 24 h).

I am a bioRxiv Affiliate – one of the people who does the screening. Preprints wait in a queue to be screened. Over the years, I’ve seen the typical queue get longer and longer. In the early days the queue was maybe 10 preprints. These days it’s often over 100.

It’s a team effort and more Affiliates have been recruited over time. Yet I often wonder how we’re doing. My impression is that there are always lots of Neuroscience and Bioinformatics papers in queue. Do any subject areas get neglected? If so, should Affiliates in these areas be recruited specifically?

To look at these questions I used this wonderful R client for the bioRxiv API written by Nicholas Fraser.

To set up:

devtools::install_github("nicholasmfraser/rbiorxiv") # load packages library(rbiorxiv) library(tidyverse) library(gridExtra) # make directory for output if it doesn't exist if (dir.exists("./output")==FALSE) dir.create("./output") if (dir.exists("./output/plots")==FALSE) dir.create("./output/plots")

Use the R client to get a data frame of preprints uploaded in 2020.

data <- biorxiv_content(from = "2020-01-01", to = "2020-03-29", limit = "*", format = "df")

We only want to look at new preprints (Version 1) and not revisions, so let’s filter for that. Then, we’ll take advantage of bioRxiv’s new style DOIs to find the “submission date”.

data <- filter(data, version == 1) data$doi_date <- substr(data$doi, 9, 18) data$doi_date <- gsub("\\.", "-", data$doi_date) data$days <- as.Date(data$date) - as.Date(data$doi_date) data$category <- as.factor(data$category)

We now have a column called ‘days’ that shows the time in days from “submission” to “publication”. We will use this as a measure of screening time. Note: this is imperfect because the submission date is when an author begins uploading their preprint (they could take several days to do this) and not when it actually gets submitted to bioRxiv.

Let’s look at the screening time per subject area.

p1 <- ggplot(data, aes(x = as.numeric(days))) + geom_histogram(binwidth = 1) + xlim(NA, 6) + facet_grid(category ~ ., scales = "free") + labs(x = "Days", y= "Preprints") + theme(strip.text.y = element_text(angle = 0)) ggsave("./output/plots/screenLag.png", p1, height = 15, width = 6, dpi = 300,) Histogram of screening times per subject

I was surprised to see that, with the exception of “Scientific Communication and Education”, the screening times were pretty constant across categories.

The subject areas on bioRxiv are not equal in size. Look at the numbers on the axes for Zoology and for Neuroscience to get a feel for the difference. The histogram view conceals these differences.

Next, we can calculate the average screening time and see if the busiest categories suffer delayed screening.

df1 <- aggregate(as.numeric(data$days), list(data$category), mean) colnames(df1) <- c("category","mean_days") df2 <- count(data$category) colnames(df2) <- c("category","count") summary_df <- merge(df1,df2)

And then make some bar charts to look at the data.

p3 <- ggplot(summary_df, aes(x = category, y = mean_days)) + geom_bar(stat = "identity") + scale_x_discrete(limits = rev(levels(summary_df$category))) + labs(x = "", y = "Mean (days)") + coord_flip() p4 <- ggplot(summary_df, aes(x = category, y = count)) + geom_bar(stat = "identity") + scale_x_discrete(limits = rev(levels(summary_df$category))) + labs(x = "", y = "Preprints") + coord_flip() p5 <- grid.arrange(p3,p4, nrow = 1, ncol = 2) ggsave("./output/plots/summary.png", p5, height = 8, width = 8, dpi = 300,)

The average screening time is 1 day or less. Neuroscience, microbiology and bioinformatics (the biggest categories) have similar screening delays to less busy categories. So, assuming that Affiliates screen on the basis of expertise, the pool is either enriched for these popular areas, or those affiliates are more busy!

The longest lag is for “Scientific Communication and Education”, which is a very small category. Assuming the authors take a similar time to upload these manuscripts, I guess the Affiliates tend to screen these preprints as a lower priority. These papers do tend to be a bit different from other research papers and have separate screening criteria. Anyway, they still get screened in just over 2 days, which is still impressive.

I was pleased to see “Cell Biology” had the shortest screening time (around half a day)!

Conclusion

Even though my impression was that Bioinformatics and Neuroscience papers linger in the queue, this is not actually the case. There’s likely more of them in the queue because there are more of them, period.

The bioRxiv team have done a great job in maintaining a pool of Affiliates that can screen the huge number of preprints that are uploaded.

The post title comes from “Screenager” by Muse from their Origin of Symmetry album.

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: Rstats – quantixed. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Coronavirus : spatially smoothed decease in France

Mon, 03/30/2020 - 08:34

[This article was first published on r.iresmi.net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

# Carto décès COVID 19 France # avec lissage # sources ----------------------------------------------------------------- fichier_covid <- "donnees/covid.csv" url_donnees_covid <- "https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7" fichier_pop <- "donnees/pop.xls" # https://www.insee.fr/fr/statistiques/2012713#tableau-TCRD_004_tab1_departements url_donnees_pop <- "https://www.insee.fr/fr/statistiques/fichier/2012713/TCRD_004.xls" # Adminexpress : # https://geoservices.ign.fr/documentation/diffusion/telechargement-donnees-libres.html#admin-express # config ------------------------------------------------------------------ library(tidyverse) library(httr) library(fs) library(sf) library(readxl) library(janitor) library(tmap) # + btb, raster, fasterize, plyr #' Kernel weighted smoothing with arbitrary bounding area #' #' @param df sf object (points) #' @param field weight field in the df #' @param bandwidth kernel bandwidth (map units) #' @param resolution output grid resolution (map units) #' @param zone sf study zone (polygon) #' @param out_crs EPSG (should be an equal-area projection) #' #' @return a raster object #' @import btb, raster, fasterize, dplyr, plyr, sf lissage <- function(df, field, bandwidth, resolution, zone, out_crs = 3035) { if (st_crs(zone)$epsg != out_crs) { message("reprojecting data...") zone <- st_transform(zone, out_crs) } if (st_crs(df)$epsg != out_crs) { message("reprojecting study zone...") df <- st_transform(df, out_crs) } zone_bbox <- st_bbox(zone) # grid generation message("generating reference grid...") zone_xy <- zone %>% dplyr::select(geometry) %>% st_make_grid(cellsize = resolution, offset = c(plyr::round_any(zone_bbox[1] - bandwidth, resolution, f = floor), plyr::round_any(zone_bbox[2] - bandwidth, resolution, f = floor)), what = "centers") %>% st_sf() %>% st_join(zone, join = st_intersects, left = FALSE) %>% st_coordinates() %>% as_tibble() %>% dplyr::select(x = X, y = Y) # kernel message("computing kernel...") kernel <- df %>% cbind(., st_coordinates(.)) %>% st_set_geometry(NULL) %>% dplyr::select(x = X, y = Y, field) %>% btb::kernelSmoothing(dfObservations = ., sEPSG = out_crs, iCellSize = resolution, iBandwidth = bandwidth, vQuantiles = NULL, dfCentroids = zone_xy) # rasterization message("\nrasterizing...") raster::raster(xmn = plyr::round_any(zone_bbox[1] - bandwidth, resolution, f = floor), ymn = plyr::round_any(zone_bbox[2] - bandwidth, resolution, f = floor), xmx = plyr::round_any(zone_bbox[3] + bandwidth, resolution, f = ceiling), ymx = plyr::round_any(zone_bbox[4] + bandwidth, resolution, f = ceiling), resolution = resolution) %>% fasterize::fasterize(kernel, ., field = field) } # téléchargement-------------------------------------------------------------- if (!file_exists(fichier_covid) | file_info(fichier_covid)$modification_time < Sys.Date()) { GET(url_donnees_covid, progress(), write_disk(fichier_covid, overwrite = TRUE)) } if (!file_exists(fichier_pop)) { GET(url_donnees_pop, progress(), write_disk(fichier_pop)) } # données ----------------------------------------------------------------- covid <- read_csv2(fichier_covid) # adminexpress prétéléchargé dep <- read_sf("donnees/ADE_2-0_SHP_LAMB93_FR/DEPARTEMENT.shp") %>% clean_names() pop <- read_xls(fichier_pop, skip = 2) %>% clean_names() # prétraitement ----------------------------------------------------------- fr <- dep %>% st_union() %>% st_sf() %>% st_set_crs(2154) deces <- dep %>% left_join(pop, by = c("insee_dep" = "x1")) %>% left_join(covid %>% filter(jour == max(jour), sexe == 0) %>% group_by(dep) %>% summarise(deces = sum(dc, na.rm = TRUE)), by = c("insee_dep" = "dep")) %>% st_point_on_surface() %>% st_set_crs(2154) # lissage ----------------------------------------------------------------- d <- deces %>% lissage("deces", 100000, 10000, fr, 3035) p <- deces %>% lissage("x2020_p", 100000, 10000, fr, 3035) d100k <- d * 100000 / p # carto ------------------------------------------------------------------- tm_layout(title = paste("Covid19 - France métropolitaine", max(covid$jour)), legend.position = c("left", "bottom")) + tm_shape(d100k) + tm_raster(title = "décès pour 100 000 hab.\n(lissage 100 km sur grille 10 km,\n classif. kmeans)", style = "kmeans", palette = "viridis", legend.format = list(text.separator = "à moins de", digits = 0), legend.reverse = TRUE) + tm_shape(dep) + tm_borders() + tm_credits("http://r.iresmi.net/\nprojection LAEA Europe\ndonnées départementales Santé publique France\nINSEE RP 2020, IGN Adminexpress 2019") var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: r.iresmi.net. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Meet {tidycovid19}: Yet another Covid-19 related R Package

Mon, 03/30/2020 - 02:00

[This article was first published on An Accounting and Data Science Nerd's Corner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I have decided that the world needs another Covid-19 related R package. Not sure whether you agree, but the new package facilitates the direct download of various Covid-19 related data (including data on governmental measures) directly from authoritative sources. It also provides a flexible function and accompanying shiny app to visualize the spreading of the virus. Play around with the shiny app here if you like or hang around to learn more about the package.

Why yet another package on Covid-19?

There are at least two R packages that provide data and infrastructure related to Covid-19:

  • {nCov2019}: This package has a focus on Chinese data but also contains data on other countries and regions. It contains a shiny dashboard.
  • {conronavirus}: This package provides
    the Johns Hopkins University CSSE dataset together with a dashboard

Other than the two packages above, the key objective of the {tidycovid19} package is to provide transparent access to various data sources. It does not contain any data per se. Instead, it provides functions to pull data from publicly available sources.

However, to save processing time and bandwidth, for those interested in speedy downloads it alternatively provides the option to download the data from the cached data in the Github repository (stored in the directory cached_data). The cached data will be updated daily.

If you rather want to start your own project by customizing the download code to fit your needs, I suggest that you take a look at my Github repository “tidy_covid19” (mind the underscore).
This repo presents a forkable code infrastructure for Covid 19 projects using the same data infrastructure.

The disclaimer from my previous blog post continues to apply. I am not and Epidemiologist. My motivation to develop this package is to help researchers interested in studying the spread of the virus and the effects of non-pharmaceutical interventions on the virus spread. To reach this objective, the package pulls data from various sources. T

Here are the download functions offered by the package:

  • download_jhu_csse_covid19_data(): Downloads and tidies Covid-19 data from the Johns Hopkins University CSSE Github Repo. This data has developed to a standard resource for researchers and the general audience
    interested in assessing the global spreading of the virus. The data is aggregated to the country level.
  • download_acaps_npi_data(): Downloads and tidies the Government measures dataset provided by the Assessment Capacities Project (ACAPS). These relatively new data allow researchers to study the effect of non-pharmaceutical interventions on the development of the virus.
  • download_google_trends_data(): Uses {gtrendsR} to Download and tidy Google Trends data on the search volume for the term “coronavirus” (Thank you to Yan Ouaknine for bringing up that idea!). This data can be used to assess the public attention to Covid-19 across countries (see plot below) and over time within a given country.
  • download_wbank_data(): Downloads and tidies additional country level information provided by the World Bank using the {wbstats} package. These data allow researchers to calculate per capita measures of the virus spread and to assess the association of macro-economic variables with the development of the virus.
  • download_merged_data(): Downloads all data sources and creates a merged country-day panel sample.

All functions can be called with the parameter cached = TRUE to download the cached data instead of assessing the original data sources. So, a quick way to use the data is to do the following.

# remotes::install_github("joachim-gassen/tidycovid19") suppressPackageStartupMessages({ library(tidycovid19) library(dplyr) library(ggplot2) library(ggrepel) }) merged_dta <- download_merged_data(cached = TRUE) ## Downloading cached version of merged data...done. Timestamp is 2020-03-30 06:43:14 merged_dta %>% group_by(country) %>% mutate( reported_deaths = max(deaths), soc_dist_measures = max(soc_dist) ) %>% select(country, iso3c, reported_deaths, soc_dist_measures) %>% distinct() %>% ungroup() %>% arrange(-reported_deaths) %>% head(20) -> df ggplot(df, aes(x = reported_deaths, y = soc_dist_measures)) + geom_point() + geom_label_repel(aes(label = iso3c)) + theme_minimal() + scale_x_continuous(trans='log10', labels = scales::comma) + labs(x = "Reported deaths (logarithmic scale)", y = "Number of governmental social distancing measures", annotation = "Data from JHU CSSE and ACAPS.")

Visualization

The focus of the package lies on data collection and not on visualization as there are already many great tools floating around. The function plot_covid19_spread() however, allows you to quickly visualize the spread of the virus in relation to governmental intervention measures. It is inspired by the insightful displays created by John Burn-Murdoch from the Financial Times and offers various customization options.

plot_covid19_spread(merged_dta, highlight = c("ITA", "ESP", "FRA", "DEU", "USA"), intervention = "lockdown")

Shiny App

Sorry, I could not resist. The options of the plot_covid19_spread() make the
implementation of a shiny app a little bit to tempting to pass. The command
shiny_covid19_spread() starts the app. You can check it out online if you like.



Wrapping up

This is it for the time being. I hope that the package might be helpful to those
doing Covid-19 related research. If you have suggestions and/or feedback, consider
opening an Issue on Github

Stay well and keep #FlattenTheCurve!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: An Accounting and Data Science Nerd's Corner. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Installing spatial R packages on Ubuntu

Mon, 03/30/2020 - 02:00

[This article was first published on the Geocomputation with R website, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This post explains how to quickly get key R packages for geographic research installed on Ubuntu, a popular Linux distribution.

A recent thread on the r-spatial GitHub organization alludes to many considerations when choosing a Linux set-up for work with geographic data, ranging from the choice of Linux distribution (distro) to the use of binary vs or compiled versions (binaries are faster to install). This post touches on some of these things but, its main purpose is to provide advice on getting R’s key spatial packages up-and-running on a future-proof Linux operating system (Ubuntu).

Now is an excellent time to be thinking about the topic because changes are in the pipeline and getting set-up (or preparing to get set-up) now could save hours in the future.
These imminent changes include:

  • The next major release of R (4.0.0), scheduled for the 24th April (2020-04-24)
  • The next major release of Ubuntu (20.04), a Long Term Support (LTS) version that will be used by millions of servers and research computers worldwide for years to come. Coincidentally, Ubuntu 20.04 will be released a day earlier than R 4.0.0, on 23rd April (2020-04-23).
  • Ongoing changes to the OSGeo stack on which key geographic R packages depend, as documented in r-spatial repos and a recent blog post on how recent versions of PROJ enable more precise coordinate reference system definitions.

To keep-up with these changes, this post will be updated in late April when some of the dust has settled around these changes.
However, the advice presented here should be future-proof, including information on how to upgrade Ubuntu in section 3.

There many ways of getting Ubuntu set-up for spatial R packages.
A benefit of Linux operating systems is that they offer choice and prevent ‘lock-in’.
However, the guidance in the next section should reduce set-up time and improve maintainability (with updates managed by Ubuntu) compared with other ways of doing things, especially for beginners.
If you’re planning to switch to Linux as the basis of your geographic work, this advice may be particularly useful.
(The post was written in response to people asking how to set-up R on their new Ubuntu installations.
For more on getting a computer running Ubuntu, check out companies that support open source operating systems and guides installing Ubuntu on an existing machine.

By ‘key packages’ I mean the following, which enable the majority of day-to-day geographic data processing and visualization tasks:

  • sf for reading, writing and working with a range geographic vector file formats and geometry types
  • raster, a mature package for working with geographic raster data (see the terra for an in-development replacement for raster)
  • tmap, a flexible package for making static and interactive maps

The focus is on Ubuntu because that’s what I’ve got most experience with and it is well supported by the community.
Links for installing geographic R packages on other distros are provided in a subsequent.

1. Installing spatial R packages on Ubuntu


R’s spatial packages can be installed from source on the latest version of this popular operating system, once the appropriate repository has been set-up, meaning faster install times (only a few minutes including the installation of upstream dependencies).
The following bash commands should install key geographic R packages on Ubuntu 19.10:

# add a repository that ships the latest version of R: sudo add-apt-repository ppa:marutter/rrutter3.5 # update the repositories so the software can be found: sudo apt update # install system dependencies: sudo apt install libudunits2-dev libgdal-dev libgeos-dev libproj-dev libfontconfig1-dev # binary versions of key R packages: sudo apt install r-base-dev r-cran-sf r-cran-raster r-cran-rjava

To test your installation of R has worked, try running R in an IDE such as RStudio or in the terminal by entering R.
You should be able to run the following commands without problem:

library(sf) #> Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0 install.packages("tmap")

If you are using an older version of Ubuntu and don’t want to upgrade to 19.10, which will upgrade to (20.04) by the end of April 2020, see instructions at github.com/r-spatial/sf and detailed instructions on the blog rtask.thinkr.fr, which contains this additional shell command:

# for Ubuntu 18.04 sudo add-apt-repository ppa:marutter/c2d4u3.5

That adds a repository that ships hundreds of binary versions of R packages, meaning faster install times for packages (see the Binary package section of the open source book R Packages for more on binary packages).
An updated repository, called c2d4u4.0 or similar, will be available for Ubuntu 20.04 in late April.


If you have issues with the instructions in this post here, you can find a wealth of answers on site such as StackOverflow, the sf issue tracker, r-sig-geo and Debian special interest group (SIG) email lists (the latter of which provided input into this blog post, thanks to Dirk Eddelbuettel and Michael Rutter).

2. Updating R packages and upstream dependencies

Linux operating systems allow you to customize your set-up in myriad ways.
This can be enlightening but it can also be wasteful.
It’s worth considering the stability/cutting-edge continuum before diving into a particular set-up and potentially wasting time (if the previous section hasn’t already made-up your mind).


A reliable way to keep close (but not too close) to the cutting edge on the R side is simply to keep your packages up-to-date.
Running the following command (or using the Tools menu in RStudio) every week or so will ensure you have up-to-date package versions:

update.packages()

The following commands will update system dependencies including the ‘OSGeo stack’ composed of PROJ, GEOS and GDAL:

sudo apt-get update # see if things have changed sudo apt upgrade # install changes

If you want to update Ubuntu to the latest version, you can with the following command (also see instructions here):

apt-get dist-upgrade

To get more up-to-date upstream geographic libraries than provided in the default Ubuntu repositories, you can add the ubuntugis repository as follows.
This is a pre-requisite on Ubuntu 18.04 and earlier but also works with later versions (warning, adding this repository could cause complications if you already have software such as QGIS that uses a particular version of GDAL installed):

sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable sudo apt update sudo apt upgrade

That will give you more up-to-date versions of GDAL, GEOS and PROJ which may offer some performance improvements.
Note: if you do update dependencies such as GDAL you will need to re-install the relevant packages, e.g. with install.packages("sf").
You can revert that change with the following little-known command:

sudo add-apt-repository --remove ppa:ubuntugis/ubuntugis-unstable

If you also want the development versions of key R packages, e.g. to test new features and support development efforts, you can install them from GitHub, e.g. as follows:

remotes::install_github("r-spatial/sf") remotes::install_github("rspatial/raster") remotes::install_github("mtennekes/tmaptools") # required for dev version of tmap remotes::install_github("mtennekes/tmap") 3. Installing geographic R packages on other Linux operating systems

If you are in the fortunate position of switching to Linux and being able to choose the distribution that best fits your needs, it’s worth thinking about which distribution will be both user-friendly (more on that soon), performant and future-proof.
Ubuntu is a solid choice, with a large user community and repositories such as ‘ubuntugis’ providing more up-to-date versions of upstream geographic libraries such as GDAL.

QGIS is also well-supported on Ubuntu.

However, you can install R and key geographic packages on other operating systems, although it may take longer.
Useful links on installing R and geographic libraries are provided below for reference:

  • Installing R on Debian is covered on the CRAN website. Upstream dependencies such as GDAL can be installed on recent versions of Debian, such as buster, with commands such as apt-get install libgdal-dev as per instructions on the rocker/geospatial.

  • Installing R on Fedora/Red Hat is straightforward, as outlined on CRAN. GDAL and other spatial libraries can be installed from Fedora’s dnf package manager, e.g. as documented here for sf.

  • Arch Linux has a growing R community. Information on installing and setting-up R can be found on the ArchLinux wiki. Installing upstream dependencies such as GDAL on Arch is also relatively straightforward. There is also a detailed guide for installing R plus geographic packages by Patrick Schratz.

4. Geographic R packages on Docker







The Ubuntu installation instructions outlined above provide such an easy and future-proof set-up.
But if you want an even easier way to get the power of key geographic packages running on Linux, and have plenty of RAM and HD space, running R on the ‘Docker Engine’ may be an attractive option.

Advantages of using Docker include reproducibility (code will always run the same on any given image, and images can be saved), portability (Docker can run on Linux, Windows and Mac) and scalability (Docker provides a platform for scaling-up computations across multiple nodes).

For an introduction to using R/RStudio in Docker, see the Rocker project.

Using that approach, I recommend the following Docker images for using R as a basis for geographic research:

  • rocker/geospatial which contains key geographic packages, including those listed above
  • robinlovelace/geocompr which contains all the packages needed to reproduce the contents of the book, and which you can run with the following command in a shell in which Docker is installed:
docker run -e PASSWORD=yourpassword --rm -p 8787:8787 robinlovelace/geocompr

To test-out the Ubuntu 19.10 set-up recommended above I created a Dockerfile and associated image on Dockerhub that you can test-out as follows:

docker run -it robinlovelace/geocompr:ubuntu-eoan R library(sf) #> Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0 library(raster) library(tmap)

The previous commands should take you to a terminal inside the docker container where you try out the Linux command line and R.
If you want to use more cutting-edge versions of the geographic libraries, you can use the ubuntu-bionic image (note the more recent version numbers, with PROJ 7.0.0 for example):

sudo docker run -it robinlovelace/geocompr:ubuntu-bionic R library(sf) #> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0

These images do not currently contain all the dependencies needed to reproduce the code in Geocomputation with R.

However, as documented in issue 476 in the geocompr GitHub repo, there is a plan to provide Docker images with this full ‘R-spatial’ stack installed, building on strong foundations such as rocker/geospatial and the ubuntugis repositories, to support different versions of GDAL and other dependencies.
We welcome any comments or tech support to help make this happen.
Suggested changes to this post are also welcome, see the source code here.

5. Fin

R is an open-source language heavily inspired by Unix/Linux so it should come as no surprise that it runs well on a variety of Linux distributions, Ubuntu (covered in this post) in particular.
The guidance in this post should get geographic R packages set-up quickly in a future-proof way.
A sensible next step is to sharpen you system administration (sysadmin) and shell coding skills, e.g. with reference to Ubuntu wiki pages and Chapter 2 of the open source book Data Science at the Command Line.

This will take time but, building on OSGeo libraries, a well set-up Linux machine is an ideal platform to install, run and develop key geographic R packages in a performant, stable and future-proof way.

Be the FOSS4G change you want to see in the world!

var vglnk = { key: '949efb41171ac6ec1bf7f206d57e90b8' }; (function(d, t) { var s = d.createElement(t); s.type = 'text/javascript'; s.async = true; s.src = '//cdn.viglink.com/api/vglnk.js'; var r = d.getElementsByTagName(t)[0]; r.parentNode.insertBefore(s, r); }(document, 'script'));

To leave a comment for the author, please follow the link and comment on their blog: the Geocomputation with R website. R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Pages