Subscribe to R-spatial feed
R Spatial software blogs and ideas 2018-10-10T16:36:39+00:00 Jekyll v3.8.4
Updated: 1 hour 28 min ago

sf 0.6-0 news

Mon, 01/08/2018 - 01:00

[view raw Rmd]

Version 0.6-0 of the sf package (an R package for handling vector geometries in R) has been released to CRAN. It contains several innovations, summarized in the NEWS file. This blog post will illustrate some of these further.

Ring directions

Consider the following two polygons:

library(sf) ## Loading required package: methods ## Linking to GEOS 3.5.1, GDAL 2.1.2, proj.4 4.9.3 p1 = rbind(c(0,0), c(1,0), c(1,1), c(0,1), c(0,0)) p2 = 0.5 * p1 + 0.25 (pol1 = st_polygon(list(p1, p2))) ## POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0), (0.25 0.25, 0.75 0.25, 0.75 0.75, 0.25 0.75, 0.25 0.25)) (pol2 = st_polygon(list(p1, p2[5:1,]))) ## POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0), (0.25 0.25, 0.25 0.75, 0.75 0.75, 0.75 0.25, 0.25 0.25)) opar = par(mfrow = c(1, 2), mar = rep(0, 4)) plot(pol1, col = grey(.8), rule = "winding") plot(pol2, col = grey(.8), rule = "winding")

Although the simple feature standard describes that all secondary rings indicate holes, it also specifies that outer rings should be counter clockwise and inner rings (holes) clockwise. It doesn’t specify that polygons for which the hole has the same ring direction as the outer ring are invalid - and they aren’t. But how should software deal with them? In prior sf versions, plot (and ggplot) would take the winding rule, requiring holes to have the opposite direction as outer rings. This has been changed into evenodd as default, which plots both cases with holes:

library(sf) p1 = rbind(c(0,0), c(1,0), c(1,1), c(0,1), c(0,0)) p2 = 0.5 * p1 + 0.25 (pol1 = st_polygon(list(p1, p2))) ## POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0), (0.25 0.25, 0.75 0.25, 0.75 0.75, 0.25 0.75, 0.25 0.25)) (pol2 = st_polygon(list(p1, p2[5:1,]))) ## POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0), (0.25 0.25, 0.25 0.75, 0.75 0.75, 0.75 0.25, 0.25 0.25)) opar = par(mfrow = c(1, 2), mar = rep(0, 4)) plot(pol1, col = grey(.8)) # rule = "evenodd" plot(pol2, col = grey(.8)) # rule = "evenodd"

In addition, st_sfc and st_read gained a parameter check_ring_dir, by default FALSE, which when TRUE will check every ring and revert them to counter clockwise for outer, and clockwise for inner (hole) rings. By default this is FALSE because it is an expensive operation for large datasets.

Higher-order geometry differences

This was reported here; two nice graphs are available from ?st_difference

set.seed(131) m = rbind(c(0,0), c(1,0), c(1,1), c(0,1), c(0,0)) p = st_polygon(list(m)) n = 100 l = vector("list", n) for (i in 1:n) l[[i]] = p + 10 * runif(2) s = st_sfc(l) d = st_difference(s) # sequential differences: s1, s2-s1, s3-s2-s1, ... i = st_intersection(s) # all intersections plot(s, col = sf.colors(categorical = TRUE, alpha = .5)) title("overlapping squares")

par(mfrow = c(1, 2), mar = c(0,0,2.4,0)) plot(d, col = sf.colors(categorical = TRUE, alpha = .5)) title("non-overlapping differences") plot(i, col = sf.colors(categorical = TRUE, alpha = .5)) title("non-overlapping intersections")

Spherical geometry

All geometric operations (area, length, intersects, intersection, union etc) provided by the GEOS library assume two-dimensional coordinates. If your data have geographic (longitude-latitude) coordinates, this is may be quite OK when your area is small and close to the equator, otherwise it is not. One way out is to project the data using a suitable projection, the other is to use spherical geometry: algorithms that compute on the sphere (or, more precisely, on the spheroid. This has a number of advantages:

  • it is easy, you have no-worries about which projection to choose
  • it is always correct

however, it comes at some computational cost.

Spherical geometry functions were formerly taken from R package geosphere; with the new sf they use package lwgeom, which interfaces liblwgeom, the library that is also used by PostGIS. (and the development of which was funded by palantir).

Liblwgeom functions st_make_valid, st_geohash, st_split, which were formerly in sf have now been moved to lwgeom. Other functions in lwgeom enable the following functions to work with geographic coordinates: st_length, st_area st_distance, st_is_within_distance, st_segmentize.

Where geosphere could only compute distances between points, st_distance now computes distances between arbitrary simple feature geometries. st_distance is clearly slower when computed on a spheroid than when computed on the sphere. For point data, faster results are obtained when we assume the Earth is a sphere:

n = 2000 df = data.frame(x = runif(n), y = runif(n)) pts = st_as_sf(df, coords = c("x", "y")) system.time(x0 <- st_distance(pts)) ## user system elapsed ## 3.456 0.008 3.479 st_crs(pts) = 4326 # spheroid system.time(x1 <- st_distance(pts)) ## user system elapsed ## 5.564 0.016 5.594 st_crs(pts) = "+proj=longlat +ellps=sphere" # sphere ## Warning: st_crs<- : replacing crs does not reproject data; use st_transform ## for that system.time(x2 <- st_distance(pts)) ## user system elapsed ## 1.220 0.024 1.246 system.time(x3 <- dist(as.matrix(df))) ## user system elapsed ## 0.012 0.000 0.010 Hausdorff and Frechet distance

For two-dimensional (flat) geometries, st_distance now has the option of computing Hausdorff distances, and (if sf was linked to GEOS 3.7.0) Frechet distances.


For two-dimensional (flat) geometries, st_snap is now available; we refer to the PostGIS documentation for examples what it does.

join to largest matching feature

This feature was reported and illustrated here:

sf::st_join with “largest=TRUE” now joins to the single largest intersecting feature: #rspatial #rstats

— Edzer Pebesma (@edzerpebesma) December 3, 2017

polygon geometries have zero length

Function st_length now returns zero for non-linear geometries including polygons. For length of polygon rings, st_cast to MULTILINESTRING first.

printing coordinates now honors digits setting

Printing of geometries, as well as st_as_text now use the default digits of R:

st_point(c(1/3, 1/6)) ## POINT (0.3333333 0.1666667) options(digits = 3) st_point(c(1/3, 1/6)) ## POINT (0.333 0.167) st_as_text(st_point(c(1/3, 1/6)), digits = 16) ## [1] "POINT (0.3333333333333333 0.1666666666666667)"

Before sf 0.6, as.character was used, which used around 16 digits.

sf 0.6-0 news

Mon, 01/08/2018 - 01:00

Spatiotemporal arrays for R - blog one

Thu, 11/23/2017 - 01:00

Tidy storm trajectories

Mon, 08/28/2017 - 02:00

Setting up large scale OSM environments for R using Osmosis and PostgreSQL with PostGIS

Fri, 07/14/2017 - 02:00

Importing OpenStreetMap (OSM) data into R can sometimes be rather difficult, especially when it comes to processing large datasets. There are some packages that aim at easy integration of OSM data, including the very versatile osmar package, that allows you to scrape data directly in R via the OSM API. Packages like osmar,rgdal or sf also offer build-in functions to read the spatial data formats that OSM data comes along with.

However, these packages reach their limits when it comes to larger datasets and running the programmes on weak machines. I want to introduce an easy way to set up an environment to process large OSM datasets in R, using the Java application Osmosis and the open-source database PostgreSQL with the PostGIS extension for spatial data.

This tutorial was created using a Asus Zenbook UX32LA with a i5-4200U CPU, 8 GB RAM and a 250 GB SSD, running on Windows

  1. The data used has a size of 1.9 GB (unzipped). Under this setting, OSM data import using osmar, rgdal and sf takes up several hours, if not days, especially if you want to continue using your system. The following steps thus show a way to set up larger spatial data environments using the PostgreSQL database scheme and how to easily import and set up this data in R.
Getting OSM data

The place to extract large OSM datasets is the file dump Planet.osm, which can be found here:

Here, we can download all available OSM data or search for extracts from our area of interest. I am interested in downloading the most recent OSM data for Greater London, which for instance is provided by Geofabrik. This archive offers OSM data for predefined layers like countries, states or urban areas. The London data that I will be using in this tutorial can be found here:

I download the file greater-london-latest.osm.pbf, conatining the complete dataset for the Greater London area. Note that this file is updated regularly.

Setting up PostgreSQL and PostGIS

We now need to download and install PostgreSQL with the PostGIS extension. A detailed explanation on how to install PostgreSQL can be found here:

Make sure you note username, password and the port you use for the installation. After PostgreSQL is installed on the system, PostGIS can be added as described here:

Now, open pgAdmin to set up your database. We can create new databases by clicking on Object – Create – Database, inserting a name of your choice, e.g. London OSM.

My new database London OSM is now in place and can be prepared for data import. We have to create two extensions to our database, using a SQL script. We navigate into the new database and open the script command line by clicking on Object – CREATE Script and execute two commands:


These extensions should now show up when openng the Extensions path in our London OSM database.

Setting up Osmosis and importing the data

The tool connecting our dataset with PostgreSQL is called Osmosis. It is a command line Java application and can be used to read, write and manipulate OSM data. The latest stable version including detailed installation information for different OS can be found here: (Note that Osmosis requires the Java Runtime Environment, which can be downloaded at

If you are using Windows, you can navigate into the Osmosis installation folder, e.g. C:\Program Files (x86)\osmosis\bin\, and open osmosis.bat. Double clicking this file opens the Osmosis command line. To keep the Osmosis window open, create a shortcut to the osmosis.bat file, open its properties and add cmd /k at the beginning of the target in the shortcut tab. The Osmosis output should look like this:

We now have to prepare our PostgreSQL database for the OSM data import (courtesy of Stackexchange user zehpunktbarron). Navigate back into pgAdmin and the OSM London database and create a new script via Object – CREATE Script. Now, execute the SQL code that you find in two of the files that you created when installing Osmosis. First execute the code from [PATH_TO_OSMOSIS]\script\pgsnapshot_schema_0.6.sql and afterwards the code from [PATH_TO_OSMOSIS]\script\pgsnapshot_schema_0.6_linestring.sql.

Now, add indices to the database to better process the data. Execute the following SQL commands in the script:

  • CREATE INDEX idx_nodes_tags ON nodes USING GIN(tags);
  • CREATE INDEX idx_ways_tags ON ways USING GIN(tags);
  • CREATE INDEX idx_relations_tags ON relations USING GIN(tags);

We have now successfully prepared our database for the OSM import. Open Osmosis and run the following command to import the previously downloaded .pbf file:

"[PATH_TO_OSMOSIS]\bin\osmosis" --read-pbf file="[PATH_TO_OSM_FILE]\greater-london-latest.osm.pbf" --write-pgsql host="localhost" database="London OSM" user="YOUR_USERNAME" password="YOUR_PASSWORD"

Note that if the .pbf file is larger, this process might take a while – also depending on the specs of your system. If the data import was successful, this should give you an output that looks like this:

Accessing PostgreSQL databases in R

Our freshly imported database is now ready to be accessed via R. Connecting to the PostgreSQL database requires the R package RPostgreSQL. First, we load the PostgreSQL driver and connect to the database using our credentials:

require(RPostgreSQL) # LOAD POSTGRESQL DRIVER driver <- dbDriver("PostgreSQL") # CREATE CONNECTION TO THE POSTGRESQL DATABASE # THE CONNECTION VARIABLE WILL BE USED FOR ALL FURTHER OPERATIONS connection <- dbConnect(driver, dbname = "London OSM", host = "localhost", port = 5432, user = "YOUR_USERNAME", password = "YOUR_PASSWORD")

We can now check, whether we have successfully established a connection to our database using a simple command:

dbExistsTable(connection, "lines") ## [1] TRUE

We have now set up the environment to load OSM data into R flawlessly. Note that queries using RPostgreSQL are written in the SQL syntax. Further information on the use of the RPostgreSQL package can be found here:

Creating spatial data frames in R

In the last step of this tutorial we will explore how to put the accessed data to work and how to properly establish the geographical reference. We first load data into the R environment, using a RPostgreSQL query. The following query creates a data.frame with all available OSM point data. We use the PostGIS command ST_AsText on the wkb_geometry column to return the Well Known Text (WKT) geometries and save it in the newly created column geom. After that, we delete the now redundant wkb_geometry column.

#LOAD POINT DATA FROM OSM DATABASE points <- dbGetQuery(connection, "SELECT * , ST_AsText(wkb_geometry) AS geom from points") points$wkb_geometry <- NULL

The points data frame contains all available OSM point data, including the several different tagging schemes, which can be further explored looking at OSMs’ map features:

head(points) ## ogc_fid osm_id name barrier highway ## 1 1 1 Prime Meridian of the World <NA> <NA> ## 2 2 99941 <NA> lift_gate <NA> ## 3 3 101831 <NA> <NA> crossing ## 4 4 101833 <NA> <NA> crossing ## 5 5 101839 <NA> <NA> traffic_signals ## 6 6 101843 <NA> <NA> traffic_signals ## ref address is_in place man_made ## 1 <NA> <NA> <NA> <NA> <NA> ## 2 <NA> <NA> <NA> <NA> <NA> ## 3 <NA> <NA> <NA> <NA> <NA> ## 4 <NA> <NA> <NA> <NA> <NA> ## 5 <NA> <NA> <NA> <NA> <NA> ## 6 <NA> <NA> <NA> <NA> <NA> ## other_tags ## 1 "historic"=>"memorial","memorial"=>"stone" ## 2 <NA> ## 3 "crossing"=>"traffic_signals","crossing_ref"=>"pelican" ## 4 "crossing"=>"island" ## 5 <NA> ## 6 <NA> ## geom ## 1 POINT(-0.0014863 51.4779481) ## 2 POINT(-0.1553793 51.5231639) ## 3 POINT(-0.1470438 51.5356116) ## 4 POINT(-0.1588224 51.5350894) ## 5 POINT(-0.1526586 51.5375096) ## 6 POINT(-0.163653 51.534922)

Now, to get the geometry working, we can transform the data frame into a spatial data frame using the sf package. Note that I have to set a coordinate reference system (CRS), in this case the WGS84 projection:

#SET COORDINATE SYSTEM require(sf) points <- st_as_sf(points, wkt="geom") %>% `st_crs<-`(4326)

We can now scrape our dataset for the data we are looking for, e.g. all bicycle parking spots (see Since sf data is stored in spatial data frames, we can easily create a subset containing our desired points - e.g. using the filter function from the dplyr package and str_detect from stringr:

require(dplyr) require(stringr) #EXTRACTING ALL POINTS TAGGED 'BUS_STOP' bikepark <- points %>% filter(str_detect(other_tags, "bicycle_parking"))

Additionally to all bike parking spots, we also want to include all explicitly marked bicycle routes, as found in the lines data. These can be extracted from OSM relation data via the cycleway tag:

We can contrast cycleways from the regular road network by also selecting the most common road types (see Note that after our final PostgreSQL query, we close the connection using the dbDisconnect command in order to not overload the driver.

#LOAD RELATION DATA lines <- dbGetQuery(connection, "SELECT * , ST_AsText(wkb_geometry) AS geom from lines") lines$wkb_geometry <- NULL dbDisconnect(connection) ## [1] TRUE #SET CRS lines <- st_as_sf(lines, wkt="geom") %>% `st_crs<-`(4326) #SUBSET CYCLEWAYS cycleways <- lines %>% filter(highway=="cycleway") #SUBSET OTHER STREETS streets <- lines %>% filter(highway=="motorway" | highway=="trunk" | highway=="primary" | highway=="secondary" | highway=="tertiary")

Having created subsets for bicycle parking spots, cycleways and regular roads. We finally plot our data using ggplot2 and the geom_sf function:

require(ggplot2) #PLOTTING ALL BUS STOPS ggplot(bikepark) + geom_sf(data=streets,aes(colour="lightgrey")) + #Requires development version of ggplot2: devtools::install_github("tidyverse/ggplot2") geom_sf(data=cycleways,aes(colour="turquoise")) + geom_sf(data=bikepark,aes(colour="turquoise4"),shape=".") + coord_sf(crs = st_crs(bikepark)) + ggtitle("Biking infrastructure (parking + cycleways) in London") + scale_colour_manual("",values = c("lightgrey","turquoise","turquoise4"),labels=c("Other Roads","Cycleways","Bike Parking")) + theme_void() + theme(legend.position="bottom")

UseR! 2017 Spatial Tutorials

Thu, 06/29/2017 - 02:00

This year’s UseR! (next week Tuesday!!) will have two spatial tutorials:

1. Geospatial visualization using R (morning)

Bashkar Karambelkar posted earlier that attendees should download his docker image by Jul 1, by

docker pull bhaskarvk/rgeodataviz 2. Spatial data in R: new directions (afternoon)

My tutorial deals mostly with sf, the material (html, R markdown) is found at here:

I’m looking forward to it!

Related posts:

UseR! 2016 tutorial

Spatial indexes coming to sf

Thu, 06/22/2017 - 02:00

Spatial indexes give you fast results on spatial queries, such as finding whether pairs of geometries intersect or touch, or finding their intersection. They reduce the time to get results from quadratic in the number of geometries to linear in the number of geometries. A recent commit brings spatial indexes to sf for the binary logical predicates (intersects, touches, crosses, within, contains, contains_properly, overlaps, covers, covered_by), as well as the binary predicates that yield geometries (intersection, union, difference, sym_difference).

The spatial join function st_join using a logical predicate to join features, and aggregate or summarise using union to union aggregated feature geometries, are also affected by this speedup.


There have been attempts to use spatial planar indices, including enhancement issue sfr:76. In rgeos, GEOS STRtrees were used in rgeos/src/rgeos_poly2nb.c, which is mirrored in a modern Rcpp setting sf/src/geos.cpp, around lines 276 and 551. The STRtree is constructed by building envelopes (bounding boxes) of input entities, which are then queried for intersection with envelopes of another set of entities (in rgeos, R functions gUnarySTRtreeQuery and gBinarySTRtreeQuery). The use case was to find neighbours of all the about 90,000 US Census entities in Los Angeles, via spdep::poly2nb(), which received an argument to enter the candidate neighbours found by Unary querying the STRtree of entities by the same entities.


A simple benchmark shows the obvious: st_intersects without spatial index behaves quadratic in the number of geometries (black line), and is much faster for the case where a spatial index is created, stronger so for larger number of polygons:

The polygon datasets used are simple checker boards with square polygons (showing a nice Moiré pattern):

The black small square polygons are essentially matched to the red ones; the number of polygons along the x axis is the number of a single geometry set (black).

To show that the behaviour of intersects and intersection is indeed linear in the number of polygons, we show runtimes for both, as a function of the number of polygons (where intersection was divided by 10 for scaling purposes):


Spatial indexes are available in the GEOS library used by sf, through the functions starting with STRtree. The algorithm implements a Sort-Tile-Recursive R-tree, according to the JTS documentation described in P. Rigaux, Michel Scholl and Agnes Voisard. Spatial Databases With Application To GIS. Morgan Kaufmann, San Francisco, 2002.

The sf implementation (some commits to follow this one) excludes some binary operations. st_distance, st_relate, and st_relate_pattern, as these all need to go through all combinations, rather than a subset found by checking for overlapping bounding boxes. st_equals_exact and st_equals are excluded because they do not have an implementation for prepared geometries. st_disjoint could benefit from the search tree, but needs a dedicated own implementation.

On which argument is an index built?

The R-tree is built on the first argument (x), and used to match all geometries over the second argument (y) of binary functions. This could give runtime differences, but for instance for the dataset that triggered this development in sfr:394, we see hardly any difference:

library(sf) # Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.3.2 r15302 load("test_intersection.Rdata") nrow(test) # [1] 16398 nrow(hsg2) # [1] 6869 system.time(int1 <- st_intersection(test, hsg2)) # user system elapsed # 105.712 0.040 105.758 system.time(int2 <- st_intersection(hsg2, test)) # user system elapsed # 107.756 0.060 107.822 # Warning messages: # 1: attribute variables are assumed to be spatially constant throughout all geometries # 2: attribute variables are assumed to be spatially constant throughout all geometries

The resulting feature sets int1 and int2 are identical, only the order of the features (records) and of the attribute columns (variables) differs. (Runtime without index is 35 minutes, 20 times as long.)

Is the spatial index always built?

In the current implemenation it is always built of logical predicates when argument prepare = TRUE, which means by default. This made it easier to run benchmarks, and I strongly doubt anyone ever sets prepare = FALSE. This may change, to have them always built.

It would be nice to also have them on st_relate and st_relate_pattern, e.g. for rook or queen neighborhood selections (sfr:234), but this still requires some work, since two non-intersecting geometries have a predictable but not a constant relationship.

What about prepared geometries?

Prepared geometries in GEOS are essentially indexes over single geometries and not over sets of geometries; they speed things up in particular when single geometries are very complex, and only for a single geometry to single geometry comparison. The spatial indexes are indexes over collections of geometries; they make a cheap preselection based on bounding boxes before the expensive pairwise comparison takes place.

Script used

The followinig script was used to create the benchmark plots.

library(sf) sizes = c(10, 20, 50, 100, 160, 200) res = matrix(NA, length(sizes), 4) for (i in seq_along(sizes)) { g1 = st_make_grid(st_polygon(list(rbind(c(0,0),c(0,1),c(1,1),c(0,1),c(0,0)))), n = sizes[i]) * sizes[i] g2 = g1 + c(.5,.5) res[i, 1] = system.time(i1 <- st_intersects(g1, g2))[1] res[i, 2] = system.time(i2 <- st_intersects(g1, g2, prepare = FALSE))[1] res[i, 3] = system.time(i1 <- st_intersection(g1, g2))[1] res[i, 4] = identical(i1, i2) } plot(sizes^2, res[,2], type = 'b', ylab = 'time [s]', xlab = '# of polygons') lines(sizes^2, res[,1], type = 'b', col = 'red') legend("topleft", lty = c(1,1), col = c(1,2), legend = c("st_intersects without index", "st_intersects with spatial index")) plot(sizes^2, res[,3]/10, type = 'b', ylab = 'time [s]', xlab = '# of polygons') lines(sizes^2, res[,1], type = 'b', col = 'red') legend("topleft", lty = c(1,1), col = c(1,2), legend = c("st_intersection * 0.1", "st_intersects"))

mapedit - updates in 0.2.0

Fri, 06/09/2017 - 02:00

[view raw Rmd]

mapedit has progressed substantially since the introduction to mapedit post. mapedit 0.2.0 offers improvements and incorporates changes based on much appreciated feedback from the R geospatial community. mapedit is still in rapid development, but the API is stabilizing. We are targeting a CRAN release prior to useR 2017, and Tim Appelhans will demonstrate mapedit in his useR talk.

In this post, we will highlight some of the recent improvements and changes to mapedit. These updates can be categorized as

  1. better integration with [simple features(] and

  2. addition of Shiny modules.


We are moving quickly, so please install the development versions of mapview, leaflet.extras, and mapview as shown below.

devtools::install_github("r-spatial/mapview@develop") devtools::install_github("bhaskarvk/leaflet.extras") devtools::install_github("r-spatial/mapedit") Simple Features

The R geo community is radidly embracing the RConsortium-sponsored sf package, and mapedit plans to fully adopt and incorporate simple features like leaflet, mapview, geojsonio, plotly, and ggplot2. sf can greatly improve geospatial workflows in R. mapedit now returns simple features by default with editMap() and includes a new function selectFeatures() for interactive selection of simple features. Let’s take a quick look at this new functionality.

editMap returns sf

editMap() looks the same, but the output is very different.

library(mapedit) library(mapview) library(sf) crud <- editMap(mapview())

Now, since the return value is simple features and mapview added addFeatures(), we can see the drawn features with a one-liner. This collaboration greatly increases the efficiency of the editing workflow.


selectFeatures makes selecting features easy

Let’s use the sf example with North Carolina county data to give us some simple features to select with the new selectFeatures().

library(mapview) library(mapedit) library(sf) nc <- st_read(system.file("shape/nc.shp", package="sf")) selected <- selectFeatures(nc)

As before, we can now take advantage of mapview to plot our selection.


We also changed the underlying selectMap() function to use the RStudio Viewer by default. This allows us to include selectMap() in a workflow or pipeline.

library(mapview) library(mapedit) library(sf) nc <- st_read(system.file("shape/nc.shp", package="sf")) selectFeatures(nc) %>% st_union() %>% mapview()

Stay tuned for an editFeatures() equivalent.

Shiny Modules

The original editMap() and selectMap() provided useful functionality. However, they are limited to standalone application. For even greater integration in an interactive geospatial workflow, Shiny modules allow a user to incorporate edit and select in a broader application context. Let’s see a couple examples of this concept.

Select as Shiny Module

In this example, we will demonstrate analysis of the quakes data in R along with some helpful sf. The app will build a grid for selection of quakes and then plot the selection with a comparative density plot.

First we will convert the quakes to simple features.

library(sf) # make the coordinates a numeric matrix qk_mx <- data.matrix(quakes[,2:1]) # convert the coordinates to a multipoint feature qk_mp <- st_multipoint(qk_mx) # convert the multipoint feature to sf qk_sf <- st_sf(st_cast(st_sfc(qk_mp), "POINT"), quakes, crs=4326)

Now let’s use the very helpful sf::st_make_grid() function, and then filter the grid to only those that contain quakes points.

# make a grid grd <- st_set_crs(st_make_grid(qk_sf), 4326) # only keep grid polygons that contain at least one quake point grd <- grd[which(sapply(st_contains(st_sf(grd), qk_sf),length)>0)]

With our grid, we can build a Shiny app for some interactive analysis of quake magnitude.

library(mapview) library(mapedit) library(shiny) ui <- fluidPage( fluidRow( column( 6, h3("Select Grid"), # our new select module ui selectModUI("selectmap") ), column( 6, h3("Selected Quakes"), plotOutput("selectplot") ) ), fluidRow( h3("Magnitude Distribution of Selected Quakes"), plotOutput("quakestat", height=200) ) ) server <- function(input, output, session) { # our new select module g_sel <- callModule( selectMod, "selectmap", leaflet() %>% addTiles() %>% addFeatures(st_sf(grd), layerId = ~seq_len(length(grd))) ) rv <- reactiveValues(intersect=NULL, selectgrid=NULL) observe({ # the select module returns a reactive # so let's use it to find the intersection # of selected grid with quakes points gs <- g_sel() rv$selectgrid <- st_sf( grd[as.numeric(gs[which(gs$selected==TRUE),"id"])] ) if(length(rv$selectgrid) > 0) { rv$intersect <- st_intersection(rv$selectgrid, qk_sf) } else { rv$intersect <- NULL } }) output$selectplot <- renderPlot({ plot(qk_mp, col="gray") if(!is.null(rv$intersect)) { plot(rv$intersect, pch=19, col="black", add=TRUE) } plot(st_union(rv$selectgrid), add=TRUE) }) output$quakestat <- renderPlot({ plot( stats::density(qk_sf$mag), col="gray30", ylim=c(0,1.2), main = NA ) if(!is.null(rv$intersect) && nrow(rv$intersect) > 0) { lines(stats::density(rv$intersect$mag), col="red", lwd=2) } }) } shinyApp(ui, server)

Edit as Shiny Module

Since we have the quake data, we will use it to show the edit module in a simple application. Instead of the grid, let’s draw polygons to select quakes.

# run select demo for the quake data # we will need the qk_sf # to test # plot(qk_sf) library(mapedit) library(mapview) library(shiny) ui <- fluidPage( fluidRow( # edit module ui column(6, editModUI("editor")), column( 6, h3("Boxplot of Depth"), plotOutput("selectstat") ) ) ) server <- function(input, output, session) { # edit module returns sf edits <- callModule(editMod, "editor", mapview(qk_sf)@map) output$selectstat <- renderPlot({ req(edits()$finished) qk_intersect <- st_intersection(edits()$finished, qk_sf) req(nrow(qk_intersect) > 0) boxplot( list( all = as.numeric(qk_sf$depth), selected = as.numeric(qk_intersect$depth) ), xlab = "depth" ) }) } shinyApp(ui, server)

Next Steps

The progress made thus far depended entirely on feedback received. Please help us by providing feedback, ideas, and use cases. As mentioned earlier, we aim for an initial CRAN release before useR 2017 on July 4, 2017. We do not anticipate any breaking API changes before release. Rather, we plan to spend time on documentation, examples, and tests.


mapedit and many of its dependency packages are funded by the RConsortium. Thanks so much to all those who have contributed to this fantastic organization. Also, thanks to all those open source contributors in the R community.

First openEO hackaton report

Wed, 04/19/2017 - 02:00

As announced earlier,

First #OpenEO hackaton will be held @ ifgi on Mar 23-24. Let me know if you’re interested to participate!

— Edzer Pebesma ((???)) January 20, 2017

the first openEO hackaton was held a few weeks ago. A short report follows.


OpenEO is an initiative to define an open API for accessing cloud-based Earth observation data processing engines, explained in this blog post.

Five participants took part in the hackaton: Christoph Paulik from the TU Wien, and the following people from from ifgi: Meng Lu, Daniel Nuest, Marius Appel and me.

The EODC had been so kind to set us up with access to a virtual machine in their data center, and this illustrates the problem with remote sensing data is that there are many and they are big. Part of the archive we looked at was found here:

xxxxxxxx@openeo1:~$ df -h /eodc/products/ Filesystem Size Used Avail Use% Mounted on XX.XXX.XX.XXX:/products 1.4P 1.3P 159T 89% /eodc/products

How big is 1.3 petabyte? I often try to relate to the latest fixed-size digital medium we used, the CD-R:

How many? Mmm.

1.3 * 1024^5 / (700 * 1024^2) ## [1] 1994092

2 Million: a stack of 2.4 km, or 10 km when we put them in a case.


We first talked quite a while about what RESTful APIs are, how they work, their verbs (GET, PUT, POST, etc), resources, what microservices are, and how they can be standardised e.g. following the Open API initiative. Then, we tried to define a simple challenge that uses REST, and uses Earth observation data.


In the challenge, We didn’t want to work with the large amounts of data straight away. Instead, we tried to a challenge as simple as possible, namely adding two bands from a particular Sentinel-2 scene, and returning the result as a GeoTIFF.

We worked in three teams:

  • team 1 worked on a JavaScript solution, with the imagery in a SciDB backend
  • team 2 worked on a Python solution,
  • team 3 worked on a pure R solution.

After around 3-4 hours, all teams had managed to get a RESTfull service doing this:

  • team 1 had most man power, but also most work to do; the solution was in the end the fastest, because the SciDB backend is highly scalable, using 24 cores or so

  • team 2, Christoph’s Python solution uses flask for web requests, gdal from osgeo to read data, and numpy to add to images–the resulting image was not georeferenced.

  • team 3 (Edzer) used the plumber R package to set up web services, and rgdal and sp to read and write raster maps. Solution found here.

Road ahead

We discussed quite at length what it will take to realize the OpenEO ambition. Adding two bands is easy, handling large collections of scenes is not. My respect for what the people from Google Earth Engine have realized has grown, their award from ASPRS is more than deserved. Nevertheless, it is isolated, closed source, practically impossible to verify.

We also drafted service calls, discussed coordinate reference systems of tile/scene collections, and looked at covjson. And one of the most attractive ideas (for some of us): to submit your Python or R function to the imagery server, and have it work, in parallel, over the scenes, returning the requested summary.

Certain terms (scene, tile, pixel) reoccurred many times, seemingly stretching across the different used back-ends but there were slight differences - a huge challenge for an API intending to be useful to many. Discussing the API design, we struggled with the user perspective: is the user the analyst, or the developer? Who do we accommodate with the used terms?

Reproducible assignments with R-markdown

Thu, 04/13/2017 - 02:00

I tweeted earlier Today about

Evaluating 22 assignments of my MSc course that students handed in as html, and Rmd + raw data. I can reproduce each of them!

— Edzer Pebesma (@edzerpebesma) April 13, 2017

for my course Analysis of Spatio-Temporal Data in our MSc program Geoinformatics. Barry kindly answered:

@edzerpebesma It would be nice to see a write-up of how you get students to this level of competence.

— Barry Rowlingson (@geospacedman) April 13, 2017

Well, here you go!

The course

The course has a a goal to get familiar with what spatio-temporal data is, and how you can analyse it, with emphasis on using R. Analysis methods focus on spatial statistical methods (point patterns, geostatistics, lattice data) and analysis of movement data. The course consists of lectures and exercise meetings; for exercise meetings students were supposed to hand in short assignments. The larger final assignment is individual, and is like a small, reproducible research paper and forms the basis for the final grade.

Instead of (or: in addition to) working with pre-cooked examples that use data from a particular package, loaded by data(dataset), I encouraged students from the beginning to look for datasets themselves, and import them in R. This way they learn, and learn from one another

  • how datasets look like in the wild
  • how they are imported in R
  • the hoops they have to go through to get time or date variables right, e.g. by using as.POSIXct or strptime and
  • troubles that coordinates may give, e.g. their projections or degrees-minutes-seconds notation.

as well as challenges related to applying methods:

  • time series analysis does not work if you have very short time series
  • periodicities are often absent in yearly data, but present in sub-yearly (daily/weekly/monthly)
  • spatial correlation is hard to assess with a handful of observations
  • not all point-referenced datasets are meaningful input for point pattern analysis, or for geostatistical analysis
  • most point pattern software assumes Carthesian coordinates, but does not check for this

I’ve asked students to hand in assignments as R markdown files so that I can not only see what they did, but also redo what they did. I’ve asked them specifically to

  • work in a separate directory for each assignment, using lastname.assignmentNumber as directory name
  • include the R markdown file, data sets, and the output (html or pdf)
  • not set paths in the R markdown files
  • not use absolute paths to data files
  • not use install.packages in their R markdown file
  • zip this directory and submit it to the moodle system

I’ve also demonstrated for the whole group how things work when I reproduce this, and showed them where things go wrong. For reproducing, I have to

  • download the zip, and unzip it
  • go into the directory
  • double-click the R-markdown file, so rstudio starts in the right directory
  • click Knit to run the R markdown file and create the output html

For the final assignment, several students ended up with data sets larger than the maximum allowed upload size; they then gave me a link to a download link. Several students saved the result of a long computation to a .RData file, uncommented the long running section, and loaded the .RData instead, to limit run-time while writing the assignment. They warned me in case run times was long.

If anyone can think of a simpler workflow (for me, or for the students) I’d very much like to hear it!

Spring School on “Statistical analysis of hyperspectral and high-dimensional remote-sensing data using R”: a report

Sat, 03/25/2017 - 01:00

The Spring School on “Statistical analysis of hyperspectral and high-dimensional remote-sensing data using R”, held at the University of Jena, March 13-17, 2017, was organized by the GIScience group, led by Prof. Alexander Brenning and two researchers from the GIScience research group, Patrick Schratz and Dr. Jannes Münchow.

The school brought together a diverse group of 28 researchers (e.g. geoscientists, forestry, environmental studies) at different scientific levels (graduate students, PhD, postdoc, professor) from all over the world as far as Chile, Peru, Turkey, and Bosnia & Herzegowina. Overall, eight german and 16 non-german participants (20 male, 8 female) took part in this event. During five days the participants were introduced to the theoretical background of hyperspectral remote sensing data and learned in numerous hands-on sessions how to analyse and illustrate spatial data in R. The Spring School was organized within the LIFE Healthy Forest project and supported by the Michael Stifel Center Jena.

In this short blog-post I will give a quick overview of the many, many things we learned during this intense “spatial stats-and-R-week”.

Participants and organizers of the Spring School on “Statistical analysis of hyperspectral and high-dimensional remote-sensing data using R” in Jena, © H. Petschko

Day 1

On the first day of the summer school the participants obtained a theoretical introduction to hyperspectral remote-sensing data with examples focusing on the application of hyperspectral data in forest research.
Marco Peña from the Alberto Hurtado University in Chile gave a lecture on “Introduction to hyperspectral remote sensing” which brought everyone to the same level.
This very comprehensive introduction was followed by a talk on hyperspectral applications exemplified on a study on forests in the Bialowieza Forest in eastern Poland by Aneta Modzelewska from the Forest Research Institute in Raszyn.
The last talk on the first day was by Dr. Henning Buddenbaum (University of Trier) on “Hyperspectral remote sensing for measuring biochemical leaf parameters in forests”.
Dr. Buddenbaum is involved in the Science Advisory Group – Forests and Natural Ecosystems in the EnMAP mission, a German hyperspectral satellite mission aiming at monitoring and characterising the Earth’s environment globally.

Lecture by Prof. A. Brenning on “Statistical and machine learning in remote sensing”, © H. Petschko

Day 2

The second day was filled with hands-on R sessions. In a first session by Patrick Schratz we learned about his “must know” features of R, namely Rmarkdown, the apply-family and pipes.
This was followed by two session focusing on the usage of R as a GIS. Dr. Jannes Münchow, who developed the package RQGIS, an interface between R and QGIS which allows the user to access QGIS algorithms from within R.
Afterwards we were introduced to the R package mapview, by its author, Dr. Tim Appelhans. Mapview is a GIS-like interactive graphing tool that is directly accessible within RStudio (or the web browser, if you are not using RStudio). It is especially helpful if you want to quickly do a visual check whether a certain analysis has produced reasonable results.

Solving R-problems with Dr. Jannes Münchow, © H. Petschko

Day 3

The third day started with a lecture and hands-on session on “Statistical and machine learning in remote sensing” by Prof. Alexander Brenning with a focus on linear discriminant analysis, support vector machine and random forest. A short overview of these statistical modeling methods and the application in R including a comprehensive tutorial can be found here.
In the afternoon, Dr. Thomas Bocklitz presented a very different perspective in the application of spectral data analysis in histopathology. Afterwards, the participants had a chance to discuss their own research involving spatial modeling techniques or R-problem with the group and the experts from the GIScience group in Jena.

Open session during the Day3 of the Spring School to discuss research projects of the participants, © H. Petschko

Day 4

On the fourth day, Partick Schratz briefly introduced the hsdar package developed by Dr. Lukas Lehnert from University of Marburg. It can be used for processing and analysis on hyperspectral data in R.
Prof. Brenning focused in his second session further on the assessment of model accuracy (non-spatial and spatial validation methods, variable importance) using the sperrorest package and dealing with high dimensionality in linear regression.

Discussing sampling designs with Prof. A. Brenning, © H. Petschko

Introduction to parallel processing in R with Patrick Schratz, © H. Petschko

Day 5 (Thuringian Forest excursion)

On the last day, we visited a monitoring site and a site with tornado damage (see images below) from 2016 in the Thuringian Forest together with three experts from the official authority “ThüringenForst”.
In conclusion, the Spring School was a great event with many fruitful hands-on R-sessions during which the participants could learn helpful tricks in R, how to use R as a GIS and about statistical and machine learning in R. Hopefully there will be more academic “schools” like this one to follow in the future (maybe even with a thematic focus on geomorphology or natural hazards).

Tornado damage in the Thuringian Forest from September 2016 © P. Schratz

Field trip to the Thuringian Forest, © X. Tagle

Tidying feature geometries with sf

Sun, 03/19/2017 - 01:00

view raw Rmd


Spatial line and polygon data are often messy; although simple features formally follow a standard, there is no guarantee that data is clean when imported in R. This blog shows how we can identify, (de)select, or repair broken and invalid geometries. We also show how empty geometries arise, and can be dealt with. Literature on invalid polygons and correcting them is found in Ramsey (2010), Ledoux, Ohori, and Meijers (2014), Ledoux, Ohori, and Meijers (2012), and Van Oosterom, Quak, and Tijssen (2005); all these come with excelent figures illustrating the problem cases.

We see that from version 0.4-0, sf may be linked to lwgeom,

library(sf) ## Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.3.1 r15264

where lwgeom stands for the light-weight geometry library that powers postgis. This library is not present on CRAN, so binary packages installed from CRAN will not come with it. It is only linked to sf when it is detected during a build from source. When lwgeom is present, we will have a working version of st_make_valid, which is essentially identical to PostGIS’ ST_makeValid.

Corrup or invalid geometries?

There are two types of things that can go wrong when dealing with geometries in sf. First, a geometry can be corrupt, which is for instance the case for a LINESTRING with one point, or a POLYGON with more than zero and less than 4 points:

l0 = st_linestring(matrix(1:2,1,2)) p0 = st_polygon(list(rbind(c(0,0),c(1,1),c(0,0))))

These cases could of course be easily caught by the respective constructor functions, but they are not because we want to see what happens. Also, if we would catch them, it would not prevent us from running into them, because the majority of spatial data enters R through GDAL, and sf’s binary interface (reading well-known binary). Also, for many purposes corrupt may not be a problem, e.g. if we only want to plot them. In case we want to use them however in geometrical operations, we’ll typically see a message like:

IllegalArgumentException: Invalid number of points in LinearRing found 3 - must be 0 or >= 4

which points to GEOS not accepting a geometry as a possible geometry. Such an error message however does not point us to which geometry caused this. We could of course write a loop over all geometries to find this out, but can also use st_is_valid which returns by default NA on corrupt geometries:

l0 = st_linestring(matrix(1:2,1,2)) p0 = st_polygon(list(rbind(c(0,0),c(1,1),c(0,0)))) p = st_point(c(0,1)) # not corrupt st_is_valid(st_sfc(l0, p0, p)) ## [1] NA NA TRUE

Simple feature validity refers to a number of properties that polygons should have, such as non-self intersecting, holes being inside polygons. A number of different examples for invalid geometries are found in Ledoux, Ohori, and Meijers (2014), and were taken from their prepair github repo:

# A 'bowtie' polygon: p1 = st_as_sfc("POLYGON((0 0, 0 10, 10 0, 10 10, 0 0))") # Square with wrong orientation: p2 = st_as_sfc("POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))") # Inner ring with one edge sharing part of an edge of the outer ring: p3 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0),(5 2,5 7,10 7, 10 2, 5 2))") # Dangling edge: p4 = st_as_sfc("POLYGON((0 0, 10 0, 15 5, 10 0, 10 10, 0 10, 0 0))") # Outer ring not closed: p5 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10))") # Two adjacent inner rings: p6 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0), (1 1, 1 8, 3 8, 3 1, 1 1), (3 1, 3 8, 5 8, 5 1, 3 1))") # Polygon with an inner ring inside another inner ring: p7 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0), (2 8, 5 8, 5 2, 2 2, 2 8), (3 3, 4 3, 3 4, 3 3))") p = c(p1, p2, p3, p4, p5, p6, p7) (valid = st_is_valid(p)) ## [1] FALSE TRUE FALSE FALSE NA FALSE FALSE

Interestingly, GEOS considers p5 as corrupt (NA) and p2 as valid.

To query GEOS for the reason of invalidity, we can use the reason = TRUE argument to st_is_valid:

st_is_valid(p, reason = TRUE) ## [1] "Self-intersection[5 5]" "Valid Geometry" ## [3] "Self-intersection[10 2]" "Self-intersection[10 0]" ## [5] NA "Self-intersection[3 1]" ## [7] "Holes are nested[3 3]" Making invalid polygons valid

As mentioned above, in case sf was linked to lwgeom, which is confirmed by

sf_extSoftVersion()["lwgeom"] ## lwgeom ## "2.3.1 r15264"

not printing a NA, we can use st_make_valid to make geometries valid:

st_make_valid(p) ## Geometry set for 7 features ## geometry type: GEOMETRY ## dimension: XY ## bbox: xmin: 0 ymin: 0 xmax: 15 ymax: 10 ## epsg (SRID): NA ## proj4string: NA ## First 5 geometries: ## MULTIPOLYGON(((0 0, 0 10, 5 5, 0 0)), ((5 5, 10... ## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0)) ## GEOMETRYCOLLECTION(POLYGON((10 7, 10 2, 10 0, 0... ## GEOMETRYCOLLECTION(POLYGON((10 0, 0 0, 0 10, 10... ## POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))

A well-known “trick”, which may be your only alternative if is to buffer the geometries with zero distance:

st_buffer(p[!], 0.0) ## Geometry set for 6 features ## geometry type: POLYGON ## dimension: XY ## bbox: xmin: 0 ymin: 0 xmax: 10 ymax: 10 ## epsg (SRID): NA ## proj4string: NA ## First 5 geometries: ## POLYGON((0 0, 0 10, 5 5, 0 0)) ## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0)) ## POLYGON((0 0, 0 10, 10 10, 10 7, 5 7, 5 2, 10 2... ## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0)) ## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0), (1 1, 3 ...

but we see that, apart from the fact that this only works for non-corrupt geometries, we end up with different results.

A larger example from the prepair site is this:

x = read_sf("/home/edzer/git/prepair/data/CLC2006_2018418.geojson") st_is_valid(x) ## [1] FALSE st_is_valid(st_make_valid(x)) ## [1] TRUE plot(x, col = 'grey', axes = TRUE, graticule = TRUE)

The corresponding paper, Ledoux, Ohori, and Meijers (2012) zooms in on problematic points. The authors argue to use constrained triangulation instead of the (less documented) approach taken by lwgeom; Mike Sumner also explores this here. It builds upon RTriangle, which cannot be integrated in sf as it is distributed under license with a non-commercial clause. Ledoux uses CGAL, which would be great to have an interface to from R!

Empty geometries

Empty geometries exist, and can be thought of as zero-length vectors, data.frames without rows, or NULL values in lists: in essence, there’s place for information, but there is no information. An empty geometry arises for instance if we ask for the intersection of two non-intersecting geometries:

st_intersection(st_point(0:1), st_point(1:2)) ## GEOMETRYCOLLECTION()

In principle, we could have designed sf such that empty geometries were represented a NULL value, but the standard prescrives that every geometry type has an empty instance:

st_linestring() ## LINESTRING() st_polygon() ## POLYGON() st_point() ## POINT(NA NA)

and thus the empty geometry is typed. This guarantees clean roundtrips from a database to R back into a database: no information (on type) gets lost in case of presence of empty geometries.

How can we detect, and filter on empty geometries? We can do that with st_dimension:

lin = st_linestring(rbind(c(0,0),c(1,1))) pol = st_polygon(list(rbind(c(0,0),c(1,1),c(0,1),c(0,0)))) poi = st_point(c(1,1)) p0 = st_point() pol0 = st_polygon() st_dimension(st_sfc(lin, pol, poi, p0, pol0)) ## [1] 1 2 0 NA NA

and see that empty geometries return NA.

The standard however prescribes that an empty polygon still has dimension two, and we can override the NA convenience to get standard-compliant dimensions by

st_dimension(st_sfc(lin, pol, poi, p0, pol0), NA_if_empty = FALSE) ## [1] 1 2 0 0 2 Tidying feature geometries

When you analyse your spatial data with sf and you don’t get any warnings or error messages, all may be fine. In case you do, or your are curious, you can check for

  1. empty geometries, using any(
  2. corrupt geometries, using any(
  3. invalid geometries, using any(na.omit(st_is_valid(x)) == FALSE); in case of corrupt and/or invalid geometries,
  4. in case of invalid geometries, query the reason for invalidity by st_is_valid(x, reason = TRUE)
  5. you may be succesful in making geometries valid using st_make_valid(x) or, if st_make_valid is not supported by
  6. st_buffer(x, 0.0) on non-corrupt geometries (but beware of the bowtie example above, where st_buffer removes one half).
  7. After succesful a st_make_valid, you may want to select a particular type subset using st_is, or cast GEOMETRYCOLLECTIONS to MULTIPOLYGON by
st_make_valid(p) %>% st_cast("MULTIPOLYGON") ## Warning in st_cast.GEOMETRYCOLLECTION(X[[i]], ...): only first part of ## geometrycollection is retained ## Warning in st_cast.GEOMETRYCOLLECTION(X[[i]], ...): only first part of ## geometrycollection is retained ## Warning in st_cast.GEOMETRYCOLLECTION(X[[i]], ...): only first part of ## geometrycollection is retained ## Geometry set for 7 features ## geometry type: MULTIPOLYGON ## dimension: XY ## bbox: xmin: 0 ymin: 0 xmax: 10 ymax: 10 ## epsg (SRID): NA ## proj4string: NA ## First 5 geometries: ## MULTIPOLYGON(((0 0, 0 10, 5 5, 0 0)), ((5 5, 10... ## MULTIPOLYGON(((0 0, 0 10, 10 10, 10 0, 0 0))) ## MULTIPOLYGON(((10 7, 10 2, 10 0, 0 0, 0 10, 10 ... ## MULTIPOLYGON(((10 0, 0 0, 0 10, 10 10, 10 0))) ## MULTIPOLYGON(((0 0, 10 0, 10 10, 0 10, 0 0)))

For longer explanations about what makes a polygons invalid, do read one of the references below, all are richly illustrated

References [references]

Ledoux, Hugo, Ken Arroyo Ohori, and Martijn Meijers. 2012. “Automatically Repairing Invalid Polygons with a Constrained Triangulation.” In. Agile.

———. 2014. “A Triangulation-Based Approach to Automatically Repair GIS Polygons.” Computers & Geosciences 66: 121–31.

Ramsey, Paul. 2010. “PostGIS-Tips for Power Users.” Presentation on: FOSS4G.

Van Oosterom, Peter, Wilko Quak, and Theo Tijssen. 2005. “About Invalid, Valid and Clean Polygons.” In Developments in Spatial Data Handling, 1–16. Springer.

Spatial Modeling Using Statistical Learning Techniques

Mon, 03/13/2017 - 01:00

Geospatial data scientists often make use of a variety of statistical and machine learning techniques for spatial prediction in applications such as landslide susceptibility modeling (Goetz et al. 2015) or habitat modeling (Knudby, Brenning, and LeDrew 2010). Novel and often more flexible techniques promise improved predictive performances as they are better able to represent nonlinear relationships or higher-order interactions between predictors than less flexible linear models.

Nevertheless, this increased flexibility comes with the risk of possible over-fitting to the training data. Since nearby spatial observations often tend to be more similar than distant ones, traditional random cross-validation is unable to detect this over-fitting whenever spatial observations are close to each other (e.g. Brenning (2005)). Spatial cross-validation addresses this by resampling the data not completely randomly, but using larger spatial regions. In some cases, spatial data is grouped, e.g. in remotely-sensed land use mapping grid cells belonging to the same field share the same management procedures and cultivation history, making them more similar to each other than to pixels from other fields with the same crop type.

The sperrorest package provides a customizable toolkit for cross-validation (and bootstrap) estimation using a variety of spatial resampling schemes. More so, this toolkit can even be extended to spatio-temporal data or other complex data structures. This blog post will walk you through a simple case study, crop classification in central Chile (Peña and Brenning 2015).

Data and Packages

As a case study we will carry out a supervised classification analysis using remotely-sensed data to predict fruit-tree crop types in central Chile. This data set is a subsample of data from (Peña and Brenning 2015).

library(pacman) p_load(sperrorest) data("maipo", package = "sperrorest")

The remote-sensing predictor variables were derived from an image times series consisting of eight Landsat images acquired throughout the (southern hemisphere) growing season. The data set includes the following variables:


  • croptype: response variable (factor) with 4 levels: ground truth information


  • b[12-87]: spectral data, e.g. b82 = image date #8, spectral band #2
  • ndvi[01-08]: Normalized Difference Vegetation Index, e.g. #8 = image date #8
  • ndwi[01-08]: Normalized Difference Water Index, e.g. #8 = image date #8


  • field: field identifier (grouping variable - not to be used as predictor)
  • utmx, utmy: x/y location; not to be used as predictors

All but the first four variables of the data set are predictors; their names are used to construct a formula object:

predictors <- colnames(maipo)[5:ncol(maipo)] # Construct a formula: fo <- as.formula(paste("croptype ~", paste(predictors, collapse = "+"))) Modeling

Here we will take a look at a few classification methods with varying degrees of computational complexity and flexibility. This should give you an idea of how different models are handled by sperrorest, depending on the characteristics of their fitting and prediction methods. Please refer to (James et al. 2013) for background information on the models used here.

Linear Discriminant Analysis (LDA)

LDA is simple and fast, and often performs surprisingly well if the problem at hand is ‘linear enough’. As a start, let’s fit a model with all predictors and using all available data:

p_load(MASS) fit <- lda(fo, data = maipo)

Predict the croptype with the fitted model and calculate the misclassification error rate (MER) on the training sample:

pred <- predict(fit, newdata = maipo)$class mean(pred != maipo$croptype) ## [1] 0.0437

But remember that this result is over-optimistic because we are re-using the training sample for model evaluation. We will soon show you how to do better with cross-validation.

We can also take a look at the confusion matrix but again, this result is overly optimistic:

table(pred = pred, obs = maipo$croptype) ## obs ## pred crop1 crop2 crop3 crop4 ## crop1 1294 8 4 37 ## crop2 50 1054 4 44 ## crop3 0 0 1935 6 ## crop4 45 110 29 3093 Classification Tree

Classification and regresion trees (CART) take a completely different approach—they are based on yes/no questions in the predictor variables and can be referred to as a binary partitioning technique. Fit a model with all predictors and default settings:

p_load(rpart) fit <- rpart(fo, data = maipo) ## optional: view the classiciation tree # par(xpd = TRUE) # plot(fit) # text(fit, use.n = TRUE)

Again, predict the croptype with the fitted model and calculate the average MER:

pred <- predict(fit, newdata = maipo, type = "class") mean(pred != maipo$croptype) ## [1] 0.113

Here the predict call is slightly different. Again, we could calculate a confusion matrix.

table(pred = pred, obs = maipo$croptype) ## obs ## pred crop1 crop2 crop3 crop4 ## crop1 1204 66 0 54 ## crop2 47 871 38 123 ## crop3 38 8 1818 53 ## crop4 100 227 116 2950 RandomForest

Bagging, bundling and random forests build upon the CART technique by fitting many trees on bootstrap resamples of the original data set (Breiman 1996) (Breiman 2001) (Hothorn and Lausen 2005). They differ in that random forest also samples from the predictors, and bundling adds an ancillary classifier for improved classification. We will use the nowadays widely used randomForest() here.

p_load(randomForest) fit <- randomForest(fo, data = maipo, coob = TRUE) fit ## ## Call: ## randomForest(formula = fo, data = maipo, coob = TRUE) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 8 ## ## OOB estimate of error rate: 0.57% ## Confusion matrix: ## crop1 crop2 crop3 crop4 class.error ## crop1 1382 2 0 5 0.00504 ## crop2 1 1163 0 8 0.00768 ## crop3 0 0 1959 13 0.00659 ## crop4 7 5 3 3165 0.00472

Let’s take a look at the MER achieved on the training sample:

pred <- predict(fit, newdata = maipo, type = "class") mean(pred != maipo$croptype) ## [1] 0 table(pred = pred, obs = maipo$croptype) ## obs ## pred crop1 crop2 crop3 crop4 ## crop1 1389 0 0 0 ## crop2 0 1172 0 0 ## crop3 0 0 1972 0 ## crop4 0 0 0 3180

Isn’t this amazing? Only one grid cell is misclassified by the bagging classifier! Even the OOB (out-of-bag) estimate of the error rate is < 1%.
Too good to be true? We’ll see…

Cross-Validation Estimation of Predictive Performance

Of course we can’t take the MER on the training set too seriously—it is biased. But we’ve heard of cross-validation, in which disjoint subsets are used for model training and testing. Let’s use sperrorest for cross-validation.

Also, at this point we should highlight that the observations in this data set are pixels, and multiple grid cells belong to the same field. In a predictive situation, and when field boundaries are known (as is the case here), we would want to predict the same class for all grid cells that belong to the same field. Here we will use a majority filter. This filter ensures that the final predicted class type of every field is the most often predicted croptype within one field.

Linear Discriminant Analysis (LDA)

First, we need to create a wrapper predict method for LDA for sperrorest(). This is necessary in order to accomodate the majority filter, and also because class predictions from lda’s predict method are hidden in the $class component of the returned object.

lda.predfun <- function(object, newdata, fac = NULL) { p_load(nnet) majority <- function(x) { levels(x)[] } majority.filter <- function(x, fac) { for (lev in levels(fac)) { x[fac == lev] <- majority(x[fac == lev]) } x } pred <- predict(object, newdata = newdata)$class if (!is.null(fac)) pred <- majority.filter(pred, newdata[, fac]) return(pred) }

To ensure that custom predict-functions do also work with parsperrorest(), we need to define all custom functions in one step. Otherwise, the foreach() package in par.mode = 2 of parsperrorest() will throw errors because of the way how it loads functions of the parent environment.

Finally, we can run sperrorest() with a non-spatial sampling setup ( In this example we use a ‘50 repetitions - 5 folds’ setup. To make your results more independent of particular random partitioning, you may want to use 100 repetitions or even more in practice.

res.lda.nsp <- sperrorest(fo, data = maipo, coords = c("utmx","utmy"), = lda, = lda.predfun, pred.args = list(fac = "field"), =, smp.args = list(repetition = 1:50, nfold = 5), error.rep = TRUE, error.fold = TRUE, progress = FALSE) summary(res.lda.nsp$error.rep) ## mean sd median IQR ## train.error 3.40e-02 0.001 3.40e-02 0.001 ## train.accuracy 9.66e-01 0.001 9.66e-01 0.001 ## 4.69e+03 0.000 4.69e+03 0.000 ## train.count 3.09e+04 0.000 3.09e+04 0.000 ## test.error 4.00e-02 0.002 4.00e-02 0.002 ## test.accuracy 9.60e-01 0.002 9.60e-01 0.002 ## 1.17e+03 0.000 1.17e+03 0.000 ## test.count 7.71e+03 0.000 7.71e+03 0.000

To run a spatial cross-validation at the field level, we can use as the sampling function. Since we are using 5 folds, we get a coarse 80/20 split of our data. 80% will be used for training, 20% for testing our trained model.

To take a look where our training and tests sets will be partitioned on each fold, we can plot them. The red colored points represent the test set in each fold, the black colored points the training set. Note that because we plotted over 7000 points, overplotting occurs and since the red crosses are plotted after the black ones, it seems visually that way more than ~20% of red points exist than it is really the case.

resamp <-, nfold = 5, repetition = 1:1, fac = "field") plot(resamp, maipo, coords = c("utmx","utmy"))

Subsequently, we have to specify the location of the fields (fac = "field") in the prediction arguments (pred.args) and sampling arguments (smp.args) in sperrorest().

res.lda.sp <- sperrorest(fo, data = maipo, coords = c("utmx","utmy"), = lda, = lda.predfun, pred.args = list(fac = "field"), =, smp.args = list(fac = "field", repetition = 1:50, nfold = 5), error.rep = TRUE, error.fold = TRUE, benchmark = TRUE, progress = FALSE) res.lda.sp$benchmark$runtime.performance summary(res.lda.sp$error.rep) ## mean sd median IQR ## train.error 2.95e-02 0.00177 2.97e-02 0.00261 ## train.accuracy 9.70e-01 0.00177 9.70e-01 0.00261 ## 4.69e+03 0.00000 4.69e+03 0.00000 ## train.count 3.09e+04 0.00000 3.09e+04 0.00000 ## test.error 6.65e-02 0.00807 6.59e-02 0.01083 ## test.accuracy 9.33e-01 0.00807 9.34e-01 0.01083 ## 1.17e+03 0.00000 1.17e+03 0.00000 ## test.count 7.71e+03 0.00000 7.71e+03 0.00000 RandomForest

In the case of Random Forest, the customized looks as follows; it is only required because of the majority filter, without it, we could just omit the and pred.args arguments below.

rf.predfun <- function(object, newdata, fac = NULL) { p_load(nnet) majority <- function(x) { levels(x)[] } majority.filter <- function(x, fac) { for (lev in levels(fac)) { x[fac == lev] <- majority(x[fac == lev]) } x } pred <- predict(object, newdata = newdata) if (!is.null(fac)) pred <- majority.filter(pred, newdata[,fac]) return(pred) } res.rf.sp <- sperrorest(fo, data = maipo, coords = c("utmx","utmy"), = randomForest, = rf.predfun, pred.args = list(fac = "field"), =, smp.args = list(fac = "field", repetition = 1:50, nfold = 5), error.rep = TRUE, error.fold = TRUE, benchmark = TRUE, progress = 2) ## Mon Feb 27 20:56:01 2017 Repetition 1 ## Mon Feb 27 20:57:12 2017 Repetition 2 ## Mon Feb 27 20:58:20 2017 Repetition 3 ## Mon Feb 27 20:59:29 2017 Repetition 4 ## Mon Feb 27 21:00:36 2017 Repetition 5 ## Mon Feb 27 21:01:46 2017 Repetition 6 ## Mon Feb 27 21:02:55 2017 Repetition 7 ## Mon Feb 27 21:04:01 2017 Repetition 8 ## Mon Feb 27 21:05:07 2017 Repetition 9 ## Mon Feb 27 21:06:16 2017 Repetition 10 ## Mon Feb 27 21:07:23 2017 Repetition 11 ## Mon Feb 27 21:08:30 2017 Repetition 12 ## Mon Feb 27 21:09:38 2017 Repetition 13 ## Mon Feb 27 21:10:45 2017 Repetition 14 ## Mon Feb 27 21:11:53 2017 Repetition 15 ## Mon Feb 27 21:13:01 2017 Repetition 16 ## Mon Feb 27 21:14:09 2017 Repetition 17 ## Mon Feb 27 21:15:16 2017 Repetition 18 ## Mon Feb 27 21:16:23 2017 Repetition 19 ## Mon Feb 27 21:17:31 2017 Repetition 20 ## Mon Feb 27 21:18:39 2017 Repetition 21 ## Mon Feb 27 21:19:46 2017 Repetition 22 ## Mon Feb 27 21:20:53 2017 Repetition 23 ## Mon Feb 27 21:22:03 2017 Repetition 24 ## Mon Feb 27 21:23:13 2017 Repetition 25 ## Mon Feb 27 21:24:23 2017 Repetition 26 ## Mon Feb 27 21:25:32 2017 Repetition 27 ## Mon Feb 27 21:26:39 2017 Repetition 28 ## Mon Feb 27 21:27:47 2017 Repetition 29 ## Mon Feb 27 21:28:55 2017 Repetition 30 ## Mon Feb 27 21:30:03 2017 Repetition 31 ## Mon Feb 27 21:31:11 2017 Repetition 32 ## Mon Feb 27 21:32:18 2017 Repetition 33 ## Mon Feb 27 21:33:25 2017 Repetition 34 ## Mon Feb 27 21:34:33 2017 Repetition 35 ## Mon Feb 27 21:35:40 2017 Repetition 36 ## Mon Feb 27 21:36:47 2017 Repetition 37 ## Mon Feb 27 21:37:54 2017 Repetition 38 ## Mon Feb 27 21:39:02 2017 Repetition 39 ## Mon Feb 27 21:40:09 2017 Repetition 40 ## Mon Feb 27 21:41:17 2017 Repetition 41 ## Mon Feb 27 21:42:24 2017 Repetition 42 ## Mon Feb 27 21:43:31 2017 Repetition 43 ## Mon Feb 27 21:44:38 2017 Repetition 44 ## Mon Feb 27 21:45:46 2017 Repetition 45 ## Mon Feb 27 21:46:54 2017 Repetition 46 ## Mon Feb 27 21:48:01 2017 Repetition 47 ## Mon Feb 27 21:49:07 2017 Repetition 48 ## Mon Feb 27 21:50:15 2017 Repetition 49 ## Mon Feb 27 21:51:21 2017 Repetition 50 ## Mon Feb 27 21:52:27 2017 Done. res.rf.sp$benchmark$runtime.performance ## Time difference of 56.4 mins summary(res.rf.sp$error.rep$test.error) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0630 0.0827 0.0871 0.0868 0.0928 0.1100 summary(res.rf.sp$error.rep$test.accuracy) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.890 0.907 0.913 0.913 0.917 0.937

What a surprise! RandomForest classification isn’t that good after all, if we acknowledge that in ‘real life’ we wouldn’t be making predictions in situations where the class membership of other grid cells in the same field is known in the training stage. So spatial dependence does matter.

Parallelized Cross-Validation

To speed things up, we can use parsperrorest() and inspect the runtime difference. Note that we have two parallel modes to choose from!
In this example we will use both par.mode = 1 and par.mode = 2 to show you the runtime differences.

Details of argument par.mode

The advantage of par.mode = 1 compared to par.mode = 2 is its speed. However, par.mode = 1 is somewhat less stable. This means that par.mode = 1 will throws errors for some models (e.g. LDA) while par.mode = 2 does not. par.mode = 1 uses either a parallel::mclapply() (Unix) or parallel::parApply() (Windows) while par.mode = 2 is running on a foreach backend.

While traditional mclapply() approaches have the downside that no progress report can be printed, the pbapply package implemented in parsperrorest() provides progress output (showing a progress bar) and does even work cross-platform. par.mode = 1 does not show a progress bar but prints fold and/or repetition information to the console.

res.rf.sp.par1 <- parsperrorest(fo, data = maipo, coords = c("utmx","utmy"), = randomForest, = rf.predfun, pred.args = list(fac = "field"), =, smp.args = list(fac = "field", repetition = 1:50, nfold = 5), par.args = list(par.units = 4, par.mode = 1), error.rep = TRUE, error.fold = TRUE, benchmark = TRUE, progress = FALSE) res.rf.sp.par1$benchmarks$runtime.performance ## Time difference of 18 mins res.rf.sp.par2 <- parsperrorest(fo, data = maipo, coords = c("utmx","utmy"), = randomForest, = rf.predfun, pred.args = list(fac = "field"), =, smp.args = list(fac = "field", repetition = 1:50, nfold = 5), par.args = list(par.units = 4, par.mode = 2), error.rep = TRUE, error.fold = TRUE, benchmark = TRUE, progress = FALSE) res.rf.sp.par2$benchmarks$runtime.performance ## Time difference of 19.9 mins

Both the differences of the parallel methods (parsperrorest()) compared to the sequential one (sperrorest()) and between the parallel modes themselves can be seen as variance among the repetitions. If you would use more repetitions (e.g. 100), this difference would converge towards zero.

Results of par.mode = 1 summary(res.rf.sp.par1$error.rep$test.error) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0744 0.0813 0.0852 0.0862 0.0903 0.1090 summary(res.rf.sp.par1$error.rep$test.accuracy) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.891 0.910 0.915 0.914 0.919 0.926 Results of par.mode = 2 summary(res.rf.sp.par2$error.rep$test.error) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0638 0.0832 0.0874 0.0872 0.0926 0.1130 summary(res.rf.sp.par2$error.rep$test.accuracy) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.887 0.907 0.913 0.913 0.917 0.936 Usage Advices

Given all the different sampling functions and the custom predict function (rf.predfun()) you might be a little confused which function to use for your use case.
If you want to do a “normal”, i.e. non-spatial cross-validation we recommend to use as in sperrorest() or parsperrorest().
If you want to perform a spatial cross-validation and you do not have a grouping structure like fields in our example, you should use partition.kmeans() as This creates a spatial k-means clustering within your cross-validation runs.

Most often you can simply use the generic predict() method for your model. Only in cases where you want to do prediction based on some factor level like in this vignette, you need to write your own prediction function.

Wrapper functions

Another point to be aware of is that some model functions do use a different argument naming/order than the argument of sperrorest() expects (formula + data). For example, the formula argument of glmmPQL() is named fixed instead of formula.
Also sometimes you may encounter errors when providing spatial correlation structures to your model using model.args. If this happens, also try to provide the correlation structure within a wrapper function.

my_glmmPQL <- function(formula, data) { # calculate spatial correlation structure correlation = corSpatial(value = c(5024, 0.25), form = ~ry + rx, nugget = TRUE, fixed = FALSE, type = "spherical") correlation = Initialize(correlation, data = data) # actual glmmPQL(fixed = formula, data = data, correlation = correlation) }

For all further questions, please feel free to open an issue at our Github repo.

References [references]

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2). Springer Nature: 123–40. doi:10.1007/bf00058655.

———. 2001. “Random Forests.” Machine Learning 45 (1). Springer Nature: 5–32. doi:10.1023/a:1010933404324.

Brenning, A. 2005. “Spatial Prediction Models for Landslide Hazards: Review, Comparison and Evaluation.” Natural Hazards and Earth System Science 5 (6). Copernicus GmbH: 853–62. doi:10.5194/nhess-5-853-2005.

Goetz, J.N., A. Brenning, H. Petschko, and P. Leopold. 2015. “Evaluating Machine Learning and Statistical Prediction Techniques for Landslide Susceptibility Modeling.” Computers & Geosciences 81 (August). Elsevier BV: 1–11. doi:10.1016/j.cageo.2015.04.007.

Hothorn, Torsten, and Berthold Lausen. 2005. “Bundling Classifiers by Bagging Trees.” Computational Statistics & Data Analysis 49 (4). Elsevier BV: 1068–78. doi:10.1016/j.csda.2004.06.019.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani.

  1. An Introduction to Statistical Learning. Springer New York. doi:10.1007/978-1-4614-7138-7.

Knudby, Anders, Alexander Brenning, and Ellsworth LeDrew. 2010. “New Approaches to Modelling Fishhabitat Relationships.” Ecological Modelling 221 (3). Elsevier BV: 503–11. doi:10.1016/j.ecolmodel.2009.11.008.

Peña, M.A., and A. Brenning. 2015. “Assessing Fruit-Tree Crop Classification from Landsat-8 Time Series for the Maipo Valley, Chile.” Remote Sensing of Environment 171 (December). Elsevier BV: 234–44. doi:10.1016/j.rse.2015.10.029.

mapedit - interactively edit spatial data in R

Mon, 01/30/2017 - 01:00

[view raw Rmd]

The R ecosystem offers a powerful set of packages for geospatial analysis. For a comprehensive list see the CRAN Task View: Analysis of Spatial Data. Yet, many geospatial workflows require interactivity for smooth uninterrupted completion. With new tools, such as htmlwidgets, shiny, and crosstalk, we can now inject this useful interactivity without leaving the R environment. In the first phase of the mapedit project, we have focused on experimenting and creating proof of concepts for the following three objectives:

  1. drawing, editing, and deleting features,

  2. selecting and querying of features and map regions,

  3. editing attributes.

Install mapedit

To run the code in the following discussion, please install with devtools::install_github. Please be aware that the current functionality is strictly a proof of concept, and the API will change rapidly and dramatically.

devtools::install_github("bhaskarvk/leaflet.extras") devtools::install_github("r-spatial/mapedit") Drawing, Editing, Deleting Features

We would like to set up an easy process for CRUD (create, read, update, and delete) of map features. The function edit_map demonstrates a first step toward this goal.

Proof of Concept 1 | Draw on Blank Map

To see how we might add some features, let’s start with a blank map, and then feel free to draw, edit, and delete with the Leaflet.Draw toolbar on the map. Once finished drawing simply press “Done”.

library(leaflet) library(mapedit) what_we_created <- leaflet() %>% addTiles() %>% edit_map()

edit_map returns a list with drawn, edited, deleted, and finished features as GeoJSON. In this case, if we would like to see our finished creation we can focus on what_we_created$finished. Since this is GeoJSON, the easiest way to see what we just created will be to use the addGeoJSON function from leaflet. This works well with polylines, polygons, rectangles, and points, but circles will be treated as points without some additional code. In future versions of the API it is likely that mapedit will return simple features gemometries rather than geojson by default.

leaflet() %>% addTiles() %>% addGeoJSON(what_we_created$finished) Proof of Concept 2 | Edit and Delete Existing Features

As an extension of the first proof of concept, we might like to edit and/or delete existing features. Let’s play Donald Trump for this exercise and use the border between Mexico and the United States for California and Arizona. For the sake of the example, let’s use a simplified polyline as our border. As we have promised we want to build a wall, but if we could just move the border a little in some places, we might be able to ease construction.

library(sf) # simplified border for purpose of exercise border <- st_as_sfc( "LINESTRING(-109.050197582692 31.3535554844322, -109.050197582692 31.3535554844322, -111.071681957692 31.3723176640684, -111.071681957692 31.3723176640684, -114.807033520192 32.509681296831, -114.807033520192 32.509681296831, -114.741115551442 32.750242384668, -114.741115551442 32.750242384668, -117.158107738942 32.5652527715121, -117.158107738942 32.5652527715121)" ) %>% st_set_crs(4326) # plot quickly for visual inspection plot(border)

Since we are Trump, we can do what we want, so let’s edit the line to our liking. We will use mapview for our interactive map since it by default gives us an OpenTopoMap layer and the develop branch includes preliminary simple features support. With our new border and fence, we will avoid the difficult mountains and get a little extra beachfront.

# use develop branch of mapview with simple features support # devtools::install_github("environmentalinformatics-marburg/mapview@develop") library(mapview) new_borders <- mapview(border)@map %>% edit_map("border")

Now, we can quickly inspect our new borders and then send the coordinates to the wall construction company.

leaflet() %>% addTiles() %>% fitBounds(-120, 35, -104, 25) %>% addGeoJSON(new_borders$drawn)


If you played enough with the border example, you might notice a couple of glitches and missing functionality. This is a good time for a reminder that this is alpha and intended as a proof of concept. Please provide feedback, so that we can insure a quality final product. In this case, the older version of Leaflet.Draw in RStudio Viewer has some bugs, so clicking an existing point creates a new one rather than allowing editing of that point. Also, the returned list from edit_map has no knowledge of the provided features.

Selecting Regions

The newest version of leaflet provides crosstalk support, but support is currently limited to addCircleMarkers. This functionality is enhanced by the sf use of list columns and integration with dplyr verbs. Here is a quick example with the breweries91 data from mapview.

library(crosstalk) library(mapview) library(sf) library(shiny) library(dplyr) # convert breweries91 from mapview into simple features # and add a Century column that we will use for selection brew_sf <- st_as_sf(breweries91) %>% mutate(century = floor(founded/100)*100) %>% filter(! %>% mutate(id=1:n()) pts <- SharedData$new(brew_sf, key = ~id, group = "grp1") ui <- fluidPage( fluidRow( column(4, filter_slider(id="filterselect", label="Century Founded", sharedData=pts, column=~century, step=50)), column(6, leafletOutput("leaflet1")) ), h4("Selected points"), verbatimTextOutput("selectedpoints") ) server <- function(input, output, session) { # unfortunatly create SharedData again for scope pts <- SharedData$new(brew_sf, key = ~id, group = "grp1") lf <- leaflet(pts) %>% addTiles() %>% addMarkers() not_rendered <- TRUE # hack to only draw leaflet once output$leaflet1 <- renderLeaflet({ if(req(not_rendered,cancelOutput=TRUE)) { not_rendered <- FALSE lf } }) output$selectedpoints <- renderPrint({ df <- pts$data(withSelection = TRUE) cat(nrow(df), "observation(s) selected\n\n") str(dplyr::glimpse(df)) }) } shinyApp(ui, server)

With mapedit, we would like to enhance the geospatial crosstalk integration to extend beyond leaflet::addCircleMarkers. In addition, we would like to provide an interactive interface to the geometric operations of sf, such as st_intersects(), st_difference(), and st_contains().

Proof of Concept 3

As a select/query proof of concept, assume we want to interactively select some US states for additional analysis. We will build off Bhaskar Karambelkar’s leaflet projection example using Bob Rudis albersusa package.

# use @bhaskarvk USA Albers with leaflet code # #devtools::install_github("hrbrmstr/albersusa") library(albersusa) library(sf) library(leaflet) library(mapedit) spdf <- usa_composite() %>% st_as_sf() pal <- colorNumeric( palette = "Blues", domain = spdf$pop_2014 ) bounds <- c(-125, 24 ,-75, 45) (lf <- leaflet( options= leafletOptions( worldCopyJump = FALSE, crs=leafletCRS( crsClass="L.Proj.CRS", code='EPSG:2163', proj4def='+proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs', resolutions = c(65536, 32768, 16384, 8192, 4096, 2048,1024, 512, 256, 128) ))) %>% fitBounds(bounds[1], bounds[2], bounds[3], bounds[4]) %>% setMaxBounds(bounds[1], bounds[2], bounds[3], bounds[4]) %>% addPolygons( data=spdf, weight = 1, color = "#000000", # adding group necessary for identification group = ~iso_3166_2, fillColor=~pal(pop_2014), fillOpacity=0.7, label=~stringr::str_c(name,' ', format(pop_2014, big.mark=",")), labelOptions= labelOptions(direction = 'auto')#, #highlightOptions = highlightOptions( # color='#00ff00', bringToFront = TRUE, sendToBack = TRUE) ) ) # test out select_map with albers example select_map( lf, style_false = list(weight = 1), style_true = list(weight = 4) )

The select_map() function will return a data.frame with an id/group column and a selected column. select_map() will work with nearly all leaflet overlays and offers the ability to customize the styling of selected and unselected features.

Editing Attributes

A common task in geospatial analysis involves editing or adding feature attributes. While much of this can be accomplished in the R console, an interactive UI on a reference map can often help perform this task. Mapbox’s provides a good reference point for some of the features we would like to provide in mapedit.

Proof of Concept 4

As a proof of concept, we made a Shiny app that thinly wraps a slightly modified Currently, we will have to pretend that there is a mechanism to load R feature data onto the map, since this functionality does not yet exist.

library(shiny) edited_features <- runGitHub( "", "timelyportfolio", ref="shiny" )


mapedit hopes to add useful interactivity to your geospatial workflows by leveraging powerful new functionality in R with the interactivity of HTML, JavaScript, and CSS. mapedit will be better with your feedback, requests, bug reports, use cases, and participation. We will report on progress periodically with blog posts on this site, and we will develop openly on the mapedit Github repo.

sf - plot, graticule, transform, units, cast, is

Thu, 01/12/2017 - 01:00

[view raw Rmd]

This year began with the R Consortium blog on simple features:

#rstats A new post by Edzer Pebesma reviews the status of the R Consortium’s Simple Features project:

— Joseph Rickert (@RStudioJoe) January 3, 2017

This blog post describes changes of sf 0.2-8 and upcoming 0.2-9, compared to 0.2-7, in more detail.

Direct linking to Proj.4

Since 0.2-8, sf links directly to the Proj.4 library:

library(sf) ## Linking to GEOS 3.5.0, GDAL 2.1.0, proj.4 4.9.2

before that, it would use the projection interface of GDAL, which uses Proj.4, but exposes only parts of it. The main reason for switching to Proj.4 is the ability for stronger error checking. For instance, where GDAL would interpret any unrecognized field for +datum as WGS84:

# sf 0.2-7: > st_crs("+proj=longlat +datum=NAD26") $epsg [1] NA $proj4string [1] "+proj=longlat +ellps=WGS84 +no_defs" attr(,"class") [1] "crs"

Now, with sf 0.2-8 we get a proper error in case of an unrecognized +datum field:

t = try(st_crs("+proj=longlat +datum=NAD26")) attr(t, "condition") ## <simpleError in make_crs(x): invalid crs: +proj=longlat +datum=NAD26, reason: unknown elliptical parameter name> plotting

The default plot method for sf objects (simple features with attributes, or data.frames with a simple feature geometry list-column) now plots the set of maps, one for each attribute, with automatic color scales:

nc = st_read(system.file("gpkg/nc.gpkg", package="sf"), quiet = TRUE) plot(nc)

well, that is all there is, basically. For plotting a single map, select the appropriate attribute


or only the geometry:



Package sf gained a function st_graticule to generate graticules, grids formed by lines with constant longitude or latitude. Suppose we want to project nc to the state plane, and plot it with a longitude latitude graticule in NAD27 (the original datum of nc):

nc_sp = st_transform(nc["SID79"], 32119) # NC state plane, m plot(nc_sp, graticule = st_crs(nc), axes = TRUE)

The underlying function, st_graticule, can be used directly to generate a simple object with graticules, but is rather meant to be used by plotting functions that benefit from a graticule in the background, such as plot or ggplot. The function provides the end points of graticules and the angle at which they end; an example for using Lambert equal area on the USA is found in the help page of st_graticule:

The default plotting method for simple features with longitude/latitude coordinates is the equirectangular projection, (also called geographic projection, or equidistant cylindrical (eqc) projection) which linearly maps longitude and latitude into \(x\) and \(y\), transforming \(y\) such that in the center of the map 1 km easting equals 1 km northing. This is also the default for sp::plot, sp::spplot and ggplot2::coord_quickmap. The official Proj.4 transformation for this is found here.

We can obtain e.g. a plate carrée projection (with one degree latitude equaling one degree longitude) with

caree = st_crs("+proj=eqc") plot(st_transform(nc[1], caree), graticule = st_crs(nc), axes=TRUE, lon = -84:-76)

and we see indeed that the lon/lat grid is formed of squares.

The usual R plot for nc obtained by

plot(nc[1], graticule = st_crs(nc), axes = TRUE)

corrects for latitude. The equivalent, officially projected map is obtained by using the eqc projection with the correct latitude:

mean(st_bbox(nc)[c(2,4)]) ## [1] 35.23582 eqc = st_crs("+proj=eqc +lat_ts=35.24") plot(st_transform(nc[1], eqc), graticule = st_crs(nc), axes=TRUE)

so that in the center of these (identical) maps, 1 km east equals 1 km north.

geosphere and units support

sf now uses functions in package geosphere to compute distances or areas on the sphere. This is only possible for points and not for arbitrary feature geometries:

centr = st_centroid(nc) ## Warning in st_centroid.sfc(st_geometry(x)): st_centroid does not give ## correct centroids for longitude/latitude data st_distance(centr[c(1,10)])[1,2] ## 34093.21 m

As a comparison, we can compute distances in two similar projections, each having a different measurement unit:

centr.sp = st_transform(centr, 32119) # NC state plane, m (m <- st_distance(centr.sp[c(1,10)])[1,2]) ## 34097.54 m centr.ft = st_transform(centr, 2264) # NC state plane, US feet (ft <- st_distance(centr.ft[c(1,10)])[1,2]) ## 111868.3 US_survey_foot

and we see that the units are reported, by using package units. To verify that the distances are equivalent, we can compute

ft/m ## 1 1

which does automatic unit conversion before computing the ratio. (Here, 1 1 should be read as one, unitless (with unit 1)).

For spherical distances, sf uses geosphere::distGeo. It passes on the parameters of the datum, as can be seen from

st_distance(centr[c(1,10)])[1,2] # NAD27 ## 34093.21 m st_distance(st_transform(centr, 4326)[c(1,10)])[1,2] # WGS84 ## 34094.28 m

Other measures come with units too, e.g. st_area

st_area(nc[1:5,]) ## Units: m^2 ## [1] 1137388604 611077263 1423489919 694546292 1520740530

units vectors can be coerced to numeric by

as.numeric(st_area(nc[1:5,])) ## [1] 1137388604 611077263 1423489919 694546292 1520740530 type casting

With help from Mike Sumner and Etienne Racine, we managed to get a working st_cast, which helps converting one geometry in another.

casting individual geometries (sfg)

Casting individual geometries will close polygons when needed:

st_point(c(0,1)) %>% st_cast("MULTIPOINT") ## MULTIPOINT(0 1) st_linestring(rbind(c(0,1), c(5,6))) %>% st_cast("MULTILINESTRING") ## MULTILINESTRING((0 1, 5 6)) st_linestring(rbind(c(0,0), c(1,0), c(1,1))) %>% st_cast("POLYGON") ## POLYGON((0 0, 1 0, 1 1, 0 0))

and will warn on loss of information:

st_linestring(rbind(c(0,1), c(5,6))) %>% st_cast("POINT") ## Warning in st_cast.LINESTRING(., "POINT"): point from first coordinate only ## POINT(0 1) st_multilinestring(list(matrix(1:4,2), matrix(1:6,,2))) %>% st_cast("LINESTRING") ## Warning in st_cast.MULTILINESTRING(., "LINESTRING"): keeping first ## linestring only ## LINESTRING(1 3, 2 4) casting sets of geometries (sfc)

Casting sfc objects can group or ungroup geometries:

# group: st_sfc(st_point(0:1), st_point(2:3), st_point(4:5)) %>% st_cast("MULTIPOINT", ids = c(1,1,2)) ## Geometry set for 2 features ## geometry type: MULTIPOINT ## dimension: XY ## bbox: xmin: 0 ymin: 1 xmax: 4 ymax: 5 ## epsg (SRID): NA ## proj4string: NA ## MULTIPOINT(0 1, 2 3) ## MULTIPOINT(4 5) # ungroup: st_sfc(st_multipoint(matrix(1:4,,2))) %>% st_cast("POINT") ## Geometry set for 2 features ## geometry type: POINT ## dimension: XY ## bbox: xmin: 1 ymin: 3 xmax: 2 ymax: 4 ## epsg (SRID): NA ## proj4string: NA ## POINT(1 3) ## POINT(2 4)

st_cast with no to argument will convert mixes of GEOM and MULTIGEOM to MULTIGEOM, where GEOM is POINT, LINESTRING or POLYGON, e.g.

st_sfc( st_multilinestring(list(matrix(5:8,,2))), st_linestring(matrix(1:4,2)) ) %>% st_cast() ## Geometry set for 2 features ## geometry type: MULTILINESTRING ## dimension: XY ## bbox: xmin: 1 ymin: 3 xmax: 6 ymax: 8 ## epsg (SRID): NA ## proj4string: NA ## MULTILINESTRING((5 7, 6 8)) ## MULTILINESTRING((1 3, 2 4))

or unpack geometry collections:

x <- st_sfc( st_multilinestring(list(matrix(5:8,,2))), st_point(c(2,3)) ) %>% st_cast("GEOMETRYCOLLECTION") x ## Geometry set for 2 features ## geometry type: GEOMETRYCOLLECTION ## dimension: XY ## bbox: xmin: 2 ymin: 3 xmax: 6 ymax: 8 ## epsg (SRID): NA ## proj4string: NA ## GEOMETRYCOLLECTION(MULTILINESTRING((5 7, 6 8))) ## GEOMETRYCOLLECTION(POINT(2 3)) x %>% st_cast() ## Geometry set for 2 features ## geometry type: GEOMETRY ## dimension: XY ## bbox: xmin: 2 ymin: 3 xmax: 6 ymax: 8 ## epsg (SRID): NA ## proj4string: NA ## MULTILINESTRING((5 7, 6 8)) ## POINT(2 3) casting on sf objects

The casting of sf objects works in principle identical, except that for ungrouping, attributes are repeated (and might give rise to warning messages),

# ungroup: st_sf(a = 1, geom = st_sfc(st_multipoint(matrix(1:4,,2)))) %>% st_cast("POINT") ## Warning in st_cast.sf(., "POINT"): repeating attributes for all sub- ## geometries for which they may not be constant ## Simple feature collection with 2 features and 1 field ## geometry type: POINT ## dimension: XY ## bbox: xmin: 1 ymin: 3 xmax: 2 ymax: 4 ## epsg (SRID): NA ## proj4string: NA ## c.1..1. geom ## 1 1 POINT(1 3) ## 2 1 POINT(2 4)

and for grouping, attributes are aggregated, which requires an aggregation function

# group: st_sf(a = 1:3, geom = st_sfc(st_point(0:1), st_point(2:3), st_point(4:5))) %>% st_cast("MULTIPOINT", ids = c(1,1,2), FUN = mean) ## Simple feature collection with 2 features and 2 fields ## geometry type: MULTIPOINT ## dimension: XY ## bbox: xmin: 0 ymin: 1 xmax: 4 ymax: 5 ## epsg (SRID): NA ## proj4string: NA ## a geom ## 1 1 1.5 MULTIPOINT(0 1, 2 3) ## 2 2 3 MULTIPOINT(4 5) type selection

In case we have a mix of geometry types, we can select those of a particular geometry type by the new helper function st_is. As an example we create a mix of polygons, lines and points:

g = st_makegrid(n=c(2,2), offset = c(0,0), cellsize = c(2,2)) s = st_sfc(st_polygon(list(rbind(c(1,1), c(2,1),c(2,2),c(1,2),c(1,1))))) i = st_intersection(st_sf(a=1:4, geom = g), st_sf(b = 2, geom = s)) ## Warning in st_intersection(st_sf(a = 1:4, geom = g), st_sf(b = 2, geom = ## s)): attribute variables are assumed to be spatially constant throughout ## all geometries i ## Simple feature collection with 4 features and 2 fields ## geometry type: GEOMETRY ## dimension: XY ## bbox: xmin: 1 ymin: 1 xmax: 2 ymax: 2 ## epsg (SRID): NA ## proj4string: NA ## a b geometry ## 1 1 2 POLYGON((2 2, 2 1, 1 1, 1 2... ## 2 2 2 LINESTRING(2 2, 2 1) ## 3 3 2 LINESTRING(1 2, 2 2) ## 4 4 2 POINT(2 2)

and can select using dplyr::filter, or directly using st_is:

filter(i, st_is(geometry, c("POINT"))) ## Simple feature collection with 1 feature and 2 fields ## geometry type: GEOMETRY ## dimension: XY ## bbox: xmin: 1 ymin: 1 xmax: 2 ymax: 2 ## epsg (SRID): NA ## proj4string: NA ## a b geometry ## 1 4 2 POINT(2 2) filter(i, st_is(geometry, c("POINT", "LINESTRING"))) ## Simple feature collection with 3 features and 2 fields ## geometry type: GEOMETRY ## dimension: XY ## bbox: xmin: 1 ymin: 1 xmax: 2 ymax: 2 ## epsg (SRID): NA ## proj4string: NA ## a b geometry ## 1 2 2 LINESTRING(2 2, 2 1) ## 2 3 2 LINESTRING(1 2, 2 2) ## 3 4 2 POINT(2 2) st_is(i, c("POINT", "LINESTRING")) ## [1] FALSE TRUE TRUE TRUE

OpenEO: a GDAL for Earth Observation Analytics

Tue, 11/29/2016 - 01:00

Earth observation data, or satellite imagery, is one of the richest sources to find out how our Earth is changing. The amount of Earth observation data we collect Today has become too large to analyze on a single computer. Although most of the Earth observation data is available for free, practical difficulties we are currently facing when we try to analyze it seriously constrains the potential benefits for citizens, industry, scientists, or society. How did we get here?

GIS: the 80’s

To understand the current difficulty when analyzing big Earth observation data analysis, let us look how geographic information systems (GIS) developed over the past decades. In the early days, they would be isolated structures:

where one would get things done in isolation, without any chance of verifying or comparing it with another system: these were expensive systems, hard to set up and maintain, and (with the exception of GRASS) closed databases and closed source software.

File formats: the 90’s

In the 90’s, file formats came up: several systems started supporting various file formats, and dedicated programs that would do certain file format conversions became available. This made many new things possible, such as the integration of S-Plus with Arc-Info, but to fully realize this, each software would have to implement drivers for every file format. Developing new applications or introducing a new file format are both difficult in this model.

GDAL: the 00’s

Then came GDAL! This Geospatial Data Abstraction Layer is a software library that reads and writes raster and vector data. Instead of having to write drivers for each file format, application developers needed to only write a GDAL client driver. When proposing a new file format, instead of having to convince many application developers to support it, only a GDAL driver for the new format was required to realize quick adoption. Instead of many-to-many links, only many-to-one links were needed:

R and python suddenly became strong GIS and spatial modelling tools. ArcGIS users could suddenly deal with the weird data formats from hydrologists, meteorologists, and climate scientists. Developing innovative applications and introducing new file formats became attractive.

For analyzing big Earth observation data, GDAL has its limitations, including:

  • it has weak data semantics: the data model does not include observation time or band wavelength (color), but addresses bands by dataset name and band number,
  • raster datasets cannot tell whether pixels refer to points, cells with constant value, or cells with an aggregated value; most regridding methods silently assume the least likely option (points),
  • the library cannot do much processing, meaning that clients that do the processing and use GDAL for reading and writing need to be close to where the data is, which is far away from the user’s computer.
Big Earth Observation data: the 10’s

We currently see a plethora of cloud environments that all try to solve the problem of how to effectively analyze Earth Observation data that are too large to download and process locally. It very much reminds of the isolated GIS of the 80’s, in the sense that strongly differing systems have been built with large efforts, and when solving a certain problem in one system it is practically impossible to try to solve it in another system too. Several systems work with data processing back-ends that are not open source. The following figure only shows a few systems for illustration:

Open Earth Observation Science

For open science, open data is a necessary but not sufficient condition. In order to fight the reproducibility crisis, analysis procedures need to be fully transparent and reproducible. To get there, it needs to be simple to execute a given analysis on two different data centers to verify that they yield identical results. Today, this sounds rather utopian. On the other hand, we believe that one of the reasons that data science has taken off is that the software making it possible (e.g. python, R, spark) was able to hide irrelevant details and had reached at a stage where it was transparent, stable, and robust.

Google Earth Engine is an excellent example of an interface that is simple, in the sense that users can directly address sensors and observation times and work with composites rather than having to comb through the raw data consisting of large collections of files (“scenes”). The computational back-end however is not transparent, it lets the user execute functions as far as they are provided by the API, but not inspect the source code of these function, or modify them.

How then can we make progress out of a world of currently incompatible, isolated Earth Observation data center efforts? Learning from the past, the solution might be a central interface between users who are allowed to think in high-level objects (“Landsat 7 over Western Europe, 2005-2015”) and a set of computational Earth Observation data centers that agree on supporting this interface. A GDAL for Earth Observation Analytics, so to speak. We’ll call it OpenEO.

Open EO Interface: the 20’s

The following figure shows schematically how this could look like: Users write scripts in a neutral language that addresses the data and operations, but ignores the computer architecture of the back-end. The back-end carries out the computations and returns results. It needs to be able to identify users (e.g. verify its permission to use it), but also to tell which data they provide and which computational operations they afford.

More in detail, and following the GDAL model closely, the architecture contains of

  • a client and back-end neutral set of API’s for both sides, defining which functions users can call, and which functions back-ends have to obey to,
  • a driver for each back-end that translates the neutral requirements into the platform specific offerings, or information that a certain function is not available (e.g. creating new datasets in a write-only back-end)
  • a driver for each client that binds the interfaces to the particular front-end such as R or python

With an architecture like this, it would not only become easy and attractive for users to change from one back-end to the other, but also make it much easier to choose between back-ends, because the value propositions of the back-end providers are now comparable, on paper as well as in practice.

A rough idea of the architecture is shown in this figure:

One of the drivers will interface collections of scenes in a back-end, or on the local hard drive. GDAL will remain playing an important role in back-ends if the back-end uses files in a format supported by GDAL, and in the front-end if the (small) results of big computations are fetched.

The way forward

We, as the authors of this piece, have started to work on these ideas in current activities, projects, and proposals. We are planning to submit a proposal to the EC H2020 call EO-2-2017: EO Big Data Shift. We hope that the call writers (and the reviewers) have the same problem in mind as what we explain above.

We are publishing this article because we believe this work has to be done, and only a concerted effort can realize it. When you agree and would like to participate, please get in touch with us. When successful, it will in the longer run benefit science, industry, and all of us.

Earlier blogs related to this topic

Earlier posts related to this:

Simple features now on CRAN

Wed, 11/02/2016 - 01:00

Submitting a package to CRAN is always an adventure, submitting a package with lots of external dependencies even more so. A week ago I submitted the simple features for R package to CRAN, and indeed, hell broke loose! Luckily, the people behind CRAN are extremely patient, friendly and helpful, but they let your code test on a big server farm with computers in 13 different flavors.

Of course we test code on linux and windows after every code push to github, but that feels like talking to a machine, like remotely compiling and testing. CRAN feels different: you first need to manually confirm that you did your utmost best to solve problems, and then there is a person telling you everything still remaining! Of course, this is of incredible help, and a big factor in the R community’s sustainability.

Package sf is somewhat special in that it links to GEOS and GDAL, and in particular GDAL links, depending on how it is installed, to many (77 in my case) other libraries, each with their own versions. After first submission of sf 0.2-0, I ran into the following issues with my code.

sf 0.2-0
  • I had to change all links starting with into A direct link to a units vignette on CRAN had to be removed.
  • some of the tests gave very different output, because my default testing platforms (laptop, travis) have PostGIS, and CRAN machines don’t; I changed this so that testing without PostGIS (CRAN) is now most silent
  • the tests still output differences in GDAL and GEOS versions, but that was considered OK.

That was it! The good message

Thanks, on CRAN now. Best -k

arrived! Party time! Too early. In the evening (CRAN never sleeps) an email arrived, mentioning:

This worked for my incoming checks, but just failed for the regular checks, with Error in loadNamespace(name) : there is no package called 'roxygen2' Calls: :: ... tryCatch -> tryCatchList -> tryCatchOne -> <Anonymous> Execution halted For some reason roxygen2 is not working. Is it installed? ERROR: configuration failed for package ‘sf’

with lots of helpful hints. Indeed, my package generated Rcpp files and manual pages dynamically during install; this requires Rcpp and roxygen2 to be available unconditionally and they aren’t.

So I sat down and worked on 0.2-1, to address this. Before I could do that, an email from Oxford (Brian Ripley) arrived, telling me that sf had caused some excitement in the multi-flavor server farm:

Here, it should be noted (again) that the only two NOTEs were due to the excellent work of Jeroen Ooms who compiled GDAL and many other libraries for rwinlib, and prepared sf for downloading and using them. The rest was my contribution.

In addition, an issue was raised by Dirk Eddelbuettel, telling me that his Rcpp reverse check farm had noted him that sf required GDAL 2.0 or later, but not by properly checking its version but by generating plane compile errors. The horror, the horror.

sf 0.2-1: roxygen, Rcpp, SQLITE on Solaris

sf 0.2-1 tried to address the Rcpp and roxygen2 problems: I took their generation out of the configure and scripts. I added all automatically derived files to the github repo, to get everything in sync. Worked:

Thanks, on CRAN now. [Tonight I'll know for sure ...] Best -k

… and no emails in the evening.

Also, the errors on Solaris platforms were caused by the SQLITE library not being present, hence GeoPackage not being available as a GDAL driver. As a consequence, I had to set back the examples reading a GeoPackage polygons file to one where a shapefile is read. Bummer.

Another issue was related to relying on GDAL 2.1 features without testing for it; this was rather easily solved by conditional compiling.

This gave:

meaning SOME improvement, but where do these UBSAN reports suddenly come from!?

sf 0.2-2: byte swapping, memory alignment

Over-optimistically as I am, I had commented that life is too short to do byte swapping when reading WKB. Welcome to the CRAN server farm. Solaris-Spark is big-endian. Although the R code reading WKB does read non-native endian, this would have required to rewrite all tests, so I added byte swapping to the C++ code, using a helpful SO post.

The UBSAN issues all listed something like:

UBSAN: don't assume that pointers can point anywhere to, in valid memory; wkb.cpp:185:39: runtime error: load of misaligned address 0x6150001d0e29 for type 'uint32_t', which requires 4 byte alignment 0x6150001d0e29: note: pointer points here 00 00 00 01 06 00 00 00 01 00 00 00 01 03 00 00 00 01 00 00 00 1b 00 00 00 00 00 00 a0 41 5e 54

Sounds scary? What I had done was for every coordinate use a double pointer, pointing it to the right place in the WKB byte stream, then copy its value, and move it 8 bytes. I love this *d++ expression. But you can’t do this anymore! Although the code worked on my machines, you can’t put a double pointer to any location and assume it’ll work everywhere. The solution was to memcpy the relevant bytes to a double value on the stack, and copy that into the Rcpp::NumericVector.

To be done:

All these changes have brought me here:

where you see that linux and Windows compile (all NOTEs indicate that the library is too large, which is unavoidable with GDAL) and that errors are Mac related:

r-devel-macos-x86_64-clang ** testing if installed package can be loaded Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object '/Users/ripley/R/packages/tests-devel/sf.Rcheck/sf/libs/': dlopen(/Users/ripley/R/packages/tests-devel/sf.Rcheck/sf/libs/, 6): Symbol not found: _H5T_NATIVE_DOUBLE_g Referenced from: /Users/ripley/R/packages/tests-devel/sf.Rcheck/sf/libs/ Expected in: flat namespace in /Users/ripley/R/packages/tests-devel/sf.Rcheck/sf/libs/

which indicates a problem with GDAL not linking to the HDF5 library (unsolved).

The second:

r-release-osx-x86_64-mavericks checking gdal-config usability... yes configure: GDAL: 1.11.4 checking GDAL version >= 2.0.0... yes

indicates that

  • this platform still runs GDAL 1.x, so needs to be upgraded and that
  • my check for GDAL version present on the system still does not work!
CRAN flavors

CRAN flavors is a great asset that teaches you problems of all kinds at an early stage. Without it, users would have run at some stage into problems that are now caught up front. Thanks to the tremendous effort of the CRAN team!!

UPDATE, Dec 21, 2016
  • Thanks to Roger Bivand, Simon Urbanek and Brian Ripley for constructive help, the MacOSX-mavericks binary build is now on CRAN, the issue is described here