Using data.table and Rcpp to scale geo-spatial analysis with sf

<!DOCTYPE html>

The background

At the beginning of 2017 I left academia to work in the industry. Luckily, I found a great job as a geo-spatial data analyst at GfK Geomarketing where we do all sorts of spatial data analyses, including micro-geographic studies. These micro-geographic projects can be challenging, because data sets can grow very large very quickly and thus performance becomes an important issue.

The tweet

Recently, I tweeted about some success I had in combining R packages sf, data.table and Rcpp to enable speedy geo-spatial analysis and Matt Harris kindly suggested that a write up of this would be well received.

A blog post about that would process be well received!
— Matt Harris (@Md_Harris) October 26, 2017

So here we go.

The scenario

The project that stimulated my tweet is aiming at mapping certain administrative polygon attributes to road network linestrings. For lines located within polygons this can easily be done with a (potentially weighted) spatial join - st_join from package sf. Frequently, however, administrative borders follow roads/streets which means that the two sides of a road will fall into two different administrative areas. This is where the challenge begins. How can we map different attributes from both sides to one linestring?

The challenge

We need to come up with a way of extracting the correct administrative feature data for each road in our area of interest, regardless whether the road is located entirely within one or marks a border between two administrative areas. One important piece of information here is that road segments are defined as the (multi-)linestring between two intersections. This means that we can represent the road segments as points onto which we can map the polygon feature attributes. This approach, however, still leaves us with the problem of mapping data from two different administrative areas onto one road segment point whenever that represents a road along an administrative border. The idea to solve this is to establish perpendicular points on either side of the road at the location of the point representing the road segment.

The calculation of these perpendicular points is what I referred to in my tweet and will be the scope of this blog post.

The plan of action

What are the steps necessary to calculate perpendicular points for a set of line strings at a given location? We can break this down into four major steps:

identify the point that represents a road segment - we’re going to take the center point (the line “centroid”)
identify the line segment that this point falls on
calculate the bearing (angle to projected north) of this line segment
calculate the two points at 90 degree angles on either side of this line segment for a given distance from the line at the location of the center point

Note that the underlying assumption to all that follows is that we do our calculations in 2-D euclidean space, hence we need projected coordinates not longlat!

The code

Let’s tackle this. We will develop our solution using data that comes with package mapview and later test its performance on a largish road network from OpenStreetMap.

I have uploaded a few helper functions as gists to do Pythagorean calculations in the 2-D plane in both r code and c++ code.

First of all, let’s load the packages and source the functions we need to do all the calculations. Note that the R version of the functions will be in snake_case and the C++ version will be lowerCamelCase.

library(mapview)
library(sf)
library(data.table)
library(devtools)

### read and save cpp functions from github gist
cpp_functions = readLines(
  "https://gist.githubusercontent.com/tim-salabim/561afd611fea2acdd14b4b6acce1bc8e/raw/32e914d26b117714039f78acb99189467eddfddd/perpPoints.cpp"
)

fname = file.path(tempdir(), "perpPoints.cpp")
writeLines(cpp_functions, fname)

### now source cpp file to get functionality
Rcpp::sourceCpp(file = fname)

### source R version of the functions directly from gist
devtools::source_gist("387c6179faf826f810938a542d1f55ca", filename = "perp_points.R")

1. R function to calculate perpendicular points

A common approach to problems like this one is to create a function that calculates the desired outcome for a single object and then apply that function iteratively to all objects as necessary. In many cases functions can be written so that they are vectorised, however, in our case this will turn out to be a challenge. In any case, we will first create a function that calculates perpendicular points of a single linestring purely written in R to understand the process. We will enable an additional argument called perp_dist to supply the distance of the perpendicular points away from the line in units of the projection.

st_perp_points_r = function(line, perp_dist = 10) {

  ## calculate line centroid coordinates
  mp_crds = sf::st_coordinates(sf::st_line_sample(line, n = 1))

  ## identify start node of centroid segment via cumulative sum of segment lengths
  crd = sf::st_coordinates(line)
  cs = cumsum(calc_len(c(0, diff(crd[, "X"])),
                       c(0, diff(crd[, "Y"]))))
  mx = which((cs / max(cs)) < 0.5)
  idx = mx[length(mx)]
  
  ## calculate perpendicular points
  diff_x = mp_crds[, "X"] - crd[idx, "X"]
  diff_y = mp_crds[, "Y"] - crd[idx, "Y"]
  
  perp_r_offset = calc_perp_point_r(diff_x, diff_y, perp_dist)
  rownames(perp_r_offset) = NULL
  perp_l_offset = calc_perp_point_l(diff_x, diff_y, perp_dist)
  rownames(perp_l_offset) = NULL
  
  prx = crd[idx, "X"] + perp_r_offset[, 1]
  pry = crd[idx, "Y"] + perp_r_offset[, 2]
  plx = crd[idx, "X"] + perp_l_offset[, 1]
  ply = crd[idx, "Y"] + perp_l_offset[, 2]
  
  ## right side
  perp_r = data.frame(L1 = unique(crd[, "L1"]), 
                      prx = prx, 
                      pry = pry)
  names(perp_r) = c("ID", "X", "Y")
  perp_r$side = "R"
  rownames(perp_r) = NULL
  
  ## left side
  perp_l = data.frame(L1 = unique(crd[, "L1"]), 
                      plx = plx, 
                      ply = ply)
  names(perp_l) = c("ID", "X", "Y")
  perp_l$side = "L"
  rownames(perp_l) = NULL
  
  ## center
  m = data.frame(L1 = unique(crd[, "L1"]), 
                 ptx = mp_crds[, "X"], 
                 pty = mp_crds[, "Y"])
  names(m) = c("ID", "X", "Y")
  rownames(m) = NULL
  m$side = "C"
  
  res = rbind(m, perp_r, perp_l)
  res = sf::st_as_sf(res,
                     coords = c("X", "Y"),
                     crs = sf::st_crs(lines))
  
  return(res)
}

Let’s check whether this works

lines = sf::st_cast(trails, "LINESTRING")
## Warning in st_cast.sf(trails, "LINESTRING"): repeating attributes for all
## sub-geometries for which they may not be constant
line = lines[7, ]

pp = st_perp_points_r(line, perp_dist = 10)

plot(sf::st_geometry(line))
plot(pp, add = TRUE)
## Warning in plot.sf(pp, add = TRUE): ignoring all but the first attribute

Sweet! So now we have a function to calculate perpendicular points for a given linestring. However, we want to be able to do this for objects of multiple linestrings. For this we need to loop through those line by line as we rely on calculations based on the coordinates of the individual line vertices and there is no way of vectorising this (correct me if I’m wrong).

Let’s see how long this will take for the entire trails data set (979 linestrings when cast).

(duration_r = system.time({
  pp_r = do.call(rbind, lapply(seq(nrow(lines)), function(i) {
    st_perp_points_r(lines[i, ], perp_dist = 10)
  }))
}))
##    user  system elapsed 
##  17.802   0.012  17.814

17.814 seconds for about 1000 linestrings is unacceptably slow, especially given that we want to do this calculation for road networks of entire countries! It is hardly surprising that this approach is slow as we are looping through our data line by line in R using lapply. There is some additional overhead by converting the result of each iteration to a sf object instead of doing this in one fell swoop after we have processed all linestrings.

Thus, we need a different approach. In particular, we need to get rid of the do.call(rbind, lapply(... approach in order to speed things up!

2. Using data.table

One known and well tested remedy for such situations is to use package data.table to do group-wise calculations. So let’s write a function that takes not only a single linestring as input but an object of any number of linestrings and does the looping internally.

st_perp_points_dt = function(lines, perp_dist = 10) {

  mp_crds = sf::st_coordinates(sf::st_line_sample(lines, n = 1))

  crd = data.table::data.table(sf::st_coordinates(lines))
  data.table::setkey(crd, L1)

  crd[, id := 1:nrow(crd)]
  crd[, cs := cumsum(
    calc_len(c(0, diff(X)), c(0, diff(Y)))
  ), by = data.table::rleid(L1)]
  crd[, mx := max(cs), by = data.table::rleid(L1)]
  crd[, fr := cs / mx, by = data.table::rleid(L1)]
  tmp = crd[fr < 0.5, ]
  data.table::setkey(tmp, L1)

  tmp = tmp[J(unique(L1)), mult = "last"]
  tmp$ptx = mp_crds[, "X"]
  tmp$pty = mp_crds[, "Y"]
  tmp$dx = tmp$ptx - tmp$X
  tmp$dy = tmp$pty - tmp$Y

  perp_r_offset = calc_perp_point_r(tmp$dx, tmp$dy, perp_dist)
  perp_l_offset = calc_perp_point_l(tmp$dx, tmp$dy, perp_dist)

  tmp$prx = tmp$X + perp_r_offset[, 1]
  tmp$pry = tmp$Y + perp_r_offset[, 2]
  tmp$plx = tmp$X + perp_l_offset[, 1]
  tmp$ply = tmp$Y + perp_l_offset[, 2]

  ## right side
  perp_r = tmp[, c("L1", "prx", "pry")]
  names(perp_r) = c("ID", "X", "Y")
  perp_r$side = "R"

  ## left side
  perp_l = tmp[, c("L1", "plx", "ply")]
  names(perp_l) = c("ID", "X", "Y")
  perp_l$side = "L"

  ## center
  m = tmp[, c("L1", "ptx", "pty")]
  names(m) = c("ID", "X", "Y")
  m$side = "C"

  res = rbind(m, perp_r, perp_l)
  res = sf::st_as_sf(data.table::setorder(res, ID),
                     coords = c("X", "Y"),
                     crs = sf::st_crs(lines))

  return(res)
}


(duration_dt = system.time({
  pp_dt = st_perp_points_dt(lines, perp_dist = 10)
}))
##    user  system elapsed 
##   0.395   0.000   0.332

0.332 seconds is a huge improvement!

Still, why stop here? Maybe there is a way to make this even faster? Let’s take a look at the call stack to figure out where most of the time is spent when executing this function. For this we are going to use the profvis package.

profvis::profvis({
  st_perp_points_dt(lines, perp_dist = 10)
  Sys.sleep(0.1)
})

It turns out that about 60% of the time is spent on identifying the line centroids via st_line_sample and within that we see an lapply call and some conversions using the units package. st_line_sample is a very flexible function to allow for sampling points along a line in many different ways at pretty much any combination of distances (or intervals). In our case, however, we are only interested in one point sample in the center of the line, so maybe we can get rid of some of the sugar that provides flexibility we don’t need and optimise it for our purposes.

3. Optimising the function

The source code of st_line_sample is found here. Turns out we really only need 3 lines of st_line_sample for our purposes.

st_line_centroid = function(x) {
  l = sf::st_length(x) * 0.5
  x = sf::st_geometry(x)
  sf::st_sfc(sf:::CPL_gdal_linestring_sample(x, l), crs = sf::st_crs(x))
}

So let’s put this all together and see how much performance we can gain. Note that we also use the C++ Pythagorean functions in this final optimised version. We shall call it st_perp_points.

st_perp_points = function(lines, perp_dist = 10) {
  
  mp_crds = sf::st_coordinates(st_line_centroid(lines))

  crd = data.table::data.table(sf::st_coordinates(lines))
  data.table::setkey(crd, L1)

  crd[, id := 1:nrow(crd)]
  crd[, cs := cumsum(
    calcLen(c(0, diff_cpp(X)), c(0, diff_cpp(Y)))
  ), by = data.table::rleid(L1)]
  crd[, mx := max(cs), by = data.table::rleid(L1)]
  crd[, fr := cs / mx, by = data.table::rleid(L1)]
  tmp = crd[fr < 0.5, ]
  data.table::setkey(tmp, L1)

  tmp = tmp[J(unique(L1)), mult = "last"]
  tmp$ptx = mp_crds[, "X"]
  tmp$pty = mp_crds[, "Y"]
  tmp$dx = tmp$ptx - tmp$X
  tmp$dy = tmp$pty - tmp$Y

  perp_r_offset = calcPerpPointR(tmp$dx, tmp$dy, perp_dist)
  perp_l_offset = calcPerpPointL(tmp$dx, tmp$dy, perp_dist)

  tmp$prx = tmp$X + perp_r_offset[, 1]
  tmp$pry = tmp$Y + perp_r_offset[, 2]
  tmp$plx = tmp$X + perp_l_offset[, 1]
  tmp$ply = tmp$Y + perp_l_offset[, 2]

  ## right side
  perp_r = tmp[, c("L1", "prx", "pry")]
  names(perp_r) = c("ID", "X", "Y")
  perp_r$side = "R"

  ## left side
  perp_l = tmp[, c("L1", "plx", "ply")]
  names(perp_l) = c("ID", "X", "Y")
  perp_l$side = "L"

  ## center
  m = tmp[, c("L1", "ptx", "pty")]
  names(m) = c("ID", "X", "Y")
  m$side = "C"

  ## combine and convert to sf
  res = rbind(m, perp_r, perp_l)
  res = sf::st_as_sf(data.table::setorder(res, ID),
                     coords = c("X", "Y"),
                     crs = sf::st_crs(lines))

  return(res)
}


(duration = system.time({
  pp = st_perp_points(lines, perp_dist = 10)
}))
##    user  system elapsed 
##   0.178   0.000   0.118

0.118 seconds means that by using this optimised version of st_line_sample we’ve decreased computation time by a factor of about 2.5. st_line_centroid now only accounts for about 15% of the call stack.

profvis::profvis({
  st_perp_points(lines, perp_dist = 10)
  Sys.sleep(0.1)
})

The tweet revisited (QED)

So what about my tweeted claim that we can process about 600k linestrings in less than 2 minutes? Well, let’s try… Note, for acceptable running time when knitting this document we’re going to test performance on a slightly smaller road network of about 230k roads. Yet, this will give us a good enough impression on performance. The road network data used here can be downloaded from geofabrik.de a great resource for pre-processed OpenStreetMap data. The code chunk below will download the relevant data set into a temporary file, unzip the contents into a temporary folder, read the relevant layer and delete everything afterwards.

## download Oberfanken OSM data from geofabrik.de and unzip
url = "http://download.geofabrik.de/europe/germany/bayern/oberfranken-latest-free.shp.zip"
tmpfile = tempfile()
download.file(url, tmpfile)
unzip(tmpfile, exdir = file.path(tempdir(), "shp_files"))
unlink(tmpfile, recursive = TRUE, force = TRUE)

## read and transform roads
roads = sf::st_read(file.path(tempdir(), "shp_files", "gis.osm_roads_free_1.shp"))
## Reading layer `gis.osm_roads_free_1' from data source `/tmp/RtmpiF8SFG/shp_files/gis.osm_roads_free_1.shp' using driver `ESRI Shapefile'
## Simple feature collection with 229839 features and 10 fields
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: 10.42777 ymin: 49.58355 xmax: 12.36851 ymax: 50.53482
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
roads = sf::st_transform(roads, crs = 3068)
unlink(file.path(tempdir(), "shp_files"), recursive = TRUE, force = TRUE)

## time perp point performance
(elapsed = system.time({
  st_perp_points(roads)
}))
##    user  system elapsed 
##  18.679   0.049  18.668

18.668 seconds is about the same as it took for the calculation of about 1k perpendicular points using our first implementation. So we have come quite far since then.

Also, I think we can agree that we can safely say

quod erat demonstrandum

The wrap up

To wrap it up, let’s compare our methods on this largish data set. Running 3 replications will give us timings approximating my claim regarding 600k linestrings. I dare not include the slow R lapply based implementation, but if you are curious and have a lot of time you can simply un-comment the first line in the benchmark call to include these timings.

1. Calculation benchmark

library(rbenchmark)

benchmark(
  # r_result <<- st_perp_points_r(roads),
  dt_result <<- st_perp_points_dt(roads),
  op_result <<- st_perp_points(roads),
  replications = 3,
  columns = c("test", "replications", "elapsed", "relative")
)
##                                     test replications elapsed relative
## 1 dt_result <<- st_perp_points_dt(roads)            3 225.457    4.021
## 2    op_result <<- st_perp_points(roads)            3  56.076    1.000

We see that the optimised function using our minimalistic st_line_centroid is about 4 times as fast as using st_line_sample.

Finally, we’re going to check whether the results are equal.

all.equal(dt_result, op_result)
## [1] TRUE

And we also visually inspect the result.

mapview(head(roads)) + 
  mapview(head(op_result, 18), zcol = "side", legend = TRUE)

Looks as expected!

2. Write benchmark

Given that this post is concerned with performance issues of geo-spatial analysis, let’s go the whole nine yards (even though I’m a big fan of SI units). There has been a rather lengthy and very enlightening discussion at https://github.com/r-spatial/sf/issues/470 about write speed comparisons between the popular shapefile format and the much more modern geopackage format which resulted in an update of gdal_write.cpp the function that is used under the hood in st_write. This has tremendously increased write speed for the geopackage driver so that it is now on par and for large data sets even outperforming the shapefile driver. Let’s benchmark this for our op_result.

fldr = tempdir()
shp_dsn = file.path(fldr, "result.shp")
gpkg_dsn = file.path(fldr, "result.gpkg")

benchmark(
  sf_shp = st_write(op_result, shp_dsn, delete_dsn = TRUE, quiet = TRUE),
  sf_gpkg = st_write(op_result, gpkg_dsn, delete_dsn = TRUE, quiet = TRUE),
  replications = 3,
  columns = c("test", "replications", "elapsed", "relative")
)
##      test replications elapsed relative
## 2 sf_gpkg            3  38.155    3.245
## 1  sf_shp            3  11.758    1.000

There are many reasons to switch from shapefile to more modern geo-spatial file formats like geopackage and write speed performance should no longer be holding us back from making this switch.

3. Final considerations

It may seem silly to compare a lapply based R implementation of a function that loops through features with data.table and C++ implementations, but I think the path we’ve taken here resembles quite nicely a standard approach of tackling a problem in R. Implement the desired functionality in R, test whether it scales, if it doesn’t profile and optimise performance by identifying and addressing bottlenecks.

I am in no way claiming that the presented final solution is the most performant implementation that can possibly be coded. In fact, I’d love to see even better performing variants, so if people can come up with even better performing solutions, I’d love to hear about them.

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rbenchmark_1.0.0   devtools_1.13.3    data.table_1.10.4 
## [4] sf_0.5-5           mapview_2.2.1      leaflet_1.1.0.9000
## 
## loaded via a namespace (and not attached):
##  [1] lattice_0.20-35   colorspace_1.3-2  htmltools_0.3.6  
##  [4] stats4_3.4.2      viridisLite_0.2.0 yaml_2.1.14      
##  [7] base64enc_0.1-3   R.oo_1.21.0       e1071_1.6-8      
## [10] withr_2.0.0       DBI_0.7           R.utils_2.5.0    
## [13] sp_1.2-5          foreach_1.4.3     plyr_1.8.4       
## [16] stringr_1.2.0     munsell_0.4.3     raster_2.5-8     
## [19] R.methodsS3_1.7.1 htmlwidgets_0.9   codetools_0.2-15 
## [22] evaluate_0.10.1   memoise_1.1.0     knitr_1.17       
## [25] httpuv_1.3.5      crosstalk_1.0.0   curl_2.8.1       
## [28] gdalUtils_2.0.1.7 class_7.3-14      Rcpp_0.12.13     
## [31] xtable_1.8-2      udunits2_0.13     scales_0.4.1.9002
## [34] backports_1.1.0   classInt_0.1-24   satellite_1.0.0  
## [37] jsonlite_1.5      webshot_0.4.1     mime_0.5         
## [40] png_0.1-7         digest_0.6.12     stringi_1.1.5    
## [43] shiny_1.0.4       grid_3.4.2        rprojroot_1.2    
## [46] rgdal_1.2-8       tools_3.4.2       magrittr_1.5     
## [49] profvis_0.3.3     rmarkdown_1.6     httr_1.3.1       
## [52] iterators_1.0.8   R6_2.2.2          units_0.4-6      
## [55] compiler_3.4.2

</body> </html>