Spatial indexes give you fast results on spatial queries, such as finding whether pairs of geometries intersect or touch, or finding their intersection. They reduce the time to get results from quadratic in the number of geometries to linear in the number of geometries. A recent commit brings spatial indexes to sf for the binary logical predicates (intersects, touches, crosses, within, contains, contains_properly, overlaps, covers, covered_by, relate_pattern, equals, disjoint), as well as intersection, which returns intersecting geometries.

The spatial join function st_join uses a logical predicate to join features, and is also also affected by this speedup.

Antecedents

There have been attempts to use spatial planar indices, including enhancement issue sfr:76. In rgeos, GEOS STRtrees were used in rgeos/src/rgeos_poly2nb.c, which is mirrored in a modern Rcpp setting sf/src/geos.cpp, around lines 276 and 551. The STRtree is constructed by building envelopes (bounding boxes) of input entities, which are then queried for intersection with envelopes of another set of entities (in rgeos, R functions gUnarySTRtreeQuery and gBinarySTRtreeQuery). The use case was to find neighbours of all the about 90,000 US Census entities in Los Angeles, via spdep::poly2nb(), which received an argument to enter the candidate neighbours found by Unary querying the STRtree of entities by the same entities.

Benchmark

A simple benchmark shows the obvious: st_intersects without spatial index behaves quadratic in the number of geometries (black line), and is much faster for the case where a spatial index is created, stronger so for larger number of polygons:

first benchmark

The polygon datasets used are simple checker boards with square polygons (showing a nice Moiré pattern):

first benchmark

The black small square polygons are essentially matched to the red ones; the number of polygons along the x axis is the number of a single geometry set (black).

To show that the behaviour of intersects and intersection is indeed linear in the number of polygons, we show runtimes for both, as a function of the number of polygons (where intersection was divided by 10 for scaling purposes):

second benchmark

Implementation

Spatial indexes are available in the GEOS library used by sf, through the functions starting with STRtree. The algorithm implements a Sort-Tile-Recursive R-tree, according to the JTS documentation described in P. Rigaux, Michel Scholl and Agnes Voisard. Spatial Databases With Application To GIS. Morgan Kaufmann, San Francisco, 2002.

The sf implementation (some commits to follow this one) excludes some binary operations. st_distance, and st_relate without pattern, are excluded as these need to go through all combinations, returning a dense matrix. st_equals_exact is excluded as approximate matches are sought, without constraining the degree of approximation.

On which argument is an index built?

The R-tree is built on the first argument (x), and used to match all geometries over the second argument (y) of binary functions. This could give runtime differences, but for instance for the dataset that triggered this development in sfr:394, we see hardly any difference:

library(sf)
# Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.3.2 r15302
load("test_intersection.Rdata")
nrow(test)
# [1] 16398
nrow(hsg2)
# [1] 6869
system.time(int1 <- st_intersection(test, hsg2))
#    user  system elapsed 
# 105.712   0.040 105.758 
system.time(int2 <- st_intersection(hsg2, test))
#    user  system elapsed 
# 107.756   0.060 107.822 
# Warning messages:
# 1: attribute variables are assumed to be spatially constant throughout all geometries 
# 2: attribute variables are assumed to be spatially constant throughout all geometries 

The resulting feature sets int1 and int2 are identical, only the order of the features (records) and of the attribute columns (variables) differs. Runtime without index is 35 minutes, 20 times as long.

Is the spatial index always built?

In the current implemenation (version >= 0.5-1) the index is always built, except for st_distance, and st_relate without pattern. This means it is also built e.g. for rook or queen neighborhood selections (sfr:234), which use st_relate with a specified pattern.

What about prepared geometries?

Prepared geometries in GEOS are essentially indexes over single geometries and not over sets of geometries; they speed things up in particular when single geometries are very complex, and only for a single geometry to single geometry comparison. The spatial indexes are indexes over collections of geometries; they make a cheap preselection based on bounding boxes before the expensive pairwise comparison takes place.

Script used

The followinig script was used to create the benchmark plots. It no longer works; in the version where it worked, prepared = FALSE would take a branch where no trees were built, this is no longer the case.

 library(sf)
 sizes = c(10, 20, 50, 100, 160, 200)
 res = matrix(NA, length(sizes), 4)
 for (i in seq_along(sizes)) {
     g1 = st_make_grid(st_polygon(list(rbind(c(0,0),c(0,1),c(1,1),c(0,1),c(0,0)))), n = sizes[i]) * sizes[i]
     g2 = g1 + c(.5,.5)
     res[i, 1] = system.time(i1 <- st_intersects(g1, g2))[1]
     res[i, 2] = system.time(i2 <- st_intersects(g1, g2, prepare = FALSE))[1]
     res[i, 3] = system.time(i1 <- st_intersection(g1, g2))[1]
     res[i, 4] = identical(i1, i2)
 }
 plot(sizes^2, res[,2], type = 'b', ylab = 'time [s]', xlab = '# of polygons')
 lines(sizes^2, res[,1], type = 'b', col = 'red')
 legend("topleft", lty = c(1,1), col = c(1,2), legend = c("st_intersects without index", "st_intersects with spatial index"))
 plot(sizes^2, res[,3]/10, type = 'b', ylab = 'time [s]', xlab = '# of polygons')
 lines(sizes^2, res[,1], type = 'b', col = 'red')
 legend("topleft", lty = c(1,1), col = c(1,2), legend = c("st_intersection * 0.1", "st_intersects"))