Reading well-known-binary into R

This blog post describes ways to read binary simple feature data into R, and compares them.

WKB (well-known-binary) is the (ISO) standard binary serialization for simple features. You see it often printed in hexadecimal notation , e.g. in spatially extended databases such as PostGIS:

postgis=# SELECT 'POINT(1 2)'::geometry;
                  geometry                  
--------------------------------------------
 0101000000000000000000F03F0000000000000040
(1 row)

where the alternative form is the human-readable text (Well-known text) form:

postgis=# SELECT ST_AsText('POINT(1 2)'::geometry);
 st_astext  
------------
 POINT(1 2)
(1 row)

In fact, the WKB is the way databases store features in BLOBs (binary large objects). This means that, unlike well-known text, reading well-known binary involves

no loss of precision caused by text <–> binary conversion,
no conversion of data needed at all (provided the endianness is native)

As a consequence, it should be possible to do this blazingly fast. Also with R? And large data sets?

Three software scenarios

I compared three software implementations:

sf::st_as_sfc (of package sf) using C++ to read WKB
sf::st_as_sfc (of package sf) using pure R to read WKB (but C++ to compute bounding box)
wkb::readWKB (of package wkb) using pure R to read features into sp-compatible objects

Note that the results below were obtained after profiling, and implementing expensive parts in C++.

Three geometries

I created three different (sets of) simple features to compare read performance: one large and simple line, one data set with many small lines, and one multi-part line containing many sub-lines:

single LINESTRING with many points: a single LINESTRING with one million nodes (pionts) is read into a single simple feature
many LINESTRINGs with few points: half a million simple features of type LINESTRING are read, each having two nodes (points)
single MULTILINESTRING with many short lines: a single simple feature of type MULTILINESTRING is read, consisting of half a million line segments, each line segment consisting of two points.

A reproducible demo-script is found in the sf package here, and can be run by

devtools::install_github("edzer/sfr")
demo(bm_wkb)

Reported run times are in seconds, and were obtained by system.time().

single LINESTRING with many points

expression	user	system	elapsed
`sf::st_as_sfc(.)`	0.032	0.000	0.031
`sf::st_as_sfc(., pureR = TRUE)`	0.096	0.012	0.110
`wkb::readWKB(.)`	8.276	0.000	8.275

We see that for this case both sf implementations are comparable; this is due to the fact that the whole line of 16 Mb is read into R with a single readBin call: C++ can’t do this much faster.

I suspect wkb::readWKB is slower here because instead of reading a complete matrix in one step it makes a million calls to readPoint, and then merges the points read in R. This adds a few million function calls. Since only a single Line is created, not much overhead from sp can take place here.

Function calls, as John Chambers explains in Extending R, have a constant overhead of about 1000 instructions. Having lots of them may become expensive, if each of them does relatively little.

many LINESTRINGs with few points

expression	user	system	elapsed
`sf::st_as_sfc(.)`	1.244	0.000	1.243
`sf::st_as_sfc(., pureR = TRUE)`	55.004	0.056	55.063
`wkb::readWKB(.)`	257.092	0.192	257.291

Here we see a strong performance gain of the C++ implementation: all the object creation is done in C++, without R function calls. wkb::readWKB slowness may be largely due to overhead caused by sp: creating Line and Lines objects, object validation, computing bounding box.

I made the C++ and “pureR” implementations considerably faster by moving the bounding box calculation to C++. The C++ implementation was further optimized by moving the type check to C++: if a mix of types is read from a set of WKB objects, sfc will coerce them to a single type (e.g., a set of LINESTRING and MULTILINESTRING will be coerced to all MULTILINESTRING.)

single MULTILINESTRING with many short lines

expression	user	system	elapsed
`sf::st_as_sfc(.)`	0.348	0.000	0.348
`sf::st_as_sfc(., pureR = TRUE)`	24.088	0.008	24.100
`wkb::readWKB(.)`	87.072	0.004	87.074

Here we see again the cost of function calls: both “pureR” in sf and wkb::readWKB are much slower due to the many function calls; the latter also due to object management and validation in sp.

Discussion

Reading well-known binary spatial data into R can be done pretty elegantly by R, but in many scenarios can be much faster using C++. We observe speed gains up to a factor 250.