Reading well-known-binary into R
This blog post describes ways to read binary simple feature data into R, and compares them.
WKB (well-known-binary) is the (ISO) standard binary serialization for simple features. You see it often printed in hexadecimal notation , e.g. in spatially extended databases such as PostGIS:
postgis=# SELECT 'POINT(1 2)'::geometry; geometry -------------------------------------------- 0101000000000000000000F03F0000000000000040 (1 row)
where the alternative form is the human-readable text (Well-known text) form:
postgis=# SELECT ST_AsText('POINT(1 2)'::geometry); st_astext ------------ POINT(1 2) (1 row)
In fact, the WKB is the way databases store features in BLOBs (binary large objects). This means that, unlike well-known text, reading well-known binary involves
- no loss of precision caused by text <–> binary conversion,
- no conversion of data needed at all (provided the endianness is native)
As a consequence, it should be possible to do this blazingly fast. Also with R? And large data sets?
Three software scenarios
I compared three software implementations:
sf::st_as_sfc(of package sf) using C++ to read WKB
sf::st_as_sfc(of package sf) using pure R to read WKB (but C++ to compute bounding box)
wkb::readWKB(of package wkb) using pure R to read features into sp-compatible objects
Note that the results below were obtained after profiling, and implementing expensive parts in C++.
I created three different (sets of) simple features to compare read performance: one large and simple line, one data set with many small lines, and one multi-part line containing many sub-lines:
- single LINESTRING with many points: a single LINESTRING with one million nodes (pionts) is read into a single simple feature
- many LINESTRINGs with few points: half a million simple features of type LINESTRING are read, each having two nodes (points)
- single MULTILINESTRING with many short lines: a single simple feature of type MULTILINESTRING is read, consisting of half a million line segments, each line segment consisting of two points.
A reproducible demo-script is found in the sf package here, and can be run by
Reported run times are in seconds, and were obtained by
single LINESTRING with many points
We see that for this case both
sf implementations are comparable;
this is due to the fact that the whole line of 16 Mb is read into
R with a single
readBin call: C++ can’t do this much faster.
wkb::readWKB is slower here because instead of
reading a complete matrix in one step it makes a million calls to
readPoint, and then merges the points read in R. This adds a few
million function calls. Since only a single
Line is created,
not much overhead from
sp can take place here.
Function calls, as John Chambers explains in Extending R, have a constant overhead of about 1000 instructions. Having lots of them may become expensive, if each of them does relatively little.
many LINESTRINGs with few points
Here we see a strong performance gain of the C++ implementation:
all the object creation is done in C++, without R function
wkb::readWKB slowness may be largely due to overhead caused
by sp: creating
Lines objects, object validation,
computing bounding box.
I made the C++ and “pureR” implementations considerably faster by moving the bounding box calculation to C++. The C++ implementation was further optimized by moving the type check to C++: if a mix of types is read from a set of WKB objects, sfc will coerce them to a single type (e.g., a set of LINESTRING and MULTILINESTRING will be coerced to all MULTILINESTRING.)
single MULTILINESTRING with many short lines
Here we see again the cost of function calls: both “pureR” in sf
wkb::readWKB are much slower due to the many function calls;
the latter also due to object management and validation in sp.
Reading well-known binary spatial data into R can be done pretty elegantly by R, but in many scenarios can be much faster using C++. We observe speed gains up to a factor 250.