Reading well-known-binary into R
This blog post describes ways to read binary simple feature data into R, and compares them.
WKB (well-known-binary) is the (ISO) standard binary serialization for simple features. You see it often printed in hexadecimal notation , e.g. in spatially extended databases such as PostGIS:
postgis=# SELECT 'POINT(1 2)'::geometry;
geometry
--------------------------------------------
0101000000000000000000F03F0000000000000040
(1 row)
where the alternative form is the human-readable text (Well-known text) form:
postgis=# SELECT ST_AsText('POINT(1 2)'::geometry);
st_astext
------------
POINT(1 2)
(1 row)
In fact, the WKB is the way databases store features in BLOBs (binary large objects). This means that, unlike well-known text, reading well-known binary involves
- no loss of precision caused by text <–> binary conversion,
- no conversion of data needed at all (provided the endianness is native)
As a consequence, it should be possible to do this blazingly fast. Also with R? And large data sets?
Three software scenarios
I compared three software implementations:
sf::st_as_sfc
(of package sf) using C++ to read WKBsf::st_as_sfc
(of package sf) using pure R to read WKB (but C++ to compute bounding box)wkb::readWKB
(of package wkb) using pure R to read features into sp-compatible objects
Note that the results below were obtained after profiling, and implementing expensive parts in C++.
Three geometries
I created three different (sets of) simple features to compare read performance: one large and simple line, one data set with many small lines, and one multi-part line containing many sub-lines:
- single LINESTRING with many points: a single LINESTRING with one million nodes (pionts) is read into a single simple feature
- many LINESTRINGs with few points: half a million simple features of type LINESTRING are read, each having two nodes (points)
- single MULTILINESTRING with many short lines: a single simple feature of type MULTILINESTRING is read, consisting of half a million line segments, each line segment consisting of two points.
A reproducible demo-script is found in the sf package here, and can be run by
devtools::install_github("edzer/sfr")
demo(bm_wkb)
Reported run times are in seconds, and were obtained by system.time()
.
single LINESTRING with many points
expression | user | system | elapsed |
---|---|---|---|
sf::st_as_sfc(.) |
0.032 | 0.000 | 0.031 |
sf::st_as_sfc(., pureR = TRUE) |
0.096 | 0.012 | 0.110 |
wkb::readWKB(.) |
8.276 | 0.000 | 8.275 |
We see that for this case both sf
implementations are comparable;
this is due to the fact that the whole line of 16 Mb is read into
R with a single readBin
call: C++ can’t do this much faster.
I suspect wkb::readWKB
is slower here because instead of
reading a complete matrix in one step it makes a million calls to
readPoint
, and then merges the points read in R. This adds a few
million function calls. Since only a single Line
is created,
not much overhead from sp
can take place here.
Function calls, as John Chambers explains in Extending R, have a constant overhead of about 1000 instructions. Having lots of them may become expensive, if each of them does relatively little.
many LINESTRINGs with few points
expression | user | system | elapsed |
---|---|---|---|
sf::st_as_sfc(.) |
1.244 | 0.000 | 1.243 |
sf::st_as_sfc(., pureR = TRUE) |
55.004 | 0.056 | 55.063 |
wkb::readWKB(.) |
257.092 | 0.192 | 257.291 |
Here we see a strong performance gain of the C++ implementation:
all the object creation is done in C++, without R function
calls. wkb::readWKB
slowness may be largely due to overhead caused
by sp: creating Line
and Lines
objects, object validation,
computing bounding box.
I made the C++ and “pureR” implementations considerably faster by moving the bounding box calculation to C++. The C++ implementation was further optimized by moving the type check to C++: if a mix of types is read from a set of WKB objects, sfc will coerce them to a single type (e.g., a set of LINESTRING and MULTILINESTRING will be coerced to all MULTILINESTRING.)
single MULTILINESTRING with many short lines
expression | user | system | elapsed |
---|---|---|---|
sf::st_as_sfc(.) |
0.348 | 0.000 | 0.348 |
sf::st_as_sfc(., pureR = TRUE) |
24.088 | 0.008 | 24.100 |
wkb::readWKB(.) |
87.072 | 0.004 | 87.074 |
Here we see again the cost of function calls: both “pureR” in sf
and wkb::readWKB
are much slower due to the many function calls;
the latter also due to object management and validation in sp.
Discussion
Reading well-known binary spatial data into R can be done pretty elegantly by R, but in many scenarios can be much faster using C++. We observe speed gains up to a factor 250.