This blog post describes ways to read binary simple feature data into R, and compares them.

WKB (well-known-binary) is the (ISO) standard binary serialization for simple features. You see it often printed in hexadecimal notation , e.g. in spatially extended databases such as PostGIS:

postgis=# SELECT 'POINT(1 2)'::geometry;
(1 row)

where the alternative form is the human-readable text (Well-known text) form:

postgis=# SELECT ST_AsText('POINT(1 2)'::geometry);
 POINT(1 2)
(1 row)

In fact, the WKB is the way databases store features in BLOBs (binary large objects). This means that, unlike well-known text, reading well-known binary involves

  • no loss of precision caused by text <–> binary conversion,
  • no conversion of data needed at all (provided the endianness is native)

As a consequence, it should be possible to do this blazingly fast. Also with R? And large data sets?

Three software scenarios

I compared three software implementations:

  1. sf::st_as_sfc (of package sf) using C++ to read WKB
  2. sf::st_as_sfc (of package sf) using pure R to read WKB (but C++ to compute bounding box)
  3. wkb::readWKB (of package wkb) using pure R to read features into sp-compatible objects

Note that the results below were obtained after profiling, and implementing expensive parts in C++.

Three geometries

I created three different (sets of) simple features to compare read performance: one large and simple line, one data set with many small lines, and one multi-part line containing many sub-lines:

  1. single LINESTRING with many points: a single LINESTRING with one million nodes (pionts) is read into a single simple feature
  2. many LINESTRINGs with few points: half a million simple features of type LINESTRING are read, each having two nodes (points)
  3. single MULTILINESTRING with many short lines: a single simple feature of type MULTILINESTRING is read, consisting of half a million line segments, each line segment consisting of two points.

A reproducible demo-script is found in the sf package here, and can be run by


Reported run times are in seconds, and were obtained by system.time().

single LINESTRING with many points

expression user system elapsed
sf::st_as_sfc(.) 0.032 0.000 0.031
sf::st_as_sfc(., pureR = TRUE) 0.096 0.012 0.110
wkb::readWKB(.) 8.276 0.000 8.275

We see that for this case both sf implementations are comparable; this is due to the fact that the whole line of 16 Mb is read into R with a single readBin call: C++ can’t do this much faster.

I suspect wkb::readWKB is slower here because instead of reading a complete matrix in one step it makes a million calls to readPoint, and then merges the points read in R. This adds a few million function calls. Since only a single Line is created, not much overhead from sp can take place here.

Function calls, as John Chambers explains in Extending R, have a constant overhead of about 1000 instructions. Having lots of them may become expensive, if each of them does relatively little.

many LINESTRINGs with few points

expression user system elapsed
sf::st_as_sfc(.) 1.244 0.000 1.243
sf::st_as_sfc(., pureR = TRUE) 55.004 0.056 55.063
wkb::readWKB(.) 257.092 0.192 257.291

Here we see a strong performance gain of the C++ implementation: all the object creation is done in C++, without R function calls. wkb::readWKB slowness may be largely due to overhead caused by sp: creating Line and Lines objects, object validation, computing bounding box.

I made the C++ and “pureR” implementations considerably faster by moving the bounding box calculation to C++. The C++ implementation was further optimized by moving the type check to C++: if a mix of types is read from a set of WKB objects, sfc will coerce them to a single type (e.g., a set of LINESTRING and MULTILINESTRING will be coerced to all MULTILINESTRING.)

single MULTILINESTRING with many short lines

expression user system elapsed
sf::st_as_sfc(.) 0.348 0.000 0.348
sf::st_as_sfc(., pureR = TRUE) 24.088 0.008 24.100
wkb::readWKB(.) 87.072 0.004 87.074

Here we see again the cost of function calls: both “pureR” in sf and wkb::readWKB are much slower due to the many function calls; the latter also due to object management and validation in sp.


Reading well-known binary spatial data into R can be done pretty elegantly by R, but in many scenarios can be much faster using C++. We observe speed gains up to a factor 250.