view raw Rmd

### Introduction

Spatial line and polygon data are often messy; although simple features formally follow a standard, there is no guarantee that data is clean when imported in R. This blog shows how we can identify, (de)select, or repair broken and invalid geometries. We also show how empty geometries arise, and can be dealt with. Literature on invalid polygons and correcting them is found in Ramsey (2010), Ledoux, Ohori, and Meijers (2014), Ledoux, Ohori, and Meijers (2012), and Van Oosterom, Quak, and Tijssen (2005); all these come with excelent figures illustrating the problem cases.

We see that from version 0.4-0, `sf` may be linked to `lwgeom`,

``````library(sf)

## Linking to GEOS 3.5.1, GDAL 2.1.3, proj.4 4.9.2, lwgeom 2.3.1 r15264
``````

where `lwgeom` stands for the light-weight geometry library that powers postgis. This library is not present on CRAN, so binary packages installed from CRAN will not come with it. It is only linked to `sf` when it is detected during a build from source. When `lwgeom` is present, we will have a working version of `st_make_valid`, which is essentially identical to PostGIS’ `ST_makeValid`.

### Corrup or invalid geometries?

There are two types of things that can go wrong when dealing with geometries in `sf`. First, a geometry can be corrupt, which is for instance the case for a `LINESTRING` with one point, or a `POLYGON` with more than zero and less than 4 points:

``````l0 = st_linestring(matrix(1:2,1,2))
p0 = st_polygon(list(rbind(c(0,0),c(1,1),c(0,0))))
``````

These cases could of course be easily caught by the respective constructor functions, but they are not because we want to see what happens. Also, if we would catch them, it would not prevent us from running into them, because the majority of spatial data enters R through GDAL, and `sf`’s binary interface (reading well-known binary). Also, for many purposes corrupt may not be a problem, e.g. if we only want to plot them. In case we want to use them however in geometrical operations, we’ll typically see a message like:

``````  IllegalArgumentException: Invalid number of points in LinearRing found 3 - must be 0 or >= 4
``````

which points to GEOS not accepting a geometry as a possible geometry. Such an error message however does not point us to which geometry caused this. We could of course write a loop over all geometries to find this out, but can also use `st_is_valid` which returns by default `NA` on corrupt geometries:

``````l0 = st_linestring(matrix(1:2,1,2))
p0 = st_polygon(list(rbind(c(0,0),c(1,1),c(0,0))))
p = st_point(c(0,1)) # not corrupt
st_is_valid(st_sfc(l0, p0, p))

##    NA   NA TRUE
``````

Simple feature validity refers to a number of properties that polygons should have, such as non-self intersecting, holes being inside polygons. A number of different examples for invalid geometries are found in Ledoux, Ohori, and Meijers (2014), and were taken from their prepair github repo:

``````# A 'bowtie' polygon:
p1 = st_as_sfc("POLYGON((0 0, 0 10, 10 0, 10 10, 0 0))")
# Square with wrong orientation:
p2 = st_as_sfc("POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))")
# Inner ring with one edge sharing part of an edge of the outer ring:
p3 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0),(5 2,5 7,10 7, 10 2, 5 2))")
# Dangling edge:
p4 = st_as_sfc("POLYGON((0 0, 10 0, 15 5, 10 0, 10 10, 0 10, 0 0))")
# Outer ring not closed:
p5 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10))")
p6 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0), (1 1, 1 8, 3 8, 3 1, 1 1), (3 1, 3 8, 5 8, 5 1, 3 1))")
# Polygon with an inner ring inside another inner ring:
p7 = st_as_sfc("POLYGON((0 0, 10 0, 10 10, 0 10, 0 0), (2 8, 5 8, 5 2, 2 2, 2 8), (3 3, 4 3, 3 4, 3 3))")
p = c(p1, p2, p3, p4, p5, p6, p7)
(valid = st_is_valid(p))

##  FALSE  TRUE FALSE FALSE    NA FALSE FALSE
``````

Interestingly, GEOS considers `p5` as corrupt (`NA`) and `p2` as valid.

To query GEOS for the reason of invalidity, we can use the `reason = TRUE` argument to `st_is_valid`:

``````st_is_valid(p, reason = TRUE)

##  "Self-intersection[5 5]"  "Valid Geometry"
##  "Self-intersection[10 2]" "Self-intersection[10 0]"
##  NA                        "Self-intersection[3 1]"
##  "Holes are nested[3 3]"
``````

### Making invalid polygons valid

As mentioned above, in case `sf` was linked to `lwgeom`, which is confirmed by

``````sf_extSoftVersion()["lwgeom"]

##         lwgeom
## "2.3.1 r15264"
``````

not printing a `NA`, we can use `st_make_valid` to make geometries valid:

``````st_make_valid(p)

## Geometry set for 7 features
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 0 ymin: 0 xmax: 15 ymax: 10
## epsg (SRID):    NA
## proj4string:    NA
## First 5 geometries:

## MULTIPOLYGON(((0 0, 0 10, 5 5, 0 0)), ((5 5, 10...

## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))

## GEOMETRYCOLLECTION(POLYGON((10 7, 10 2, 10 0, 0...

## GEOMETRYCOLLECTION(POLYGON((10 0, 0 0, 0 10, 10...

## POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))
``````

A well-known “trick”, which may be your only alternative if is to buffer the geometries with zero distance:

``````st_buffer(p[!is.na(valid)], 0.0)

## Geometry set for 6 features
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 0 ymin: 0 xmax: 10 ymax: 10
## epsg (SRID):    NA
## proj4string:    NA
## First 5 geometries:

## POLYGON((0 0, 0 10, 5 5, 0 0))

## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))

## POLYGON((0 0, 0 10, 10 10, 10 7, 5 7, 5 2, 10 2...

## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0))

## POLYGON((0 0, 0 10, 10 10, 10 0, 0 0), (1 1, 3 ...
``````

but we see that, apart from the fact that this only works for non-corrupt geometries, we end up with different results.

A larger example from the prepair site is this:

``````x = read_sf("/home/edzer/git/prepair/data/CLC2006_2018418.geojson")
st_is_valid(x)

##  FALSE

st_is_valid(st_make_valid(x))

##  TRUE

plot(x, col = 'grey', axes = TRUE, graticule = TRUE)
`````` The corresponding paper, Ledoux, Ohori, and Meijers (2012) zooms in on problematic points. The authors argue to use constrained triangulation instead of the (less documented) approach taken by `lwgeom`; Mike Sumner also explores this here. It builds upon RTriangle, which cannot be integrated in `sf` as it is distributed under license with a non-commercial clause. Ledoux uses CGAL, which would be great to have an interface to from R!

### Empty geometries

Empty geometries exist, and can be thought of as zero-length vectors, `data.frame`s without rows, or `NULL` values in lists: in essence, there’s place for information, but there is no information. An empty geometry arises for instance if we ask for the intersection of two non-intersecting geometries:

``````st_intersection(st_point(0:1), st_point(1:2))

## GEOMETRYCOLLECTION()
``````

In principle, we could have designed `sf` such that empty geometries were represented a `NULL` value, but the standard prescrives that every geometry type has an empty instance:

``````st_linestring()

## LINESTRING()

st_polygon()

## POLYGON()

st_point()

## POINT(NA NA)
``````

and thus the empty geometry is typed. This guarantees clean roundtrips from a database to R back into a database: no information (on type) gets lost in case of presence of empty geometries.

How can we detect, and filter on empty geometries? We can do that with `st_dimension`:

``````lin = st_linestring(rbind(c(0,0),c(1,1)))
pol = st_polygon(list(rbind(c(0,0),c(1,1),c(0,1),c(0,0))))
poi = st_point(c(1,1))
p0 = st_point()
pol0 = st_polygon()
st_dimension(st_sfc(lin, pol, poi, p0, pol0))

##   1  2  0 NA NA
``````

and see that empty geometries return `NA`.

The standard however prescribes that an empty polygon still has dimension two, and we can override the `NA` convenience to get standard-compliant dimensions by

``````st_dimension(st_sfc(lin, pol, poi, p0, pol0), NA_if_empty = FALSE)

##  1 2 0 0 2
``````

### Tidying feature geometries

When you analyse your spatial data with `sf` and you don’t get any warnings or error messages, all may be fine. In case you do, or your are curious, you can check for

1. empty geometries, using `any(is.na(st_dimension(x)))`
2. corrupt geometries, using `any(is.na(st_is_valid(x)))`
3. invalid geometries, using `any(na.omit(st_is_valid(x)) == FALSE)`; in case of corrupt and/or invalid geometries,
4. in case of invalid geometries, query the reason for invalidity by `st_is_valid(x, reason = TRUE)`
5. you may be succesful in making geometries valid using `st_make_valid(x)` or, if `st_make_valid` is not supported by
6. `st_buffer(x, 0.0)` on non-corrupt geometries (but beware of the bowtie example above, where `st_buffer` removes one half).
7. After succesful a `st_make_valid`, you may want to select a particular type subset using `st_is`, or cast `GEOMETRYCOLLECTIONS` to `MULTIPOLYGON` by
``````st_make_valid(p) %>% st_cast("MULTIPOLYGON")

## Warning in st_cast.GEOMETRYCOLLECTION(X[[i]], ...): only first part of
## geometrycollection is retained

## Warning in st_cast.GEOMETRYCOLLECTION(X[[i]], ...): only first part of
## geometrycollection is retained

## Warning in st_cast.GEOMETRYCOLLECTION(X[[i]], ...): only first part of
## geometrycollection is retained

## Geometry set for 7 features
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 0 ymin: 0 xmax: 10 ymax: 10
## epsg (SRID):    NA
## proj4string:    NA
## First 5 geometries:

## MULTIPOLYGON(((0 0, 0 10, 5 5, 0 0)), ((5 5, 10...

## MULTIPOLYGON(((0 0, 0 10, 10 10, 10 0, 0 0)))

## MULTIPOLYGON(((10 7, 10 2, 10 0, 0 0, 0 10, 10 ...

## MULTIPOLYGON(((10 0, 0 0, 0 10, 10 10, 10 0)))

## MULTIPOLYGON(((0 0, 10 0, 10 10, 0 10, 0 0)))
``````

For longer explanations about what makes a polygons invalid, do read one of the references below, all are richly illustrated

### References [references]

Ledoux, Hugo, Ken Arroyo Ohori, and Martijn Meijers. 2012. “Automatically Repairing Invalid Polygons with a Constrained Triangulation.” In. Agile. https://3d.bk.tudelft.nl/ken/files/12_agile.pdf.

———. 2014. “A Triangulation-Based Approach to Automatically Repair GIS Polygons.” Computers & Geosciences 66: 121–31. https://pdfs.semanticscholar.org/d9ec/b32a7844b436fcd4757958e5eeca9563fcd2.pdf.

Ramsey, Paul. 2010. “PostGIS-Tips for Power Users.” Presentation on: FOSS4G. http://2010.foss4g.org/presentations/3369.pdf.

Van Oosterom, Peter, Wilko Quak, and Theo Tijssen. 2005. “About Invalid, Valid and Clean Polygons.” In Developments in Spatial Data Handling, 1–16. Springer. http://excerpts.numilog.com/books/3540267727.pdf.