10  Statistical modelling of spatial data

So far, in this book we mostly addressed the problem of describing data, e.g. by logical geometrical predicates or transformations that involved geometries, or by summary measures or by plots depicting variability in the geometry data or in feature attributes.

Statistical modelling aims at going beyond describing the data: it considers the data as a sample drawn from a population, and tries to make assessments (inference) about the population sampled from, for instance by quantifying relationships between variables, estimating population parameters, and predicting the outcome of observations that could have been taken but were not, as is the case in interpolation problems. This is usually done by adopting a model for the data, where for instance observations are decomposed as follows:

\[ \mbox{observed} = \mbox{explained} + \mbox{remainder} \tag{10.1}\]

where “explained” typically uses external variables (predictors, covariates, in machine learning confusingly also called features) that are related to the observed variable and some kind of regression model to translate into variability of the observed variable, and “remainder” is remaining variability that could not be explained. Interest may focus on the nature and magnitude or the relations between predictors and the observed variable, or in predicting new observations.

Statistical models, and sampling hinge on the concept of probability, which in typical spatial data science problems is not a force of nature but has to be assumed in one way or another. If we are faced with data that come from (spatially) random sampling and we are interested in estimating means or totals, a design-based approach that assumes randomness in the sample locations is the most straightforward analysis approach, as pointed out in more detail in Section 10.4. If observations were not sampled randomly, or if our interest is in predicting values at specific locations (mapping) a model-based approach is needed. The remaining chapters in this part deal with model-based approaches.

10.1 Mapping with non-spatial regression and ML models

Regression models or other machine learning (ML) models can be applied to spatial and spatiotemporal data just the way they are applied for predicting new observations in non-spatial problems:

  1. estimate: for a set of observations, a regression or ML model is fitted using predictor values corresponding to the observations (in ML jargon this step is also known as “train”)
  2. predict: for a new situation, known predictor values are combined with the fitted model to predict the value of the observed variable, along with a prediction error or prediction interval if possible

Objects of class sf need no special treatment, as they are data.frames. To create maps of the resulting predictions, predicted values need to be added to the sf object, as e.g. done using the nc dataset loaded as in Chapter 1:

nc |> mutate(SID = SID74/BIR74, NWB = NWBIR74/BIR74) -> nc1
lm(SID ~ NWB, nc1) |>
  predict(nc1, interval = "prediction") -> pr
bind_cols(nc, pr) |> names()
#  [1] "AREA"      "PERIMETER" "CNTY_"     "CNTY_ID"   "NAME"     
#  [6] "FIPS"      "FIPSNO"    "CRESS_ID"  "BIR74"     "SID74"    
# [11] "NWBIR74"   "BIR79"     "SID79"     "NWBIR79"   "geom"     
# [16] "fit"       "lwr"       "upr"

where we see that

  • lm estimates a linear model, and works directly on an sf object
  • the output is used for a predict model, which predicts values corresponding to the observations in nc1, the same sf object
  • predict creates three columns: fit for predicted values and lwr and upr for the 95% prediction intervals
  • these three columns have been added to the final object using bind_cols.

In general the dataset for model estimation and prediction do not have to be the same. Section 7.4.6 points out how this can be done with stars objects (essentially by going through a long data.frame representation of the datacube and converting the predicted results back, potentially in a chunked fashion).

Because many regression and ML type problems share this same structure, packages like caret (Kuhn 2022) or tidymodels (Kuhn and Wickham 2020) allow for automated evaluation and comparison over a large set of model alternatives, offering a large set of model evaluation criteria and cross validation strategies. Such cross validation approaches assume independent observations, which is often not a reasonable assumption for spatial data, for instance because of spatial correlation (Ploton et al. 2020) or because of strong spatial clustering in sample data (Meyer and Pebesma 2022), or both, and a number of R packages provide methods that are meant as replacements for naive cross validation, including spatialsample (Silge and Mahoney 2022), CAST (Meyer, Milà, and Ludwig 2022), mlr3spatial (Becker and Schratz 2022) and mlr3spatiotempcv (Schratz and Becker 2022).

Strong spatial clustering of sample can arise when sample data are composed by joining different databases, each with very different sampling density. This is often the case in global datasets (Meyer and Pebesma 2022). Another example of strong clustering arises when, for sampling ground truth points of a land cover class, polygons are digitised and points are sampled within these polygons at the resolution of pixels in satellite imagery.

Spatial correlation in the “remainder” part of the model may be decreased by adding spatial coordinates or functions of spatial coordinates to the set of predictors. This also carries a risk of over-optimistic predictions in extrapolation cases, (cross) validation, and model assessment, and is further discussed in Section 10.5.

10.2 Support and statistical modelling

Support of data (Section 1.6; Chapter 5) plays a lead role in the statistical analysis of spatial data. Methods for areal data (Chapter 14 - Chapter 17) are devised for data with area support, where the set of areas cover the entire area of interest.

By showing an extensive variable (Section 5.3.1) in a polygon choropleth map as done in Figure 1.1 one runs the risk that the information is related with the polygon size, and that the signal shown is actually the size of the polygons, in colour. For the variable population count one would divide by the polygon area to show the (intensive) variable population density, in order to create an informative map. In the analysis of health data, like disease incidences over a time period shown in Figure 1.1, rather than dividing by polygon area to get a spatial density, observations are usually converted to probabilities or incidence rates by dividing over the population size of the associated polygons. As such they are (still) associated with the polygon area but their support is the population total divided by. It is these totals that inform the (Poisson) variability used by subsequent modelling, e.g. in CAR-type models (Chapter 16).

Chapter 11 deals in principle with point support observations, but at some stage needs to acknowledge that observations have non-zero size: tree stem “points” cannot be separated distances smaller than the tree diameter. Also, points in point pattern analysis are considered in their observation window, the area for which the point dataset is exhaustive, or complete. The observation window is of influence in many of the analysis tools. If points are observed on a line network, then the observation window consists of the set of lines observed, and distances measured through this network.

Geostatistical data (Chapter 12, Chapter 13) usually start with point support observations, and may end with predictions (spatial interpolations) for unobserved point locations distributed over the area of interest, or may end in predictions for means over areas (block kriging; Section 12.5). Alternatively, observations may be aggregates over regions (Skøien et al. (2014)). In remote sensing data, pixel values are usually associated with aggregates over the pixel area. Challenges may be the filling of gaps in images, e.g. caused by cloud coverage, from neighbouring pixels both in space and time (Gerber et al. (2018), Heaton et al. (2019), Militino et al. (2019)).

When combining data with different spatial supports, e.g. polygons from administrative regions and raster layers, it is often seen that all information is “brought together” to the highest resolution, by simply extracting polygon values at pixel locations, and proceeding from there, with all the newly created “observations”. This of course bears a large risk of producing non-sensible results when analysing these “data”, and a proper downsampling strategy, possibly using simulations to cope with uncertainty, would be a better alternative. For naive users, using software that is not aware of support of values associated with areas and using software that does not warn against naive downsampling is of course not a helpful situation.

10.3 Time in predictive models

Schabenberger and Gotway (2005) already noted that in many cases, statistical analysis of spatiotemporal data proceeds either by reducing time, then working on the problem spatially (time first, space later) or reducing space, then working on the problem temporally (space first, time later). An example of the first approach is given in Chapter 12 where a dataset with a year of hourly values (detailed in Chapter 13) are reduce to station mean values (time first) after which these means are interpolated spatially (space later). Examples from the area of remote sensing are

  • Simoes et al. (2021), who use supervised machine learning and time series deep learning to segmentise pixel time series into sequences of land use (time first), and then smooth the resulting sequences of maps to remove improbable transitions in isolated pixels (space later)
  • Verbesselt et al. (2010), who use (unsupervised) structural change algorithms to find breakpoints in pixel time series (time first), which are interpreted in the context deforestation later on.

Examples of space first, time later in the area of remote sensing are any case where a single scene or scenes belonging to a single season are classified, and multi-year changes in land use or land cover are assessed by comparing time sequences of classified scenes. An example of this is Brown et al. (2022). Examples where space and time are considered jointly are the spatiotemporal interpolation in Chapter 13, and Lu et al. (2016) in the context of remote sensing.

10.4 Design-based and model-based inference

Statistical inference means the action of estimating parameters about a population from sample data. Suppose we denote the variable of interest with \(z(s)\), where \(z\) is the attribute value measured at location \(s\), and we are interested in estimating the mean value of \(z(s)\) over a domain \(D\), \[z(s)=\frac{1}{|D|} \int_{ u \in D} z(u)du,\] with \(|D|\) the area of \(D\), from sample data \(z(s_1),...,z(s_n)\).

Then, there are two possibilities to proceed: model-based, or design-based. A model-based approach considers \(z(s)\) to be a realisation of a super-population \(Z(s)\) (using capital letters to indicate random variables), and could for instance postulate a model for its spatial variability in the form of \[Z(s) = m + e(s), \ \ \mbox{E}(e(s)) = 0, \ \ \mbox{Cov(e(s))} = \Sigma(\theta)\] with \(m\) a constant mean and \(e(s)\) a residual with mean zero and covariance matrix \(\Sigma(\theta)\). This would require choosing the covariance function \(\Sigma()\) and estimating its parameters \(\theta\) from \(z(s)\), and then computing a block kriging prediction \(\hat{Z}(D)\) (Section 12.5). This approach makes no assumptions about how \(z(s)\) was sampled spatially, but of course it should allow for choosing the covariance function and estimating its parameters; inference is conditional to the validity of the postulated model.

Rather than assuming a superpopulation model, the design-based approach (JJ De Gruijter and Ter Braak 1990; Brus 2021a; Breidt, Opsomer, et al. 2017) assumes randomness in the locations, which is justified (only) when using random sampling. It requires that the sample data were obtained by probability sampling, meaning that some form of spatial random sampling was used where all elements of \(z(s)\) had a known and positive probability of being included in the sample obtained. The random process is that of sampling: \(z(s_1)\) is a realisation of the random process \(z(S_1)\), the first observation taken over repeated random sampling. Design-based estimators only need these inclusion probabilities to estimate mean values with standard errors. This means that for instance given a simple random sample, the unweighted sample mean is used to estimate the population mean, and no model parameters need to be fit.

Now the question is whether \(z(s_1)\) and \(z(s_2)\) can be expected to be correlated when \(s_1\) and \(s_2\) are close together. The question doesn’t work out as long as \(z(s_1)\) and \(z(s_2)\) are just two numbers: we need some kind of framework, random variables, that recreates this situation to form two sets of numbers for which we can consider correlation. The misconception here, as explained in Brus (2021a), is that the two are always spatially correlated but this is only the case when working under model-based approaches: \(Z(s_1)\) and \(Z(s_2)\) may well be correlated (“model-dependent”), but although in a particular random sample (realisation) \(z(s_1)\) and \(z(s_2)\) may be close in space, the corresponding random variables \(z(S_1)\) and \(z(S_2)\) considered over repeated random sampling are not close together, and are design-independent. Both situations can co-exist without contradiction, and are a consequence of choosing to work under one inference framework or the other.

The choice whether to work under a design-based or model-based framework depends on the purpose of the study and the data collection process. The model-based framework lends itself best for cases:

  • where predictions are required for individual locations, or for areas too small to be sampled
  • where the available data were not collected using a known random sampling scheme (i.e., the inclusion probabilities are unknown, or are zero over particular areas or/and times)

Design-based approaches are most suitable when:

  • observations were collected using a spatial random sampling process
  • aggregated properties of the entire sample region (or sub-region) are needed
  • estimates are required that are not sensitive to potential model misspecification, e.g. when needed for regulatory or legal purposes

In case a sampling procedure is to be planned (Jaap De Gruijter et al. 2006), some form of spatial random sampling is definitely worth considering since it opens up the possibility of following both inference frameworks.

10.5 Predictive models with coordinates

In data science projects, coordinates may be seen as features in a larger set of predictors (or features, or covariates) and treated accordingly. There are some pitfalls in doing so.

As usual when working with predictors, it is good to choose predictive methods that are not sensitive to shifts in origin or shifts in unit (scale). Assuming a two-dimensional problem, predictive models should also not be sensitive to arbitrary rotations of the \(x\)- and \(y\)- or latitude and longitude axes. For projected (2D, Cartesian) coordinates this can be assured e.g. by using polynomials of order \(n\) as \((x+y)^n\), rather than \((x)^n + (y)^n\); for a second order polynomial this involves including the term \(xy\), so that an ellipsoidal-shape trend surface does not have to be aligned with the \(x-\) or \(y-\)axis. For a GAM model with spline components, one would use a spline in two dimensions \(s(x,y)\) rather than two independent splines \(s(x)\) and \(s(y)\) that do not allow for interaction. An exception to this “rule” is when e.g. a pure latitude effect is desired, for instance to account for yearly total solar energy influx.

When the area covered by the data is large, the difference between using ellipsoidal coordinates and projected coordinates will automatically become larger, and hence choosing one of both will have an effect on predictive modelling. For very large extents, e.g. global models, polynomials or splines in latitude and longitude will not make sense as they ignore the circular nature of longitude and the coordinate singularities at the poles. Here, spherical harmonics, base functions that are continuous on the sphere with increasing spatial frequencies can replace polynomials or be used as spline base functions.

In many cases, the spatial coordinates over which samples were collected also define the space over which predictions are made, setting them apart from other features. Many simple predictive approaches, including most machine learning methods, assume sample data to be independent. When samples are collected by spatially random sampling over the spatial target area, this assumption may be justified when working under a design-based context (Brus 2021b). This context however treats the coordinate space as the variable over which we randomise, which affords predicting values for a new randomly chosen location but rules out making predictions for fixed locations; this implies that averages over areas over which samples were collected can be obtained, but not spatial interpolations. In case predictions for fixed locations are required, or in case data were not collected by spatial random sampling, a model-based approach (as taken in Chapter 12) is needed and typically some form of spatial and/or temporal autocorrelation of residuals must be assumed.

A common case is where sample data are collected opportunistically (“whatever could be found”), and are then used in a predictive framework that does not weight them. This has a consequence that the resulting model may be biased towards over-represented areas (in predictor space and/or in spatial coordinates space), and that simple (random) cross validation statistics may be over-optimistic when taken as performance measures for spatial prediction (Meyer and Pebesma 2021, 2022). Adaptive cross validation measures, e.g. from spatial cross validation may help getting more relevant measures for predictive performance.

10.6 Exercises

Use R to solve the following exercises.

  1. Following the lm example of Section 10.1 use a random forest model to predict SID values (e.g. using package randomForest), and plot the random forest predictions against observations, along with the \(x=y\) line.
  2. Create a new dataset by randomly sampling 1000 points from the nc dataset, and rerun the linear regression model of Section 10.1 on this dataset. Consider the summary of the fitted models, in particular the estimated coefficients, their standard errors, and the residual standard error. What has changed?
  3. Redo the water-land classification of section Section 7.4.6 using class::knn instead of lda, using a value of k = 5, and compare the resulting predictions with those of lda.
  4. For the linear model using nc and for the knn example of the previous exercise, add a first and a second order linear model in the spatial coordinates and compare the results (use st_centroid to obtain polygon centroids, and st_coordinates to extract the x and y coordinates in matrix form).