10  Statistical modelling of spatial data

Spatial data almost always (and everywhere) has the property that it is spatially structured: observations done close by in space tend to be more similar than observations done at larger distance from each other. This phenomenon, in the geography domain attributed to Waldo Tobler (as in “Waldo Tobler’s first law of geography”) was already noted by Fisher et al. (1937) and was a motivation for developing randomized block design in agricultural experiments: allocating treatments randomly to blocks avoids that spatial structure gets mixed up (or: confounds) with a signal caused by the treatment.

The often heard argument that spatially structured data means that the data is spatially correlated, which would exclude estimation methods that assume independent observations is false. Correlation is a property of two random variables, and there are different ways in which spatial data can be approached with random variables: either the observation locations are random (leading to design-based inference) or the observed values are random (leading to model-based inference). The next section points out the difference between these two.

10.1 Design-based and model-based inference

Statistical inference means the action of estimating parameters about a population from sample data. Suppose we denote the variable of interest with \(z(s)\), where \(z\) is the attribute value measured at location \(s\), and we are interested in estimating the mean value of \(z(s)\) over a domain \(D\), \[z(s)=\frac{1}{|D|} \int_{ u \in D} z(u)du,\] with \(|D|\) the area of \(D\), from sample data \(z(s_1),...,z(s_n)\).

Then, there are two possibilities to proceed: model-based, or design-based. A model-based approach considers \(z(s)\) to be a realisation of a superpopulation \(Z(s)\) (using capital letters to indicate random variables), and could for instance postulate a model for its spatial variability in the form of \[Z(s) = m + e(s), \ \ \mbox{E}(e(s)) = 0, \ \ \mbox{Cov(e(s))} = \Sigma(\theta)\] with \(m\) a constant mean and \(e(s)\) a residual with mean zero and covariance matrix \(\Sigma(\theta)\). This would require choosing the covariance function \(\Sigma()\) and estimating its parameters \(\theta\) from \(z(s)\), and then computing a block kriging prediction \(\hat{Z}(D)\) (Section 12.5). This approach makes no assumptions about how \(z(s)\) was sampled spatially, but of course it should allow for choosing the covariance function and estimating its parameters; inference is conditional to the validity of the postulated model.

Rather than assuming a superpopulation model, the design-based approach (JJ De Gruijter and Ter Braak 1990; Brus 2021a; Breidt, Opsomer, et al. 2017) assumes randomness in the locations, which is justified (only) when using random sampling. It requires that the sample data were obtained by probability sampling, meaning that some form of spatial random sampling was used where all elements of \(z(s)\) had a known and positive probability of being included in the sample obtained. The random process is that of sampling: \(z(s_1)\) is a realisation of the random process \(z(S_1)\), the first observation taken over repeated random sampling. Design-based estimaters only need these inclusion probabilities to estimate mean values with standard errors. This means that for instance given a simple random sample, the unweighted sample mean is used to estimate the population mean, and no model parameters need to be fit.

Now the question is whether \(z(s_1)\) and \(z(s_2)\) can be expected to be correlated when \(s_1\) and \(s_2\) are close together. The question doesn’t work out as long as \(z(s_1)\) and \(z(s_2)\) are just two numbers: we need some kind of framework, random variables, that recreates this situation to form two sets of numbers for which we can consider correlation. The misconception here, as explained in Brus (2021a), is that the two are always spatially correlated but this is only the case when working under model-based approaches: \(Z(s_1)\) and \(Z(s_2)\) may well be correlated (“model-dependent”), but although in a particular random sample (realisation) \(z(s_1)\) and \(z(s_2)\) may be close in space, the corresponding random variables \(z(S_1)\) and \(z(S_2)\) considered over repeated random sampling are not close together, and are design-independent. Both situations can co-exist without contradiction, and are a consequence of choosing to work under one inference framework or the other.

The choice whether to work under a design-based or model-based framework depends on the purpose of the study and the data collection process. The model-based framework lends itself best for cases:

  • where predictions are required for individual locations, or for areas too small to be sampled
  • where the available data were not collected using a known random sampling scheme (i.e., the inclusion probabilities are unknown, or are zero over particular areas or/and times)

Design-based approaches are most suitable when:

  • observations were collected using a spatial random sampling process
  • aggregated properties of the entire sample region (or sub-region) are needed
  • estimates are required that are not sensitive to potential model misspecification, e.g. when needed for regulatory or legal purposes

In case a sampling procedure is to be planned (Jaap De Gruijter et al. 2006), some form of spatial random sampling is definitely worth considering since it opens up the possibility of following both inference frameworks.

10.2 Predictive models with coordinates

In data science projects, coordinates may be seen as features in a larger set of predictors (or features, or covariates) and treated accordingly. There are some catches with doing so.

As usual when working with predictors, it is good to choose predictive methods that are not sensitive to shifts in origin or shifts in unit (scale). Assuming a two-dimensional problem, predictive models should also not be sensitive to arbitrary rotations of the \(x\)- and \(y\)- or latitude and longitude axes. For projected (2D, Cartesian) coordinates this can be assured e.g. by using polynomials of order \(n\) as \((x+y)^n\), rather than \((x)^n + (y)^n\); for a second order polynomial this involves including the term \(xy\), so that an ellipsoidal-shape trend surface does not have to be aligned with the \(x-\) or \(y-\)axis. For a GAM model with spline components, one would use a spline in two dimensions \(s(x,y)\) rather than two independent splines \(s(x)\) and \(s(y)\) that do not allow for interaction. An exception to this “rule” is when e.g. a pure latitude effect is desired, for instance to account for yearly total solar energy influx.

When the area covered by the data is large, the difference between using ellipsoidal coordinates and projected coordinates will automatically become larger, and hence choosing one of both will have an effect on predictive modelling. For very large extents, e.g. global models, polynomials or splines in latitude and longitude will not make sense as they ignore the circular nature of longitude and the coordinate singularities at the poles. Here, spherical harmonics, base functions that are continuous on the sphere with increasing spatial frequencies can replace polynomials or be used as spline base functions.

In many cases, the spatial coordinates over which samples were collected also define the space over which predictions are made, setting them apart from other features. Many simple predictive approaches, including most machine learning methods, assume sample data to be independent. When samples are collected by spatially random sampling over the spatial target area, this assumption may be justified when working under a design-based context (Brus 2021b). This context however treats the coordinate space as the variable over which we randomize, which affords predicting values for a new randomly chosen location but rules out making predictions for fixed locations; this implies that averages over areas over which samples were collected can be obtained, but not spatial interpolations. In case predictions for fixed locations are required, or in case data were not collected by spatial random sampling, a model-based approach (as taken in Chapter 12) is needed and typically some form of spatial and/or temporal autocorrelation of residuals must be assumed.

A common case is where sample data are collected opportunistically (“whatever could be found”), and are then used in a predictive framework that does not weigh them. This has a consequence that the resulting model may be biased towards over-represented areas (in predictor space and/or in spatial coordinates space), and that simple (random) cross validation statistics may be over-optimistic when taken as performance measures for spatial prediction (Meyer and Pebesma 2021, 2022). Adaptive cross validation measures, e.g. from spatial cross validation may help getting more relevant measures for predictive performance.

10.3 Further reading

There is a large number of papers and books available on analysing and statistical modelling of spatial and spatiotemporal data, and a very large number of R packages help doing so. Several CRAN task views try to maintain an overview of the R packages, e.g. on:

The introductions to the chapters following contain more pointers to relevant literature references. Introductions to using the integrated nested Laplace approximation (INLA) for analysing spatial data are given in Blangiardo et al. (2013), Blangiardo and Cameletti (2015), and Gómez-Rubio (2020). Krainski et al. (2018) combine the INLA approach with stochastic partial differential equations. Many of these approaches use meshes to tesselate the area, to help define a Gaussian Markov Random Field both for modelling point pattern data or geostatistical data. Spatiotemporal Bayesian modelling of change of support problems is presented in A. M. Raim et al. (2021) and Andrew M. Raim et al. (2020).