Vector Data Cubes
Summary:
Where raster data cubes refer to data cubes with raster (x- and y-, or lon- and lat-) dimensions, vector data cubes are n-D arrays that have (at least) a single spatial dimension that maps to a set of (2-D) vector geometries. This post explains what they are used for, and how they can be handled with data science and GIS software.
Vector data cubes
Vector data cubes are n-D arrays with (at least) one dimension that maps to a set of typically 2-D vector geometries (points, lines or polygons). A simple example is a time series of precipitation values for a set of stations. With 5 time steps and 3 stations this could look like
time
station 2022-08-15 2022-08-16 2022-08-17 2022-08-18 2022-08-19
A 1.03 0.62 1.47 1.88 4.23
B 2.65 2.59 1.19 1.63 4.57
C 2.08 1.72 4.33 1.58 2.62
This contains station labels (A, B, C) but not geometries; we could
encode the stations with their WKT notation POINT(x y)
, as in
time
station 2022-08-15 2022-08-16 2022-08-17 2022-08-18 2022-08-19
POINT(5 7) 1.03 0.62 1.47 1.88 4.23
POINT(1.3 4) 2.65 2.59 1.19 1.63 4.57
POINT(8 3) 2.08 1.72 4.33 1.58 2.62
A second example is a time series of RGB brightness values sampled at four locations, which gives a 4 x 5 x 3 cube that can be printed as three 4 x 5 tables:
, , color = R
time
station 2022-08-15 2022-08-16 2022-08-17 2022-08-18 2022-08-19
POINT(5 7) 74 120 102 87 107
POINT(1.3 4) 10 55 149 201 17
POINT(8 3) 175 162 83 38 29
POINT(2 6) 197 159 235 61 95
, , color = G
time
station 2022-08-15 2022-08-16 2022-08-17 2022-08-18 2022-08-19
POINT(5 7) 31 179 240 151 26
POINT(1.3 4) 152 162 179 229 53
POINT(8 3) 86 162 249 116 34
POINT(2 6) 79 55 7 148 73
, , color = B
time
station 2022-08-15 2022-08-16 2022-08-17 2022-08-18 2022-08-19
POINT(5 7) 12 79 64 244 232
POINT(1.3 4) 33 23 53 165 182
POINT(8 3) 250 229 47 85 87
POINT(2 6) 186 26 29 249 253
Where do vector data cubes come from?
Naturally, in situ sensor data, where at regular time intervals data are collected at a number of stations, are vector data cube candidates. In the Earth Observation world, sampling raster data cubes at point locations leads to vector data cubes - an example would be to sample Sentinel-5P data cubes at the locations of air quality monitoring stations, in order to compare both - S5P values and in situ sensor values.
Other applications involve time series of land use (change) values observed over time periods at fixed locations, which are input to ML models for the classification of time series of land use: as opposed to classifying land use scene by scene dynamic world ref, from observations of land use time series a better approach might be to predict land use change from observed dynamics sits book ref.
Another case where vector cubes arise is when (polygon) area statistics are calculated from raster data cube imagery, e.g. the deforested area (fraction, or ha) by year and by state or country.
Representing vector data cubes in software
In principle, any software that can handle labeled arrays (arrays with named dimensions, and labels for dimension values) can handle vector data cubes. However, the handling is rather clumsy: labels are character (string) vectors, and do not reveal
- where time, geometries, or other dimensions are involved, and
- what dimension values mean: measurement units, or reference systems for time (origin and unit in case of numeric values; time zone, calendar) or space (coordinate reference system: datum, projection parameters)
More dedicated software takes care of this, e.g. R package
stars
summarizes the above data
like this:
stars object with 3 dimensions and 1 attribute
attribute(s):
Min. 1st Qu. Median Mean 3rd Qu. Max.
brightness [cd] 7 53 98.5 118.25 179 253
dimension(s):
from to offset delta refsys point values
station 1 4 NA NA WGS 84 TRUE POINT (5 7),...,POINT (2 6)
time 1 5 2022-08-15 1 days Date NA NULL
color 1 3 NA NA NA NA R, G, B
which
- recognizes the regularity of the time dimension, and its
Date
class - adds a reference system to the station geometries, and recognizes these are points
File formats for vector data cubes
array formats
Multidimensional arrays with a vector geometry dimension can well be
saved in formats like NetCDF or Zarr. For instance a NetCDF
representation, as printed by ncdump
, would look like
netcdf a {
dimensions:
color = 3 ;
time = 5 ;
station = 4 ;
variables:
double brightness(color, time, station) ;
brightness:grid_mapping = "crs" ;
brightness:coordinates = "lat lon" ;
brightness:units = "cd" ;
char crs ;
crs:grid_mapping_name = "latitude_longitude" ;
crs:long_name = "CRS definition" ;
crs:longitude_of_prime_meridian = 0. ;
crs:semi_major_axis = 6378137. ;
crs:inverse_flattening = 298.257223563 ;
crs:spatial_ref = "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Latitude\",NORTH],AXIS[\"Longitude\",EAST],AUTHORITY[\"EPSG\",\"4326\"]]" ;
crs:crs_wkt = "GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563]],PRIMEM[\"Greenwich\",0],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Latitude\",NORTH],AXIS[\"Longitude\",EAST],AUTHORITY[\"EPSG\",\"4326\"]]" ;
double station(station) ;
double time(time) ;
time:units = "days since 1970-01-01" ;
string col(color) ;
double lon(station) ;
lon:units = "degrees_north" ;
lon:standard_name = "longitude" ;
lon:axis = "X" ;
double lat(station) ;
lat:units = "degrees_east" ;
lat:standard_name = "latitude" ;
lat:axis = "Y" ;
double geometry ;
geometry:geometry_type = "point" ;
geometry:grid_mapping = "crs" ;
// global attributes:
:Conventions = "CF-1.6" ;
data:
brightness =
74, 10, 175, 197,
120, 55, 162, 159,
102, 149, 83, 235,
87, 201, 38, 61,
107, 17, 29, 95,
31, 152, 86, 79,
179, 162, 162, 55,
240, 179, 249, 7,
151, 229, 116, 148,
26, 53, 34, 73,
12, 33, 250, 186,
79, 23, 229, 26,
64, 53, 47, 29,
244, 165, 85, 249,
232, 182, 87, 253 ;
station = 1, 2, 3, 4 ;
time = 19219, 19220, 19221, 19222, 19223 ;
col = "R", "G", "B" ;
lon = 5, 1.3, 8, 2 ;
lat = 7, 4, 3, 6 ;
}
Such files can be read and written using GDAL’s multidimensional array
API, and
transformed into other multidimansional array formats using
gdalmdimtranslate
. The utility gdalmdiminfo
can print the dimension
metadata, or the entire information (including values) as JSON.
I/O: GIS formats
The two common GIS formats (as supported by GDAL) are
- vector tables: a set of vector geometries with zero or more attributes
- raster data: a raster images with 1 or more layers
Clearly, for vector data cubes the the raster data format will not work because the array dimensions do not correspond to two spatial dimensions (x and y). For vector tables, there are essentially two options:
long table form
The long table form can be illustrated by showing six records of the array above:
station time color brightness
1 POINT (5 7) 2022-08-15 R 74 [cd]
2 POINT (1.3 4) 2022-08-15 R 10 [cd]
3 POINT (8 3) 2022-08-15 R 175 [cd]
4 POINT (2 6) 2022-08-15 R 197 [cd]
5 POINT (5 7) 2022-08-16 R 120 [cd]
6 POINT (1.3 4) 2022-08-16 R 55 [cd]
In this form, the complete set of array values ends up in a single column, and all dimensions are recycled appropriately. This is the least ambiguous form because it
- keeps dimension and variable names,
- keeps data types (like variable time being of class
Date
) - keeps the array values in a single column.
On the other hand, it replicates dimension values and can lead to very large tables.
When a table of this kind is provided by a user, it is not immediately clear
- what the (unique) dimension values are, in particular for geometries
- whether all array values are present,
- whether it contains multiple records with identical dimension values
all this needs to be sorted out before one can recreate a multidimensional array from its long table form.
wide table forms
There are different ways in which we can use the column space to distribute our array values. The most extreme would not replicate geometries, so end up with four rows and combine the other dimensions (time, color) into columns, creating column names that paste the information togehter, as in the 4 rows x 15 columns table
2022-08-15.R 2022-08-16.R 2022-08-17.R 2022-08-18.R 2022-08-19.R 2022-08-15.G
1 74 [cd] 120 [cd] 102 [cd] 87 [cd] 107 [cd] 31 [cd]
2 10 [cd] 55 [cd] 149 [cd] 201 [cd] 17 [cd] 152 [cd]
3 175 [cd] 162 [cd] 83 [cd] 38 [cd] 29 [cd] 86 [cd]
4 197 [cd] 159 [cd] 235 [cd] 61 [cd] 95 [cd] 79 [cd]
2022-08-16.G 2022-08-17.G 2022-08-18.G 2022-08-19.G 2022-08-15.B 2022-08-16.B
1 179 [cd] 240 [cd] 151 [cd] 26 [cd] 12 [cd] 79 [cd]
2 162 [cd] 179 [cd] 229 [cd] 53 [cd] 33 [cd] 23 [cd]
3 162 [cd] 249 [cd] 116 [cd] 34 [cd] 250 [cd] 229 [cd]
4 55 [cd] 7 [cd] 148 [cd] 73 [cd] 186 [cd] 26 [cd]
2022-08-17.B 2022-08-18.B 2022-08-19.B station
1 64 [cd] 244 [cd] 232 [cd] POINT (5 7)
2 53 [cd] 165 [cd] 182 [cd] POINT (1.3 4)
3 47 [cd] 85 [cd] 87 [cd] POINT (8 3)
4 29 [cd] 249 [cd] 253 [cd] POINT (2 6)
Other forms would borrow from the long form, and for instance create 5 x
4 records with all station and time combinations and have variables R
,
G
and B
, or have 5 x 3 records combining the station and color,
having the time values as column names.
The disadvantages are obvious:
- column names collated from combinations of dimension values are cludgy, and are ugly to read or process
- the dimension values that end up in column names loose there type, units and reference system
- this may lead to tables with extremely many columns, which may be not practical or break software
- software my require that column names start with a letter, increasing the name cludge
Recreating the array from wide table forms may be relatively straightforward in data science languages (Python, R) but harder in databases.
It should be noted that vector data cubes can also have line or polygon geometries. In that case, WKT representations of geometries can become very long and not useful as colunn names.
Multiple (long) table forms (database normalization)
The repetition of dimension values in the long table form would be prohibitive if individual geometries are large. One way to cope with that is to create a geometry table with the unique geometries, and put a index (foreign key) to these geometries in the long table. This can be done with all recycling dimensions, and would make recreating the original array easier.