dc.Rmd

---
title: "Raster and vector data cubes in R"
author: "Edzer Pebesma, OpenGeoHub Summer School 2023"
date: "Aug 29, 2023"
---

# Phenomena: spatially discreet or continuous?

A bit of theory.

[Stevens' 1946 "On the Theory of Scales of
Measurement"](https://www.science.org/doi/10.1126/science.103.2684.677)
says there are four measurement scales: nominal, ordinal, interval
and ratio. It is tempting to reduce this further to discrete
(nominal, ordinal) and continuous (interval, ratio), although
a count variable has a bit of both.  Are the phenomena we study
discreet or continuous?

We think of our phenomenon as $Z(s)$, property $Z$ is measured on object
$s_i$ or at location $s$.  For making progress, we need to
distinguish here whether we talk about

* (non-spatial) properties of objects or locations: $Z$
    * your blood pressure and temperature are two continuous variables on a discrete object (you)
    * your DNA (or at least, the presence of a specific gene) is discrete
* spatial continuity or discreteness: $s$
    * body temperature is continuous within your body, but does not extend it; average body temperature is a property of a spatially discrete entity
    * air temperature is continuous through the air, and is spatially continuous (in $s$), as well as a continuous property (in $Z$)
    * soil type or land use type is continuous in space $s$, but discrete in value $Z$

In space-time we have for instance:

* temperature is continuous in space, time and property
* being infected with disease $x$ is discrete in space (e.g., a person) but continuous in time.
* a remote sensing image is (semi-) continuous in space $s$ and value $Z$ (energy) but the _observation_ is discrete in time (snapshot); the energy phenomena is continuous in time

We often associate

* ***vector data*** with _spatially_ discrete phenomena ("features")
* ***raster data*** with _spatially_ continuous phenomena ("fields", coverages)

but this is always the case: e.g. POINTs can reflect

* locations of things (trees, persons, cars)
* locations for _measurements_ of spatially continuous phenomena (air temperature, air quality),

LINESTRINGs can reflect

* roads, railways or borders
* contour lines with constant values of spatially continuous elevation

and POLYGONs can reflect

* administrative regions, voting districts or forest stands
* the boundaries of of a spatially continuous discrete variable like land use or soil type

Linestring and polygon geometries are collections of points, and attributes associated with the geometry can be
associated either with

* all of the points (1 value: population count, population density, size or length)
* each of the points (soil type, land use, contour line elevation)

We say for the first relation that the attribute has
_block support_, and for the second that it has _point
support_, following the geostatistical literature. [Further
reading](https://r-spatial.org/book/05-Attributes.html).

Getting this wrong, e.g. when downscaling polygons, may be a source
of gross error in subsequent analysis. Before you analyse data,
it makes a lot of sense to not only consider the measurement scale
of your attribute $Z$, but also whether it is spatially continuous
or discrete ($s$). If it is discrete, counting, sums and spatial
densities make sense, if it is continuous averaging and interpolation
(prediction over continuous space) make sense.

```{r}
library(sf)
(pol = st_polygon(list(rbind(c(0,0), c(2,0), c(2,2), c(0,2), c(0,0)))))
(point = st_point(c(1,1)))
st_intersects(point, pol)
```
Now suppose we know two properties of the polygon: total population and
land use:
```{r}
(pol.sf = st_sf(population = 5000, land_use = "urban", geom = st_sfc(pol)))
```
If we now intersect `point` with `pol.sf`, should it retain the associated
attributes?
```{r}
st_intersection(pol.sf, point)
```
How do we get rid of this warning? By setting the attribute-geometry relationsip (agr):

```{r}
st_agr(pol.sf) = c(population = "aggregate", land_use = "constant")
st_intersection(pol.sf["land_use"], point)
```
No warning!
```{r}
st_intersection(pol.sf["population"], point)
```
Warning justified: maybe this should be an error, `population` be returned `NA`.

# Tidy data?

Recall [tidy data](https://www.jstatsoft.org/article/view/v059i10):
_each variable is a column, each observation is a row, and each
type of observational unit is a table._

```{r}
head(mtcars)
```

`sf` and `geopandas` objects belong to this category.

## Enter raster data and data cubes

[See here](https://r-spatial.org/book/06-Cubes.html)

## Limits of tables

For raster data, we can, obviously, put things in tables, in
different ways

```{r}
library(stars)
L7_ETMs |> as.data.frame() |> head()
L7_ETMs |> st_as_sf(long = TRUE, as_points = TRUE) |> head()
L7_ETMs |> st_as_sf(as_points = TRUE) |> head()
L7_ETMs |> st_as_sf(as_points = FALSE) |> head()
```

In principle, everything you can do with arrays, you can do with
tables; not vice versa. Why then would you work with arrays?

## Why/when would you stick to arrays?

If you use arrays, you have the advantage of:

	* guarantee of complete data: "empty" fields are filled with 0 (count) or NA (continuous, or nominal)
    * there is no need to take care of completeness, data are complete by construction
	* in-memory arrays have a continous memory layout, hence indexes are for free: we "know" where each data value is ("[geotransform](https://r-spatial.org/book/06-Cubes.html#regular-dimensions-gdals-geotransform)")
    * for larger datasets: arrays form an easy and useful abstraction
    * array [_operations_](https://r-spatial.org/book/06-Cubes.html#sec-dcoperations) are a powerful and comprehensive means to design and communicate analysis, and are much harder to carry out with tables


```{r}
stars:::get_geotransform(L7_ETMs)
```

# Datacube datasets stored in tables:

## long table:

Each record (row) reflects a unique combination of space (state) and time (year):

```{r}
data(Produc, package = "plm")
head(Produc)
nrow(Produc)
length(unique(Produc$state)) * length(unique(Produc$year))
# but do we acually have all combinations?
with(Produc, length(unique(paste(state, year, sep = ":"))))
```

## space-wide table:

different records (rows) correspond to different times, columns to different locations:
```{r}
data(wind, package = "gstat")
head(wind)
```

## time-wide table:

Different records (rows) correspond to different locations, columns to different times (xxx74: xxx for 1974; xxx79: xxx for 1979)
```{r}
library(sf)
system.file("gpkg/nc.gpkg", package="sf") |> read_sf() -> nc
head(nc)
```

# Creating datacubes from non-datacube data

## Foot-and-mouth disease cases

```{r}
data(fmd, package = 'stpp')
head(fmd)
data("northcumbria", package = 'stpp')
fmd.sf = st_as_sf(as.data.frame(fmd), coords = c('X', 'Y'))
n = nrow(northcumbria)
nh = st_sfc(st_polygon(list(northcumbria[c(1:n,1),])))
plot(fmd.sf, pch = 16, reset = FALSE, extent = nh, breaks = "quantile")
plot(nh, add = TRUE)
```

Create an empty datacube covering the space-time area, and count the
number of cases in each grid cell:
```{r}
library(stars)
st = st_as_stars(nh, nx = 10, ny = 10)
plot(st, reset = FALSE)
plot(nh, add = TRUE, border = 'green')
a = aggregate(fmd.sf, st_as_sf(st), FUN = length)
plot(a, main = "# of cases", reset = FALSE)
plot(fmd.sf, add = TRUE, col = 'grey')
plot(nh, border = 'green', add = TRUE)
```

See https://www.jstatsoft.org/article/view/v053i02 for a more elaborate
approach to model these data statistically.

## Hurricanes

See [hurricanes.Rmd](https://github.com/edzer/OGH23/blob/main/hurdat.Rmd), [output](hurdat.html)

## OD

See example in SDS: https://r-spatial.org/book/07-Introsf.html#sec-oddc

## `nc`: a categorical dimension

```{r}
library(sf)
nc = read_sf(system.file("gpkg/nc.gpkg", package="sf"))
m = st_set_geometry(nc, NULL)
n = as.matrix(m[c("BIR74", "SID74", "NWBIR74", "BIR79", "SID79", "NWBIR79")]) # 100 x 6
dim(n) = c(county = 100, var = 3, year = 2) # 100 x 3 x 2
dimnames(n) = list(county = nc$NAME, var = c("BIR", "SID", "NWBIR"), year = c(1974, 1979))
library(stars)
(st = st_as_stars(pop = n)) # without geometries
foo <- st |> st_set_dimensions(1, st_geometry(nc)) # with
foo
st_bbox(foo)
(x = st_as_sf(foo))
frac = function(x) x[2] / x[1]
frac2 = function(x) c(sidsr = x[2] / x[1], nwbr = x[3] / x[1])
frac2an = function(x) c(x[2] / x[1], x[3] / x[1])
st_apply(foo, c(1,3), frac)
st_apply(foo, c(1,3), frac2)
st_apply(foo, c(1,3), frac2an)
library(abind)
aperm(st_apply(foo, c(1,3), frac2), c("county", "year", "frac2"))

split(foo, 2)
split(foo, 3)

# subset vector cube:
foo[nc[1]]
```

# Vector data cubes from raster data cubes

Consider the following temperature reanalysis data, taken from [this file](https://psl.noaa.gov/repository/entry/show/PSD+Climate+Data+Repository/Public/PSD+Datasets/PSD+Gridded+Datasets/ncep.reanalysis2.derived/gaussian_grid/skt.sfc.mon.mean.nc?entryid=synth%3Ae570c8f9-ec09-4e89-93b4-babd5651e7a9%3AL25jZXAucmVhbmFseXNpczIuZGVyaXZlZC9nYXVzc2lhbl9ncmlkL3NrdC5zZmMubW9uLm1lYW4ubmM%3D&output=default.html).

We can read this NetCDF file using either of two ways:
```{r}
u = "https://psl.noaa.gov/repository/entry/get/skt.sfc.mon.mean.nc?entryid=synth%3Ae570c8f9-ec09-4e89-93b4-babd5651e7a9%3AL25jZXAucmVhbmFseXNpczIuZGVyaXZlZC9nYXVzc2lhbl9ncmlkL3NrdC5zZmMubW9uLm1lYW4ubmM%3D"
f = 'skt.sfc.mon.mean.nc'
if (!file.exists(f)) {
  download.file(u, f)
}
library(stars)
read_stars(f) # GDAL RasterLayer
read_mdim(f)  # GDAL Multidimensional Array API
```

We will continue with the first in order to have a regular grid; another approach
would have been to resample/warp the rectilinear grid.

Sampling this at two locations:
```{r}
skt = read_stars('skt.sfc.mon.mean.nc')
st_crs(skt) = 'OGC:CRS84'
pts = st_sfc(st_point(c(7, 52)), st_point(c(16.9, 52)), crs = st_crs(skt))
e = st_extract(skt, pts)
library(xts)
e.xts = as.xts(e)
colnames(e.xts) = c("Muenster", "Poznan")
plot(e.xts, legend.loc = "top")
```

Computing the means over three countries:

```{r}
ne = rnaturalearth::ne_countries(returnclass = "sf")
library(dplyr)
sel = c("PL", "DE", "ES")
ne |> filter(iso_a2 %in% sel) -> ne3
aggregate(skt, ne3, FUN=mean, na.rm = TRUE) |>
  aggregate(by = "year", FUN=mean, na.rm = TRUE) |>
  as.xts() -> a3.xts
colnames(a3.xts) = sel
plot(a3.xts, legend.loc = "top")
```