feat: pixel driller API #221

mjaquiery · 2024-12-04T14:58:13Z

The pixel driller API converts many single-band raster files into one gigantic multi-band raster file with each band tagged with its initial file name. The API will return the value for any given pixel on the /{x}/{y} endpoint.

The first time it is run, or anytime the source rasters are updated, the ingestion process will need to be rerun by sending a POST request to the /reimport endpoint with {"code": str} matching the ADMIN_CODE envvar.

This likely needs a few adjustments, but the core idea is there as a first try. There are simple tests for the two main functions - we could add a CI job to run them.

FWIW: I'm in no way suggesting this isn't a terrible way to do this, or even that it doesn't result in horrible data warping during reprojection. We need to investigate the data integrity in particular.

The pixel driller API converts many single-band raster files into one gigantic multi-band raster file with each band tagged with its initial file name. The API will return the value for any given pixel on the `/{x}/{y}` endpoint. The first time it is run, or anytime the source rasters are updated, the ingestion process will need to be rerun by sending a POST request to the `/reimport` endpoint with `{"code": str}` matching the `ADMIN_CODE` envvar.

thomas-fred · 2024-12-04T17:11:20Z

Hey Matt, that was quick! :)

A few random thoughts:

I had assumed this would be a new API endpoint within the existing backend for some reason, but a new service also works. Any thoughts on the choice?
Given windowed reads, maybe it's worth trying to read the rasters one-by-one (no mega-raster) first, to see if that's acceptably fast.
How big will the mega-file be for Jamaica? I imagine eventually we'll want to deploy this for the global visualisation tool too. So that'll be hundreds of rasters with global coverage, all stacked into one file. That seems a bit gross. Maybe it's not? With the values resampled to a regular grid there are lots of potential alternatives though. Sharded parquet?
What's the response time for a few hundred bands of a single pixel with the current method?

tomalrussell · 2024-12-05T13:39:45Z

Thanks again for the quick setup and provocation to test something else out!

ingesting to zarr takes a few minutes, with data stored in chunks whose layout and size can be tuned
this approach maintains the resolution and CRS of the source data, though it stores grid coordinates explicitly in the coordinate dimensions, rather than using the affine transform
it uses xarray's ds.sel(..., method="nearest") selector to pick the nearest pixel
retrieval is probably fast enough, locally I'm seeing around ~60ms to get JSON back for all layers for a single point
the tileserver/stacks directory of zarr stores is ~820M, a little smaller than the ~990M tileserver/raster Cloud-Optimised GeoTIFFs directory for terracotta
I think this would sit well as an endpoint within backend - the metadata for layers and zarr datasets is all in a couple of CSVs as I've set it up now, but these are tightly coupled to the risk results (each hazard/rp/... layer that we query for an intensity value in pixel_driller now is also one for which we calculate damages and store in the postgres database). So metadata about these layers could well go in our db in a future iteration.
ingest.py could go in etl
the JSON response could be tweaked in many ways
- do we want to use and expose the compound string keys as they are?
- do we like [{'key':'coastal__rp_10__...', 'band_data':0.92349237}, ...] list of objects
- big object of {'coastal__rp_10__...': 0.92349237, ...}
- other, more structured/flexible ways of passing the key metadata? Or more compact (currently around ~50k response size)?

mjaquiery · 2024-12-05T14:21:57Z

Sounds like a great upgrade to what I came up with - I'll look properly tomorrow when I can see the code.
Agreed that there's lots to consider with sending layer metadata, including figuring out exactly what we need to transfer.

mjaquiery · 2024-12-06T11:37:59Z

That looks like a really strong rework of the original code @tomalrussell, thanks.
@thomas-fred - the choice for a new service was that this is a fundamentally different data representation than terracotta's (or at least might have been - I wandered through several different ideas before going for raster bands), and that writing a small, simple FastAPI is a relatively small, extensible, self-contained job that relies on tech already used in the project.

The initial solution ran about 1.5s for retrieving the full list of existing rasters. @tomalrussell has got that down to about 60ms, although I don't know if that's running in a container or directly on the host system. He's also shrunk the file size for the output from about 7.5GB to less than 1GB, so, again, a massive improvement.

thomas-fred · 2025-01-08T12:02:31Z

the JSON response could be tweaked in many ways

I think we should be passing something to the FE that is cleaner and easier to understand for a non-nerd than the key string. I think that could either be:

a single 'pretty' version of the key string e.g. 'Coastal flooding, RCP4.5, return-period 10y'
a structured object derived from the key string e.g. {'peril': 'flooding', 'sub-peril': 'coastal', 'rp': 10, 'rcp': '4.5', 'value': 0.975}, then the FE can display more flexibly as desired

To display I'm imagining a traffic light categorisation of values, each dataset/key shaded by its classification at that location. If we go for 1), you could just natural sort the keys in the returned object and show them in the FE in a big long list with their classifications.

Dataset	Value
Coastal flooding depth (m), RCP4.5, return-period 1y	0.15
Coastal flooding depth (m), RCP4.5, return-period 2y	0.23
Coastal flooding depth (m), RCP4.5, return-period 5y	0.45
Coastal flooding depth (m), RCP4.5, return-period 10y	0.61
Coastal flooding depth (m), RCP4.5, return-period 20y	1.23

If we went for 2), you could build a table more easily. I'm imagining something like:

Peril	Sub-peril	RCP	Variable	Unit	Return period
					1	2	5	10	20
flooding	coastal	4.5	depth	meters	0.15	0.23	0.45	0.61	1.23

Of course the rasters don't all vary on the same parameters, so you'd probably have a block for each set with a common schema.

tomalrussell · 2025-01-08T17:29:47Z

Just to expand on this - I think something more structured, like (2), is more flexible and useful, leaves the presentation up to the frontend but doesn't insist on all the knowledge required to parse out the arbitrary stringified compound keys.

In the irv-jamaica case here, the response would have data content like hazard_layers.csv without the path or key columns, and with a value column.

Here are some examples:

import json
import pandas
import numpy


df = pandas.read_csv('https://raw.githubusercontent.com/nismod/irv-jamaica/37b83418853027de17aae07ec9a84ba00e7b1db5/etl/hazard_layers.csv')
df.drop(columns=['key','path'], inplace=True)
df["value"] = numpy.around(numpy.random.random((581,)), decimals=10)

# (a) This record-oriented format is simple and parses easily to list-of-objects
df.to_json("records.json", orient='records')

# (b) This record-oriented format is similar, with light self-documenting schema
df.to_json("table.json", orient='table', index=False)

# (c) This format is compact (avoids repeating column names) but still record oriented ("data" is nested list of "row" lists of cell values)
df.to_json("split.json", orient='split', index=False)

# (d) This column-oriented format is simple and compresses well: keys are column names, values are lists of cell values
data = {}
for col in df.columns:
    data[col] = df[col].tolist()

with open("columns.json", 'w') as fh:
    json.dump(data, fh, separators=(',', ':'))

I think (d) columns.json or (a) records.json look preferable for simplicity and flexibility.

pixel_driller/query.py

thomas-fred · 2025-01-23T17:55:13Z

Hi all, I think this is ready for now. We decided to keep it as a separate service in the end.

If you run the service, ingest and then hit the endpoint with:

curl http://localhost:5080/-77/18

Then it should return response.json

- make sure `libgdal-dev` is installed during the Docker build. - update VSCode settings.

mjaquiery requested review from eatyourgreens, tomalrussell and thomas-fred December 4, 2024 14:58

fix: volumes must be a list

7d6153e

mjaquiery and others added 4 commits December 5, 2024 08:39

fix: docker-compose setup

ed4d3a5

Drop reimport endpoint

7740604

Format with black

f1aa080

Rewrite pixel_driller with xarray and zarr

9f7be01

tomalrussell reviewed Jan 8, 2025

View reviewed changes

pixel_driller/query.py Outdated Show resolved Hide resolved

thomas-fred added 3 commits January 22, 2025 16:29

Include metadata in response; change to columnar format

f38a30c

Test metadata join

e696169

Simplify reshaping

578dbc5

thomas-fred force-pushed the pixel-driller branch from 36eda8f to 578dbc5 Compare January 23, 2025 11:41

thomas-fred added 2 commits January 23, 2025 17:43

Expand docs; dict equality

3fbf91c

Check query is within bounds of grids

2a0d11c

eatyourgreens and others added 4 commits January 24, 2025 13:12

Merge branch 'main' into pixel-driller

ac3b9f9

Tidy metadata handling

a123a4b

Small fixes

35df46c

- make sure `libgdal-dev` is installed during the Docker build. - update VSCode settings.

Merge branch 'main' into pixel-driller

ee7d0fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pixel driller API #221

feat: pixel driller API #221

mjaquiery commented Dec 4, 2024

thomas-fred commented Dec 4, 2024 •

edited

Loading

tomalrussell commented Dec 5, 2024

mjaquiery commented Dec 5, 2024

mjaquiery commented Dec 6, 2024

thomas-fred commented Jan 8, 2025

tomalrussell commented Jan 8, 2025

thomas-fred commented Jan 23, 2025

feat: pixel driller API #221

Are you sure you want to change the base?

feat: pixel driller API #221

Conversation

mjaquiery commented Dec 4, 2024

thomas-fred commented Dec 4, 2024 • edited Loading

tomalrussell commented Dec 5, 2024

mjaquiery commented Dec 5, 2024

mjaquiery commented Dec 6, 2024

thomas-fred commented Jan 8, 2025

tomalrussell commented Jan 8, 2025

thomas-fred commented Jan 23, 2025

thomas-fred commented Dec 4, 2024 •

edited

Loading