Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: pixel driller API #221

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

feat: pixel driller API #221

wants to merge 15 commits into from

Conversation

mjaquiery
Copy link

The pixel driller API converts many single-band raster files into one gigantic multi-band raster file with each band tagged with its initial file name. The API will return the value for any given pixel on the /{x}/{y} endpoint.

The first time it is run, or anytime the source rasters are updated, the ingestion process will need to be rerun by sending a POST request to the /reimport endpoint with {"code": str} matching the ADMIN_CODE envvar.


This likely needs a few adjustments, but the core idea is there as a first try. There are simple tests for the two main functions - we could add a CI job to run them.

FWIW: I'm in no way suggesting this isn't a terrible way to do this, or even that it doesn't result in horrible data warping during reprojection. We need to investigate the data integrity in particular.

The pixel driller API converts many single-band raster files into one gigantic multi-band raster file with each band tagged with its initial file name.
The API will return the value for any given pixel on the `/{x}/{y}` endpoint.

The first time it is run, or anytime the source rasters are updated, the ingestion process will need to be rerun by sending a POST request to the `/reimport` endpoint with `{"code": str}` matching the `ADMIN_CODE` envvar.
@thomas-fred
Copy link

thomas-fred commented Dec 4, 2024

Hey Matt, that was quick! :)

A few random thoughts:

  • I had assumed this would be a new API endpoint within the existing backend for some reason, but a new service also works. Any thoughts on the choice?
  • Given windowed reads, maybe it's worth trying to read the rasters one-by-one (no mega-raster) first, to see if that's acceptably fast.
  • How big will the mega-file be for Jamaica? I imagine eventually we'll want to deploy this for the global visualisation tool too. So that'll be hundreds of rasters with global coverage, all stacked into one file. That seems a bit gross. Maybe it's not? With the values resampled to a regular grid there are lots of potential alternatives though. Sharded parquet?
  • What's the response time for a few hundred bands of a single pixel with the current method?

@tomalrussell
Copy link
Member

Thanks again for the quick setup and provocation to test something else out!

  • ingesting to zarr takes a few minutes, with data stored in chunks whose layout and size can be tuned
  • this approach maintains the resolution and CRS of the source data, though it stores grid coordinates explicitly in the coordinate dimensions, rather than using the affine transform
  • it uses xarray's ds.sel(..., method="nearest") selector to pick the nearest pixel
  • retrieval is probably fast enough, locally I'm seeing around ~60ms to get JSON back for all layers for a single point
  • the tileserver/stacks directory of zarr stores is ~820M, a little smaller than the ~990M tileserver/raster Cloud-Optimised GeoTIFFs directory for terracotta
  • I think this would sit well as an endpoint within backend - the metadata for layers and zarr datasets is all in a couple of CSVs as I've set it up now, but these are tightly coupled to the risk results (each hazard/rp/... layer that we query for an intensity value in pixel_driller now is also one for which we calculate damages and store in the postgres database). So metadata about these layers could well go in our db in a future iteration.
  • ingest.py could go in etl
  • the JSON response could be tweaked in many ways
    • do we want to use and expose the compound string keys as they are?
    • do we like [{'key':'coastal__rp_10__...', 'band_data':0.92349237}, ...] list of objects
    • big object of {'coastal__rp_10__...': 0.92349237, ...}
    • other, more structured/flexible ways of passing the key metadata? Or more compact (currently around ~50k response size)?

@mjaquiery
Copy link
Author

Sounds like a great upgrade to what I came up with - I'll look properly tomorrow when I can see the code.
Agreed that there's lots to consider with sending layer metadata, including figuring out exactly what we need to transfer.

@mjaquiery
Copy link
Author

That looks like a really strong rework of the original code @tomalrussell, thanks.
@thomas-fred - the choice for a new service was that this is a fundamentally different data representation than terracotta's (or at least might have been - I wandered through several different ideas before going for raster bands), and that writing a small, simple FastAPI is a relatively small, extensible, self-contained job that relies on tech already used in the project.

The initial solution ran about 1.5s for retrieving the full list of existing rasters. @tomalrussell has got that down to about 60ms, although I don't know if that's running in a container or directly on the host system. He's also shrunk the file size for the output from about 7.5GB to less than 1GB, so, again, a massive improvement.

@thomas-fred
Copy link

the JSON response could be tweaked in many ways

I think we should be passing something to the FE that is cleaner and easier to understand for a non-nerd than the key string. I think that could either be:

  1. a single 'pretty' version of the key string e.g. 'Coastal flooding, RCP4.5, return-period 10y'
  2. a structured object derived from the key string e.g. {'peril': 'flooding', 'sub-peril': 'coastal', 'rp': 10, 'rcp': '4.5', 'value': 0.975}, then the FE can display more flexibly as desired

To display I'm imagining a traffic light categorisation of values, each dataset/key shaded by its classification at that location. If we go for 1), you could just natural sort the keys in the returned object and show them in the FE in a big long list with their classifications.

Dataset Value
Coastal flooding depth (m), RCP4.5, return-period 1y 0.15
Coastal flooding depth (m), RCP4.5, return-period 2y 0.23
Coastal flooding depth (m), RCP4.5, return-period 5y 0.45
Coastal flooding depth (m), RCP4.5, return-period 10y 0.61
Coastal flooding depth (m), RCP4.5, return-period 20y 1.23

If we went for 2), you could build a table more easily. I'm imagining something like:

Peril Sub-peril RCP Variable Unit Return period        
          1 2 5 10 20
flooding coastal 4.5 depth meters 0.15 0.23 0.45 0.61 1.23

Of course the rasters don't all vary on the same parameters, so you'd probably have a block for each set with a common schema.

@tomalrussell
Copy link
Member

Just to expand on this - I think something more structured, like (2), is more flexible and useful, leaves the presentation up to the frontend but doesn't insist on all the knowledge required to parse out the arbitrary stringified compound keys.

In the irv-jamaica case here, the response would have data content like hazard_layers.csv without the path or key columns, and with a value column.

Here are some examples:

import json
import pandas
import numpy


df = pandas.read_csv('https://raw.githubusercontent.com/nismod/irv-jamaica/37b83418853027de17aae07ec9a84ba00e7b1db5/etl/hazard_layers.csv')
df.drop(columns=['key','path'], inplace=True)
df["value"] = numpy.around(numpy.random.random((581,)), decimals=10)

# (a) This record-oriented format is simple and parses easily to list-of-objects
df.to_json("records.json", orient='records')

# (b) This record-oriented format is similar, with light self-documenting schema
df.to_json("table.json", orient='table', index=False)

# (c) This format is compact (avoids repeating column names) but still record oriented ("data" is nested list of "row" lists of cell values)
df.to_json("split.json", orient='split', index=False)

# (d) This column-oriented format is simple and compresses well: keys are column names, values are lists of cell values
data = {}
for col in df.columns:
    data[col] = df[col].tolist()

with open("columns.json", 'w') as fh:
    json.dump(data, fh, separators=(',', ':'))

I think (d) columns.json or (a) records.json look preferable for simplicity and flexibility.

pixel_driller/query.py Outdated Show resolved Hide resolved
@thomas-fred
Copy link

Hi all, I think this is ready for now. We decided to keep it as a separate service in the end.

If you run the service, ingest and then hit the endpoint with:

curl http://localhost:5080/-77/18

Then it should return response.json

eatyourgreens and others added 4 commits January 24, 2025 13:12
- make sure `libgdal-dev` is installed during the Docker build.
- update VSCode settings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants