-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: pixel driller API #221
base: main
Are you sure you want to change the base?
Conversation
The pixel driller API converts many single-band raster files into one gigantic multi-band raster file with each band tagged with its initial file name. The API will return the value for any given pixel on the `/{x}/{y}` endpoint. The first time it is run, or anytime the source rasters are updated, the ingestion process will need to be rerun by sending a POST request to the `/reimport` endpoint with `{"code": str}` matching the `ADMIN_CODE` envvar.
Hey Matt, that was quick! :) A few random thoughts:
|
Thanks again for the quick setup and provocation to test something else out!
|
Sounds like a great upgrade to what I came up with - I'll look properly tomorrow when I can see the code. |
That looks like a really strong rework of the original code @tomalrussell, thanks. The initial solution ran about 1.5s for retrieving the full list of existing rasters. @tomalrussell has got that down to about 60ms, although I don't know if that's running in a container or directly on the host system. He's also shrunk the file size for the output from about 7.5GB to less than 1GB, so, again, a massive improvement. |
I think we should be passing something to the FE that is cleaner and easier to understand for a non-nerd than the key string. I think that could either be:
To display I'm imagining a traffic light categorisation of values, each dataset/key shaded by its classification at that location. If we go for 1), you could just natural sort the keys in the returned object and show them in the FE in a big long list with their classifications.
If we went for 2), you could build a table more easily. I'm imagining something like:
Of course the rasters don't all vary on the same parameters, so you'd probably have a block for each set with a common schema. |
Just to expand on this - I think something more structured, like (2), is more flexible and useful, leaves the presentation up to the frontend but doesn't insist on all the knowledge required to parse out the arbitrary stringified compound keys. In the Here are some examples: import json
import pandas
import numpy
df = pandas.read_csv('https://raw.githubusercontent.com/nismod/irv-jamaica/37b83418853027de17aae07ec9a84ba00e7b1db5/etl/hazard_layers.csv')
df.drop(columns=['key','path'], inplace=True)
df["value"] = numpy.around(numpy.random.random((581,)), decimals=10)
# (a) This record-oriented format is simple and parses easily to list-of-objects
df.to_json("records.json", orient='records')
# (b) This record-oriented format is similar, with light self-documenting schema
df.to_json("table.json", orient='table', index=False)
# (c) This format is compact (avoids repeating column names) but still record oriented ("data" is nested list of "row" lists of cell values)
df.to_json("split.json", orient='split', index=False)
# (d) This column-oriented format is simple and compresses well: keys are column names, values are lists of cell values
data = {}
for col in df.columns:
data[col] = df[col].tolist()
with open("columns.json", 'w') as fh:
json.dump(data, fh, separators=(',', ':')) I think (d) columns.json or (a) records.json look preferable for simplicity and flexibility. |
36eda8f
to
578dbc5
Compare
Hi all, I think this is ready for now. We decided to keep it as a separate service in the end. If you run the service, ingest and then hit the endpoint with:
Then it should return response.json |
- make sure `libgdal-dev` is installed during the Docker build. - update VSCode settings.
The pixel driller API converts many single-band raster files into one gigantic multi-band raster file with each band tagged with its initial file name. The API will return the value for any given pixel on the
/{x}/{y}
endpoint.The first time it is run, or anytime the source rasters are updated, the ingestion process will need to be rerun by sending a POST request to the
/reimport
endpoint with{"code": str}
matching theADMIN_CODE
envvar.This likely needs a few adjustments, but the core idea is there as a first try. There are simple tests for the two main functions - we could add a CI job to run them.
FWIW: I'm in no way suggesting this isn't a terrible way to do this, or even that it doesn't result in horrible data warping during reprojection. We need to investigate the data integrity in particular.