Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about implementing custom Codec classes #402

Open
Cadair opened this issue Nov 30, 2022 · 16 comments
Open

Some questions about implementing custom Codec classes #402

Cadair opened this issue Nov 30, 2022 · 16 comments

Comments

@Cadair
Copy link

Cadair commented Nov 30, 2022

Hello 👋

Myself and @astrofrog are currently re-implementing how astropy handles the compression of tiled image data in it's FITS library. We have implemented the compression algorithms in the form of Codec classes (without introducing a required dependency on numcodecs). The main motivation for this is to allow projects like kerchunk and zarr etc to use these codecs in fun ways in the future.

We have run into a few things which aren't clear to us as we have been learning how to make use of the Codec API:

  1. numcodecs API states that "Inputs to and outputs from these methods may be any Python object exporting a contiguous buffer via the new-style Python protocol." What objects are acceptable here? I see some codecs implemented in numcodecs returning bytes and some returning numpy arrays from their decode methods. These objects don't strictly adhere to the buffer API (although obviously contain a contiguous buffer). How are the return types used, and what practical restrictions are there on them? We are currently returning a memoryview object from our decode methods, but would it be a compatibility issue if we returned numpy arrays instead?

  2. The signature of the Code.__init__ isn't specified as part of the abc. I assume from this line that it would be fine for all the settings to be keyword-only arguments to the constructor?

  3. Finally, and most unclear, is that for our FITS quantization methods, the encode method generates metadata (offset and scale) which are needed by the decode method to successfully round-trip through the quantization. What's the best way to handle this? Can we safely mutate the instance attributes of the Codec? Will the config of the codec be serialised to the metadata (by zarr or other things) after encoding in a way which makes this reliable? Are there any other examples of this kind of pattern we could learn from?

@martindurant
Copy link
Member

  1. Historically, bytes-like types (e.g., memoryview) have been used by compressor codecs, and numpy arrays for everything else. For codecs that work on arrays, the first line is usually a version of get-array-from-buffer, although that array is usually 1D at that point.
  2. Yes, the codecs should all be instatiatable with kwargs only. The serialization is JSON, so there is a limit to how complex the arguments can be.
  3. This is the trickiest one! The codec instance is stored at array creation time - you normally pass the instance directly to the create methods, before any chunks data. Of course, if you already have the data, you could most easily do two passes on some part of it to get your metadata. The bitinformation codec is like this - first youn analyse your data, then instantiate the codec with the resultant argument at array creation time, then encode your data. The zarr metadata is of course only JSON, so can be updated in place easily enough - kerchunk does this kind of thing, but zarr has no API for it.

I am suspecting, though, that you mean you have metadata that is different for each chunk. That totally doesn't fit into the zarr/numcodecs model. Kerchunk/referenceFS has been thinking about stacking some of the chunk processing directly in the storage layer, where is work chunkwise, rather than in the zarr machinery to allow such a thing, even though it duplicates some code and logic. The initial use case is for simple offset/scale, but where the values to use depend on the specific chunk.

@Cadair
Copy link
Author

Cadair commented Nov 30, 2022

Hi @martindurant thanks for the answers!

We do indeed have per-chunk metadata for the quantization. Sounds like this will be a fun challenge for me later on when I try and get kerchunk/zarr to process these data.

We did think about initializing the codec having done the analysis to get the scale and offset, which would work well for decode. I assume kerchunk/zarr can't initialise a filter chain per-chunk? That's something we could do inside Astropy to make this work while still presenting a Codec interface which might be a better base to build from?

@martindurant
Copy link
Member

I assume kerchunk/zarr can't initialise a filter chain per-chunk

No.

There is both the idea I described above in which kerchunk could store extra metadata per chunk and use those at load time, but also you may be interested by the discussion here. I describe passing a "context" to decode, so you know where in the overall array you are decoding. That would mean you could store the full list of arguments in the single codec instance, much closer to the current model.

@Cadair
Copy link
Author

Cadair commented Dec 1, 2022

I am going to drop generalising the problem for a moment as I am pretty sure you are familiar enough with FITS to let me get away with it.

The medium-term goal of the work we are doing is to get to the point where I can kerchunk a tile compressed image FITS file, which is where each tile of the image is stored in a vararr in one of a few possible columns (can change per-tile/row) and some per-chunk metadata for the decoding can be stored in other columns (scale & offset etc).

I can imagine a situation where if you could set parameters for a codec chain for each row of the data in the heap, you could have the column offsets for that row be inputs to the first codec, which could split the columns in the row, then determine how to decode the data (i.e. if it needs dequantizing) and then read the other per-tile metadata out of the columns and then pass that through as inputs to the other codecs later in the chain.

I can't see how you can maintain a per-tile kerchunked approach to these data without having the ability to set some metadata per-chunk. If you passed the whole table and a pointer to the heap you could decode the whole image, but I was hoping to be able to have a resulting array chunked up so you can just read and decode a single tile at a time.

@martindurant
Copy link
Member

I can't see how you can maintain a per-tile kerchunked approach to these data without having the ability to set some metadata per-chunk

Exactly right. So we need to do something, whether that is keeping a list of parameters in the global metadata, or work in referenceFS to perform specific actions on a chunk in the storage layer. I am leaning towards the latter.

An interesting similar situation arose recently with WCS parameters varying per image of a massive cube (I think you saw this in another thread, not sure where now), and that the set of WCS parameters could also be saved in the same parquet structure as the offset/size information. In this case, the codec doesn't need it, but the xarray flexible index mechanism does. Kerchunk/referenceFS can do all of this, but it feels just a little like we're circumventing zarr's new extension mechanism, unless the whole of the kerchunk framework itself becomes an extension.

@Cadair
Copy link
Author

Cadair commented Dec 5, 2022

You mentioned parquet, which I don't have a lot of experience with, I was hoping that I would be able to put the referenceFS data in my existing asdf files, via a json representation, is the parquet thing a second option to json or something else?

Mainly for curiosity, why do you think the approach should be adding this stuff to referenceFS?

Does zarr not have per-chunk metadata at all? For things not related to the reading of the bytes?

@joshmoore
Copy link
Member

Does zarr not have per-chunk metadata at all?

@Cadair, correct. There's nothing at the moment in the spec, though as @martindurant mentions, it's certainly conceivable to add a mechanism either via an extension or as a convention.

@Cadair
Copy link
Author

Cadair commented Dec 5, 2022

Interesting, as @martindurant points out there's use cases for per-chunk coordinate information in sunpy/ndcube#222 and I obviously have per-tile compression metadata for FITS tiled image compression.

@martindurant
Copy link
Member

is the parquet thing a second option to json

It could in principle be put into JSON too. Not quite sure how it would live inside ASDF, would be interested to see what you are thinking. The reason for thinking parquet, is that we have an increasing number of cases where the number of chunks is large, and JSON encoding becomes a bottleneck for opening datasets at some point. That would be exacerbated if we store more information per chunk.

why do you think the approach should be adding this stuff to referenceFS?

It's something that we can do right away, before zarr is ready to pass chunk details to the codec (which would still need storing all the parameters in the global metadata) or there is some kind of transformer. One could argue that implementing this in referenceFS would be the equivalent of the first such transformer; but storing the parameters alongside the offset/size information of a reference means implementing the decode in the same place makes sense to me.

@martindurant
Copy link
Member

See also preffs which already stores references in parquet, and has the extra feature of concatenating byte ranges for keys that have more than one entry. That would open up another alternative closer to what some of the current codecs do: storing the chunk specific parameter in the chunk itself as a header/frame. Of course, we can't change the original chunk/tile, but we could store extra bytes for concatenation and so cheat.

@Cadair
Copy link
Author

Cadair commented Dec 6, 2022

Not quite sure how it would live inside ASDF, would be interested to see what you are thinking.

I haven't looked at it, but if you can serialise a ReferenceFS to JSON I was going to take that and give it to asdf to save in it's yaml metadata, then take it back out and de-serialise the ReferenceFS? Obviously, I haven't tried this, or even looked into it yet, so maybe I am missing something?

It's something that we can do right away

Sounds good, I am not sure when I am likely to get to this phase of the project (got some parental leave coming up in the new year 👶 and trying to get the first phase stuff done before), but is there an issue in fsspec tracking it somewhere I can keep an eye on / come back to when I get there?

@martindurant
Copy link
Member

martindurant commented Dec 6, 2022

take that and give it to asdf to save in it's yaml metadata, then take it back out and de-serialise the ReferenceFS

I don't see that this gets you anything beyond directly storing the JSON, which is what we do now.

@Cadair
Copy link
Author

Cadair commented Dec 8, 2022

It means I can put it in my ASDF file we are already making and distributing to our users.

@martindurant
Copy link
Member

You might also find interesting that ASDF (containing data!) is another future target for kerchunk. It is a fairly sensible format and not too dissimilar to zarr. I think astropy ought to be able to read from zarr, would not take too much to implement, but an alternative might be to pretend that zarrs are actually ASDF, which can already be loaded. I'm not sure how much work that would be.

@Cadair
Copy link
Author

Cadair commented Dec 8, 2022

We (DKIST) have a weird situation where we are shipping FITS files and then we make an ASDF file which is a metadata-only file which let's you load a combined array (currently one chunk per-FITS) and have a WCS for the whole array etc. I have currently cooked up a not particularly efficient Dask way of doing this, but I am looking for kerchunk and zarr to replace it. My thinking is that if I can serialise the ReferenceFS to the ASDF then I can use that to make a zarr array and return that to my users on load.

I feel like we are driving this a little off topic, I actually have a question related to this so I will open an issue on kerchunk. Never mind it's already been answered.

@jakirkham
Copy link
Member

Wonder if a good conclusion to this discussion would be adding some documentation here to make it a bit clearer how people can plug their own compressors in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants