DM-45371: technote describing possible cell-based coadd file layouts #1

TallJimbo · 2024-07-22T19:16:09Z

No description provided.

beckermr · 2024-07-22T19:25:19Z

index.rst

+As a result, I expect us to need to write some low-level code of our own to do subimage reads over object stores, though this is not specific to cell-based coadds: we need this for all of our image data products.
+Whether to contribute more general code upstream (probably to Astropy) or focus only on reading the specific files we write ourselves is an important open question; the general solution would be very useful the community, but its scope is unquestionably larger.
+
+Finally, the WCS of any single cell (or the inner-cell stitched image) will be simple and can be exactly represented in FITS headers (unlike the WCSs of our single-epoch images).


To be very clear, the WCS of a single cell is simply that of the tract. The layout has only one WCS and each cell is simply some subimage within that WCS. This feature makes the overlap pixels be literally the same geometrically, means that the images can be stitched together correctly, etc.

That's conceptually true but if the cells are stored independently of each other in the FITS file, they do technically have separate WCSes and will at least need different CRPIXn values in order to participate in the tract-global WCS correctly.

(Of course the overlap pixels don't have the same values, since there's a different input image selection for each cell.)

beckermr · 2024-07-22T19:26:03Z

index.rst

+Whether to contribute more general code upstream (probably to Astropy) or focus only on reading the specific files we write ourselves is an important open question; the general solution would be very useful the community, but its scope is unquestionably larger.
+
+Finally, the WCS of any single cell (or the inner-cell stitched image) will be simple and can be exactly represented in FITS headers (unlike the WCSs of our single-epoch images).
+An additional FITS WCS could be used to describe the position of a single-cell image within a larger grid.


This position is simply a pixel location and the dimension of the image.

I think Jim is referring to the Rubin convention of having an "alternative WCS" (we usually use WCS A) that provides the mapping from local FITS 1-based pixel coordinates to tract-level 0-based coordinates.

index.rst

beckermr · 2024-07-22T19:30:46Z

index.rst

+
+This section will cover potential ways to lay out the image data - the 5+ planes that share a common grid, as well as the PSF images - across multiple FITS HDUs.
+
+Binary Table With Per-Cell Rows


I think you are aware, but the FITS tile compression code fpack ends up with a binary table of compressed images. In DES, we get pretty good (lossy) compression with the tiling working per cell. We found pretty apparent issues with lossy compression when the tile compression boundaries crossed cell boundaries because the noise levels in the images change.

It really makes me wonder why we aren't using HDF5 and making use of a hierarchy and native compression rather than trying to shoe horn this into FITS.

The way I see it, we can also do HDF5, but we really can't just do HDF5, according to our requirements. Maybe there's some scheme where we only store the HDF5 and convert to FITS when streaming bytes in various places, but I'm not at all convinced that removes enough constraints from the FITS version to make it easier to implement overall.

It really makes me wonder why we aren't using HDF5 and making use of a hierarchy and native compression rather than trying to shoe horn this into FITS.

This is an interesting perspective. Given that FITS already provides a solution that generates this kind of data structure internally and has all of the support to read subsets etc., it seems like it'd be shoe horning to put this into HDF5. One would have to build all of the logic to interact with the table yourself when something already exists and is a defacto community standard.

It seems like the only issue with simply using what FITS has is that one cannot use object stores / non-posix I/O to read subsets of the file. That seems like the kind of use case that is specific to DM, is solvable with custom code that can be written just for DM's use, and results in reusing essentially everything FITS already offers.

Forcing cell-based coadds into FITS when the thing you end up with is essentially unusable for ds9/firefly/astropy seems like we are stretching the usefulness of the requirement we have to use FITS. If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough. cc/ @gpdf

Cells are one parallelization axis, sure. But they're not the only one; parellelizing over objects is another, and we'll have enough data that we can keep all the machines plenty hot parallelizing only over patches. I'd generally say that parallelizing over cells is best only if you have your own reason to do per-cell detection and hence have no choice but to deal with the problem of resolving duplicate detections later. Otherwise it's a lot cleaner to combine data at the pixel level, especially if you're starting from our main Object table's detections.

Do we have a service to do a bulk data transfer?

We don't, and "bulk" has implications that we don't want to get into. But services that transform or augment images as they are downloaded is something we've talked about for a long time.

My main concern with only providing FITS via a service is that I think there's still a lot of work defining the format of the FITS file we return. Stitching together the inner cells is not enough to make a coadd look just like a single-epoch image, at least not when you're thinking about the on-disk data model rather than a Python API. I don't want to put a lot of effort into an HDF5 format for cell-based coadds (especially not if I'm also putting a lot of effort into FITS+JSON serialization of other image data products), only to have the scope of what people want back from the FITS-transformer service creep up to near parity with what's in the HDF5.

@timj wrote:

Forcing cell-based coadds into FITS when the thing you end up with is essentially unusable for ds9/firefly/astropy seems like we are stretching the usefulness of the requirement we have to use FITS. If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough.

I think that the binary table model is a reasonable one. It solves one of the biggest problems for interoperable, non-mission-specific interpretation of FITS files with enormous numbers of HDUs: there's nothing in the FITS standard, and precious little in the "registered conventions", for having a "table of contents" extension that tells you what's in the file.

By comparison, putting the subimages in a table solves this problem for free. You can put all sorts of metadata into additional columns in the table, and a reasonable application will give a user the ability to browse by this metadata.

It's true that today if you handed a big Green Bank style table to Firefly or Aladin you might not get acceptable UX, but that's because there's been little pressure on tools in this area to catch up with what's already in the standard.

It's important to recognize that Rubin isn't the only project with this need, and cell-based coadds aren't even the only application for it within Rubin: we also have an equally important requirement to be able to create multi-epoch bulk cutout files, both for cutouts around many objects, and for cutouts around one (possibly moving) object in many epochs.

So we are among friends, as it were, in looking for standards-conformant improvements in tool behavior.

If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough.

Interesting. Do we have a service to do a bulk data transfer? Such a conversion could happen in the process, since we don't want to store the trimmed/stitched image at USDF and the destination may not have Science Pipelines installed to read our custom datastructure.

We have a robust framework for retrieving "related data" in the RSP. When someone queries for coadded images, they'll get a list of images, most naturally, in the way they are chunked in the Butler (e.g., currently, one row per patch in the query result).

The way the image itself is then retrieved actually allows for N different products to be returned associated with that row in the query result. So you can return the full cell-based coadd FITS file verbatim from the backing store, or you can return a stitched-together simple FITS file for the whole area it covers, or you can return a PNG preview of it, or a table of its provenance, or pretty much anything else we can think of.

These additional related products can be things that already exist statically, or things that are created on the fly by services from the data on the backing store.

... and modern IVOA-flavored tools like Firefly and Aladin already know how to follow these links and present these options to users.

beckermr · 2024-07-22T19:32:09Z

index.rst

+Exploded Images
+---------------
+
+TODO


The MEDS format is basically exploded images but with them all unraveled into a single 1D set of pixels. We then set the lossy compression tile boundaries to correspond to the exploded cells with one per tile.

To be clear, we store one exploded image per data type being stored (e.g., one image HDU, one mask HDU etc.) and then they all share common metadata stored in a binary table HDU in the file.

I'm not familiar with this format, @beckermr . Do you have a reference?

https://github.com/esheldon/meds/wiki/MEDS-Format

I do not know if this is up to date with what has been done in practice though.

beckermr · 2024-07-22T19:35:20Z

index.rst

+A binary table with per-cell rows is a natural fit for the fixed-schema per-cell information, especially if the image data layout already involves a binary table with per-cell rows.
+But if we're embedding a JSON document in the FITS file anyway, it might make more sense to store this information in JSON as well; this will let us share code, documentation, and serialization for more complex objects with other Rubin image data products, and that includes sharing the machinery for managing schema changes schema documentation.


Have you seen performance issues with JSON in FITS?

We do this very rarely in DES and it is a potentially lossy operation, especially with respect to floating point numbers etc.

You do have to be very careful with the floating point numbers (NaNs and Infs especially), but the libraries we're using (mostly Pydantic) seem to be pretty good about that.

The JSON-in-FITS stuff is extremely new and we don't have any benchmarks of that specifically. But I don't think the combination raises any red flags, and we've been pretty happy with JSON serialization even in some very big data structures (not YAML, though, avoid YAML like the plague if you care about performance!).

My 2 cents: As an end user I would prefer binary table if the metadata has a "flat" (non hierarchical) structure and has FITS compatible data types.

Using binary blobs makes data exploration/browsing more difficult for anyone who is not using DM specific "data loaders" etc.

index.rst

arunkannawadi · 2024-07-23T12:17:24Z

index.rst

+This avoids the problem with per-HDU overheads, and it makes an address table much less important, as there are many fewer HDUs.
+It also allows compression tiles that comprise multiple cells.
+
+It does not allow us to represent the on-sky locations of cells using FITS WCS, however, and this is probably enough to rule it out.


Are you referring to the integer offsets that indicate the position of the cell?

Yes, more or less. We could have a WCS that describes one cell, but I don't think there's any way to write a FITS WCS that puts different slices of a data cube in different places.

I don't think FITS-WCS can do it. I'm pretty sure AST can with switchMaps.

No idea if gwcs can.

I believe there might be a way to do this with table-lookup WCSes, but they are not uniformly well supported by readers. I would have to read the FITS standard very closely to determine whether this would also require us to do all the spherical projection ourselves, which would require us to be careful about precision issues in the (piecewise linear) table lookups, especially near the coordinate poles.

I am not endorsing this idea! But if this storage scheme were the winner for other reasons, it's something we could research.

This storage scheme definitely isn't winning for other reasons, so I don't think it's worth anyone's time to look into it further.

index.rst

Co-authored-by: Arun Kannawadi <kj.arun.kj@gmail.com>

TallJimbo · 2024-08-16T14:07:36Z

index.rst

+The main problem with binary table layouts is compression support.
+The latest FITS standard does include (in section 10.3) a discussion of "tiled table compression", which seems like it'd be sufficient, as long as compressing one cell at a time is enough to get a good compression ratio (this is unclear).
+Unlike image tile compression, binary table tile compression doesn't support lossy compression algorithms or dithered quantization, but it would still be possible to do our own non-dithered quantization and use ``BZERO`` and ``BSCALE`` to record the transformation from integer back to floating-point.
+The bigger problem is that there does not appear to be any implementations of it: there is no mention of it in either the CFITSIO or Astropy documentation (and even if an implementation does exist in, say, the Java ecosystem, we wouldn't be in a position to use it).


It appears I was wrong about this: as @gpdf discovered, both CFITSIO and at least one major Java FITS implementation (nom.tam.fits) do support table compression (CFITSIO just hasn't updated their docs to reflect this).

But having looked more closely at the table compression spec, I'm less sanguine about introducing our own quantization convention for lossy compression here, for two reasons:

Adding BZERO and BSCALE columns as per the Green Bank convention only works if there is only one image column in the binary table, i.e. we put each logical plane in its own binary table HDU (this seems generally true of the Green Bank convention: there's no provision for multiple image columns). This is fine, but I think it removes one of the advantages of the binary table form.

There'd be no way to record the "subtractive dithering" quantization (in which one adds a deterministic uniform random number before rounding to int, and subtracts it when decompressing). And while we could invent one (probably a ZDITHER0 column, by analogy with the tiled image compression header key), users would have to use our code to correctly decompress in that case. While other file layouts would require users who don't use our code to do some work, none of them would require them to implement anything as low-level as decompression. @beckermr, do you know if you used subtractive dithering in your lossy-compression configuration? It'd be the ZQUANTIZ header key.

In fact, if we decide to do binary tables with one plane per HDU to work around the first problem, we could just make them the very binary tables used to implement tiled image compression, but add extra columns (which are explicitly permitted) for e.g. a Green Bank convention WCS or other metadata. FITS libraries often provide the option to read a compressed image HDU as an image or as a binary table, and with this format we'd provide all information in a fairly FITS-native way in at least one of those two interpretations. This does mean that the tiled image compression representation would correspond to the exploded image, of course.

We use SUBTRACTIVE_DITHER_2 which preserves zeros exactly. You need dithering to remove biases from the lossy compression and having zeros go through unchanged has lots of nice advantages.

gpdf

It's a great summary of the issues. My main contribution is to note that table-compression implementations are available.

I've made a number of other comments. I'll write separately on the RFC about the proposed decision itself, which isn't addressed in this PR.

gpdf · 2024-08-15T23:10:56Z

index.rst

+- Astropy can do image tile compression with efficient subimage reads on POSIX filesystems, and it can do uncompressed reads (including efficient subimage reads) against both POSIX and object stores via `fsspec`, but it delegates to vendored CFITSIO code for its tile compression code and does not support the combination of these two.
+  Astropy cannot do subset reads of FITS binary tables, even on POSIX filesystems.
+
+- I am not aware of any library that can do FITS binary table compression at all.


I think in fact CFITSIO (C) and nom.tam.fits (Java) both support it. It appears that the CFITSIO documentation is behind the code.

Note that CFITSIO was only recently formally released as open-source code on GitHub (there were semi-bootleg versions of it previously) and people may not be entirely familiar with the latest version.

Some links:

https://github.com/HEASARC/cfitsio/blob/56640a39c59148b5143d8dd0833cce6a88a55d8c/fitsio.h#L2069-L2070

https://github.com/HEASARC/cfitsio/blob/56640a39c59148b5143d8dd0833cce6a88a55d8c/imcompress.c#L8704

Currently lossy table compression is not supported.

The permitted algorithms for compressing BINTABLE columns are ’RICE 1’, ’GZIP 1’, and ’GZIP 2’ (plus ’NOCOMPRESS’), which are lossless and are described in Sect. 10.4. Lossy compression could be allowed in the future once a process is defined to preserve the details of the compression.

gpdf · 2024-08-15T23:19:55Z

index.rst

+
+This section will cover potential ways to lay out the image data - the 5+ planes that share a common grid, as well as the PSF images - across multiple FITS HDUs.
+
+Binary Table With Per-Cell Rows


@timj wrote:

Forcing cell-based coadds into FITS when the thing you end up with is essentially unusable for ds9/firefly/astropy seems like we are stretching the usefulness of the requirement we have to use FITS. If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough.

I think that the binary table model is a reasonable one. It solves one of the biggest problems for interoperable, non-mission-specific interpretation of FITS files with enormous numbers of HDUs: there's nothing in the FITS standard, and precious little in the "registered conventions", for having a "table of contents" extension that tells you what's in the file.

By comparison, putting the subimages in a table solves this problem for free. You can put all sorts of metadata into additional columns in the table, and a reasonable application will give a user the ability to browse by this metadata.

It's true that today if you handed a big Green Bank style table to Firefly or Aladin you might not get acceptable UX, but that's because there's been little pressure on tools in this area to catch up with what's already in the standard.

It's important to recognize that Rubin isn't the only project with this need, and cell-based coadds aren't even the only application for it within Rubin: we also have an equally important requirement to be able to create multi-epoch bulk cutout files, both for cutouts around many objects, and for cutouts around one (possibly moving) object in many epochs.

So we are among friends, as it were, in looking for standards-conformant improvements in tool behavior.

gpdf · 2024-08-15T23:25:58Z

index.rst

+
+This section will cover potential ways to lay out the image data - the 5+ planes that share a common grid, as well as the PSF images - across multiple FITS HDUs.
+
+Binary Table With Per-Cell Rows


If we have a service to convert the cell-based thing to a usable image then that service returning FITS is good enough.

Interesting. Do we have a service to do a bulk data transfer? Such a conversion could happen in the process, since we don't want to store the trimmed/stitched image at USDF and the destination may not have Science Pipelines installed to read our custom datastructure.

We have a robust framework for retrieving "related data" in the RSP. When someone queries for coadded images, they'll get a list of images, most naturally, in the way they are chunked in the Butler (e.g., currently, one row per patch in the query result).

The way the image itself is then retrieved actually allows for N different products to be returned associated with that row in the query result. So you can return the full cell-based coadd FITS file verbatim from the backing store, or you can return a stitched-together simple FITS file for the whole area it covers, or you can return a PNG preview of it, or a table of its provenance, or pretty much anything else we can think of.

These additional related products can be things that already exist statically, or things that are created on the fly by services from the data on the backing store.

gpdf · 2024-08-15T23:26:34Z

index.rst

+
+This section will cover potential ways to lay out the image data - the 5+ planes that share a common grid, as well as the PSF images - across multiple FITS HDUs.
+
+Binary Table With Per-Cell Rows


... and modern IVOA-flavored tools like Firefly and Aladin already know how to follow these links and present these options to users.

index.rst

gpdf · 2024-08-16T19:51:48Z

index.rst

+Stitched Images
+---------------
+
+We expect most accesses to our coadd files to be uninterested in the redundant overlap pixel values - instead, most science users will be interested in stitching together the inner cells to form a mostly-seamless patch-level image.


I would say "many", perhaps, not "most", given the discussions of the value of the images for algorithmic processing vs. human inspection.

I've made that change and added my point on forward-modeling vs. data-convolution analysis needs from the other thread.

gpdf · 2024-08-16T19:57:47Z

index.rst

+
+Finally, the WCS of any single cell (or the inner-cell stitched image; they differ only in an integer offset) will be simple and can be exactly represented in FITS headers (unlike the WCSs of our single-epoch images).
+An additional FITS WCS could be used to describe the position of a single-cell image within a larger grid (this is just an integer offset).
+It is highly desirable that any images we store in our FITS files have both to allow third-party FITS readers to fully interpret them.


Add:
"In some cases we might be using corners of the current FITS standard (v4.00) that existing community tools do not yet fully implement. We consider this preferable to using entirely Rubin-unique formats and conventions, and we are prepared to support the community to some extent in filling out these corners (e.g., by adding support to Astropy and Firefly).

We are aware of other projects' interest in FITS representations for large numbers of small-ish images with associated metadata and will keep them informed of what we are planning."

I'm a little uncomfortable with the parenthetical, as I don't think we'll actually have the effort to add support to at least Astropy on the early-science timescale.

index.rst

gpdf · 2024-08-16T20:00:44Z

index.rst

+While some global information will go into FITS headers (certainly the WCS and some identifiers), we do not want to assume that all global metadata can be neatly represented in the limited confines of a FITS header.
+A single-row binary table is another option, but we will likely instead adopt the approach recently proposed for other Rubin image data products on RFC-1030: embedding a JSON document as a byte array in a FITS extension HDU.
+
+A binary table with per-cell rows is a natural fit for the fixed-schema per-cell information, especially if the image data layout already involves a binary table with per-cell rows.


I would encourage storing the fixed-schema data as a table even if it is also in the JSON.

- tiled table compression is supported by CFITSIO and other readers; - we do need lossy compression and subtractive dithering for at least some image planes. The overall result is that the big picture is not changed much: binary tables are still not really viable, because they can't do the kind of compression we need.

beckermr · 2024-09-12T20:47:34Z

index.rst

+  CFITSIO can only permits customization of the quantization used in image tile compression via undocumented internal interfaces (which is what we use in ``afw``).
  This imposes the same limitation on its Python bindings in the ``fitsio`` package.


This is not totally true, though I agree the documentation is sparse.

We just added support for some of the lossy compression seeding options to fitsio: esheldon/fitsio#411.

Excerpts from the docs are here in this github message: esheldon/fitsio#411 (comment)

TallJimbo added 3 commits July 22, 2024 14:14

Re-whitespace abstract and fix extraneous word.

33b1857

Add intros, outline, and metadata sections.

288f6da

Add sections for binary tables and per-cell HDUs.

19be450

beckermr reviewed Jul 22, 2024

View reviewed changes

index.rst Show resolved Hide resolved

beckermr reviewed Jul 22, 2024

View reviewed changes

TallJimbo added 3 commits July 22, 2024 15:42

Add sections on data cubes and exploded images.

c7df87e

Add section on stitched images and hybrid approaches.

815f86a

Proofreading edits and PR comment clarifications.

cf8f832

TallJimbo marked this pull request as ready for review July 22, 2024 20:52

Delete template text.

54821c4

arunkannawadi reviewed Jul 23, 2024

View reviewed changes

PR review edits from Arun

a38b189

Co-authored-by: Arun Kannawadi <kj.arun.kj@gmail.com>

gpdf self-requested a review August 15, 2024 22:56

TallJimbo commented Aug 16, 2024

View reviewed changes

gpdf reviewed Aug 16, 2024

View reviewed changes

TallJimbo added 6 commits September 12, 2024 12:16

Include GPDF's lossy-compressed visualization-only image idea.

3d666b2

Add section on dual-interpretation HDUs.

2ce0df7

Add note on third-party image support from GPDF.

c73cf71

Clarify situation on third-party viewers and per-HDU cells.

4e773b5

Clarify when overlap regions are and are not needed.

96e31a5

beckermr reviewed Sep 12, 2024

View reviewed changes


		This section will cover potential ways to lay out the image data - the 5+ planes that share a common grid, as well as the PSF images - across multiple FITS HDUs.

		Binary Table With Per-Cell Rows

		A binary table with per-cell rows is a natural fit for the fixed-schema per-cell information, especially if the image data layout already involves a binary table with per-cell rows.
		But if we're embedding a JSON document in the FITS file anyway, it might make more sense to store this information in JSON as well; this will let us share code, documentation, and serialization for more complex objects with other Rubin image data products, and that includes sharing the machinery for managing schema changes schema documentation.

		CFITSIO can only permits customization of the quantization used in image tile compression via undocumented internal interfaces (which is what we use in ``afw``).
		This imposes the same limitation on its Python bindings in the ``fitsio`` package.

DM-45371: technote describing possible cell-based coadd file layouts #1

Are you sure you want to change the base?

DM-45371: technote describing possible cell-based coadd file layouts #1

Conversation

TallJimbo commented Jul 22, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunkannawadi Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

beckermr Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

gpdf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo Jul 23, 2024 •

edited

Loading

arunkannawadi Aug 15, 2024 •

edited

Loading

beckermr Aug 20, 2024 •

edited

Loading