Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension proposal: multiscale arrays v0.1 #50

Closed
joshmoore opened this issue Mar 5, 2020 · 24 comments
Closed

Extension proposal: multiscale arrays v0.1 #50

joshmoore opened this issue Mar 5, 2020 · 24 comments
Labels
protocol-extension Protocol extension related issue

Comments

@joshmoore
Copy link
Member

joshmoore commented Mar 5, 2020

This issue has been migrated to an image.sc topic after the 2020-05-06 community discussion. Authors are still encouraged to make use of the specification in their own libraries. As the v3 extension mechanism matures, the specification will be updated and registered as appropriate. Feedback and request changes are welcome either on this repository or on image.sc.


As a first draft of support for the multiscale use-case (#23), this issue proposes an intermediate nomenclature for describing groups of Zarr arrays which are scaled down versions of one another, e.g.:

example/
├── 0    # Full-sized array
├── 1    # Scaled down 0, e.g. 0.5; for images, in the X&Y dimensions
├── 2    # Scaled down 1, ...
├── 3    # Scaled down 2, ...
└── 4    # Etc.

This layout was independently developed in a number of implementations and has since been implemented in others, including:

Using a common metadata representation across implementations:

  1. fosters a common vocabulary between existing implementations
  2. enables other implementations to reliably detect multiscale arrays
  3. permits the upgrade of v0.1 arrays to future versions of this or other extension
  4. tests this extension for limitations against multiple use cases

A basic example of the metadata that is added to the containing Zarr group is seen here:

{
  “multiscales”: [
    {
      “datasets” : [
          {"path": "0"},
          {"path": "1"},
          {"path": "2"},
          {"path": "3"},
          {"path": "4"}
        ]
      “version” : “0.1”
    }
     // See the detailed example below for optional metadata
  ]
}

Process

An RFC process for Zarr does not yet exist. Additionally, the v3 spec is a work-in-progress. However, since the implementations listed above as well as others are already being developed, I'd propose that if a consensus can be reached here, this issue should be turned into an .rst file similar to those in the v3 branches (e.g. filters) and used as a temporary spec for defining arrays with the understanding that this a prototype intended to be amended and brought into the general extension mechanism as it develops.

I'd welcome any suggestions/feedback, but especially around:

  • Better terms for "multiscale" and "series"
  • The most useful enum values
  • Is this already too complicated? (Limit to one series per group?) or on the flip side:
  • Are there existing use cases that aren't supported? (Note: I'm aware of some examples like BDV's N5 format but I'd suggest they are higher-level than just "multiscale arrays".)

Deadline for a first round of comments: March 15, 2020
Deadline for a second round of comments: April 15, 2020

Detailed example

Color key (according to https://www.ietf.org/rfc/rfc2119.txt):

- MUST     : If these values are not present, the multiscale series will not be detected.
! SHOULD   : Missing values may cause issues in future versions.
+ MAY      : Optional values which can be readily omitted.
# UNPARSED : When updating between versions, no transformation will be performed on these values.

Color-coded example:

-{
-  "multiscales": [
-    {
!      "version": "0.1",
!      "name": "example",
-      "datasets": [
-        {"path": "0"},
-        {"path": "1"},
-        {"path": "2"}
-      ],
!      "type": "gaussian",
!      "metadata": {
+        "method":
#          "skiimage.transform.pyramid_gaussian",
+        "version":
#          "0.16.1",
+        "args":
#          [true],
+        "kwargs":
#          {"multichannel": true}
!      }
-    }
-  ]
-}

Explanation

  • Multiple multiscale series of datasets can be present in a single group.
  • By convention, the first multiscale should be chosen if all else is equal.
  • Alternatively, a multiscale can be chosen by name or with slightly more effort, but the zarray metadata like chunk size.
  • The paths to the arrays are ordered from largest to smallest.
  • These paths could potentially point to datasets in other groups via “../foo/0” in the future. For now, the identifiers MUST be local to the annotated group.
  • These values SHOULD (MUST?) come from the enumeration below.
  • The metadata example is taken from https://scikit-image.org/docs/dev/api/skimage.transform.html#skimage.transform.pyramid_reduce

Type enumeration:

Sample code

#!/usr/bin/env python
import argparse
import zarr
import numpy as np
from skimage import data
from skimage.transform import pyramid_gaussian, pyramid_laplacian

parser = argparse.ArgumentParser()
parser.add_argument("zarr_directory")
ns = parser.parse_args()


# 1. Setup of data and Zarr directory
base = np.tile(data.astronaut(), (2, 2, 1))

gaussian = list(
    pyramid_gaussian(base, downscale=2, max_layer=4, multichannel=True)
)

laplacian = list(
    pyramid_laplacian(base, downscale=2, max_layer=4, multichannel=True)
)

store = zarr.DirectoryStore(ns.zarr_directory)
grp = zarr.group(store)
grp.create_dataset("base", data=base)


# 2. Generate datasets
series_G = []
for g, dataset in enumerate(gaussian):
    if g == 0:
        path = "base"
    else:
        path = "G%s" % g
        grp.create_dataset(path, data=gaussian[g])
    series_G.append({"path": path})

series_L = []
for l, dataset in enumerate(laplacian):
    if l == 0:
        path = "base"
    else:
        path = "L%s" % l
        grp.create_dataset(path, data=laplacian[l])
    series_L.append({"path": path})


# 3. Generate metadata block
multiscales = []
for name, series in (("gaussian", series_G),
                     ("laplacian", series_L)):
    multiscale = {
      "version": "0.1",
      "name": name,
      "datasets": series,
      "type": name,
    }
    multiscales.append(multiscale)
grp.attrs["multiscales"] = multiscales

which results in a .zattrs file of the form:

{
    "multiscales": [
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "G1"
                },
                {
                    "path": "G2"
                },
                {
                    "path": "G3"
                },
                {
                    "path": "G4"
                }
            ],
            "name": "gaussian",
            "type": "gaussian",
            "version": "0.1"
        },
        {
            "datasets": [
                {
                    "path": "base"
                },
                {
                    "path": "L1"
                },
                {
                    "path": "L2"
                },
                {
                    "path": "L3"
                },
                {
                    "path": "L4"
                }
            ],
            "name": "laplacian",
            "type": "laplacian",
            "version": "0.1"
        }
    ]
}

and the following on-disk layout:

/var/folders/z5/txc_jj6x5l5cm81r56ck1n9c0000gn/T/tmp77n1ga3r.zarr
├── G1
│   ├── 0.0.0
...
│   └── 3.1.1
├── G2
│   ├── 0.0.0
│   ├── 0.1.0
│   ├── 1.0.0
│   └── 1.1.0
├── G3
│   ├── 0.0.0
│   └── 1.0.0
├── G4
│   └── 0.0.0
├── L1
│   ├── 0.0.0
...
│   └── 3.1.1
├── L2
│   ├── 0.0.0
│   ├── 0.1.0
│   ├── 1.0.0
│   └── 1.1.0
├── L3
│   ├── 0.0.0
│   └── 1.0.0
├── L4
│   └── 0.0.0
└── base
    ├── 0.0.0
...
    └── 1.1.1

9 directories, 54 files
Revision Source Date Description
6 External feedback on twitter and image.sc 2020-05-06 Remove "scale"; clarify ordering and naming
5 External bug report from @mtbc 2020-04-21 Fixed error in the simple example
4 #50 (comment) 2020-04-08 Changed "name" to "path"
3 Discussions up through #50 (comment) 2020-04-01 Updated naming schema
2 #50 (comment) 2020-03-07 Fixed typo
1 @joshmoore 2020-03-06 Original text from in person discussions

Thanks to @ryan-williams, @jakirkham, @freeman-lab, @petebankhead, @jni, @sofroniewn, @chris-allan, and anyone else whose GitHub account I've forgotten for the preliminary discussions.

@manzt
Copy link
Member

manzt commented Mar 5, 2020

I'm really happy see this. We also used a similar layout for storing pyramids initially to that proposed above and it's fantastic to see this formalized.

I'm curious about the decision to store the base array in the same group as the downsampled levels. We initially did the same, but then moved towards a structure separating the two:

└── example/
    ├── .zgroup
    ├── base
    │   ├── .zarray
    │   ├── .zattrs
    │   ├── 0.0.0
    │   └── ...etc
    └── sub-resolutions/
        ├── .zgroup
        ├── .zattrs
        ├── 01/
        │   ├── .zarray
        │   ├── 0.0.0
        │   └── ...etc
        └── 02/
            ├── .zarray
            ├── 0.0.0
            └── ...etc

as a more general "image" format in zarr. One could expect to find a "base" array and then check for the "sub-resolutions" group to determine if it is a pyramid or not. We thought this structure would allow for other types of data (e.g. segmentation) to be store along side the base array. Again, thanks for the work here in formalizing this!

@joshmoore
Copy link
Member Author

Thanks, @manzt. Let's see if there are more votes for the deeper representation. It's certainly also what I was originally thinking about in #23. The downside is that one likely needs metadata on all the datasets pointing up and down the hierarchy in order to support detection of the sequence from any scale. It's the other major design layout I can think of. (If anyone has more, those would be very welcome.)

@sofroniewn
Copy link

sofroniewn commented Mar 5, 2020

@joshmoore amazing to see this kick off. A couple short comments

  • If looking for alternate names I'd consider multiresolution, but multiscale definitely works for me. We have been using pyramid at in napari but are thinking of changing (see here Support for multiresolution (pyramid) layers besides Image layers napari/napari#1019 (comment) and we can try and go with whatever the majority likes).

  • One thing that has come up for me and a list of "scales" is that when you have large volumetric timeseries, where you might create a pyramid for each timepoint, some of the axes are unscaled, so you really need to look at the shapes of the arrays to do the right thing. I see that the field is optional but I wonder how much is gained from it (I'm also not opposed though, and would probably find usage from it, but wanted to put out this caveat)

  • Multiple series per group is probably good flexibility to have, say if you have two independent multiscale datasets you want to put in the same group, it lets the group abstraction remain separate from the multiscale details.

  • The concept of having base + subresolutions like @manzt proposes is intriguing to me too. Ultimately for visualization purposes I want something like a single list of arrays so I guess I find that representation little simpler, but I can construct that from the later representation if I know the data is multiscale and maybe it is nice to keep that a little separate. I will think on it more, curious what others say.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 5, 2020

Glad to see the discussion here. Some thoughts:

  • Philosophically, I'd like to suggest a few constraints (both of which are satisfied by @joshmoore's proposal, but not by a lot of other existing multiscale image schemas): First, individual images should be portable -- wherever possible, images should not have metadata/attributes that indicates their role in a multiscale representation, so that they can be copied somewhere else and viewed on their own without losing context. Second, no magic dataset names like s0, s1, etc. The use of the list of datasets in @joshmoore's group attributes solves this problem.

  • Personally I'm not a fan of putting the base image at a different level of the hierarchy, since most software i've seen assumes that all the different scale levels will all be elements in the same collection. @manzt you suggest that you adopted this structure in order to facilitate checking for a multiscale representation, but I think this is a job for group metadata, not hierarchy.

  • For simplicity, I would propose a restriction of one multiscale representation per group. Groups are cheap; if you to represent 2 multiscale images, then make 2 groups. (This doesn't work for multiple multiscale representations that use the same base image, e.g. gaussian and laplacian pyramids). The use of the series group metadata in @joshmoore's proposal handles this nicely.

  • A multiscale image is a collection of images. Accordingly, the "multiscaleness" should be a group attribute that lists the images in the collection, which is how @joshmoore does it in the draft prosposal. I would add some dataset-specific information to the group attributes: software that consumes multiscale images needs to know about how the spatial properties of each image, and on cloud storage it can be cumbersome to query each image individually; so for convenience this image metadata could also be in the group attributes that describe the multiscale representation. I think explicitly listing the transform attributes of each image is safer than just listing "scales", as long as the transform attributes of each image are small.

Here's example metadata that implements this concept. The specifics of the "transform attributes" don't really matter -- this could be an affine transform, or something fancier. But I think the basic idea of putting the spatial information of each dataset in the group attributes is solid.

// group attributes
{
  “multiscale”: {
    “version” : “0.1”,
        “datasets” : [“0” : {transform attributes of 0}, 
                      “1” : {transform attributes of 1}, 
                      “2” : {transform attributes of 2}, 
                      “3” : {transform attributes of 3}, 
                      "4" : {transform attributes of 4}]
    }
    // optional stuff
  }
}

// example transform attributes of dataset 0

"transform" : {
    "offset" : {"X" : 0, "Y" : 0, "Z" : 0},
    "scale" : {"X" : 1, "Y" : 1, "Z" : 1},
    "units" : {"X" : "nm", "Y" : "nm", "Z" : "nm"}
} 

// example spatial attributes of dataset 1

"transform" : {
    "offset" : {"X" : .5, "Y" : .5, "Z" : 0},
    "scale" : {"X" : 2, "Y" : 2, "Z" : 1},
    "units" : {"X" : "nm", "Y" : "nm", "Z" : "nm"}
} 

For posterity, I've written about this issue (as it pertains to the data our group works with) here

@manzt
Copy link
Member

manzt commented Mar 5, 2020

@sofroniewn The concept of having base + subresolutions like @manzt proposes is intriguing to me too. Ultimately for visualization purposes I want something like a single list of arrays so I guess I find that representation little simpler, but I can construct that from the later representation if I know the data is multiscale and maybe it is nice to keep that a little separate. I will think on it more, curious what others say.

I generally have the same feelings. I'm for the simplicity of the current proposal, and I wonder if my suggestion adds an extra layer of complexity unnecessarily.

@d-v-b For simplicity, I would propose a restriction of one multiscale representation per group. Groups are cheap; if you to represent 2 multiscale images, then make 2 groups.

Wouldn't this require copying the base image into a separate group? Perhaps I'm misunderstanding.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 5, 2020

Wouldn't this require copying the base image into a separate group? Perhaps I'm misunderstanding.

The base image would be in the same group with the downscaled versions. So on the file system, it would look like this:

└── example/
    ├── .zgroup
    ├── base
    │   ├── .zarray
    │   ├── .zattrs
    │   ├── 0.0.0
    │   └── ...etc
    ├── base_downscaled
    │   ├── .zarray
    │   ├── .zattrs
    │   ├── 0.0.0
    │   └── ...etc
    ...etc

@manzt
Copy link
Member

manzt commented Mar 5, 2020

Apologies, I thought you were suggesting that separate groups should be created for different sampling of the same base image (e.g. gaussian and laplacian).

@d-v-b
Copy link
Contributor

d-v-b commented Mar 5, 2020

@manzt
this is actually my mistake -- I was not thinking at all about the use case where the same base image is used for multiple pyramids, and I agree that copying data is not ideal. I will remove / amend the "one multiscale representation per group" part of my proposal above.

@thewtex
Copy link

thewtex commented Mar 5, 2020

I would add some dataset-specific information to the group attributes: software that consumes multiscale images needs to know about how the spatial properties of each image, and on cloud storage it can be cumbersome to query each image individually;

Adding to the practical importance here: the spatial position of the first pixel is shifted in subresolutions, and the physical spacing between pixels changes also. This must be accounted for during visualization or analysis when other datasets, e.g. other images or segmentations, come into play. If this metadata is readily and independently available for every subresolution, i.e. scale factors do not need to be fetched and computations made, each subresolution image can be used independently, effortlessly, and without computational overhead.

One option is to build on the model implied by storing images in the Xarray project data structures, which has Zarr support. This enables storing metadata such as the position of the first pixel, the spacing between pixels, and identification of the array dimensions, e.g., x, y, t, so that data can be used and passed through processing pipelines and visualization tools. This is helpful because it enables distributed computing via Dask and machine learning [2] via the scikit-learn API. Xarray has broad community adoption, and it is gaining more traction lately. Of course, a model that is compatible with Xarray does not require Xarray to use the data. On the other hand, Xarray coords have more flexibility than what is required for pixels sampled on a uniform rectilinear grid, and this adds a little complexity to the layout.

Generated from this example, here is what it looks like:

.
├── level_1.zarr
│   ├── rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6
│   │   ├── 0.0.0
│   │   ├── 0.0.1
....
│   │   ├── 9.9.9
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── x
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── y
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── z
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── .zattrs
│   ├── .zgroup
│   └── .zmetadata
├── level_2.zarr
│   ├── rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6
│   │   ├── 0.0.0
│   │   ├── 0.0.1

│   │   ├── 8.9.9
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── x
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── y
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── z
│   │   ├── 0
│   │   ├── .zarray
│   │   └── .zattrs
│   ├── .zattrs
│   ├── .zgroup
│   └── .zmetadata
....
├── rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6
│   ├── 0.0.0
│   ├── 0.0.1
...
│   ├── 9.9.9
│   ├── .zarray
│   └── .zattrs
├── x
│   ├── 0
│   ├── .zarray
│   └── .zattrs
├── y
│   ├── 0
│   ├── .zarray
│   └── .zattrs
├── z
│   ├── 0
│   ├── .zarray
│   └── .zattrs
├── .zattrs
├── .zgroup
└── .zmetadata

34 directories, 62359 files

This is the layout generated by xarray.DataSet.to_zarr. It does not mean that Xarray has to be used to read and write. But, it would mean that Zarr images would be extremely easy to use via xarray. In this case, .zmetadata is generated on each subresolution so it can be used entirely independently. Due to how Xarray/Zarr handles coords, x, y, are one dimensional arrays. This results in every resolution having its own group.

The metadata looks like this:

{
    "metadata": {
        ".zattrs": {
            "_MULTISCALE_LEVELS": [
                "",
                "level_1.zarr",
                "level_2.zarr",
                "level_3.zarr",
                "level_4.zarr",
                "level_5.zarr",
                "level_6.zarr"
            ],
            "_SPATIAL_IMAGE": "rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6"
        },
Open for rest

        ".zgroup": {
            "zarr_format": 2
        },
        "level_1.zarr/.zattrs": {},
        "level_1.zarr/.zgroup": {
            "zarr_format": 2
        },
        "level_1.zarr/rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zarray": {
            "chunks": [
                64,
                64,
                64
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "zstd",
                "id": "blosc",
                "shuffle": 0
            },
            "dtype": "|u1",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                1080,
                1280,
                1280
            ],
            "zarr_format": 2
        },
        "level_1.zarr/rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "z",
                "y",
                "x"
            ],
            "direction": [
                [
                    1.0,
                    0.0,
                    0.0
                ],
                [
                    0.0,
                    1.0,
                    0.0
                ],
                [
                    0.0,
                    0.0,
                    1.0
                ]
            ],
            "units": "\u03bcm"
        },
        "level_1.zarr/x/.zarray": {
            "chunks": [
                1280
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                1280
            ],
            "zarr_format": 2
        },
        "level_1.zarr/x/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "x"
            ]
        },
        "level_1.zarr/y/.zarray": {
            "chunks": [
                1280
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                1280
            ],
            "zarr_format": 2
        },
        "level_1.zarr/y/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y"
            ]
        },
        "level_1.zarr/z/.zarray": {
            "chunks": [
                1080
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                1080
            ],
            "zarr_format": 2
        },
        "level_1.zarr/z/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "z"
            ]
        },
        "level_2.zarr/.zattrs": {},
        "level_2.zarr/.zgroup": {
            "zarr_format": 2
        },
        "level_2.zarr/rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zarray": {
            "chunks": [
                64,
                64,
                64
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "zstd",
                "id": "blosc",
                "shuffle": 0
            },
            "dtype": "|u1",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                540,
                640,
                640
            ],
            "zarr_format": 2
        },
        "level_2.zarr/rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "z",
                "y",
                "x"
            ],
            "direction": [
                [
                    1.0,
                    0.0,
                    0.0
                ],
                [
                    0.0,
                    1.0,
                    0.0
                ],
                [
                    0.0,
                    0.0,
                    1.0
                ]
            ],
            "units": "\u03bcm"
        },
        "level_2.zarr/x/.zarray": {
            "chunks": [
                640
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                640
            ],
            "zarr_format": 2
        },
        "level_2.zarr/x/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "x"
            ]
        },
        "level_2.zarr/y/.zarray": {
            "chunks": [
                640
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                640
            ],
            "zarr_format": 2
        },
        "level_2.zarr/y/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y"
            ]
        },
        "level_2.zarr/z/.zarray": {
            "chunks": [
                540
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                540
            ],
            "zarr_format": 2
        },
        "level_2.zarr/z/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "z"
            ]
        },
        "level_3.zarr/.zattrs": {},
        "level_3.zarr/.zgroup": {
            "zarr_format": 2
        },
        "level_3.zarr/rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zarray": {
            "chunks": [
                64,
                64,
                64
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "zstd",
                "id": "blosc",
                "shuffle": 0
            },
            "dtype": "|u1",
            "fill_value": null,
            "filters": null,
            "order": "C",
            "shape": [
                270,
                320,
                320
            ],
            "zarr_format": 2
        },
        "level_3.zarr/rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "z",
                "y",
                "x"
            ],
            "direction": [
                [
                    1.0,
                    0.0,
                    0.0
                ],
                [
                    0.0,
                    1.0,
                    0.0
                ],
                [
                    0.0,
                    0.0,
                    1.0
                ]
            ],
            "units": "\u03bcm"
        },
  [...]
        "rec20160318_191511_232p3_2cm_cont__4097im_1500ms_ML17keV_6/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "z",
                "y",
                "x"
            ],
            "direction": [
                [
                    1.0,
                    0.0,
                    0.0
                ],
                [
                    0.0,
                    1.0,
                    0.0
                ],
                [
                    0.0,
                    0.0,
                    1.0
                ]
            ],
            "units": "\u03bcm"
        },
        "x/.zarray": {
            "chunks": [
                2560
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                2560
            ],
            "zarr_format": 2
        },
        "x/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "x"
            ]
        },
        "y/.zarray": {
            "chunks": [
                2560
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                2560
            ],
            "zarr_format": 2
        },
        "y/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "y"
            ]
        },
        "z/.zarray": {
            "chunks": [
                2160
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": "NaN",
            "filters": null,
            "order": "C",
            "shape": [
                2160
            ],
            "zarr_format": 2
        },
        "z/.zattrs": {
            "_ARRAY_DIMENSIONS": [
                "z"
            ]
        }
    },
    "zarr_consolidated_format": 1
}

Here _MULTISCALE_LEVELS prevents the need to hardcode the identifiers as suggested by @d-v-b @manzt , but it could be renamed to multiscale, etc. _ARRAY_DIMENSIONS is the key that Xarray uses in Zarr files to identify the dims.

This example is generated with itk, but it could also just as easily be generated with scikit-image, or dask-image via [1] (work in progress) or pyimagej.

@sofroniewn
Copy link

Thanks for the link to that example @thewtex! Conforming with xarray.DataSet.to_zarr where possible seems reasonable to me too.

@constantinpape, @bogovicj, @axtimwalde might also be interested in weighing in.

@jni
Copy link

jni commented Mar 5, 2020

👍 to flat vs hierarchical representation. Also 👍 to "multiscale".

I also like the constraint that the sub-datasets should be openable as zarr arrays by themselves. I think @thewtex's example satisfies this. Having said this, @thewtex, the xarray model looks too complex to me compared to @joshmoore's proposed spec. It would be great if it could be stripped down to its bare essentials. I agree that it's nice to have the pixel start coordinate handy, but it can also be computed after the fact, so it should be optional I think.

Last thing, which may be out of scope, but might not be: for visualisation, it is sometimes convenient to have the same array with different chunk sizes, e.g. orthogonal planes to all axes for a 3D image. I wonder if the same data/metadata layout standard can be used in these situations.

Oh and @joshmoore

anyone else who's GitHub account I've forgotten for the preliminary discussions

whose. Regret pinging me yet? =P

@constantinpape
Copy link

Great to see so much discussion on this proposal. I didn't have time to read through all of it yet, will try to catch up on the weekend.
Fyi, there is a pyramid storage format for n5 used by BigDataViewer and paintera already and I have used this format for large volume representations as well:
https://github.com/bigdataviewer/bigdataviewer-core/blob/master/BDV%20N5%20format.md

@forman
Copy link

forman commented Mar 9, 2020

Great to see this moving on!

In our projects xcube and xcube-viewer image pyramids look like so:

example.levels/
├── 0.zarr    # Full-sized array
├── 1.zarr    # Level-0 X&Y dimensions divided by 2^1
├── 2.zarr    # Level-0 X&Y dimensions divided by 2^2
├── 3.zarr    # Level-0 X&Y dimensions divided by 2^3
└── 4.zarr    # Etc.

As @joshmoore mentioned, also this goes without special metadata, because

  • To make pyramids discoverable, we simple use the file extension .levels.
  • Spatial resolutions decrease by factor 2^Level.
  • The number of levels is obvious from the entries in the .levels folder.
  • The level zero, can also be named 0.lnk. In this case it contains the path the original data rather than a copy of the "pyramidized" original dataset.

(See also the xcube level CLI tool that implements this.)

We are looking forward to adopt our code to any commonly agreed-on Zarr "standard".

@joshmoore
Copy link
Member Author

All-

Here's a quick summary from my side of discussions up to this point. Please send corrections/additions as you see fit. ~Josh

Apparent agreement

Name

The name "multiscale" seems to be generally acceptable (#50 (comment), #50 (comment))

Multiple series

Support for multiple series per groups seems to be generally acceptable (e.g. #50 (comment)).

Special names

There are a few explicit votes for no special dataset names (e.g. #50 (comment)), but under "New ideas" there was one mention of group naming schemes.

Less clear

Layout

One primary decision point seems to be whether to use a deep or a flat layout:

Here I'd add that if flat is generally accepted as being the simplest approach for getting started, later revisions can always move to something more sophisticated. However, I'm pretty sure at that point we would want metadata not just at a single group level but either on multiple groups or all related datasets (or both).

Scaling information

Another key issue seems to be the scaling information. There are a range of ways that have been shown:

@sofroniewn even asked if they are even useful as they stand (#50 (comment)).

To be honest, I punted on this issue knowing that it would be harder to find consensus on it. To my mind, this could even be a second though related extension proposal. My reasoning for that is that it can also be used to represent the relationship between non-multiscale arrays, along the lines of @jni's "multiple chunk sizes" question below, and in the case of BDV, the relationship between the individual timepoints, etc.

My first question then would be: to what extent can the current multiscale proposal be of value without the spatial/scale/transform information?

New ideas

Explicit "name" key

@d-v-b's New proposed COSEM style from #50 (comment) uses this format:

        {"multiscale": [{"name": "base",  ...}, {"name" : "L1", ...}]}

Though this would prevent directly consuming the list (e.g. datasets = multiscale["series"][0]["datasets"]), it might provide a nice balance of extensibility, especially depending on the results of the coordinates/scales/transforms discussion.

Group naming

@forman showed an example from xcube in #50 (comment) in which group names were used rather than metadata to detect levels:

example.levels/

Links

@forman also showed in #50 (comment) one solution for linking: "The level zero, can also be named 0.lnk. In this case it contains the path the original data rather then a copy of the 'pyramidized' original dataset." This would likely need to be a pre-requisite proposal for this one if we were to follow that route. cc: @alimanfoo

Either/or logic

In @d-v-b's COSEM writeup from #50 (comment), there is an example of either/or logic, where could would need to check in more than one location for a given piece of metadata:

 -     ├── (required) s1 (optional, unless "scales" is not a group level attribute): {"downsamplingFactors": [a, b, c]})

Multiple chunk sizes

@jni pondered in #50 (comment): "for visualisation, it is sometimes convenient to have the same array with different chunk sizes, e.g. orthogonal planes to all axes for a 3D image. I wonder if the same data/metadata layout standard can be used in these situations."


For the record, I'd currently err on the side of:

  • sticking with a flat "multiscale" object
  • without links or either/or logic
  • and without any special names,
  • while likely moving to the more flexible [{"name": "base"}] format
  • and saving coordinates for a follow-on proposal.

(whew) But opinions, as always, are very welcome.


Further CCs: @saalfeldlab @axtimwalde @tpietzsch

@d-v-b
Copy link
Contributor

d-v-b commented Mar 9, 2020

My first question then would be: to what extent can the current multiscale proposal be of value without the spatial/scale/transform information?

I think there's value in the current effort, insofar as standardizing spatial metadata is a separable issue.

For a multiscale image spec, I would propose abstracting over the specific implementation of spatial metadata, e.g. by stipulating that the group multiscale attribute must contain the same spatial metadata as the collection of array attributes. This assumes as little as possible about the details of the spatial metadata; (but a key assumption I'm making is that duplicating this metadata is not prohibitive)

For the record, I'd currently err on the side of:
* sticking with a flat "multiscale" object
* without links or either/or logic
* and without any special names,
* while likely moving to the more flexible [{"name": "base"}] format
* and saving coordinates for a follow-on proposal.

These all look good to me!

@thewtex
Copy link

thewtex commented Mar 9, 2020

@joshmoore outstanding summary! Thanks for leading this endeavor.

My first question then would be: to what extent can the current multiscale proposal be of value without the spatial/scale/transform information?

To correctly analyze or visualize the data as a multiscale image pyramid, then some spatial/scale/transform information is required.

To:

  • Compare with image subregions
  • Handle anisotropically sampled volumes
  • Compare with segmentations stored as meshes whose node positions are in "world space" or were generated from a derived volume sampled on a different pixel sampling grid.
  • Use model-based annotations defined in "world space"
  • Effectively utilize image registration

Spacing / scale and offset / origin and/or transforms are required. Without them, these use cases are either complex and error prone (requiring provenance and computation related to source pixel grids), or not possible. This is why the majority of scientific imaging file formats have at least spacing / scale and offset / origin in some form.

That said, the specs could still be split into two to keep things moving along.

@rabernat
Copy link
Contributor

rabernat commented Mar 12, 2020

Thanks so much to everyone who is putting detailed thought into this complex issue. Since the discussion has mostly focused on the bioimaging side of things, I'll try to add the xarray & geospatial perspective.

  • The main precedent for "multiscale arrays" in geospatial comes from the GeoTIFF / COG format. In geospatial lingo, they are called "overviews". GDAL has good documentation on this.
  • There has been some discussion in xarray about supporting overviews (see Accessing COG overviews with read_rasterio pydata/xarray#3269), but it is not currently part of our data model, which is derived from the common data model and tied closely to netCDF.
  • However, xarray does have a very convenient utility for generating overviews: the coarsen method.
  • For climate model data, generating overviews is not trivial because the cell geometry can be non-euclidean. You need to know an area weighting factor to apply when coarse graining. It's not clear to me from the discussion above whether zarr needs to know how to actually generate these overviews, or if that is up to a third-party library.
  • Nevertheless, given the proliferation of high resolution weather and climate models, the ability to store overviews would be quite valuable, particularly for interactive visualization. For broader adoption, this concept would need to make its way into the NetCDF standard itself.
  • The bigger cells / pixels get, the more important become coordinates and cell bounds. It seems like this conversation is closely tied to the question of how to represent coordinates in zarr. As noted by @thewtex, we have already established some de-facto standards about how to do this in order to plug zarr into xarray. So these discussions need to happen in parallel.

@hanslovsky
Copy link

Great discussion. These are my $0.02. Largely, I agree with @joshmoore's summary in #50 (comment). Being able to open each scale level as an individual data set and not part of a pyramid is probably the most important feature and should be part of any standard the comes out of this. With this in mind, the spatial meta data (gridSpacing and origin) would need to be stored in the attributes of the individual datasets. This means either

  • duplication of the spatial meta data, or
  • no spatial meta data of the individual datasets in the multiscale group.

This also does not consider other spatial meta data like rotations. As far as I know, this is a relevant use case for @tpietzsch. If such (arbitrary) transforms should not be considered in the standard, then the question arises of how to combine this with the gridSpacing and origin. In such a scenario, I would probably set the origin to zero with appropriate shifts in downscaled levels as needed, and have the actual offset after the rotation in a global transform. But then again, each scale dataset could not be loaded individually with the correct scaling, rotation, and offset, without explicit knowledge of the pyramid.

Other than that, here are a few comments:

  • I think the bare minimum spatial information is going to be the gridSpacing and origin for each scale level. I do not have a strong opinion about nomenclature. In Paintera, it is resolution and offset, but I am ok with anything reasonable.
  • If scales are defined, they should be fully specified for all of the spatial dimensions, i.e. for 3D or 3D+channel, it would be [[sx, sy, sz], ...].I like having the scales attribute but the scales can be inferred from gridSpacing, so it is redundant information.
  • I prefer the format that @d-v-b proposed that stores an array of dictionaries for the datasets, e.g.
[{"name": "s0", "meta1": ...}, {"name": "s1", "meta1": ...}]

over storing multiple arrays like

{"datasets": ["s0", "s1", ...], "meta1": [...]}
  • I do like the idea of having multiple multi-scale groups within a group and specifying scale levels at arbitrary paths (relative to the group). I had not thought of that before but it sounds very intriguing. On caveat here is that it may get out of control and result in very chaotic dataset hierarchies but that would be the responsibility of the user. I am not aware of any good restriction, yet. Considering this extension, maybe using "path" as a key instead of "name" in @d-v-b's proposal may be more descriptive and appropriate.

I think that a common standard would be a great thing to have and help interaction between the wealth of tools that we are looking at. Paintera does not have a great standard and should update its format if a reasonable standard comes out of this (while maintaining backwards compatibility).

Disclaimer: I will start a position outside academia soon and will not be involved in developing tools in this realm after that. My comment should be regarded as food for thought and to raise concerns that may not have been considered yet. Ultimately, I will not be involved in the decision making of any specifics of this standard.

cc @igorpisarev

@joshmoore
Copy link
Member Author

Apologies, all, for letting this slip into April. Hopefully everyone's managing these times well enough despite the burden of long spec threads.

I've updated the description to include the new {"name": ...} syntax and added a new deadline of April 15th for further responses.

A few points on the more recent comments:

Otherwise, it sounds like the newer comments are generally onboard with the current proposal, but let me know if I've dropped anyone's concerns.

@d-v-b
Copy link
Contributor

d-v-b commented Apr 2, 2020

I like path much more than name. +1 to that.

My major concern with duplication would be keeping the two representations consistent.

This is a valid concern. Personally I don't like duplicating spatial metadata in the group -- my original conception a long time ago was for the group multiscale metadata to simply list the names/paths to the datasets that comprise the pyramid, with no additional information. But I was reminded by @axtimwalde that accessing metadata from multiple files on cloud stores can be bothersome, and this led to the idea of consolidating the array metadata at the group level. Maybe this can be addressed via the consolidated metadata functionality that has already been added to zarr: https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata.

For a spec, a way to resolve this could be to specify that, for dataset entry in the group multiscale metadata, a path field is required but additional fields per dataset are optional. In this regime, programs that attempt to parse the multiscale group may look for consolidated metadata in the group attributes, but they should have a fallback routine that involves parsing the individual attributes of the datasets.

@axtimwalde
Copy link

What would we do if cloud storage wouldn't have high latency? I am similarly worried about the consolidated meta-data hack because we may store a lot of meta-data and parsing very long JSON texts isn't particularly fast either, it also doesn't scale very well.

@jrbourbeau jrbourbeau added the protocol-extension Protocol extension related issue label Apr 6, 2020
@joshmoore
Copy link
Member Author

NB: Updated description to use "path".

#50 (comment)

I had never considered a level of consolidation between none and everything, e.g. all arrays (but not groups) within a group are cached within the group metadata. It's an interesting idea, but discussing it here seems dangerous.

If we assume that consolidation is out-of-scope for this issue, I think the only question remaining is if we want optional spatial metadata at the group level, where the array metadata would take precedence. Here, I'd likely also vote for being conservative and not doing that at this point, though we could add it in the future (more easily than we could remove it).

If all agree, I'll add hopefully one last update to remove all mention of "scale" and then start collecting all the spatial ideas that we've tabled in this issue into a new one.

joshmoore added a commit to joshmoore/bioformats2raw that referenced this issue Apr 23, 2020
By listing pyramids in the group attributes which contain
the pyramids, clients can lookup the number of resolutions
without needing to know beforehand or perform a directory
listing.

see: zarr-developers/zarr-specs#50
joshmoore added a commit to joshmoore/bioformats2raw that referenced this issue Apr 23, 2020
By listing pyramids in the group attributes which contain
the pyramids, clients can lookup the number of resolutions
without needing to know beforehand or perform a directory
listing.

see: zarr-developers/zarr-specs#50
@joshmoore
Copy link
Member Author

joshmoore commented May 6, 2020

Description now updated removing use of "scale" and clarifying a few items like the ordering of the datasets which have come up recently during conversations on image.sc, twitter, etc. Thanks again to everyone for the feedback.

@joshmoore
Copy link
Member Author

This issue has been migrated to image.sc after the 2020-05-06 community discussion and will be closed. Authors are still encouraged to make use of the specification in their own libraries. As the v3 extension mechanism matures, the specification will be updated and registered as appropriate. Many thanks to everyone who has participated to date. Further feedback and request changes are welcome either on this repository or on image.sc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests