Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion]: syntax differences due to backend differences #1702

Open
3 tasks done
bendichter opened this issue Jun 16, 2023 · 5 comments
Open
3 tasks done

[Discussion]: syntax differences due to backend differences #1702

bendichter opened this issue Jun 16, 2023 · 5 comments
Assignees
Labels
category: proposal proposed enhancements or new features priority: medium non-critical problem and/or affecting only a small set of NWB users

Comments

@bendichter
Copy link
Contributor

bendichter commented Jun 16, 2023

What would you like to see added to PyNWB?

In a conversation with @satra and @oruebel at the BRAIN meeting, we discussed syntax differences in pynwb that are the result of using different backends. This has the potential to cause bugs and complications for downstream code. I am aware of three areas in PyNWB where differences appear.

  1. IO for reading and writing files: A user needs to decide between NWBHDF5IO and NWBZarrIO in order to read or write a file. The writing part doesn't seem that bad since the user will be required to indicate the backend of choice somehow and using a different class seems like a reasonable way to do that. However, for read you might prefer that pynwb determines the backend automatically. Otherwise, we are going to need to cover the read classes all of the different possible backends in tutorials. IMHO it wouldn't hurt to have an NWBIO class that can automatically determine the backend. This is a relatively simple solution and would make the pynwb library more user-friendly.
    relevant issues: NWBIO #858

  2. dataset configuration on write: chunking, compression, filters, shuffling, etc. are specified slightly differently between HDF5/h5py and Zarr. You might consider creating a unifying api layer that translates into the different specs for the backend (@satra was advocating for this) however, there are enough differences in the capabilities and logic of the different approaches that this would not be straightforward. If we do create a unifying language, it would be difficult not to restrict our utilization of configuration capabilities of particular datasets. For example, I'd like to use WavPack, but that is only available with Zarr currently.

  3. Dataset indexing when reading: When you read datasets from an NWB file, you get h5py.Dataset vs. zarr.Dataset. These classes act similarly to a np.ndarray, but there are enough differences that it will likely cause bugs for any analysis script that wants to work on flexible backends. For example, this code works in Zarr:

import zarr
import numpy as np

# Create a new Zarr array
zarr_array = zarr.zeros((10, 10))  # Create a 10x10 array of zeros
zarr_array[:] = np.random.rand(10, 10)  # Fill the array with random numbers

# Assume zarr_array is a 2D array, create index arrays
rows = np.array([0, 1, 2])
cols = np.array([1, 2, 3])

# Use multidimensional indexing
subset = zarr_array[rows, cols]

# Print the subset
print(subset)
[0.43752435 0.57966441 0.86366265]

but the following analogous code in h5py does not work:

import h5py
import numpy as np

# Create a new HDF5 file and dataset
file = h5py.File('filename.hdf5', 'w')
data = np.random.rand(10, 10)  # Create a 10x10 array of random numbers
dataset = file.create_dataset('dataset', data=data)

# Assume dataset is a 2D array, create index arrays
rows = np.array([0, 1, 2])
cols = np.array([1, 2, 3])

# Use multidimensional indexing
subset = dataset[rows, cols]

# Print the subset
print(subset)

# Remember to close the file
file.close()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [9], line 10
      7 dataset = file.create_dataset('dataset', data=data)
      9 # Use multidimensional indexing
---> 10 subset = dataset[rows, cols]
     12 # Print the subset
     13 print(subset)

File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File ~/opt/miniconda3/lib/python3.9/site-packages/h5py/_hl/dataset.py:814, in Dataset.__getitem__(self, args, new_dtype)
    809     return arr
    811 # === Everything else ===================
    812 
    813 # Perform the dataspace selection.
--> 814 selection = sel.select(self.shape, args, dataset=self)
    816 if selection.nselect == 0:
    817     return numpy.zeros(selection.array_shape, dtype=new_dtype)

File ~/opt/miniconda3/lib/python3.9/site-packages/h5py/_hl/selections.py:82, in select(shape, args, dataset)
     79     space = h5s.create_simple(shape)
     80     selector = _selector.Selector(space)
---> 82 return selector.make_selection(args)

File h5py/_selector.pyx:276, in h5py._selector.Selector.make_selection()

File h5py/_selector.pyx:189, in h5py._selector.Selector.apply_args()

TypeError: Only one indexing vector or array is currently allowed for fancy indexing

There are many other subtle differences between the indexing of these different array-like classes. See the Zarr docs on fancy indexing.

In NWB Widgets, we are starting to run into bugs that are the result of these differences, e.g. NeurodataWithoutBorders/nwbwidgets#283. Without a proper response, I anticipate that these types of issues will accumulate across the cross-section of analysis tools and backends.

Is your feature request related to a problem?

No response

What solution would you like?

There are some pretty big trade-offs here, including homogenization, backwards compatibility, and evaluation of the complexity of each solution. I think it deserves some discussion.

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

@bendichter bendichter changed the title [Discussion]: backend syntax changes [Discussion]: syntax differences due to backend differences Jun 16, 2023
@oruebel
Copy link
Contributor

oruebel commented Jun 16, 2023

Thanks for the summary. I would suggest we discuss here for a bit and then create separate issues for items where we identify more specific tasks or issues.

  1. However, for read you might prefer that pynwb determines the backend automatically.

For reading data locally (or via ffspec) this should be fine. Maybe a generic open(...) function would be a good approach for this as it would: 1) imply that this is just for reading, 2) would be similar to other libraries, 3) would allow us to restrict what the function can do. open would then return an instance of NWBHDF5IO or NWBZarrIO (or whatever HDMFIO class is being used).

  1. The writing part doesn't seem that bad since the user will be required to indicate the backend of choice somehow

I agree, for write this is less useful, in large part because a user here also wants to define different options for write.

3. Dataset reading: When you read datasets from an NWB file, you get h5py.Dataset vs. zarr.Dataset

This means we would need to wrap all arrays in a wrapper class to unify array slicing etc. between the various libraries (numpy, lists, h5py.Dataset, zarr.Dataset, etc.). The same approach would also be useful to go after the second item (to unify I/O settings) by making the wrapper class inherit from DataIO and also handle setting of compression and chunking parameters.

While this approach is attractive, I think the main reason we have not gone this route so far is because this is a big issue that has the potential to effectively grow into its own library (xarray is an example for such a library). Unifying array slicing while maintaining performance is non-trivial because Zarr and h5py make different restrictions compared to the full numpy syntax. Unifying I/O settings is similarly problematic, because the options different libraries provide are broad. Even within Zarr, options can vary depending on the backend store. Also, while we are right now dealing with Zarr and HDF5, it is unclear how this approach would work for other potential backend stores that may not be based on chunked arrays. With something as fundamental as wrapping arrays, it is also challenging to draw a line in terms of restricting functionality, because it very quickly limits the practical utility and very quickly leads to frustration. In principle, I think having a common array wrapper class that unifies the interface of all array types in Python would be great, but it would require substantial resources to develop that are beyond the scope of what we have resources for at this time.

2. dataset configuration on write: chunking, compression, filters, shuffling, etc.

I agree with your comment, that consistency on read is more important than on write, since users have to make custom choices on write anyways. Similar to my above comments on array wrapping, I think a full unification is challenging and outside of the of the scope of what can do. However, maybe there is a middle-ground option for this where we could maybe have a helper function (or class) of sorts for common compression options that would focus on recommended and most-common chunking/compression options. Something like that could help us recommend best-practices for chunking and compression and at the same time limit scope in a way that would make this doable.

@bendichter
Copy link
Contributor Author

bendichter commented Jun 16, 2023

@oruebel I agree with you that defining a uniform API over the various array-like APIs is a challenging endeavor that is outside of the scope of the project, but is there any other way we can get ahead of the differences in these APIs? Maybe we can offer a test suite for downstream software that allows them to test their data access patterns against all of the supported backends? Or maybe we could create documentation of best practices that work cross-backend?

@oruebel
Copy link
Contributor

oruebel commented Jun 16, 2023

is there any other way we can get ahead of the differences in these APIs

I'm certainly open for suggestions.

Maybe we can offer a test suite for downstream software that allows them to test their data access patterns against all of the supported backends

I think that should be doable if that is useful. Do we have a test/use-case for this that we can work off?

Or maybe we could create documentation of best practices that work cross-backend?

I think that would certainly be a good start. Here we could cover both read and write best practices. Maybe a gallery tutorial would be useful. I'd be happy to make this a project for the user/developer days that I could hack on.

@satra
Copy link

satra commented Jul 17, 2023

sorry for the lateness in responding to this.

i think NWB should strive for uniformity where possible, as dealing with if's is not something you want any downstream developers (e.g., nwbwidgets, etc) to do when carrying out basic processing operations. if you see isinstance littered throughout that's going to be an issue.

it could simply mean there is a function as_xarray or some such that essentially provides a handle in a known entity. one may not be able to do everything with this, but it makes the data access layer uniform. similarly one could provide a similar access layer for the metadata in some known entity.

i would be more concerned about read than write at this point. on the write side, it would be useful to abstract out a few things, and provide low level interfaces where one needs specific functionality. it may be useful to take a few use cases and write a spec for what those look like now for the two different backends, where are the places where those if's come in. and that should be a doable exercise in the tutorials for example. what if every tutorial could deal with either type of file? what would that look like?

@bendichter
Copy link
Contributor Author

(1) is being handled by read_nwbfile, which is currently being developed in NWB Inspector with plans to move it to PyNWB when it is ready: NeurodataWithoutBorders/nwbinspector#388

For (3), I like Satra's idea of reading data as an XArray. Let's see if that's doable, as that would solve the uniformity problem somewhat.

@stephprince stephprince added category: proposal proposed enhancements or new features priority: medium non-critical problem and/or affecting only a small set of NWB users labels Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: proposal proposed enhancements or new features priority: medium non-critical problem and/or affecting only a small set of NWB users
Projects
None yet
Development

No branches or pull requests

5 participants