-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion]: syntax differences due to backend differences #1702
Comments
Thanks for the summary. I would suggest we discuss here for a bit and then create separate issues for items where we identify more specific tasks or issues.
For reading data locally (or via ffspec) this should be fine. Maybe a generic
I agree, for write this is less useful, in large part because a user here also wants to define different options for write.
This means we would need to wrap all arrays in a wrapper class to unify array slicing etc. between the various libraries (numpy, lists, h5py.Dataset, zarr.Dataset, etc.). The same approach would also be useful to go after the second item (to unify I/O settings) by making the wrapper class inherit from While this approach is attractive, I think the main reason we have not gone this route so far is because this is a big issue that has the potential to effectively grow into its own library (
I agree with your comment, that consistency on read is more important than on write, since users have to make custom choices on write anyways. Similar to my above comments on array wrapping, I think a full unification is challenging and outside of the of the scope of what can do. However, maybe there is a middle-ground option for this where we could maybe have a helper function (or class) of sorts for common compression options that would focus on recommended and most-common chunking/compression options. Something like that could help us recommend best-practices for chunking and compression and at the same time limit scope in a way that would make this doable. |
@oruebel I agree with you that defining a uniform API over the various array-like APIs is a challenging endeavor that is outside of the scope of the project, but is there any other way we can get ahead of the differences in these APIs? Maybe we can offer a test suite for downstream software that allows them to test their data access patterns against all of the supported backends? Or maybe we could create documentation of best practices that work cross-backend? |
I'm certainly open for suggestions.
I think that should be doable if that is useful. Do we have a test/use-case for this that we can work off?
I think that would certainly be a good start. Here we could cover both read and write best practices. Maybe a gallery tutorial would be useful. I'd be happy to make this a project for the user/developer days that I could hack on. |
sorry for the lateness in responding to this. i think NWB should strive for uniformity where possible, as dealing with it could simply mean there is a function i would be more concerned about read than write at this point. on the write side, it would be useful to abstract out a few things, and provide low level interfaces where one needs specific functionality. it may be useful to take a few use cases and write a spec for what those look like now for the two different backends, where are the places where those |
(1) is being handled by For (3), I like Satra's idea of reading data as an XArray. Let's see if that's doable, as that would solve the uniformity problem somewhat. |
What would you like to see added to PyNWB?
In a conversation with @satra and @oruebel at the BRAIN meeting, we discussed syntax differences in pynwb that are the result of using different backends. This has the potential to cause bugs and complications for downstream code. I am aware of three areas in PyNWB where differences appear.
IO for reading and writing files: A user needs to decide between
NWBHDF5IO
andNWBZarrIO
in order to read or write a file. The writing part doesn't seem that bad since the user will be required to indicate the backend of choice somehow and using a different class seems like a reasonable way to do that. However, for read you might prefer that pynwb determines the backend automatically. Otherwise, we are going to need to cover the read classes all of the different possible backends in tutorials. IMHO it wouldn't hurt to have anNWBIO
class that can automatically determine the backend. This is a relatively simple solution and would make the pynwb library more user-friendly.relevant issues: NWBIO #858
dataset configuration on write: chunking, compression, filters, shuffling, etc. are specified slightly differently between HDF5/h5py and Zarr. You might consider creating a unifying api layer that translates into the different specs for the backend (@satra was advocating for this) however, there are enough differences in the capabilities and logic of the different approaches that this would not be straightforward. If we do create a unifying language, it would be difficult not to restrict our utilization of configuration capabilities of particular datasets. For example, I'd like to use WavPack, but that is only available with Zarr currently.
Dataset indexing when reading: When you read datasets from an NWB file, you get
h5py.Dataset
vs.zarr.Dataset
. These classes act similarly to anp.ndarray
, but there are enough differences that it will likely cause bugs for any analysis script that wants to work on flexible backends. For example, this code works in Zarr:but the following analogous code in h5py does not work:
There are many other subtle differences between the indexing of these different array-like classes. See the Zarr docs on fancy indexing.
In NWB Widgets, we are starting to run into bugs that are the result of these differences, e.g. NeurodataWithoutBorders/nwbwidgets#283. Without a proper response, I anticipate that these types of issues will accumulate across the cross-section of analysis tools and backends.
Is your feature request related to a problem?
No response
What solution would you like?
There are some pretty big trade-offs here, including homogenization, backwards compatibility, and evaluation of the complexity of each solution. I think it deserves some discussion.
Do you have any interest in helping implement the feature?
Yes.
Code of Conduct
The text was updated successfully, but these errors were encountered: