Add ExtraDataFunctor for integration with pasha #299

philsmt · 2022-03-22T15:57:34Z

As it came up again in European-XFEL/pasha#14, here's a proof-of-concept on how to move the ExtraDataFunctor implementation to EXtra-data itself. This would allow it to take advantage of private APIs as well as test properly (TBD). So far it's almost exactly the same implementation only with unnecessary import checks removed.

As the code is only loaded conditionally, it does not add an actual dependency. The import of gen_split_slices could even be removed once SourceData has its own .split_trains() method.

Any ideas for the module name?

takluyver · 2022-03-22T16:16:27Z

Thanks. I think pasha_functor is fine for the module name - it shouldn't normally be visible in any case.

As we were just discussing, .split_trains() doesn't fully work for this, because the objects it produces don't know the index of their trains in the original object. I think it might be a useful enhancement for pasha to define a default splitting based on len(functor), so we don't need gen_split_slices everywhere. But we do also have basically the same code in read_machinery (as split_trains), so it's easy enough for this purpose.

takluyver · 2022-04-06T10:07:37Z

Do you want to add a test for the pasha integration, BTW?

philsmt · 2022-04-27T12:31:58Z

Yes. How would the tests work dependency-wise? Do we add pasha as an optional feature, or not all?

I do not much like the hack for iterating SourceData objects anymore. I've looked into supporting it properly in the class natively, but I think the proper solution for this is to move various find-data-and-chunks methods from DataCollectiontoSourceData, in particular since we now have SourceData` objects anyway at all times. So let's move this to a later release, and I will remove the support here.

philsmt · 2022-04-27T12:33:43Z

extra_data/pasha_functor.py

+        # different process than the functor was initially created in,
+        # close all file handles inherited from the parent collection to
+        # force re-opening them again in each child process.
+        if getpid() != self._parent_pid:


Do we still need this at all with the dependencies we require?

Probably not really.

It's a bit tricky to be definitive because the change is in HDF5, and h5py can be built with a wide range of HDF5 versions. EXtra-data supports h5py 2.10, and the pre-built packages of 2.10 on PyPI have HDF5 1.10.4. So I might bump the required version to >= 3.0, just to be a bit cautious. I hope very few people are still stuck on h5py 2.x.

takluyver · 2022-04-27T15:53:03Z

I would add pasha to the tests extra in setup.py (which already includes things like cloudpickle and dask).

Add ExtraDataFunctor for integration with pasha

68d1166

Use internal split_trains function

21d9ab7

takluyver added the enhancement New feature or request label Apr 6, 2022

philsmt commented Apr 27, 2022

View reviewed changes

Remove provisional SourceData iteration support in pasha functor

c8d37ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ExtraDataFunctor for integration with pasha #299

Add ExtraDataFunctor for integration with pasha #299

philsmt commented Mar 22, 2022

takluyver commented Mar 22, 2022

takluyver commented Apr 6, 2022

philsmt commented Apr 27, 2022

philsmt Apr 27, 2022

takluyver Apr 27, 2022

takluyver commented Apr 27, 2022

Add ExtraDataFunctor for integration with pasha #299

Are you sure you want to change the base?

Add ExtraDataFunctor for integration with pasha #299

Conversation

philsmt commented Mar 22, 2022

takluyver commented Mar 22, 2022

takluyver commented Apr 6, 2022

philsmt commented Apr 27, 2022

philsmt Apr 27, 2022

Choose a reason for hiding this comment

takluyver Apr 27, 2022

Choose a reason for hiding this comment

takluyver commented Apr 27, 2022