Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExtraDataFunctor for integration with pasha #299

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

philsmt
Copy link
Contributor

@philsmt philsmt commented Mar 22, 2022

As it came up again in European-XFEL/pasha#14, here's a proof-of-concept on how to move the ExtraDataFunctor implementation to EXtra-data itself. This would allow it to take advantage of private APIs as well as test properly (TBD). So far it's almost exactly the same implementation only with unnecessary import checks removed.

As the code is only loaded conditionally, it does not add an actual dependency. The import of gen_split_slices could even be removed once SourceData has its own .split_trains() method.

Any ideas for the module name?

@takluyver
Copy link
Member

Thanks. I think pasha_functor is fine for the module name - it shouldn't normally be visible in any case.

As we were just discussing, .split_trains() doesn't fully work for this, because the objects it produces don't know the index of their trains in the original object. I think it might be a useful enhancement for pasha to define a default splitting based on len(functor), so we don't need gen_split_slices everywhere. But we do also have basically the same code in read_machinery (as split_trains), so it's easy enough for this purpose.

@takluyver takluyver added the enhancement New feature or request label Apr 6, 2022
@takluyver
Copy link
Member

Do you want to add a test for the pasha integration, BTW?

@philsmt
Copy link
Contributor Author

philsmt commented Apr 27, 2022

Yes. How would the tests work dependency-wise? Do we add pasha as an optional feature, or not all?

I do not much like the hack for iterating SourceData objects anymore. I've looked into supporting it properly in the class natively, but I think the proper solution for this is to move various find-data-and-chunks methods from DataCollectiontoSourceData, in particular since we now have SourceData` objects anyway at all times. So let's move this to a later release, and I will remove the support here.

# different process than the functor was initially created in,
# close all file handles inherited from the parent collection to
# force re-opening them again in each child process.
if getpid() != self._parent_pid:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this at all with the dependencies we require?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not really.

It's a bit tricky to be definitive because the change is in HDF5, and h5py can be built with a wide range of HDF5 versions. EXtra-data supports h5py 2.10, and the pre-built packages of 2.10 on PyPI have HDF5 1.10.4. So I might bump the required version to >= 3.0, just to be a bit cautious. I hope very few people are still stuck on h5py 2.x.

@takluyver
Copy link
Member

I would add pasha to the tests extra in setup.py (which already includes things like cloudpickle and dask).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants