Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache entry point lookups #6124

Merged
merged 18 commits into from
Oct 10, 2023
Merged

Conversation

danielhollas
Copy link
Collaborator

In #6091 we have sped up aiida.orm import significantly by delaying the call to get_type_string_from_class. However, we did not solve the underlying issue -- the reason why this call is so expensive is because it essentially involves a quadruply-nested for loop.

Here I cache the results of importlib_metadata.entry_points.select() calls, which should get rid of the two inner loops coming from importlib_metadata code. This change shaves off around 25ms from verdi invocation, but should help throughout the code base. I am happy to do further benchmarking if needed (what should I benchmark?)

cc @sphuber

Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @danielhollas !

introducing caches is always increasing complexity,
but in this particular case I can't really come up with a scenario where this would hurt us (eps() is already being cached, so this should just be saving computational time)

so to me this looks fine

tests/plugins/test_entry_point.py Show resolved Hide resolved
def eps():
return _eps()


@functools.lru_cache(maxsize=100)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, so do I understand correctly that eps() is cached, nothing is being read from disk here, and processing the few entry points in memory is still taking 25ms?

Can you please document this function to explain the quadruple loop problem here, and why this additional cache is necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll write a comment. You can look at the cProfile info on the forum. https://aiida.discourse.group/t/why-is-aiidalab-base-widget-import-so-slow/32/16?u=danielhollas

Together with the fact that we were calling this during each Node class build (in a metaclass), this completely explained why the aiida.orm import was so slow. See #6091

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @ltalirz I am bit hesitant whether this could introduce some subtle bugs related to cache invalidation, but couldn't come up with one for now. So think we can add this. Just have one comment

def eps():
return _eps()


@functools.lru_cache(maxsize=100)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we worried about the size of the cache? I think the number of different calls to eps_select should be reasonable, not exceeding on the order of 100. So wouldn't we be better of with lru_cache(maxsize=None) i.e. cache which will be faster. Not sure how much faster it will be compared to an LRU cache of max 100 items. Might be negligible

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually worried. I wouldn't be surprised if the number of calls was bigger then 100, especially if plugins are installed, since in some functions we're essentially iterating over all existing entry points when looking for the entry point for a given class. I'll take a closer look and do some more benchmarking.

Copy link
Member

@ltalirz ltalirz Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we doing this iteration or is it importlib internally?

Are we trying to find an entry point for a class without knowing the name of the entry point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since in some functions we're essentially iterating over all existing entry points when looking for the entry point for a given class. I'll take a closer look and do some more benchmarking.

These don't call this eps_select function though, do they? The cache here simply applies to the number of combinations of arguments with which it is called. Since it just has group and name, it should just be the list of all (group, name) tuples with which the function is called. This should, reasonably, not be much larger than the entry points that exist.

aiida/manage/configuration/config.py Show resolved Hide resolved
tests/plugins/test_entry_point.py Show resolved Hide resolved
@danielhollas
Copy link
Collaborator Author

danielhollas commented Oct 2, 2023

I looked a bit more carefully at the implementation of the get_entry_point_from_class.
In this function, we essentially need to iterate over all existing entry points (in my environment, there are 360 of them!).
And we're doing this iteration very inefficiently, because we first grab the groups, and then filter entry points in each group. But this filtering is done using EntryPoints.select() which itself iterates over all the entry points.

Instead, we can simply iterate over all the entry points directly, see the last commit.

Here are some timings, using a non-existing class to measure the worst-case scenario:
(note that I have disabled cache for get_entry_point_from_class for these timings)

Main branch

In [1]: from aiida.plugins.entry_point import get_entry_point_from_class, eps
In [2]: p = eps() #warmup cache
In [3]: %timeit get_entry_point_from_class('invalid', 'invalid')
21.4 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Caching eps.select() on this branch

In [1]: from aiida.plugins.entry_point import get_entry_point_from_class, eps
In [2]: p = eps(); get_entry_point_from_class('invalid', 'invalid') # warmup caches
In [12]: %timeit  get_entry_point_from_class('invalid', 'invalid')
329 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Iterating over all entry points directly

In [1]: from aiida.plugins.entry_point import get_entry_point_from_class, eps
In [2]: p = eps()
In [5]: %timeit get_entry_point_from_class('invalid', 'invalid')
288 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So the last approach clearly wins, and perhaps we could even remove the cache on top of this function.

I'll do a bit more investigation to see if it still makes sense to cache eps.select() since it is called from other functions as well.

EDIT: The last commit does not work, need to investigate more...

@sphuber
Copy link
Contributor

sphuber commented Oct 2, 2023

EDIT: The last commit does not work, need to investigate more..

The problem is that get_entry_points returns a list of EntryPoint whereas the importlib_metadata.entry_points function returns a list of strings. So you would have to load the entry point for each string in order to be able to get the module and class name. Not sure what the impact of having to load each entry point is going to be for the performance and whether the implementation will then still be faster than simply using get_entry_points.

@danielhollas
Copy link
Collaborator Author

danielhollas commented Oct 2, 2023

@sphuber yes. The reason why this worked for me locally is that I had importlib_metadata==5.2.0 instead of importlib_metadata==4.13. So if we bumped the version this new approach would work. I am not sure if there is a reason why we are at 4.13?

EDIT: To clarify in 5.2.0 entry_points() returns the EntryPoints instance that has all the necessary info and can be directly iterated over.

EDIT: Updating 5.2.0 would require at the very least updating the EntryPointManager pytest fixture.

@sphuber
Copy link
Contributor

sphuber commented Oct 6, 2023

EDIT: Updating 5.2.0 would require at the very least updating the EntryPointManager pytest fixture.

This should be straightforward. The eps has just changed from a nested mapping to a flat list, so instead of having add_entry_point treat it like a dict, we should simply append the entry point to the list. @danielhollas want me to take resolve this so we can wrap up this PR?

@danielhollas
Copy link
Collaborator Author

danielhollas commented Oct 6, 2023

@sphuber So I should have clarified my hesitation here: I was not sure if you'd be okay with upgrading importlib_metadata because I wasn't sure if downstream packages might depend on this package as well. If that is not a concern I'll go ahead and finish this today or over the weekend.

The eps has just changed from a nested mapping to a flat list, so instead of having add_entry_point treat it like a dict, we should simply append the entry point to the list. @danielhollas want me to take resolve this so we can wrap up this PR?

Indeed, but I stumbled at first because EntryPoints is in fact a subclass of tuple and not a list hence we'll probably need to recreate it instead of modifying it directly.

EDIT: I've checked aiida-quantumespresso and it depends on importlib_resources and not importlib_metadata, I also have these mixed up. I don't know if there's a reason why other plugins should depend on importlib_metadata since aiida-core provides useful abstractions over it.

@sphuber
Copy link
Contributor

sphuber commented Oct 6, 2023

No, I think it is fine for use to update the minimum requirement of importlib_metadat. Please feel free to go ahead with this

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @danielhollas . Implementation looks good, just some minor docstring requests and then we can merge

aiida/manage/tests/pytest_fixtures.py Outdated Show resolved Hide resolved
aiida/manage/tests/pytest_fixtures.py Outdated Show resolved Hide resolved
aiida/plugins/entry_point.py Outdated Show resolved Hide resolved
@ltalirz
Copy link
Member

ltalirz commented Oct 9, 2023

I'll do a bit more investigation to see if it still makes sense to cache eps.select() since it is called from other functions as well.

Just checking whether this is then indeed still necessary or whether we can reduce the caching

@danielhollas
Copy link
Collaborator Author

danielhollas commented Oct 10, 2023

Just checking whether this is then indeed still necessary or whether we can reduce the caching

So some timings: tl;dr: entry_points.select() is a fairly expensive operation and we should probably cache it since aiida uses entry points extensively:

# Without cache
In [7]: %timeit eps.select(group='invalid', name='invalid')
363 µs ± 4.13 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit eps.select(group='console_scripts', name='verdi')
376 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# with cache
%timeit eps_select(group='console_scripts', name='verdi')
201 ns ± 0.886 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

The get_entry_point_from_class now takes similer in the order of hundreds of microseconds, so at least order of magnitude faster then before, but still 3 orders of magnitude slower than a cache so removing that cache might have some unintended consequences in some cases.
We can nonetheless achieve further speed up by pre-sorting the EntryPoints by the group name so that aiida. groups come first. This can give large speedups in some cases.

# Best case scenario, first item in sorted `EntryPoints`
In [64]: %timeit get_entry_point_from_class_uncached('aiida.calculations.arithmetic.add', 'ArithmeticAddCalculation')
2.4 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

# Last item in aiida groups in sorted `EntryPoints`
In [65]: %timeit get_entry_point_from_class_uncached('aiida.workflows.arithmetic.add_multiply', 'add_multiply')
83.4 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# Iterating through all entry points (243 in this case)
%timeit get_entry_point_from_class_uncached('invalid', 'invalid')
231 µs ± 416 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) 

The pre-sorting itself takes ~30µs so its worth it in most cases

 %timeit EntryPoints(sorted(eps_unsorted, key=lambda x: x.group))
37.2 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Copy link
Collaborator Author

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber thank you for the review. I've modified the docstrings, and added the EntryPoints pre-sorting, as described above.

environment.yml Show resolved Hide resolved
aiida/orm/autogroup.py Show resolved Hide resolved
@sphuber sphuber merged commit 12cc930 into aiidateam:main Oct 10, 2023
32 checks passed
@danielhollas danielhollas deleted the cache-entry-lookups branch October 10, 2023 08:21
@ltalirz
Copy link
Member

ltalirz commented Oct 10, 2023

Thanks @danielhollas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants