-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache entry point lookups #6124
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @danielhollas !
introducing caches is always increasing complexity,
but in this particular case I can't really come up with a scenario where this would hurt us (eps()
is already being cached, so this should just be saving computational time)
so to me this looks fine
def eps(): | ||
return _eps() | ||
|
||
|
||
@functools.lru_cache(maxsize=100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, so do I understand correctly that eps()
is cached, nothing is being read from disk here, and processing the few entry points in memory is still taking 25ms?
Can you please document this function to explain the quadruple loop problem here, and why this additional cache is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll write a comment. You can look at the cProfile info on the forum. https://aiida.discourse.group/t/why-is-aiidalab-base-widget-import-so-slow/32/16?u=danielhollas
Together with the fact that we were calling this during each Node class build (in a metaclass), this completely explained why the aiida.orm
import was so slow. See #6091
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @ltalirz I am bit hesitant whether this could introduce some subtle bugs related to cache invalidation, but couldn't come up with one for now. So think we can add this. Just have one comment
def eps(): | ||
return _eps() | ||
|
||
|
||
@functools.lru_cache(maxsize=100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we worried about the size of the cache? I think the number of different calls to eps_select
should be reasonable, not exceeding on the order of 100. So wouldn't we be better of with lru_cache(maxsize=None)
i.e. cache
which will be faster. Not sure how much faster it will be compared to an LRU cache of max 100 items. Might be negligible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually worried. I wouldn't be surprised if the number of calls was bigger then 100, especially if plugins are installed, since in some functions we're essentially iterating over all existing entry points when looking for the entry point for a given class. I'll take a closer look and do some more benchmarking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we doing this iteration or is it importlib internally?
Are we trying to find an entry point for a class without knowing the name of the entry point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since in some functions we're essentially iterating over all existing entry points when looking for the entry point for a given class. I'll take a closer look and do some more benchmarking.
These don't call this eps_select
function though, do they? The cache here simply applies to the number of combinations of arguments with which it is called. Since it just has group
and name
, it should just be the list of all (group, name)
tuples with which the function is called. This should, reasonably, not be much larger than the entry points that exist.
I looked a bit more carefully at the implementation of the Instead, we can simply iterate over all the entry points directly, see the last commit. Here are some timings, using a non-existing class to measure the worst-case scenario: Main branchIn [1]: from aiida.plugins.entry_point import get_entry_point_from_class, eps
In [2]: p = eps() #warmup cache
In [3]: %timeit get_entry_point_from_class('invalid', 'invalid')
21.4 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Caching eps.select() on this branchIn [1]: from aiida.plugins.entry_point import get_entry_point_from_class, eps
In [2]: p = eps(); get_entry_point_from_class('invalid', 'invalid') # warmup caches
In [12]: %timeit get_entry_point_from_class('invalid', 'invalid')
329 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Iterating over all entry points directlyIn [1]: from aiida.plugins.entry_point import get_entry_point_from_class, eps
In [2]: p = eps()
In [5]: %timeit get_entry_point_from_class('invalid', 'invalid')
288 µs ± 1.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) So the last approach clearly wins, and perhaps we could even remove the cache on top of this function. I'll do a bit more investigation to see if it still makes sense to cache EDIT: The last commit does not work, need to investigate more... |
The problem is that |
@sphuber yes. The reason why this worked for me locally is that I had EDIT: To clarify in 5.2.0 EDIT: Updating |
a47009a
to
0705701
Compare
This should be straightforward. The |
@sphuber So I should have clarified my hesitation here: I was not sure if you'd be okay with upgrading
Indeed, but I stumbled at first because EDIT: I've checked |
No, I think it is fine for use to update the minimum requirement of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @danielhollas . Implementation looks good, just some minor docstring requests and then we can merge
Just checking whether this is then indeed still necessary or whether we can reduce the caching |
So some timings: tl;dr: entry_points.select() is a fairly expensive operation and we should probably cache it since aiida uses entry points extensively: # Without cache
In [7]: %timeit eps.select(group='invalid', name='invalid')
363 µs ± 4.13 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit eps.select(group='console_scripts', name='verdi')
376 µs ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# with cache
%timeit eps_select(group='console_scripts', name='verdi')
201 ns ± 0.886 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) The # Best case scenario, first item in sorted `EntryPoints`
In [64]: %timeit get_entry_point_from_class_uncached('aiida.calculations.arithmetic.add', 'ArithmeticAddCalculation')
2.4 µs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# Last item in aiida groups in sorted `EntryPoints`
In [65]: %timeit get_entry_point_from_class_uncached('aiida.workflows.arithmetic.add_multiply', 'add_multiply')
83.4 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
# Iterating through all entry points (243 in this case)
%timeit get_entry_point_from_class_uncached('invalid', 'invalid')
231 µs ± 416 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each) The pre-sorting itself takes ~30µs so its worth it in most cases %timeit EntryPoints(sorted(eps_unsorted, key=lambda x: x.group))
37.2 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber thank you for the review. I've modified the docstrings, and added the EntryPoints pre-sorting, as described above.
Thanks @danielhollas! |
In #6091 we have sped up
aiida.orm
import significantly by delaying the call toget_type_string_from_class
. However, we did not solve the underlying issue -- the reason why this call is so expensive is because it essentially involves a quadruply-nested for loop.Here I cache the results of
importlib_metadata.entry_points.select()
calls, which should get rid of the two inner loops coming fromimportlib_metadata
code. This change shaves off around 25ms fromverdi
invocation, but should help throughout the code base. I am happy to do further benchmarking if needed (what should I benchmark?)cc @sphuber