-
-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More speedup via mtime-base caching. #274
Conversation
4e3e3ba
to
d55e743
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a stab at this. It seems to be proving tricky. I guess let me know if you think you have a solution for the failing tests, but also, I have some other concerns about the approach. It undoes a lot of the separation of concerns that was previously created and breaks some intentional protections around the API. It's a good start for some inspiration, though. May I recommend submitting a separate PR with an additional test in the benchmark
environment that demonstrates the current performance?
for child in self.children() | ||
if name.matches(child, self.base) | ||
) | ||
def update_cache(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too happy that a lot of essential behavior ("matches", "is_egg", "prepared computations") has been inlined into an "update_cache" function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because there's no more notion of "prepared computations" (because we just parse everything and store the normalized results into the cache). "Matches" is basically just a dict lookup now (in search()
).
I guess at best I could have a def is_info(self, low): return low.endswith(...)
and def is_egg(self, low): return self.base_is_egg and low == "egg-info"
. I don't think adding these levels of indirection help legibility, but that's up to you.
I can't think of further factoring, though.
self.update_cache() | ||
if prepared.name: | ||
infos = self.infos.get(prepared.normalized, []) | ||
yield from map(self.joinpath, infos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm seeing a lot of repetition and branching. I'd want to see this unified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can write something like
def search(self, prepared):
self.update_cache()
infos_and_eggs = itertools.chain(
self.infos.get(prepared.normalized, []) if prepared.name else
itertools.chain.from_iterable(self.infos.values()),
self.eggs.get(prepared.legacy_normalized, []) if prepared.name else
itertools.chain.from_iterable(self.eggs.values()))
yield from map(self.joinpath, infos_and_eggs)
(or lift the if prepared.name
out) but that seems less legible IMHO...
|
||
def __init__(self, name): | ||
self.name = name | ||
if name is None: | ||
return | ||
self.normalized = self.normalize(name) | ||
self.exact_matches = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did the behavior for exact matches go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
low.rpartition(".")[0].partition("-")[0]
handles both exact and inexact matches (for exact (versionless) matches, .partition("-")[0]
will be a noop).
See #279 for the separate benchmark. I agree the implementation is a bit messy, I'll work on it... |
bd737d3
to
9b450a8
Compare
I rebased this on top of #279, to show the more than 10x speedup in the cached lookup case. (In fact even the uncached case seems a bit (~10%) faster now, although I don't know if that's measurement noise.) As a side point, one simplification that may be possible would be to fold normalization and legacy_normalization (which is a subset of "plain" normalization) together. Previously, they had to be separate, because we took a queried name and had to normalize it to a form that would match the filename (after lowercasing), but now, both the filename and the queried name get normalized. The only bad side effect would be if e.g. there's both Thoughts? |
Caching based on mtime is similar to the one done on importlib's FileFinder. Locally, on a large-ish environment, this speeds up repeated calls to `distribution("pip")` ~10x.
Kindly bumping this. (I believe I've replied to all your comments.) |
Thanks for the bump. I haven't had a chance to look at the changes, but I appreciate the thoughtful responses to the critique and I'll do my best to give this a fair review soon. |
f7d5365
to
f787075
Compare
In #290, I've added a couple of refactorings that largely address the misgivings I have about this change. Unfortunately, it doesn't seem to be having the desired effect (performance tests indicating a degradation in performance), so I need to investigate more. |
Fix ResourceWarning due to unclosed file resource.
Caching based on mtime is similar to the one done on importlib's
FileFinder.
Locally, on a large-ish environment, this speeds up repeated calls to
distribution("pip")
~10x.