Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More speedup via mtime-base caching. #274

Merged
merged 2 commits into from
Mar 27, 2021
Merged

More speedup via mtime-base caching. #274

merged 2 commits into from
Mar 27, 2021

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Jan 10, 2021

Caching based on mtime is similar to the one done on importlib's
FileFinder.

Locally, on a large-ish environment, this speeds up repeated calls to
distribution("pip") ~10x.

@anntzer anntzer force-pushed the fscache branch 4 times, most recently from 4e3e3ba to d55e743 Compare January 11, 2021 10:47
Copy link
Member

@jaraco jaraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a stab at this. It seems to be proving tricky. I guess let me know if you think you have a solution for the failing tests, but also, I have some other concerns about the approach. It undoes a lot of the separation of concerns that was previously created and breaks some intentional protections around the API. It's a good start for some inspiration, though. May I recommend submitting a separate PR with an additional test in the benchmark environment that demonstrates the current performance?

for child in self.children()
if name.matches(child, self.base)
)
def update_cache(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too happy that a lot of essential behavior ("matches", "is_egg", "prepared computations") has been inlined into an "update_cache" function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because there's no more notion of "prepared computations" (because we just parse everything and store the normalized results into the cache). "Matches" is basically just a dict lookup now (in search()).

I guess at best I could have a def is_info(self, low): return low.endswith(...) and def is_egg(self, low): return self.base_is_egg and low == "egg-info". I don't think adding these levels of indirection help legibility, but that's up to you.

I can't think of further factoring, though.

self.update_cache()
if prepared.name:
infos = self.infos.get(prepared.normalized, [])
yield from map(self.joinpath, infos)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing a lot of repetition and branching. I'd want to see this unified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can write something like

    def search(self, prepared):
        self.update_cache()
        infos_and_eggs = itertools.chain(
            self.infos.get(prepared.normalized, []) if prepared.name else
            itertools.chain.from_iterable(self.infos.values()),
            self.eggs.get(prepared.legacy_normalized, []) if prepared.name else
            itertools.chain.from_iterable(self.eggs.values()))
        yield from map(self.joinpath, infos_and_eggs)

(or lift the if prepared.name out) but that seems less legible IMHO...


def __init__(self, name):
self.name = name
if name is None:
return
self.normalized = self.normalize(name)
self.exact_matches = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did the behavior for exact matches go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low.rpartition(".")[0].partition("-")[0] handles both exact and inexact matches (for exact (versionless) matches, .partition("-")[0] will be a noop).

importlib_metadata/__init__.py Outdated Show resolved Hide resolved
tests/test_integration.py Outdated Show resolved Hide resolved
@anntzer
Copy link
Contributor Author

anntzer commented Jan 23, 2021

See #279 for the separate benchmark. I agree the implementation is a bit messy, I'll work on it...
At least I fixed the tests for now; I had a problem with invalidation.

@anntzer anntzer force-pushed the fscache branch 4 times, most recently from bd737d3 to 9b450a8 Compare January 24, 2021 10:54
@anntzer
Copy link
Contributor Author

anntzer commented Jan 24, 2021

I rebased this on top of #279, to show the more than 10x speedup in the cached lookup case. (In fact even the uncached case seems a bit (~10%) faster now, although I don't know if that's measurement noise.)
The failure on 3.6/windows is spurious, arising from a spotty download...


As a side point, one simplification that may be possible would be to fold normalization and legacy_normalization (which is a subset of "plain" normalization) together. Previously, they had to be separate, because we took a queried name and had to normalize it to a form that would match the filename (after lowercasing), but now, both the filename and the queried name get normalized. The only bad side effect would be if e.g. there's both foo.bar.egg and foo_bar.egg; now these would be considered equivalent (again, this can only occur with eggs, not with {dist,egg}-infos, which already normalize them together.

Thoughts?

Caching based on mtime is similar to the one done on importlib's
FileFinder.

Locally, on a large-ish environment, this speeds up repeated calls to
`distribution("pip")` ~10x.
@anntzer
Copy link
Contributor Author

anntzer commented Feb 21, 2021

Kindly bumping this. (I believe I've replied to all your comments.)
The CI benchmark still shows a >10x speedup for cached lookups vs uncached lookups.

@jaraco
Copy link
Member

jaraco commented Feb 23, 2021

Thanks for the bump. I haven't had a chance to look at the changes, but I appreciate the thoughtful responses to the critique and I'll do my best to give this a fair review soon.

@jaraco
Copy link
Member

jaraco commented Mar 8, 2021

In #290, I've added a couple of refactorings that largely address the misgivings I have about this change. Unfortunately, it doesn't seem to be having the desired effect (performance tests indicating a degradation in performance), so I need to investigate more.

@jaraco jaraco merged commit f787075 into python:main Mar 27, 2021
@anntzer anntzer deleted the fscache branch March 27, 2021 12:25
jaraco added a commit that referenced this pull request Dec 21, 2023
Fix ResourceWarning due to unclosed file resource.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants