More speedup via mtime-base caching. #274

anntzer · 2021-01-10T22:30:29Z

Caching based on mtime is similar to the one done on importlib's
FileFinder.

Locally, on a large-ish environment, this speeds up repeated calls to
distribution("pip") ~10x.

jaraco

Thanks for taking a stab at this. It seems to be proving tricky. I guess let me know if you think you have a solution for the failing tests, but also, I have some other concerns about the approach. It undoes a lot of the separation of concerns that was previously created and breaks some intentional protections around the API. It's a good start for some inspiration, though. May I recommend submitting a separate PR with an additional test in the benchmark environment that demonstrates the current performance?

jaraco · 2021-01-22T14:49:32Z

importlib_metadata/__init__.py

-            for child in self.children()
-            if name.matches(child, self.base)
-        )
+    def update_cache(self):


I'm not too happy that a lot of essential behavior ("matches", "is_egg", "prepared computations") has been inlined into an "update_cache" function.

That's because there's no more notion of "prepared computations" (because we just parse everything and store the normalized results into the cache). "Matches" is basically just a dict lookup now (in search()).

I guess at best I could have a def is_info(self, low): return low.endswith(...) and def is_egg(self, low): return self.base_is_egg and low == "egg-info". I don't think adding these levels of indirection help legibility, but that's up to you.

I can't think of further factoring, though.

jaraco · 2021-01-22T14:54:22Z

importlib_metadata/__init__.py

+        self.update_cache()
+        if prepared.name:
+            infos = self.infos.get(prepared.normalized, [])
+            yield from map(self.joinpath, infos)


I'm seeing a lot of repetition and branching. I'd want to see this unified.

I can write something like

def search(self, prepared): self.update_cache() infos_and_eggs = itertools.chain( self.infos.get(prepared.normalized, []) if prepared.name else itertools.chain.from_iterable(self.infos.values()), self.eggs.get(prepared.legacy_normalized, []) if prepared.name else itertools.chain.from_iterable(self.eggs.values())) yield from map(self.joinpath, infos_and_eggs)

(or lift the if prepared.name out) but that seems less legible IMHO...

jaraco · 2021-01-22T14:55:12Z

importlib_metadata/__init__.py


    def __init__(self, name):
        self.name = name
        if name is None:
            return
        self.normalized = self.normalize(name)
-        self.exact_matches = [


Where did the behavior for exact matches go?

low.rpartition(".")[0].partition("-")[0] handles both exact and inexact matches (for exact (versionless) matches, .partition("-")[0] will be a noop).

importlib_metadata/__init__.py

tests/test_integration.py

anntzer · 2021-01-23T22:32:53Z

See #279 for the separate benchmark. I agree the implementation is a bit messy, I'll work on it...
At least I fixed the tests for now; I had a problem with invalidation.

anntzer · 2021-01-24T10:59:36Z

I rebased this on top of #279, to show the more than 10x speedup in the cached lookup case. (In fact even the uncached case seems a bit (~10%) faster now, although I don't know if that's measurement noise.)
The failure on 3.6/windows is spurious, arising from a spotty download...

As a side point, one simplification that may be possible would be to fold normalization and legacy_normalization (which is a subset of "plain" normalization) together. Previously, they had to be separate, because we took a queried name and had to normalize it to a form that would match the filename (after lowercasing), but now, both the filename and the queried name get normalized. The only bad side effect would be if e.g. there's both foo.bar.egg and foo_bar.egg; now these would be considered equivalent (again, this can only occur with eggs, not with {dist,egg}-infos, which already normalize them together.

Thoughts?

Caching based on mtime is similar to the one done on importlib's FileFinder. Locally, on a large-ish environment, this speeds up repeated calls to `distribution("pip")` ~10x.

anntzer · 2021-02-21T18:10:31Z

Kindly bumping this. (I believe I've replied to all your comments.)
The CI benchmark still shows a >10x speedup for cached lookups vs uncached lookups.

jaraco · 2021-02-23T01:55:51Z

Thanks for the bump. I haven't had a chance to look at the changes, but I appreciate the thoughtful responses to the critique and I'll do my best to give this a fair review soon.

importlib_metadata/__init__.py

jaraco · 2021-03-08T02:39:36Z

In #290, I've added a couple of refactorings that largely address the misgivings I have about this change. Unfortunately, it doesn't seem to be having the desired effect (performance tests indicating a degradation in performance), so I need to investigate more.

Fix ResourceWarning due to unclosed file resource.

anntzer force-pushed the fscache branch 4 times, most recently from 4e3e3ba to d55e743 Compare January 11, 2021 10:47

jaraco requested changes Jan 22, 2021

View reviewed changes

anntzer mentioned this pull request Jan 23, 2021

Separately profile cached and uncached lookup performance. #279

Merged

anntzer force-pushed the fscache branch 4 times, most recently from bd737d3 to 9b450a8 Compare January 24, 2021 10:54

anntzer force-pushed the fscache branch from 9b450a8 to c803388 Compare February 21, 2021 16:10

More speedup via mtime-base caching.

0da7579

Caching based on mtime is similar to the one done on importlib's FileFinder. Locally, on a large-ish environment, this speeds up repeated calls to `distribution("pip")` ~10x.

anntzer force-pushed the fscache branch from c803388 to 0da7579 Compare February 21, 2021 18:06

jaraco reviewed Mar 8, 2021

View reviewed changes

importlib_metadata/__init__.py Show resolved Hide resolved

Merge branch 'main' into fscache

f787075

jaraco force-pushed the fscache branch 3 times, most recently from f7d5365 to f787075 Compare March 8, 2021 02:34

jaraco mentioned this pull request Mar 8, 2021

More speedup via mtime-base caching. #290

Merged

jaraco force-pushed the main branch from 0a66eba to da0bc89 Compare March 14, 2021 14:30

jaraco merged commit f787075 into python:main Mar 27, 2021

anntzer deleted the fscache branch March 27, 2021 12:25

jaraco added a commit that referenced this pull request Dec 21, 2023

Merge pull request #274 from tirkarthi/fix-warnings

b14f9b5

Fix ResourceWarning due to unclosed file resource.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More speedup via mtime-base caching. #274

More speedup via mtime-base caching. #274

anntzer commented Jan 10, 2021 •

edited

Loading

jaraco left a comment

jaraco Jan 22, 2021

anntzer Jan 24, 2021

jaraco Jan 22, 2021

anntzer Jan 24, 2021

jaraco Jan 22, 2021

anntzer Jan 24, 2021

anntzer commented Jan 23, 2021 •

edited

Loading

anntzer commented Jan 24, 2021 •

edited

Loading

anntzer commented Feb 21, 2021

jaraco commented Feb 23, 2021

jaraco commented Mar 8, 2021

More speedup via mtime-base caching. #274

More speedup via mtime-base caching. #274

Conversation

anntzer commented Jan 10, 2021 • edited Loading

jaraco left a comment

Choose a reason for hiding this comment

jaraco Jan 22, 2021

Choose a reason for hiding this comment

anntzer Jan 24, 2021

Choose a reason for hiding this comment

jaraco Jan 22, 2021

Choose a reason for hiding this comment

anntzer Jan 24, 2021

Choose a reason for hiding this comment

jaraco Jan 22, 2021

Choose a reason for hiding this comment

anntzer Jan 24, 2021

Choose a reason for hiding this comment

anntzer commented Jan 23, 2021 • edited Loading

anntzer commented Jan 24, 2021 • edited Loading

anntzer commented Feb 21, 2021

jaraco commented Feb 23, 2021

jaraco commented Mar 8, 2021

anntzer commented Jan 10, 2021 •

edited

Loading

anntzer commented Jan 23, 2021 •

edited

Loading

anntzer commented Jan 24, 2021 •

edited

Loading