Feature: Persistent caching of inference results #1145

joshua-cannon-techlabs · 2021-08-23T14:45:57Z

Steps to reproduce

For fairly large repositories, using inference can be incredibly slow. Running pylint with a rule which leverages inference can take multiple seconds, while running a rule without inference is almost immediate.

From profiling, most of the time taken is traversing the file's dependencies and parsing them as well. This information is cached (see astroid.manager.AtroidManager.brain), however the cache is in-memory and therefore doesn't last between runs. This is likely due to the fact that astroid classes can't be easily serialized (pickling them is unfortunately not a drop-in solution).

Ideally, there is an option to enable persistent caching which is file-timestamp aware. If a particular module hasn't been changed, astroid could pull from the cache (exactly like Python's caching of bytecode).

Of course, a lot of thought has to be put into caching, but the benefits would be on the order of dev-years if not dev-decades IMO.

Current behavior

Run pylint on a single file in a large repo with assert-on-tuple as the only rule: Maybe ~0.5s on a bad day
Run pylint on the same file with abstract-class-instantiated as the only rule: Roughly 7.4s
Run pylint twice in the same python process with abstract-class-instantiated as the only rule: Also roughly 7.4s

Expected behavior

Running pylint multiple times on the same file ideally doesn't take the same amount of time each time.

`python -c "from astroid import pkginfo; print(pkginfo.version)"` output

2.4.2

The text was updated successfully, but these errors were encountered:

joshua-cannon-techlabs · 2021-08-23T15:02:48Z

FWIW I'm fully aware that this is no easy undertaking. However it should at least be documented as a feature request and talked about 😄

joshua-cannon-techlabs · 2021-08-23T15:10:03Z

I'll add too, AFAICT mypy runs speedily on the same codebase. I assume since it has a cache-dir flag it is caching post-parsed information.

Pierre-Sassoulas · 2021-08-23T15:11:50Z

Thank you for opening the issue, I agree with everything you said. I would add that making astroid performance better would save gigawatt of electricity. What I'm saying it that AWS should sponsor this issue 😄

joshua-cannon-techlabs · 2021-08-23T15:13:46Z

In the meantime I think I'll play around with running pylint as part of a daemon in order to get the performance improvement on dev boxes (since it'll keep the manager brain in-memory). Then in the daemon process, eject modules from the brain when files change.

thejcannon · 2021-09-26T16:53:57Z

I've been playing around with this in my free time and am getting close to a really good Proof-of-Concept/MvP.

Hopefully by next weekend I'll have a PR to show!

cdce8p · 2021-11-06T13:32:37Z

Cross-posting it here to preserve it for later discussions.

--
I'm not convinced we'll see any noticeable improvement with this approach. As far as I'm aware most time is spend during the actual inference. Parsing the ast tree by contrast is fairly quick. So much so that I might bet we won't see measurable changes when constructing it from cache instead.

A more useful approach IMO would be to try and cache the inference results for each file although it still needs to be investigated to what extend that is possible.

Originally posted by @cdce8p in #1194 (review)

thejcannon · 2021-11-06T14:45:50Z

(Perhaps the title should be edited to reflect the actual work needing done)

Cross-posting my reply:

I agree the AST parsing caching itself wouldn't be sufficient (hence why there isn't any code that actually persists the AST to disk yet). I assumed this would be the first step that would eventually lead to inference caching (since the inference results would need to be persistable).

doublethefish · 2021-11-22T13:54:15Z

@joshua-cannon-techlabs said:

From profiling, most of the time taken is traversing the file's dependencies and parsing them as well.

@cdce8p said:

As far as I'm aware most time is spend during the actual inference. Parsing the ast tree by contrast is fairly quick.

This sounds like we need some quantification, which tox -e profile_against_external or tox -e benchmark in pylint should give us. Here a profile graph from running pylint against numpy profile-numpy-2021-11-22.zip

My experience (when introducing pylint-dev/pylint#3498 and #3473) was more in line with @joshua-cannon-techlabs in that there's a lot of time spent in upfront file-work, which seems to be multiplied by the number of jobs being run - which suggests that under -j N astroid does the same work N times, instead of just once. I'm running the tox -e profile_against_external and will confirm. Update: the SVG from the numpy profile is attached - I've not yet looked at it myself. profile-numpy-2021-11-22.zip

So, the implication is that a potential fast-win is to generate the cache/AST on the main-thread and distribute that (as execplictly read-only) to the sub-procs. An even faster win might be to move the multiprocessing to the astroid loop in check_files(), but that's probably out-of-scope for now.

That said, I very much agree with @cdce8p that a simple cache isn't the be-all and end-all. The cache has to match the use-case, which in this case would be both of the following:

how do we reduce the amount of inference work [re]done?
how do we get that inference work done faster?

My AST know-how is rusty but IIRC you should be able to generate a dependency tree that can be dirtied based on changes to the tree. Then a partial solution to (1.) is that the work to be done is contingent on that dirty-state. The majority of checkers could adhere to that dirty state but some wouldn't, I imagine (edge cases etc.).

You should then also get, for free, the ability to better multi-thread tasks that operate on that tree, which partially solves (2.).

Considering similar-checker, we could/should use the cache directly rather than create yet another copy of the data in memory.

Pierre-Sassoulas added Enhancement ✨ Improvement to a component High effort 🏋 Difficult solution or problem to solve labels Aug 23, 2021

Pierre-Sassoulas assigned thejcannon Sep 26, 2021

thejcannon mentioned this issue Oct 1, 2021

DO NOT MERGE - WIP: Add ability to serialize/deserialze astroid.Module (and friends) #1194

Closed

2 tasks

cdce8p unassigned thejcannon Nov 6, 2021

cdce8p changed the title ~~Feature: Persistent caching of modules~~ Feature: Persistent caching of inference results Nov 6, 2021

cdce8p mentioned this issue Nov 22, 2021

Primer tests "à la mypy" pylint-dev/pylint#5173

Merged

6 tasks

cdce8p mentioned this issue Nov 27, 2021

Optimizing speed of the starting and initialisation phase pylint-dev/pylint#4814

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Persistent caching of inference results #1145

Feature: Persistent caching of inference results #1145

joshua-cannon-techlabs commented Aug 23, 2021

joshua-cannon-techlabs commented Aug 23, 2021

joshua-cannon-techlabs commented Aug 23, 2021 •

edited

Loading

Pierre-Sassoulas commented Aug 23, 2021

joshua-cannon-techlabs commented Aug 23, 2021

thejcannon commented Sep 26, 2021

cdce8p commented Nov 6, 2021

thejcannon commented Nov 6, 2021

doublethefish commented Nov 22, 2021 •

edited

Loading

Feature: Persistent caching of inference results #1145

Feature: Persistent caching of inference results #1145

Comments

joshua-cannon-techlabs commented Aug 23, 2021

Steps to reproduce

Current behavior

Expected behavior

python -c "from astroid import __pkginfo__; print(__pkginfo__.version)" output

joshua-cannon-techlabs commented Aug 23, 2021

joshua-cannon-techlabs commented Aug 23, 2021 • edited Loading

Pierre-Sassoulas commented Aug 23, 2021

joshua-cannon-techlabs commented Aug 23, 2021

thejcannon commented Sep 26, 2021

cdce8p commented Nov 6, 2021

thejcannon commented Nov 6, 2021

doublethefish commented Nov 22, 2021 • edited Loading

`python -c "from astroid import pkginfo; print(pkginfo.version)"` output

joshua-cannon-techlabs commented Aug 23, 2021 •

edited

Loading

doublethefish commented Nov 22, 2021 •

edited

Loading