Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Persistent caching of inference results #1145

Open
joshua-cannon-techlabs opened this issue Aug 23, 2021 · 8 comments
Open

Feature: Persistent caching of inference results #1145

joshua-cannon-techlabs opened this issue Aug 23, 2021 · 8 comments
Labels
Enhancement ✨ Improvement to a component High effort 🏋 Difficult solution or problem to solve

Comments

@joshua-cannon-techlabs
Copy link

Steps to reproduce

For fairly large repositories, using inference can be incredibly slow. Running pylint with a rule which leverages inference can take multiple seconds, while running a rule without inference is almost immediate.

From profiling, most of the time taken is traversing the file's dependencies and parsing them as well. This information is cached (see astroid.manager.AtroidManager.brain), however the cache is in-memory and therefore doesn't last between runs. This is likely due to the fact that astroid classes can't be easily serialized (pickling them is unfortunately not a drop-in solution).

Ideally, there is an option to enable persistent caching which is file-timestamp aware. If a particular module hasn't been changed, astroid could pull from the cache (exactly like Python's caching of bytecode).

Of course, a lot of thought has to be put into caching, but the benefits would be on the order of dev-years if not dev-decades IMO.

Current behavior

  • Run pylint on a single file in a large repo with assert-on-tuple as the only rule: Maybe ~0.5s on a bad day
  • Run pylint on the same file with abstract-class-instantiated as the only rule: Roughly 7.4s
  • Run pylint twice in the same python process with abstract-class-instantiated as the only rule: Also roughly 7.4s

Expected behavior

Running pylint multiple times on the same file ideally doesn't take the same amount of time each time.

python -c "from astroid import __pkginfo__; print(__pkginfo__.version)" output

2.4.2

@joshua-cannon-techlabs
Copy link
Author

FWIW I'm fully aware that this is no easy undertaking. However it should at least be documented as a feature request and talked about 😄

@Pierre-Sassoulas Pierre-Sassoulas added Enhancement ✨ Improvement to a component High effort 🏋 Difficult solution or problem to solve labels Aug 23, 2021
@joshua-cannon-techlabs
Copy link
Author

joshua-cannon-techlabs commented Aug 23, 2021

I'll add too, AFAICT mypy runs speedily on the same codebase. I assume since it has a cache-dir flag it is caching post-parsed information.

@Pierre-Sassoulas
Copy link
Member

Thank you for opening the issue, I agree with everything you said. I would add that making astroid performance better would save gigawatt of electricity. What I'm saying it that AWS should sponsor this issue 😄

@joshua-cannon-techlabs
Copy link
Author

In the meantime I think I'll play around with running pylint as part of a daemon in order to get the performance improvement on dev boxes (since it'll keep the manager brain in-memory). Then in the daemon process, eject modules from the brain when files change.

@thejcannon
Copy link
Contributor

I've been playing around with this in my free time and am getting close to a really good Proof-of-Concept/MvP.

Hopefully by next weekend I'll have a PR to show!

@cdce8p
Copy link
Member

cdce8p commented Nov 6, 2021

Cross-posting it here to preserve it for later discussions.

--
I'm not convinced we'll see any noticeable improvement with this approach. As far as I'm aware most time is spend during the actual inference. Parsing the ast tree by contrast is fairly quick. So much so that I might bet we won't see measurable changes when constructing it from cache instead.

A more useful approach IMO would be to try and cache the inference results for each file although it still needs to be investigated to what extend that is possible.

Originally posted by @cdce8p in #1194 (review)

@thejcannon
Copy link
Contributor

(Perhaps the title should be edited to reflect the actual work needing done)

Cross-posting my reply:

I agree the AST parsing caching itself wouldn't be sufficient (hence why there isn't any code that actually persists the AST to disk yet). I assumed this would be the first step that would eventually lead to inference caching (since the inference results would need to be persistable).

@cdce8p cdce8p changed the title Feature: Persistent caching of modules Feature: Persistent caching of inference results Nov 6, 2021
@doublethefish
Copy link

doublethefish commented Nov 22, 2021

@joshua-cannon-techlabs said:

From profiling, most of the time taken is traversing the file's dependencies and parsing them as well.

@cdce8p said:

As far as I'm aware most time is spend during the actual inference. Parsing the ast tree by contrast is fairly quick.

This sounds like we need some quantification, which tox -e profile_against_external or tox -e benchmark in pylint should give us. Here a profile graph from running pylint against numpy profile-numpy-2021-11-22.zip

My experience (when introducing pylint-dev/pylint#3498 and #3473) was more in line with @joshua-cannon-techlabs in that there's a lot of time spent in upfront file-work, which seems to be multiplied by the number of jobs being run - which suggests that under -j N astroid does the same work N times, instead of just once. I'm running the tox -e profile_against_external and will confirm. Update: the SVG from the numpy profile is attached - I've not yet looked at it myself. profile-numpy-2021-11-22.zip

So, the implication is that a potential fast-win is to generate the cache/AST on the main-thread and distribute that (as execplictly read-only) to the sub-procs. An even faster win might be to move the multiprocessing to the astroid loop in check_files(), but that's probably out-of-scope for now.

That said, I very much agree with @cdce8p that a simple cache isn't the be-all and end-all. The cache has to match the use-case, which in this case would be both of the following:

  1. how do we reduce the amount of inference work [re]done?
  2. how do we get that inference work done faster?

My AST know-how is rusty but IIRC you should be able to generate a dependency tree that can be dirtied based on changes to the tree. Then a partial solution to (1.) is that the work to be done is contingent on that dirty-state. The majority of checkers could adhere to that dirty state but some wouldn't, I imagine (edge cases etc.).

You should then also get, for free, the ability to better multi-thread tasks that operate on that tree, which partially solves (2.).

Considering similar-checker, we could/should use the cache directly rather than create yet another copy of the data in memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement ✨ Improvement to a component High effort 🏋 Difficult solution or problem to solve
Projects
None yet
Development

No branches or pull requests

5 participants