-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO NOT MERGE - WIP: Add ability to serialize/deserialze astroid.Module (and friends) #1194
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all tests pass right now. Latest errors are from inference not working (and since inference is quite the beasty, I thought I'd publish this PR as-is for initial feedback)
tests/conftest.py
Outdated
|
||
def pytest_addoption(parser): | ||
parser.addoption( | ||
"--test-roundtrip-persistence", type=bool, default=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, I think we'll add a tox section to re-run all tests with this enabled. For now it's just enabled so I can test easily.
I'll use this PR to update progress and dump thoughts as I work through the remaining problems. The current test is failing AFAICT due to |
…nto persistence
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening this. Sorry I don't have time for an in depth review right now. I agree with the need to improve the caching though. Could you be more specific about the advantages of JSON compared to pickle and about what make the persistance necessary ? Is it for very big code base or low ram machine where the cache would be bigger than ram ? Also could you add typing on newly added functions, please ?
@@ -3045,3 +3103,12 @@ def _get_assign_nodes(self): | |||
child_node._get_assign_nodes() for child_node in self.body | |||
) | |||
return list(itertools.chain.from_iterable(children_assign_nodes)) | |||
|
|||
def __dump__(self, dumper): | |||
data = super().__dump__(dumper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a cultural reference to dumb and dumber ;) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha no, but we can pretend it is.
I originally was passing the refmap around but thought that was kind of dirty, so wanted to pass a function that did the dumping and remembered the remap, so "dumper" was the first word that came to mind 😂
@Pierre-Sassoulas I got tests to pass. I think the next step is I'll break out this monolothic change into smaller ones. |
2232b52
to
90c3417
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we'll see any noticeable improvement with this approach. As far as I'm aware most time is spend during the actual inference. Parsing the ast tree by contrast is fairly quick. So much so that I might bet we won't see measurable changes when constructing it from cache instead.
A more useful approach IMO would be to try and cache the inference results for each file although it still needs to be investigated to what extend that is possible.
I agree the AST parsing caching itself wouldn't be sufficient (hence why there isn't any code that actually persists the AST to disk yet). I assumed this would be the first step that would eventually lead to inference caching (since the inference results would need to be persistable). Closing, though as I won't have much free time anymore to walk this along. It can be used as a starting point for that future effort though 🎉 |
Steps
Description
This PR adds the ability to serialize and deserialize
astroid.Module
instances (and by association, every other kind of instance it may reference) to-and-from JSON.Why persistence?
Persisting processed modules is useful primarily when parsing a large number of files, or files with a lot of interdependencies. So long as desrializing form the cache is quicker than parsing the files, it's a net win.
(I'll do some profiling with
pylint
on our codebase of about 1k files when this is more polished)Why JSON, why not YAML, XML, or
pickle
?JSON was chosen as it is compact (as opposed to yml or xml or similar) and we can choose what and how to store information. Although pickling might be simpler in terms of complexity, it is inherently insecure as a bad actor only needs to infect the
astroid
pickled data on disk to infect the next parse. Using JSON, since we control the data, the actor would have to infect the cache, and the Python environment the next parse runs in.And since we aren't inventing anything new here, it should also be stated
mypy
's cache is also JSON 😉Changes
The bulk of the relevant changes have been localized to the
_persistence
module, which has been specially implemented in order to reduce the amount of changes to all of the relevantastroid
classes. Additional changes have been made in order to either serialize/deserialize in a special way or to enable the default way.A future change will have the changes to the module cache to leverage this new feature.
Testing
TODO
Type of Changes
Related Issue
Related #1145