Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes some bugs when using dump_session() with byref=True #463

Merged
merged 18 commits into from
May 1, 2022

Conversation

leogama
Copy link
Contributor

@leogama leogama commented Apr 22, 2022

Here are the proposed fixes for bugs reported in #462.

You can review the changes already, but please don't merge it yet as I'm still tracking a strange third bug in dump_session(). And I spotted yet another possible bug, but need to test it first. I'll probably append more commits in the next days.

I'm working on this feature for it to be used as the cache mechanism for the reticulate R package —knitr's Python engine. It's mostly functional by now. (Related discussion: yihui/knitr#1505)

By the way, could you clarify a thing for me? Why is it that, even when the byref parameter is set to True, after the imported objects are identified and stored in the __dill_imported object, the Pickler is called with the _byref attribute set to False. What's the rationale? Is it because the session is saved as a whole module and then pickle would try to store everything inside it by name reference?

Fixes #462

… modules

When no objects are found to be imported from external modules, `_stash_modules()` should return `main_module` unmodified, and not the pair list of objects created from it (`original`).
…was imported

Just calling `import multiprocessing` creates a `'__mp_main__'` entry in `sys.modules` that is simply a reference to the `__main__` module. In the old version of the test for module membership, objects in the global scope are wrongly attributed to this virtual `__mp_main__` module. And therefore `load_session()` fails. Testing for object identity instead of module name resolves the issue.
@mmckerns
Copy link
Member

Yes, if we didn't set _byref=False while trying to store the main module, it will attempt to use name reference for modules, etc.

The PR looks good thus far. FYI, I expect that we will create a new release this coming week.

@leogama
Copy link
Contributor Author

leogama commented Apr 24, 2022

The PR looks good thus far. FYI, I expect that we will create a new release this coming week.

This are great news! It would be awesome to have dump_session(byref=True) working fine in this next release. I will definitely need it to be able to propose changes in the (maybe) two R packages related to the usage of dill as a cache engine for Python in R Markdown documents. Do you mind to include this PR in the 0.3.5 milestones? I'm happy if any number of these bug fixes are included, and will continue to work on this feature in a new PR after the release.

@mmckerns
Copy link
Member

if we can finish the PR, it will be included in the release

…e in `dump_session(byref=TRUE)`

Currently, `dump_session(byref=True)` misses some imported objects. For example:

- If the session had a statement `import numpy as np`, it may find a reference to the `numpy` named
as `np` in some internal module listed in `sys.resources`.  But if the module was imported with a
non-canonical name, like `import numpy as nump`, it won't find it at all. Mapping the objects by id
in `modmap` solves the issue.  Note that just types of objects usually imported under an alias must
be looked up by id, otherwise common objects like singletons may be wrongly attributed to a module,
and such reference in the module could change to a different object depending on its initialization
and state.

- If a object in the global scope is a top level module, like `math`, again `save_session` may find
a reference to it in another module and it works. But if this module isn't referenced anywhere else,
it won't be found because the function only looks for objects inside the `sys.resources` modules and
not for the modules themselves.

This commit introduces two new attributes to session modules saved by reference:
- `__dill_imported_as`: a list with (module name, object name, object alias in session)
- `__dill_imported_top_level`: a list with (module name, module alias in session)

I did it this way for forwards (complete) and backwards (partial) compatibility.

Oh, and I got rid of that nasty `exec()` call in `_restore_modules()`! ;)
Fixes RecursionWarning error where a function defined in the top level of the module being saved with
`dump_session(byref=True)`, of which "globals" is a reference to the original module's `__dict__`,
makes dill to try to save this instead of the modified module's `__dict__`, triggering recursion.
@leogama leogama marked this pull request as ready for review April 26, 2022 22:17
@leogama
Copy link
Contributor Author

leogama commented Apr 26, 2022

Hello there! My PR is ready for review. I rushed to solve all the problems I've found using dump_session(), so it could be included in the next release. The overall result seems to be satisfactory and working in both python2 and python3 (3.8 on my machine). I still don't understand fully the function saving logic, but contrary to my first belief it didn't need to be modified.

Do we need some unit tests? How can I provide them? Should it just reproduce the kinds of usage that were failing before these modification? I'm new to testing.

@mmckerns
Copy link
Member

mmckerns commented Apr 26, 2022

That's great. I'll review it tonight. Yes, dump_session sorely needs unit tests. Most of dill has good unit test coverage, however dump_session may have none or at least very very few. Testing is done in dill.tests. You are free to add a module in dill.tests, if one doesn't exist that tests dump_session. Then, you should include yourself as an author of the file, as an additional line to the standard file header.

@leogama
Copy link
Contributor Author

leogama commented Apr 26, 2022

Thank you! I'll take a look at the tests.

@mmckerns
Copy link
Member

tests have to pass for the currently supported versions of python which are: (python 2.7, 3.7-3.10; PyPy 2.7, 3.7-3.9) and should also pass for python 3.11. We will drop support for python 2.7 / PyPy 2.7 after this upcoming release.

@mmckerns
Copy link
Member

I did some random testing -- building a class and an instance of a class, then dumping a session (with and without byref), then loading, and getting the same results either way. Also, testing that things like a = Bar(); a.__class__ is Bar work before and after the load_session, regardless of byref. Things seem to work (eyeroll -- testing needed) for all supported versions of python.

@mmckerns mmckerns self-requested a review April 26, 2022 23:31
Copy link
Member

@mmckerns mmckerns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would be nice to have some tests, but can't complain as there are currently none.

@leogama
Copy link
Contributor Author

leogama commented Apr 28, 2022

Added some tests. The Travis build is failing because of a warning raised in a assert statement before the proper tests. It's not critical...

@leogama
Copy link
Contributor Author

leogama commented Apr 28, 2022

Travis doesn't like warnings. Can I substitute it by a print to stderr or a call to log()? Otherwise I will just comment out the offending lines.

@mmckerns
Copy link
Member

mmckerns commented Apr 28, 2022

If the raising of a warning is what you want to test, then you can catch it in a try/except block, and assert True on except. If this isn't the case, you can always use an if block to skip the line on travis, or something else. It looks like you are just trying to ensure that the modules you will test are not already imported. Could you not instead delete them from sys.modules if they are already loaded?

@leogama
Copy link
Contributor Author

leogama commented Apr 28, 2022

I added the line to check if the modules that I choose to import in the tests weren't already loaded by python or dill. My assumption is that import mod may have different side effects than from mod import obj concerning the import of the module mod, its inclusion in sys.modules or any internal, implementation dependent, details. But I don't know if this precaution is necessary, and the list of default loaded modules varies from one python version to another... The actual tests are passing, this was just to catch an eventual false negative. I'm happy to suppress this line.

@leogama
Copy link
Contributor Author

leogama commented Apr 28, 2022

Actually, I remove all the non "pre-loaded" modules from sys.modules before calling load_session(), so if one of the modules was already there from the beginning, there's no way to test if load_session() successfully imported it. And I thought that removing a module loaded by dill or python itself from sys.modules could cause problems, but maybe not.

@mmckerns
Copy link
Member

mmckerns commented Apr 28, 2022

I would think that if you delete a module from sys.modules and from __main__, working in the test environment you control... then the module should not be available for __main__ where you are testing until you load_session. However, maybe what matters is that regardless of what's in the environment to begin with, the environment is as expected after load_session is called.

@leogama
Copy link
Contributor Author

leogama commented Apr 28, 2022

I've removed the header "pre-test" and adapted the code. Travis builds are failing with Python 3.7 and Python 2.7. This last one "is allowed to fail", and all the tests pass with Python 2.7 on my machine though (Ubuntu Focal). The error is the same: the xml.sax module was not included in sys.modules after load_session(), but the sax module object was there. It occurred in the testing of dump_session(byref=False), and didn't reach the byref=True case.

This is an artificial test setting however, if you start to think. The use case of load_session() is in a different session, not in the same one, and definitely not after tweaking sys.modules in this way. Maybe we should save the session to a file and test for correct module loading in a following, separate, test file?

tests/test_session_0.py Outdated Show resolved Hide resolved
Copy link
Member

@mmckerns mmckerns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the changes to _dill.py -- the tests need a few changes, however are generally good. please address the comments left inline

tests/test_session_0.py Outdated Show resolved Hide resolved
@leogama
Copy link
Contributor Author

leogama commented May 1, 2022

Answered all the comments. All tests passing. Ready to go.

Copy link
Member

@mmckerns mmckerns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mmckerns mmckerns merged commit 23c4745 into uqfoundation:master May 1, 2022
@leogama leogama deleted the fix-dump-session branch May 1, 2022 00:45
@leogama leogama restored the fix-dump-session branch May 4, 2022 14:27
mmckerns added a commit that referenced this pull request May 6, 2022
@mmckerns mmckerns added this to the dill-0.3.5 milestone May 19, 2022
@mmckerns
Copy link
Member

mmckerns commented Jun 3, 2022

@leogama: In case the OP doesn't open a dill GitHub issue, you may want to ponder this: https://stackoverflow.com/questions/72486294.

@leogama
Copy link
Contributor Author

leogama commented Jun 3, 2022

@leogama: In case the OP doesn't open a dill GitHub issue, you may want to ponder this: https://stackoverflow.com/questions/72486294.

@mmckerns, take a look at my answer there. It is half a bug and half misdesign/misuse. It may need a decision about what to do.

@mmckerns
Copy link
Member

mmckerns commented Jun 3, 2022

yep, that's my thought exactly

@leogama
Copy link
Contributor Author

leogama commented Jun 3, 2022

I've already made up my mind. Restoring the whole namespace of a module to a different module (with a different name!) is a really bad idea, as it overwrites the second module's attributes like __name__, __path__ and __spec__ and creates orphaned classes and functions which dill itself can't deal with. I think load_session() should just keep the main parameter for backward compatibility and ignore it from now on.

The only doubt I have is whether _import_module() should support modules created at runtime if it fails to import. Should this work?:

>>> import dill, types
>>> mod = types.ModuleType('runtime')
>>> mod.var = 'spam'
>>> dill.dump_session(main=mod)

In other session:

>>> import dill
>>> mod = dill.load_session()
>>> print(mod.var)
spam

Edit

I feel it is not so hard to do the same as above without changing _import_module, if someone needs it:

>>> import dill, sys, types
>>> sys.modules['runtime'] = types.ModuleType('runtime')
>>> mod = dill.load_session()  # it could return the module anyway
>>> print(mod.var)
spam

@mmckerns
Copy link
Member

mmckerns commented Jun 4, 2022

I'm pretty sure that I agree with you. I also think it'd be good if modules created at runtime (as in your example) worked.

@mmckerns
Copy link
Member

mmckerns commented Jun 4, 2022

@leogama: the above issue is opened here: #498

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load_session() fails when dump_session() is used with byref=True
2 participants