-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plugin: cache hook with diskcache
#684
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some comments
@@ -57,3 +58,20 @@ def create_temporary_module(*functions: Callable, module_name: str = None) -> Mo | |||
setattr(module, fn_name, fn) | |||
sys.modules[module_name] = module | |||
return module | |||
|
|||
|
|||
def module_from_source(source: str) -> ModuleType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this function, the CacheHook
doesn't work in notebooks because it won't be able to use inspect
to get the source code and create a hash.
The alternative would be to use the functions's __code__
attribute, but this would dependent on the Python version. It might not be an issue because the cache's pickling is already Python version dependent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this used? Think I'm missing it... But that makes sense I think. You're talking about temporary modules, right?
With temporary modules you'd have to special-case them, go through every function and use inspect
on them.
hamilton/graph_utils.py
Outdated
return hashlib.sha256(source.encode()).hexdigest() | ||
|
||
|
||
def remove_docs_and_comments(source: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might not be worth removing comments TBH -- can you at least add some comments here as to what you're doing? Code is a bit cryptic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth adding a test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behavior is tested via test_graph_utils::
test_different_hash_docstring
and test_different_hash_comment
. They check both if "the hashes are unequal when not stripping docs and comments" and "the hashes are equal when stripping docs and comments".
The tests would be improved by not relying on graph_utils.hash_source_code()
though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add comments to the function
I think the value of removing docstring/comments will depend on the workflow for the CI / CLI code diff (which rely on the same hashing).
If we marked as "changed" a node that had a changed docstring or added #comment AND all downstream nodes, the amount of false positive will make the feature pointless IMHO
I think it would be valuable to add this as a core feature of Hamilton in the future. I could see a To avoid introducing It could also enable "watermarking" for light orchestration such as retries or backfills. Also probably a good basis for Burr snapshotting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, don't want to overthink it, we can always change it up later if we need to and this works.
Just fix the 3.8 tests and merge!
This hook uses the diskcache to cache node execution on disk. The cache key is a tuple of the function's
(source code, input a, ..., input n)
. It is like memoization but it considers the function's source code in addition to inputs.diskcache has great features to:
Disk
implementations to change the serialization protocol (e.g., pickle, JSON)Changes
hamilton.plugins.h_diskcache
which contains a caching hook and cache management functionsexamples/cache_hook
to demonstrate how to useh_diskcache
setup.py
to includesf-hamilton[diskcache]
diskcache
torequirements-docs.txt
to run testsgraph_utils
and corresponding tests.How I tested this
Notes / TODO
h_diskcache
with notebooks for interactive/iterative developmentinspect.getsource
and will return distinct hashes for changes to docstrings, newlines and comments.Checklist