Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor XarrayZarrRecipe to be serialization-friendly #160

Merged
merged 23 commits into from
Jun 25, 2021

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jun 16, 2021

This is (another) refactor to the XarrayZarrRecipe to resolve some of the memory issues we've seen when executing large jobs (pangeo-forge/pangeo-forge-azure-bakery#10, #151)

It builds on #153, which adds to_dask and to_prefect methods (probably should do that at the base recipe level).

It looks like a large diff, but it's primarily just moving code from methods on XarrayZarrRecipe to top-level functions, and forwarding arguments appropriately. This eliminates self from the functions sent to workers. More explanation at https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160/files#diff-e12c886cc124886c5cfa5313d760a36c39649af9da845077c663e6feab8487b5R685-R693.

The main outstanding task right now is ensuring that the metadata_cache is being handled properly. I need to better understand where that object is supposed to live (on the client or workers?) and who is supposed to be able to write to it (do workers write to it? Is it expected to be global so that writes from a worker process are seen by the client?)

  • Move the new compilation methods to BaseRecipe. This would also help clarify the API questions (finalize_target vs _finalize_target, which are required to be implemented where, etc.)
  • Remove the to_pipelines method
  • Implement to_function method, which is equivalent to the existing PythonExecutor
  • Update executor docs (started in 9c39bf7)
  • Rerun all notebooks in docs/tutorials to verify no end-user API changes are needed.
  • Verify metadata caching isn't broken

Closes #116.

@rabernat
Copy link
Contributor

Thanks for working on this Tom!

It builds on #153,

Does it directly build on #153? Or is it a new implementation of similar ideas.

This is a pretty big refactoring, so I'd like to understand the motivation more clearly, specifically, why this is needed on top of #153.

Is the basic issue that we cannot use any methods (that contain self as an argument) whatsoever without embedding the large Recipe objects in every task? So therefore you need to essentially rewrite everything in functional form?

xarray_open_kwargs: dict,
delete_input_encoding: bool,
process_input: Optional[Callable[[xr.Dataset, str], xr.Dataset]],
metadata_cache: Optional[MetadataTarget],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like these long argument blocks may hurt maintainability. There is so much room for programmer error when passing long lists of arguments through the stack. If we go this route, perhaps we want to enclose some of the arguments in a Config object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Though I think the biggest risk of programmer error is right now / in future major refactors. So it comes down to simplicity of the code (not having to look up what is this "FooConfig" thing?) vs. typing all these out. I'm 51% in favor of writing things out explicitly like this, but am happy to go with a Config / Options style object.

I will say that mypy has been helpful here. It caught a couple issues before running the tests.

@TomAugspurger
Copy link
Contributor Author

Does it directly build on #153? Or is it a new implementation of similar ideas.

New implementation of similar ideas.

Is the basic issue that we cannot use any methods (that contain self as an argument) whatsoever without embedding the large Recipe objects in every task? So therefore you need to essentially rewrite everything in functional form?

Essentially, yes. That's why I fear #153 alone won't fix it, since https://github.com/pangeo-forge/pangeo-forge-recipes/pull/153/files#diff-a78cae0f25369a56a98f5a65392472c337c3df4dd99398759babf5071dc2032eR109-R112 captures the recipe object in the scope of the delayed function.

cache_input_task = task(self._cache_input, name="cache_input")
prepare_target_task = task(self._prepare_target, name="prepare_target")
store_chunk_task = task(self._store_chunk, name="store_chunk")
finalize_target_task = task(self._finalize_target, name="finalize_target")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you not using _finalize_target rather than finalize_target, etc.

Is there any point in maintaining the finalize_target methods if we are going to remove the existing to_pipelines method in BaseRecipe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the top-level property .finalize_target is still really nice for users who are developing / debugging recipes. We want to keep that, while still allowing Dask and Prefect to access the (partially applied) function itself, so they can wrap it in a Task.

@rabernat
Copy link
Contributor

rabernat commented Jun 16, 2021

Ok, so I think I am on board with this as hopefully the right solution to the serializability problems.

Here is a partial checklist of some things that IMO would need to be done before we merge this:

  • Move the new compilation methods to BaseRecipe. This would also help clarify the API questions (finalize_target vs _finalize_target, which are required to be implemented where, etc.)
  • Remove the to_pipelines method
  • Implement to_function method, which is equivalent to the existing PythonExecutor
  • Update executor docs (started in 9c39bf7)
  • Rerun all notebooks in docs/tutorials to verify no end-user API changes are needed.

Happy to help with some of these via pushing directly to this brach.

Thanks again Tom for taking the time to sort out the core issues.

cache_input,
cache_inputs=self.cache_inputs,
input_cache=self.input_cache,
file_pattern=self.file_pattern,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having a hard time understanding why these partial functions are "small" in a serialization sense, since each function in the graph essentially contains a functools.partial-wrapped version of all of the same attributes that are part of the recipe class? In particular, file_pattern and targets are all fairly complex / large objects themselves, which are now being curried into a single-argument function. How is that any better than having a method attached to a class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that my initial attempt to answer this question have failed. My hunch is that it's just easier to serialize these functions than it is to serialize the dataclass, but I don't I'd like to have a stronger justification than that. Pointing to the workflow in pangeo-forge/pangeo-forge-azure-bakery#10 is some evidence, but it's pretty indirect.

I'll keep trying to come up with a clear answer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, having a forensic understanding of this is interesting academically but far less important than actually shipping code that works. 😁 So given limited time, I would focus on the checklist above (rather than digging deeper).

Let me know if you want help on any aspects here. I have stopped working on #153 in the meantime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of the first few items around to_pipelines() and will look into the outstanding metadata caching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the docs and notebooks. I had some trouble running the notebook, but I think it was an issue with my local internet connection.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional to override the to_dask and to_prefect methods from BaseRecipe in XarrayZarrRecipe?

yield k
def to_prefect(self):
"""Compile the recipe to a Prefect.Flow object."""
from prefect import Flow, task, unmapped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this now implemented twice? You also have this in BaseRecipe, no?

return xr.open_zarr(target_mapper)
for i, input_key in enumerate(self.iter_inputs()):
dsk[(f"cache_input-{token}", i)] = (self._cache_input, input_key)
dsk[f"checkpoint_0-{token}"] = (lambda *args: None, list(dsk))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, this is also implemented in BaseRecipe, no?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jun 17, 2021

Is it intentional to override the to_dask and to_prefect methods from BaseRecipe in XarrayZarrRecipe?

Fixed in 6d190ff, to remove the to_* methods from XarrayZarrRecipe and move them to the base.

That also removes the @closure and underscore-versions of the properties (_prepare_target, etc). since they aren't needed anymore.

I'm going to test this commit out on the full dataset in https://github.com/pangeo-forge/pangeo-forge-azure-bakery again.

@TomAugspurger
Copy link
Contributor Author

Well, I feel a bit silly. The difficulty in answering https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160/files/e70d52662875f8835cc180cb289fc4e6d4445e4a#diff-e12c886cc124886c5cfa5313d760a36c39649af9da845077c663e6feab8487b5 spurred some more investigation into what was actually taking a lot of size to serialize.

It turns out that most of the size was in the FilePattern class. Maybe this isn't too surprising, but it did surprise me that the size was actually in the function filepatterns_from_sequence at

def pattern_from_file_sequence(file_list, concat_dim, nitems_per_file=None):
"""Convenience function for creating a FilePattern from a list of files."""
keys = list(range(len(file_list)))
concat = ConcatDim(name=concat_dim, keys=keys, nitems_per_file=nitems_per_file)
def format_function(**kwargs):
return file_list[kwargs[concat_dim]]
return FilePattern(format_function, concat)
, rather than the class itself. And then it was obvious, in https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/8c7c40183d3a0eedbaab7dac483287b8abf54c46/flow_test/oisst_recipe.py#L100-L106 we create a list with 14,000 (largeish) strings. I'm still not sure why, but apparently multiple instances of that list were being created when the function was deserialized (maybe since it's referenced via a closure, rather than a top-level function?).

The alternative is to create the FIlePattern class "manually": https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/c70246d209a529ba921b73ed779bfbb57e9178a6/flow_test/oisst_recipe.py#L96-L110. This just requires a function, and a range object. Much smaller.

For some reason in my debugging, I had turned cache_input off. Which led me to conclude that this branch was helping. I'll now test the alternative FilePattern against pangeo-forge-recipes master, to see if this is even necessary.

@rabernat
Copy link
Contributor

It turns out that most of the size was in the FilePattern class. Maybe this isn't too surprising, but it did surprise me that the size was actually in the function filepatterns_from_sequence at

👍 This is definitely expected from my POV. Perhaps we should deprecate that function.

@TomAugspurger
Copy link
Contributor Author

Perhaps we should deprecate that function.

I think that's worthwhile. It is convenient, but I worry that this will bite us again as we scale users's recipes to large jobs. And I hope that manually constructing the FilePattern isn't too much more difficult.


FWIW, when I ran the flow using the fixed FilePattern construction on pangeo-forge-recipes master, my scheduler was OOM Killed. It does seem like this branch is doing something.

@rabernat
Copy link
Contributor

The big question I have is whether this PR is still significantly better than #153 in terms of serialization. (Assuming the more efficient FilePattern approach is used in both cases.)

@TomAugspurger
Copy link
Contributor Author

The big question I have is whether this PR is still significantly better than #153 in terms of serialization.

It seems like it. I ran a flow that just sleeps for the cache_input step with pangeo-forge-recipes master, #153, and this PR. #153 still had memory issues on the scheduler, while this PR didn't.

pangeo-forge/pangeo-forge-azure-bakery#10 (comment) has the results.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this again and noticed that all of the global metadata references have been commented out. Can you explain why global metadata is more difficult with this new functional syntax?

Is there something I can do to help here?

# if cache_metadata:
# # if nitems_per_input is not constant, we need to cache this info
# recipe_meta = {"input_sequence_lens": input_sequence_lens}
# return recipe_meta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here? I don't follow your comment.

Comment on lines 108 to 109
# TODO(METADATA): set
metadata_cache[_input_metadata_fname(input_key)] = input_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some reason metadata requires different treatment here?

Comment on lines 163 to 166
# TODO(Tom): Handle metadata caching here
# else:
# global_metadata = metadata_cache[_GLOBAL_METADATA_KEY]
# input_sequence_lens = global_metadata["input_sequence_lens"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just flagging that metadata stuff has been commented out.

@TomAugspurger
Copy link
Contributor Author

Fixing the metadata stuff is my (hopefully) last TODO. I'll dig into it today, but maybe you can answer this easily: Where do the writes and reads to the global metadata store happen, client process or the worker processes?

@rabernat
Copy link
Contributor

rabernat commented Jun 24, 2021

Worker processes.

The client knows nothing about the execution state.

@TomAugspurger
Copy link
Contributor Author

Sounds good. So it's assumed that the metadata cache is globally read / writeable, like a blob storage file system. In that case, I think my last commit fixes things, but the tests were passing without it. It's probably worth adding some kind of check to the tests to verify that.


I'm running through the notebooks now, and at least some are failing with TypeError: 'FSSpecTarget' object does not support item assignment. I'm guessing they're failing on master too. The targets should be a MetadataTarget, rather than an FsspecTarget. Maybe related to #133.

I'll update the notebooks to use MetadataTarget.

@TomAugspurger
Copy link
Contributor Author

The notebooks caught an issue: I had removed XarrayZarrRecipe.open_input and XarrayZarrRecipe.open_chunk. These are useful helpers for debugging a recipe so I restored them and added a basic test to ensure they're present.

The notebooks are now updated to use a MetadataCache and are all passing. I think this should be good to go.

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic. Thanks so much Tom for all your hard work on this!

Could you just update the release notes? I think at this point our next release is 0.4, so probably need to add a new section.

@TomAugspurger
Copy link
Contributor Author

Done in f57e36a.

@rabernat rabernat merged commit e1ef575 into pangeo-forge:master Jun 25, 2021
@rabernat
Copy link
Contributor

@TomAugspurger, can you think of a test we could add to guard against regressions related to serialization? Relevant for big PRs like #166 which touch a lot of different pieces...

@TomAugspurger
Copy link
Contributor Author

Nothing really comes to mind :/ The typical way to check is to serialize the object and then check the size of the bytestring. But IIRC we found that these objects weren't all that large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve serializability of Recipe and related classes
2 participants