Refactor XarrayZarrRecipe to be serialization-friendly #160

TomAugspurger · 2021-06-16T11:47:20Z

This is (another) refactor to the XarrayZarrRecipe to resolve some of the memory issues we've seen when executing large jobs (pangeo-forge/pangeo-forge-azure-bakery#10, #151)

It builds on #153, which adds to_dask and to_prefect methods (probably should do that at the base recipe level).

It looks like a large diff, but it's primarily just moving code from methods on XarrayZarrRecipe to top-level functions, and forwarding arguments appropriately. This eliminates self from the functions sent to workers. More explanation at https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160/files#diff-e12c886cc124886c5cfa5313d760a36c39649af9da845077c663e6feab8487b5R685-R693.

The main outstanding task right now is ensuring that the metadata_cache is being handled properly. I need to better understand where that object is supposed to live (on the client or workers?) and who is supposed to be able to write to it (do workers write to it? Is it expected to be global so that writes from a worker process are seen by the client?)

Move the new compilation methods to BaseRecipe. This would also help clarify the API questions (finalize_target vs _finalize_target, which are required to be implemented where, etc.)
Remove the to_pipelines method
Implement to_function method, which is equivalent to the existing PythonExecutor
Update executor docs (started in 9c39bf7)
Rerun all notebooks in docs/tutorials to verify no end-user API changes are needed.
Verify metadata caching isn't broken

Closes #116.

rabernat · 2021-06-16T14:42:08Z

Thanks for working on this Tom!

It builds on #153,

Does it directly build on #153? Or is it a new implementation of similar ideas.

This is a pretty big refactoring, so I'd like to understand the motivation more clearly, specifically, why this is needed on top of #153.

Is the basic issue that we cannot use any methods (that contain self as an argument) whatsoever without embedding the large Recipe objects in every task? So therefore you need to essentially rewrite everything in functional form?

rabernat · 2021-06-16T14:44:12Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+    xarray_open_kwargs: dict,
+    delete_input_encoding: bool,
+    process_input: Optional[Callable[[xr.Dataset, str], xr.Dataset]],
+    metadata_cache: Optional[MetadataTarget],


I feel like these long argument blocks may hurt maintainability. There is so much room for programmer error when passing long lists of arguments through the stack. If we go this route, perhaps we want to enclose some of the arguments in a Config object?

Agreed! Though I think the biggest risk of programmer error is right now / in future major refactors. So it comes down to simplicity of the code (not having to look up what is this "FooConfig" thing?) vs. typing all these out. I'm 51% in favor of writing things out explicitly like this, but am happy to go with a Config / Options style object.

I will say that mypy has been helpful here. It caught a couple issues before running the tests.

TomAugspurger · 2021-06-16T14:44:54Z

Does it directly build on #153? Or is it a new implementation of similar ideas.

New implementation of similar ideas.

Is the basic issue that we cannot use any methods (that contain self as an argument) whatsoever without embedding the large Recipe objects in every task? So therefore you need to essentially rewrite everything in functional form?

Essentially, yes. That's why I fear #153 alone won't fix it, since https://github.com/pangeo-forge/pangeo-forge-recipes/pull/153/files#diff-a78cae0f25369a56a98f5a65392472c337c3df4dd99398759babf5071dc2032eR109-R112 captures the recipe object in the scope of the delayed function.

rabernat · 2021-06-16T14:47:11Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+        cache_input_task = task(self._cache_input, name="cache_input")
+        prepare_target_task = task(self._prepare_target, name="prepare_target")
+        store_chunk_task = task(self._store_chunk, name="store_chunk")
+        finalize_target_task = task(self._finalize_target, name="finalize_target")


Here you not using _finalize_target rather than finalize_target, etc.

Is there any point in maintaining the finalize_target methods if we are going to remove the existing to_pipelines method in BaseRecipe?

I think having the top-level property .finalize_target is still really nice for users who are developing / debugging recipes. We want to keep that, while still allowing Dask and Prefect to access the (partially applied) function itself, so they can wrap it in a Task.

rabernat · 2021-06-16T15:21:03Z

Ok, so I think I am on board with this as hopefully the right solution to the serializability problems.

Here is a partial checklist of some things that IMO would need to be done before we merge this:

Move the new compilation methods to BaseRecipe. This would also help clarify the API questions (finalize_target vs _finalize_target, which are required to be implemented where, etc.)
Remove the to_pipelines method
Implement to_function method, which is equivalent to the existing PythonExecutor
Update executor docs (started in 9c39bf7)
Rerun all notebooks in docs/tutorials to verify no end-user API changes are needed.

Happy to help with some of these via pushing directly to this brach.

Thanks again Tom for taking the time to sort out the core issues.

rabernat · 2021-06-16T15:25:27Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+            cache_input,
+            cache_inputs=self.cache_inputs,
+            input_cache=self.input_cache,
+            file_pattern=self.file_pattern,


I am having a hard time understanding why these partial functions are "small" in a serialization sense, since each function in the graph essentially contains a functools.partial-wrapped version of all of the same attributes that are part of the recipe class? In particular, file_pattern and targets are all fairly complex / large objects themselves, which are now being curried into a single-argument function. How is that any better than having a method attached to a class?

Just noting that my initial attempt to answer this question have failed. My hunch is that it's just easier to serialize these functions than it is to serialize the dataclass, but I don't I'd like to have a stronger justification than that. Pointing to the workflow in pangeo-forge/pangeo-forge-azure-bakery#10 is some evidence, but it's pretty indirect.

I'll keep trying to come up with a clear answer.

To be clear, having a forensic understanding of this is interesting academically but far less important than actually shipping code that works. 😁 So given limited time, I would focus on the checklist above (rather than digging deeper).

Let me know if you want help on any aspects here. I have stopped working on #153 in the meantime.

I'll take care of the first few items around to_pipelines() and will look into the outstanding metadata caching.

I updated the docs and notebooks. I had some trouble running the notebook, but I think it was an issue with my local internet connection.

(cherry picked from commit 9c39bf7)

review-notebook-app · 2021-06-17T16:12:10Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

rabernat

Is it intentional to override the to_dask and to_prefect methods from BaseRecipe in XarrayZarrRecipe?

rabernat · 2021-06-17T17:50:22Z

pangeo_forge_recipes/recipes/xarray_zarr.py

-            yield k
+    def to_prefect(self):
+        """Compile the recipe to a Prefect.Flow object."""
+        from prefect import Flow, task, unmapped


Isn't this now implemented twice? You also have this in BaseRecipe, no?

rabernat · 2021-06-17T17:51:03Z

pangeo_forge_recipes/recipes/xarray_zarr.py

-        return xr.open_zarr(target_mapper)
+        for i, input_key in enumerate(self.iter_inputs()):
+            dsk[(f"cache_input-{token}", i)] = (self._cache_input, input_key)
+        dsk[f"checkpoint_0-{token}"] = (lambda *args: None, list(dsk))


Same, this is also implemented in BaseRecipe, no?

TomAugspurger · 2021-06-17T19:11:34Z

Is it intentional to override the to_dask and to_prefect methods from BaseRecipe in XarrayZarrRecipe?

Fixed in 6d190ff, to remove the to_* methods from XarrayZarrRecipe and move them to the base.

That also removes the @closure and underscore-versions of the properties (_prepare_target, etc). since they aren't needed anymore.

I'm going to test this commit out on the full dataset in https://github.com/pangeo-forge/pangeo-forge-azure-bakery again.

TomAugspurger · 2021-06-18T14:20:52Z

Well, I feel a bit silly. The difficulty in answering https://github.com/pangeo-forge/pangeo-forge-recipes/pull/160/files/e70d52662875f8835cc180cb289fc4e6d4445e4a#diff-e12c886cc124886c5cfa5313d760a36c39649af9da845077c663e6feab8487b5 spurred some more investigation into what was actually taking a lot of size to serialize.

It turns out that most of the size was in the FilePattern class. Maybe this isn't too surprising, but it did surprise me that the size was actually in the function filepatterns_from_sequence at

pangeo-forge-recipes/pangeo_forge_recipes/patterns.py

Lines 151 to 160 in 562e76c

    
           def pattern_from_file_sequence(file_list, concat_dim, nitems_per_file=None): 
        
               """Convenience function for creating a FilePattern from a list of files.""" 
        
               keys = list(range(len(file_list))) 
        
               concat = ConcatDim(name=concat_dim, keys=keys, nitems_per_file=nitems_per_file) 
        
               def format_function(**kwargs): 
        
                   return file_list[kwargs[concat_dim]] 
        
               return FilePattern(format_function, concat)

, rather than the class itself. And then it was obvious, in https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/8c7c40183d3a0eedbaab7dac483287b8abf54c46/flow_test/oisst_recipe.py#L100-L106 we create a list with 14,000 (largeish) strings. I'm still not sure why, but apparently multiple instances of that list were being created when the function was deserialized (maybe since it's referenced via a closure, rather than a top-level function?).

The alternative is to create the FIlePattern class "manually": https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/c70246d209a529ba921b73ed779bfbb57e9178a6/flow_test/oisst_recipe.py#L96-L110. This just requires a function, and a range object. Much smaller.

For some reason in my debugging, I had turned cache_input off. Which led me to conclude that this branch was helping. I'll now test the alternative FilePattern against pangeo-forge-recipes master, to see if this is even necessary.

rabernat · 2021-06-18T14:25:59Z

It turns out that most of the size was in the FilePattern class. Maybe this isn't too surprising, but it did surprise me that the size was actually in the function filepatterns_from_sequence at

👍 This is definitely expected from my POV. Perhaps we should deprecate that function.

TomAugspurger · 2021-06-18T15:28:03Z

Perhaps we should deprecate that function.

I think that's worthwhile. It is convenient, but I worry that this will bite us again as we scale users's recipes to large jobs. And I hope that manually constructing the FilePattern isn't too much more difficult.

FWIW, when I ran the flow using the fixed FilePattern construction on pangeo-forge-recipes master, my scheduler was OOM Killed. It does seem like this branch is doing something.

rabernat · 2021-06-18T15:34:15Z

The big question I have is whether this PR is still significantly better than #153 in terms of serialization. (Assuming the more efficient FilePattern approach is used in both cases.)

TomAugspurger · 2021-06-23T21:19:10Z

The big question I have is whether this PR is still significantly better than #153 in terms of serialization.

It seems like it. I ran a flow that just sleeps for the cache_input step with pangeo-forge-recipes master, #153, and this PR. #153 still had memory issues on the scheduler, while this PR didn't.

pangeo-forge/pangeo-forge-azure-bakery#10 (comment) has the results.

rabernat

I went through this again and noticed that all of the global metadata references have been commented out. Can you explain why global metadata is more difficult with this new functional syntax?

Is there something I can do to help here?

rabernat · 2021-06-24T08:19:40Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+    # if cache_metadata:
+    #     # if nitems_per_input is not constant, we need to cache this info
+    #     recipe_meta = {"input_sequence_lens": input_sequence_lens}
+    #     return recipe_meta


What's going on here? I don't follow your comment.

rabernat · 2021-06-24T08:21:07Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+        # TODO(METADATA): set
+        metadata_cache[_input_metadata_fname(input_key)] = input_metadata


Is there some reason metadata requires different treatment here?

rabernat · 2021-06-24T08:22:34Z

pangeo_forge_recipes/recipes/xarray_zarr.py

+    # TODO(Tom): Handle metadata caching here
+    # else:
+    # global_metadata = metadata_cache[_GLOBAL_METADATA_KEY]
+    # input_sequence_lens = global_metadata["input_sequence_lens"]


Just flagging that metadata stuff has been commented out.

TomAugspurger · 2021-06-24T12:59:14Z

Fixing the metadata stuff is my (hopefully) last TODO. I'll dig into it today, but maybe you can answer this easily: Where do the writes and reads to the global metadata store happen, client process or the worker processes?

rabernat · 2021-06-24T16:38:56Z

Worker processes.

The client knows nothing about the execution state.

TomAugspurger · 2021-06-24T17:48:32Z

Sounds good. So it's assumed that the metadata cache is globally read / writeable, like a blob storage file system. In that case, I think my last commit fixes things, but the tests were passing without it. It's probably worth adding some kind of check to the tests to verify that.

I'm running through the notebooks now, and at least some are failing with TypeError: 'FSSpecTarget' object does not support item assignment. I'm guessing they're failing on master too. The targets should be a MetadataTarget, rather than an FsspecTarget. Maybe related to #133.

I'll update the notebooks to use MetadataTarget.

TomAugspurger · 2021-06-24T18:36:59Z

The notebooks caught an issue: I had removed XarrayZarrRecipe.open_input and XarrayZarrRecipe.open_chunk. These are useful helpers for debugging a recipe so I restored them and added a basic test to ensure they're present.

The notebooks are now updated to use a MetadataCache and are all passing. I think this should be good to go.

rabernat

This is fantastic. Thanks so much Tom for all your hard work on this!

Could you just update the release notes? I think at this point our next release is 0.4, so probably need to add a new section.

TomAugspurger · 2021-06-25T14:43:58Z

Done in f57e36a.

rabernat · 2021-07-16T17:46:42Z

@TomAugspurger, can you think of a test we could add to guard against regressions related to serialization? Relevant for big PRs like #166 which touch a lot of different pieces...

TomAugspurger · 2021-07-19T12:22:31Z

Nothing really comes to mind :/ The typical way to check is to serialize the object and then check the size of the bytestring. But IIRC we found that these objects weren't all that large.

Tom Augspurger added 10 commits June 15, 2021 12:55

wip - refactor

5a03d68

wip - refactor

a835911

fixup

ee26324

fixup

d8ca1d0

wip - refactor

75b6d19

wip - refactor

064b458

wip - refactor

a668516

wip - refactor

58c9f53

fix fsspec target class

f8eb378

fix fsspec target class

e70d526

rabernat reviewed Jun 16, 2021

View reviewed changes

Tom Augspurger and others added 6 commits June 17, 2021 08:36

Pipelines refactor

f979311

warning for to_pipelines

1013560

rewrite executor docs; wip

e5a0ef0

(cherry picked from commit 9c39bf7)

Added prefect

70eb8cc

base tests

3b0f294

Update tutorials

0bfa2b6

rabernat reviewed Jun 17, 2021

View reviewed changes

rabernat mentioned this pull request Jun 17, 2021

refactor executors into standalone package #64

Closed

simplify

6d190ff

typing

d962583

TomAugspurger mentioned this pull request Jun 23, 2021

Debugging memory issues pangeo-forge/pangeo-forge-azure-bakery#10

Open

rabernat reviewed Jun 24, 2021

View reviewed changes

ignore

23c866a

fixed metadata

4227b36

Tom Augspurger added 2 commits June 24, 2021 13:34

Restore open_chunk, open_input

037ab7f

Rerun tutorials

6ec840f

This was referenced Jun 24, 2021

Maybe deprecate pattern_from_file_sequence #162

Open

Add CESM2-LE pipeline pangeo-forge/staged-recipes#53

Closed

rabernat approved these changes Jun 25, 2021

View reviewed changes

update changelog

f57e36a

rabernat merged commit e1ef575 into pangeo-forge:master Jun 25, 2021

cisaacstern mentioned this pull request Jun 28, 2021

Merge our first recipe PR(s) pangeo-forge/staged-recipes#58

Closed

TomAugspurger mentioned this pull request Jul 17, 2021

cannot use rechunker starting from 0.4.0 pangeo-data/rechunker#92

Open

sharkinsspatial mentioned this pull request Jul 18, 2021

Upgrade to support pangeo-forge-recipes 0.4.0 release api changes. pangeo-forge/pangeo-forge-prefect#15

Merged

TomAugspurger deleted the refactor branch July 19, 2021 12:21

This was referenced Jul 20, 2021

use xr.open_dataset context manager #171

Merged

Support Beam as an executor for recipe pipelines #169

Closed

rabernat mentioned this pull request Aug 28, 2021

Go back to rechunker Pipeline executors #192

Closed

rabernat mentioned this pull request Sep 27, 2021

Potential refactor of Recipe classes to use shared context #211

Closed

rabernat mentioned this pull request Jan 13, 2022

Dask executor is broken with dask upstream #259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor XarrayZarrRecipe to be serialization-friendly #160

Refactor XarrayZarrRecipe to be serialization-friendly #160

TomAugspurger commented Jun 16, 2021 •

edited

Loading

rabernat commented Jun 16, 2021

rabernat Jun 16, 2021

TomAugspurger Jun 16, 2021

TomAugspurger commented Jun 16, 2021

rabernat Jun 16, 2021

TomAugspurger Jun 16, 2021

rabernat commented Jun 16, 2021 •

edited by TomAugspurger

Loading

rabernat Jun 16, 2021

TomAugspurger Jun 17, 2021

rabernat Jun 17, 2021

TomAugspurger Jun 17, 2021

TomAugspurger Jun 17, 2021

review-notebook-app bot commented Jun 17, 2021

rabernat left a comment

rabernat Jun 17, 2021

rabernat Jun 17, 2021

TomAugspurger commented Jun 17, 2021 •

edited

Loading

TomAugspurger commented Jun 18, 2021

rabernat commented Jun 18, 2021

TomAugspurger commented Jun 18, 2021

rabernat commented Jun 18, 2021

TomAugspurger commented Jun 23, 2021

rabernat left a comment

rabernat Jun 24, 2021

rabernat Jun 24, 2021

rabernat Jun 24, 2021

TomAugspurger commented Jun 24, 2021

rabernat commented Jun 24, 2021 •

edited

Loading

TomAugspurger commented Jun 24, 2021

TomAugspurger commented Jun 24, 2021

rabernat left a comment •

edited

Loading

TomAugspurger commented Jun 25, 2021

rabernat commented Jul 16, 2021

TomAugspurger commented Jul 19, 2021

		# TODO(METADATA): set
		metadata_cache[_input_metadata_fname(input_key)] = input_metadata

Refactor XarrayZarrRecipe to be serialization-friendly #160

Refactor XarrayZarrRecipe to be serialization-friendly #160

Conversation

TomAugspurger commented Jun 16, 2021 • edited Loading

rabernat commented Jun 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jun 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Jun 16, 2021 • edited by TomAugspurger Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

review-notebook-app bot commented Jun 17, 2021

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jun 17, 2021 • edited Loading

TomAugspurger commented Jun 18, 2021

rabernat commented Jun 18, 2021

TomAugspurger commented Jun 18, 2021

rabernat commented Jun 18, 2021

TomAugspurger commented Jun 23, 2021

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jun 24, 2021

rabernat commented Jun 24, 2021 • edited Loading

TomAugspurger commented Jun 24, 2021

TomAugspurger commented Jun 24, 2021

rabernat left a comment • edited Loading

Choose a reason for hiding this comment

TomAugspurger commented Jun 25, 2021

rabernat commented Jul 16, 2021

TomAugspurger commented Jul 19, 2021

TomAugspurger commented Jun 16, 2021 •

edited

Loading

rabernat commented Jun 16, 2021 •

edited by TomAugspurger

Loading

TomAugspurger commented Jun 17, 2021 •

edited

Loading

rabernat commented Jun 24, 2021 •

edited

Loading

rabernat left a comment •

edited

Loading