KilledWorker error in the deployment #5

paigem · 2022-06-08T19:00:59Z

It looks like the most recent recipe run failed again, this time with a KilledWorker error. This appears to be a different failure than we saw in this previous recipe run, which did not result in an explicit error, but only processed a single chunk in time.

@cisaacstern Any thoughts on how to get passed a KilledWorker error?

The text was updated successfully, but these errors were encountered:

cisaacstern · 2022-06-08T20:31:39Z

@paigem, is it possible that any of the variables added in #4 are fields which, even when divided into 5 subsets

cesm-atm-025deg-feedstock/feedstock/recipe.py

Line 48 in d19040b

    
           subset_inputs = {"time": 5}) # set 5 chunks per year, each with time of length 73 (to total 365)

are still > 800 MB in size?

For example, if any of the variables added were >= 5 GB in size, then with our current subset inputs, each subset will be >= 1 GB in size, which could easily kill a worker.

If this is the case, the way to solve it would be to increase subset_inputs to an integer divisor which is large enough to make each subset <= about 800 MB.

Also, just noting that dealing with these troublesome infrastructure concerns is one of many problems which the work scoped in pangeo-forge/pangeo-forge-recipes#256 should solve once complete.

paigem · 2022-06-09T15:34:57Z

Thanks @cisaacstern for explaining this.

Each individual netCDF file (1 year of daily output) is just under 1 GB. Since I'm stringing together 9 years of data, then the full time series for each variable would indeed be roughly 9 GB, which is > 800MB per chunk.

I decided to go for the subset_inputs=5 with time chunks of length 73, since 5 is one of the few divisors of 365. If I increase the subset_inputs to, say 12 (for roughly one-month segments), this would yield time chunks of unequal size. Would that still work?

cisaacstern · 2022-06-09T17:18:54Z

this would yield time chunks of unequal size. Would that still work?

Yes, pangeo-forge-recipes should understand how to put the remainder into the last chunk. Let's try that and see what happens. Could you make a PR proposing this change? Thanks for your patience as we work this out, Paige.

paigem mentioned this issue Jun 9, 2022

Remove subset inputs #6

Merged

cisaacstern mentioned this issue Dec 22, 2022

Authentication requirement #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KilledWorker error in the deployment #5

KilledWorker error in the deployment #5

paigem commented Jun 8, 2022

cisaacstern commented Jun 8, 2022

paigem commented Jun 9, 2022

cisaacstern commented Jun 9, 2022

KilledWorker error in the deployment #5

KilledWorker error in the deployment #5

Comments

paigem commented Jun 8, 2022

cisaacstern commented Jun 8, 2022

paigem commented Jun 9, 2022

cisaacstern commented Jun 9, 2022