Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KilledWorker error in the deployment #5

Open
paigem opened this issue Jun 8, 2022 · 3 comments
Open

KilledWorker error in the deployment #5

paigem opened this issue Jun 8, 2022 · 3 comments

Comments

@paigem
Copy link
Contributor

paigem commented Jun 8, 2022

It looks like the most recent recipe run failed again, this time with a KilledWorker error. This appears to be a different failure than we saw in this previous recipe run, which did not result in an explicit error, but only processed a single chunk in time.

@cisaacstern Any thoughts on how to get passed a KilledWorker error?

@cisaacstern
Copy link
Member

@paigem, is it possible that any of the variables added in #4 are fields which, even when divided into 5 subsets

subset_inputs = {"time": 5}) # set 5 chunks per year, each with time of length 73 (to total 365)

are still > 800 MB in size?

For example, if any of the variables added were >= 5 GB in size, then with our current subset inputs, each subset will be >= 1 GB in size, which could easily kill a worker.

If this is the case, the way to solve it would be to increase subset_inputs to an integer divisor which is large enough to make each subset <= about 800 MB.

Also, just noting that dealing with these troublesome infrastructure concerns is one of many problems which the work scoped in pangeo-forge/pangeo-forge-recipes#256 should solve once complete.

@paigem
Copy link
Contributor Author

paigem commented Jun 9, 2022

Thanks @cisaacstern for explaining this.

Each individual netCDF file (1 year of daily output) is just under 1 GB. Since I'm stringing together 9 years of data, then the full time series for each variable would indeed be roughly 9 GB, which is > 800MB per chunk.

I decided to go for the subset_inputs=5 with time chunks of length 73, since 5 is one of the few divisors of 365. If I increase the subset_inputs to, say 12 (for roughly one-month segments), this would yield time chunks of unequal size. Would that still work?

@cisaacstern
Copy link
Member

this would yield time chunks of unequal size. Would that still work?

Yes, pangeo-forge-recipes should understand how to put the remainder into the last chunk. Let's try that and see what happens. Could you make a PR proposing this change? Thanks for your patience as we work this out, Paige.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants