Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large EC-Earth data fails at CMIP6 cleaning due to OOM error #574

Closed
emileten opened this issue Feb 16, 2022 · 3 comments · Fixed by #580
Closed

large EC-Earth data fails at CMIP6 cleaning due to OOM error #574

emileten opened this issue Feb 16, 2022 · 3 comments · Fixed by #580
Labels

Comments

@emileten
Copy link
Contributor

emileten commented Feb 16, 2022

Blocks progress on #263 and #266.

Workflow : https://argo.cildc6.org/archived-workflows/default/28a83ec8-998f-4d61-96f1-73f57387d3e7

One can look at the standardize_gcm step, in cleaning : each and every retry failed due to OOM errors.

I picked one of these failed pods input and reproduced the OOM on JupyterHub with a 48GB server, which is the specified resource limit in this pod in our argo workflow.

In standardize_gcm we load the data in memory. EC-Earth3 pr is 256 * 512 * time, so higher resolution than other models, but that's still only ~16GBs for the future data. The problem is that we have operations in standardize_gcm that make the memory usage kind of blow up, I think.

Another model of this family, with a lower resolution, that is running here https://argo.cildc6.org/workflows/default/e2e-ec-earth3-veg-lr-pr-t8stn?tab=workflow&nodeId=e2e-ec-earth3-veg-lr-pr-t8stn-1904431442, nearly crashed at the same steps for the same reason, but survived thanks to retries.

@emileten emileten added the bug label Feb 16, 2022
@emileten emileten changed the title large EC-Earth3 pr data fails at CMIP6 cleaning due to OOM error large EC-Earth3 pr and EC-Earth3-Veg data fails at CMIP6 cleaning due to OOM error Feb 16, 2022
@emileten
Copy link
Contributor Author

emileten commented Feb 16, 2022

An additional detail. In standardize_cmip6, the two culprits are :

  1. The precip unit conversion : ds_cleaned['pr'] * 24 * 60 * 60
  2. xclim_remove_leapdays(ds_cleaned)

If we're willing to spend time on this, I see one acceptable option only. Split the standardize_cmip6 step so that argo works on a few spatial chunks. We'd also avoid changing anything to dodola. standardize_cmip6 is spatial-independent so that would be fine.

Two other options that won't work are : increasing the resource limits or restructuring standardize_cmip6 in dodola. The former won't actually work as the probem is too severe, same for the latter which on top of it implies a lot of re-write.

Note that fixing this issue would allow us to let in 4 models data from this consortium.

[Edit : updated some information and clarified]

@emileten
Copy link
Contributor Author

emileten commented Feb 16, 2022

Oh ok. I think I understand better what happened here.

  • I was puzzled half an hour ago by the fact that some precip EC models had already passed the cleaning stage in the past. In fact, I moved backward the models card on the project board -- @brews it seems you had ran the precip cleaning steps for these models just fine, for example in this workflow : https://argo.cildc6.org/archived-workflows/default/769b4e04-61e5-4efc-a409-ba480909a292. Why is it complaining about memory now ? Answer below.
  • It was using dodola 0.8.0. dodola 0.8.0 was not loading the data in memory in dodola.services.standardize_cmip6 before running dodola.core.standardize_gcm. We introduced this as a necessary patch to be able to convert from 360-days calendars (which requires data that is not chunked across time) in this PR : Fix 360 day calendar conversion chunk errors from dodola.services.clean_cmip6 dodola#151
  • We didn't realize it, but this PR broke the EC precip cleaning step.

The only step requiring the absence of temporal chunks is the 360 days calendar conversion, though. Therefore, I am suggesting we move the data loading to that specific location of the code. It's a super easy change and it fixes the backward compatibility of that breaking PR. The only downside is that it introduces chunking concerns in dodola.core. We already have some there, though...

@emileten emileten changed the title large EC-Earth3 pr and EC-Earth3-Veg data fails at CMIP6 cleaning due to OOM error large EC-Earth data fails at CMIP6 cleaning due to OOM error Feb 17, 2022
@emileten
Copy link
Contributor Author

Like I expected, two additional EC-Earth models failed due to this (EC-Earth3-AerChem and EC-Earth3-CC)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant