large EC-Earth data fails at CMIP6 cleaning due to OOM error #574

emileten · 2022-02-16T00:46:46Z

Blocks progress on #263 and #266.

Workflow : https://argo.cildc6.org/archived-workflows/default/28a83ec8-998f-4d61-96f1-73f57387d3e7

One can look at the standardize_gcm step, in cleaning : each and every retry failed due to OOM errors.

I picked one of these failed pods input and reproduced the OOM on JupyterHub with a 48GB server, which is the specified resource limit in this pod in our argo workflow.

In standardize_gcm we load the data in memory. EC-Earth3 pr is 256 * 512 * time, so higher resolution than other models, but that's still only ~16GBs for the future data. The problem is that we have operations in standardize_gcm that make the memory usage kind of blow up, I think.

Another model of this family, with a lower resolution, that is running here https://argo.cildc6.org/workflows/default/e2e-ec-earth3-veg-lr-pr-t8stn?tab=workflow&nodeId=e2e-ec-earth3-veg-lr-pr-t8stn-1904431442, nearly crashed at the same steps for the same reason, but survived thanks to retries.

The text was updated successfully, but these errors were encountered:

emileten · 2022-02-16T01:27:32Z

An additional detail. In standardize_cmip6, the two culprits are :

The precip unit conversion : ds_cleaned['pr'] * 24 * 60 * 60
xclim_remove_leapdays(ds_cleaned)

If we're willing to spend time on this, I see one acceptable option only. Split the standardize_cmip6 step so that argo works on a few spatial chunks. We'd also avoid changing anything to dodola. standardize_cmip6 is spatial-independent so that would be fine.

Two other options that won't work are : increasing the resource limits or restructuring standardize_cmip6 in dodola. The former won't actually work as the probem is too severe, same for the latter which on top of it implies a lot of re-write.

Note that fixing this issue would allow us to let in 4 models data from this consortium.

[Edit : updated some information and clarified]

emileten · 2022-02-16T07:41:58Z

Oh ok. I think I understand better what happened here.

I was puzzled half an hour ago by the fact that some precip EC models had already passed the cleaning stage in the past. In fact, I moved backward the models card on the project board -- @brews it seems you had ran the precip cleaning steps for these models just fine, for example in this workflow : https://argo.cildc6.org/archived-workflows/default/769b4e04-61e5-4efc-a409-ba480909a292. Why is it complaining about memory now ? Answer below.
It was using dodola 0.8.0. dodola 0.8.0 was not loading the data in memory in dodola.services.standardize_cmip6 before running dodola.core.standardize_gcm. We introduced this as a necessary patch to be able to convert from 360-days calendars (which requires data that is not chunked across time) in this PR : Fix 360 day calendar conversion chunk errors from dodola.services.clean_cmip6 dodola#151
We didn't realize it, but this PR broke the EC precip cleaning step.

The only step requiring the absence of temporal chunks is the 360 days calendar conversion, though. Therefore, I am suggesting we move the data loading to that specific location of the code. It's a super easy change and it fixes the backward compatibility of that breaking PR. The only downside is that it introduces chunking concerns in dodola.core. We already have some there, though...

emileten · 2022-02-17T01:06:01Z

Like I expected, two additional EC-Earth models failed due to this (EC-Earth3-AerChem and EC-Earth3-CC)

emileten added the bug label Feb 16, 2022

emileten changed the title ~~large EC-Earth3 pr data fails at CMIP6 cleaning due to OOM error~~ large EC-Earth3 pr and EC-Earth3-Veg data fails at CMIP6 cleaning due to OOM error Feb 16, 2022

emileten mentioned this issue Feb 16, 2022

relocate data loading in standardize-gcm ClimateImpactLab/dodola#179

Merged

4 tasks

emileten changed the title ~~large EC-Earth3 pr and EC-Earth3-Veg data fails at CMIP6 cleaning due to OOM error~~ large EC-Earth data fails at CMIP6 cleaning due to OOM error Feb 17, 2022

emileten mentioned this issue Feb 17, 2022

Update dodola in standardize-cmip6 #580

Merged

emileten closed this as completed in #580 Feb 17, 2022

emileten mentioned this issue Feb 24, 2022

HadGEM3-GC31-MM tasmin, tasmax, pr e2e runs fail at cleaning due to OOM error #589

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large EC-Earth data fails at CMIP6 cleaning due to OOM error #574

large EC-Earth data fails at CMIP6 cleaning due to OOM error #574

emileten commented Feb 16, 2022 •

edited

Loading

emileten commented Feb 16, 2022 •

edited

Loading

emileten commented Feb 16, 2022 •

edited

Loading

emileten commented Feb 17, 2022

large EC-Earth data fails at CMIP6 cleaning due to OOM error #574

large EC-Earth data fails at CMIP6 cleaning due to OOM error #574

Comments

emileten commented Feb 16, 2022 • edited Loading

emileten commented Feb 16, 2022 • edited Loading

emileten commented Feb 16, 2022 • edited Loading

emileten commented Feb 17, 2022

emileten commented Feb 16, 2022 •

edited

Loading

emileten commented Feb 16, 2022 •

edited

Loading

emileten commented Feb 16, 2022 •

edited

Loading