-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large EC-Earth data fails at CMIP6 cleaning due to OOM error #574
Comments
An additional detail. In
If we're willing to spend time on this, I see one acceptable option only. Split the Two other options that won't work are : increasing the resource limits or restructuring Note that fixing this issue would allow us to let in 4 models data from this consortium. [Edit : updated some information and clarified] |
Oh ok. I think I understand better what happened here.
The only step requiring the absence of temporal chunks is the 360 days calendar conversion, though. Therefore, I am suggesting we move the data loading to that specific location of the code. It's a super easy change and it fixes the backward compatibility of that breaking PR. The only downside is that it introduces chunking concerns in |
Like I expected, two additional |
Blocks progress on #263 and #266.
Workflow : https://argo.cildc6.org/archived-workflows/default/28a83ec8-998f-4d61-96f1-73f57387d3e7
One can look at the
standardize_gcm
step, in cleaning : each and every retry failed due to OOM errors.I picked one of these failed pods input and reproduced the OOM on JupyterHub with a 48GB server, which is the specified resource limit in this pod in our argo workflow.
In
standardize_gcm
we load the data in memory.EC-Earth3
pr is 256 * 512 * time, so higher resolution than other models, but that's still only ~16GBs for the future data. The problem is that we have operations instandardize_gcm
that make the memory usage kind of blow up, I think.Another model of this family, with a lower resolution, that is running here
https://argo.cildc6.org/workflows/default/e2e-ec-earth3-veg-lr-pr-t8stn?tab=workflow&nodeId=e2e-ec-earth3-veg-lr-pr-t8stn-1904431442
, nearly crashed at the same steps for the same reason, but survived thanks to retries.The text was updated successfully, but these errors were encountered: