Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HadGEM3-GC31-MM tasmin, tasmax, pr e2e runs fail at cleaning due to OOM error #589

Closed
emileten opened this issue Feb 24, 2022 · 10 comments
Closed

Comments

@emileten
Copy link
Contributor

emileten commented Feb 24, 2022

Workflows :

https://argo.cildc6.org/workflows/default/e2e-hadgem3-gc31-mm-tasmax-c9v8q?tab=workflow
https://argo.cildc6.org/workflows/default/e2e-hadgem3-gc31-mm-tasmax-c9v8q?tab=workflow
https://argo.cildc6.org/workflows/default/e2e-hadgem3-gc31-mm-pr-d8v6x?tab=workflow

This model has a particularly high resolution, with 324 latitude bands and 432 longitude bands, versus 144 and 192 respectively for its low resolution equivalent HadGEM3-GC31-LL that we successfully ran.

Blocks progress on #586, #587, #225

@emileten
Copy link
Contributor Author

This model has a slightly higher resolution than EC-Earth high-res models which had caused similar issues here #574. We solved those though.

@emileten
Copy link
Contributor Author

Ah I see. HadGEM3-GC31-LL's calendar is 360-days and in this corner case we load the data in memory in dodola.core.standardize_gcm. This is the cause of this issue.

@emileten
Copy link
Contributor Author

emileten commented Feb 24, 2022

@brews I think, if we're willing to solve that, the cheapest way is to rechunk the data before and after the standardize-cmip6 step instead of loading it into memory (this requires small changes in both dodola and here in this repo).

What do you think ? Tagging you in particular because you had an opinion regarding this here : ClimateImpactLab/dodola#150. What I am thinking of though, in contrast to what I had suggested back then, is not to rechunk within dodola's standardize_gcm function but in separate workflow steps, before and after -- like we do in other parts of the workflow -- using dodola.cli.rechunk.

@brews
Copy link
Member

brews commented Feb 24, 2022

@emileten Hmmm.... as you likely know, the hard thing here is that we clean and standardize the raw data so that it's in good enough shape (i.e. it's consistent enough) for us to do things like rechunking without failing on issues like unexpected variable/coord/dim names, etc. And it's wasn't a problem until now because the standardizing step was relatively cheap.

The other thing is that I think rechunking is often done after 1x1 regridding... so the data is a conveniently small, standardized size that fits in memory when it happens — making it a relatively fast and reliable operation.

Have you just run the rechunk workflowtemplate on the raw HadGEM3-GC31-MM? Can you get it in the needed chunks without OOM errors? (I realize I don't even know HadGEM3-GC31-MM's native size on disk or resolution.)

@emileten
Copy link
Contributor Author

@brews thanks ! I haven't. That's a good suggestion, let me at least try to do that and see what it gives.

@emileten
Copy link
Contributor Author

emileten commented Feb 24, 2022

Hm yes indeed it's not completely straightforward. I need to play a bit with these spatial bounds just like we do here. but unfortunately before 'standardizing' the data...

Getting this error in this workflow :

INFO:dodola.services:Starting dodola service rechunk with args=('gs://scratch-170cd6ec/78defc15-47bd-4a60-b720-8731e7ec7bcc/e2e-hadgem3-gc31-mm-tasmax-c9v8q-1268986208/timesliced.zarr',), kwargs={'target_chunks': {'time': -1, 'lat': 10, 'lon': 10}, 'out': 'gs://scratch-170cd6ec/d51f3b4b-4677-4ee0-b43d-7b101969987b/rechunk-9rwzg-3636228586/rechunked.zarr'})
INFO:dodola.repository:Read gs://scratch-170cd6ec/78defc15-47bd-4a60-b720-8731e7ec7bcc/e2e-hadgem3-gc31-mm-tasmax-c9v8q-1268986208/timesliced.zarr
Traceback (most recent call last):
  File "/opt/conda/bin/dodola", line 33, in <module>
    sys.exit(load_entry_point('dodola', 'console_scripts', 'dodola')())
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/opt/dodola/dodola/cli.py", line 543, in rechunk
    services.rechunk(
  File "/opt/dodola/dodola/services.py", line 31, in service_logger
    func(*args, **kwargs)
  File "/opt/dodola/dodola/services.py", line 583, in rechunk
    storage.write(out, ds)
  File "/opt/dodola/dodola/repository.py", line 69, in write
    x.to_zarr(url_or_path, mode="w", compute=True)
  File "/opt/conda/lib/python3.9/site-packages/xarray/core/dataset.py", line 2031, in to_zarr
    return to_zarr(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/api.py", line 1414, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/api.py", line 1124, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 555, in store
    self.set_variables(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 602, in set_variables
    encoding = extract_zarr_variable_encoding(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 246, in extract_zarr_variable_encoding
    chunks = _determine_zarr_chunks(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 174, in _determine_zarr_chunks
    raise NotImplementedError(
NotImplementedError: Specified zarr chunks encoding['chunks']=(432, 2) for variable named 'lon_bnds' would overlap multiple dask chunks ((10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 2), (2,)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

@brews
Copy link
Member

brews commented Feb 24, 2022

Yeah, and I'm pretty certain that this would error from other raw GCM input, too.

You know more about the 360-day calendar conversion implementation than I do, @emileten. Do you feel like this particular conversion is something that we might be able to make chunk friendly or do you feel like this is too much of a pain? (...I might have already asked you this for another issue...)

@emileten
Copy link
Contributor Author

@brews yes, I think it would be a lot of work.

We decided to abandon these models.

@brews
Copy link
Member

brews commented Feb 24, 2022

Thanks, @emileten. I also tried bumping the container memory for this standardizing step up from ~40GiB to 68GiB and it still gets an OOM error, so a small memory bump wasn't a quick fix, either.

@emileten
Copy link
Contributor Author

Yea this is a lot of data and we're doing various things with it, in particular these xclim operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants