HadGEM3-GC31-MM tasmin, tasmax, pr e2e runs fail at cleaning due to OOM error #589

emileten · 2022-02-24T01:20:50Z

Workflows :

https://argo.cildc6.org/workflows/default/e2e-hadgem3-gc31-mm-tasmax-c9v8q?tab=workflow
https://argo.cildc6.org/workflows/default/e2e-hadgem3-gc31-mm-tasmax-c9v8q?tab=workflow
https://argo.cildc6.org/workflows/default/e2e-hadgem3-gc31-mm-pr-d8v6x?tab=workflow

This model has a particularly high resolution, with 324 latitude bands and 432 longitude bands, versus 144 and 192 respectively for its low resolution equivalent HadGEM3-GC31-LL that we successfully ran.

Blocks progress on #586, #587, #225

The text was updated successfully, but these errors were encountered:

emileten · 2022-02-24T01:24:42Z

This model has a slightly higher resolution than EC-Earth high-res models which had caused similar issues here #574. We solved those though.

emileten · 2022-02-24T01:25:35Z

Ah I see. HadGEM3-GC31-LL's calendar is 360-days and in this corner case we load the data in memory in dodola.core.standardize_gcm. This is the cause of this issue.

emileten · 2022-02-24T02:14:12Z

@brews I think, if we're willing to solve that, the cheapest way is to rechunk the data before and after the standardize-cmip6 step instead of loading it into memory (this requires small changes in both dodola and here in this repo).

What do you think ? Tagging you in particular because you had an opinion regarding this here : ClimateImpactLab/dodola#150. What I am thinking of though, in contrast to what I had suggested back then, is not to rechunk within dodola's standardize_gcm function but in separate workflow steps, before and after -- like we do in other parts of the workflow -- using dodola.cli.rechunk.

brews · 2022-02-24T03:03:53Z

@emileten Hmmm.... as you likely know, the hard thing here is that we clean and standardize the raw data so that it's in good enough shape (i.e. it's consistent enough) for us to do things like rechunking without failing on issues like unexpected variable/coord/dim names, etc. And it's wasn't a problem until now because the standardizing step was relatively cheap.

The other thing is that I think rechunking is often done after 1x1 regridding... so the data is a conveniently small, standardized size that fits in memory when it happens — making it a relatively fast and reliable operation.

Have you just run the rechunk workflowtemplate on the raw HadGEM3-GC31-MM? Can you get it in the needed chunks without OOM errors? (I realize I don't even know HadGEM3-GC31-MM's native size on disk or resolution.)

emileten · 2022-02-24T03:22:25Z

@brews thanks ! I haven't. That's a good suggestion, let me at least try to do that and see what it gives.

emileten · 2022-02-24T06:51:12Z

Hm yes indeed it's not completely straightforward. I need to play a bit with these spatial bounds just like we do here. but unfortunately before 'standardizing' the data...

Getting this error in this workflow :

INFO:dodola.services:Starting dodola service rechunk with args=('gs://scratch-170cd6ec/78defc15-47bd-4a60-b720-8731e7ec7bcc/e2e-hadgem3-gc31-mm-tasmax-c9v8q-1268986208/timesliced.zarr',), kwargs={'target_chunks': {'time': -1, 'lat': 10, 'lon': 10}, 'out': 'gs://scratch-170cd6ec/d51f3b4b-4677-4ee0-b43d-7b101969987b/rechunk-9rwzg-3636228586/rechunked.zarr'})
INFO:dodola.repository:Read gs://scratch-170cd6ec/78defc15-47bd-4a60-b720-8731e7ec7bcc/e2e-hadgem3-gc31-mm-tasmax-c9v8q-1268986208/timesliced.zarr
Traceback (most recent call last):
  File "/opt/conda/bin/dodola", line 33, in <module>
    sys.exit(load_entry_point('dodola', 'console_scripts', 'dodola')())
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/opt/dodola/dodola/cli.py", line 543, in rechunk
    services.rechunk(
  File "/opt/dodola/dodola/services.py", line 31, in service_logger
    func(*args, **kwargs)
  File "/opt/dodola/dodola/services.py", line 583, in rechunk
    storage.write(out, ds)
  File "/opt/dodola/dodola/repository.py", line 69, in write
    x.to_zarr(url_or_path, mode="w", compute=True)
  File "/opt/conda/lib/python3.9/site-packages/xarray/core/dataset.py", line 2031, in to_zarr
    return to_zarr(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/api.py", line 1414, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/api.py", line 1124, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 555, in store
    self.set_variables(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 602, in set_variables
    encoding = extract_zarr_variable_encoding(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 246, in extract_zarr_variable_encoding
    chunks = _determine_zarr_chunks(
  File "/opt/conda/lib/python3.9/site-packages/xarray/backends/zarr.py", line 174, in _determine_zarr_chunks
    raise NotImplementedError(
NotImplementedError: Specified zarr chunks encoding['chunks']=(432, 2) for variable named 'lon_bnds' would overlap multiple dask chunks ((10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 2), (2,)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.

brews · 2022-02-24T07:05:17Z

Yeah, and I'm pretty certain that this would error from other raw GCM input, too.

You know more about the 360-day calendar conversion implementation than I do, @emileten. Do you feel like this particular conversion is something that we might be able to make chunk friendly or do you feel like this is too much of a pain? (...I might have already asked you this for another issue...)

emileten · 2022-02-24T23:06:23Z

@brews yes, I think it would be a lot of work.

We decided to abandon these models.

brews · 2022-02-24T23:22:20Z

Thanks, @emileten. I also tried bumping the container memory for this standardizing step up from ~40GiB to 68GiB and it still gets an OOM error, so a small memory bump wasn't a quick fix, either.

emileten · 2022-02-24T23:24:21Z

Yea this is a lot of data and we're doing various things with it, in particular these xclim operations.

emileten added bug pr bias correction and downscaling run tasmax bias correction and downscaling run tasmin bias correction and downscaling run labels Feb 24, 2022

emileten added the wontfix label Feb 24, 2022

emileten closed this as completed Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HadGEM3-GC31-MM tasmin, tasmax, pr e2e runs fail at cleaning due to OOM error #589

HadGEM3-GC31-MM tasmin, tasmax, pr e2e runs fail at cleaning due to OOM error #589

emileten commented Feb 24, 2022 •

edited

Loading

emileten commented Feb 24, 2022

emileten commented Feb 24, 2022

emileten commented Feb 24, 2022 •

edited

Loading

brews commented Feb 24, 2022 •

edited

Loading

emileten commented Feb 24, 2022

emileten commented Feb 24, 2022 •

edited

Loading

brews commented Feb 24, 2022

emileten commented Feb 24, 2022

brews commented Feb 24, 2022

emileten commented Feb 24, 2022

HadGEM3-GC31-MM tasmin, tasmax, pr e2e runs fail at cleaning due to OOM error #589

HadGEM3-GC31-MM tasmin, tasmax, pr e2e runs fail at cleaning due to OOM error #589

Comments

emileten commented Feb 24, 2022 • edited Loading

emileten commented Feb 24, 2022

emileten commented Feb 24, 2022

emileten commented Feb 24, 2022 • edited Loading

brews commented Feb 24, 2022 • edited Loading

emileten commented Feb 24, 2022

emileten commented Feb 24, 2022 • edited Loading

brews commented Feb 24, 2022

emileten commented Feb 24, 2022

brews commented Feb 24, 2022

emileten commented Feb 24, 2022

emileten commented Feb 24, 2022 •

edited

Loading

emileten commented Feb 24, 2022 •

edited

Loading

brews commented Feb 24, 2022 •

edited

Loading

emileten commented Feb 24, 2022 •

edited

Loading