Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rechunk along time dim before cal conversion #150

Conversation

emileten
Copy link
Contributor

@emileten emileten commented Dec 9, 2021

Solving #149, from ClimateImpactLab/downscaleCMIP6#436.

Interpolation along the time dimension was impossible because in production our datasets are chunked along time in standardize_gcm.

I changed xclim_convert_360day_calendar_interpolate so that it gathers the time chunks if asked to interpolate. It doesn't re-rechunk after that. standardize_gcm does it anyways.

The other option was to .load() all the data. My reasoning is that this computation is not large enough to load all that data into memory. But, let me know if my judgment isn't right @brews.

I manually tested that this solves the issue with the following snippet of code on my local machine. @brews it seemed to me we're not accounting for chunking in any of our automated tests. Did I miss something ? [EDIT] I think I might missed something which would have caught this -- a version of the function in services, interacting with repository ?

import numpy as np
import dask
import xarray as xr
import pytest 

cluster = dask.distributed.LocalCluster()
client = dask.distributed.Client(cluster)

np_da = np.random.rand(1000, 1000)
np_da[100, 100] = np.NaN
xr_da = xr.DataArray(np_da, {'dim1':range(1000), 'dim2':range(1000)}, ['dim1', 'dim2'])
xr_ds = xr.Dataset({'var':xr_da})
xr_ds_chunked = xr_ds.chunk(chunks={'dim1':10, 'dim2':10})
with pytest.raises(ValueError):
	xr_ds_chunked.interpolate_na('dim1')
xr_ds_chunked.load()
xr_ds_rechunked = xr_ds_chunked.chunk(chunks={'dim1':-1})
xr_ds_rechunked_interp = xr_ds_rechunked.interpolate_na('dim1')

@emileten emileten requested a review from brews December 9, 2021 13:38
@emileten emileten self-assigned this Dec 9, 2021
@brews
Copy link
Member

brews commented Dec 9, 2021

@emileten cool. thanks for this work. I see your suggesting the fix for this is that we need time to be in a single chunk.

We've generally not forced chunking in services and core, and instead forced people to chunk explicitly before running whatever they need (because rechunking can be very very expensive -- like more expensive than our downscaling, bias-correction methods)... So, we manually rechunk before regridding, QDM bias correction, etc. Whats more, rechunking in this step means that we might cause chunking problems in a later step if we're not careful. It usually required a bit of tuning to get things right.

Before we merge this PR, let me see if I can get around this by manually rechunking before this step... I'm not sure how well this will work because data at this point has not yet been through standardizing...

@brews
Copy link
Member

brews commented Dec 9, 2021

@emileten Okay, based on your solution, I have another approach in #151 that I think solves the problem while remaining consistent with other dodola services (i.e. not rechunking).

So, we don't do rechunking as a side-effect of other dodola services (except the rechunking service) but some of the dodola.services have used .load() to load the entire dataset into memory (converting it from a dask to a numpy array). And I think this will preserve the original chunking when saved back to a zarr store... so we won't need to tweak other stages.

@brews
Copy link
Member

brews commented Dec 9, 2021

@emileten I'm closing this and marking it "duplicate" because we have an alternative solution in PR #151. Thanks for submitting this though, it was important to get past this bug.

@brews brews closed this Dec 9, 2021
@brews brews added duplicate This issue or pull request already exists invalid This doesn't seem right labels Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists invalid This doesn't seem right
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xclim_convert_360day_calendar_interpolate does not support data chunked across time
2 participants