Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: decode_times=True as default triggers error when open netcdf with no time dimension #396

Closed
lee1043 opened this issue Dec 10, 2022 · 14 comments · Fixed by #409
Closed
Assignees
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@lee1043
Copy link
Collaborator

lee1043 commented Dec 10, 2022

What happened?

Opened 2-d field (lat/lon) written in netcdf file using xcdat. Error triggered if without decode_times=False option given.

Screen Shot 2022-12-09 at 4 26 57 PM

What did you expect to happen?

Because the dataset does not have time dimension and has only lat/lon dimensions, I was expecting decode_times won't be activated even if it is a default option.

Minimal Complete Verifiable Example

import xcdat as xc
target_grid = xc.open_dataset('target_grid.nc')

Relevant log output

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[30], line 2
      1 import xcdat as xc
----> 2 target_grid = xc.open_dataset('target_grid.nc')

File ~/.conda/envs/pcmdi_metrics_dev_12_20221208/lib/python3.9/site-packages/xcdat/dataset.py:103, in open_dataset(path, data_var, add_bounds, decode_times, center_times, lon_orient, **kwargs)
    100 ds = xr.open_dataset(path, decode_times=False, **kwargs)  # type: ignore
    102 if decode_times:
--> 103     ds = decode_time(ds)
    105 ds = _postprocess_dataset(ds, data_var, center_times, add_bounds, lon_orient)
    107 return ds

File ~/.conda/envs/pcmdi_metrics_dev_12_20221208/lib/python3.9/site-packages/xcdat/dataset.py:308, in decode_time(dataset)
    305 coord_keys = _get_all_coord_keys(ds, "T")
    307 if len(coord_keys) == 0:
--> 308     raise KeyError(
    309         "Unable to map to time coordinates in this dataset to perform decoding. "
    310         "Make sure that the time coordinates have the CF 'axis' or 'standard_name' "
    311         "attribute set (e.g., ds['time'].attrs['axis'] = 'T' or "
    312         "ds['time'].attrs['standard_name'] = 'time'), and try decoding again. "
    313     )
    315 for key in coord_keys:
    316     coords = ds[key].copy()

KeyError: "Unable to map to time coordinates in this dataset to perform decoding. Make sure that the time coordinates have the CF 'axis' or 'standard_name' attribute set (e.g., ds['time'].attrs['axis'] = 'T' or ds['time'].attrs['standard_name'] = 'time'), and try decoding again. "

Anything else we need to know?

Example input included here (uncompress to use):
target_grid.nc.zip

Environment

xcdat 0.4.0

@lee1043 lee1043 added the type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. label Dec 10, 2022
@pochedls
Copy link
Collaborator

@lee1043 – do you have a dev environment setup? I'm wondering if you could fix this by adding in some logic in open_dataset/open_mfdataset to set decode_times=False if there is no time axis present in the dataset?

@tomvothecoder
Copy link
Collaborator

Thanks @lee1043.

_has_cf_compliant_time() checks for the existence of CF compliant/non-CF compliant time units. If neither exist, then it returns None.

xcdat/xcdat/dataset.py

Lines 461 to 507 in 36a3539

def _has_cf_compliant_time(paths: Paths) -> Optional[bool]:
"""Checks if a dataset has time coordinates with CF compliant units.
If the dataset does not contain a time dimension, None is returned.
Otherwise, the units attribute is extracted from the time coordinates to
determine whether it is CF or non-CF compliant.
Parameters
----------
path : Union[str, pathlib.Path, List[str], List[pathlib.Path], \
List[List[str]], List[List[pathlib.Path]]]
Either a file (``"file.nc"``), a string glob in the form
``"path/to/my/files/*.nc"``, or an explicit list of files to open.
Paths can be given as strings or as pathlib Paths. If concatenation
along more than one dimension is desired, then ``paths`` must be a
nested list-of-lists (see ``combine_nested`` for details). (A string
glob will be expanded to a 1-dimensional list.)
Returns
-------
Optional[bool]
None if time dimension does not exist, True if CF compliant, or False if
non-CF compliant.
Notes
-----
This function only checks one file for multi-file datasets to optimize
performance because it is slower to combine all files then check for CF
compliance.
"""
first_path = _get_first_path(paths)
ds = xr.open_dataset(first_path, decode_times=False) # type: ignore
if ds.cf.dims.get("T") is None:
return None
time = ds.cf["T"]
# If the time units attr cannot be split, it is not cf_compliant.
try:
units = _split_time_units_attr(time.attrs.get("units"))[0]
except ValueError:
return False
cf_compliant = units not in NON_CF_TIME_UNITS
return cf_compliant

If _has_cf_compliant_time() returns None or True, we use xr.open_dataset(decode_times=True...).

xcdat/xcdat/dataset.py

Lines 99 to 105 in 36a3539

cf_compliant_time: Optional[bool] = _has_cf_compliant_time(path)
# xCDAT attempts to decode non-CF compliant time coordinates.
if cf_compliant_time is False:
ds = xr.open_dataset(path, decode_times=False, **kwargs) # type: ignore
ds = decode_non_cf_time(ds)
else:
ds = xr.open_dataset(path, decode_times=True, **kwargs) # type: ignore

I think by removing this check in 0.4.0, we assume time dimension(s) must exist with decode_times=True. We'll need to add similar (but simpler) check for the time dimension.

Example:

def _has_time_dim(ds):
    coord_keys = _get_all_coord_keys(ds, "T")

    if len(coord_keys) == 0:
        return False

	return True

@lee1043
Copy link
Collaborator Author

lee1043 commented Dec 12, 2022

@pochedls yes I do have a dev setup and I can give a shot. However, some strategic thoughts might be needed here.

open_dataset calls decode_times which then checks whether time coordinate is included or not. If not, it raises error.

Then the question is how to make the code knows whether (1) the time coordinate was not included intentionally, or (2) it supposed to have time but somehow not included that justifies raising the error message.

I am thinking about how that can be decided in the code. If anyone has a good idea, I'd welcome to hear about it.

@tomvothecoder do you have any thoughts?

@lee1043
Copy link
Collaborator Author

lee1043 commented Dec 12, 2022

@tomvothecoder our postings crossed! Thanks for adding your comment. I am studying it.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Dec 12, 2022

Of course @lee1043! Thanks for taking a look into it.

decode_time() is also a standalone API so we need to keep the error checking.

We just have to implement something like _has_time_dim() in open_dataset() to avoid decode_time() from being called if the time dimension doesn't exist.

Example:

def open_dataset():
	
	if decode_times and _has_time_dim(ds):
			ds = decode_time(ds):

This logic also needs to be implemented in open_mfdataset().

@lee1043
Copy link
Collaborator Author

lee1043 commented Dec 12, 2022

@tomvothecoder Thanks! But what if the loaded dataset is actually a time series but with no standard time axis? If so, doesn't that checking _has_time_dim before calling decode_times prevent raising error message for standard name?

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Dec 12, 2022

@tomvothecoder Thanks! But what if the loaded dataset is actually a time series but with no standard time axis? If so, checking _has_time_dim before calling decode_times prevent raising error message for standard name.

Hmmm I think I see what you're saying in this case.

_has_time_dim() returns False if the dataset has a time dimension without CF metadata (e.g., standard_name, axis).

We would need to figure out how to handle cases where datasets have time dims without CF metadata.
We should check how xarray handles this, since it also defaults to decode_times=True. If I recall correctly, if xarray can't find the time dimension then it won't decode and does not raise an error/warning.

@lee1043
Copy link
Collaborator Author

lee1043 commented Dec 12, 2022

@tomvothecoder I agree. Opening the file I attached above via xarray didn't return any time related error. I will investigate how xarray check for that.

@pochedls
Copy link
Collaborator

@lee1043 - I thought this might be an easy two line fix...I guess it is more complicated. Thanks for looking into this issue!

@lee1043
Copy link
Collaborator Author

lee1043 commented Dec 13, 2022

We would need to figure out how to handle cases where datasets have time dims without CF metadata.
We should check how xarray handles this, since it also defaults to decode_times=True. If I recall correctly, if xarray can't find the time dimension then it won't decode and does not raise an error/warning.

In xarray, decode_times is False (actually None) by default (described here), so it just loads dataset with given coordinates, unless user specify decode_times=True.

I think complexity here is at distinguishing between a dataset completely without time dimension and a dataset with incorrect time dimension, which is kind of obvious for people but not for computer. Still thinking what would be a simple and efficient way of doing it...

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Dec 13, 2022

In xarray, decode_times is False (actually None) by default (described here), so it just loads dataset with given coordinates, unless user specify decode_times=True.

Thanks for finding decode_times=None in xarray, my mistake!

At one point I recall removing decode_times=True because xarray defaults to None and I preferred explicit usage of this function parameter. The team had discussion to add it back because xcdat temporal APIs require decoded time.

The questions we might want to investigate:

  1. How should we handle decode_times=True in open_dataset()/open_mfdataset() if the dataset has time dimension(s) without CF metadata?
    • Currently, we can't map to these time dimensions since CF metadata is missing. As a result, no specific warning/error can be raised if decode_times=True even though a time dimension does exist.
  2. Should we consider decode_times=None like xarray?
    • In the current version, xcdat temporal APIs are now robust enough to catch when times are not decoded. In particular, the temporal accessor class throws an error with a suggestion for the user to make sure a time dimension exists with CF metadata and to use the decode_time() API manually.
    • The user has more control over API usage since they have to be explicit about the settings they pass.
    • Notes from 1-18-23 meeting from Steve: Try default_times=None and if None, detect for T axis and attempt decoding (check if xarray does this too). Can be overriden with default_times=False.

I'll take some more time to think about this too.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jan 24, 2023

I think the most reasonable solution is to set decode_times=None by default with no additional logic for implicit decoding.

Behaviors:

  • If decode_times=None or decode_times=False, no time decoding operations are performed
  • If user passes decode_times=True:
    1. Dataset does not have time coordinates at all-> raises KeyError (xarray normally does nothing in this case)
    2. Dataset time coordinates cannot be mapped to (missing CF attributes such as standard_name or axis) -> raises KeyError
    3. Dataset has time coordinates with CF attributes -> decodes time coordinates

Benefits of this approach:

Drawbacks of this approach:

  • No automatic decoding of time coordinates that have CF attributes
  • If time coordinates exist without CF metadata, they have to use decode_times=None or decode_times=False, add the CF attributes to the time coordinates, then call decode_time before calling xCDAT temporal operations.
    • xCDAT temporal averaging APIs will inform the user that they need decoded time coordinates first.
    • This is also the normal workflow for xarray, which provides decode_cf() to handle decoding time.

This approach would be considered a breaking change.

@pochedls
Copy link
Collaborator

I propose an alternate approach:

  • decode_times still defaults as True (no breaking change)
  • In open_dataset and open_mfdataset we check if a time coordinate exists. If no (detectable) time coordinate exists, we modify decode_times to None and print a warning
  • Modify documentation that default_times defaults to True unless no time coordinate is detected, in which case it is set to None

This approach would allow us to keep the existing API / syntax, is likely what users want to happen in 99% of situations, and won't throw an inconvenient error (but we're not silencing it, we're avoiding it!). Also – xarray isn't supposed to be a tool for climate – so it makes sense to have None by default. For xCDAT, I think the situation is different – we want to readily detect and handle time as appropriate.

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Jan 24, 2023

I propose an alternate approach:

You make some great points and I like this alternative approach. Basically, instead of raising a KeyError, we raise a logger warning with the KeyError. This involves minimal effort or API changes.

If we expect time coordinates to be in most datasets, we should also expect this warning to be rarely raised.

I just opened PR #409 using this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants