-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for N axis dimensions mapped to N coordinates #343
Conversation
963cec5
to
f0b8715
Compare
Codecov ReportBase: 100.00% // Head: 100.00% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## main #343 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 13 13
Lines 1202 1242 +40
=========================================
+ Hits 1202 1242 +40
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
4214f91
to
a592bca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial self-review of the PR.
c0ca5ab
to
29efb0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pochedls, this PR is ready for review. We were spot-on with our estimate of about a month to get this ready.
Let me know if you would like me to tag others for review since this is a big PR, and I will also do a self-review. I marked areas of interest under this comment if you'd like to look over the implementation details.
Some helpful review resources:
- Python script using the sample IPSL and E3SM datasets from the GH issues, sanity check: https://gist.github.com/tomvothecoder/7ae3b9ef34a4c9e8257eb71d9cb51866
- Follow the order of Phases of the Algorithm Design in the feature spec document to streamline the review process.
Thanks!
I am still reviewing, but I keep bumping into one major issue:
I think this issue is related to a singleton If I open with xarray, things work okay (either with |
docs/index.rst
Outdated
* Optional selection of single data variable to keep in the Dataset (bounds are also | ||
kept if they exist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happily surprised this is re-implemented – looking forward to checking this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific option has been available in open_dataset()
/open_mfdataset()
but not documented in the features list:
Lines 51 to 53 in 36a3539
data_var: Optional[str], optional | |
The key of the non-bounds data variable to keep in the Dataset, | |
alongside any existing bounds data variables, by default None. |
I think you're referring to xcdat guessing the data variable to use in single variable datasets.
3c34fdb
to
c76f9fc
Compare
7f2b1c1
to
8ac1886
Compare
Yes – the results should be the same! |
@pochedls great, thanks for confirming. I will work on this tomorrow or early next week. |
keys = keys + obj.cf.coordinates[cf_attrs["coordinate"]] | ||
except KeyError: | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if CF_ATTR_MAP['T']['coordinate'] in obj.coords.keys(): | |
keys = keys + [CF_ATTR_MAP[axis]['coordinate']] | |
This would fix this issue. I think this logic is reasonable and it seems like all unit tests still pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomvothecoder - PR looks great. The lazy loading functionality you loaded goes above and beyond and is a really nice addition. Over all, I like how you cleaned up the open_*dataset
and decode_time
functionality.
I've verified that the current PR (with my one suggestion) passes all tests, resolves the linked issues/bugs, and spatial averaging and regridding look good.
I have one suggestion, which we can discuss if you're not a fan.
@tomvothecoder @pochedls - I've verified that current PR does not influence to the temporal and spatial average operations. My statistics (correlation, rmse, standard deviation, etc) that includes spatial and/or temporal average during its calculations were not changed by this PR. |
2171067
to
f622a8d
Compare
xcdat/axis.py
Outdated
# A dictionary that maps common variable names to coordinate variables. This | ||
# map is used as fall-back when coordinate variables don't have CF attributes | ||
# set for ``cf_xarray`` to interpret using `CF_ATTR_MAP`. | ||
VAR_NAME_MAP: Dict[CFAxisKey, List[str]] = { | ||
"X": ["longitude", "lon"], | ||
"Y": ["latitude", "lat"], | ||
"T": ["time"], | ||
"Z": coordinate_criteria["Z"]["standard_name"], | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mapping table addresses the issue here #343 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used cf_xarray's coordinate_criteria["Z"]
because there are way too many to list ourselves. Let me know if there is a shorter static list we might want to use instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final self-review code updates
Overview: - `xcdat` APIs previously expected a 1:1 map between axis dimensions and coordinates. For example, the `time` axis dimension would be mapped to `time` coord var. However, datasets can have N dimensions mapped to N coordinate variables - For example, datasets that have N dims mapped to N coord vars include E3SM, which might have a "Z" axis with the dimensions and coords "ilev" and "lev" Feature Updates: - Update bounds accessor methods to operate on the dataset or a variable within the dataset using the new `var_key` kwarg - Update `add_missing_bounds()` to loop over all axes and their coordinate vars and attempt to add bounds for each coordinate var if they do not exist - Update decoding function (`decode_time()`) to decode CF and non-CF compliant time coordiantes as `cftime` objects - Update dataset longitude swapping function (`swap_lon_axis()`) to loop over longitude coords and their bounds to swap them - Update spatial and temporal accessor class methods to refer to the dimension coordinate variable on the `data_var` being operated on, rather than the parent dataset Bug Fixes: - Fix `add_bounds()` not ignoring 0-dim singleton coords in addition to 1-dim singleton coords Refactor: - Refactor `open_dataset()` and `open_mfdataset()` to remove unnecessary preprocessing functions - Rename `get_axis_coord()` to `get_dim_coords()` and `get_axis_dim()` to `get_dim_keys()` - Update fixtures in `fixtures.py` to use `cftime` objects and add `decode_times` to `generate_dataset()` - Extract utility function `_if_multidim_dask_array_then_load()` and reuse in several places in the codebase, usually when manipulating bounds
- This improves runtime performance, since `.values` performs type conversions to `np.array()` - Refactor `_get_cftime_coords()` to use `xr.coding.times.decode_cf_datetime()` for decoding CF compliant time units
- Update `_is_decoded()` to check `.encoding` attributes rather than the object types in the array for performance purposes, since accessing `.values` can involve type conversion
- Refactor imports and fix docstrings
- This is required for cases where time units are different between files in multi-file datasets, but the offsets are the same. More info in the `_preprocess()` docstring - Add raise ValueError for decoding datasets with time coords without CF attrs set
- Add `decode_time()` test for units in days
to interpret common var names - Update tests
246f30b
to
6591295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit-picking updates
@tomvothecoder – this is an awesome PR – thanks for all of your work on it the past 2 months – it is a great advance for xcdat! |
Excellent work @tomvothecoder and @pochedls, thank you for your hard work for this PR. |
Description
Related feature specification document: https://docs.google.com/document/d/1BnwNGzb-_pSUFzH5513x99Yv2IOanZlTtA7RvIF7hJA/edit#heading=h.disypbipz1ic
swap_lon_axis()
renames data variables (non-bounds) #378Overview:
xcdat
APIs previously expected a 1:1 map between dimensions and coordinates. For example, thetime
dimension would be mapped totime
coord var. However, datasets can have N dimensions mapped to N coordinate variablesFeature Updates:
var_key
kwargadd_missing_bounds()
to loop over all axes and their coordinate vars and attempt to add bounds for each coordinate var if they do not existdecode_time()
) to decode CF and non-CF compliant time coordiantes ascftime
objectsswap_lon_axis()
) to loop over longitude coords and their bounds to swap themdata_var
being operated on, rather than the parent datasetBug Fixes:
add_bounds()
not ignoring 0-dim singleton coords in addition to 1-dim singleton coordsRefactor:
open_dataset()
andopen_mfdataset()
to remove legacy private functions for checking CF compliance before attempting to decodedecode_time()
now checks for CF complianceget_axis_coord()
toget_dim_coords()
andget_axis_dim()
toget_dim_keys()
for clarityfixtures.py
to usecftime
objects and adddecode_times
togenerate_dataset()
_if_multidim_dask_array_then_load()
and reuse in several places in the codebase, usually when manipulating boundsChecklist
If applicable: