Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add full support for xarray decode_coords #282

Closed
alexamici opened this issue Apr 6, 2021 · 11 comments · Fixed by #284
Closed

Add full support for xarray decode_coords #282

alexamici opened this issue Apr 6, 2021 · 11 comments · Fixed by #284
Labels
proposal Idea for a new feature.

Comments

@alexamici
Copy link
Contributor

xarray 0.17.0 has introduced a way to optionally keep most auxiliary variables associated with a main variable, including grid_mapping, inside the corresponding DataArray: pydata/xarray#2844

See: http://xarray.pydata.org/en/stable/weather-climate.html?highlight=decode_coords#related-variables

Current behaviour of open_rasterio is to always add the spacial_ref coordinate with CF attributes, so it is similar to always assuming decode_coords="all" (similar but not identical as the grid_mapping key / value pair is in DataArray.attrs instead of DataArray.encoding).

Fully supporting decode_coords while attempting to minimise change means that open_rasterio would return a Dataset for decode_coords="coordinates" and continue returning a DataArray for decode_coords="all" that would be the default.

I would find such an interface very confusing, but open_rasterio already returns different types for different arguments so I'm not sure.

Alternatively we could add a new interface that always return a Dataset. That would also make adding a plugin for the new backend API or xarray very easy (see #281).

@alexamici alexamici added the proposal Idea for a new feature. label Apr 6, 2021
@snowman2
Copy link
Member

snowman2 commented Apr 6, 2021

The first line of pydata/xarray#2844 gives the motivation for the change: "I prefer having these as coordinates rather than data variables." The current behavior of rioxarray currently handles the grid_mapping as the PR above would want.

The way I interpret the reason for the PR is that historically, the grid_mapping (which is named spatial_ref by default in rioxarray) wasn't always read into the coords by xarray because it wasn't listed in coordinates. The reason for the change in the PR was to provide a backwards-compatible mechanism to load the variable in the grid_mapping into the coordinates if it wasn't already listed in the coordinates attribute by default. From what I can tell, the purpose of the change does not seem to be to remove the grid_mapping from the coordinates if you don't want it there as you are suggesting.

Side note: I am noticing changes that rioxarray definitely needs to make with that change:

  1. Store grid_mapping in encoding instead of attrs
  2. Look for grid_mapping in encoding and attrs

@snowman2
Copy link
Member

snowman2 commented Apr 6, 2021

Another element to this is that when reading in geotiff/png/jpeg files, it does not store the coordinate attribute on the data variables as that is something more for the netCDF format. For those files supporting decode_coords wouldn't make much sense.

@alexamici
Copy link
Contributor Author

The reason for the change in the PR was to provide a backwards-compatible mechanism to load the variable in the grid_mapping into the coordinates if it wasn't already listed in the coordinates attribute by default.

@snowman2 coming form the netCDF / CF world (nowadays, I'm a long time GeoTIFF user) I don't see it that way.

  1. the coordinates attribute is not part of the xarray in-memory data model and is never present in attrs as far as I'm aware. It is a netCDF attribute that is only read at load time and written at save time to keep track of what variables contain "coordinate" data following the CF Conventions. All xarray handling of the coordinates on-disk netCDF attribute look very solid to me and there's no reason to add the spatial_ref auxiliary variable to it.
  2. what we are interested in is what variable are linked to each DataArray.coords (typically during the xr.decode_cf operation) as that is the data that you can use when working with DataArrays. That is what Read grid mapping and bounds as coords pydata/xarray#2844 is all about. Note that the PR uses extra care to ensure that the coordinates netCDF on-disk attribute doesn't get polluted with the non-coordinate variable because that would be a violation of the CF Conventions, see: https://github.com/pydata/xarray/blob/d2582c2f8811a3bd527d47c945b1cccd4983a1d3/xarray/conventions.py#L707
  3. IMO the simplest implementation is to add the spatial_ref as a normal variable and apply xr.decode_cf that already handles all the cases.

@snowman2
Copy link
Member

snowman2 commented Apr 6, 2021

The main change you want is to make sure the on-disk representation is CF-compliant, correct?

If so, then if rioxarray only stores grid_mapping in encoding and not attrs, to_netcdf should handle the rest. I have verified this already with a test that spatial_ref is not listed in the coordinates in this scenario.

	double green(time, y, x) ;
		green:_FillValue = NaN ;
		green:nodata = 0LL ;
		string green:units = "DN", "DN" ;
		green:scale_factor = 1. ;
		green:add_offset = 0. ;
		green:coordinates = "" ;
		green:grid_mapping = "spatial_ref" ;

Though coordinates is an empty string. Probably something that should be updated with a fix in xarray.

@alexamici
Copy link
Contributor Author

The main change you want is to make sure the on-disk representation is CF-compliant, correct?

yes

If so, then if rioxarray only stores grid_mapping in encoding and not attrs, to_netcdf should handle the rest.

yes

(OK, I also would love to optionally have bands as variables, but I will survive doing the .to_dataset("band").rename(...) by myself)

@alexamici
Copy link
Contributor Author

BTW, the coordinate = "" ; is aesthetically annoying, but passes the CF checkers, so I don't care for now.

@snowman2
Copy link
Member

snowman2 commented Apr 6, 2021

I also would love to optionally have bands as variables, but I will survive doing the .to_dataset("band").rename(...) by myself

Please open a new issue to discuss. Probably a new argument band_as_variable or something to enable that.

@snowman2
Copy link
Member

snowman2 commented Apr 6, 2021

the coordinate = "" ; is aesthetically annoying, but passes the CF checkers, so I don't care for now.

Yep. Sounds like a good issue to raise at xarray

@alexamici
Copy link
Contributor Author

pydata/xarray#5121

@alexamici
Copy link
Contributor Author

alexamici commented Apr 6, 2021

@snowman2 as a final bit.

GdalRawBackend is a super minimal plugin that uses only rasterio and pyproj to support spatial_ref and mask_and_scale:

https://github.com/bopen/xarray-gdal/blob/main/xarray_gdal/xarray_plugin.py#L52

The only serious feature missing is lazy loading of chunked data.

It is not intended for release, I wanted just to explore what can be delegated to xr.decode_cf and as you see, it is a lot, e.g. encoding is all managed there. It also supports drop_variables.

@snowman2
Copy link
Member

snowman2 commented Apr 6, 2021

Nice. I like default_name="__bands__"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal Idea for a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants