Exploratory proposal: convert dascore to a Pangeo/xarray project #27
Replies: 2 comments
-
I spent a few hours on this and I found the xarray backend stuff a bit convoluted and confusing, and wasn't able to get the lazy loading to work with dask yet. I am wondering if, without additional abstraction, this will make it too hard for people to add additional DAS backends as I really don't want to write them all... Also, even for lazy loading, it seems the entire coordinate arrays need to be loaded into memory and only the data arrays are lazy. For nearly square data this isn't a big deal, but for DAS data where the time axis is easily orders of magnitude larger than the distance axis it comes at a much greater cost. Moreover, we will find it difficult to maintain metadata consistency if all the |
Beta Was this translation helpful? Give feedback.
-
I think this ship is long sailed at this point. |
Beta Was this translation helpful? Give feedback.
-
It may be in our best interest to make dascore a Pangeo affiliated project. This discussion contains some of the advantages/disadvantages with the associated proposal.
Motivation
The Pangeo project is an ecosystem of python packages used for geoscience (especially climate science). Provided we use the correct data structures (discussed later) dascore can benefit from all the large-scale data processing tools of that ecosystem, including:
The Pangeo community is very active and a recent effort to improve ETL workflows have been funded by the NSF under the name of Pangeo Forge
Disadvantages
We would need to remove the custom DAS classes (
Patch
andSpool
) and operate on xarray objectsDataArray
andDataSet
. This would break existing code, but the migration path should be straight-forward. Rather than something like this:It would look like this:
Where we would use either 'dc' or 'das' as an accessor namespace.
We would also need to rewrite dascore file parsers as xarray backends, but, at first glance, this seems straight-forward.
Personally, I find the xarray API functional enough, but it can be hard for beginners to grok, especially if they aren't familiar with pandas. We can mitigate this by providing simpler functions in the das namespace that will do 90% of what most users need, then they have to learn the xarray API to go beyond that point.
There was some discussion of using ray rather than dask, or jax rather than numpy in dascore. If we adopt this proposal these paths will effectively be closed unless the pangeo packages go this route.
Next steps
I am going to spend some time working on a branch that implements the outlined changes then report back here.
Beta Was this translation helpful? Give feedback.
All reactions