Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting NetCDF I/O to Julia #3

Closed
sethaxen opened this issue Jul 29, 2022 · 9 comments · Fixed by #4
Closed

Porting NetCDF I/O to Julia #3

sethaxen opened this issue Jul 29, 2022 · 9 comments · Fixed by #4
Labels
blocked enhancement New feature or request

Comments

@sethaxen
Copy link
Member

sethaxen commented Jul 29, 2022

from_netcdf and to_netcdf being ported to Julia is an essential step to InferenceData being truly stand-alone (see arviz-devs/ArviZ.jl#207).

2 Julia packages provide NetCDF I/O: NCDatasets.jl and NetCDF.jl. Both are wrappers around NetCDF_jll.jl, which is an BinaryBuilder-generated binary for the NetCDF C package.

Currently there are some issues with the NetCDF_jll binary on Windows that would need to be resolved for us to use either of these packages. Descriptions of the issues can be found here:

@sethaxen sethaxen added enhancement New feature or request blocked labels Jul 29, 2022
@visr
Copy link

visr commented Jul 29, 2022

The most pressing issues with NetCDF_jll will hopefully go away with JuliaPackaging/Yggdrasil#5251.

But yeah, due to these issues I've also thought about porting parts of netCDF to julia, though never attempting anything. Do you need netCDF 3 or 4?

3 is simpler, for instance SciPy does this in 1000 lines of code: https://github.com/scipy/scipy/blob/v1.8.1/scipy/io/_netcdf.py

4 is based on HDF5, and is much more complex. Though there is https://github.com/JuliaIO/JLD2.jl which has quite a good subset of HDF5. It could be interesting to see how much is needed to use JLD2 to do (a subset of) netCDF 4 I/O.

A pure julia alternative that already exists is https://github.com/JuliaIO/Zarr.jl/.

@sethaxen
Copy link
Member Author

The most pressing issues with NetCDF_jll will hopefully go away with JuliaPackaging/Yggdrasil#5251.

Awesome! 🤞

But yeah, due to these issues I've also thought about porting parts of netCDF to julia, though never attempting anything. Do you need netCDF 3 or 4?

As I understand it, we need netCDF 4, since our InferenceData type is a collection of groups, each of which could be stored independently in a different netCDF 3 file but which can only be stored together in netCDF 4. We'd definitely be interested in a pure Julia implementation if it was fully functional.

4 is based on HDF5, and is much more complex. Though there is https://github.com/JuliaIO/JLD2.jl which has quite a good subset of HDF5. It could be interesting to see how much is needed to use JLD2 to do (a subset of) netCDF 4 I/O.

That would be interesting. This line in the JLD2 docs concerns me:

JLD2 is likely to be incapable of reading files created or modified by other HDF5 implementations

We use netCDF for standardized serialization and archival of outputs, which should be readable and usable across languages, so it's essential that we be able to read netCDF 4 files created by the Python ArviZ package (written to netcdf 4 using xarray) and that we write files that Python ArviZ can read.

A pure julia alternative that already exists is https://github.com/JuliaIO/Zarr.jl/.

Cool, might be a nice alternative to provide in addition to netCDF. Can it just write single arrays or also hierarchical data structures containing such arrays? e.g. at our lowest level we have multidimensional arrays with named dimensions, but a higher level ties them together into groups with shared named dimensions, and an even higher level ties the groups together.

@visr
Copy link

visr commented Jul 29, 2022

Note that being able to read netCDF4 files written by Python ArviZ/xarray is already a much smaller subset than "any valid HDF5 file found out in the wild". And I would expect netCDF files written through JLD2 to be fine for xarray to read. But still, it would need testing. I spoke to one of the JLD2 devs on Slack a while back and I think they thought it was worth a shot for sure.

Zarr indeed handles groups and shared named dimensions as well. There even something called NCZarr https://www.unidata.ucar.edu/blogs/developer/en/entry/overview-of-zarr-support-in.

@sethaxen
Copy link
Member Author

It seems that with current functionality of JLD2, netCDF files written by ArviZ are not readable.

julia> using ArviZ, JLD2

julia> idata = load_arviz_data("centered_eight");

julia> to_netcdf(idata, "tmp.jld2")
"tmp.jld2"

julia> jldopen("tmp.jld2")
ERROR: ArgumentError: "/home/sethaxen/projects/ArviZ.jl/tmp.jld2" is not a JLD2 file

I'd have to look more into how JLD2 writes its data to test the other direction.

@visr
Copy link

visr commented Jul 29, 2022

Ha yeah I tried commenting out the header checks, and jldopen worked, but it seems the JLDFile struct it created is empty. Perhaps better to make an issue on JLD2 if you think it's worth pursuing.

@Alexander-Barth
Copy link

Here is a prof of concept of a pure julia NetCDF 3 reader based on pupynere (also in SciPy):

https://github.com/Alexander-Barth/NetCDF3/blob/main/test/runtests.jl#L42

Surprisingly, for reading a whole variable julia is faster than the C library (about 4 times faster).

@visr
Copy link

visr commented Aug 1, 2022

Haha that is awesome. I can delete my mostly empty NetCDF3 that I made yesterday and clone this!

@Alexander-Barth
Copy link

haha, I don't know, mine is mostly empty too :-) In any case, NetCDF3 can be put on JuliaGeo et JuliaIO too (now or later).

pupynere/SciPy does not allow to read/write a subset of a NetCDF variable which might be not so trivial to do (but still doable).

Another "format" in NetCDF that is quite common is OPENDAP (DAP2 and DAP4 is supported in libnetcdf, but I never came accross a DAP4 server).

@sethaxen
Copy link
Member Author

It seems that most or all of the linked issues will be resolved in the next few days. So we can begin work on a native Julia from_netcdf function. A few notes:

  • NCDatasets.jl seems to work better for this than NetCDF.jl. With the latter I was not even able to get group names.
  • Python ArviZ defaults to lazy loading of NetCDF. This seems to be reasonably fast in Julia, but perhaps it requires that we leave the file open? It also seems that the eltype of the resulting AbstactArray types is a union with Missing, which could be problematic. On the flip siide, eager loading seems very slow (~30s for the radon dataset).
  • The function should live in InferenceObjects.jl and be conditionally loaded with Requires (alternatively live in its own package InferenceObjectsNetCDF.jl). I'll move this issue to InferenceObjects.

@sethaxen sethaxen transferred this issue from arviz-devs/ArviZ.jl Aug 18, 2022
@sethaxen sethaxen mentioned this issue Aug 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants