Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a netcdf file has NC_STRING variables with fill value turned off (nc_inq_var_fill() function returns no_fill as 1) #2612

Open
krisfed opened this issue Feb 8, 2023 · 4 comments

Comments

@krisfed
Copy link

krisfed commented Feb 8, 2023

Hello!

I was hoping to get clarification about whether it is supported to "turn off" fill value for NC_STRING variables.

We recently came across a file that had two NC_STRING variables with the fill value turned off (nc_inq_var_fill() function used on those variables returns no_fill as 1). I was under assumption that this should not be possible (e.g. see #727 and Unidata/netcdf4-python#331 ), but I am not sure it is officially documented... We tried to get more info about how this file was created, and apparently it was created with netcdf-python and this code here was used:

var = nc_outfile.createVariable('img_pair_info', 'U1', (), fill_value=None)

From my (limited) understanding of netcdf-python, setting fill_value=None does not turn off fill value, but just sets it to a default fill value. When we use netcdf-python and create "U1" or "double" variables with fill_value=None, we still see that nc_inq_var_fill() function returns no_fill as 0 (fill value is still turned on). From the netcdf4-python doc, it seems like the way to turn off the fill value is to use fill_value=False. When we use netcdf-python and create "U1" or "double" variables with fill_value=False, we see that for the double variable nc_inq_var_fill() returns no_fill as 1 (so the fill value is successfully turned off). But for the "U1"/string variable, nc_inq_var_fill() still returns no_fill as 0 (the fill value is still on). This agrees with our understanding that turning off fill value for NC_STRING variables is unsupported or at least is broken (and results in no-op in netcdf-python, as described in Unidata/netcdf4-python#331 (comment) ).

When I try to create an NC_STRING variable with fill value "turned off" using netcdf-c, I get get -36 error code (NC_EINVAL - "NetCDF: Invalid argument"), which is also consistent with #727 (comment).

We were not able to get any more details about how this file was created and how the fill value for NC_STRING variables was "turned off".

We also saw that ncdump from v.4.7.0 could read this file, but ncdump from v.4.9.0 could not (Unknown file format)...

It would be great to know the following:

  • is "turning off" fill value for NC_STRING variables officially supported?
  • how was it possible for this file to be created (or does it mean it got corrupted somehow)? Is there any way for us to create a similar file for our own testing (either with netcdf-c or netcdf-python or some other way)?
@krisfed
Copy link
Author

krisfed commented Feb 14, 2023

We got a bit more info about how the file was created. The following versions of libraries and packages were used:

libnetcdf 4.8.1
netcdf4 1.6.2
hdf4 4.2.15
hdf5 1.12.2
python 3.9.15

Interestingly, the newer pipeline now creates a file that does NOT have the fill value turned off. The newer pipleline uses the same versions, the only difference being HDF5 now is built with SZip support using libaec.

There doesn't seem to be a difference in the python code generating the file that would explain why the fill value is NOT turned off with the newer pipeline.

I am not sure if we will get to the bottom of this, but would be at least good to have some official confirmation on expected behavior for turning off fill value for NC_STRING variables.

Thanks!

@edwardhartnett
Copy link
Contributor

Turning off fill values for netcdf-4 files is not really helpful, and is only supported for backward compatibility with classic formats.

The idea of turning off fill values is that, when creating a classic file, instead of setting each value to something, the library can just assign some disk space for the variable and do nothing to it, so that it contains whatever it contains, and that will be random values if read, but we count on the user to later actually write all these data, so we're just skipping the step of writing a fill value everywhere, and then having the fill values overwritten by real data.

But HDF5 does not work this way. In HDF5, disk space is not allocated for chunks that are not written. So turning the fill value on for such variables does not increase any disk activity. HDF5 does not write chunks for data until you need to. So if you define a big variable, and then don't write any data to it, there will actually be no disk space allocated, and no fill values will be written in any case. If you then try to read the data, the HDF5 library will pretend that it is there, and it is full of fill value. So turning off fill values for HDF5 data does not actually accomplish anything.

@krisfed
Copy link
Author

krisfed commented Feb 15, 2023

Thank you for the detailed explanation @edwardhartnett !

I guess I was asking more about what is "allowed" with netCDF APIs (both C and python), i.e. what the official netCDF position is about fill values for NC_STRING variables.

From #727 I assumed turning fill values off was not supported (or just not possible) for NC_STRING variables (although I don't think it is documented). So it was a surprise to come across a file which had the fill value turned off for NC_STRING variables.

Because of this wrong assumption I ended up running nc_free_string on the pointer returned by nc_inq_var_fill without first checking the returned no_fill value. This causes a crash when no_fill is 1 - which makes sense, in this case there is no fill value to return, so nothing meaningful to make the pointer point to.

void* p;
char* stringFillValue;
p = &stringFillValue; // p is char**
int no_fill; // 1 if fill value is turned OFF, 0 if fill value is turned ON
status = nc_inq_var_fill(ncid, varid, &no_fill, p);

status = nc_free_string(1, static_cast<char**>(p));  // crash when no_fill=1, i.e. when there is no fill value

@krisfed
Copy link
Author

krisfed commented Oct 10, 2024

Hi @edwardhartnett ,

Sorry to raise this question again, but we have noticed an increased number of such files in recent months. I think their source is European Centre for Medium-Range Weather Forecasts Climate Data Store (ECMWF CDS): https://cds.climate.copernicus.eu/ (the new, just recently released system). Specifically, when downloading data there is an (experimental) option to convert GRIB files to netCDF, and it seems to produce files where NC_STRIG variables have disabled fill values (nc_inq_var_fill returns no_fill=1).

Could you comment on this? Would you consider such files well-formed (if there is no way to turn off fill values for NC_STRING variables using netCDF library)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants