Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_FillValue attribute creation error for NC_GLOBAL with netCDF 4.5.x? #458

Closed
czender opened this issue Aug 9, 2017 · 66 comments
Closed

Comments

@czender
Copy link
Contributor

czender commented Aug 9, 2017

netCDF library behavior of _FillValue seems to have changed in a backwardly incompatible way. With the 20170804 netCDF snapshot this command fails to create a global _FillValue attribute in a netCDF3 file, whereas the same command succeeds with netCDF 4.4.x and earlier versions. Is this new behavior intended?

zender@skyglow:~$ ncatted -O -a _FillValue,global,c,l,222 ~/nco/data/in.nc ~/foo.nc
nco_err_exit(): ERROR Short NCO-generated message (usually name of function that triggered error): nco_put_att()
nco_err_exit(): ERROR Error code is -50. Translation into English with nc_strerror(-50) is "NetCDF: Action prohibited on NC_GLOBAL varid"
nco_err_exit(): ERROR NCO will now exit with system call exit(EXIT_FAILURE)
@WardF
Copy link
Member

WardF commented Aug 9, 2017

Thanks I will track this down!

@czender
Copy link
Contributor Author

czender commented Aug 10, 2017

Thank you. FYI 4.5.x also causes a less clear regression with ncpdq, which now fails to pack variables that have a _FillValue attribute. This is a major regression, since people do this all the time. I will try to narrow down what it happening further when I have time.

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented Aug 10, 2017

Do I have this right? If in.nc was an empty netcdf file, the ncatted command
would be equivalent to this?
ncgen -3 in.cdl
where in.cdl would be:
netcdf in {
// global attributes:
int64 :_Fillvalue = 222LL ;
}

@czender
Copy link
Contributor Author

czender commented Aug 10, 2017

Yes, except the _FillValue would be a plain int not int64.

@czender
Copy link
Contributor Author

czender commented Aug 11, 2017

I now understand that this error is due to #388. FWIW it has always been (or seemed) legal to have a _FillValue defined for a NC_GLOBAL, and NCO supported this. A consequence is that there may be files in the wild with _FillValue defined for root and groups. #388 may make copying/manipulating such files difficult, since an error code will be returned. Does this backwards incompatibility merit revisiting #388?

@WardF
Copy link
Member

WardF commented Aug 11, 2017

@czender Absolutely worth revisiting and a discussion; the files can still be read correctly? We need to avoid breaking the backwards compatibility, and depending on the actual harm done we may just revert #388 and instead update the documentation. @DennisHeimbigner any thoughts? We can discuss this at our meeting if not here.

@czender
Copy link
Contributor Author

czender commented Aug 11, 2017

I patched the latest NCO snapshot to emit a warning when it receives an NC_EGLOBAL error from nc_put_att(NC_GLOBAL,"_FilllValue"), but otherwise proceed normally. This helps to mitigate the issue.

@DennisHeimbigner
Copy link
Collaborator

In order to keep #388
and to usably support back-compatibilty, I think we need to
change from reporting an error to reporting a warning somehow.
If the file is being generated with ncgen, then ncgen can report the
warning. If it is via the library, then our only option is to use logging
to report the warning. Unfortunately, logging is netcdf-4 only.
In any case, attempts to write such a global _FillValue could be silently
ignored.
BTW, maybe we should consider extending logging support to all parts
of netcdf library and not just netcdf-4. Otherwise we have not way to
report warnings.

@WardF
Copy link
Member

WardF commented Aug 11, 2017

I don't think silently ignoring is a good idea; if data isn't written it should be reported somehow, or else we set ourselves up for future problems when people think they are doing one thing but are really doing another. Maybe we could have a hidden attribute in the file that is used to store warning messages, etc? There's probably a problem with that, I haven't really thought it through yet.

@DennisHeimbigner
Copy link
Collaborator

It occurs to me to ask: did a global _FillValue actually get used in defining
the fillvalue for any variable? If so, then I think we have to keep that behavior.
If it was effectively unused, then we can "ignore" in some sense that global
fillvalue.

@czender
Copy link
Contributor Author

czender commented Aug 15, 2017

I'm not sure if Dennis' question was for me so let me clarify that currently/formerly people could have groups with a _FillValue with no error. Some of them may have workflows that cause their variables to "inherit" the _FillValue of the group they are in. I don't know of any specific examples.

@DennisHeimbigner
Copy link
Collaborator

I guess I was trying to figure out what behavior needs to be preserved.
If we allow setting a global _FillValue, but otherwise ignore it, does that
preserve back compatibility?

@czender
Copy link
Contributor Author

czender commented Aug 15, 2017

NCO exposes the full attribute API, i.e., creating, deleting, copying, renaming attributes. If the full attribute API is allowed to continue to work with _FillValue on NC_GLOBAL then nothing changes from the user's point of view, so that would be back-compatible. Eliminating any of those could potentially cause grief to some users.

@WardF
Copy link
Member

WardF commented Aug 15, 2017

NCO is a vital suite of tools; I'm convinced that we need to find a solution (such as Dennis suggests) that does not break the tools or user workflows which adhere to what has become a de-facto standard. As far as I can tell, the global _FillValue attribute isn't used for anything internally, and as such can be ignored by the library. My approach for the next rc3/full release would be (absent a reason to do something else): updating the documentation to reflect this and reverting the previous pull request that prevents setting a global _FillValue attribute.

@DennisHeimbigner
Copy link
Collaborator

Sounds right to me.

@WardF
Copy link
Member

WardF commented Aug 15, 2017

I've created a temporary branch, reverse-388. Does this fix the issue you're seeing re: ncpdq @czender ? I need to do more to understand the underlying issue but I want to make sure I'm looking in the right place to start dissecting the code.

@czender
Copy link
Contributor Author

czender commented Aug 15, 2017

There is another, though possibly related, issue caused by changed treatment of _FillValue in 4.5.x that is, IMHO, both less straightforward and a bigger deal than the NC_GLOBAL issue discussed here (in #458) and in #388. I'm uncertain precisely what netCDF change caused this (possibly #383 and/or #384 ???). As background, NCO allows the _FillValue type to be other than the variable type, and ncpdq does not try to change the _FillValue type to the packed type. 4.5.x no longer allows this, and now emits NC_EBADTYPE (-45): "NetCDF: Not a valid data type or _FillValue type mismatch" from ncpdq's nc_put_var() that writes the packed data to the variable that already has a _FillValue (of a different type) defined. Many people (including me) rely on this feature, although it is not a best practice (I wrote ncpdq in 2004, when missing_value was still the norm, and possibly before _FillValue best practices were defined). So, I consider this issue is partly due to an NCO design flaw (being lazy and not always making typeof(_FillValue) == typeof(variable)). Still, the current algorithm works pretty well when _FillValue is in the range of the packed type (otherwise ncpdq does print a warning). Modifying ncpdq is possible, but could take months. In the meantime, many workflows with ncpdq will break if/when it's linked to 4.5.x. I'm unsure how to proceed.

@WardF
Copy link
Member

WardF commented Aug 15, 2017

My first thought off the top of my head is to back out the change implemented re: different types and instead phase it in slowly while flagging the behavior as deprecated, and work together towards not locking this behavior out altogether until NCO has been updated. Although depending on the scope of the problems it could cause, I am hesitant to do something that breaks widespread workflows built around behavior that was permitted, even if it was documented as not a best practice. Even if it's not a best practice, unless it was explicitly forbidden, i dont know how easily we can/should change the behavior at this point. It will merit more discussion once I'm done with this symposium, but I am inclined to balance the damage caused by enforcing this vs. the hypothetical damage prevented. Let me refer back to those other pull requests when I can, to refamiliarize myself with the motivation for making the changes.

@DennisHeimbigner
Copy link
Collaborator

Can you elaborate on what you mean by the term "packed"?

@wkliao
Copy link
Contributor

wkliao commented Aug 20, 2017

Just created a pull request #464
It may silence the NC_EBADTYPE error for ncpdq.

@czender
Copy link
Contributor Author

czender commented Aug 20, 2017

Yes, apparently the NC_NOFILL on line 763 prevents ncpdq packing mismatched _FillValue from failing with netCDF < 4.5.x. When I remove that line it fails with NC_EBADTYPE. Thanks for digging-in to this. I have not tested your patch, though it seems promising.

@WardF
Copy link
Member

WardF commented Aug 21, 2017

Thank you very much @wkliao for providing your insight and the patch; thank you also @czender for working with us to figure this out, and reporting it in the first place! I've just tested #464 with the provided test, and so long as ERANGE_FILL is not enabled (default behavior), the test passes. @czender, will this work for you? I suppose there is a possibility of issues being reported if somebody goes out of their way to turn on ERANGE_FILL, but since it's not the default behavior I feel comfortable calling this fixed. I've pushed a new branch, wkliao-ghpull464; once the CI tests finish, I will merge it back into v4.5.0-release-branch and from there it will propagate back into master.

@czender
Copy link
Contributor Author

czender commented Aug 21, 2017

Please ping me when @wkliao 's patch is in master and I'll test. Assuming it passes, then this is a fine solution for ncpdq users. Thank you Wei-Keng, for producing this so quickly.

@WardF
Copy link
Member

WardF commented Aug 23, 2017

@czender This is in master, let me know if it works for you and if so I'll close this out and move on to the other cdf5-related issues.

@czender
Copy link
Contributor Author

czender commented Aug 24, 2017

This thread discusses two distinct changes in _FillValue behavior in 4.5.x, each of which broke one NCO regression test that passed with 4.4.x and earlier libraries. Today's master still causes an error when attempting to create a _FillValue for NC_GLOBAL. I mentioned that I patched NCO to continue (not abort) despite this error:

zender@skyglow:~/nco$ ncatted -O -a _FillValue,global,c,l,222 ~/nco/data/in.nc ~/foo.nc
WARNING: nco_put_att() received NC_EGLOBAL error writing attribute "_FillValue" to metadata for group "/". netCDF 4.5.x forbids this with the _FillValue attribute, though earlier versions allow it. Proceeding normally without writing _FillValue attribute...

Personally I can live with the current behavior because I don't use/need this feature. As discussed above, Unidata needs to decide whether to allow this feature to work become some users may rely on it, and the new behavior would reduce backwards compatibility for them.

As for the second issue, which was diagnosed as causing ncpdq to fail, the current snapshot does fix that issue (thanks to @wkliao for the patch). This will prevent massive breakage in many workflows so this is an important patch to keep. In the future I would like to alter ncpdq so it does not rely on this patch, but that would be a big project that I don't have time for right now.

@WardF
Copy link
Member

WardF commented Aug 24, 2017

I think we are going to allow it; it has been allowed this long and has become permissible, so I think the documentation needs to be updated. I thought I had reverted this behavior already but apparently not. I will look into it.

Actually, yes I've reverted it in a branch but have not met with @DennisHeimbigner yet to discuss/confirm. And yes, thank you @wkliao for the patch with ncpdq.

@czender - NCO is a very important tool for our community, and breaking it is not something I want to do. I've built a pretty robust automated regression/continuous integration framework based on docker, but the weakness is that it's automated; if a test fails and returns 0, it will not catch it. I'd like to figure out a way to catch some of these netcdf tests that only run via make test and not make check. Do you have any ideas? Worst case scenario would be to write a string parser for the test output and use that, but it's unlikely I'd get to that in the near future. Could some of these netcdf specific tests be moved into 'make check', maybe? Or would that be a departure from your development design? @

@wkliao
Copy link
Contributor

wkliao commented Aug 24, 2017

Please note that as early as version 3.3 released in 1997, checking type mismatch has already appeared in libsrc/putget.m4. This checking happens during nc_enddef and when fill mode is on.

 126         /*
 127          * Set up fill value
 128          */
 129         attrpp = NC_findattr(&varp->attrs, _FillValue);
 130         if( attrpp != NULL )
 131         {
 132                 /* User defined fill value */
 133                 if( (*attrpp)->type != varp->type || (*attrpp)->nelems != 1 )
 134                 {
 135                         return NC_EBADTYPE;
 136                 }

@DennisHeimbigner
Copy link
Collaborator

I agree that we should continue to allow _FillValue with NC_GLOBAL.
We just need to note that it is (for now) ignored by the netcdf library.

@wkliao
Copy link
Contributor

wkliao commented Aug 24, 2017

FYI. I found the following statement from netcdf man page of v 3.3, libsrc/netcdf.3

798 By default, the netCDF interface sets the values of
799 all newly-defined variables of finite length (i.e. those that do not have
800 an unlimited, dimension) to the type-dependent fill-value associated with each
801 variable.  This is done when \fBnc_endef(\|)\fR
802 is called.  The
803 fill-value for a variable may be changed from the default value by
804 defining the attribute `\fB_FillValue\fR' for the variable.  This
805 attribute must have the same type as the variable and be of length one.

@czender
Copy link
Contributor Author

czender commented Oct 4, 2017

FYI, this command fails with 4.5.0-rc3 and succeeds in 4.4.x. I thought Unidata intended to continue to let users write _FillValue as group metadata in 4.5.0. The current release candidate does not allow that.

zender@skyglow:~$ ncks --lbr
Linked to netCDF library version 4.5.1-development compiled Oct  1 2017 16:21:13
zender@skyglow:~$ ncatted -O -a _FillValue,global,c,l,222 ~/nco/data/in.nc ~/foo.nc
WARNING: nco_put_att() received NC_EGLOBAL error writing attribute "_FillValue" to metadata for group "/". netCDF 4.5.x forbids this with the _FillValue attribute, though earlier versions allow it. Proceeding normally without writing _FillValue attribute...
zender@aerosol:~$ ncks --lbr
Linked to netCDF library version 4.4.1.1 compiled Jul 15 2017 09:29:02
zender@aerosol:~$ ncatted -O -a _FillValue,global,c,l,222 ~/nco/data/in.nc ~/foo.nc
zender@aerosol:~$ 

@WardF
Copy link
Member

WardF commented Oct 4, 2017

That as the intent, and I thought I had reversed the pull request disallowing this. I even thought I had added a test for this. Let me go double-check. This should be allowed.

@DennisHeimbigner
Copy link
Collaborator

This may also require a change to ncgen; I will investigate separately.

@czender
Copy link
Contributor Author

czender commented Oct 4, 2017

Please re-read my message of August 24. Nothing seems to have changed since then.

@WardF
Copy link
Member

WardF commented Oct 5, 2017

Thanks @czender we will get this in!

@WardF
Copy link
Member

WardF commented Oct 5, 2017

Ok, looking at this we had corrected the initial behavior using an ncpdq-based test. Looking at this ncatted issue to see what's going on.

@DennisHeimbigner
Copy link
Collaborator

It looks like there is a check in libdispatch/dattput.c
in (at least) each of the nc_put_att_XXX routines,
where XXX is some type (e.g. int or float or...).
You should be able to find all of them by looking for
occurrences of NC_EGLOBAL.

@WardF
Copy link
Member

WardF commented Oct 6, 2017

I've got this taken care of and am adjusting a test. I'll roll this in soon today and get ball rolling on the 4.5.0 full release.

@WardF
Copy link
Member

WardF commented Oct 6, 2017 via email

@WardF WardF modified the milestones: 4.5.0, 4.5.1 Oct 16, 2017
@WardF
Copy link
Member

WardF commented Oct 16, 2017

I believe we have this addressed for 4.5.0. Tagging to revisit for 4.5.1.

@WardF
Copy link
Member

WardF commented Nov 26, 2018

It appears this is fixed, closing out as part of issue cleanup, will reopen if incorrect.

@WardF WardF closed this as completed Nov 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants