Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify what action to take when unpacking an int #374

Closed
mankoff opened this issue Aug 1, 2022 · 15 comments · Fixed by #456
Closed

Clarify what action to take when unpacking an int #374

mankoff opened this issue Aug 1, 2022 · 15 comments · Fixed by #456
Labels
defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors

Comments

@mankoff
Copy link

mankoff commented Aug 1, 2022

In Chapter 8. Reduction of Dataset Size section 8.1. Packed Data on line 18 https://github.com/cf-convention/cf-conventions/blob/main/ch08.adoc#L18 it states,

It is not advised to unpack an int into a float as there is a potential precision loss.

This is a warning about what not to do, but does not provide a suggestion of what should be done. Does this mean unpacked data should be type int? Or type double? I suggest re-wording to provide positive advice (if there is a simple thing that should be done), rather than negative advice. The 'not advised' is appropriate language if there are many possible correct actions that could be taken.

If someone can offer clarity, I will issue a pull request.

@mankoff mankoff added the typo label Aug 1, 2022
@JonathanGregory
Copy link
Contributor

Dear Ken @mankoff

Thanks for raising this. I suppose the concern is that an int (four-byte integer) has up to 10 significant figures, which is too many for a float. Do you and others agree that is what it means? double would be OK as far as that is concerned. But is this sentence recommending that float should be packed as byte or short but not int? I suppose this would be reasonable, since float and int are both four bytes so there wouldn't be any compression if you pack float as int.

I'm changing the label of this issue to defect, as I think it's a point that's unclear in the convention, more serious than a typo.

Cheers

Jonathan

@JonathanGregory JonathanGregory added defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors and removed typo labels Aug 2, 2022
@mankoff
Copy link
Author

mankoff commented Aug 2, 2022

This sentence is about unpacking. Elsewhere packing says that packed data should be type byte, short, or int.

@JonathanGregory
Copy link
Contributor

Does anyone know what it's getting at, then?

@mankoff
Copy link
Author

mankoff commented Aug 8, 2022

I would suggest re-writing section 8 from having one subsection Packed Data to two sub-sections Packing Data and Unpacking Data.

Currently under Packed Data is:

If the scale_factor and add_offset attributes are of the same data type as the associated variable, the unpacked data is assumed to be of the same data type as the packed data. However, if the scale_factor and add_offset attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. An additional restriction in this case is that the variable containing the packed data must be of type byte , short or int . It is not advised to unpack an int into a float as there is a potential precision loss.

I would rewrite this

  • Using RFC 2119 keywords (MAY, SHOULD, MUST, MUST NOT, etc.)
  • If the scale_factor and add_offset attributes are of the same data type as the associated variable, the unpacked data SHOULD be of the same data type as the packed data, BUT WITH 4 (8?) ADDITIONAL BITS TO CAPTURE OVERFLOW
  • If the scale_factor and add_offset attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. THE UNPACKED DATA SHOULD BE THE HIGHEST PRECISION DATA TYPE OF THE THREE, WITH X ADDITIONAL BITS TO CAPTURE OVERFLOW.

The last two sentences:

An additional restriction in this case is that the variable containing the packed data must be of type byte , short or int . It is not advised to unpack an int into a float as there is a potential precision loss.

Should be in a section on Packing Data, which will have different requirements than unpacking data. And as in the OP, should be rewritten for clarity and to reduce ambiguity.

@JonathanGregory
Copy link
Contributor

Dear Ken @mankoff

The preceding text says, "If the scale_factor and add_offset attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int." I find this quite hard to understand, but I would say it covers both packing and unpacking. My interpretation is this:

You may pack float or double into byte, short or int. The attributes scale_factor and add_offset must be of the same type, either float or double, and the data will be unpacked to this type.

Then comes the sentence you commented on, "It is not advised to unpack an int into a float as there is a potential precision loss." I believe this is because int has more significant figures than float. It could the case that the original data is double and was packed to int. In that case, unpacking it as float would lose precision. But you would only unpack to float if the scale_factor and add_offset were float, according the previous rule. By choosing that type, the data-writer caused the problem. To avoid the problem, we could changing the packing rules as follows:

You may pack float into byte or short, and double into byte, short or int. The attributes scale_factor and add_offset must be of the same type as the data before packing, either float or double, and the data will be unpacked to this type.

That is, it's not allowed to pack float into int. Why would you want to do that anyway, since they're the same size?

If the data was written (erroneously) as int and the attributes are float, the data should be unpacked to double, not float. I suppose that's the positive recommendation you were asking for.

The text seems to allow the possibility that scale_factor and add_offset attributes might be of the same data type to the variable. This might done if they were of integer type, so you could pack larger numbers than could be stored in that type. For example, you could store large multiples of 128 in a byte variable using a scale_factor of 128. Is that the idea? Is there a reason why one might do this with floating-point types?

Best wishes

Jonathan

@kmuehlbauer
Copy link

Coming from #427. I agree to rephrase as suggested, but with the enhancement (bold italics).

You may pack float into byte or short, and double into byte, short or int (or their unsigned counterparts). The attributes scale_factor and add_offset must be of the same type as the data before packing, either float or double, and the data will be unpacked to this type.

@JonathanGregory
Copy link
Contributor

Dear Kai @kmuehlbauer

Yes, you're right, we should mention the unsigned types, to accord with issue 427, tthat 8.1 should be consistent with the list of data types in 2.2.

I propose that we should replace most of the second paragraph of 8.1, namely the following text:

If the scale_factor and add_offset attributes are of the same data type as the associated variable, the unpacked data is assumed to be of the same data type as the packed data. However, if the scale_factor and add_offset attributes are of a different data type from the variable (containing the packed data) then the unpacked data should match the type of these attributes, which must both be of type float or both be of type double. An additional restriction in this case is that the variable containing the packed data must be of type byte, short or int. It is not advised to unpack an int into a float as there is a potential precision loss.

with this text:

When packed data is written, the scale_factor and add_offset attributes must be of the same type as the unpacked data, which must be either float or double. Data of type float may be packed into byte, unsigned byte, short or unsigned short. Data of type double may be backed into byte, unsigned byte, short, unsigned short, int or unsigned int.

When packed data is read, it should be unpacked to the type of the scale_factor and add_offset attributes, which must have the same type if both are present, except that packed data of type int or unsigned int should be unpacked to double if the scale_factor and add_offset attributes are float, to aboid a potential loss of precision.

This text does not permit data to be packed into the same data type. The existing text seems to permit it, but why would you do it? If anyone has a use-case, we should consider it. Like the existing text, the new text does not permit integer data to be packed. However, if existing data is presented which does either of these things, it can still be unpacked with the rules given, so the new text does not invalidate any existing data. int64 is not a possible type for packed data because it is not shorter than any floating-point type.

Is this OK?

Best wishes

Jonathan

@kmuehlbauer
Copy link

Dear Jonathan @JonathanGregory,

thanks for the quick reply. I have no use-case for packing integer into integer or floating point into floating point either. But I have seen almost any weird combination of differing dtypes of scale_factor, add_offset, _FillValue etc.

So it's good to have at guide to decode mismatched files.

@mankoff You have already been deep into this matter, would this new text apply to your use-cases too?

@mankoff
Copy link
Author

mankoff commented Apr 27, 2023

Yes, this text is more clear. I've changed jobs and am no longer working on the same issue however - I'm losing track of the depth I had on this last year.

@JonathanGregory I apologize for not answering your #374 (comment) from 9 Aug 2022. I think the response there is that if you find it "quite hard to understand", what hope to the rest of us have? I agree with your suggested change to the packing rules. That is, to me, clear and specific.

As for the more recent comment, your suggested paragraph on packing reads well. I suggest a change to unpacking. Currently,

When packed data is read, it should be unpacked to the type of the scale_factor and add_offset attributes, which must have the same type if both are present, except that packed data of type int or unsigned int should be unpacked to double if the scale_factor and add_offset attributes are float, to aboid [TYPO] a potential loss of precision.

According to the packing guidelines

float may (MUST?) be backed into byte, unsigned byte, short, or unsigned short.

So (if properly packed), you cannot have packed data as int or unsigned int with scale_factor and/or add_offset as float. They would be double, so it would unpack to double.

If we are adding recommendations about how to treat packed data that does not follow the spec, I suggest 1) a new paragraph (possibly a new section) and 2) there may be more scenarios than just this one described here, so we should strive to be more complete, and possibly more generic, in our recommended error handling. Maybe something about largest data type of the unpacked data and then increase by one data type or /n/ bits.

@mankoff
Copy link
Author

mankoff commented Apr 27, 2023

So (if properly packed), you cannot have packed data as int or unsigned int with scale_factor and/or add_offset as float. They would be double, so it would unpack to double.

Sorry to not be more explicit. This means everything including and after the except is redundant.

@JonathanGregory
Copy link
Contributor

Dear Ken @mankoff and Kai @kmuehlbauer

Thanks for your comments. I think Ken is right to prefer "must", and I believe we all agree on

When packed data is written, the scale_factor and add_offset attributes must be of the same type as the unpacked data, which must be either float or double. Data of type float must be packed into one of these types: byte, unsigned byte, short, unsigned short. Data of type double must be packed into one of these types: byte, unsigned byte, short, unsigned short, int, unsigned int.

When packed data is read, it should be unpacked to the type of the scale_factor and add_offset attributes, which must have the same type if both are present.

I included the further sentence, about how to unpack int if the attributes are double, even though this doesn't conform to the packing rules, following Ken's original question in this issue. The sentence which Ken asked about in the existing text provides guidance for a situation of that kind. I agree that this isn't complete.

We have a choice. We could say nothing at all about how to deal with non-conforming data, or we could try to be complete. It would be fine to say nothing, insofar as CF is a convention for writing netCDF data and interpreting CF-conforming netCDF data, but it's not so helpful and may lead to questions like Ken's. We could work out comprehensive guidance, but is that appropriate for the CF standard? What do you think, Ken and Kai, and anyone else?

If we don't want to give detailed guidance, maybe we could just say, "We suggest that packed data which does not conform to the rules of this section regarding the types of the data variable and attributes should be unpacked to double type, in order to minimise the risk of loss of precision."

Cheers

Jonathan

@mankoff
Copy link
Author

mankoff commented Apr 28, 2023

I agree that giving guidance for non-conformant data is a good thing. Being complete and specific may be too high effort at this point. It is easier to be complete if we offer more general and vague advice. I agree with your suggested sentence:

We suggest that packed data which does not conform to the rules of this section regarding the types of the data variable and attributes should be unpacked to double type, in order to minimise the risk of loss of precision.

There may be more efficient ways to unpack non-conformant data, but the advice above is simple and implementable by anyone who implements an API.

@JonathanGregory
Copy link
Contributor

Dear all

I have created this pull request to implement the changes shown in #374 (comment) and #374 (comment) in Section 8.1, and accordingly in the conformance document. In the conformance document I have deleted the single recommendation, which was the reason for this issue being raised. The new rules are more explicit and we have inserted a sentence of guidance about unpacking non-conformant data.

Because this is a defect ticket and it is much more than three weeks since the changes were agreed, I believe this change can be implemented immediately. Please could someone check that I have got the pull request right and merge it if so? Thanks.

In view of their valuable work on this issue, Ken Mankoff and Kai Mühlbauer should be added to the list of contributors to the CF conventions. Thanks to both.

Jonathan

@JonathanGregory JonathanGregory added the new contributor This issue was worked on by new contributors to the CF conventions label Sep 18, 2023
@mankoff
Copy link
Author

mankoff commented Sep 18, 2023

As stated on the PR #456 I think this is a good change. Thank you.

@larsbarring
Copy link
Contributor

With this strong support from the original enhancement proposers I am happy to merge this PR. Many thanks to all involved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants