Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summing up empty data #1001

Closed
Poshi opened this issue Mar 23, 2022 · 8 comments · Fixed by #1370
Closed

Summing up empty data #1001

Poshi opened this issue Mar 23, 2022 · 8 comments · Fixed by #1370
Labels

Comments

@Poshi
Copy link
Contributor

Poshi commented Mar 23, 2022

I found a small lack of documentation and an unexpected result while adding data.

I'm using an input in homogeneous CSV format, so all data is either empty or present and defined, but never absent. It could be something like:

x,y,z
3,6,
2,,7
2,,
1,5,8

I'm trying to perform the addition of these three fields, expecting mlr to interpret the empty fields as zeroes, like if they were absent, to minimize keystrokes (following the policy of simplifying the life to the user with non-present data). These are the results:

$ mlr --csv --from input.csv put '$sum = $x + $y + $z'
x,y,z,sum
3,6,,
2,,7,
2,,,
1,5,8,14

Only when all the data is non_empty the addition is performed. From the documentation I can see that this is expected. The reasoning behind is that conceptually, absent and empty are different things and have different meanings, but when you are in a homogeneous CSV context where absents are not possible, we hit this issue. In my specific case, the columns are file sizes, and the empty ones are for files that does not exist (so they are absent), but the only possible representation is be leaving the blank in the file.
Well, until this point it is just food for though (decisions and implementation has already been done so...).

But I also identified a point in the documentation that can be easily expanded and will help clarify this point:
I saw that in reference-main-null-data help page, the table at the end (arithmetic rules) does not contain an entry for empty data, only for absents and errors. If the behavior must be different from absent, it should be included here too.

@johnkerl
Copy link
Owner

@Poshi this absolutely sounds like a miss, perhaps in the Go port.

I absolutely treated empties as zeroes, documented that, regression-tested for it.

Let me see what I missed in the port.

@johnkerl
Copy link
Owner

@Poshi I stand corrected. Miller 5 -> 6 (C -> Go) didn't change behavior here. What I was recalling above as being "absolutely" true was the behavior for stats1 -- which was (and is) inconsistent not from Miller 5 to Miller 6, but rather, between stats1 and +.

$ mlr5 --c2p --from input.csv put '$sum = $x + $y + $z'
x y z sum
3 6 - -
2 - 7 -
2 - - -
1 5 8 14

$ mlr --c2p --from input.csv put '$sum = $x + $y + $z'
x y z sum
3 6 - -
2 - 7 -
2 - - -
1 5 8 14

$ mlr5 --c2p --from input.csv stats1 -a sum -f x,y,z
x_sum y_sum z_sum
8     11    15

$ mlr --c2p --from input.csv stats1 -a sum -f x,y,z
x_sum y_sum z_sum
8     11    15

We really should have similar semantics between stats1 and +, and I'm amazed (and saddened) I didn't notice this before.

This would be a slightly backward-incompatible change although (for strict semantic versioning) I would hesitate to call this Miller 7.0.0 -- we can put this in Miller 6.3.0.

@johnkerl
Copy link
Owner

johnkerl commented Mar 24, 2022

Note mlr step behaves like mlr stats1:

$ mlr --c2p --from input.csv step -a rsum -f x,y,z
x y z x_rsum y_rsum z_rsum
3 6 - 3      6      -
2 - 7 5      -      7
2 - - 7      -      -
1 5 8 8      11     15

@johnkerl
Copy link
Owner

johnkerl commented Mar 24, 2022

Design:

https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling

In that wording (which goes back a long time) I clearly intended empty plus number is number. So the code (C and Go) and the docs are consistent. The only thing that's inconsistent (and very much so) is how this compares columnwise summations like stats1 or step.

Workaround:

$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'
x y z sum
3 6 - 9
2 - 7 9
2 - - 2
1 5 8 14

@aborruso
Copy link
Contributor

$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'

@johnkerl what can I read to understand the meaning of ???0

Thank you

@Poshi
Copy link
Contributor Author

Poshi commented Mar 24, 2022

Design:

https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling

In that wording (which goes back a long time) I clearly intended empty plus number is number. So the code (C and Go) and the docs are consistent. The only thing that's inconsistent (and very much so) is how this compares columnwise summations like stats1 or step.

I disagree. In that page, you literally say:

Empty values are explicit in the data so they should explicitly affect accumulations: mlr put '@sum += $x' should accumulate numeric x values into the sum but an empty x, when encountered in the input data stream, should make the sum non-numeric. To work around this you can use the is_not_null function as follows: mlr put 'is_not_null($x) { @sum += $x }'

"an empty x should make the sum non-numeric"... is just opposite of your words in this message "empty plus number is number".

The point is that it make sense that an empty value should affect operations and an absent not. But in homogeneous CSV files, you cannot distinguish empty from absent :-(

@johnkerl
Copy link
Owner

Thanks @Poshi ! You're right, you found another inconsistency.

To be clear though -- I very much think this change needs to be made; I'm in agreement. Thanks for helping me correctly catalog various things written in various places over the years -- helping to find the doc bits (as well as the code, obviously) that all need to be changed to make this happen.

@johnkerl
Copy link
Owner

$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'

@johnkerl what can I read to understand the meaning of ???0

Thank you

@aborruso
https://miller.readthedocs.io/en/latest/reference-dsl-builtin-functions/index.html#_30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants