Summing up empty data #1001

Poshi · 2022-03-23T12:22:18Z

I found a small lack of documentation and an unexpected result while adding data.

I'm using an input in homogeneous CSV format, so all data is either empty or present and defined, but never absent. It could be something like:

x,y,z
3,6,
2,,7
2,,
1,5,8

I'm trying to perform the addition of these three fields, expecting mlr to interpret the empty fields as zeroes, like if they were absent, to minimize keystrokes (following the policy of simplifying the life to the user with non-present data). These are the results:

$ mlr --csv --from input.csv put '$sum = $x + $y + $z'
x,y,z,sum
3,6,,
2,,7,
2,,,
1,5,8,14

Only when all the data is non_empty the addition is performed. From the documentation I can see that this is expected. The reasoning behind is that conceptually, absent and empty are different things and have different meanings, but when you are in a homogeneous CSV context where absents are not possible, we hit this issue. In my specific case, the columns are file sizes, and the empty ones are for files that does not exist (so they are absent), but the only possible representation is be leaving the blank in the file.
Well, until this point it is just food for though (decisions and implementation has already been done so...).

But I also identified a point in the documentation that can be easily expanded and will help clarify this point:
I saw that in reference-main-null-data help page, the table at the end (arithmetic rules) does not contain an entry for empty data, only for absents and errors. If the behavior must be different from absent, it should be included here too.

The text was updated successfully, but these errors were encountered:

johnkerl · 2022-03-23T22:50:41Z

@Poshi this absolutely sounds like a miss, perhaps in the Go port.

I absolutely treated empties as zeroes, documented that, regression-tested for it.

Let me see what I missed in the port.

johnkerl · 2022-03-24T02:31:35Z

@Poshi I stand corrected. Miller 5 -> 6 (C -> Go) didn't change behavior here. What I was recalling above as being "absolutely" true was the behavior for stats1 -- which was (and is) inconsistent not from Miller 5 to Miller 6, but rather, between stats1 and +.

$ mlr5 --c2p --from input.csv put '$sum = $x + $y + $z'
x y z sum
3 6 - -
2 - 7 -
2 - - -
1 5 8 14

$ mlr --c2p --from input.csv put '$sum = $x + $y + $z'
x y z sum
3 6 - -
2 - 7 -
2 - - -
1 5 8 14

$ mlr5 --c2p --from input.csv stats1 -a sum -f x,y,z
x_sum y_sum z_sum
8     11    15

$ mlr --c2p --from input.csv stats1 -a sum -f x,y,z
x_sum y_sum z_sum
8     11    15

We really should have similar semantics between stats1 and +, and I'm amazed (and saddened) I didn't notice this before.

This would be a slightly backward-incompatible change although (for strict semantic versioning) I would hesitate to call this Miller 7.0.0 -- we can put this in Miller 6.3.0.

johnkerl · 2022-03-24T11:43:34Z

Note mlr step behaves like mlr stats1:

$ mlr --c2p --from input.csv step -a rsum -f x,y,z
x y z x_rsum y_rsum z_rsum
3 6 - 3      6      -
2 - 7 5      -      7
2 - - 7      -      -
1 5 8 8      11     15

johnkerl · 2022-03-24T11:46:40Z

Design:

https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling

In that wording (which goes back a long time) I clearly intended empty plus number is number. So the code (C and Go) and the docs are consistent. The only thing that's inconsistent (and very much so) is how this compares columnwise summations like stats1 or step.

Workaround:

$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'
x y z sum
3 6 - 9
2 - 7 9
2 - - 2
1 5 8 14

aborruso · 2022-03-24T11:47:02Z

$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'

@johnkerl what can I read to understand the meaning of ???0

Thank you

Poshi · 2022-03-24T12:44:11Z

Design:

https://miller.readthedocs.io/en/latest/reference-main-null-data/#rules-for-null-handling

In that wording (which goes back a long time) I clearly intended empty plus number is number. So the code (C and Go) and the docs are consistent. The only thing that's inconsistent (and very much so) is how this compares columnwise summations like stats1 or step.

I disagree. In that page, you literally say:

Empty values are explicit in the data so they should explicitly affect accumulations: mlr put '@sum += $x' should accumulate numeric x values into the sum but an empty x, when encountered in the input data stream, should make the sum non-numeric. To work around this you can use the is_not_null function as follows: mlr put 'is_not_null($x) { @sum += $x }'

"an empty x should make the sum non-numeric"... is just opposite of your words in this message "empty plus number is number".

The point is that it make sense that an empty value should affect operations and an absent not. But in homogeneous CSV files, you cannot distinguish empty from absent :-(

johnkerl · 2022-03-24T13:49:05Z

Thanks @Poshi ! You're right, you found another inconsistency.

To be clear though -- I very much think this change needs to be made; I'm in agreement. Thanks for helping me correctly catalog various things written in various places over the years -- helping to find the doc bits (as well as the code, obviously) that all need to be changed to make this happen.

johnkerl · 2022-03-27T15:28:32Z

$ mlr --c2p --from input.csv put '$sum = ($x???0) + ($y???0) + ($z???0)'

@johnkerl what can I read to understand the meaning of ???0

Thank you

@aborruso
https://miller.readthedocs.io/en/latest/reference-dsl-builtin-functions/index.html#_30

johnkerl added bug active labels Mar 23, 2022

johnkerl removed the active label Jan 1, 2023

johnkerl added the on deck label Aug 20, 2023

johnkerl mentioned this issue Aug 27, 2023

Summing up empty data #1370

Merged

johnkerl added active and removed on deck labels Aug 27, 2023

johnkerl closed this as completed in #1370 Aug 27, 2023

johnkerl removed the active label Aug 27, 2023

johnkerl mentioned this issue Aug 31, 2023

Treat empty like absent in + - * #1371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summing up empty data #1001

Summing up empty data #1001

Poshi commented Mar 23, 2022

johnkerl commented Mar 23, 2022

johnkerl commented Mar 24, 2022

johnkerl commented Mar 24, 2022 •

edited

Loading

johnkerl commented Mar 24, 2022 •

edited

Loading

aborruso commented Mar 24, 2022

Poshi commented Mar 24, 2022

johnkerl commented Mar 24, 2022

johnkerl commented Mar 27, 2022

Summing up empty data #1001

Summing up empty data #1001

Comments

Poshi commented Mar 23, 2022

johnkerl commented Mar 23, 2022

johnkerl commented Mar 24, 2022

johnkerl commented Mar 24, 2022 • edited Loading

johnkerl commented Mar 24, 2022 • edited Loading

aborruso commented Mar 24, 2022

Poshi commented Mar 24, 2022

johnkerl commented Mar 24, 2022

johnkerl commented Mar 27, 2022

johnkerl commented Mar 24, 2022 •

edited

Loading

johnkerl commented Mar 24, 2022 •

edited

Loading