Improve numeric matching #179

timbray · 2024-08-28T21:38:34Z

What is your idea?

For Quamina, a couple of folks figured how to represent the whole range of 64-bit float values in 64 big-endian bits, and then to encode them in base128, then to discard certain suffixes. Check out numbits.go https://github.com/timbray/quamina/blob/main/numbits.go

So you get a smaller size representation of numeric field values, and no more subsetting of the the numbers that can be matched.

I can't see any reason this wouldn't work in Ruler .

Would you be willing to make the change?

I seem to recall that Rishi just wired in a similar flavor of change, so I suggest this one would be easy for him.

baldawar · 2024-08-29T19:44:49Z

oh that's neat!

This should be doable within ruler though I won't be able to pick up for few weeks due to an internal launch.

Placing down links of the files I'd expect we'd need to touch to enforce this d04e3f0#diff-58bfacbbf2f6f6e26165ed131f0cae9667cf41d642d57b1a154ca97236507bce.

To keep codebases roughly similar, I'll try to imitate numbits.go as much as Java lets me.

timbray · 2024-08-29T19:52:10Z

BTW I wrote a blog about it at https://www.tbray.org/ongoing/When/202x/2024/08/28/Q-Numbers-2 and Arne Hoffman has promised to write an explanation of bit-masking voodoo, will put a pointer in here when I see it. Off the top of my head, I don't think Java should get in the way, although I'm not sure there's an equivalent of Go's math.Float64bits(f). It’s a little weird because the byte values are between 0 and 127 inclusive, a lot of which are not printable characters even though they are valid UTF-8. In Quamina we do a little extra work to shorten the 10-byte results where possible but I think that's going to screw up the horrible Range model-building logic; shouldn't make it impossible but it will have to be modified. If it were me I might decide to leave it at 10 bytes just to avoid that work.

baldawar · 2024-08-29T20:04:51Z

Thanks Tim. Cursory check points me to Double.doubleToLongBits and related functions. I haven't had the time to explore how these methods behaves across various scenarios but hopefully it good enough for ruler's needs.

This change follows the guidance from #179 on using 10 byte base-128 encoded format for numbers similar to how Quamina does it. Didn't see any performance implications of supporting the new range, but had to fix a bunch of tests. I will be changing the numbers we use for testing to better test the new range of numbers before merging. During debugging, I found it challenging to make sense of the numbers to I've also added a helper method in ComparableNumbers and modified toString() methods in few places.

timbray added the enhancement New feature or request label Aug 28, 2024

baldawar self-assigned this Sep 11, 2024

This was referenced Sep 18, 2024

[WIP] Numbits #187

Closed

Improve Numeric matching to support full range of float64 #188

Merged

baldawar mentioned this issue Sep 26, 2024

Minor updates to ComparableNumbers based on PR feedback. #190

Merged

baldawar closed this as completed Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve numeric matching #179

Improve numeric matching #179

timbray commented Aug 28, 2024

baldawar commented Aug 29, 2024

timbray commented Aug 29, 2024

baldawar commented Aug 29, 2024

Improve numeric matching #179

Improve numeric matching #179

Comments

timbray commented Aug 28, 2024

What is your idea?

Would you be willing to make the change?

baldawar commented Aug 29, 2024

timbray commented Aug 29, 2024

baldawar commented Aug 29, 2024