Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve numeric matching #179

Closed
timbray opened this issue Aug 28, 2024 · 3 comments
Closed

Improve numeric matching #179

timbray opened this issue Aug 28, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@timbray
Copy link
Collaborator

timbray commented Aug 28, 2024

What is your idea?

For Quamina, a couple of folks figured how to represent the whole range of 64-bit float values in 64 big-endian bits, and then to encode them in base128, then to discard certain suffixes. Check out numbits.go https://github.com/timbray/quamina/blob/main/numbits.go

So you get a smaller size representation of numeric field values, and no more subsetting of the the numbers that can be matched.

I can't see any reason this wouldn't work in Ruler .

Would you be willing to make the change?

I seem to recall that Rishi just wired in a similar flavor of change, so I suggest this one would be easy for him.

@timbray timbray added the enhancement New feature or request label Aug 28, 2024
@baldawar
Copy link
Collaborator

oh that's neat!

This should be doable within ruler though I won't be able to pick up for few weeks due to an internal launch.

Placing down links of the files I'd expect we'd need to touch to enforce this d04e3f0#diff-58bfacbbf2f6f6e26165ed131f0cae9667cf41d642d57b1a154ca97236507bce.

To keep codebases roughly similar, I'll try to imitate numbits.go as much as Java lets me.

@timbray
Copy link
Collaborator Author

timbray commented Aug 29, 2024

BTW I wrote a blog about it at https://www.tbray.org/ongoing/When/202x/2024/08/28/Q-Numbers-2 and Arne Hoffman has promised to write an explanation of bit-masking voodoo, will put a pointer in here when I see it. Off the top of my head, I don't think Java should get in the way, although I'm not sure there's an equivalent of Go's math.Float64bits(f). It’s a little weird because the byte values are between 0 and 127 inclusive, a lot of which are not printable characters even though they are valid UTF-8. In Quamina we do a little extra work to shorten the 10-byte results where possible but I think that's going to screw up the horrible Range model-building logic; shouldn't make it impossible but it will have to be modified. If it were me I might decide to leave it at 10 bytes just to avoid that work.

@baldawar
Copy link
Collaborator

Thanks Tim. Cursory check points me to Double.doubleToLongBits and related functions. I haven't had the time to explore how these methods behaves across various scenarios but hopefully it good enough for ruler's needs.

@baldawar baldawar self-assigned this Sep 11, 2024
baldawar added a commit that referenced this issue Sep 19, 2024
This change follows the guidance from #179 on using 10 byte base-128 encoded format for numbers similar to how Quamina does it.

Didn't see any performance implications of supporting the new range, but had to fix a bunch of tests. I will be changing the numbers we use for testing to better test the new range of numbers before merging.

During debugging, I found it challenging to make sense of the numbers to I've also added a helper method in ComparableNumbers and modified toString() methods in few places.
@baldawar baldawar closed this as completed Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants