-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shrink bloom filters when serializing partitions #1172
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure the approach taken here will lead to a complete solution, see the other comment.
bb70f24
to
d2bd1ee
Compare
So 50-80% reduction in size? |
d2bd1ee
to
b26b66b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good already.
The one thing I'm missing is an in-process measurement of the synopsis sizes. The overall size of the meta index should probably be reported in vast status
anyway.
f65da53
to
de33222
Compare
2aad26d
to
1d09499
Compare
5c41d8b
to
6ef3a77
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
When creating a new address synopsis structure, it is sized for the worst-case scenario of every event in the synopsis having a unique ip address. When finalizing a synopsis, the exact number of unique ips is known and the minimal size for the synopsis can be computed, so that's what we do.
The `bloom_filter_parameters` class has no input validation, so users have to rely on the `evaluate` function to see if a given combination of parameters is valid. However, this function can die with an assertion failure if some parameters were set incorrectly, creating a catch-22 situation. Additionally, this meant that these checks were not performed in release builds with assertions disabled, potentially leading to bloom filters with nonsensical settings.
ea443e7
to
0211bd7
Compare
Use a std::set over a std::vector + bloom_filter combination, since the latter confused multiple independent reviewers. This also has the benefit of allowing for stricter checks on where and how the buffered_address_synopsis is used, and we don't have to rely on the `type` member being annotated with stringified bloom filter parameters. Also, pass the actor handle of the index together with the persist atom, as opposed to when spawning a partition actor.
0211bd7
to
5c42cde
Compare
📔 Description
Shrink bloom filters when storing partitions.
Running this on the 5GiB suricata dataset and comparing against a CI build from latest master there seems to be a reduction of 10-15 MiB per partition, so a significant chunk of the 19 MiB we expect for synopsis bloom filters in total.
Looking at the address synopsis in particular: (for reference, all of these were exactly
1.2MiB
before the changes in this PR)📝 Checklist
🎯 Review Instructions