Shrink bloom filters when serializing partitions #1172

lava · 2020-11-16T16:44:24Z

📔 Description

Shrink bloom filters when storing partitions.

Running this on the 5GiB suricata dataset and comparing against a CI build from latest master there seems to be a reduction of 10-15 MiB per partition, so a significant chunk of the 19 MiB we expect for synopsis bloom filters in total.

vast-bloomresize$ ls -lh
total 759M
-rw-r--r-- 1 benno benno 92M Nov 20 01:01 0fc5ac1b-0bfd-4912-bece-521a77338370
-rw-r--r-- 1 benno benno 96M Nov 20 01:01 20f6569e-9833-4c86-bef0-370095a7b4ac
-rw-r--r-- 1 benno benno 18M Nov 20 01:03 3ad75efb-434c-44c0-86a8-2302728e6f80
-rw-r--r-- 1 benno benno 93M Nov 20 00:59 7c12d1b4-ef11-4bcb-92ba-e0f300ccb01c
-rw-r--r-- 1 benno benno 98M Nov 20 01:00 afb6ad5d-4f1b-4534-b328-9a8ad037c204
-rw-r--r-- 1 benno benno 91M Nov 20 01:02 c065e7a5-c936-4c47-9a93-08f85ed3ab98
-rw-r--r-- 1 benno benno 91M Nov 20 01:00 ca76cbc7-3b74-40f8-b5c5-a74dc93bed5c
-rw-r--r-- 1 benno benno 95M Nov 20 01:03 e543a4c2-39c1-4e80-8721-320678b2fff5
-rw-r--r-- 1 benno benno 90M Nov 20 00:59 ea1d7114-b65d-458f-a7e4-2d1f108a48ea
-rw-r--r-- 1 benno benno 720 Nov 20 01:03 index.bin


vast-master$ ls -lh
total 906M
-rw-r--r-- 1 benno benno 109M Nov 19 17:26 1b819c77-0ed6-4f7e-b357-e192081bbbcf
-rw-r--r-- 1 benno benno 111M Nov 19 17:28 27cda834-a790-406a-833e-e979a27f776e
-rw-r--r-- 1 benno benno 110M Nov 19 17:28 2e182fe7-461d-40b8-bc80-dc5bede6120f
-rw-r--r-- 1 benno benno 106M Nov 19 17:26 42aaad1b-d490-414f-a1ba-d3eb1604d201
-rw-r--r-- 1 benno benno 108M Nov 19 17:29 6304aef2-a6ec-462b-af39-b9d07408dac7
-rw-r--r-- 1 benno benno  34M Nov 19 17:29 7019a0e7-14bb-44d0-a900-f3549ca2fba2
-rw-r--r-- 1 benno benno 107M Nov 19 17:25 7f3cd1bb-3add-416b-95d4-e1e42198f335
-rw-r--r-- 1 benno benno 112M Nov 19 17:27 8a6c9d05-fa4f-486e-a9c6-6b9552227693
-rw-r--r-- 1 benno benno 113M Nov 19 17:29 b1f06f77-459e-4487-9b90-ff950a6244d2
-rw-r--r-- 1 benno benno  720 Nov 19 17:29 index.bin

Looking at the address synopsis in particular: (for reference, all of these were exactly 1.2MiB before the changes in this PR)

$ ~/src/vast/build-release/bin/lsvast -h ea1d7114-b65d-458f-a7e4-2d1f108a48ea | grep "ip: opaque_synopsis"
    suricata.smtp.dest_ip: opaque_synopsis (569 B)
    suricata.smtp.src_ip: opaque_synopsis (313 B)
    suricata.ftp.src_ip: opaque_synopsis (377 B)
    suricata.dns.dest_ip: opaque_synopsis (48.1 KiB)
    suricata.http.src_ip: opaque_synopsis (377 B)
    suricata.fileinfo.src_ip: opaque_synopsis (28.0 KiB)
    suricata.http.dest_ip: opaque_synopsis (42.7 KiB)
    suricata.fileinfo.dest_ip: opaque_synopsis (14.8 KiB)
    suricata.flow.dest_ip: opaque_synopsis (76.9 KiB)
    suricata.flow.src_ip: opaque_synopsis (76.5 KiB)
    suricata.ftp.dest_ip: opaque_synopsis (1.5 KiB)
    suricata.dns.src_ip: opaque_synopsis (377 B)
    suricata.tls.src_ip: opaque_synopsis (377 B)
    suricata.tls.dest_ip: opaque_synopsis (9.6 KiB)

📝 Checklist

All user-facing changes have changelog entries.
The changes are reflected on docs.tenzir.com/vast, if necessary.
The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

tobim

I'm unsure the approach taken here will lead to a complete solution, see the other comment.

libvast/vast/qualified_record_field.hpp

libvast/vast/system/partition.hpp

mavam · 2020-11-17T19:21:44Z

[..] there seems to be a reduction of 10-15 MiB per partition, so a significant chunk of the 19 MiB [..]

So 50-80% reduction in size?

tobim

This is looking pretty good already.
The one thing I'm missing is an in-process measurement of the synopsis sizes. The overall size of the meta index should probably be reported in vast status anyway.

libvast/src/system/index.cpp

libvast/vast/fwd.hpp

libvast/vast/address_synopsis.hpp

libvast/vast/system/partition.hpp

libvast/vast/address_synopsis.hpp

tobim

Nice work!

When creating a new address synopsis structure, it is sized for the worst-case scenario of every event in the synopsis having a unique ip address. When finalizing a synopsis, the exact number of unique ips is known and the minimal size for the synopsis can be computed, so that's what we do.

The `bloom_filter_parameters` class has no input validation, so users have to rely on the `evaluate` function to see if a given combination of parameters is valid. However, this function can die with an assertion failure if some parameters were set incorrectly, creating a catch-22 situation. Additionally, this meant that these checks were not performed in release builds with assertions disabled, potentially leading to bloom filters with nonsensical settings.

Use a std::set over a std::vector + bloom_filter combination, since the latter confused multiple independent reviewers. This also has the benefit of allowing for stricter checks on where and how the buffered_address_synopsis is used, and we don't have to rely on the `type` member being annotated with stringified bloom filter parameters. Also, pass the actor handle of the index together with the persist atom, as opposed to when spawning a partition actor.

tobim reviewed Nov 16, 2020

View reviewed changes

libvast/vast/qualified_record_field.hpp Show resolved Hide resolved

libvast/vast/system/partition.hpp Outdated Show resolved Hide resolved

mavam added the performance Improvements or regressions of performance label Nov 17, 2020

lava force-pushed the topic/bloom-filter-sizing branch 3 times, most recently from bb70f24 to d2bd1ee Compare November 17, 2020 17:55

lava force-pushed the topic/bloom-filter-sizing branch from d2bd1ee to b26b66b Compare November 18, 2020 12:14

tobim requested changes Nov 18, 2020

View reviewed changes

libvast/src/system/index.cpp Outdated Show resolved Hide resolved

libvast/vast/fwd.hpp Outdated Show resolved Hide resolved

libvast/vast/address_synopsis.hpp Outdated Show resolved Hide resolved

lava force-pushed the topic/bloom-filter-sizing branch 2 times, most recently from f65da53 to de33222 Compare November 19, 2020 14:59

lava marked this pull request as ready for review November 19, 2020 15:57

lava force-pushed the topic/bloom-filter-sizing branch 2 times, most recently from 2aad26d to 1d09499 Compare November 20, 2020 09:13

tobim reviewed Nov 20, 2020

View reviewed changes

libvast/vast/system/partition.hpp Outdated Show resolved Hide resolved

tobim reviewed Nov 20, 2020

View reviewed changes

libvast/vast/system/partition.hpp Show resolved Hide resolved

tobim reviewed Nov 20, 2020

View reviewed changes

libvast/vast/address_synopsis.hpp Outdated Show resolved Hide resolved

tobim reviewed Nov 20, 2020

View reviewed changes

libvast/vast/address_synopsis.hpp Show resolved Hide resolved

lava force-pushed the topic/bloom-filter-sizing branch from 5c41d8b to 6ef3a77 Compare November 20, 2020 14:48

tobim approved these changes Nov 20, 2020

View reviewed changes

lava added 2 commits November 20, 2020 16:14

lava force-pushed the topic/bloom-filter-sizing branch 2 times, most recently from ea443e7 to 0211bd7 Compare November 20, 2020 15:41

lava force-pushed the topic/bloom-filter-sizing branch from 0211bd7 to 5c42cde Compare November 20, 2020 16:27

lava merged commit 5078b9f into master Nov 20, 2020

lava deleted the topic/bloom-filter-sizing branch November 20, 2020 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrink bloom filters when serializing partitions #1172

Shrink bloom filters when serializing partitions #1172

lava commented Nov 16, 2020 •

edited

Loading

tobim left a comment

mavam commented Nov 17, 2020

tobim left a comment

tobim left a comment

Shrink bloom filters when serializing partitions #1172

Shrink bloom filters when serializing partitions #1172

Conversation

lava commented Nov 16, 2020 • edited Loading

📔 Description

📝 Checklist

🎯 Review Instructions

tobim left a comment

Choose a reason for hiding this comment

mavam commented Nov 17, 2020

tobim left a comment

Choose a reason for hiding this comment

tobim left a comment

Choose a reason for hiding this comment

lava commented Nov 16, 2020 •

edited

Loading