-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how does the compression work? #5
Comments
Just fyi, his 20 min talk "Improved CRL compression with structured linear functions" at RWC 2022 discusses this some |
So you could use a Huffman code and you would get the usual cost: like if you have up to 4 bits in the code, and However, you don't have to constrain the outputs of these linear maps, and they will give you instead (depending on the hashes and whatever) a pseudorandom answer, which is an option not present in most applications of Huffman codes. So you can consider also a non-power-of-two size, like So the cost in this case would be 1.75 bits per It turns out that it's always optimal for every encoding except for one to be a power-of-two size. So it's almost like a Huffman code. I'm not sure how it compares to Huffman in general, but non-power-of-two sizes are definitely a win for cases like CRLite where the ratios are like 1% revoked vs 99% not, and a Huffman code would be wasteful because it spends at least one bit to encode each value. |
That talk was fascinating. Thank you for sharing and presenting respectively. Here's some of what I got out of it. The Huffman/Arithmetic encoding discussed above reduces the arbitrary case down to a series of exact set membership tests. If our values have a skewed enough distribution, then we can use Going a little further, the second layer here is just another case of the exact set membership tests. This time with And while I was writing this up, you replied. So next on my to do list is to read your comment! |
That is a very cool algorithm for reusing the accidental values. How do you construct the encoded keys such that you have a high probability of these happening?
That's very interesting. Why does that end up happening? I would have guessed you want, as much as possible, the number of symbols that map to the same
If I understand Huffman encodings correctly, a symbol that represents > 50% of the data is guaranteed to be length one. So for that ratio we should automatically give us the most flexibility |
Thinking about this overnight. The math I did yesterday isn't quite helpful. In the regimes where I thought a |
Should cost only |
You are absolutely correct! Both about my mistake and its implications. |
Very nice talk! In the table it says ribbon filters have 20% or 7% space overhead but this isn't really correct. The simplest standard ribbon filter with the smash technique as decribed in the ribbon filter paper achieves 1% space overhead easily. Also query time is about 2.5x of binary fuse filters in my tests. |
Ah, I'd missed that the smash technique fixes the overhead. Binary fuse filters are probably just better than either Ribbon or Frayed Ribbon, right? Faster to construct, faster to query, and IIRC they have a proof of correctness? |
I was initially using binary fuse filters, they are very fast but they have too much space overhead (20%-7%). In my use case it is better to use more cpu but have lower false positive rate and it is only limited by memory so lower space overhead is better. Also want to port this library to zig if I can understand it eventually 😁. It looks like it would be better than ribbon filter. Smash fixes overhead with small keys sets under 100k in my test but it is not so great with big sets like 1m keys. bumped ribbon seems like it would be able to scale to bigger sets so I'll try to test that as well for my use case. |
Hm, I'd thought there was a parameter set for binary fuse that achieved only a few % overhead. Maybe not tho.
Bumped ribbon filters are good but IIRC they only work as approxsets, not as general functions. Another possibility for larger sets is sharding. Just break the approxset or function into subsets of ~100k keys, using a hash function, and make a separate map for each subsets. |
thanks for the tip, that might work best! Bumped ribbon says it works for approxsets but code is more complicated so I left it for now. Probably will just port this library to zig when I come back to it. |
After extensively reading the code, my best guess was:
and you responded with:
I would love to hear more about how you analyze the cost model, and how you took advantage of getting the correct answer by accident. But I don't want to sidetrack that issue anymore than I already have. So I thought I would open a separate issue for you to expound on your insights, if you're willing to.
The text was updated successfully, but these errors were encountered: