Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how does the compression work? #5

Open
Eh2406 opened this issue Nov 5, 2024 · 12 comments
Open

how does the compression work? #5

Eh2406 opened this issue Nov 5, 2024 · 12 comments

Comments

@Eh2406
Copy link

Eh2406 commented Nov 5, 2024

After extensively reading the code, my best guess was:

I think the algorithm (compressed_map) is using "arithmetic coding" (or some homegrown alternative). But I can explain it better using Huffman encoding.

The size of a probabilistic map is (some constant C) * (the number of keys we want to guarantee we get the correct answer for) * (the size of each value) (if we ignore taking advantage of "accidentally getting the correct answer"). At first blush it seems like we're stuck. The size of the size of each value is basically fixed, as are the number of keys we care about getting the correct answer for. We could deal with different bits of the value in different structures, but our formula is linear so that doesn't get us anything. It takes the same amount of space to say we have N structures that each encode 1 bit as it does if we say we have 1 structure that encodes N bits.

That's where Huffman encoding comes in. We encode the values according to the Huffman algorithm. Then we process things one bit at a time. The first data structure encodes the first bit and needs to be correct for every key. So it takes quite a bit of space, it avoids being a disastrous amount because it's only one bit wide. We can do this again for the next bit but this time we can skip any keys whose Huffman encoding has already reached a leaf node. We don't care what value is returned by the structure for that bit of that key, because a reader processing that key will have found a value. With every additional bit we have fewer keys to process and therefore a smaller data structure.

and you responded with:

My compressed_map crate can be thought of as more or less the same idea as a Huffman code plus a BinaryFuse-like matrix map, much as @Eh2406 suggests. It's not quite a Huffman code because the possibility of getting the correct answer "by accident" changes the cost model, but it's the same idea.

I would love to hear more about how you analyze the cost model, and how you took advantage of getting the correct answer by accident. But I don't want to sidetrack that issue anymore than I already have. So I thought I would open a separate issue for you to expound on your insights, if you're willing to.

@burdges
Copy link

burdges commented Nov 6, 2024

Just fyi, his 20 min talk "Improved CRL compression with structured linear functions" at RWC 2022 discusses this some

@bitwiseshiftleft
Copy link
Owner

So you could use a Huffman code and you would get the usual cost: like if you have up to 4 bits in the code, and foo has an encoding 11??, then for each key that maps to foo you have to constrain the output value to exactly 11 for the first two maps (or two-bit outputs from a single CompressedRandomMap), and the other two you don't care about so you don't constrain them. Each constraint costs 1+epsilon bits, where epsilon is the overhead of the underlying CompressedRandomMap. So this example costs 2(1+epsilon) bits for each key that maps to foo.

However, you don't have to constrain the outputs of these linear maps, and they will give you instead (depending on the hashes and whatever) a pseudorandom answer, which is an option not present in most applications of Huffman codes. So you can consider also a non-power-of-two size, like foo might instead have an encoding {1011, 11??}. Encoding proceeds from the right, and you don't constrain the value of the foos in the rightmost two bits, because it has valid encodings no matter what those two bits are (as far as I can tell this is optimal). Then some of the foo values will happen to land on 11, since it's pseudorandom. For those values, the left two bits can be either 10 or 11 and they will decode correctly, so you encode them as 1?, costing one bit. For the other ones, you must set both of the first two bits to get the right encoding, costing two bits.

So the cost in this case would be 1.75 bits per foo, because approximately 1/4 of them cost 1 bit and the other 3/4 of them cost 2 bits. You encode from right to left because the other direction would cost more: first bit always costs you, the second doesn't, and then half the time you are in the 10 case and have to spend 2 more bits to finish out the 1011 branch, for a total cost of 2 bits per foo.

It turns out that it's always optimal for every encoding except for one to be a power-of-two size. So it's almost like a Huffman code. I'm not sure how it compares to Huffman in general, but non-power-of-two sizes are definitely a win for cases like CRLite where the ratios are like 1% revoked vs 99% not, and a Huffman code would be wasteful because it spends at least one bit to encode each value.

@Eh2406
Copy link
Author

Eh2406 commented Nov 6, 2024

That talk was fascinating. Thank you for sharing and presenting respectively. Here's some of what I got out of it.

The Huffman/Arithmetic encoding discussed above reduces the arbitrary case down to a series of exact set membership tests. CompressedRandomMap can store this in (some constant C) * (the number of keys we want to guarantee we get the correct answer for) * (the size of each value). We can ignore C, because everything I'm thinking about is linear in C, and it's 1.001 for frayed ribbon according to the presentation. This takes one bit per key.

If our values have a skewed enough distribution, then we can use ApproxSet to do better. Without loss of generality, I'm going to assume that 1 is the less common value. And that 1 occurs P portion of the time. We can use a ApproxSet first and then a full CompressedRandomMap only on the keys included in the ApproxSet. In this 2 level scheme this will take (2^epsilon)*keys*P for the ApproxSet and P*keys + keys*(1-P)/(2^epsilon) for the CompressedRandomMap. In practice epsilon has to be an integer for the value == hash schemes, and should be a power of two to avoid unaligned reads. Playing with that formula in Excel this strategy is worthwhile for P < 0.2 with epsilon = 1. epsilon = 2 beats that out for P < 0.1111. epsilon = 3 beats that out for P < 0.0154. This is a really cool insight. Thank you again.

Going a little further, the second layer here is just another case of the exact set membership tests. This time with P*keys + keys*(1-P)/(2^epsilon) keys, and with a new P of (2^epsilon) * P / ((2^epsilon) * P - P + 1). This new P is generally > 0.2 so CompressedRandomMap is the best choice. But if the original P < 0.055, the new P ends up small enough to be worth adding 3ed layer!

And while I was writing this up, you replied. So next on my to do list is to read your comment!

@Eh2406
Copy link
Author

Eh2406 commented Nov 6, 2024

That is a very cool algorithm for reusing the accidental values. How do you construct the encoded keys such that you have a high probability of these happening?

It turns out that it's always optimal for every encoding except for one to be a power-of-two size.

That's very interesting. Why does that end up happening? I would have guessed you want, as much as possible, the number of symbols that map to the same foo to be proportional to the number of foos in the data sets.

I'm not sure how it compares to Huffman in general, but non-power-of-two sizes are definitely a win for cases like CRLite where the ratios are like 1% revoked vs 99% not, and a Huffman code would be wasteful because it spends at least one bit to encode each value.

If I understand Huffman encodings correctly, a symbol that represents > 50% of the data is guaranteed to be length one. So for that ratio we should automatically give us the most flexibility 1?????. Even so if that symbol is completely dominating the distribution then we end up in the skewed case where using a ApproxSet+CompressedRandomMap fits well.

@Eh2406
Copy link
Author

Eh2406 commented Nov 7, 2024

Thinking about this overnight. The math I did yesterday isn't quite helpful. In the regimes where I thought a ApproxSet with epsilon >= 2 was the smallest option, two or more layers of ApproxSet with epsilon = 1 is better. Furthermore, a 1 bit ApproxSet is the same as a column in your construction.

@bitwiseshiftleft
Copy link
Owner

(2^epsilon)*keys*P for the ApproxSet and P*keys + keys*(1-P)/(2^epsilon)

Should cost only epsilon*keys*P, which I think changes it back to two ApproxSets in front costing the same as one.

@Eh2406
Copy link
Author

Eh2406 commented Nov 11, 2024

You are absolutely correct! Both about my mistake and its implications.

@ozgrakkurt
Copy link

Just fyi, his 20 min talk "Improved CRL compression with structured linear functions" at RWC 2022 discusses this some

Very nice talk! In the table it says ribbon filters have 20% or 7% space overhead but this isn't really correct. The simplest standard ribbon filter with the smash technique as decribed in the ribbon filter paper achieves 1% space overhead easily. Also query time is about 2.5x of binary fuse filters in my tests.

@bitwiseshiftleft
Copy link
Owner

Ah, I'd missed that the smash technique fixes the overhead.

Binary fuse filters are probably just better than either Ribbon or Frayed Ribbon, right? Faster to construct, faster to query, and IIRC they have a proof of correctness?

@ozgrakkurt
Copy link

Ah, I'd missed that the smash technique fixes the overhead.

Binary fuse filters are probably just better than either Ribbon or Frayed Ribbon, right? Faster to construct, faster to query, and IIRC they have a proof of correctness?

I was initially using binary fuse filters, they are very fast but they have too much space overhead (20%-7%). In my use case it is better to use more cpu but have lower false positive rate and it is only limited by memory so lower space overhead is better.

Also want to port this library to zig if I can understand it eventually 😁. It looks like it would be better than ribbon filter.

Smash fixes overhead with small keys sets under 100k in my test but it is not so great with big sets like 1m keys. bumped ribbon seems like it would be able to scale to bigger sets so I'll try to test that as well for my use case.

@bitwiseshiftleft
Copy link
Owner

I was initially using binary fuse filters, they are very fast but they have too much space overhead (20%-7%). In my use case it is better to use more cpu but have lower false positive rate and it is only limited by memory so lower space overhead is better.

Hm, I'd thought there was a parameter set for binary fuse that achieved only a few % overhead. Maybe not tho.

Also want to port this library to zig if I can understand it eventually 😁. It looks like it would be better than ribbon filter.

Smash fixes overhead with small keys sets under 100k in my test but it is not so great with big sets like 1m keys. bumped ribbon seems like it would be able to scale to bigger sets so I'll try to test that as well for my use case.

Bumped ribbon filters are good but IIRC they only work as approxsets, not as general functions.

Another possibility for larger sets is sharding. Just break the approxset or function into subsets of ~100k keys, using a hash function, and make a separate map for each subsets.

@ozgrakkurt
Copy link

I was initially using binary fuse filters, they are very fast but they have too much space overhead (20%-7%). In my use case it is better to use more cpu but have lower false positive rate and it is only limited by memory so lower space overhead is better.

Hm, I'd thought there was a parameter set for binary fuse that achieved only a few % overhead. Maybe not tho.

Also want to port this library to zig if I can understand it eventually 😁. It looks like it would be better than ribbon filter.
Smash fixes overhead with small keys sets under 100k in my test but it is not so great with big sets like 1m keys. bumped ribbon seems like it would be able to scale to bigger sets so I'll try to test that as well for my use case.

Bumped ribbon filters are good but IIRC they only work as approxsets, not as general functions.

Another possibility for larger sets is sharding. Just break the approxset or function into subsets of ~100k keys, using a hash function, and make a separate map for each subsets.

thanks for the tip, that might work best!

Bumped ribbon says it works for approxsets but code is more complicated so I left it for now. Probably will just port this library to zig when I come back to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants