-
-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Edge case fail in all Huffman Encoding implementations. #659
Comments
These are the ones so far that we know are broken (I or another maintainer have personally tested them)
and those that are untested
|
Here is a first stab at fixing the Python implementation: diff --git a/contents/huffman_encoding/code/python/huffman.py b/contents/huffman_encoding/code/python/huffman.py
index 1e1cc54..75700cd 100644
--- a/contents/huffman_encoding/code/python/huffman.py
+++ b/contents/huffman_encoding/code/python/huffman.py
@@ -11,6 +11,9 @@ def build_huffman_tree(message):
frequencies = Counter(message)
trees = frequencies.most_common()
+ if len(trees) == 1:
+ return [trees[0][0]]
+
# while there is more than one tree
while len(trees) > 1:
@@ -37,6 +40,9 @@ def build_codebook(tree, code=''):
codebook = []
+ if len(tree) == 1:
+ return [(tree[0], '0')]
+
# split the tree
left_tree, right_tree = tree
I figure it's ok to not handle the case where the input is empty. Or should that be handled? |
Right, this should catch it. I'm wondering it. I'll fix the ones I can soon! |
I don't like the solution of using 1 bit to encode the solution. Mostly cause it makes the Kotlin implementation more complex than I feel it needs to be. |
I think that, for the edge case, the empty string isn't suitable, since it doesn't give us the full description of the text. And, I also think that representing the only character with a
|
The way I see it, while making an encoding edge case where the codebook indicates "this string has only the letter A" and the bitstring instead encodes the amount of characters would be the optimal thing to do, the algorithmically correct implementation would be instead for the codebook to say "the character A corresponds to a 0" and then the bitstring has as many zeroes as the original string has A's, as that would be internally consistent with how this algorithm handles all other cases. |
That's actually not correct. Most implementations (eg DEFLATE) don't have a marker for the end of the bitstring, and instead, store the length of the uncompressed data. Because we are storing the bits into a string we had an implicit end marker (end of the string). IMHO the correct encoding result is the empty string, but the decode function should somehow be told what the target length is. This can be done by example by making the encoder return an |
Yes, because those implementations operate down at the metal, where there needs to be some way to determine when to stop reading and interpreting bits, and as such either encode the length upfront or include an "end of string" token just like any other kind of string. For the version of the algorithm being discussed in this book, those extra details can and maybe should be mentioned in the chapter text, but I don't think this kind of handling should be done in the actual code just because it's often done elsewhere, since these implementations are meant to be illustrative of the main principle at work without any optimizations. Since the text does mention that the algorithm "guarantees the smallest possible bit length", that may be a good reason to implement this kind of optimization, though I am unconvinced, since any means to actually achieve that in our example implementations will involve either encoding the length or encoding the "end of string" token, both of which involve very low-level concepts that distract from the point of the chapter and may confuse a reader. For the simplified version of the algorithm that is actually described in the chapter, the correct behavior would be:
|
That's incorrect, the algorithm as described would encode "aaaaaa" as the empty string. Honestly, I would just explicitly disallow those strings in the implementation. |
Then the algorithm as described has a bug in it, and we should figure out whether to just disallow all strings that would cause that bug or to fix it and if so how. |
In any case, I sent Leios the original paper the algorithm originated in to read, so we'll see if there are any answers to this there. |
On the one hand, fixing for those kind of strings in Python/Coconut is pretty easy. You can see how I did it in the Coconut Huffman encoding PR (trying not to brag, but sorry if I seem like it), where I used the solution that @thecnoNSMB provides. So my point is that the fix is done on my part, but that it's not really realistic to have to do it. |
This issue is also there in my Racket implementation (#823). The input The way I am doing it, the Huffman tree for
|
I took a quick look and this issue is also there in the Haskell implementation. The My guess is that this is also due to the same reason as mine: leaves and non-leaf nodes are treated differently. So it might be possible to fix in one of the two ways I suggested above. |
There is also a small typo in the Haskell implementation. |
If you want to update the haskell implementation, I think that would be very much appreciated! Thanks for bumping this thread, btw! I apparently didn't have the code fixed in Julia either, so I created a new PR (#828) to do this. I used the if statement to deal with a tree composed of only 1 leaf, but maybe the branch approach would have been better? I am happy with either. |
I was eavesdropping this conversation (I made the Haskell implementation). Thank you for spotting the bug, here is a possible fix: decode :: (Ord a) => Tree a -> [Bit] -> [a]
decode (Leaf i a) _ = replicate i a
decode t msg = path t msg
where ... In this version encoding "aaa" gets you |
@jiegillet go ahead! I don't really know Haskell anyway xD. Though I'm still concerned whether it should be an empty string or |
I mean, Kroppeb is right here. In actual implementations, it makes sense that we keep track of the length of the data... This should be mentioned in the chapter (and we should also explicitly link to deflate), but I feel that it is better to just return the 0's for now because it's easier to reason about for now (it's also what I was taught in data compression, so I think it is the more common approach for math folks). I have been wanting to talk about the Vitter algorithm for a while as an extension to this one and think that is a good place to introduce the more technical aspects. |
I think returning an empty sequence makes more sense, because the algorithm is all about compression. and |
The problem is that following the Huffman tree building algorithm naively, I think you get Now that I think about it, the weights in the codebook building are arbitrary and technically decorrelated from the string to encode, except if you build the codebook with that string (which, although it is the case here, normally never happens). In all common cases, if in the codebook there was say the entry ['a', '10'], the string of 100 'a's would be "10"*100, which is actually longer in terms of characters than the input sequence. I don't think sending an empty sequence makes sense, as it should probably be reserved for the encoding of the empty string. |
Yeah, there are actually a lot of weird details to the code right now that should be fixed for efficiency. In many languages, we are not even using bits, but instead chars to act as bits. I think at the time, julia's bit types had just changed and I couldn't find documentation on how to use them properly anymore, so I just used strings. We really should be using actual bits. I'll see about fixing that in the Julia PR as well. To make sure I am following, we are thinking about returning either:
It seems to me that 2 is only more efficient in the case where we have a small number (less than 8, 16, 32, 64... whatever the precision of the Int is) of the same character repeated, which means we are paying attention to an edge-case that would be very difficult to actually see in practice, right? |
That's not exactly it.
The reasoning behind 2 is that if the codebook is always generated with the message to encode, Still, @Amaras convinced me and I now think solution 1 is the best. But this is a pathological case, any decision is fine so let's pick one and run with it. @leios, I think if you make the call, people will follow. I really don't think we need to worry over bits vs characters for the bitstring, the algorithm is the important part, we are not building libraries here. |
And I forgot to remove the link again, sorry |
I need to learn to stop saying "This fixes #..." when it only fixes 1 part of the full issue... |
Whenever any (or most) of the Huffman Encoding implementations receive a string with only one character multiple times (think
'aaaaaaaaaaaaa'
, or'bbbb'
). The functions fail and give either errors, or weird empty output. Here are some examples of Python, Java and Lua respectively. I assume this has to be fixed because it nowhere states in the chapter that it shouldn't work on one-character-strings.The text was updated successfully, but these errors were encountered: