Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge case fail in all Huffman Encoding implementations. #659

Open
Trashtalk217 opened this issue Dec 20, 2019 · 26 comments · Fixed by #828
Open

Edge case fail in all Huffman Encoding implementations. #659

Trashtalk217 opened this issue Dec 20, 2019 · 26 comments · Fixed by #828
Labels
Implementation Edit This provides an edit to an algorithm implementation. (Code and maybe md files are edited.) Problem This is a problem in the archive or an implementation.

Comments

@Trashtalk217
Copy link
Contributor

Whenever any (or most) of the Huffman Encoding implementations receive a string with only one character multiple times (think 'aaaaaaaaaaaaa', or 'bbbb'). The functions fail and give either errors, or weird empty output. Here are some examples of Python, Java and Lua respectively. I assume this has to be fixed because it nowhere states in the chapter that it shouldn't work on one-character-strings.

@c252
Copy link
Member

c252 commented Dec 20, 2019

These are the ones so far that we know are broken (I or another maintainer have personally tested them)

  • C
  • C++
  • C#
  • Go
  • Java
  • Javascript
  • Julia
  • Lua
  • Python
  • Rust

and those that are untested

  • x86-64 Assembly
  • Clojure
  • Haskell
  • Scala

@c252 c252 added the Implementation Edit This provides an edit to an algorithm implementation. (Code and maybe md files are edited.) label Dec 20, 2019
@june128 june128 added the Problem This is a problem in the archive or an implementation. label Dec 21, 2019
@berquist
Copy link
Member

Here is a first stab at fixing the Python implementation:

diff --git a/contents/huffman_encoding/code/python/huffman.py b/contents/huffman_encoding/code/python/huffman.py
index 1e1cc54..75700cd 100644
--- a/contents/huffman_encoding/code/python/huffman.py
+++ b/contents/huffman_encoding/code/python/huffman.py
@@ -11,6 +11,9 @@ def build_huffman_tree(message):
     frequencies = Counter(message)
     trees = frequencies.most_common()
 
+    if len(trees) == 1:
+        return [trees[0][0]]
+
     # while there is more than one tree
     while len(trees) > 1:
 
@@ -37,6 +40,9 @@ def build_codebook(tree, code=''):
 
     codebook = []
 
+    if len(tree) == 1:
+        return [(tree[0], '0')]
+
     # split the tree
     left_tree, right_tree = tree
 

I figure it's ok to not handle the case where the input is empty. Or should that be handled?

@leios
Copy link
Member

leios commented Dec 21, 2019

Right, this should catch it. I'm wondering it. I'll fix the ones I can soon!

@Kroppeb
Copy link

Kroppeb commented Jan 28, 2020

https://stackoverflow.com/questions/22429854/huffman-code-for-a-single-character

@Kroppeb
Copy link

Kroppeb commented Jan 28, 2020

I don't like the solution of using 1 bit to encode the solution. Mostly cause it makes the Kotlin implementation more complex than I feel it needs to be.

@rishvic
Copy link
Contributor

rishvic commented Mar 1, 2020

I think that, for the edge case, the empty string isn't suitable, since it doesn't give us the full description of the text. And, I also think that representing the only character with a 1 or a 0 is wasteful, since we are not using the other character at our disposal. So, I think the best way to go about with this is, if there is only one character in the text, the Huffman code becomes the binary representation of the number of occurrences of the single character in the given text (I think for simplicity, the representation should be in little endian order). Then, while decoding, if we see the code-book has only one entry (only one root node containing the character), we simply take the encoded text as the number of the occurrences of the character in the code-book. Although, this increases the complexity of the encoding & decoding functions, this method ensures,

  1. A complete description of the text
  2. The most efficient of the given text using only 1s and 0s

@thecnoNSMB
Copy link
Contributor

The way I see it, while making an encoding edge case where the codebook indicates "this string has only the letter A" and the bitstring instead encodes the amount of characters would be the optimal thing to do, the algorithmically correct implementation would be instead for the codebook to say "the character A corresponds to a 0" and then the bitstring has as many zeroes as the original string has A's, as that would be internally consistent with how this algorithm handles all other cases.

@Kroppeb
Copy link

Kroppeb commented Aug 6, 2020

The way I see it, while making an encoding edge case where the codebook indicates "this string has only the letter A" and the bitstring instead encodes the amount of characters would be the optimal thing to do, the algorithmically correct implementation would be instead for the codebook to say "the character A corresponds to a 0" and then the bitstring has as many zeroes as the original string has A's, as that would be internally consistent with how this algorithm handles all other cases.

That's actually not correct. Most implementations (eg DEFLATE) don't have a marker for the end of the bitstring, and instead, store the length of the uncompressed data. Because we are storing the bits into a string we had an implicit end marker (end of the string). IMHO the correct encoding result is the empty string, but the decode function should somehow be told what the target length is. This can be done by example by making the encoder return an Encoded object that stores the length and bits.

@thecnoNSMB
Copy link
Contributor

Yes, because those implementations operate down at the metal, where there needs to be some way to determine when to stop reading and interpreting bits, and as such either encode the length upfront or include an "end of string" token just like any other kind of string. For the version of the algorithm being discussed in this book, those extra details can and maybe should be mentioned in the chapter text, but I don't think this kind of handling should be done in the actual code just because it's often done elsewhere, since these implementations are meant to be illustrative of the main principle at work without any optimizations. Since the text does mention that the algorithm "guarantees the smallest possible bit length", that may be a good reason to implement this kind of optimization, though I am unconvinced, since any means to actually achieve that in our example implementations will involve either encoding the length or encoding the "end of string" token, both of which involve very low-level concepts that distract from the point of the chapter and may confuse a reader.

For the simplified version of the algorithm that is actually described in the chapter, the correct behavior would be:

  • An empty string should be encoded with an empty codebook and an empty bytestring,
  • a should be encoded with a codebook indicating that "a" is 0 (for example) and the bytestring should be 0,
  • and aaaaa should be encoded with the same codebook and the bytestring should be 00000.

@Kroppeb
Copy link

Kroppeb commented Aug 6, 2020

For the simplified version of the algorithm that is actually described in the chapter, the correct behavior would be:

  • An empty string should be encoded with an empty codebook and an empty bytestring,
  • a should be encoded with a codebook indicating that "a" is 0 (for example) and the bytestring should be 0,
  • and aaaaa should be encoded with the same codebook and the bytestring should be 00000.

That's incorrect, the algorithm as described would encode "aaaaaa" as the empty string. Honestly, I would just explicitly disallow those strings in the implementation.

@thecnoNSMB
Copy link
Contributor

That's incorrect, the algorithm as described would encode "aaaaaa" as the empty string. Honestly, I would just explicitly disallow those strings in the implementation.

Then the algorithm as described has a bug in it, and we should figure out whether to just disallow all strings that would cause that bug or to fix it and if so how.

@thecnoNSMB
Copy link
Contributor

In any case, I sent Leios the original paper the algorithm originated in to read, so we'll see if there are any answers to this there.

@Amaras
Copy link
Member

Amaras commented Aug 6, 2020

On the one hand, fixing for those kind of strings in Python/Coconut is pretty easy. You can see how I did it in the Coconut Huffman encoding PR (trying not to brag, but sorry if I seem like it), where I used the solution that @thecnoNSMB provides.
The paper does not really provide a feel for the edge cases, though.
Realistically, you already have some kind of Huffman tree or code book that is already lying around, or you don't really need to build them from single-character (repeated or not) inputs or empty string inputs... If the only kind of messages you can send are "no message" and "here are x characters", do you really need to go through the trouble of building a tree or a code book? Just use (possibly repeated) single digits!

So my point is that the fix is done on my part, but that it's not really realistic to have to do it.

@lazyprop
Copy link
Contributor

This issue is also there in my Racket implementation (#823). The input "bbbbb" returns an empty string as the encoded message. I could fix this to return 00000.

The way I am doing it, the Huffman tree for "bbbbb" consists of just one leaf node corresponding to the symbol b. I see two ways of fix it:

  1. Make it so that the Huffman tree for such an input consists of a branch and the left child of that branch is the leaf.
  2. Add a simple if condition to deal with such a tree.

@lazyprop
Copy link
Contributor

I took a quick look and this issue is also there in the Haskell implementation. The encoded procedure returns an empty string and the decoded procedure returns an infinite list of b's.

My guess is that this is also due to the same reason as mine: leaves and non-leaf nodes are treated differently. So it might be possible to fix in one of the two ways I suggested above.

@lazyprop
Copy link
Contributor

There is also a small typo in the Haskell implementation. encoding is spelled as endoding.

@leios
Copy link
Member

leios commented Aug 18, 2021

If you want to update the haskell implementation, I think that would be very much appreciated!

Thanks for bumping this thread, btw! I apparently didn't have the code fixed in Julia either, so I created a new PR (#828) to do this. I used the if statement to deal with a tree composed of only 1 leaf, but maybe the branch approach would have been better? I am happy with either.

@jiegillet
Copy link
Member

jiegillet commented Aug 18, 2021

I was eavesdropping this conversation (I made the Haskell implementation). Thank you for spotting the bug, here is a possible fix:

decode :: (Ord a) => Tree a -> [Bit] -> [a]
decode (Leaf i a) _ = replicate i a
decode t msg = path t msg
  where ...

In this version encoding "aaa" gets you (Leaf 3 'a', []) basically en empty encoded message, since in this case the whole message is encoded in the tree.
Should I make a PR? Or do you want to include it in yours @lazyprop?

@lazyprop
Copy link
Contributor

@jiegillet go ahead! I don't really know Haskell anyway xD. Though I'm still concerned whether it should be an empty string or "0000". The discussion that happened on this thread last year didn't end in an agreement. What do you think @leios?

@leios
Copy link
Member

leios commented Aug 18, 2021

The way I see it, while making an encoding edge case where the codebook indicates "this string has only the letter A" and the bitstring instead encodes the amount of characters would be the optimal thing to do, the algorithmically correct implementation would be instead for the codebook to say "the character A corresponds to a 0" and then the bitstring has as many zeroes as the original string has A's, as that would be internally consistent with how this algorithm handles all other cases.

That's actually not correct. Most implementations (eg DEFLATE) don't have a marker for the end of the bitstring, and instead, store the length of the uncompressed data. Because we are storing the bits into a string we had an implicit end marker (end of the string). IMHO the correct encoding result is the empty string, but the decode function should somehow be told what the target length is. This can be done by example by making the encoder return an Encoded object that stores the length and bits.

I mean, Kroppeb is right here. In actual implementations, it makes sense that we keep track of the length of the data... This should be mentioned in the chapter (and we should also explicitly link to deflate), but I feel that it is better to just return the 0's for now because it's easier to reason about for now (it's also what I was taught in data compression, so I think it is the more common approach for math folks).

I have been wanting to talk about the Vitter algorithm for a while as an extension to this one and think that is a good place to introduce the more technical aspects.

@jiegillet
Copy link
Member

I think returning an empty sequence makes more sense, because the algorithm is all about compression. and (Leaf 'a' 100, []) is a heck of a lot more compressed than (Leaf 'a' 100, "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000")

@Amaras
Copy link
Member

Amaras commented Aug 18, 2021

I think returning an empty sequence makes more sense, because the algorithm is all about compression. and (Leaf 'a' 100, []) is a heck of a lot more compressed than (Leaf 'a' 100, "0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000")

The problem is that following the Huffman tree building algorithm naively, I think you get
(Leaf 'a', '0') with which the 100 'a's are transformed into 100 '0's.
The other problem is that you still need to (consistently) signal the end of your string.
So either you say that if you get a 1 element codebook, you get the binary representation of the number of elements in the encoded string, or you need to return the whole string of 0s no matter what.

Now that I think about it, the weights in the codebook building are arbitrary and technically decorrelated from the string to encode, except if you build the codebook with that string (which, although it is the case here, normally never happens). In all common cases, if in the codebook there was say the entry ['a', '10'], the string of 100 'a's would be "10"*100, which is actually longer in terms of characters than the input sequence.
However the thing to remember is that while the encoded sequence of characters is just as long or longer, it should be thought of as a sequence of bits. And as a sequence of bits, it's a 7-fold or 8-fold compression (assuming ASCII or appropriate character encoding for most western languages).

I don't think sending an empty sequence makes sense, as it should probably be reserved for the encoding of the empty string.

@leios
Copy link
Member

leios commented Aug 18, 2021

Yeah, there are actually a lot of weird details to the code right now that should be fixed for efficiency. In many languages, we are not even using bits, but instead chars to act as bits. I think at the time, julia's bit types had just changed and I couldn't find documentation on how to use them properly anymore, so I just used strings. We really should be using actual bits. I'll see about fixing that in the Julia PR as well.

To make sure I am following, we are thinking about returning either:

  1. "000...000"
  2. ("c", n) where c is a character and n is the number of characters in the string

It seems to me that 2 is only more efficient in the case where we have a small number (less than 8, 16, 32, 64... whatever the precision of the Int is) of the same character repeated, which means we are paying attention to an edge-case that would be very difficult to actually see in practice, right?

@jiegillet
Copy link
Member

jiegillet commented Aug 19, 2021

To make sure I am following, we are thinking about returning either:

"000...000"
("c", n) where c is a character and n is the number of characters in the string

That's not exactly it.
I'm thinking or returning either

  1. "000...000" + the codebook Leaf 'a' 100
  2. "" + the codebook Leaf 'a' 100

The reasoning behind 2 is that if the codebook is always generated with the message to encode, Leaf 'a' 100 is enough to reconstruct the message since we know there is only one character and we know the frequency. However @Amaras pointed out that in a situation where the codebook is not sent with the message (maybe there is a predefined codebook for the English language or something?) or maybe the frequencies are renormalized, then you need to send the 0's. To be fair, that is also a bit of a stretch because there is no way that people would use a predefined codebook of one character :)

Still, @Amaras convinced me and I now think solution 1 is the best. But this is a pathological case, any decision is fine so let's pick one and run with it. @leios, I think if you make the call, people will follow.

I really don't think we need to worry over bits vs characters for the bitstring, the algorithm is the important part, we are not building libraries here.

@Amaras Amaras reopened this Oct 23, 2021
@Amaras
Copy link
Member

Amaras commented Oct 23, 2021

And I forgot to remove the link again, sorry

@leios
Copy link
Member

leios commented Oct 23, 2021

I need to learn to stop saying "This fixes #..." when it only fixes 1 part of the full issue...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Implementation Edit This provides an edit to an algorithm implementation. (Code and maybe md files are edited.) Problem This is a problem in the archive or an implementation.
Projects
None yet
Development

Successfully merging a pull request may close this issue.