-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError when serializing a doc object after adding a new entity label #514
Comments
…d. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.
Added a fix for this, but the situation's pretty messy. The serializer expects a list of attribute frequencies, so that it can build a Huffman tree. So it wants to know what entity labels are available, and how common they are. Once the Huffman trees are built, they can't be modified without changing the encoding. The result is that if you serialize some documents, add an entity label, and then serialize some more, the two sets of documents won't be consistently encoded. So uh...don't do that :p. I suggest trying to add your custom entity labels as soon as possible after loading the pipeline. That's probably the best way to work around the brittleness here, until the underlying design improves. The serializer is probably rather over-engineered. |
Got it. Thanks! |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I'm trying to add new entity labels and add new entity spans accordingly. However, this results in a KeyError when using doc.to_bytes(). Minimal code example below:
Output:
The text was updated successfully, but these errors were encountered: