-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix for #814 #1176
[MRG] Fix for #814 #1176
Conversation
@jayantj This looks good, but should we wait for to test this on the new pre-trained models first and then ask Lev to merge? Though I feel, it shouldn't create any difference, as you have already added a test for non-ascii words. |
Agree with @anmol01gulati. Does it work on https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md, @jayantj ? |
The previous behaviour was not good, how did that pass the code review and unit tests :( Does fastText always output utf8? Is the encoding really fixed? |
@tmylk use better issue/PR titles, so search engines can index them and users find them, when they hit the same error. |
ping @jayantj status on the questions above? |
The fastText readme says that the training file should contain Looking at the fastText code and experimenting with different encodings though, fastText doesn't really decode the bytes in the training file and simply treats the input file as a stream of bytes. The only place it does make use of the values inside the bytes is while tokenizing the text, where it checks each byte against whitespace characters ( So, encodings which have the same values for these characters as ascii would produce sane output. ( I can think of two ways to handle this -
Any thoughts? |
Re: file header, I'll create a single PR with the updated header and the fix for the above issue, whichever one we go with. |
Option 2 sounds as the most flexible |
The current code attempts to decode all characters for vocab words individually as ascii.
Instead, the code is changed to do the following - characters are read as raw bytes first, then decoded as utf-8 once the entire word has been read.
Also includes a test fasttext model with non-ascii vocab words, and checks vector lookup for the same.
Haven't been able to test the newly released FastText vectors for wiki in various languages since the models are large-ish (~10 GB).
Related issue reported here