Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{load|intersect}_word2vec_format: allow non-strict unicode error handling #466

Merged
merged 1 commit into from
Sep 24, 2015

Conversation

gojomo
Copy link
Collaborator

@gojomo gojomo commented Sep 24, 2015

A word2vec.c-format file might not have perfectly legal unicode encodings. A reader might want to load them anyway. This adds a parameter to load_word2vec_format & intersect_word2vec_format, default 'strict', that is passed to the utils.to_unicode() method as its errors parameter. By choosing one of the more tolerant alternatives – ignore or replace – bad unicode won't cause a fatal ValueError. Instead, the problematic string will just ignore or replace the offending char(s), allowing progress to continue.

piskvorky added a commit that referenced this pull request Sep 24, 2015
{load|intersect}_word2vec_format: allow non-strict unicode error handling
@piskvorky piskvorky merged commit 022bb30 into piskvorky:develop Sep 24, 2015
@piskvorky
Copy link
Owner

@tmylk I don't see this (important) change in the CHANGELOG. Did you miss it in the last release notes?

If so, can you (or @gojomo ) add it now? We'll announce it as part of the next release (0.12.4).

@gojomo
Copy link
Collaborator Author

gojomo commented Nov 20, 2015

Perhaps even 'ignore' should be the default, or we could catch the error and before re-throwing include a message pointing to the option?

@piskvorky
Copy link
Owner

Hmm, yes, a clearer error message may help. Plus a re-throw. Certainly not ignore.

@rsuwaileh
Copy link

rsuwaileh commented Jan 2, 2017

How can we use this parameter of load_word2vec_format that skips or replaces the chars with different encoding? I can't find this in documentation!

@tmylk
Copy link
Contributor

tmylk commented Jan 2, 2017

@ReemSuwaileh This question is better suited for the mailing list.

‘strict’, ‘replace’ (add U+FFFD, ‘REPLACEMENT CHARACTER’), or ‘ignore’ description in Python documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants