{load|intersect}_word2vec_format: allow non-strict unicode error handling #466
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A word2vec.c-format file might not have perfectly legal unicode encodings. A reader might want to load them anyway. This adds a parameter to load_word2vec_format & intersect_word2vec_format, default 'strict', that is passed to the
utils.to_unicode()
method as itserrors
parameter. By choosing one of the more tolerant alternatives –ignore
orreplace
– bad unicode won't cause a fatal ValueError. Instead, the problematic string will just ignore or replace the offending char(s), allowing progress to continue.