{load|intersect}_word2vec_format: allow non-strict unicode error handling #466

gojomo · 2015-09-24T02:34:05Z

A word2vec.c-format file might not have perfectly legal unicode encodings. A reader might want to load them anyway. This adds a parameter to load_word2vec_format & intersect_word2vec_format, default 'strict', that is passed to the utils.to_unicode() method as its errors parameter. By choosing one of the more tolerant alternatives – ignore or replace – bad unicode won't cause a fatal ValueError. Instead, the problematic string will just ignore or replace the offending char(s), allowing progress to continue.

…ling

{load|intersect}_word2vec_format: allow non-strict unicode error handling

piskvorky · 2015-11-20T09:57:53Z

@tmylk I don't see this (important) change in the CHANGELOG. Did you miss it in the last release notes?

If so, can you (or @gojomo ) add it now? We'll announce it as part of the next release (0.12.4).

gojomo · 2015-11-20T18:41:43Z

Perhaps even 'ignore' should be the default, or we could catch the error and before re-throwing include a message pointing to the option?

piskvorky · 2015-11-21T06:41:11Z

Hmm, yes, a clearer error message may help. Plus a re-throw. Certainly not ignore.

rsuwaileh · 2017-01-02T21:25:54Z

How can we use this parameter of load_word2vec_format that skips or replaces the chars with different encoding? I can't find this in documentation!

tmylk · 2017-01-02T21:41:35Z

@ReemSuwaileh This question is better suited for the mailing list.

‘strict’, ‘replace’ (add U+FFFD, ‘REPLACEMENT CHARACTER’), or ‘ignore’ description in Python documentation.

{load|intersect}_word2vec_format: allow non-strict unicode error hand…

a8a8f21

…ling

piskvorky added a commit that referenced this pull request Sep 24, 2015

Merge pull request #466 from gojomo/unicode_err_tolerance

022bb30

{load|intersect}_word2vec_format: allow non-strict unicode error handling

piskvorky merged commit 022bb30 into piskvorky:develop Sep 24, 2015

gojomo mentioned this pull request Oct 22, 2015

Loading c word2vec text models and encoding errors #496

Closed

piskvorky mentioned this pull request Nov 20, 2015

ValueError: string size must be a multiple of element size for "demo-phrases.sh" trained vector-phrases.bin model file in C #394

Closed

nick-magnini mentioned this pull request Dec 23, 2015

'utf-8' decode error when loading a word2vec module #567

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{load|intersect}_word2vec_format: allow non-strict unicode error handling #466

{load|intersect}_word2vec_format: allow non-strict unicode error handling #466

gojomo commented Sep 24, 2015

piskvorky commented Nov 20, 2015

gojomo commented Nov 20, 2015

piskvorky commented Nov 21, 2015

rsuwaileh commented Jan 2, 2017 •

edited

Loading

tmylk commented Jan 2, 2017

{load|intersect}_word2vec_format: allow non-strict unicode error handling #466

{load|intersect}_word2vec_format: allow non-strict unicode error handling #466

Conversation

gojomo commented Sep 24, 2015

piskvorky commented Nov 20, 2015

gojomo commented Nov 20, 2015

piskvorky commented Nov 21, 2015

rsuwaileh commented Jan 2, 2017 • edited Loading

tmylk commented Jan 2, 2017

rsuwaileh commented Jan 2, 2017 •

edited

Loading