Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added encoding parameter to TextDirectoryCorpus #3317

Merged
merged 1 commit into from
Apr 15, 2022
Merged

Added encoding parameter to TextDirectoryCorpus #3317

merged 1 commit into from
Apr 15, 2022

Conversation

Sandman-Ren
Copy link
Contributor

@Sandman-Ren Sandman-Ren commented Apr 1, 2022

Fixes #3316.

Added encoding='utf-8' keyword argument to TextDirectoryCorpus. Used smart_open to replace built-in open.
TextCorpus does not seem to have the problem. Nothing is changed for TextCorpus.

@piskvorky piskvorky changed the title Fix #3316 Added encoding parameter to TextDirectoryCorpus Apr 2, 2022
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! Let's wait for the CI tests to finish, otherwise LGTM.

Is there any risk of backward compatibility issues? Do people save/load these objects to disk = new encoding parameter missing in old saves?

@Sandman-Ren
Copy link
Contributor Author

Sandman-Ren commented Apr 2, 2022

Looks good, thanks! Let's wait for the CI tests to finish, otherwise LGTM.

Is there any risk of backward compatibility issues? Do people save/load these objects to disk = new encoding parameter missing in old saves?

I've tried saving the TextDirectoryCorpus using TextDirectoryCorpus.save() in the modified code and load it back using TextDirectoryCorpus.load() using code on the develop branch and did not get any errors.

Considering the current state of TextDirectoryCorpus where a windows user (like me) cannot handle some utf-8 characters, I think (think...) there should not be any issues with existing code except for the following case:

User 1 saves the TextDirectoryCorpus with the encoding optional parameter
User 2 (an unfortunate Windows user) loads the saved TextDirectoryCorpus, using a version of the code without the optional encoding parameter.
User 2 will receive a UniCodeDecodeError as seen in #3316 when they perform any operation that involves TextDirectoryCorpus.getstream()

The best solution I can think of overall is probably for gensim to check the operating system and set the 'PYTHONUTF8' environment variable (if possible).

@piskvorky
Copy link
Owner

I think it's fine now, thanks. Let's wait for @mpenkov 's review and we can merge :)

@piskvorky piskvorky requested a review from mpenkov April 2, 2022 08:39
@piskvorky piskvorky added this to the Next release milestone Apr 2, 2022
@mpenkov mpenkov merged commit edaeee9 into piskvorky:develop Apr 15, 2022
@mpenkov
Copy link
Collaborator

mpenkov commented Apr 15, 2022

LGTM! Thank you @Sandman-Ren !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

corpora.TextDirectoryCorpus fails on utf-8 encoded files on windows
3 participants