Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corpora.TextDirectoryCorpus fails on utf-8 encoded files on windows #3316

Closed
Sandman-Ren opened this issue Apr 1, 2022 · 2 comments · Fixed by #3317
Closed

corpora.TextDirectoryCorpus fails on utf-8 encoded files on windows #3316

Sandman-Ren opened this issue Apr 1, 2022 · 2 comments · Fixed by #3317
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix impact MEDIUM Big annoyance for affected users reach LOW Affects only niche use-case users

Comments

@Sandman-Ren
Copy link
Contributor

Problem description

What are you trying to achieve? What is the expected result? What are you seeing instead?

I have a directory of utf-8 encoded files (scraped Reddit submissions selftext i.e. text in reddit posts) in plain text. I wanted to create a corpus using gensim.corpora.TextDirectoryCorpus(<dir_of_scraped_plaintexts>). I expect this to run without error and return a working corpus. I see a UnicodeDecodeError instead: (not to be confused with the UnicodeDecodeError in the FAQ Q10

Stack Trace --------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_13728\103931668.py in 1 # load all selftext into gensim 2 all_selftext_dir = Path.cwd() / 'data/all_selftexts' ----> 3 corpus = gensim.corpora.TextDirectoryCorpus(str(all_selftext_dir))

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in init(self, input, dictionary, metadata, min_depth, max_depth, pattern, exclude_pattern, lines_are_documents, **kwargs)
433 self.exclude_pattern = exclude_pattern
434 self.lines_are_documents = lines_are_documents
--> 435 super(TextDirectoryCorpus, self).init(input, dictionary, metadata, **kwargs)
436
437 @Property

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in init(self, input, dictionary, metadata, character_filters, tokenizer, token_filters)
181 self.length = None
182 self.dictionary = None
--> 183 self.init_dictionary(dictionary)
184
185 def init_dictionary(self, dictionary):

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in init_dictionary(self, dictionary)
203 metadata_setting = self.metadata
204 self.metadata = False
--> 205 self.dictionary.add_documents(self.get_texts())
206 self.metadata = metadata_setting
207 else:

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\dictionary.py in add_documents(self, documents, prune_at)
192
193 """
--> 194 for docno, document in enumerate(documents):
195 # log progress & run a regular check for pruning, once every 10k docs
196 if docno % 10000 == 0:

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in get_texts(self)
312 yield self.preprocess_text(line), (lineno,)
313 else:
--> 314 for line in lines:
315 yield self.preprocess_text(line)
316

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in getstream(self)
521 except Exception as e:
522 print(path)
--> 523 raise e
524 num_texts += 1
525

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in getstream(self)
518 else:
519 try:
--> 520 yield f.read().strip()
521 except Exception as e:
522 print(path)

D:\Work\Anaconda\envs\cs37\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1897: character maps to

Steps/code/corpus to reproduce

Platform Specific: this error is reproducible on platforms where locale.gerpreferredencoding() == 'cp1252' i.e. it is reproducible only on some Windows machines.

Consider this file: encoding_err_txt.txt
Place the above file in an empty directory, then run:

gensim.corpora.TextDirectoryCorpus(<path_to_dir>)

Versions

>>> import platform; print(platform.platform())
Windows-10-10.0.19041-SP0
>>> import sys; print("Python", sys.version)
Python 3.7.10 (default, Feb 26 2021, 13:06:18) [MSC v.1916 64 bit (AMD64)]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.21.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.7.3
>>> import gensim; print("gensim", gensim.__version__)
gensim 4.1.2
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

Additional Note

This issue seems to be caused by gensim.corpora.textcorpus.py:513 where TextDirectoryCorpus.getstream() uses the python builtin open() without specifying an encoding= argument. This lets python defaults to using

locale.getpreferredencoding(False)

for an encoding to read the file. Unfortunately, on the aforementioned platform, the above line returns cp1252 which cannot decode some of the utf-8 characters.

Workarounds

python 3.7 and later: added a UTF-8 mode where python reads the environment variable "PYTHONUTF8" and sets the sys.flags.utf8_mode. If sys.flags.utf8_mode == 1, then locale.getpreferredencoding(False) == "UTF-8" and TextDirectoryCorpus is able to load the file.

I spent a few hours tinkering around and reading up on some resources (example: python's utf-8 mode, changing locale to change the preferred encoding, did not work) before discovering the above workaround.

Overall I think this is an easy issue to fix (perhaps by adding an encoding='utf-8' default keyword argument in TextDirectoryCorpus.__init__(...) and self.encoding to gensim.corpora.textcorpus.py:513) and does not look like it will break anything. It should greatly increase the usability of TextDirectoryCorpus on Windows platforms.

Thanks ☺️

@piskvorky
Copy link
Owner

piskvorky commented Apr 1, 2022

You're right. This seems to be a bug introduced in #1459.

An explicit TextCorpus(encoding) parameter will be great, with a utf8 default – can you open a PR please?

And also check all the other open() calls in that module. I see there are several places that open in text mode but lack encoding.

While at it, could you also replace open by smart_open.open? Because smart_open.open is 100% compatible with the built-in open, but will also allow users to input compressed files transparently (saves disk space, especially with large text files).

Many thanks!

@piskvorky piskvorky added bug Issue described a bug difficulty easy Easy issue: required small fix impact MEDIUM Big annoyance for affected users reach LOW Affects only niche use-case users labels Apr 1, 2022
@Sandman-Ren
Copy link
Contributor Author

Thanks for the reply. I read about smart_open at some point but have not really tried it. Also it's my first time creating a PR for an open source project. I'll spend some time reading up on the docs for smart_open and contributing guidelines a bit and will do a PR 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix impact MEDIUM Big annoyance for affected users reach LOW Affects only niche use-case users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants