corpora.TextDirectoryCorpus fails on utf-8 encoded files on windows #3316
Labels
bug
Issue described a bug
difficulty easy
Easy issue: required small fix
impact MEDIUM
Big annoyance for affected users
reach LOW
Affects only niche use-case users
Problem description
What are you trying to achieve? What is the expected result? What are you seeing instead?
I have a directory of utf-8 encoded files (scraped Reddit submissions selftext i.e. text in reddit posts) in plain text. I wanted to create a corpus using gensim.corpora.TextDirectoryCorpus(<dir_of_scraped_plaintexts>). I expect this to run without error and return a working corpus. I see a UnicodeDecodeError instead: (not to be confused with the UnicodeDecodeError in the FAQ Q10
Stack Trace
--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_13728\103931668.py in 1 # load all selftext into gensim 2 all_selftext_dir = Path.cwd() / 'data/all_selftexts' ----> 3 corpus = gensim.corpora.TextDirectoryCorpus(str(all_selftext_dir))D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in init(self, input, dictionary, metadata, min_depth, max_depth, pattern, exclude_pattern, lines_are_documents, **kwargs)
433 self.exclude_pattern = exclude_pattern
434 self.lines_are_documents = lines_are_documents
--> 435 super(TextDirectoryCorpus, self).init(input, dictionary, metadata, **kwargs)
436
437 @Property
D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in init(self, input, dictionary, metadata, character_filters, tokenizer, token_filters)
181 self.length = None
182 self.dictionary = None
--> 183 self.init_dictionary(dictionary)
184
185 def init_dictionary(self, dictionary):
D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in init_dictionary(self, dictionary)
203 metadata_setting = self.metadata
204 self.metadata = False
--> 205 self.dictionary.add_documents(self.get_texts())
206 self.metadata = metadata_setting
207 else:
D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\dictionary.py in add_documents(self, documents, prune_at)
192
193 """
--> 194 for docno, document in enumerate(documents):
195 # log progress & run a regular check for pruning, once every 10k docs
196 if docno % 10000 == 0:
D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in get_texts(self)
312 yield self.preprocess_text(line), (lineno,)
313 else:
--> 314 for line in lines:
315 yield self.preprocess_text(line)
316
D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in getstream(self)
521 except Exception as e:
522 print(path)
--> 523 raise e
524 num_texts += 1
525
D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in getstream(self)
518 else:
519 try:
--> 520 yield f.read().strip()
521 except Exception as e:
522 print(path)
D:\Work\Anaconda\envs\cs37\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1897: character maps to
Steps/code/corpus to reproduce
Platform Specific: this error is reproducible on platforms where
locale.gerpreferredencoding() == 'cp1252'
i.e. it is reproducible only on some Windows machines.Consider this file: encoding_err_txt.txt
Place the above file in an empty directory, then run:
Versions
Additional Note
This issue seems to be caused by gensim.corpora.textcorpus.py:513 where TextDirectoryCorpus.getstream() uses the python builtin open() without specifying an encoding= argument. This lets python defaults to using
for an encoding to read the file. Unfortunately, on the aforementioned platform, the above line returns cp1252 which cannot decode some of the utf-8 characters.
Workarounds
python 3.7 and later: added a UTF-8 mode where python reads the environment variable "PYTHONUTF8" and sets the sys.flags.utf8_mode. If sys.flags.utf8_mode == 1, then locale.getpreferredencoding(False) == "UTF-8" and TextDirectoryCorpus is able to load the file.
I spent a few hours tinkering around and reading up on some resources (example: python's utf-8 mode, changing locale to change the preferred encoding, did not work) before discovering the above workaround.
Overall I think this is an easy issue to fix (perhaps by adding an encoding='utf-8' default keyword argument in TextDirectoryCorpus.__init__(...) and self.encoding to gensim.corpora.textcorpus.py:513) and does not look like it will break anything. It should greatly increase the usability of TextDirectoryCorpus on Windows platforms.
Thanks☺️
The text was updated successfully, but these errors were encountered: