corpora.TextDirectoryCorpus fails on utf-8 encoded files on windows #3316

Sandman-Ren · 2022-04-01T09:04:48Z

Problem description

What are you trying to achieve? What is the expected result? What are you seeing instead?

I have a directory of utf-8 encoded files (scraped Reddit submissions selftext i.e. text in reddit posts) in plain text. I wanted to create a corpus using gensim.corpora.TextDirectoryCorpus(<dir_of_scraped_plaintexts>). I expect this to run without error and return a working corpus. I see a UnicodeDecodeError instead: (not to be confused with the UnicodeDecodeError in the FAQ Q10

Stack Trace

--------------------------------------------------------------------------- UnicodeDecodeError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_13728\103931668.py in 1 # load all selftext into gensim 2 all_selftext_dir = Path.cwd() / 'data/all_selftexts' ----> 3 corpus = gensim.corpora.TextDirectoryCorpus(str(all_selftext_dir))

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in init(self, input, dictionary, metadata, min_depth, max_depth, pattern, exclude_pattern, lines_are_documents, **kwargs)
433 self.exclude_pattern = exclude_pattern
434 self.lines_are_documents = lines_are_documents
--> 435 super(TextDirectoryCorpus, self).init(input, dictionary, metadata, **kwargs)
436
437 @Property

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in init(self, input, dictionary, metadata, character_filters, tokenizer, token_filters)
181 self.length = None
182 self.dictionary = None
--> 183 self.init_dictionary(dictionary)
184
185 def init_dictionary(self, dictionary):

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in init_dictionary(self, dictionary)
203 metadata_setting = self.metadata
204 self.metadata = False
--> 205 self.dictionary.add_documents(self.get_texts())
206 self.metadata = metadata_setting
207 else:

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\dictionary.py in add_documents(self, documents, prune_at)
192
193 """
--> 194 for docno, document in enumerate(documents):
195 # log progress & run a regular check for pruning, once every 10k docs
196 if docno % 10000 == 0:

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in get_texts(self)
312 yield self.preprocess_text(line), (lineno,)
313 else:
--> 314 for line in lines:
315 yield self.preprocess_text(line)
316

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in getstream(self)
521 except Exception as e:
522 print(path)
--> 523 raise e
524 num_texts += 1
525

D:\Work\Anaconda\envs\cs37\lib\site-packages\gensim\corpora\textcorpus.py in getstream(self)
518 else:
519 try:
--> 520 yield f.read().strip()
521 except Exception as e:
522 print(path)

D:\Work\Anaconda\envs\cs37\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1897: character maps to

Steps/code/corpus to reproduce

Platform Specific: this error is reproducible on platforms where locale.gerpreferredencoding() == 'cp1252' i.e. it is reproducible only on some Windows machines.

Consider this file: encoding_err_txt.txt
Place the above file in an empty directory, then run:

gensim.corpora.TextDirectoryCorpus(<path_to_dir>)

Versions

>>> import platform; print(platform.platform())
Windows-10-10.0.19041-SP0
>>> import sys; print("Python", sys.version)
Python 3.7.10 (default, Feb 26 2021, 13:06:18) [MSC v.1916 64 bit (AMD64)]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.21.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.7.3
>>> import gensim; print("gensim", gensim.__version__)
gensim 4.1.2
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

Additional Note

This issue seems to be caused by gensim.corpora.textcorpus.py:513 where TextDirectoryCorpus.getstream() uses the python builtin open() without specifying an encoding= argument. This lets python defaults to using

locale.getpreferredencoding(False)

for an encoding to read the file. Unfortunately, on the aforementioned platform, the above line returns cp1252 which cannot decode some of the utf-8 characters.

Workarounds

python 3.7 and later: added a UTF-8 mode where python reads the environment variable "PYTHONUTF8" and sets the sys.flags.utf8_mode. If sys.flags.utf8_mode == 1, then locale.getpreferredencoding(False) == "UTF-8" and TextDirectoryCorpus is able to load the file.

I spent a few hours tinkering around and reading up on some resources (example: python's utf-8 mode, changing locale to change the preferred encoding, did not work) before discovering the above workaround.

Overall I think this is an easy issue to fix (perhaps by adding an encoding='utf-8' default keyword argument in TextDirectoryCorpus.__init__(...) and self.encoding to gensim.corpora.textcorpus.py:513) and does not look like it will break anything. It should greatly increase the usability of TextDirectoryCorpus on Windows platforms.

Thanks ☺️

The text was updated successfully, but these errors were encountered:

piskvorky · 2022-04-01T09:26:46Z

You're right. This seems to be a bug introduced in #1459.

An explicit TextCorpus(encoding) parameter will be great, with a utf8 default – can you open a PR please?

And also check all the other open() calls in that module. I see there are several places that open in text mode but lack encoding.

While at it, could you also replace open by smart_open.open? Because smart_open.open is 100% compatible with the built-in open, but will also allow users to input compressed files transparently (saves disk space, especially with large text files).

Many thanks!

Sandman-Ren · 2022-04-01T09:32:53Z

Thanks for the reply. I read about smart_open at some point but have not really tried it. Also it's my first time creating a PR for an open source project. I'll spend some time reading up on the docs for smart_open and contributing guidelines a bit and will do a PR 😄

piskvorky added bug Issue described a bug difficulty easy Easy issue: required small fix impact MEDIUM Big annoyance for affected users reach LOW Affects only niche use-case users labels Apr 1, 2022

Sandman-Ren mentioned this issue Apr 1, 2022

Added encoding parameter to TextDirectoryCorpus #3317

Merged

mpenkov closed this as completed in #3317 Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpora.TextDirectoryCorpus fails on utf-8 encoded files on windows #3316

corpora.TextDirectoryCorpus fails on utf-8 encoded files on windows #3316

Sandman-Ren commented Apr 1, 2022

piskvorky commented Apr 1, 2022 •

edited

Loading

Sandman-Ren commented Apr 1, 2022

corpora.TextDirectoryCorpus fails on utf-8 encoded files on windows #3316

corpora.TextDirectoryCorpus fails on utf-8 encoded files on windows #3316

Comments

Sandman-Ren commented Apr 1, 2022

Problem description

Steps/code/corpus to reproduce

Versions

Additional Note

Workarounds

piskvorky commented Apr 1, 2022 • edited Loading

Sandman-Ren commented Apr 1, 2022

piskvorky commented Apr 1, 2022 •

edited

Loading