Delay consumption of new files #46

bavarialogy · 2020-11-25T21:17:28Z

I might be a little too nervous but I'm just too excited. I just tested direct scanning into the consumption directory, which leads to the following warning in the log:

`11/25/20, 10:02 PM WARNING Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml!

11/25/20, 10:02 PM INFO Consuming doc20201125222904.pdf`

The stdout log of the webserver container shows the following:
`22:02:09 [Q] INFO Enqueued 1
22:02:09 [Q] INFO Process-1:1 processing [doc20201125222904.pdf]
Consuming doc20201125222904.pdf
Parser: RasterisedDocumentParser based on mime type application/pdf
Generating thumbnail for doc20201125222904.pdf...
Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -trim /usr/src/paperless/src/../consume/doc20201125222904.pdf[0] /tmp/paperless/paperless-3vtrk65p/convert.png
**** Error: Cannot find a 'startxref' anywhere in the file.
Output may be incorrect.
**** Error: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.
**** Ghostscript will attempt to recover the data.
**** However, the output may be incorrect.
**** Error: Trailer dictionary not found.
Output may be incorrect.

Requested FirstPage is greater than the number of pages in the file: 0
No pages will be processed (FirstPage > LastPage).
convert-im6.q16: no images defined /tmp/paperless/paperless-3vtrk65p/convert.png' @ error/convert.c/ConvertImageCommand/3258. Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml! **** Error: Cannot find a 'startxref' anywhere in the file. Output may be incorrect. **** Error: An error occurred while reading an XREF table. **** The file has been damaged. This may have been caused **** by a problem while converting or transfering the file. **** Ghostscript will attempt to recover the data. **** However, the output may be incorrect. **** Error: Trailer dictionary not found. Output may be incorrect. No pages will be processed (FirstPage > LastPage). Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -trim /tmp/paperless/paperless-3vtrk65p/gs_out.png /tmp/paperless/paperless-3vtrk65p/convert.png convert-im6.q16: unable to open image /tmp/paperless/paperless-3vtrk65p/gs_out.png': No such file or directory @ error/blob.c/OpenBlob/2874.
convert-im6.q16: no images defined `/tmp/paperless/paperless-3vtrk65p/convert.png' @ error/convert.c/ConvertImageCommand/3258.
Deleting directory /tmp/paperless/paperless-3vtrk65p
22:02:09 [Q] ERROR Failed [doc20201125222904.pdf] - Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png'] : Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 49, in get_thumbnail
logging_group=self.logging_group)
File "/usr/src/paperless/src/documents/parsers.py", line 107, in run_convert
raise ParseError("Convert failed at {}".format(args))
documents.parsers.ParseError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/usr/src/paperless/src/../consume/doc20201125222904.pdf[0]', '/tmp/paperless/paperless-3vtrk65p/convert.png']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 132, in try_consume_file
thumbnail = document_parser.get_optimised_thumbnail()
File "/usr/src/paperless/src/documents/parsers.py", line 168, in get_optimised_thumbnail
return self.optimise_thumbnail(self.get_thumbnail())
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 73, in get_thumbnail
logging_group=self.logging_group)
File "/usr/src/paperless/src/documents/parsers.py", line 107, in run_convert
raise ParseError("Convert failed at {}".format(args))
documents.parsers.ParseError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 68, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 138, in try_consume_file
raise ConsumerError(e)
documents.consumer.ConsumerError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png']

`

Is it possible to delay the consumption of new files while they're being written to avoid this?

The text was updated successfully, but these errors were encountered:

jonaswinkler · 2020-11-25T21:26:49Z

If you scan the file somewhere else and move it into the consumption folder afterwards, does that work? That would mean the consumer is trying to read the file while its still being written, resulting in the error about no pages in the PDF.

bavarialogy · 2020-11-26T07:12:44Z

It's not consuming any PDF anymore (not even new ones) now. I also restarted the containers just to be sure. The following exception occurs:

`tuple index out of range : Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 173, in try_consume_file
classifier=classifier
File "/usr/local/lib/python3.7/site-packages/django/dispatch/dispatcher.py", line 179, in send
for receiver in self._live_receivers(sender)
File "/usr/local/lib/python3.7/site-packages/django/dispatch/dispatcher.py", line 179, in
for receiver in self._live_receivers(sender)
File "/usr/src/paperless/src/documents/signals/handlers.py", line 127, in set_tags
matched_tags = matching.match_tags(document.content, classifier)
File "/usr/src/paperless/src/documents/matching.py", line 36, in match_tags
predicted_tag_ids = classifier.predict_tags(document_content)
File "/usr/src/paperless/src/documents/classifier.py", line 224, in predict_tags
tags_ids = self.tags_binarizer.inverse_transform(y)[0]
File "/usr/local/lib/python3.7/site-packages/sklearn/preprocessing/label.py", line 1017, in inverse_transform
if yt.shape[1] != len(self.classes):
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 68, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 187, in try_consume_file
raise ConsumerError(e)
documents.consumer.ConsumerError: tuple index out of range`

jayme-github · 2020-11-26T11:26:35Z

If you scan the file somewhere else and move it into the consumption folder afterwards, does that work? That would mean the consumer is trying to read the file while its still being written, resulting in the error about no pages in the PDF.

This is actually a pretty common issue when using watchdog as it does not (and will not) expose close_write events (gorakhargosh/watchdog#184). Especially with larger files (in my case from a Brother document scanner) one will run into this with almost every document.

It's not consuming any PDF anymore (not even new ones) now. I also restarted the containers just to be sure. The following exception occurs:

That's probably unrelated, I just created #47 for that.

totti4ever · 2020-11-29T06:31:29Z

This is actually a pretty common issue when using watchdog as it does not (and will not) expose close_write events (gorakhargosh/watchdog#184). Especially with larger files (in my case from a Brother document scanner) one will run into this with almost every document.

@jayme-github, how did you solve this? I am running a while loop, checking if file_size still increases. It is still prone to errors as sometimes (large documents or network issues) my scanner apparently caches more, most of the time less. I guess there are smarter ways? :-)

jayme-github · 2020-11-29T15:09:08Z

@jayme-github, how did you solve this? I am running a while loop, checking if file_size still increases. It is still prone to errors as sometimes (large documents or network issues) my scanner apparently caches more, most of the time less. I guess there are smarter ways? :-)

I switched to inotify ;-)

totti4ever · 2020-11-29T18:02:45Z

Hmmm and then just use close_write? Damn, I should use that instead of create/move, too

Update setup.rst

jonaswinkler added Back end bug Something isn't working labels Nov 25, 2020

jonaswinkler added this to the 1.0 milestone Nov 25, 2020

jonaswinkler closed this as completed in 7539069 Nov 27, 2020

mweimerskirch pushed a commit to mweimerskirch/paperless-ng that referenced this issue Feb 16, 2022

Merge pull request jonaswinkler#46 from a-waider/patch-1

657ed21

Update setup.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay consumption of new files #46

Delay consumption of new files #46

bavarialogy commented Nov 25, 2020

jonaswinkler commented Nov 25, 2020

bavarialogy commented Nov 26, 2020

jayme-github commented Nov 26, 2020

totti4ever commented Nov 29, 2020

jayme-github commented Nov 29, 2020

totti4ever commented Nov 29, 2020

Delay consumption of new files #46

Delay consumption of new files #46

Comments

bavarialogy commented Nov 25, 2020

jonaswinkler commented Nov 25, 2020

bavarialogy commented Nov 26, 2020

jayme-github commented Nov 26, 2020

totti4ever commented Nov 29, 2020

jayme-github commented Nov 29, 2020

totti4ever commented Nov 29, 2020