Skip to content
This repository has been archived by the owner on Feb 16, 2023. It is now read-only.

Delay consumption of new files #46

Closed
bavarialogy opened this issue Nov 25, 2020 · 6 comments
Closed

Delay consumption of new files #46

bavarialogy opened this issue Nov 25, 2020 · 6 comments
Labels
bug Something isn't working
Milestone

Comments

@bavarialogy
Copy link

I might be a little too nervous but I'm just too excited. I just tested direct scanning into the consumption directory, which leads to the following warning in the log:

`11/25/20, 10:02 PM WARNING Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml!

11/25/20, 10:02 PM INFO Consuming doc20201125222904.pdf`

The stdout log of the webserver container shows the following:
`22:02:09 [Q] INFO Enqueued 1
22:02:09 [Q] INFO Process-1:1 processing [doc20201125222904.pdf]
Consuming doc20201125222904.pdf
Parser: RasterisedDocumentParser based on mime type application/pdf
Generating thumbnail for doc20201125222904.pdf...
Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -trim /usr/src/paperless/src/../consume/doc20201125222904.pdf[0] /tmp/paperless/paperless-3vtrk65p/convert.png
**** Error: Cannot find a 'startxref' anywhere in the file.
Output may be incorrect.
**** Error: An error occurred while reading an XREF table.
**** The file has been damaged. This may have been caused
**** by a problem while converting or transfering the file.
**** Ghostscript will attempt to recover the data.
**** However, the output may be incorrect.
**** Error: Trailer dictionary not found.
Output may be incorrect.

Requested FirstPage is greater than the number of pages in the file: 0
No pages will be processed (FirstPage > LastPage).
convert-im6.q16: no images defined /tmp/paperless/paperless-3vtrk65p/convert.png' @ error/convert.c/ConvertImageCommand/3258. Thumbnail generation with ImageMagick failed, falling back to ghostscript. Check your /etc/ImageMagick-x/policy.xml! **** Error: Cannot find a 'startxref' anywhere in the file. Output may be incorrect. **** Error: An error occurred while reading an XREF table. **** The file has been damaged. This may have been caused **** by a problem while converting or transfering the file. **** Ghostscript will attempt to recover the data. **** However, the output may be incorrect. **** Error: Trailer dictionary not found. Output may be incorrect. No pages will be processed (FirstPage > LastPage). Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -trim /tmp/paperless/paperless-3vtrk65p/gs_out.png /tmp/paperless/paperless-3vtrk65p/convert.png convert-im6.q16: unable to open image /tmp/paperless/paperless-3vtrk65p/gs_out.png': No such file or directory @ error/blob.c/OpenBlob/2874.
convert-im6.q16: no images defined `/tmp/paperless/paperless-3vtrk65p/convert.png' @ error/convert.c/ConvertImageCommand/3258.
Deleting directory /tmp/paperless/paperless-3vtrk65p
22:02:09 [Q] ERROR Failed [doc20201125222904.pdf] - Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png'] : Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 49, in get_thumbnail
logging_group=self.logging_group)
File "/usr/src/paperless/src/documents/parsers.py", line 107, in run_convert
raise ParseError("Convert failed at {}".format(args))
documents.parsers.ParseError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/usr/src/paperless/src/../consume/doc20201125222904.pdf[0]', '/tmp/paperless/paperless-3vtrk65p/convert.png']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 132, in try_consume_file
thumbnail = document_parser.get_optimised_thumbnail()
File "/usr/src/paperless/src/documents/parsers.py", line 168, in get_optimised_thumbnail
return self.optimise_thumbnail(self.get_thumbnail())
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 73, in get_thumbnail
logging_group=self.logging_group)
File "/usr/src/paperless/src/documents/parsers.py", line 107, in run_convert
raise ParseError("Convert failed at {}".format(args))
documents.parsers.ParseError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 68, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 138, in try_consume_file
raise ConsumerError(e)
documents.consumer.ConsumerError: Convert failed at ['convert', '-density', '300', '-scale', '500x5000>', '-alpha', 'remove', '-strip', '-trim', '/tmp/paperless/paperless-3vtrk65p/gs_out.png', '/tmp/paperless/paperless-3vtrk65p/convert.png']

`

Is it possible to delay the consumption of new files while they're being written to avoid this?

@jonaswinkler
Copy link
Owner

If you scan the file somewhere else and move it into the consumption folder afterwards, does that work? That would mean the consumer is trying to read the file while its still being written, resulting in the error about no pages in the PDF.

@jonaswinkler jonaswinkler added Back end bug Something isn't working labels Nov 25, 2020
@jonaswinkler jonaswinkler added this to the 1.0 milestone Nov 25, 2020
@bavarialogy
Copy link
Author

It's not consuming any PDF anymore (not even new ones) now. I also restarted the containers just to be sure. The following exception occurs:

`tuple index out of range : Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 173, in try_consume_file
classifier=classifier
File "/usr/local/lib/python3.7/site-packages/django/dispatch/dispatcher.py", line 179, in send
for receiver in self._live_receivers(sender)
File "/usr/local/lib/python3.7/site-packages/django/dispatch/dispatcher.py", line 179, in
for receiver in self._live_receivers(sender)
File "/usr/src/paperless/src/documents/signals/handlers.py", line 127, in set_tags
matched_tags = matching.match_tags(document.content, classifier)
File "/usr/src/paperless/src/documents/matching.py", line 36, in match_tags
predicted_tag_ids = classifier.predict_tags(document_content)
File "/usr/src/paperless/src/documents/classifier.py", line 224, in predict_tags
tags_ids = self.tags_binarizer.inverse_transform(y)[0]
File "/usr/local/lib/python3.7/site-packages/sklearn/preprocessing/label.py", line 1017, in inverse_transform
if yt.shape[1] != len(self.classes
):
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 68, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 187, in try_consume_file
raise ConsumerError(e)
documents.consumer.ConsumerError: tuple index out of range`

@jayme-github
Copy link
Contributor

If you scan the file somewhere else and move it into the consumption folder afterwards, does that work? That would mean the consumer is trying to read the file while its still being written, resulting in the error about no pages in the PDF.

This is actually a pretty common issue when using watchdog as it does not (and will not) expose close_write events (gorakhargosh/watchdog#184). Especially with larger files (in my case from a Brother document scanner) one will run into this with almost every document.

It's not consuming any PDF anymore (not even new ones) now. I also restarted the containers just to be sure. The following exception occurs:

That's probably unrelated, I just created #47 for that.

@totti4ever
Copy link

This is actually a pretty common issue when using watchdog as it does not (and will not) expose close_write events (gorakhargosh/watchdog#184). Especially with larger files (in my case from a Brother document scanner) one will run into this with almost every document.

@jayme-github, how did you solve this? I am running a while loop, checking if file_size still increases. It is still prone to errors as sometimes (large documents or network issues) my scanner apparently caches more, most of the time less. I guess there are smarter ways? :-)

@jayme-github
Copy link
Contributor

@jayme-github, how did you solve this? I am running a while loop, checking if file_size still increases. It is still prone to errors as sometimes (large documents or network issues) my scanner apparently caches more, most of the time less. I guess there are smarter ways? :-)

I switched to inotify ;-)

@totti4ever
Copy link

Hmmm and then just use close_write? Damn, I should use that instead of create/move, too

mweimerskirch pushed a commit to mweimerskirch/paperless-ng that referenced this issue Feb 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants