BUG: Extracted JPEG data seems to end prematurely #2266
Labels
is-bug
From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
is-regression
Regression introduced as a side-effect of another change
workflow-images
From a users perspective, image handling is the affected feature/workflow
I'm not 100% sure this is a PyPDF issue though I suspect it is a regression introduced single 3.14.0 because this never used to happen in my application and now it happens quite frequently despite both the calling code and the
PyTesseract
package being unchanged though there's at least a small chance there's some issue in the underlying Tesseract binary.Environment
$ python -m platform 3.11.5 $ python -c "import pypdf;print(pypdf._debug_versions)" 3.16.4
Code + PDF
The code is here, in particular these lines where a
PIL.Image
object is extracted from the PDF:produce a
PIL.Image
object that is passed toPyTesseract
here:PyTesseract
then fails with this:PDF file
You can use the PDF file in tests.
FTX Claim SC30 01072023101624File595287144.pdf
Traceback
The text was updated successfully, but these errors were encountered: