BUG: Extracted JPEG data seems to end prematurely #2266

michelcrypt4d4mus · 2023-10-24T20:40:58Z

I'm not 100% sure this is a PyPDF issue though I suspect it is a regression introduced single 3.14.0 because this never used to happen in my application and now it happens quite frequently despite both the calling code and the PyTesseract package being unchanged though there's at least a small chance there's some issue in the underlying Tesseract binary.

Environment

$ python -m platform
3.11.5

$ python -c "import pypdf;print(pypdf._debug_versions)"
3.16.4

Code + PDF

The code is here, in particular these lines where a PIL.Image object is extracted from the PDF:

for image_number, image in enumerate(page.images, start=1):
    image_obj = Image.open(io.BytesIO(image.data))

produce a PIL.Image object that is passed to PyTesseract here:

text = pytesseract.image_to_string(image)

PyTesseract then fails with this:

TesseractError: (1, 'Corrupt JPEG data: premature end of data segment Error in pixReadStreamJpeg: read error at scanline 2206; nwarn = 1 Error in pixReadStreamJpeg: bad data Error in pixReadStream: jpeg: no pix returned Error in pixRead: pix not read Error during processing.')

PDF file

You can use the PDF file in tests.
FTX Claim SC30 01072023101624File595287144.pdf

Traceback

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/uzor/workspace/clown_sort/clown_sort/files/image_file.py:123 in extract_text     │
│                                                                                                  │
│   120 │   │   text = None                                                                        │
│   121 │   │                                                                                      │
│   122 │   │   try:                                                                               │
│ ❱ 123 │   │   │   text = pytesseract.image_to_string(image)                                      │
│   124 │   │   except pytesseract.pytesseract.TesseractError as e:                                │
│   125 │   │   │   console.print_exception()                                                      │
│   126 │   │   │   console.print(warning_text(f"Tesseract OCR failure '{image_name}'! No OCR te   │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:423 in image_to_string                               │
│                                                                                                  │
│   420 │   """                                                                                    │
│   421 │   args = [image, 'txt', lang, config, nice, timeout]                                     │
│   422 │                                                                                          │
│ ❱ 423 │   return {                                                                               │
│   424 │   │   Output.BYTES: lambda: run_and_get_output(*(args + [True])),                        │
│   425 │   │   Output.DICT: lambda: {'text': run_and_get_output(*args)},                          │
│   426 │   │   Output.STRING: lambda: run_and_get_output(*args),                                  │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:426 in <lambda>                                      │
│                                                                                                  │
│   423 │   return {                                                                               │
│   424 │   │   Output.BYTES: lambda: run_and_get_output(*(args + [True])),                        │
│   425 │   │   Output.DICT: lambda: {'text': run_and_get_output(*args)},                          │
│ ❱ 426 │   │   Output.STRING: lambda: run_and_get_output(*args),                                  │
│   427 │   }[output_type]()                                                                       │
│   428                                                                                            │
│   429                                                                                            │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:288 in run_and_get_output                            │
│                                                                                                  │
│   285 │   │   │   'timeout': timeout,                                                            │
│   286 │   │   }                                                                                  │
│   287 │   │                                                                                      │
│ ❱ 288 │   │   run_tesseract(**kwargs)                                                            │
│   289 │   │   filename = f"{kwargs['output_filename_base']}{extsep}{extension}"                  │
│   290 │   │   with open(filename, 'rb') as output_file:                                          │
│   291 │   │   │   if return_bytes:                                                               │
│                                                                                                  │
│ /Users/uzor/Library/Caches/pypoetry/virtualenvs/clown-sort-BrYcfkKs-py3.11/lib/python3. │
│ 11/site-packages/pytesseract/pytesseract.py:264 in run_tesseract                                 │
│                                                                                                  │
│   261 │                                                                                          │
│   262 │   with timeout_manager(proc, timeout) as error_string:                                   │
│   263 │   │   if proc.returncode:                                                                │
│ ❱ 264 │   │   │   raise TesseractError(proc.returncode, get_errors(error_string))                │
│   265                                                                                            │
│   266                                                                                            │
│   267 def run_and_get_output(                                                                    │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TesseractError: (1, 'Corrupt JPEG data: premature end of data segment Error in pixReadStreamJpeg: read error at scanline 2206; nwarn = 1 Error in pixReadStreamJpeg: bad 
data Error in pixReadStream: jpeg: no pix returned Error in pixRead: pix not read Error during processing.')

The text was updated successfully, but these errors were encountered:

michelcrypt4d4mus · 2023-10-24T20:44:35Z

When I downgrade to 3.14.0 this issue goes away so I think it can be confirmed as a regression. Here's a file that was failing in 3.16.4 but working fine in 3.14.0 (also usable for tests):
FTX Claim Skybridge Capital 30062023113350File971325116.pdf

pubpub-zz · 2024-04-13T10:29:57Z

images from FTX Claim SC30 01072023101624File595287144.pdf :
iss2266a_images.zip

pubpub-zz · 2024-04-13T10:42:45Z

images from FTX.Claim.Skybridge.Capital.30062023113350File971325116.pdf
iss2266b_images.zip

pubpub-zz · 2024-04-13T10:43:41Z

@michelcrypt4d4mus
Can you please indicate the exact images that used to fail:
Checking all images during checks is too much time consuming

to cover py-pdf#2266

michelcrypt4d4mus mentioned this issue Oct 24, 2023

"OSError: encoder error -2 when writing image file" while enumerating images #2265

Closed

MartinThoma added workflow-images From a users perspective, image handling is the affected feature/workflow is-regression Regression introduced as a side-effect of another change labels Oct 28, 2023

MartinThoma changed the title ~~Extracted JPEG data seems to end prematurely~~ BUG: Extracted JPEG data seems to end prematurely Oct 28, 2023

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Oct 28, 2023

michelcrypt4d4mus mentioned this issue Apr 11, 2024

ROB: Cope with some issues in pillow #2595

Merged

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Apr 13, 2024

add test for py-pdf#2266

41d18b9

to cover py-pdf#2266

stefan6419846 closed this as completed in #2595 Apr 16, 2024

stefan6419846 closed this as completed in b171422 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Extracted JPEG data seems to end prematurely #2266

BUG: Extracted JPEG data seems to end prematurely #2266

michelcrypt4d4mus commented Oct 24, 2023

michelcrypt4d4mus commented Oct 24, 2023

pubpub-zz commented Apr 13, 2024

pubpub-zz commented Apr 13, 2024

pubpub-zz commented Apr 13, 2024

BUG: Extracted JPEG data seems to end prematurely #2266

BUG: Extracted JPEG data seems to end prematurely #2266

Comments

michelcrypt4d4mus commented Oct 24, 2023

Environment

Code + PDF

PDF file

Traceback

michelcrypt4d4mus commented Oct 24, 2023

pubpub-zz commented Apr 13, 2024

pubpub-zz commented Apr 13, 2024

pubpub-zz commented Apr 13, 2024