Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'not enough image data' exception from PIL #2343

Closed
brianpow opened this issue Dec 15, 2023 · 2 comments · Fixed by #2591
Closed

'not enough image data' exception from PIL #2343

brianpow opened this issue Dec 15, 2023 · 2 comments · Fixed by #2591
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-images From a users perspective, image handling is the affected feature/workflow

Comments

@brianpow
Copy link

I am trying to extract images from pdf files, however occasionally it gives 'not enough image data' exception from PIL when handling certain pdf. The files look correct in Atril Document Viewer and works if using pdfimages from poppler-utils

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.5.0-kali3-amd64-x86_64-with-glibc2.37

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==3.17.2, crypt_provider=('cryptography', '38.0.4'), PIL=10.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader
import sys

for filename in sys.argv[1:]:
    reader = PdfReader(filename)
    for i, page in enumerate(reader.pages):
        for j, image in enumerate(page.images):
            print("Writing %d-%d: %s (%d)..." % (i, j, image.name, len(image.data)))            
            with open(image.name, "wb") as fp:
                fp.write(image.data)

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

test2_P038-038.pdf

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/home/user/pypdf/pypdf_test.py", line 7, in <module>
    for j, image in enumerate(page.images):
  File "/home/user/.local/lib/python3.11/site-packages/pypdf/_page.py", line 2727, in __iter__
    yield self[i]
          ~~~~^^^
  File "/home/user/.local/lib/python3.11/site-packages/pypdf/_page.py", line 2723, in __getitem__
    return self.get_function(lst[index])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pypdf/_page.py", line 557, in _get_image
    imgd = _xobj_to_image(cast(DictionaryObject, xobjs[id]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pypdf/filters.py", line 785, in _xobj_to_image
    img, image_format, extension, _ = _handle_flate(
                                      ^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/pypdf/_xobj_image_helpers.py", line 172, in _handle_flate
    img = Image.frombytes(mode, size, data)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 2952, in frombytes
    im.frombytes(data, decoder_name, args)
  File "/usr/lib/python3/dist-packages/PIL/Image.py", line 805, in frombytes
    raise ValueError(msg)
ValueError: not enough image data
@MartinThoma MartinThoma added workflow-images From a users perspective, image handling is the affected feature/workflow is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Dec 16, 2023
@pubpub-zz
Copy link
Collaborator

the issue is on the first image:
img1

@pubpub-zz
Copy link
Collaborator

also some tests with
https://github.com/py-pdf/pypdf/files/13946477/panda.pdf
image r.pages[8].images[9] (a small black image):
Im0

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Apr 8, 2024
closes py-pdf#2343:
1st case : image with images in 1 byte encoding with Separation colorspace

2nd case: similar + \n to be ignored at the end of the image data
stefan6419846 pushed a commit that referenced this issue Apr 10, 2024
Closes #2343:
1st case : image with images in 1 byte encoding with Separation color space

2nd case: similar + \n to be ignored at the end of the image data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-images From a users perspective, image handling is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants