Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of missing newlines before endstream marker #2523

Closed
stefan6419846 opened this issue Mar 15, 2024 · 3 comments · Fixed by #2526
Closed

Handling of missing newlines before endstream marker #2523

stefan6419846 opened this issue Mar 15, 2024 · 3 comments · Fixed by #2526
Labels
is-robustness-issue From a users perspective, this is about robustness PdfReader The PdfReader component is affected

Comments

@stefan6419846
Copy link
Collaborator

I just stumbled upon some odd PDF files (generated by Microsoft Word for Microsoft 365 in 2022) where there would be missing newlines before the endstream marker.

It seems like neither Ghostscript nor Poppler like this behavior, while pdf.js does indeed. For this reason, I am not sure whether we consider this something which we want/should fix on our side or not.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader


reader = PdfReader('file.pdf')
for page in reader.pages:
    print(page)
    for key in page.images.keys():
        print(key)
        print(page.images[key])

I have no public reproducer for this, but in theory I would consider this rather easy to reproduce with any crafted PDF file which uses a snippet like this:

Öìendstream
endobj

Traceback

This is the complete traceback I see (lines might be slightly off due to debugging purposes):

Traceback (most recent call last):
  File "/home/stefan/tmp/run.py", line 24, in <module>
    for key in page.images.keys():
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 2397, in keys
    return self.ids_function()
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 443, in _get_ids_image
    content = self._get_contents_as_bytes() or b""
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 854, in _get_contents_as_bytes
    return b"".join(x.get_object().get_data() for x in obj)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 854, in <genexpr>
    return b"".join(x.get_object().get_data() for x in obj)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_base.py", line 284, in get_object
    return self.pdf.get_object(self)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_reader.py", line 1296, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1194, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 499, in read_from_stream
    data["__streamdata__"] = read_unsized_from_stream(stream, pdf)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 393, in read_unsized_from_stream
    raise PdfReadError(
pypdf.errors.PdfReadError: Unable to find 'endstream' marker for obj starting at 807.
@stefan6419846 stefan6419846 added PdfReader The PdfReader component is affected is-robustness-issue From a users perspective, this is about robustness labels Mar 15, 2024
@pubpub-zz
Copy link
Collaborator

@stefan6419846
normally the stream length should be provided in /Length. Can you check your document to indicate if the information is available.

@stefan6419846
Copy link
Collaborator Author

Yes, the xobject has a length:

4 0 obj
<</Length 7584/Filter/FlateDecode>>
stream
...

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Mar 17, 2024
fixes py-pdf#2523
situation met:
* length field is not correct
* xref may contains not ordered stream datas
* xref contains some free entries (i.e. not contains stream offset)
@pubpub-zz
Copy link
Collaborator

after analysis of the file, the length value is not valid, and the xref table was containing such data (some free entries present and possibly mixed up) that the software was not properly calculating the window to extract the stream. I've built a PR to fix it.

stefan6419846 pushed a commit that referenced this issue Mar 18, 2024
Fixes #2523.

Situation met:

* Length field is not correct
* xref may contains not ordered stream data
* xref contains some free entries (i.e. does not contain stream offset)
stefan6419846 pushed a commit that referenced this issue Mar 24, 2024
Fixes #2523

Situation met:
* length field is not correct
* xref may contain unordered stream data
* xref contains some free entries (i.e. does not contain stream offset)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness PdfReader The PdfReader component is affected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants