Handling of missing newlines before `endstream` marker #2523

stefan6419846 · 2024-03-15T12:09:16Z

I just stumbled upon some odd PDF files (generated by Microsoft Word for Microsoft 365 in 2022) where there would be missing newlines before the endstream marker.

It seems like neither Ghostscript nor Poppler like this behavior, while pdf.js does indeed. For this reason, I am not sure whether we consider this something which we want/should fix on our side or not.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.14.21-150400.24.100-default-x86_64-with-glibc2.31

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.1.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader


reader = PdfReader('file.pdf')
for page in reader.pages:
    print(page)
    for key in page.images.keys():
        print(key)
        print(page.images[key])

I have no public reproducer for this, but in theory I would consider this rather easy to reproduce with any crafted PDF file which uses a snippet like this:

Öìendstream
endobj

Traceback

This is the complete traceback I see (lines might be slightly off due to debugging purposes):

Traceback (most recent call last):
  File "/home/stefan/tmp/run.py", line 24, in <module>
    for key in page.images.keys():
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 2397, in keys
    return self.ids_function()
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 443, in _get_ids_image
    content = self._get_contents_as_bytes() or b""
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 854, in _get_contents_as_bytes
    return b"".join(x.get_object().get_data() for x in obj)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_page.py", line 854, in <genexpr>
    return b"".join(x.get_object().get_data() for x in obj)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_base.py", line 284, in get_object
    return self.pdf.get_object(self)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/_reader.py", line 1296, in get_object
    retval = read_object(self.stream, self)  # type: ignore
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 1194, in read_object
    return DictionaryObject.read_from_stream(stream, pdf, forced_encoding)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 499, in read_from_stream
    data["__streamdata__"] = read_unsized_from_stream(stream, pdf)
  File "/home/stefan/tmp/venv/lib/python3.9/site-packages/pypdf/generic/_data_structures.py", line 393, in read_unsized_from_stream
    raise PdfReadError(
pypdf.errors.PdfReadError: Unable to find 'endstream' marker for obj starting at 807.

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2024-03-15T16:53:41Z

@stefan6419846
normally the stream length should be provided in /Length. Can you check your document to indicate if the information is available.

stefan6419846 · 2024-03-15T16:59:17Z

Yes, the xobject has a length:

4 0 obj
<</Length 7584/Filter/FlateDecode>>
stream
...

fixes py-pdf#2523 situation met: * length field is not correct * xref may contains not ordered stream datas * xref contains some free entries (i.e. not contains stream offset)

pubpub-zz · 2024-03-17T09:59:29Z

after analysis of the file, the length value is not valid, and the xref table was containing such data (some free entries present and possibly mixed up) that the software was not properly calculating the window to extract the stream. I've built a PR to fix it.

Fixes #2523. Situation met: * Length field is not correct * xref may contains not ordered stream data * xref contains some free entries (i.e. does not contain stream offset)

Fixes #2523 Situation met: * length field is not correct * xref may contain unordered stream data * xref contains some free entries (i.e. does not contain stream offset)

stefan6419846 added PdfReader The PdfReader component is affected is-robustness-issue From a users perspective, this is about robustness labels Mar 15, 2024

pubpub-zz mentioned this issue Mar 17, 2024

ROB: Robustify stream object extraction #2526

Merged

stefan6419846 closed this as completed in #2526 Mar 18, 2024

stefan6419846 pushed a commit that referenced this issue Mar 18, 2024

FIX: robustify stream extraction (#2526)

bbbc9dd

Fixes #2523. Situation met: * Length field is not correct * xref may contains not ordered stream data * xref contains some free entries (i.e. does not contain stream offset)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of missing newlines before `endstream` marker #2523

Handling of missing newlines before `endstream` marker #2523

stefan6419846 commented Mar 15, 2024

pubpub-zz commented Mar 15, 2024

stefan6419846 commented Mar 15, 2024

pubpub-zz commented Mar 17, 2024

Handling of missing newlines before endstream marker #2523

Handling of missing newlines before endstream marker #2523

Comments

stefan6419846 commented Mar 15, 2024

Environment

Code + PDF

Traceback

pubpub-zz commented Mar 15, 2024

stefan6419846 commented Mar 15, 2024

pubpub-zz commented Mar 17, 2024

Handling of missing newlines before `endstream` marker #2523

Handling of missing newlines before `endstream` marker #2523