Error "Stream has ended unexpectedly" on getDocumentInfo with certain pdf file(s) #1288

OzzieIsaacs · 2022-08-27T13:42:22Z

I'm the guy from here and followed the call and having still issues with an encrypted pdf. I'm trying to extract metadata from this file. Advantage over pypdf3 is that the cover can be extracted without problem from the problematic files with pyPDF2.
The file can be opened from a "normal" pdf reader application and at least some of the metadata can be seen

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.0-46-generic-x86_64-with-glibc2.35

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

PyCryptodome-3.15.0 is installed also

Code + PDF

(having the below mentioned pdf downloaded and renamed to encrypt.pdf)

from PyPDF2 import PdfFileReader
with open('encrypt.pdf', 'rb') as f:
    pdf_file = PdfFileReader(f)
    doc_info = pdf_file.getDocumentInfo()

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

https://cloud.3dissue.net/24308/24333/24567/65779/Position_4.21-211104-DE-web-20211203082446.pdf

I'm not the owner/creator of the pdf so I recommend not to use them for automatic tests

Traceback

This is the complete Traceback I see:

    doc_info = pdf_file.getDocumentInfo()
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/_reader.py", line 339, in getDocumentInfo
    return self.metadata
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/_reader.py", line 327, in metadata
    obj = self.trailer[TK.INFO]
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/generic/_data_structures.py", line 150, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/generic/_base.py", line 163, in get_object
    obj = self.pdf.get_object(self)
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1121, in get_object
    retval = self._get_object_from_stream(indirect_reference)  # type: ignore
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1072, in _get_object_from_stream
    objnum = NumberObject.read_from_stream(stream_data)
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/generic/_base.py", line 296, in read_from_stream
    num = read_until_regex(stream, NumberObject.NumberPattern)
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/_utils.py", line 158, in read_until_regex
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2022-09-01T10:59:11Z

@exiledkingcc / @MartinThoma
The first analysis in this issues shows that the ObjStm object (623,0) returns a content b"" although the _data shows some data. (the same is reported with object(624,0) (which should contains the object (1378,0), 623 contains first page)

The object get_data() returns (without any error/warning ???) an empty string as the decompression fails.
The PDF can be read successfully in Acrobat Reader / SumatraPDF / pdminer.six
I suspect some issues in the decryption I did not worked on this part yet.

Your opinion/help will be welcomed.

exiledkingcc · 2022-09-03T09:37:16Z

i have made a pull request for this.

The specification says: To understand the algorithm below, it is necessary to treat the O and U strings in the Encrypt dictionary as made up of three sections. The first 32 bytes are a hash value (explained below). The next 8 bytes are called the Validation Salt. The final 8 bytes are called the Key Salt. So /U and /O should be 48-bytes data, but for the PDF file which causes #1288 , /O 's length is 127-bytes. The redundant data are zeros. Fixes #1288

MartinThoma · 2022-09-03T12:10:41Z

The fix is in main and will be released to PyPI latest tomorrow.

Thank you everybody 🙏

exiledkingcc mentioned this issue Sep 3, 2022

fix if u_value contains redundant data #1317

Merged

MartinThoma closed this as completed in #1317 Sep 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error "Stream has ended unexpectedly" on getDocumentInfo with certain pdf file(s) #1288

Error "Stream has ended unexpectedly" on getDocumentInfo with certain pdf file(s) #1288

OzzieIsaacs commented Aug 27, 2022 •

edited

Loading

pubpub-zz commented Sep 1, 2022

exiledkingcc commented Sep 3, 2022

MartinThoma commented Sep 3, 2022

Error "Stream has ended unexpectedly" on getDocumentInfo with certain pdf file(s) #1288

Error "Stream has ended unexpectedly" on getDocumentInfo with certain pdf file(s) #1288

Comments

OzzieIsaacs commented Aug 27, 2022 • edited Loading

Environment

Code + PDF

Traceback

pubpub-zz commented Sep 1, 2022

exiledkingcc commented Sep 3, 2022

MartinThoma commented Sep 3, 2022

OzzieIsaacs commented Aug 27, 2022 •

edited

Loading