Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error "Stream has ended unexpectedly" on getDocumentInfo with certain pdf file(s) #1288

Closed
OzzieIsaacs opened this issue Aug 27, 2022 · 3 comments · Fixed by #1317
Closed

Comments

@OzzieIsaacs
Copy link

OzzieIsaacs commented Aug 27, 2022

I'm the guy from here and followed the call and having still issues with an encrypted pdf. I'm trying to extract metadata from this file. Advantage over pypdf3 is that the cover can be extracted without problem from the problematic files with pyPDF2.
The file can be opened from a "normal" pdf reader application and at least some of the metadata can be seen
image

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.15.0-46-generic-x86_64-with-glibc2.35

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

PyCryptodome-3.15.0 is installed also

Code + PDF

(having the below mentioned pdf downloaded and renamed to encrypt.pdf)

from PyPDF2 import PdfFileReader
with open('encrypt.pdf', 'rb') as f:
    pdf_file = PdfFileReader(f)
    doc_info = pdf_file.getDocumentInfo()

Share here the PDF file(s) that cause the issue. The smaller they are, the
better. Let us know if we may add them to our tests!

https://cloud.3dissue.net/24308/24333/24567/65779/Position_4.21-211104-DE-web-20211203082446.pdf

I'm not the owner/creator of the pdf so I recommend not to use them for automatic tests

Traceback

This is the complete Traceback I see:

    doc_info = pdf_file.getDocumentInfo()
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/_reader.py", line 339, in getDocumentInfo
    return self.metadata
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/_reader.py", line 327, in metadata
    obj = self.trailer[TK.INFO]
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/generic/_data_structures.py", line 150, in __getitem__
    return dict.__getitem__(self, key).get_object()
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/generic/_base.py", line 163, in get_object
    obj = self.pdf.get_object(self)
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1121, in get_object
    retval = self._get_object_from_stream(indirect_reference)  # type: ignore
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/_reader.py", line 1072, in _get_object_from_stream
    objnum = NumberObject.read_from_stream(stream_data)
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/generic/_base.py", line 296, in read_from_stream
    num = read_until_regex(stream, NumberObject.NumberPattern)
  File "/home/ozzie/Development/calibre-web/venv/lib/python3.10/site-packages/PyPDF2/_utils.py", line 158, in read_until_regex
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
PyPDF2.errors.PdfStreamError: Stream has ended unexpectedly

@pubpub-zz
Copy link
Collaborator

@exiledkingcc / @MartinThoma
The first analysis in this issues shows that the ObjStm object (623,0) returns a content b"" although the _data shows some data. (the same is reported with object(624,0) (which should contains the object (1378,0), 623 contains first page)

The object get_data() returns (without any error/warning ???) an empty string as the decompression fails.
The PDF can be read successfully in Acrobat Reader / SumatraPDF / pdminer.six
I suspect some issues in the decryption I did not worked on this part yet.

Your opinion/help will be welcomed.

@exiledkingcc
Copy link
Contributor

i have made a pull request for this.

MartinThoma pushed a commit that referenced this issue Sep 3, 2022
The specification says:

To understand the algorithm below, it is necessary to treat the O and U strings in the Encrypt dictionary
as made up of three sections. The first 32 bytes are a hash value (explained below). The next 8 bytes are
called the Validation Salt. The final 8 bytes are called the Key Salt.

So /U and /O should be 48-bytes data, but for the PDF file which causes #1288 , /O 's length is 127-bytes. The redundant data are zeros.

Fixes #1288
@MartinThoma
Copy link
Member

The fix is in main and will be released to PyPI latest tomorrow.

Thank you everybody 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants