Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop while reading metadata #1329

Closed
MartinThoma opened this issue Sep 7, 2022 · 6 comments · Fixed by #1331
Closed

Infinite loop while reading metadata #1329

MartinThoma opened this issue Sep 7, 2022 · 6 comments · Fixed by #1331
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests help wanted We appreciate help everywhere - this one might be an easy start! is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF nf-performance Non-functional change: Performance nf-security Non-functional change: Security

Comments

@MartinThoma
Copy link
Member

MartinThoma commented Sep 7, 2022

When I try to read the metadata of Effective Java 3rd Edition by Joshua Bloch.pdf it takes extremely long. It might even be an infinite loop.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-125-generic-x86_64-with-debian-bullseye-sid

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.5

Code + PDF

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfReader

reader = PdfReader("Effective Java 3rd Edition by Joshua Bloch.pdf")
metadata = reader.metadata

The PDF: Effective Java 3rd Edition by Joshua Bloch.pdf

Affected

  • NOT: PyPDF2<=2.10.4 (throws an exception)
  • NOT PyPDF2>=2.10.6 (reads data properly)
  • IS AFFECTED: PyPDF2==2.10.5
  • See GHSA-hm9v-vj3r-r55m
@MartinThoma
Copy link
Member Author

Interestingly, extracting the text works just fine:

text = ""
for i, page in enumerate(reader.pages, start=1):
    t0 = time.time()
    text += page.extract_text()
    t1 = time.time()
    print(f"Reading page {i}: {t1-t0:.2f} seconds")

@MartinThoma
Copy link
Member Author

It's the line obj = self.trailer[TK.INFO] that takes so long

@MartinThoma
Copy link
Member Author

MartinThoma commented Sep 7, 2022

generic._data_structures.read_object seems to be the issue:

def read_object(
    stream: StreamType,
    pdf: Any,  # PdfReader
    forced_encoding: Union[None, str, List[str], Dict[int, str]] = None,
) -> Union[PdfObject, int, str, ContentStream]:
    tok = stream.read(1)
    stream.seek(-1, 1)  # reset to start
    idx = ObjectPrefix.find(tok)
    if ....
    elif idx == 7:
        # comment
        while tok not in (b"\r", b"\n"):
            tok = stream.read(1)
            # Prevents an infinite loop by raising an error if the stream is at
            # the EOF
            if len(tok) <= 0:
                raise PdfStreamError("File ended unexpectedly.")
        tok = read_non_whitespace(stream)
        stream.seek(-1, 1)
        return read_object(stream, pdf, forced_encoding)

@MartinThoma MartinThoma changed the title Extremely long time for reading metadata (potential infinite loop) Infinite Loop while reading metadata Sep 7, 2022
@MartinThoma MartinThoma changed the title Infinite Loop while reading metadata Infinite loop while reading metadata Sep 7, 2022
@MartinThoma MartinThoma added help wanted We appreciate help everywhere - this one might be an easy start! nf-performance Non-functional change: Performance nf-security Non-functional change: Security Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF labels Sep 7, 2022
@MartinThoma
Copy link
Member Author

@pubpub-zz In case you have some time, I would say this is a pretty important issue.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 7, 2022

Got it!
the file is corrupt within the /Info leading to infinite loop. Cope with such situation
fixed also a warning on cmap encoding

some other possible issues detected and injected

My concern is about test: the test PDF file does not respect the licensing. I would not like to refer it for testing

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 7, 2022
fixes py-pdf#1329
*prevent loop within dictionnaries where objects not respecting standard
*fix cmap warnings due to "numbered" characters ( #2d instead of -)
*apply unnumbering to nameobject
*add _get_indirect_object for debug/dev purpose
*add some missing seeks (no issue reported yet)
MartinThoma pushed a commit that referenced this issue Sep 9, 2022
Fixes #1329

* Prevent loop within dictionaries caused by objects not respecting the PDF standard
* Fix cmap warnings due to "numbered" characters ( #2d instead of -)
* Apply unnumbering to NameObject
* Add _get_indirect_object for debugging and development
* Add some missing seeks (no issue reported yet)
@MartinThoma
Copy link
Member Author

The fix was in PyPDF2/generic/_data_structures.py:

The elif tok in b"0123456789+-.": and

    else:
        raise PdfReadError("...")

broke the infinite loop

@py-pdf py-pdf deleted a comment from pubpub-zz Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests help wanted We appreciate help everywhere - this one might be an easy start! is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF nf-performance Non-functional change: Performance nf-security Non-functional change: Security
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants