Infinite loop while reading metadata #1329

MartinThoma · 2022-09-07T16:40:05Z

When I try to read the metadata of Effective Java 3rd Edition by Joshua Bloch.pdf it takes extremely long. It might even be an infinite loop.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-125-generic-x86_64-with-debian-bullseye-sid

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.5

Code + PDF

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfReader

reader = PdfReader("Effective Java 3rd Edition by Joshua Bloch.pdf")
metadata = reader.metadata

The PDF: Effective Java 3rd Edition by Joshua Bloch.pdf

Affected

NOT: PyPDF2<=2.10.4 (throws an exception)
NOT PyPDF2>=2.10.6 (reads data properly)
IS AFFECTED: PyPDF2==2.10.5
See GHSA-hm9v-vj3r-r55m

MartinThoma · 2022-09-07T16:40:57Z

Interestingly, extracting the text works just fine:

text = ""
for i, page in enumerate(reader.pages, start=1):
    t0 = time.time()
    text += page.extract_text()
    t1 = time.time()
    print(f"Reading page {i}: {t1-t0:.2f} seconds")

MartinThoma · 2022-09-07T16:43:25Z

It's the line obj = self.trailer[TK.INFO] that takes so long

MartinThoma · 2022-09-07T16:49:11Z

generic._data_structures.read_object seems to be the issue:

def read_object(
    stream: StreamType,
    pdf: Any,  # PdfReader
    forced_encoding: Union[None, str, List[str], Dict[int, str]] = None,
) -> Union[PdfObject, int, str, ContentStream]:
    tok = stream.read(1)
    stream.seek(-1, 1)  # reset to start
    idx = ObjectPrefix.find(tok)
    if ....
    elif idx == 7:
        # comment
        while tok not in (b"\r", b"\n"):
            tok = stream.read(1)
            # Prevents an infinite loop by raising an error if the stream is at
            # the EOF
            if len(tok) <= 0:
                raise PdfStreamError("File ended unexpectedly.")
        tok = read_non_whitespace(stream)
        stream.seek(-1, 1)
        return read_object(stream, pdf, forced_encoding)

MartinThoma · 2022-09-07T16:52:10Z

@pubpub-zz In case you have some time, I would say this is a pretty important issue.

pubpub-zz · 2022-09-07T21:58:57Z

Got it!
the file is corrupt within the /Info leading to infinite loop. Cope with such situation
fixed also a warning on cmap encoding

some other possible issues detected and injected

My concern is about test: the test PDF file does not respect the licensing. I would not like to refer it for testing

fixes py-pdf#1329 *prevent loop within dictionnaries where objects not respecting standard *fix cmap warnings due to "numbered" characters ( #2d instead of -) *apply unnumbering to nameobject *add _get_indirect_object for debug/dev purpose *add some missing seeks (no issue reported yet)

Fixes #1329 * Prevent loop within dictionaries caused by objects not respecting the PDF standard * Fix cmap warnings due to "numbered" characters ( #2d instead of -) * Apply unnumbering to NameObject * Add _get_indirect_object for debugging and development * Add some missing seeks (no issue reported yet)

MartinThoma · 2023-06-30T05:20:43Z

The fix was in PyPDF2/generic/_data_structures.py:

The elif tok in b"0123456789+-.": and

    else:
        raise PdfReadError("...")

broke the infinite loop

MartinThoma mentioned this issue Sep 7, 2022

ValueError: invalid literal for int() with base 10: b'' #1270

Closed

MartinThoma changed the title ~~Extremely long time for reading metadata (potential infinite loop)~~ Infinite Loop while reading metadata Sep 7, 2022

MartinThoma changed the title ~~Infinite Loop while reading metadata~~ Infinite loop while reading metadata Sep 7, 2022

pubpub-zz mentioned this issue Sep 7, 2022

ROB: Fix infinite loop due to Invalid object #1331

Merged

MartinThoma closed this as completed in #1331 Sep 9, 2022

py-pdf deleted a comment from pubpub-zz Jun 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite loop while reading metadata #1329

Infinite loop while reading metadata #1329

MartinThoma commented Sep 7, 2022 •

edited

Loading

MartinThoma commented Sep 7, 2022

MartinThoma commented Sep 7, 2022

MartinThoma commented Sep 7, 2022 •

edited

Loading

MartinThoma commented Sep 7, 2022

pubpub-zz commented Sep 7, 2022 •

edited by MartinThoma

Loading

MartinThoma commented Jun 30, 2023

Infinite loop while reading metadata #1329

Infinite loop while reading metadata #1329

Comments

MartinThoma commented Sep 7, 2022 • edited Loading

Environment

Code + PDF

Affected

MartinThoma commented Sep 7, 2022

MartinThoma commented Sep 7, 2022

MartinThoma commented Sep 7, 2022 • edited Loading

MartinThoma commented Sep 7, 2022

pubpub-zz commented Sep 7, 2022 • edited by MartinThoma Loading

MartinThoma commented Jun 30, 2023

MartinThoma commented Sep 7, 2022 •

edited

Loading

MartinThoma commented Sep 7, 2022 •

edited

Loading

pubpub-zz commented Sep 7, 2022 •

edited by MartinThoma

Loading