-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite loop while reading metadata #1329
Comments
Interestingly, extracting the text works just fine: text = ""
for i, page in enumerate(reader.pages, start=1):
t0 = time.time()
text += page.extract_text()
t1 = time.time()
print(f"Reading page {i}: {t1-t0:.2f} seconds") |
It's the line |
def read_object(
stream: StreamType,
pdf: Any, # PdfReader
forced_encoding: Union[None, str, List[str], Dict[int, str]] = None,
) -> Union[PdfObject, int, str, ContentStream]:
tok = stream.read(1)
stream.seek(-1, 1) # reset to start
idx = ObjectPrefix.find(tok)
if ....
elif idx == 7:
# comment
while tok not in (b"\r", b"\n"):
tok = stream.read(1)
# Prevents an infinite loop by raising an error if the stream is at
# the EOF
if len(tok) <= 0:
raise PdfStreamError("File ended unexpectedly.")
tok = read_non_whitespace(stream)
stream.seek(-1, 1)
return read_object(stream, pdf, forced_encoding) |
@pubpub-zz In case you have some time, I would say this is a pretty important issue. |
Got it! some other possible issues detected and injected My concern is about test: the test PDF file does not respect the licensing. I would not like to refer it for testing |
fixes py-pdf#1329 *prevent loop within dictionnaries where objects not respecting standard *fix cmap warnings due to "numbered" characters ( #2d instead of -) *apply unnumbering to nameobject *add _get_indirect_object for debug/dev purpose *add some missing seeks (no issue reported yet)
Fixes #1329 * Prevent loop within dictionaries caused by objects not respecting the PDF standard * Fix cmap warnings due to "numbered" characters ( #2d instead of -) * Apply unnumbering to NameObject * Add _get_indirect_object for debugging and development * Add some missing seeks (no issue reported yet)
The fix was in The
broke the infinite loop |
When I try to read the metadata of Effective Java 3rd Edition by Joshua Bloch.pdf it takes extremely long. It might even be an infinite loop.
Environment
Which environment were you using when you encountered the problem?
$ python -m platform Linux-5.4.0-125-generic-x86_64-with-debian-bullseye-sid $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.10.5
Code + PDF
This is a minimal, complete example that shows the issue:
The PDF: Effective Java 3rd Edition by Joshua Bloch.pdf
Affected
The text was updated successfully, but these errors were encountered: