Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing root object raising: 'NoneType' object has no attribute 'get_object' (different from #1295 & #1689) #2806

Closed
BertrandBordage opened this issue Aug 21, 2024 · 1 comment · Fixed by #2808
Labels
is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. PdfReader The PdfReader component is affected

Comments

@BertrandBordage
Copy link
Contributor

As I was processing client PDFs with pypdf, one of them triggered a cryptic error (traceback below).
Ideally, pypdf should raise a PdfReadError (or another subclass of PyPdfError) if that file is really impossible to parse.

Environment

$ python -m platform
Linux-5.15.0-118-generic-x86_64-with-glibc2.35

$ python --version
Python 3.11.9

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=9.5.0

Code

This is a minimal code example that shows the issue:

from pypdf import PdfReader
reader = PdfReader('client file (might be broken).pdf')
list(reader.pages)

I cannot share the PDF as it might contain sensitive client data.

Two messages/warnings are displayed before the traceback, though:
image

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/lib/python3.11/site-packages/pypdf/_page.py", line 2227, in __len__
    return self.length_function()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/pypdf/_doc_common.py", line 353, in get_num_pages
    self._flatten()
  File "/lib/python3.11/site-packages/pypdf/_doc_common.py", line 1101, in _flatten
    catalog = self.root_object
              ^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/pypdf/_reader.py", line 191, in root_object
    return cast(DictionaryObject, self.trailer[TK.ROOT].get_object())
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get_object'

Additional info

The PDF might be corrupted, as I am unable to open it with Evince, which shows this error: Failed to read the document catalog.
xpdf is also showing various errors when reading the file:

Syntax Error: Couldn't find trailer dictionary
Syntax Error: Invalid XRef entry 493
Internal Error: xref num 493 not found but needed, try to reconstruct<0a>
Syntax Error: Invalid XRef entry 493
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Catalog object is wrong type (null)
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Invalid XRef entry 493
Internal Error: xref num 493 not found but needed, try to reconstruct<0a>
Syntax Error: Invalid XRef entry 493
Syntax Error: Couldn't find trailer dictionary
Syntax Error: Catalog object is wrong type (null)
Syntax Error: Couldn't read page catalog
@stefan6419846
Copy link
Collaborator

Thanks for the report. Judging from the stacktrace and the third-party logs, this PDF file just appears to be broken, as apparently the (essential) root object cannot be found.

Feel free to open a PR to convert this into an appropriate exception.

@stefan6419846 stefan6419846 changed the title AttributeError: 'NoneType' object has no attribute 'get_object' (different from #1295 & #1689) Missing root object raising: 'NoneType' object has no attribute 'get_object' (different from #1295 & #1689) Aug 22, 2024
@stefan6419846 stefan6419846 added PdfReader The PdfReader component is affected is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. labels Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-uncaught-exception Use this label only for issues caused by broken PDF documents that cannot be recovered. PdfReader The PdfReader component is affected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants