Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: 'NoneType' object is not iterable #1279

Closed
DL6ER opened this issue Aug 26, 2022 · 10 comments · Fixed by #1298 or #1297
Closed

TypeError: 'NoneType' object is not iterable #1279

DL6ER opened this issue Aug 26, 2022 · 10 comments · Fixed by #1298 or #1297

Comments

@DL6ER
Copy link

DL6ER commented Aug 26, 2022

See #1269 for further details.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2
with open("pca_var.pdf", "rb") as f:
  pdfreader = PyPDF2.PdfFileReader(f, strict=False)
  metadata = pdfreader.metadata.copy() if pdfreader.metadata is not None else {}

PDF used above: pca_var.pdf

Traceback

This is the complete Traceback I see:

Object 26 0 not defined.

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    metadata = pdfreader.metadata.copy() if pdfreader.metadata is not None else {}
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 329, in metadata
    retval.update(obj)  # type: ignore
TypeError: 'NoneType' object is not iterable
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Aug 26, 2022

The file you've referenced is not readable. Please ensure you can open them and check informations with acrobat reader in case of failure to ensure it is not an issue with PyPDF2.

@DL6ER
Copy link
Author

DL6ER commented Aug 26, 2022

I plan to use PyPDF2 to analyze a large number (> 100,000) of PDF files automatically. My reason for reporting this issue is that this issue is not captured by PyPDF2 (by raising a PyPDF2.errors.PdfReadError) but instead PyPDF2 is failing internally. If this is the expected behavior for invalid PDF files, I can wrap the entire call in a try/except: ignore block, however, with this approach I may be missing PDF files which are slightly broken but where the text could still be extracted from, otherwise.

Can you recommend any automated method for checking a PDF for validity before loading it into PyPDF2 if this is a requirement?

@pubpub-zz
Copy link
Collaborator

@DL6ER
In any case the try/except you are proposing will be required as you may always face an exception not handled. Instead of ignoring them, you should build a list of files with issues, and preferably you should capture the stack where the exceptions and then review them manually or at least. you should have a look at #1143

@DL6ER
Copy link
Author

DL6ER commented Aug 27, 2022

@pubpub-zz I already have such a try/except block in place and store PDF files with issues for manual review in a separate place. I could easily open yet another bunch of issues with assertion errors and some more ValueErrors and AttributeError: 'NoneType' object has no attribute 'get_object'.

My question is: Do you know what the goal is for PyPDF2?
Is it meant that I have to maintain a manually compiled list of exceptions that I simply accept as being caused by broken input files or should PyPDF2 rather raise a PyPDF2.errors.PdfReadError for all these things which is one exception which I can straightforwardly ignore and say "this is a broken file so I'll just ignore it" (I guess the latter)
I'd have no issues with reporting newly found issues every now and then to this repo. I see my project - streaming hundreds of thousands of PDF files found in random places - a good way to make PyPDF2 more robust - if this is wanted.

I'll definitely open another issue ticket in a moment as I found a small PDF (19 pages, < 1MB) which does not throw an exception but rather keeps PyPDF2 spinning forever at 100% CPU.

@MartinThoma
Copy link
Member

[Do] I have to maintain a manually compiled list of exceptions that I simply accept as being caused by broken input files or should PyPDF2 rather raise a PyPDF2.errors.PdfReadError for all these things which is one exception which I can straightforwardly ignore and say "this is a broken file so I'll just ignore it"?

I haven't completely made my mind up on this one. I'd be happy if people participated in the discussion here: #1210

@DL6ER
Copy link
Author

DL6ER commented Aug 28, 2022

I found two more PDF files triggering the exact same traceback. Sorry for the bad example above - these files are readable.

@ediamondscience
Copy link
Contributor

ediamondscience commented Aug 28, 2022

@DL6ER I'm trying to sort out what your test program should be returning for all three pdfs. I'm using the pdfinfo command that comes with xpdf for Linux.

I'm seeing that your first file has an xref table that's corrupted and is missing it's trailer.

The second has a bunch of xref entries that I can't parse. The trailer is intact but contains nothing in the metadata that gets returned by the DocumentInformation() class.

The third seems to be relatively intact short of one xref entry, and has Google listed as the creator, which is the only thing that should be returned by the DocumentInformation() class.

If that's all correct, here's what I'm thinking should happen when you run your test code for each file:

File#1:
-Raises PdfReader exception, the metadata is unreadable.

File#2:
-Returns empty set. Metadata is (mostly) intact and readable, but blank.

File#3:
Returns {'Creator' : 'Google'} and nothing else.

Is this what you had in mind when you wrote your test code?

edit:
After looking at the raw text contained in each file I have a good idea of why all three should return an error. File#1 has no trailer, File#2's trailer appears to have an extra character in it's declaration causing PyPDF2 to not pick it up, and File#3's trailer has been corrupted so that it's info section isn't picked up. I think all three should throw a PdfReadError exception for now, and we need to look into dealing with the different ways we can find a document's trailer set up and handle them individually with more verbose exceptions for when we can not read the trailer.

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 29, 2022
fixes  py-pdf#1279 / Status_v1_Reviewers-Guide.pdf
@pubpub-zz
Copy link
Collaborator

with Acrobat reader the only file I've been able to read is file#3. the PR has been adjusted to read pages and the metadata

@DL6ER
Copy link
Author

DL6ER commented Aug 30, 2022

@ediamondscience The best approach in strict=False mode seems to be to extract as much as possible. However, it's fine for me to see an exception being raised when the metadata is corrupted as I can understand this approach, too. Especially if we cannot be sure that we are extracting meaningful metadata because something is corrupted.

MartinThoma pushed a commit that referenced this issue Aug 31, 2022
Added PdfReadError in cases where trailer is absent of can't be read.

Closes #1279
@MartinThoma
Copy link
Member

The improvement found by @ediamondscience was just merged to main and will be released to PyPI this week.

MartinThoma pushed a commit that referenced this issue Sep 3, 2022
Fixes #1273
Fixes #1279
Fixes #1292
Fixes #1294
Fixes #1295

ROB: Cope with xref starting on \r\n
ROB: Escaped octal code followed by decimal int
ROB: Cope with some corrupted entries in xref table
ROB: Extend xref autorepair cases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants