TypeError: 'NoneType' object is not iterable #1279

DL6ER · 2022-08-26T08:00:28Z

See #1269 for further details.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.4.0-122-generic-x86_64-with-glibc2.29

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

import PyPDF2
with open("pca_var.pdf", "rb") as f:
  pdfreader = PyPDF2.PdfFileReader(f, strict=False)
  metadata = pdfreader.metadata.copy() if pdfreader.metadata is not None else {}

PDF used above: pca_var.pdf

Traceback

This is the complete Traceback I see:

Object 26 0 not defined.

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    metadata = pdfreader.metadata.copy() if pdfreader.metadata is not None else {}
  File "/usr/local/lib/python3.8/dist-packages/PyPDF2/_reader.py", line 329, in metadata
    retval.update(obj)  # type: ignore
TypeError: 'NoneType' object is not iterable

The text was updated successfully, but these errors were encountered:

pubpub-zz · 2022-08-26T08:58:06Z

The file you've referenced is not readable. Please ensure you can open them and check informations with acrobat reader in case of failure to ensure it is not an issue with PyPDF2.

DL6ER · 2022-08-26T11:25:21Z

I plan to use PyPDF2 to analyze a large number (> 100,000) of PDF files automatically. My reason for reporting this issue is that this issue is not captured by PyPDF2 (by raising a PyPDF2.errors.PdfReadError) but instead PyPDF2 is failing internally. If this is the expected behavior for invalid PDF files, I can wrap the entire call in a try/except: ignore block, however, with this approach I may be missing PDF files which are slightly broken but where the text could still be extracted from, otherwise.

Can you recommend any automated method for checking a PDF for validity before loading it into PyPDF2 if this is a requirement?

pubpub-zz · 2022-08-27T07:56:38Z

@DL6ER
In any case the try/except you are proposing will be required as you may always face an exception not handled. Instead of ignoring them, you should build a list of files with issues, and preferably you should capture the stack where the exceptions and then review them manually or at least. you should have a look at #1143

DL6ER · 2022-08-27T09:35:32Z

@pubpub-zz I already have such a try/except block in place and store PDF files with issues for manual review in a separate place. I could easily open yet another bunch of issues with assertion errors and some more ValueErrors and AttributeError: 'NoneType' object has no attribute 'get_object'.

My question is: Do you know what the goal is for PyPDF2?
Is it meant that I have to maintain a manually compiled list of exceptions that I simply accept as being caused by broken input files or should PyPDF2 rather raise a PyPDF2.errors.PdfReadError for all these things which is one exception which I can straightforwardly ignore and say "this is a broken file so I'll just ignore it" (I guess the latter)
I'd have no issues with reporting newly found issues every now and then to this repo. I see my project - streaming hundreds of thousands of PDF files found in random places - a good way to make PyPDF2 more robust - if this is wanted.

I'll definitely open another issue ticket in a moment as I found a small PDF (19 pages, < 1MB) which does not throw an exception but rather keeps PyPDF2 spinning forever at 100% CPU.

MartinThoma · 2022-08-27T11:23:40Z

[Do] I have to maintain a manually compiled list of exceptions that I simply accept as being caused by broken input files or should PyPDF2 rather raise a PyPDF2.errors.PdfReadError for all these things which is one exception which I can straightforwardly ignore and say "this is a broken file so I'll just ignore it"?

I haven't completely made my mind up on this one. I'd be happy if people participated in the discussion here: #1210

DL6ER · 2022-08-28T13:31:55Z

I found two more PDF files triggering the exact same traceback. Sorry for the bad example above - these files are readable.

ediamondscience · 2022-08-28T22:35:40Z

@DL6ER I'm trying to sort out what your test program should be returning for all three pdfs. I'm using the pdfinfo command that comes with xpdf for Linux.

I'm seeing that your first file has an xref table that's corrupted and is missing it's trailer.

The second has a bunch of xref entries that I can't parse. The trailer is intact but contains nothing in the metadata that gets returned by the DocumentInformation() class.

The third seems to be relatively intact short of one xref entry, and has Google listed as the creator, which is the only thing that should be returned by the DocumentInformation() class.

If that's all correct, here's what I'm thinking should happen when you run your test code for each file:

File#1:
-Raises PdfReader exception, the metadata is unreadable.

File#2:
-Returns empty set. Metadata is (mostly) intact and readable, but blank.

File#3:
Returns {'Creator' : 'Google'} and nothing else.

Is this what you had in mind when you wrote your test code?

edit:
After looking at the raw text contained in each file I have a good idea of why all three should return an error. File#1 has no trailer, File#2's trailer appears to have an extra character in it's declaration causing PyPDF2 to not pick it up, and File#3's trailer has been corrupted so that it's info section isn't picked up. I think all three should throw a PdfReadError exception for now, and we need to look into dealing with the different ways we can find a document's trailer set up and handle them individually with more verbose exceptions for when we can not read the trailer.

fixes py-pdf#1279 / Status_v1_Reviewers-Guide.pdf

pubpub-zz · 2022-08-29T11:35:05Z

with Acrobat reader the only file I've been able to read is file#3. the PR has been adjusted to read pages and the metadata

DL6ER · 2022-08-30T16:02:16Z

@ediamondscience The best approach in strict=False mode seems to be to extract as much as possible. However, it's fine for me to see an exception being raised when the metadata is corrupted as I can understand this approach, too. Especially if we cannot be sure that we are extracting meaningful metadata because something is corrupted.

Added PdfReadError in cases where trailer is absent of can't be read. Closes #1279

MartinThoma · 2022-08-31T04:41:05Z

The improvement found by @ediamondscience was just merged to main and will be released to PyPI this week.

Fixes #1273 Fixes #1279 Fixes #1292 Fixes #1294 Fixes #1295 ROB: Cope with xref starting on \r\n ROB: Escaped octal code followed by decimal int ROB: Cope with some corrupted entries in xref table ROB: Extend xref autorepair cases

ediamondscience mentioned this issue Aug 27, 2022

Found & Fixed None type values for TK.INFO keys in self.trailer (_reader.py) #1284

Closed

ediamondscience mentioned this issue Aug 29, 2022

MAINT: Throw PdfReadError if Trailer can't be read #1298

Merged

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Aug 29, 2022

ROB : cope with xref starting on \r\n

147b69e

fixes py-pdf#1279 / Status_v1_Reviewers-Guide.pdf

MartinThoma closed this as completed in #1298 Aug 31, 2022

MartinThoma pushed a commit that referenced this issue Aug 31, 2022

MAINT: Throw PdfReadError if Trailer can't be read (#1298)

5c76c8f

Added PdfReadError in cases where trailer is absent of can't be read. Closes #1279

MartinThoma mentioned this issue Sep 2, 2022

ENH: Process XRefStm #1297

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: 'NoneType' object is not iterable #1279

TypeError: 'NoneType' object is not iterable #1279

DL6ER commented Aug 26, 2022

pubpub-zz commented Aug 26, 2022 •

edited

Loading

DL6ER commented Aug 26, 2022

pubpub-zz commented Aug 27, 2022

DL6ER commented Aug 27, 2022 •

edited

Loading

MartinThoma commented Aug 27, 2022

DL6ER commented Aug 28, 2022

ediamondscience commented Aug 28, 2022 •

edited

Loading

pubpub-zz commented Aug 29, 2022

DL6ER commented Aug 30, 2022

MartinThoma commented Aug 31, 2022

TypeError: 'NoneType' object is not iterable #1279

TypeError: 'NoneType' object is not iterable #1279

Comments

DL6ER commented Aug 26, 2022

Environment

Code + PDF

Traceback

pubpub-zz commented Aug 26, 2022 • edited Loading

DL6ER commented Aug 26, 2022

pubpub-zz commented Aug 27, 2022

DL6ER commented Aug 27, 2022 • edited Loading

MartinThoma commented Aug 27, 2022

DL6ER commented Aug 28, 2022

ediamondscience commented Aug 28, 2022 • edited Loading

pubpub-zz commented Aug 29, 2022

DL6ER commented Aug 30, 2022

MartinThoma commented Aug 31, 2022

pubpub-zz commented Aug 26, 2022 •

edited

Loading

DL6ER commented Aug 27, 2022 •

edited

Loading

ediamondscience commented Aug 28, 2022 •

edited

Loading