Bunch of files cannot be extract #1143

Lightup1 · 2022-07-21T14:30:20Z

After test thousands of files, we 7Zip a bunch of files for which extract_text fails at specific pages

Environment

$ python -m platform
Windows-10-10.0.22000-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.6.0 #main

Code + PDF

This is a minimal, complete example that shows the issue:

import csv
with open("extract_text_error.csv", "rt",encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile)
    filelist = [row[0] for row in reader]
filelist=list(set(filelist)) 
# for prepare a minimal reproducible version
# just reuse the extract_text_error.csv file to get a filelist
######################
### IMPORTANT: remember to change the absolute path of filelist 
for path_pdf_name in filelist:
        pdfReader=PyPDF2.PdfReader(path_pdf_name)
        text=""
        for i,page in enumerate(pdfReader.pages):
            try:
                text+=page.extract_text()
            except:
                with open("extract_text_error1.csv","a",encoding='utf-8') as file:
                    file.write('%s,%s\n' % (path_pdf_name,i))
                    continue

pdf zip sharing: https://drive.google.com/file/d/1Ado46PEN_GYUhSI0zlSujX0722FAgk27/view?usp=sharing
extract_text_error.csv
The reason why this file is not continuous with file name is because that locally I extract them with multiprocessing.

pubpub-zz · 2022-07-21T18:03:02Z

what a list! 😁

@Lightup1
Can you do me a favor to ease the analysis : can you add the exception message in a third column

            except Exception as e:
                with open("extract_text_error1.csv","a",encoding='utf-8') as file:
                    import traceback
                    file.write('%s,%s,"%s"\n' % (path_pdf_name,i,traceback.format_exc().replace('"',"'")))
                    continue

thanks

Lightup1 · 2022-07-22T01:42:49Z

Hi @pubpub-zz here is the csv with error message:
extract_text_error.csv

pubpub-zz · 2022-07-23T20:47:52Z

@Lightup1,
I've normally fixed the issued in #1155, however I did not check the whole bunch of test files. Waiting for your feed backs

See #1143 Co-authored-by: Martin Thoma <info@martin-thoma.de>

Lightup1 · 2022-07-24T11:01:44Z

Hi @pubpub-zz ,
I've tried to extract text by using pypdf2 2.8.0.
There are only a few files whose text can not be extracted.
Here is the CSV file:
extract_text_error_2.8.0.csv

pubpub-zz · 2022-07-24T14:32:05Z

2 types of errors have been found:
(99.?%) the /DecodeParams is an empty list : accepted by the standard, issue in the default value handling. file example:
2015年年度报告_pb_decode_pg0.pdf

(1 file) the last one is an issue in the PDF : resources section is missing on page 110(starting at 1) : the page can not be decoded even by Acrobat Reader DC : the overall exception handling you are using is the good one)

See #1143, 2nd part

pubpub-zz · 2022-07-24T21:23:26Z

@Lightup1
now all (but one) should be solved can you confirm ?

Lightup1 · 2022-07-25T01:24:53Z

@pubpub-zz
I've confirmed that only this one in page 110 (starting at 1) has an error for currently main branch.

Traceback (most recent call last):
  File 'E:\pdf2txt\test\pdf2txt-1.0.py', line 35, in dir_pdf2txt
    text+=page.extract_text()
  File 'C:\Users\Baiyi Yu\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_page.py', line 1433, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File 'C:\Users\Baiyi Yu\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_page.py', line 1132, in _extract_text
    resources_dict = cast(DictionaryObject, obj['/Resources'])
  File 'C:\Users\Baiyi Yu\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic.py', line 685, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError:
 '/Resources'

2015年年度报告.pdf

MartinThoma · 2022-07-25T05:52:12Z

Wow, you're both awesome 🚀 thank you for taking care of this @pubpub-zz and thanks for reporting the issues / responding quickly @Lightup1 🙏

I will make a release to pypi with the fixes soon :-)

pubpub-zz · 2022-07-26T16:57:41Z

new released has been issued : @Lightup1, ok to close this issue ?

Lightup1 · 2022-07-27T00:49:29Z

@pubpub-zz It’s okay!

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 23, 2022

ROB : cope with utf16 character for space calculation (py-pdf#1143)

e8ff310

MartinThoma mentioned this issue Jul 24, 2022

ROB: Cope with utf16 character for space calculation #1155

Merged

MartinThoma added a commit that referenced this issue Jul 24, 2022

ROB: Cope with utf16 character for space calculation (#1155)

35bec40

See #1143 Co-authored-by: Martin Thoma <info@martin-thoma.de>

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 24, 2022

ROB : Cope with empty DecodeParams (py-pdf#1143 2nd part)

8eabd41

pubpub-zz mentioned this issue Jul 24, 2022

ROB: Cope with empty DecodeParams #1165

Merged

MartinThoma pushed a commit that referenced this issue Jul 24, 2022

ROB: Cope with empty DecodeParams (#1165)

0b27287

See #1143, 2nd part

Lightup1 closed this as completed Jul 27, 2022

pubpub-zz mentioned this issue Aug 27, 2022

TypeError: 'NoneType' object is not iterable #1279

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bunch of files cannot be extract #1143

Bunch of files cannot be extract #1143

Lightup1 commented Jul 21, 2022 •

edited

Loading

pubpub-zz commented Jul 21, 2022 •

edited

Loading

Lightup1 commented Jul 22, 2022

pubpub-zz commented Jul 23, 2022

Lightup1 commented Jul 24, 2022

pubpub-zz commented Jul 24, 2022

pubpub-zz commented Jul 24, 2022

Lightup1 commented Jul 25, 2022 •

edited

Loading

MartinThoma commented Jul 25, 2022 •

edited

Loading

pubpub-zz commented Jul 26, 2022

Lightup1 commented Jul 27, 2022

Bunch of files cannot be extract #1143

Bunch of files cannot be extract #1143

Comments

Lightup1 commented Jul 21, 2022 • edited Loading

Environment

Code + PDF

pubpub-zz commented Jul 21, 2022 • edited Loading

Lightup1 commented Jul 22, 2022

pubpub-zz commented Jul 23, 2022

Lightup1 commented Jul 24, 2022

pubpub-zz commented Jul 24, 2022

pubpub-zz commented Jul 24, 2022

Lightup1 commented Jul 25, 2022 • edited Loading

MartinThoma commented Jul 25, 2022 • edited Loading

pubpub-zz commented Jul 26, 2022

Lightup1 commented Jul 27, 2022

Lightup1 commented Jul 21, 2022 •

edited

Loading

pubpub-zz commented Jul 21, 2022 •

edited

Loading

Lightup1 commented Jul 25, 2022 •

edited

Loading

MartinThoma commented Jul 25, 2022 •

edited

Loading