Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bunch of files cannot be extract #1143

Closed
Lightup1 opened this issue Jul 21, 2022 · 10 comments
Closed

Bunch of files cannot be extract #1143

Lightup1 opened this issue Jul 21, 2022 · 10 comments

Comments

@Lightup1
Copy link

Lightup1 commented Jul 21, 2022

After test thousands of files, we 7Zip a bunch of files for which extract_text fails at specific pages

Environment

$ python -m platform
Windows-10-10.0.22000-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.6.0 #main

Code + PDF

This is a minimal, complete example that shows the issue:

import csv
with open("extract_text_error.csv", "rt",encoding='utf-8') as csvfile:
    reader = csv.reader(csvfile)
    filelist = [row[0] for row in reader]
filelist=list(set(filelist)) 
# for prepare a minimal reproducible version
# just reuse the extract_text_error.csv file to get a filelist
######################
### IMPORTANT: remember to change the absolute path of filelist 
for path_pdf_name in filelist:
        pdfReader=PyPDF2.PdfReader(path_pdf_name)
        text=""
        for i,page in enumerate(pdfReader.pages):
            try:
                text+=page.extract_text()
            except:
                with open("extract_text_error1.csv","a",encoding='utf-8') as file:
                    file.write('%s,%s\n' % (path_pdf_name,i))
                    continue

pdf zip sharing: https://drive.google.com/file/d/1Ado46PEN_GYUhSI0zlSujX0722FAgk27/view?usp=sharing
extract_text_error.csv
The reason why this file is not continuous with file name is because that locally I extract them with multiprocessing.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jul 21, 2022

what a list! 😁

@Lightup1
Can you do me a favor to ease the analysis : can you add the exception message in a third column

            except Exception as e:
                with open("extract_text_error1.csv","a",encoding='utf-8') as file:
                    import traceback
                    file.write('%s,%s,"%s"\n' % (path_pdf_name,i,traceback.format_exc().replace('"',"'")))
                    continue

thanks

@Lightup1
Copy link
Author

Hi @pubpub-zz here is the csv with error message:
extract_text_error.csv

@pubpub-zz
Copy link
Collaborator

@Lightup1,
I've normally fixed the issued in #1155, however I did not check the whole bunch of test files. Waiting for your feed backs

MartinThoma added a commit that referenced this issue Jul 24, 2022
See #1143

Co-authored-by: Martin Thoma <info@martin-thoma.de>
@Lightup1
Copy link
Author

Hi @pubpub-zz ,
I've tried to extract text by using pypdf2 2.8.0.
There are only a few files whose text can not be extracted.
Here is the CSV file:
extract_text_error_2.8.0.csv

@pubpub-zz
Copy link
Collaborator

2 types of errors have been found:
(99.?%) the /DecodeParams is an empty list : accepted by the standard, issue in the default value handling. file example:
2015年年度报告_pb_decode_pg0.pdf

(1 file) the last one is an issue in the PDF : resources section is missing on page 110(starting at 1) : the page can not be decoded even by Acrobat Reader DC : the overall exception handling you are using is the good one)

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jul 24, 2022
MartinThoma pushed a commit that referenced this issue Jul 24, 2022
@pubpub-zz
Copy link
Collaborator

@Lightup1
now all (but one) should be solved can you confirm ?

@Lightup1
Copy link
Author

Lightup1 commented Jul 25, 2022

@pubpub-zz
I've confirmed that only this one in page 110 (starting at 1) has an error for currently main branch.

Traceback (most recent call last):
  File 'E:\pdf2txt\test\pdf2txt-1.0.py', line 35, in dir_pdf2txt
    text+=page.extract_text()
  File 'C:\Users\Baiyi Yu\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_page.py', line 1433, in extract_text
    return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
  File 'C:\Users\Baiyi Yu\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_page.py', line 1132, in _extract_text
    resources_dict = cast(DictionaryObject, obj['/Resources'])
  File 'C:\Users\Baiyi Yu\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\generic.py', line 685, in __getitem__
    return dict.__getitem__(self, key).get_object()
KeyError:
 '/Resources'

2015年年度报告.pdf

@MartinThoma
Copy link
Member

MartinThoma commented Jul 25, 2022

Wow, you're both awesome 🚀 thank you for taking care of this @pubpub-zz and thanks for reporting the issues / responding quickly @Lightup1 🙏

I will make a release to pypi with the fixes soon :-)

@pubpub-zz
Copy link
Collaborator

new released has been issued : @Lightup1, ok to close this issue ?

@Lightup1
Copy link
Author

@pubpub-zz It’s okay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants