-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bunch of files cannot be extract #1143
Comments
what a list! 😁 @Lightup1
thanks |
Hi @pubpub-zz here is the csv with error message: |
See #1143 Co-authored-by: Martin Thoma <info@martin-thoma.de>
Hi @pubpub-zz , |
2 types of errors have been found: (1 file) the last one is an issue in the PDF : resources section is missing on page 110(starting at 1) : the page can not be decoded even by Acrobat Reader DC : the overall exception handling you are using is the good one) |
@Lightup1 |
@pubpub-zz
|
Wow, you're both awesome 🚀 thank you for taking care of this @pubpub-zz and thanks for reporting the issues / responding quickly @Lightup1 🙏 I will make a release to pypi with the fixes soon :-) |
new released has been issued : @Lightup1, ok to close this issue ? |
@pubpub-zz It’s okay! |
After test thousands of files, we 7Zip a bunch of files for which extract_text fails at specific pages
Environment
Code + PDF
This is a minimal, complete example that shows the issue:
pdf zip sharing: https://drive.google.com/file/d/1Ado46PEN_GYUhSI0zlSujX0722FAgk27/view?usp=sharing
extract_text_error.csv
The reason why this file is not continuous with file name is because that locally I extract them with multiprocessing.
The text was updated successfully, but these errors were encountered: