-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update io/pdf.py
to new pypdfium2 API
#944
Conversation
CI currently fails because of the >= 2.0.0 requirement, considering that this version is not actually released yet. |
@mara004 Thanks a lot for this update 👍 |
Other changes: * Removed the file existence check as pypdfium2 internally does that already. * Added a password argument to open encrypted PDFs.
Version 2.0.0 is released now. I've made a force push to re-trigger CI. |
# Rasterise pages to PIL images with pypdfium2 and convert to numpy ndarrays | ||
return [np.asarray(img) for img, _ in pdfium.render_pdf_topil(file, scale=scale, **kwargs)] | ||
pdf = pdfium.PdfDocument(file, password=password) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets use with
statement :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that's not available yet. PdfDocument currently doesn't implement __enter__()
and __exit__()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean that the document is closed at the end of the execution of PdfDocument
constructor?
If not, how can we do this? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you can see in #953, PdfDocument
now provides the context manager API, so you can use any instance of this class in a with
block, meaning that .close()
will be called automatically at context manager exit. Does this answer your question?
@mara004 thanks only a few minor things left 👍 |
Codecov Report
@@ Coverage Diff @@
## main #944 +/- ##
==========================================
+ Coverage 94.71% 94.72% +0.01%
==========================================
Files 134 134
Lines 5501 5501
==========================================
+ Hits 5210 5211 +1
+ Misses 291 290 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
to resolve flake8 warnings
I guess the one remaining test failure is likely unrelated / erroneous? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot :)
Thank you for reviewing/merging! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @mara004 :)
Just a few comments & questions out of curiosity!
def read_pdf( | ||
file: AbstractFile, | ||
scale: float = 2, | ||
password: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick note for upcoming modifications: I would suggest to use Union[..., None]
rather than Optional[...]
cf. https://fastapi.tiangolo.com/python-types/#using-union-or-optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the term "Optional" is confusing for required parameters, but as password
is actually optional, I wouldn't worry about that here. As sphinx-autodoc currently renders Union[..., None]
to Optional[...]
implicitly, I guess that Optional
might even be more common at the moment.
What about keeping it as-is for now and switching to str | None
once you require Python >=3.10 ?
# Rasterise pages to PIL images with pypdfium2 and convert to numpy ndarrays | ||
return [np.asarray(img) for img, _ in pdfium.render_pdf_topil(file, scale=scale, **kwargs)] | ||
pdf = pdfium.PdfDocument(file, password=password) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean that the document is closed at the end of the execution of PdfDocument
constructor?
If not, how can we do this? :)
This patch would make doctr compatible with pypdfium2 >= 2.0.0, which will be released soon.
Now that pypdfium2 supports extracting text and locating images, it would also be possible to add back some features that were removed with #829.