Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update io/pdf.py to new pypdfium2 API #944

Merged
merged 3 commits into from
Jun 9, 2022

Conversation

mara004
Copy link
Contributor

@mara004 mara004 commented Jun 8, 2022

This patch would make doctr compatible with pypdfium2 >= 2.0.0, which will be released soon.
Now that pypdfium2 supports extracting text and locating images, it would also be possible to add back some features that were removed with #829.

@mara004
Copy link
Contributor Author

mara004 commented Jun 8, 2022

CI currently fails because of the >= 2.0.0 requirement, considering that this version is not actually released yet.

@mara004 mara004 marked this pull request as draft June 8, 2022 17:59
@felixdittrich92
Copy link
Contributor

@mara004 Thanks a lot for this update 👍

@felixdittrich92 felixdittrich92 added module: io Related to doctr.io type: misc Miscellaneous labels Jun 8, 2022
Other changes:
* Removed the file existence check as pypdfium2 internally does that
already.
* Added a password argument to open encrypted PDFs.
@mara004 mara004 marked this pull request as ready for review June 9, 2022 12:44
@mara004
Copy link
Contributor Author

mara004 commented Jun 9, 2022

Version 2.0.0 is released now. I've made a force push to re-trigger CI.

@felixdittrich92 felixdittrich92 self-assigned this Jun 9, 2022
# Rasterise pages to PIL images with pypdfium2 and convert to numpy ndarrays
return [np.asarray(img) for img, _ in pdfium.render_pdf_topil(file, scale=scale, **kwargs)]
pdf = pdfium.PdfDocument(file, password=password)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use with statement :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that's not available yet. PdfDocument currently doesn't implement __enter__() and __exit__().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that the document is closed at the end of the execution of PdfDocument constructor?
If not, how can we do this? :)

Copy link
Contributor Author

@mara004 mara004 Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see in #953, PdfDocument now provides the context manager API, so you can use any instance of this class in a with block, meaning that .close() will be called automatically at context manager exit. Does this answer your question?

doctr/io/pdf.py Show resolved Hide resolved
doctr/io/pdf.py Outdated Show resolved Hide resolved
@felixdittrich92
Copy link
Contributor

@mara004 thanks only a few minor things left 👍

@codecov
Copy link

codecov bot commented Jun 9, 2022

Codecov Report

Merging #944 (f5c95a5) into main (f0337d1) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #944      +/-   ##
==========================================
+ Coverage   94.71%   94.72%   +0.01%     
==========================================
  Files         134      134              
  Lines        5501     5501              
==========================================
+ Hits         5210     5211       +1     
+ Misses        291      290       -1     
Flag Coverage Δ
unittests 94.72% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/io/pdf.py 93.33% <100.00%> (ø)
doctr/transforms/modules/base.py 94.59% <0.00%> (ø)
doctr/transforms/functional/base.py 97.10% <0.00%> (+1.44%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f0337d1...f5c95a5. Read the comment docs.

doctr/io/pdf.py Outdated Show resolved Hide resolved
to resolve flake8 warnings
@mara004
Copy link
Contributor Author

mara004 commented Jun 9, 2022

I guess the one remaining test failure is likely unrelated / erroneous?
https://github.com/mindee/doctr/runs/6815814273?check_suite_focus=true

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot :)

@felixdittrich92 felixdittrich92 merged commit 4caeee3 into mindee:main Jun 9, 2022
@mara004
Copy link
Contributor Author

mara004 commented Jun 9, 2022

Thank you for reviewing/merging!

@mara004 mara004 deleted the pypdfium_v2api branch June 9, 2022 16:38
Copy link
Collaborator

@frgfm frgfm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @mara004 :)

Just a few comments & questions out of curiosity!

def read_pdf(
file: AbstractFile,
scale: float = 2,
password: Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note for upcoming modifications: I would suggest to use Union[..., None] rather than Optional[...]
cf. https://fastapi.tiangolo.com/python-types/#using-union-or-optional

Copy link
Contributor Author

@mara004 mara004 Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the term "Optional" is confusing for required parameters, but as password is actually optional, I wouldn't worry about that here. As sphinx-autodoc currently renders Union[..., None] to Optional[...] implicitly, I guess that Optional might even be more common at the moment.
What about keeping it as-is for now and switching to str | None once you require Python >=3.10 ?

# Rasterise pages to PIL images with pypdfium2 and convert to numpy ndarrays
return [np.asarray(img) for img, _ in pdfium.render_pdf_topil(file, scale=scale, **kwargs)]
pdf = pdfium.PdfDocument(file, password=password)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean that the document is closed at the end of the execution of PdfDocument constructor?
If not, how can we do this? :)

@felixdittrich92 felixdittrich92 added this to the 0.5.2 milestone Jun 29, 2022
@felixdittrich92 felixdittrich92 modified the milestones: 0.5.2, 0.6.0 Sep 26, 2022
@felixdittrich92 felixdittrich92 mentioned this pull request Sep 26, 2022
85 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: io Related to doctr.io type: misc Miscellaneous
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants