-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Switched from PyMuPDF to pypdfium2 #829
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that! I wanted to do it quickly but I am running out of time 😅
Are you sure keeping pymupdf for the tests is not an issue ?
For pypi downloads, I'm positive since it doesn't ship the test folder. But just to make sure that the entire codebase is rid of this, I changed it to Pillow + docTR synthesize function to mock PDFs 👍 |
Codecov Report
@@ Coverage Diff @@
## main #829 +/- ##
==========================================
- Coverage 95.97% 95.95% -0.02%
==========================================
Files 131 131
Lines 4988 4993 +5
==========================================
+ Hits 4787 4791 +4
- Misses 201 202 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Also it seems that pypdfium2 isn't supported by some versions of Python (cf. docker job) 🤔 But we'll handle this in another PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It is cool to get rid of pymupdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Hey @fg-mindee 👋 |
I see that text extraction and image localisation code was removed with this change. PDFium provides capabilities to extract text from PDFs, but I don't have a support model for it yet in I'm not sure if PDFium can locate images, but will look it up. Footnotes
|
Thanks a lot @mara004, we're quite excited about how thorough you are in the development of this project :) |
I'm currently working on the new text extraction helper class in pypdfium2-team/pypdfium2#110. If you have any suggestions/requests about the API, please let me know. |
I have just merged the new support models into main: pypdfium2-team/pypdfium2@bbc2438 |
@mara004 that's great! Do you have any minimal snippet example please? So that I can integrate the feature into docTR 😁 |
I guess something like this: Images# Creates a list of image bounding boxes of left, right, bottom and top in PDF canvas units
pdf = pdfium.PdfDocument(filepath)
page = pdf.get_page(index)
images = []
for obj in page.get_objects():
if obj.get_type() == pdfium.FPDF_PAGEOBJ_IMAGE:
images.append(obj.get_pos())
[g.close() for g in (page, pdf)] Text# Creates a list of pairs of bounding boxes and their text content
pdf = pdfium.PdfDocument(filepath)
page = pdf.get_page(index)
textpage = page.get_textpage()
text_boxes = []
for bbox in textpage.get_rectboxes():
text_boxes.append( (bbox, textpage.get_text(*bbox)) )
[g.close() for g in (textpage, page, pdf)] |
FYI, I yet made a few changes to the text API in pypdfium2-team/pypdfium2@3ac10be and adapted the above example. |
I just released pypdfium2 1.10.0, which contains the new support models. |
Can we get people like you as "customer service" for all OSS libraries haha? 😁 |
I'm glad to help OSS projects as far as my limited knowledge permits :). |
The API rewrite I mentioned in the other thread should be finished now, and I plan to release version 2.0.0 of pypdfium2 soon (there's a pre-release already). I have updated the above examples again and will submit a PR for the rendering code. |
This PR introduces the following modifications:
Closes #486
Any feedback is welcome!