Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[14.0][IMP] account_invoice_import_simple_pdf: use Tesseract-OCR if available #935

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,16 @@
pages = []
doc = fitz.open(fileobj.name)
for page in doc:
pages.append(page.get_text())
# Check if Tessdata is available for OCR
tessdata = fitz.get_tessdata()

Check warning on line 58 in account_invoice_import_simple_pdf/wizard/account_invoice_import.py

View check run for this annotation

Codecov / codecov/patch

account_invoice_import_simple_pdf/wizard/account_invoice_import.py#L58

Added line #L58 was not covered by tests
# Perform OCR if Tessdata is available, otherwise use regular text extraction
textpage = (

Check warning on line 60 in account_invoice_import_simple_pdf/wizard/account_invoice_import.py

View check run for this annotation

Codecov / codecov/patch

account_invoice_import_simple_pdf/wizard/account_invoice_import.py#L60

Added line #L60 was not covered by tests
page.get_textpage_ocr(full=False, tessdata=tessdata)
if tessdata
else page.get_textpage()
)
# Append the extracted text to the pages list
pages.append(page.get_text(textpage=textpage))

Check warning on line 66 in account_invoice_import_simple_pdf/wizard/account_invoice_import.py

View check run for this annotation

Codecov / codecov/patch

account_invoice_import_simple_pdf/wizard/account_invoice_import.py#L66

Added line #L66 was not covered by tests
res = {
"all": "\n\n".join(pages),
"first": pages and pages[0] or "",
Expand Down
Loading