Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracted text mismatch the vision #3925

Closed
rcx986635 opened this issue Oct 8, 2024 · 1 comment
Closed

Extracted text mismatch the vision #3925

rcx986635 opened this issue Oct 8, 2024 · 1 comment
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@rcx986635
Copy link

Description of the bug

I use the code to extract text:

import pymupdf
doc=pymupdf.Document('a.pdf',filetype='pdf')
out_text = pdf[0].get_text()

then I got '!"#$%&'()*+,-\n2023研究工作汇报\n汇\n报\n人\n:\n刘\n璘\n'

But the excepted result is '工业安全大数据联合研究中心\n2023研究工作汇报\n汇\n报\n人\n:\n刘\n璘\n'

Here is the original pdf:
研究中心工作汇报_page_0.pdf
Is there any config I should change? @JorjMcKie Thank u for your reply

How to reproduce the bug

I used pip to install PyMuPDF 1.24.7

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.8

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Oct 8, 2024
@JorjMcKie
Copy link
Collaborator

The file link does not work.
But presumably not a bug:
Confirm via page.get_text(flags=0). If you see � symbols in the output, then the font's backtranslation table is missing or incomplete, This flags prevents MuPDF effort to heal this by providing a Unicode derived from the glyph number.

I suppose my assumption is correct, so going to close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants