Extracted text mismatch the vision #3925

rcx986635 · 2024-10-08T13:29:08Z

Description of the bug

I use the code to extract text:

import pymupdf
doc=pymupdf.Document('a.pdf',filetype='pdf')
out_text = pdf[0].get_text()

then I got '!"#$%&'()*+,-\n2023研究工作汇报\n汇\n报\n人\n：\n刘\n璘\n'

But the excepted result is '工业安全大数据联合研究中心\n2023研究工作汇报\n汇\n报\n人\n：\n刘\n璘\n'

Here is the original pdf:
研究中心工作汇报_page_0.pdf
Is there any config I should change? @JorjMcKie Thank u for your reply

How to reproduce the bug

I used pip to install PyMuPDF 1.24.7

PyMuPDF version

1.24.7

Operating system

Linux

Python version

3.8

JorjMcKie · 2024-10-08T14:01:20Z

The file link does not work.
But presumably not a bug:
Confirm via page.get_text(flags=0). If you see � symbols in the output, then the font's backtranslation table is missing or incomplete, This flags prevents MuPDF effort to heal this by providing a Unicode derived from the glyph number.

I suppose my assumption is correct, so going to close the issue.

JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Oct 8, 2024

JorjMcKie closed this as completed Oct 8, 2024

JorjMcKie mentioned this issue Oct 8, 2024

Extracted text mismatch the vision #3924

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracted text mismatch the vision #3925

Extracted text mismatch the vision #3925

rcx986635 commented Oct 8, 2024

JorjMcKie commented Oct 8, 2024

Extracted text mismatch the vision #3925

Extracted text mismatch the vision #3925

Comments

rcx986635 commented Oct 8, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

JorjMcKie commented Oct 8, 2024