You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The file link does not work.
But presumably not a bug:
Confirm via page.get_text(flags=0). If you see � symbols in the output, then the font's backtranslation table is missing or incomplete, This flags prevents MuPDF effort to heal this by providing a Unicode derived from the glyph number.
I suppose my assumption is correct, so going to close the issue.
Description of the bug
I use the code to extract text:
import pymupdf
doc=pymupdf.Document('a.pdf',filetype='pdf')
out_text = pdf[0].get_text()
then I got '!"#$%&'()*+,-\n2023研究工作汇报\n汇\n报\n人\n:\n刘\n璘\n'
But the excepted result is '工业安全大数据联合研究中心\n2023研究工作汇报\n汇\n报\n人\n:\n刘\n璘\n'
Here is the original pdf:
研究中心工作汇报_page_0.pdf
Is there any config I should change? @JorjMcKie Thank u for your reply
How to reproduce the bug
I used pip to install PyMuPDF 1.24.7
PyMuPDF version
1.24.7
Operating system
Linux
Python version
3.8
The text was updated successfully, but these errors were encountered: