-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Change extract-text.md example codes from using cm to tm #2432
base: main
Are you sure you want to change the base?
Conversation
This is my first pull request ever. Any comments? |
Added by @stefan6419846 Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
I personally do not recommend to extract text using cm : We've observed many case where the actual text and ordering is not valid. |
@pubpub-zz This is an example inside our docs which somehow stopped working correctly, maybe because due to internal changes, maybe because it has not worked at all in the past. If ever, we should explain the new layout mode besides the "classic" mode in the docs. But this can be part of another PR - our examples should work and yield correct results to avoid frustrations. |
By the way, in pypdf version 3 (i guess) documentation it uses |
This has been a change by @pubpub-zz: bcd85c4 |
Hm, so Maybe the example PDF was the problem? Because if that change was to solve a problem, then maybe I think for me to validate this assumption I should test it on many PDFs. I will do later today because I got an assignment to do 😅. |
Did you have a chance to further check this already? |
Hey @etern4l-white are there any updates? 😇 |
Should fix #2881 as well. |
as recommended in #2881 (comment) shouldn't we propose to use |
Does this have any effect on the text size? According to the quote in #2881 (comment), the text size is somehow affected the multiplied matrix. |
TL;DR
Fixes #2431
Changed
cm
(current_matrix) totm
(text matrix).Problem
In the extract-text documentation here, the example codes that are used won't produce the correct output.
For example, the first code snippet should output the text of table of contents, but it outputs nothing. The second code snippet is supposed to convert page 4 from the PDF to a SVG, including the text, but it only outputs empty fields.
Reason
The coordination process should be using the text matrix instead of current matrix.
Solution/update
Just changed the
cm
totm
in the code snippets.