Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transparent text selection #149

Closed
workcomplete opened this issue Sep 15, 2022 · 9 comments
Closed

Transparent text selection #149

workcomplete opened this issue Sep 15, 2022 · 9 comments
Labels
documentation Improvements or additions to documentation workaround Temporary label for when something needs to be captured in documentation from closed tickets

Comments

@workcomplete
Copy link

workcomplete commented Sep 15, 2022

I'll preface this by saying I am not sure if this is an issue with pdf-tools specifically, vs a more general issue with tesseract OCR/poppler. I originally posted this in discussions, but I don't think many people are using this feature atm.

When I select text (i.e. click and drag mouse over a region of text) in pdf-tools, the text selection is not transparent. For pdf's that are created digitally this is fine, but for pdf's that are scanned and OCR'd with tesseract, the selected text becomes hidden behind the selection (see images below). Is there a way to change this behavior without re-OCRing the pdf with something like adobe acrobat (i.e. is it possible make the text selection transparent in pdf-tools)?

What selected text looks like in a pdf generated from a website:
Screenshot from 2022-09-08 20-27-51

What selected text looks like in a scanned book that has been OCRd with tesseract

Screenshot from 2022-09-08 19-07-11

OCR text can still be copied and pasted and looks fine when markup is applied.

Screenshot from 2022-09-08 19-07-55

Originally posted by @workcomplete in #147

If I understand this post on stack exchange, poppler has already implemented transparent text selection?

https://tex.stackexchange.com/questions/565909/invisible-text-even-when-selected-in-evince

This same issue with pdf-tools (under previous maintainer) is mentioned here

https://gitlab.freedesktop.org/poppler/poppler/-/issues/157

@orgtre
Copy link
Contributor

orgtre commented Sep 15, 2022

Ok, I can reproduce this using poppler 22.08 with the linn.pdf attached to the poppler issue report linked above. Still, highlighted text displays correctly. In Evince the behavior is exactly the same as in pdf-tools (non-transparent when selected, but transparent when highlighted). In several other non-popper-based pdf readers the selections are transparent, which to me indicates that this is a poppler issue (despite the fact that the issue linked above is closed). Not sure if there is something we can do on the pdf-tools side.

@workcomplete
Copy link
Author

OK well I found a work around (below) but agree that this seems to be a poppler issue of not handling 3 Tras it is purported to in the chosen answer in the stackexchange question above.

By following the instructions here up to step 2, where instead replacing 3 Tr with 1 Tr, I replaced 3 Tr with 7 Tr. This makes the text visible when selected (see image below). The remaining steps are not necessary. The downside here is that I download many books from online resources, and 80% of the time they are already OCR'd with tesseract. Needing to do this for every text is a bit painful.

Screenshot from 2022-09-15 11-40-20

@orgtre
Copy link
Contributor

orgtre commented Sep 15, 2022

This is a very useful discovery and should probably be communicated somewhere else too!

To make the first step slightly easier one could use my qpdf transient wrapper qpdf.el. In fact the whole procedure can be automated with the following command:

(defun my-fix-pdf-selection ()
  "Replace pdf with one where selection shows transparently."
  (interactive)
  (unless (equal (file-name-extension (buffer-file-name)) "pdf")
    (error "Buffer should visit a pdf file."))
  (unless (equal major-mode 'pdf-view-mode)
    (pdf-view-mode))
  ;; save file in QDF-mode
  (qpdf-run (list
	     (concat "--infile="
		     (buffer-file-name))
	     "--qdf --object-streams=disable"
	     "--replace-input"))
  ;; do replacements
  (text-mode)
  (read-only-mode -1)
  (while (re-search-forward "3 Tr" nil t)
    (replace-match "7 Tr" nil nil))
  (save-buffer)
  (pdf-view-mode))

This still depends on qpdf.el but with a few extra lines of code one could avoid that. Note that this overwrites the pdf file visited in the buffer from which it is run! To avoid this replace the first "--replace-input" with (concat "--outfile=" (file-truename (read-file-name "Outfile: "))). A caveat is that in the pdf's I tested selecting is substantially slower after running my-fix-pdf-selection, fix-qdf doesn't help.

@workcomplete workcomplete reopened this Sep 19, 2022
@workcomplete
Copy link
Author

workcomplete commented Sep 19, 2022

Great, thanks for sharing your package and function.

Reopening as it could be helpful if your function was added to the wiki, or at the very least if this could be added to known issues.

One small issue with your function above: when tested it only worked on file names that do not contain spaces. Not a major issue for me as I use zotfile/zotero to rename my pdfs.

And FWIW I have not noticed the same issue with text selection lag between "fixed" and unfixed pdfs, both are quite slow for me (compared to other PDF readers) as noted in #87 (comment)

@orgtre
Copy link
Contributor

orgtre commented Sep 19, 2022

Thanks for testing. I fixed the issue with spaces in file names, plus added the problem and workaround to known problems in the README. If I understand correctly, once the pull request is merged, pdftools.wiki (which just mirrors the README) will rebuild automatically to reflect the changes. I tried to work around the qpdf.el dependency, but to make it behave correctly in all cases, basically the whole qpdf--default-run-after-function is needed, so I decided to keep the dependency after all.

@workcomplete
Copy link
Author

workcomplete commented Oct 3, 2022

Thanks again for your work on this! FYI the function recently stopped working on my setup, it appears to be inserting a trailing '' in the qpdf command.

Minibuffer contents after running my-fix-pdf-selection:

call: qpdf '/home/user/Zotero/storage/Z4JZ82SF/some_file.pdf' --qdf --object-streams=disable --replace-input ''

When I copy this into a terminal window and remove the trailing '' the command runs as expected.

@vedang
Copy link
Owner

vedang commented Oct 3, 2022

@workcomplete : ^ This might be a bug introduced in the latest commit on qpdf: orgtre/qpdf.el@6debdce cc: @orgtre . I'm saying this after a cursory look at the linked commit, which adds a quotes around outfile (and in the command above there is no outfile argument). I might be wrong, I haven't actually tested the code.

I am going ahead and merging the PR which explains this as a known problem (which will also update the wiki).

Thank you!

@vedang
Copy link
Owner

vedang commented Oct 3, 2022

closed via aec8ecd

@vedang vedang closed this as completed Oct 3, 2022
@vedang vedang added workaround Temporary label for when something needs to be captured in documentation from closed tickets documentation Improvements or additions to documentation labels Oct 3, 2022
vedang added a commit to vedang/qpdf.el that referenced this issue Oct 3, 2022
This change fixes the bug pointed out in vedang/pdf-tools#149.

When `--replace-input` is used and no `outfile` is provided, an empty
quote (`''`) is inserted into the call to `qpdf`. This change guards
against that.
@orgtre
Copy link
Contributor

orgtre commented Oct 3, 2022

I merged vedang's fix. Thanks both of you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation workaround Temporary label for when something needs to be captured in documentation from closed tickets
Projects
None yet
Development

No branches or pull requests

3 participants