Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for extracting URIs from page link actions. #68

Closed
reyjexter opened this issue Feb 10, 2023 · 12 comments
Closed

Add support for extracting URIs from page link actions. #68

reyjexter opened this issue Feb 10, 2023 · 12 comments
Assignees

Comments

@reyjexter
Copy link

What's the correct way to extract links from page? There appears to be implementation of handling FPDFLink on bindings.rs already but it's not being used anywhere.

Thanks

@ajrcarey
Copy link
Owner

ajrcarey commented Feb 10, 2023

Hi @reyjexter, thank you for raising the issue.

Pdfium defines two types of links, so let's just make sure we're talking about the same thing :)

  1. FPDF_LINK, a link normally attached to an annotation, although these can be placed freely on a page as well
  2. FPDF_PAGELINK, a collection of clickable links automatically created by Pdfium as a courtesy when it sees any text strings that look like URLs inside the text objects on a page

I assume we're talking about the first type of link, not the second. You're right that the bindings to the FPDF_* functions related to links have already been added to bindings.rs. They're not hooked up to the high level interface yet, however.

Looking at the functions provided by Pdfium, it seems to me that FPDFLink_Enumerate() lets us iterate over all the links attached to a page. It should be relatively easy to create a collection in the PdfPages struct that provides access to this. I'm away from home at the moment, but I will look into adding this for you early next week when I return.

@ajrcarey
Copy link
Owner

PS I'm not sure I have a good sample document that has a bunch of links in it, so if you have some sample PDF files you can share, please do attach them to this issue.

@reyjexter
Copy link
Author

reyjexter commented Feb 10, 2023

Thanks for the explanation on difference between the two and I think that is indeed the first one.

I'm not particulary familiar with Pdfium yet but here's an example PDF:

example.pdf

Note: I made edits and uploaded the file here instead.

@reyjexter
Copy link
Author

I made some changes on forked repo and on this commit which didn't have to use FPDFLink_Enumerate and does work for what we need but I don't know if this is correct or not:

master...reyjexter:pdfium-render:master

This only added extra function on an existing PdfPageLinkAnnotation module.

@ajrcarey ajrcarey self-assigned this Feb 18, 2023
@ajrcarey ajrcarey changed the title Extract page link Add support for extracting URIs from page link actions. Feb 18, 2023
ajrcarey pushed a commit that referenced this issue Feb 18, 2023
@ajrcarey
Copy link
Owner

ajrcarey commented Feb 18, 2023

Yes, I think that's good. Just be aware that the URI text is actually 7-bit ASCII rather than UTF-16LE; no doubt your code works, but with a variety of string types supported by Pdfium, it may be worth updating your comments so as not to create confusion in future.

Because the FPDF_LINK and FPDF_ACTION types can be used in a variety of different places, I have taken a slightly different approach and broken these out into separate structs in the pdfium-render interface. Additionally, I have fleshed out the object hierarchy to support all action types, and added the ability to iterate over all links in the page using the new PdfPage::links() collection. It is not quite as succint as your approach, but allows for supporting all link and action types in the future, not just URI actions.

An example using the functionality is in examples/links.rs. This example demonstrates retrieving the URI path, so it should address your use-case. That said, if your fork works for you, you may as well stick with it :)

@reyjexter
Copy link
Author

reyjexter commented Feb 20, 2023

Thanks for the this and as well as feedback and pointers. We will definitely be using the library that is released on from this main repository.

Does the example/links.rs suppose to work on Web Assembly? I tried updating our codes with version that is on master repo and it's not returning the links on example PDF. The difference from what we are doing though is we are passing a Blob instead of url though annotations/text are correctly being read. Here's the demo application I created:

https://github.com/reyjexter/pdfium-render-wasm/blob/master/src/lib.rs

Also, with this new functions, how are you able to find link(s) associated with an annotation? What we are trying to create is a struct of annotation text and link and return those information on browser together.

Thanks again!

@ajrcarey
Copy link
Owner

ajrcarey commented Feb 20, 2023

Yes, all functionality should work on WASM except where explicitly marked otherwse. Let me see if I can reproduce the problem.

To retrieve the link associated with a link annotation, you first get the annotation from PdfPages::annotations(), unwrap it as a link annotation using PdfPageAnnotation::as_link_annotation(), and from there you should be able to retrieve the link itself using PdfPageLinkAnnotation::link().

So, for example, something like:

for annotation in page.annotations().iter() {
    if Some(annotation) = annotation.as_link_annotation() {
        if Some(link) = annotation.link() {
            // ... do something with the link ...
        }
    }
}

The text content (if any) inside the annotation can be retrieved for all annotation types (not just link annotations) using PdfPageAnnotationCommon::content(). This is implemented for all PdfPageAnnotation types.

@reyjexter
Copy link
Author

reyjexter commented Feb 20, 2023

The codes to extract link from an annotation works well. Thanks!

ajrcarey pushed a commit that referenced this issue Feb 20, 2023
@ajrcarey
Copy link
Owner

I can reproduce your WASM problem. This was due to a small bug in memory allocation in the WASM implementation of FPDFLink_Enumerate(). I have fixed this and updated the WASM example code at examples/wasm.rs so it outputs the links in the test document to the console.

@ajrcarey
Copy link
Owner

Made small adjustments to documentation. Bumped crate version in Cargo.toml to 0.7.31.

@reyjexter
Copy link
Author

Thanks and I can also confirm this works correctly now.

@ajrcarey
Copy link
Owner

Excellent. Publishing to crates.io as release 0.7.31.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants