Add support for extracting URIs from page link actions. #68

reyjexter · 2023-02-10T16:13:34Z

What's the correct way to extract links from page? There appears to be implementation of handling FPDFLink on bindings.rs already but it's not being used anywhere.

Thanks

ajrcarey · 2023-02-10T16:51:32Z

Hi @reyjexter, thank you for raising the issue.

Pdfium defines two types of links, so let's just make sure we're talking about the same thing :)

FPDF_LINK, a link normally attached to an annotation, although these can be placed freely on a page as well
FPDF_PAGELINK, a collection of clickable links automatically created by Pdfium as a courtesy when it sees any text strings that look like URLs inside the text objects on a page

I assume we're talking about the first type of link, not the second. You're right that the bindings to the FPDF_* functions related to links have already been added to bindings.rs. They're not hooked up to the high level interface yet, however.

Looking at the functions provided by Pdfium, it seems to me that FPDFLink_Enumerate() lets us iterate over all the links attached to a page. It should be relatively easy to create a collection in the PdfPages struct that provides access to this. I'm away from home at the moment, but I will look into adding this for you early next week when I return.

ajrcarey · 2023-02-10T16:52:37Z

PS I'm not sure I have a good sample document that has a bunch of links in it, so if you have some sample PDF files you can share, please do attach them to this issue.

reyjexter · 2023-02-10T17:35:41Z

Thanks for the explanation on difference between the two and I think that is indeed the first one.

I'm not particulary familiar with Pdfium yet but here's an example PDF:

example.pdf

Note: I made edits and uploaded the file here instead.

reyjexter · 2023-02-14T15:04:59Z

I made some changes on forked repo and on this commit which didn't have to use FPDFLink_Enumerate and does work for what we need but I don't know if this is correct or not:

master...reyjexter:pdfium-render:master

This only added extra function on an existing PdfPageLinkAnnotation module.

ajrcarey · 2023-02-18T21:40:32Z

Yes, I think that's good. Just be aware that the URI text is actually 7-bit ASCII rather than UTF-16LE; no doubt your code works, but with a variety of string types supported by Pdfium, it may be worth updating your comments so as not to create confusion in future.

Because the FPDF_LINK and FPDF_ACTION types can be used in a variety of different places, I have taken a slightly different approach and broken these out into separate structs in the pdfium-render interface. Additionally, I have fleshed out the object hierarchy to support all action types, and added the ability to iterate over all links in the page using the new PdfPage::links() collection. It is not quite as succint as your approach, but allows for supporting all link and action types in the future, not just URI actions.

An example using the functionality is in examples/links.rs. This example demonstrates retrieving the URI path, so it should address your use-case. That said, if your fork works for you, you may as well stick with it :)

reyjexter · 2023-02-20T14:29:18Z

Thanks for the this and as well as feedback and pointers. We will definitely be using the library that is released on from this main repository.

Does the example/links.rs suppose to work on Web Assembly? I tried updating our codes with version that is on master repo and it's not returning the links on example PDF. The difference from what we are doing though is we are passing a Blob instead of url though annotations/text are correctly being read. Here's the demo application I created:

https://github.com/reyjexter/pdfium-render-wasm/blob/master/src/lib.rs

Also, with this new functions, how are you able to find link(s) associated with an annotation? What we are trying to create is a struct of annotation text and link and return those information on browser together.

Thanks again!

ajrcarey · 2023-02-20T14:38:52Z

Yes, all functionality should work on WASM except where explicitly marked otherwse. Let me see if I can reproduce the problem.

To retrieve the link associated with a link annotation, you first get the annotation from PdfPages::annotations(), unwrap it as a link annotation using PdfPageAnnotation::as_link_annotation(), and from there you should be able to retrieve the link itself using PdfPageLinkAnnotation::link().

So, for example, something like:

for annotation in page.annotations().iter() {
    if Some(annotation) = annotation.as_link_annotation() {
        if Some(link) = annotation.link() {
            // ... do something with the link ...
        }
    }
}

The text content (if any) inside the annotation can be retrieved for all annotation types (not just link annotations) using PdfPageAnnotationCommon::content(). This is implemented for all PdfPageAnnotation types.

reyjexter · 2023-02-20T14:56:16Z

The codes to extract link from an annotation works well. Thanks!

ajrcarey · 2023-02-20T19:46:57Z

I can reproduce your WASM problem. This was due to a small bug in memory allocation in the WASM implementation of FPDFLink_Enumerate(). I have fixed this and updated the WASM example code at examples/wasm.rs so it outputs the links in the test document to the console.

ajrcarey · 2023-02-20T19:47:26Z

Made small adjustments to documentation. Bumped crate version in Cargo.toml to 0.7.31.

reyjexter · 2023-02-22T11:53:18Z

Thanks and I can also confirm this works correctly now.

ajrcarey · 2023-02-22T11:57:08Z

Excellent. Publishing to crates.io as release 0.7.31.

ajrcarey self-assigned this Feb 18, 2023

ajrcarey changed the title ~~Extract page link~~ Add support for extracting URIs from page link actions. Feb 18, 2023

ajrcarey pushed a commit that referenced this issue Feb 18, 2023

Progressing #68

66cbfd7

ajrcarey mentioned this issue Feb 18, 2023

Standardise lifetime and reference handling for child objects inside PdfDocument. #47

Closed

ajrcarey pushed a commit that referenced this issue Feb 20, 2023

Progressing #68

77c7169

ajrcarey closed this as completed Feb 22, 2023

ajrcarey mentioned this issue Mar 10, 2023

Improve performance enumerating page links. #75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for extracting URIs from page link actions. #68

Add support for extracting URIs from page link actions. #68

reyjexter commented Feb 10, 2023

ajrcarey commented Feb 10, 2023 •

edited

Loading

ajrcarey commented Feb 10, 2023

reyjexter commented Feb 10, 2023 •

edited

Loading

reyjexter commented Feb 14, 2023

ajrcarey commented Feb 18, 2023 •

edited

Loading

reyjexter commented Feb 20, 2023 •

edited

Loading

ajrcarey commented Feb 20, 2023 •

edited

Loading

reyjexter commented Feb 20, 2023 •

edited

Loading

ajrcarey commented Feb 20, 2023

ajrcarey commented Feb 20, 2023

reyjexter commented Feb 22, 2023

ajrcarey commented Feb 22, 2023

Add support for extracting URIs from page link actions. #68

Add support for extracting URIs from page link actions. #68

Comments

reyjexter commented Feb 10, 2023

ajrcarey commented Feb 10, 2023 • edited Loading

ajrcarey commented Feb 10, 2023

reyjexter commented Feb 10, 2023 • edited Loading

reyjexter commented Feb 14, 2023

ajrcarey commented Feb 18, 2023 • edited Loading

reyjexter commented Feb 20, 2023 • edited Loading

ajrcarey commented Feb 20, 2023 • edited Loading

reyjexter commented Feb 20, 2023 • edited Loading

ajrcarey commented Feb 20, 2023

ajrcarey commented Feb 20, 2023

reyjexter commented Feb 22, 2023

ajrcarey commented Feb 22, 2023

ajrcarey commented Feb 10, 2023 •

edited

Loading

reyjexter commented Feb 10, 2023 •

edited

Loading

ajrcarey commented Feb 18, 2023 •

edited

Loading

reyjexter commented Feb 20, 2023 •

edited

Loading

ajrcarey commented Feb 20, 2023 •

edited

Loading

reyjexter commented Feb 20, 2023 •

edited

Loading