Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mets:file URL handling: keep remote links #323

Closed
bertsky opened this issue Sep 24, 2019 · 7 comments
Closed

mets:file URL handling: keep remote links #323

bertsky opened this issue Sep 24, 2019 · 7 comments

Comments

@bertsky
Copy link
Collaborator

bertsky commented Sep 24, 2019

Currently with workspaces we can either keep images on the remote side by using http URLs in mets:file/mets:FLocat/@xlink:href (which means they have to be downloaded again and again during processing), or get local filesystem copies with relative paths by cloning with download=True or bagging and spilling (but then the source information will be lost forever).

When processing is finished and I want to make my workspace public, I now have to upload my shiny new results in addition to the original images – which I might not even have the rights to publish myself. It would be much better, if the original remote URLs would be used again for that – even if I used local copies in between.

METS-XML allows that: A mets:FLocat has xs:@maxoccurs=unbounded within mets:file, with the following documented semantic:

The file element provides access to content files for a METS object. A file element may contain one or more FLocat elements, which provide pointers to a content file, and/or an FContent element, which wraps an encoded version of the file. Note that ALL FLocat and FContent elements underneath a single file element should identify/contain identical copies of a single file.

So why don't we keep 2 FLocat elements in that case, one relative path for local processing and one remote URL for provenance/bookkeeping? When making results public, the local copies could be disposed of again, e.g. when bagging with --manifestation-depth=partial.

@bertsky
Copy link
Collaborator Author

bertsky commented Sep 24, 2019

Oh, BTW, this would also offer a chance to write the original remote URL into PAGE's imageFilename again when publishing/persisting.

@kba
Copy link
Member

kba commented Oct 9, 2019

That's a great proposal and would also be an option to keep @imageFilename and @xlink:href in check.

The idea behind the local_filename property was much the same: To have a local copy of a potentially-remote URL. Another idea was to use multiple FLocat as you propose but there were reasons why we decided against it. @maria-federbusch @cneud @tboenig I cannot seem to find the discussion in the issues in core or spec, do you remember where we documented this? IIRC (and I might not), there was a limitation in Goobi/Kitodo or maybe in the ZVDD METS Profile to use only one FLocat?

Apart from that, I'm open for the idea, but it will take some time because we have to change file handling in a few places for this (much like your AlternativeImage work, with additional checks and new possible points of failure in the logic).

@kba
Copy link
Member

kba commented Oct 9, 2019

Yes, the ZVDD guidelines are pretty restrictive. FLocat isn't repeatable. But it also says @xlink:href must be a URL which I tried to defend for the longest time but we're not abiding by anymore. So maybe repeated FLocat would be less intrusive than changing xlink:href in a destructive way as we do now...

image

@kba
Copy link
Member

kba commented Oct 9, 2019

We could also implement the local_filename stuff as additional FLocat as you propose and have a processor that strips the METS down to ZVDD requirements.

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 9, 2019

We could also implement the local_filename stuff as additional FLocat as you propose and have a processor that strips the METS down to ZVDD requirements.

Sounds good to me. Stripping down or publishing non-persistable parts (and probably ingesting provenance data) would always be one necessary last processor (and probably a institution-specifc one), right?

@bertsky
Copy link
Collaborator Author

bertsky commented Oct 9, 2020

Should be revisited now that the OLA-HD client has arrived.

@kba
Copy link
Member

kba commented Nov 20, 2023

This has since been implemented in #1079, released in v2.54.0

@kba kba closed this as completed Nov 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants