-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: validate file formats against format specification #63
Comments
VeraPDF provides two "apps" - a Java GUI and CLI. From the screenshot it looks like the Java GUI requires a windowing system (e.g. X11) and is built for human interaction. For our purposes the CLI seems like the best fit. I'd rather not install veraPDF in our Enduro worker container and then call it from the enduro (Go) worker. While a local installation may be the simplest way to implement verapdf validation it bloats our worker container with the Java RE and the veraPDF code, and doesn't follow the Docker best practice of decoupling applcations. I experimented today with running the VeraPDF CLI Docker image using a Tilt button in my local development environment. The veraPDF Docker image builds the CLI app, then runs it immediately, taking a positional argument that is the path to a PDF, then outputs the validation results to STDOUT and exits. For our purposes this doesn't work very well because (a) the container only runs for as long as it takes the verapdf command to run and (b) we can't directly run the verapdf executable from within an enduro-worker container. After Googling and checking ChatGPT there are a few different ways to run veraPDF in a separate container from an Enduro worker:
|
We discussed this in our sync time this morning and decided to just add the veraPDF CLI and it's dependencies to the preprocessing-worker container. This solution is the simplest to implement and will be the most similar to deploying the veraPDF CLI in SFA's environment. |
PDF/A validation with veraPDF is added by 458d34b. |
This is looking good. When a SIP containing PDFs is transferred, the Validate SIP Files activity is checking their conformance using the default command in veraPDF (auto-detect flavour). (This means that transfers with PDFs are currently failing because the PDF/As are nonconformant, which is an expected result.) It's worth noting that a |
Just noticed that there isn't a PREMIS event for this validation, but I'll file another issue about it! |
Is your feature request related to a problem? Please describe.
SFA accepts a limited number of formats (enforced through the use of a format allow list), and requires those formats to be well-formed and valid according to the format specification. Currently, staff are responsible for checking format validation manually, using the KOST-Val tool suite. SFA staff currently check the following formats using KOST-VAL:
They also validate SIARD (fmt/161, fmt/1196, fmt/1777) using a custom script. Other formats on their allow list that could be validated include WAV, Matroska, and MPEG-4.
The manual check is time-consuming for staff, only carried out in some cases, and the result is not captured for preservation.
Describe the solution you'd like
Implement an Ingest activity to automate this check. The activity should always validate designated formats against their specification and should also write a PREMIS event to the premis.xml file.
Approximately 90% of SFA's corpus is PDF/A. In the interest of getting the best possible validation results for the most files, I'd suggest that we implement veraPDF first.
Describe alternatives you've considered
Use KOST-Val (also a Java application; also it contains some licensed tools, like pdfaPilot)
Based on preliminary research (see board notes here) it seems like JHOVE can validate TIFF and JP2, as well as WAV, which may also make it a good option for an initial proof of concept, or for a second iteration of this feature.
Additional context
The text was updated successfully, but these errors were encountered: