Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: validate file formats against format specification #63

Closed
sallain opened this issue Oct 17, 2024 · 5 comments
Closed

Feature: validate file formats against format specification #63

sallain opened this issue Oct 17, 2024 · 5 comments
Assignees

Comments

@sallain
Copy link
Contributor

sallain commented Oct 17, 2024

Is your feature request related to a problem? Please describe.

SFA accepts a limited number of formats (enforced through the use of a format allow list), and requires those formats to be well-formed and valid according to the format specification. Currently, staff are responsible for checking format validation manually, using the KOST-Val tool suite. SFA staff currently check the following formats using KOST-VAL:

  • TIFF (fmt/353)
  • PDF/A (fmt/95, fmt/354, fmt/476, fmt/477, fmt/478)
  • JP2 (x-fmt/392)

They also validate SIARD (fmt/161, fmt/1196, fmt/1777) using a custom script. Other formats on their allow list that could be validated include WAV, Matroska, and MPEG-4.

The manual check is time-consuming for staff, only carried out in some cases, and the result is not captured for preservation.

Describe the solution you'd like

Implement an Ingest activity to automate this check. The activity should always validate designated formats against their specification and should also write a PREMIS event to the premis.xml file.

Approximately 90% of SFA's corpus is PDF/A. In the interest of getting the best possible validation results for the most files, I'd suggest that we implement veraPDF first.

Describe alternatives you've considered

Use KOST-Val (also a Java application; also it contains some licensed tools, like pdfaPilot)

Based on preliminary research (see board notes here) it seems like JHOVE can validate TIFF and JP2, as well as WAV, which may also make it a good option for an initial proof of concept, or for a second iteration of this feature.

Additional context

@djjuhasz
Copy link
Contributor

djjuhasz commented Nov 20, 2024

VeraPDF provides two "apps" - a Java GUI and CLI. From the screenshot it looks like the Java GUI requires a windowing system (e.g. X11) and is built for human interaction. For our purposes the CLI seems like the best fit.

I'd rather not install veraPDF in our Enduro worker container and then call it from the enduro (Go) worker. While a local installation may be the simplest way to implement verapdf validation it bloats our worker container with the Java RE and the veraPDF code, and doesn't follow the Docker best practice of decoupling applcations.

I experimented today with running the VeraPDF CLI Docker image using a Tilt button in my local development environment. The veraPDF Docker image builds the CLI app, then runs it immediately, taking a positional argument that is the path to a PDF, then outputs the validation results to STDOUT and exits. For our purposes this doesn't work very well because (a) the container only runs for as long as it takes the verapdf command to run and (b) we can't directly run the verapdf executable from within an enduro-worker container.

After Googling and checking ChatGPT there are a few different ways to run veraPDF in a separate container from an Enduro worker:

  1. Use the Kubernetes API to run verapdf as a Job. This isn't a good solution because it's fairly complicated, and makes the Enduro dependent on Kubernetes, which won't work in the SFA deployment environment.
  2. Run an SSH server in the verapdf container, and call the executable remotely using SSH. Using SSH seems overly complicated for our needs though, and it's difficult to limit the authorization to just running the verapdf executable.
  3. Create a veraPDF service container with an HTTP REST server (written in Go, or whatever) and trigger the verapdf validation with a REST request. We could share a disk volume between the Enduro work and the verapdf container to avoid having to send the PDF data with the REST request. This container could be reused for other apps (e.g. Archivematica) and is scalable and independent. The down side of this solution is the extra work required to set up the web server, create and maintain the image.
  4. ChatGPT suggested a couple of more complex protocols to use for communicating with a service container including using a message queue (e.g. Redis), or gRPC. These each have their own advantages, but are even more work than an HTTP service container solution.

@djjuhasz
Copy link
Contributor

We discussed this in our sync time this morning and decided to just add the veraPDF CLI and it's dependencies to the preprocessing-worker container. This solution is the simplest to implement and will be the most similar to deploying the veraPDF CLI in SFA's environment.

@djjuhasz
Copy link
Contributor

djjuhasz commented Dec 4, 2024

PDF/A validation with veraPDF is added by 458d34b.

@djjuhasz djjuhasz moved this from ⏳ In Progress to 🧐 QA in Enduro Dec 4, 2024
@sallain
Copy link
Contributor Author

sallain commented Jan 10, 2025

This is looking good. When a SIP containing PDFs is transferred, the Validate SIP Files activity is checking their conformance using the default command in veraPDF (auto-detect flavour). (This means that transfers with PDFs are currently failing because the PDF/As are nonconformant, which is an expected result.)

It's worth noting that a "Failures":null result, which is interpreted as a success, can mean one of three very different things: 1) that there are no PDF/As present, 2) that there are PDF/As present and they are conformant, or 3) that veraPDF is not running. In all cases, Enduro displays the message No invalid files found. This feels prone to issues in the future.

@sallain sallain closed this as completed Jan 10, 2025
@github-project-automation github-project-automation bot moved this from 🧐 QA to 🎉 Done in Enduro Jan 10, 2025
@sallain
Copy link
Contributor Author

sallain commented Jan 11, 2025

Just noticed that there isn't a PREMIS event for this validation, but I'll file another issue about it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🎉 Done
Development

No branches or pull requests

3 participants