-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add VeraPDF validation of PDF/A files #92
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #92 +/- ##
==========================================
+ Coverage 60.12% 61.22% +1.10%
==========================================
Files 29 32 +3
Lines 1996 2174 +178
==========================================
+ Hits 1200 1331 +131
- Misses 702 742 +40
- Partials 94 101 +7 ☔ View full report in Codecov by Sentry. |
I've tested the VeraPDF validation in my local dev environment and it works, but all of the PDF/A files in our sample SIP fail validation: [
{
"Failures": [
"invalid PDF/A: \"content/content/d0003/d0004/p0006.pdf\"",
"invalid PDF/A: \"content/content/d0003/d0004/p0007.pdf\"",
"invalid PDF/A: \"content/content/d0001/p0001.pdf\"",
"invalid PDF/A: \"content/content/d0001/p0002.pdf\"",
"invalid PDF/A: \"content/content/d0001/p0003.pdf\"",
"invalid PDF/A: \"content/content/d0002/p0004.pdf\"",
"invalid PDF/A: \"content/content/d0002/p0005.pdf\""
]
}
] VeraPDF is also quite slow - it took almost 5 seconds to validate the 7 PDF/A files listed above. 😮 |
I've done some testing and almost all of the time used by VeraPDF was startup overhead (probably JRE startup time). When I pass VeraPDF a directory path instead of individual file paths it's much faster (just under 1 second for the 7 test PDFs) and it reports about 17 milliseconds to validate each PDF. |
f1380d3
to
b31a18a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice, thanks @djjuhasz! LGTM, just a few minor details.
- Add verapdf and the JRE to the worker Docker image - Add a validate_files activity to identify SIP file formats then validate the file formats for which we have a validator - Copy siegfried_embed and the format Identifier interface from https://github.com/artefactual-sdps/temporal-activities - Add the fvalidate package and Validator interface - Add a veraPDF implementation of the Validator interface - Run veraPDF in "batch" mode to minimize startup overheads - Add file validation configuration to config file - Add processing events for file validation success and failure - Add veraPDF binary path to kube dev overlay
cf09a0a
to
87dc8d9
Compare
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@djjuhasz, I thought I added a comment about this file. Is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, no. I'll delete it.
Add the VeraPDF Java CLI application to the Enduro worker container and use it to validate SIP files identified as a PDF/A format.
Refs #63.