Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VeraPDF validation of PDF/A files #92

Merged
merged 1 commit into from
Dec 3, 2024

Conversation

djjuhasz
Copy link
Contributor

@djjuhasz djjuhasz commented Nov 27, 2024

Add the VeraPDF Java CLI application to the Enduro worker container and use it to validate SIP files identified as a PDF/A format.

Refs #63.

Copy link

codecov bot commented Nov 27, 2024

Codecov Report

Attention: Patch coverage is 73.59551% with 47 lines in your changes missing coverage. Please review.

Project coverage is 61.22%. Comparing base (451bb29) to head (87dc8d9).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
internal/fvalidate/verapdf_validator.go 55.55% 16 Missing ⚠️
internal/activities/validate_files.go 81.81% 8 Missing and 4 partials ⚠️
internal/fformat/siegfried_embed.go 80.48% 6 Missing and 2 partials ⚠️
cmd/worker/workercmd/cmd.go 0.00% 7 Missing ⚠️
internal/workflow/preprocessing.go 85.71% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #92      +/-   ##
==========================================
+ Coverage   60.12%   61.22%   +1.10%     
==========================================
  Files          29       32       +3     
  Lines        1996     2174     +178     
==========================================
+ Hits         1200     1331     +131     
- Misses        702      742      +40     
- Partials       94      101       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@djjuhasz
Copy link
Contributor Author

I've tested the VeraPDF validation in my local dev environment and it works, but all of the PDF/A files in our sample SIP fail validation:

[
  {
    "Failures": [
      "invalid PDF/A: \"content/content/d0003/d0004/p0006.pdf\"",
      "invalid PDF/A: \"content/content/d0003/d0004/p0007.pdf\"",
      "invalid PDF/A: \"content/content/d0001/p0001.pdf\"",
      "invalid PDF/A: \"content/content/d0001/p0002.pdf\"",
      "invalid PDF/A: \"content/content/d0001/p0003.pdf\"",
      "invalid PDF/A: \"content/content/d0002/p0004.pdf\"",
      "invalid PDF/A: \"content/content/d0002/p0005.pdf\""
    ]
  }
]

VeraPDF is also quite slow - it took almost 5 seconds to validate the 7 PDF/A files listed above. 😮

@djjuhasz
Copy link
Contributor Author

I've done some testing and almost all of the time used by VeraPDF was startup overhead (probably JRE startup time). When I pass VeraPDF a directory path instead of individual file paths it's much faster (just under 1 second for the 7 test PDFs) and it reports about 17 milliseconds to validate each PDF.

@djjuhasz djjuhasz force-pushed the dev/issue-63-verapdf-validation branch 6 times, most recently from f1380d3 to b31a18a Compare November 29, 2024 02:06
Copy link
Contributor

@jraddaoui jraddaoui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice, thanks @djjuhasz! LGTM, just a few minor details.

- Add verapdf and the JRE to the worker Docker image
- Add a validate_files activity to identify SIP file formats then
  validate the file formats for which we have a validator
- Copy siegfried_embed and the format Identifier interface from
  https://github.com/artefactual-sdps/temporal-activities
- Add the fvalidate package and Validator interface
- Add a veraPDF implementation of the Validator interface
- Run veraPDF in "batch" mode to minimize startup overheads
- Add file validation configuration to config file
- Add processing events for file validation success and failure
- Add veraPDF binary path to kube dev overlay
@djjuhasz djjuhasz force-pushed the dev/issue-63-verapdf-validation branch from cf09a0a to 87dc8d9 Compare December 3, 2024 22:04
@djjuhasz djjuhasz merged commit 458d34b into main Dec 3, 2024
9 checks passed
@djjuhasz djjuhasz deleted the dev/issue-63-verapdf-validation branch December 3, 2024 23:01
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djjuhasz, I thought I added a comment about this file. Is it needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, no. I'll delete it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants