Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: PREMIS files are not validated #951

Closed
sallain opened this issue May 31, 2024 · 7 comments
Closed

Problem: PREMIS files are not validated #951

sallain opened this issue May 31, 2024 · 7 comments

Comments

@sallain
Copy link
Collaborator

sallain commented May 31, 2024

Is your feature request related to a problem? Please describe.

Whether generated by Enduro (through a child workflow) or included in a SIP, PREMIS XML files should be validated before the package is sent to preservation. Archivematica/a3m can parse a PREMIS file to add the file's events to the AIP METS, which happens quite late in the AM/a3m workflow - ensuring that the PREMIS file is valid will hopefully avoid errors at this late point.

Describe the solution you'd like

Add a new activity to validate the premis.xml file against the PREMIS v3 schema before sending to AM/a3m, ensuring that it's well-formed and valid.

PREMIS files generated by Enduro child workflows should always be validated. A PREMIS file included in a transfer may have been validated in advance, so it might not be necessary to validate these. A reasonable approach might be to validate any PREMIS file in the SIP's metadata directory, regardless of origin, as this is the file that will be picked up by Archivematica/a3m.

Describe alternatives you've considered

None

Additional context

@sallain
Copy link
Collaborator Author

sallain commented May 31, 2024

@sallain
Copy link
Collaborator Author

sallain commented May 31, 2024

Note: I've only listed validating against the schema as a first iteration. Other checks might include:

  • Ensuring that each object listed in the PREMIS file exists in the transfer
  • Ensuring that identifiers (for objects, events, and agents) are unique, and replacing them with unique IDs if needed
  • Ensuring that there are no extra agents
  • and possibly others!

@sallain sallain added this to Enduro May 31, 2024
@sallain sallain moved this to 👍 Ready in Enduro May 31, 2024
@fiver-watson
Copy link
Contributor

Note additionally that this is something that will be used repeatedly for any Enduro user performing custom ingest activities that might generate PREMIS, and/or anyone submitting their own PREMIS files with a SIP. For this reason, ideally this will be implemented as a reusable temporal activity, rather than a client-specific child workflow.

@fiver-watson
Copy link
Contributor

@mcantelon also, as discussed in the meeting today:

Let's make this a general "Validate XML" task for its first pass, that can accept both a file to validate and a schema file to use for the validation as inputs.

@mcantelon mcantelon self-assigned this Jun 25, 2024
@sallain sallain moved this from 👍 Ready to ⏳ In Progress in Enduro Jul 9, 2024
@mcantelon
Copy link
Contributor

PR for CR: artefactual-sdps/temporal-activities#21

@jraddaoui
Copy link
Collaborator

There are some comments about this issue in artefactual-sdps/preprocessing-sfa#22 (comment).

@sallain
Copy link
Collaborator Author

sallain commented Nov 20, 2024

PREMIS file looks good and it's being properly parsed into the METS. I think we can finally put this issue to bed!

@sallain sallain closed this as completed Nov 20, 2024
@github-project-automation github-project-automation bot moved this from 🧐 QA to 🎉 Done in Enduro Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🎉 Done
Development

No branches or pull requests

4 participants