Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: capture PREMIS events for pre-ingest validation tasks #19

Closed
sallain opened this issue May 30, 2024 · 7 comments
Closed

Feature: capture PREMIS events for pre-ingest validation tasks #19

sallain opened this issue May 30, 2024 · 7 comments

Comments

@sallain
Copy link
Contributor

sallain commented May 30, 2024

Is your feature request related to a problem? Please describe.

SFA SIPs will have some custom ingest validation tasks in their workflow that will: 

  • Validate the transfer structure
  • Validate the metadata files included in the transfer
  • Check the file formats included against a controlled list

At present, none of these pre-ingest validation tasks are generating PREMIS events. This card aims to change that where possible.

In some cases it would be best to create PREMIS events at the package level (such as for the transfer structure validation) - since AM can't do right now, we will focus only on those events we can add at the file level. For now, we will focus on generating a validation event for each file in a package once it has been checked against the allowed file formats list during the ingest validation phase. 

Describe the solution you'd like

Generate file-level PREMIS events where possible, and include them as part of a new  well-formed premis.xml file.

The first candidate is file format validation.

  • Generate one PREMIS event for each object as it is checked against the allowed list. Sample PREMIS event here
  • Capture validation events in a single premis.xml file (see comments below for an annotated PREMIS file that you can use as a template)
  • Place the newly-generated premis.xml file in the metadata directory

Describe alternatives you've considered

A clear and concise description of any alternative solutions or features you've considered.

Additional context

Previously, we were generating a premis.xml file by combining individual PREMIS files found within the content directory. This work is being undone by #18.

@sallain
Copy link
Contributor Author

sallain commented May 30, 2024

Annotated PREMIS file for multiple objects:
premis-annotated-multi.zip

@sallain
Copy link
Contributor Author

sallain commented May 30, 2024

In the sample PREMIS event, @fiver-watson provided a generic linkingAgentIdentifierValue of <premis:linkingAgentIdentifierValue>https://github.com/artefactual-sdps/preprocessing-base</premis:linkingAgentIdentifierValue> on the principle that this kind of validation is likely to be universally useful. However, since it's currently implemented for SFA only through the child workflow, I think I would recommend pointing to this repo as the Agent (that is, the child workflow is the agent).

@sallain sallain added this to Enduro May 30, 2024
@sallain sallain moved this to 👍 Ready in Enduro May 30, 2024
@sallain sallain moved this from 👍 Ready to ⏳ In Progress in Enduro Jun 3, 2024
@sallain sallain moved this from ⏳ In Progress to 🧐 QA in Enduro Jun 28, 2024
@sallain
Copy link
Contributor Author

sallain commented Jun 28, 2024

The structure of the PREMIS file looks good. However, I'm seeing two issues:

  1. Archivematica isn't able to load the events from the premis.xml due to incorrect associations between events and objects.

Image

Looking at the PREMIS file that's generated, I can see that there are five objects in the package. I can see that there are six format validation events (there should be five, I think - not sure what's going on there). However, all of the objects are linked to just one of those events, rather than each object being linked to a separate event.

A similar issue happens with the structure validation and metadata validation events, except in those cases there are only one of each event. There needs to be one event for each object (even though that doesn't make sense, I know!)

  1. The value for <premis:eventType> needs to adhere to the PREMIS data dictionary, and the eventDetail and eventOutcomeDetailNote should provide more information. The correct values are:
  • eventType: validateStructure SHOULD BE validation

    • eventDetail: name="Validate SIP structure" (NOTE: this is the name of the Enduro activity)
    • eventOutcomeDetailNote: SIP structure identified: VecteurAIP. SIP structure matches validation criteria. (NOTE: this is a combination of the outcomes from the Identify SIP structure and Validate SIP structure activities - I'm not sure if just directly pulling that information is the easiest way to do it, but it provides all the info needed)
  • eventType: validateFileFormats SHOULD BE validation

    • eventDetail: name="Validate SIP file formats"
    • eventOutcomeDetailNote: Format allowed (NOTE: will always be allowed because a disallowed format will cause the transfer to fail)
  • eventType: validateMetadata SHOULD BE validation

    • eventDetail: name="Validate SIP metadata"
    • eventOutcomeDetailNote: Metadata validation successful (NOTE: will always be successful because invalid metadata will cause the transfer to fail)

Let me know if a mock-up of the premis.xml would be helpful.

@mcantelon
Copy link
Contributor

PR ready for CR: #31

@sallain sallain moved this from 🧐 QA to ⏳ In Progress in Enduro Jul 9, 2024
mcantelon added a commit that referenced this issue Jul 10, 2024
* Removed redundant PREMIS event for validate file formats
* Correct use of improper PREMIS event types
* Allowed PREMIS evebt details and PREMIS event outcome detail to be
  specified rather than generated
* Add PREMIS event outcome detail specification to PREMIS events in
  workflow and validate file formats activity
* Generate error when attempting to add a PREMIS event to a
  non-existent PREMIS object
* Fixed issue with validate formats activity
* Removed unneeded code in tests of add PREMIS event activity
mcantelon added a commit that referenced this issue Jul 10, 2024
* Removed redundant PREMIS event for validate file formats
* Correct use of improper PREMIS event types
* Allowed PREMIS evebt details and PREMIS event outcome detail to be
  specified rather than generated
* Add PREMIS event outcome detail specification to PREMIS events in
  workflow and validate file formats activity
* Generate error when attempting to add a PREMIS event to a
  non-existent PREMIS object
* Fixed issue with validate formats activity
* Removed unneeded code in tests of add PREMIS event activity
@mcantelon mcantelon moved this from ⏳ In Progress to 🧐 QA in Enduro Jul 10, 2024
@sallain
Copy link
Contributor Author

sallain commented Jul 22, 2024

@mcantelon Archivematica is throwing up the following error - I don't really understand what it means!

'UUID' object has no attribute 'replace'Traceback (most recent call last):
  File "/usr/lib/archivematica/MCPClient/client/job.py", line 142, in JobContext
    yield
  File "/usr/lib/archivematica/MCPClient/clientScripts/load_premis_events_from_xml.py", line 848, in call
    job.set_status(main(job))
  File "/usr/lib/archivematica/MCPClient/clientScripts/load_premis_events_from_xml.py", line 839, in main
    save_events(valid_events, file_queryset, job.pyprint)
  File "/usr/lib/archivematica/MCPClient/clientScripts/load_premis_events_from_xml.py", line 695, in save_events
    event["event_id"] = ensure_event_id_is_uuid(event["event_id"], printfn)
  File "/usr/lib/archivematica/MCPClient/clientScripts/load_premis_events_from_xml.py", line 670, in ensure_event_id_is_uuid
    uuid.UUID(event_id, version=4)
  File "/usr/lib64/python3.9/uuid.py", line 174, in __init__
    hex = hex.replace('urn:', '').replace('uuid:', '')
AttributeError: 'UUID' object has no attribute 'replace'

mcantelon added a commit that referenced this issue Aug 19, 2024
* Fixed PREMIS event recording to add one event per file
* Updated PREMIS event recording and tests to work with updated
  transfer structure
* Return subpaths within transfer, instead of absolute paths, in file
  format validator list of failures
mcantelon added a commit that referenced this issue Aug 19, 2024
* Renamed event adding functions
* Fixed PREMIS event recording to add one event per file
* Updated PREMIS event recording and tests to work with updated
  transfer structure
* Return subpaths within transfer, instead of absolute paths, in file
  format validator list of failures
mcantelon added a commit that referenced this issue Aug 19, 2024
* Renamed event adding functions
* Fixed PREMIS event recording to add one event per file
* Updated PREMIS event recording and tests to work with updated
  transfer structure
* Return subpaths within transfer, instead of absolute paths, in file
  format validator list of failures
mcantelon added a commit that referenced this issue Aug 19, 2024
* Renamed event adding functions
* Fixed PREMIS event recording to add one event per file
* Updated PREMIS event recording and tests to work with updated
  transfer structure
* Return subpaths within transfer, instead of absolute paths, in file
  format validator list of failures
mcantelon added a commit that referenced this issue Aug 19, 2024
* Renamed event adding functions
* Fixed PREMIS event recording to add one event per file
* Updated PREMIS event recording and tests to work with updated
  transfer structure
* Return subpaths within transfer, instead of absolute paths, in file
  format validator list of failures
mcantelon added a commit that referenced this issue Aug 19, 2024
* Renamed event adding functions
* Fixed PREMIS event recording to add one event per file
* Updated PREMIS event recording and tests to work with updated
  transfer structure
* Return subpaths within transfer, instead of absolute paths, in file
  format validator list of failures
mcantelon added a commit that referenced this issue Aug 19, 2024
* Renamed event adding functions
* Fixed PREMIS event recording to add one event per file
* Updated PREMIS event recording and tests to work with updated
  transfer structure
@mcantelon
Copy link
Contributor

PR to fix issues: #19

mcantelon added a commit that referenced this issue Aug 20, 2024
* Renamed event adding functions
* Fixed PREMIS event recording to add one event per file
* Updated PREMIS event recording and tests to work with updated
  transfer structure
@mcantelon
Copy link
Contributor

Fix merged! 🤞

mcantelon added a commit that referenced this issue Aug 28, 2024
Added support for born digital SIPs to PREMIS event recording.
mcantelon added a commit that referenced this issue Aug 28, 2024
Added support for born digital SIPs to PREMIS event recording.
mcantelon added a commit that referenced this issue Aug 28, 2024
Added support for born digital SIPs to PREMIS event recording.
mcantelon added a commit that referenced this issue Aug 28, 2024
Added support for born digital SIPs to PREMIS event recording.

[skip-codecov]
jraddaoui added a commit that referenced this issue Aug 28, 2024
Co-authored-by: José Raddaoui Marín <raddaouimarin@gmail.com>
@sallain sallain closed this as completed Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

2 participants