Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Computational Workflow metadata and related documentation #8812

Merged
merged 4 commits into from
Jul 29, 2022

Conversation

abujeda
Copy link
Contributor

@abujeda abujeda commented Jun 23, 2022

What this PR does / why we need it:
Add Computational Workflow metadata to Dataverse and related documentation on how to use it.

Computational Workflow: A workflow that covers computational / data-driven workflows. They are designed to compose and execute a series of computational or data manipulation steps in a scientific application.

Which issue(s) this PR closes:
Closes #8639

Special notes for your reviewer:
None

Suggestions on how to test this:
For new installations, the metadata block will be added automatically to the database. No longer the case. This is now an experimental feature.
Please add the new metadata block to the Dataverse installation:
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/computational_workflow.tsv -H "Content-type: text/tab-separated-values"

Create a new dataverse and add computational workflow metadata into it.
Create a dataset within this dataverse and populate the computational workflow metadata fields.

There are only 3 metadata fields. None of them are compulsory.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No

Is there a release notes update needed for this change?:
Yes. Added to the PR

Additional documentation:
None

@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2022

@adaybujeda the Jenkins tests didn't run so I merged the latest from develop to kick off another build.

I'm still in the process of reviewing this PR but I thought I'd add a couple screenshots from a quick play:

showing new block

Screenshot 2022-07-18 at 16-54-12 Open Source at Harvard

showing facet

Screenshot 2022-07-18 at 16-54-31 Open Source at Harvard

One quick bit of feedback is that "External Code Repository URL" should be a clickable URL. Please stay tuned for more!

@poikilotherm we spoke briefly about this PR today. Would like like to add a review as well?

@coveralls
Copy link

coveralls commented Jul 18, 2022

Coverage Status

Coverage remained the same at 19.756% when pulling b1bb1d3 on adaybujeda:8639-computational-workflow-metadata into ba0183d on IQSS:develop.

@abujeda
Copy link
Contributor Author

abujeda commented Jul 19, 2022

Thanks @pdurbin.
Regarding the clickable URL for External Code Repository URL field, does this achieved by updating the display format to:
<a href="#VALUE" target="_blank" rel="noopener">#VALUE</a>, right?

Copy link
Contributor

@poikilotherm poikilotherm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I generally like the idea of having the option to filter by metadata type and workflows are unlikely to suffer as much from the mixed-content problems for a dataset, I would highly appreciate to make and declare this an experimental feature.

Also: can we cause a problem with same metadata fields in different schemas? (See comment inline)

@@ -261,6 +261,9 @@
<field name="cleaningOperations" type="text_en" multiValued="false" stored="true" indexed="true"/>
<field name="collectionMode" type="text_en" multiValued="true" stored="true" indexed="true"/>
<field name="collectorTraining" type="text_en" multiValued="false" stored="true" indexed="true"/>
<field name="workflowType" type="text_en" multiValued="true" stored="true" indexed="true"/>
<field name="codeRepository" type="text_en" multiValued="true" stored="true" indexed="true"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: this is obviously coming from https://schema.org/codeRepository, getting used within the Workflow schema (below) and within to-be-done CodeMeta schema #7877. AFAIK this could be the first time ever we might add the same metadata field in two different schemas. Are we confident this will not cause troubles?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name field in the tsv files must be globally unique in an instance, now enforced.

Comment on lines +1 to +6
## Adding Computational Workflow Metadata
The new Computational Workflow metadata block will allow depositors to effectively tag datasets as computational workflows.

To add the new metadata block, follow the instructions in the user guide: <https://guides.dataverse.org/en/latest/admin/metadatacustomization.html>

The location of the new metadata block tsv file is: `dataverse/scripts/api/data/metadatablocks/computational_workflow.tsv`
Copy link
Contributor

@poikilotherm poikilotherm Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: May I request we declare this an experimental thing as usually done with new things like these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it should be. Is this something you can make a decision on @jggautier?

Or @scolapasta / @pdurbin , should this feature be labeled as experimental?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@poikilotherm, the effect of labelling this experimental would be to let installations know that it's likely that it will change (more likely than other changes in the update) as more work is done to support the publication of workflows and software. Is that the reasoning?

Since this is considered an MVP, I agree, and the group that worked on this metadatablock share the same concerns you raise in your comments in this PR (and in your talks during the Dataverse Community Meeting). I would confirm with Mahmood that this be labelled as experimental. Does anyone know Mahmood's GitHub username?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be @mmshad.

As @scolapasta and @jggautier know, we just talked about this at standup. That is, should this feature be enabled for all new installations or not? We're going to talk about this a bit more during our sprint planning meeting in a few hours.

As Gustavo pointed out, for embargo (which isn't on by default) we turned it on for the demo server to play around with it. (I'm not even sure if embargo is enabled in Harvard Dataverse or not.) So we could try this... enable the workflow block on the demo server first.

If we do flag the workflow block as experimental and not enabled by default we should probably remove it from the appendix of the User Guide or at least have a note saying that your admin need to manually enable it.

@mmshad if you happen to know any Harvard Dataverse users that would like to try out this workflow block, please let us know.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdurbin , I can ask a few labs to start uploading their workflow. My team can use it as well.

Comment on lines 160 to 172
BagIt Support
-------------

BagIt is a set of hierarchical file system conventions designed to support disk-based storage and network transfer of arbitrary digital content. It offers several benefits such as integration with digital libraries, easy implementation, and transfer validation. See `the Wikipedia article <https://en.wikipedia.org/wiki/BagIt>`__ for more information.

If the repository you are using has enabled BagIt file handling, when uploading BagIt files the repository will validate the checksum values listed in each BagIt’s manifest file against the uploaded files and generate errors about any mismatches. The repository will identify a certain number of errors, such as the first five errors in each BagIt file, before reporting the errors.

|bagit-image1|

You can fix the errors and reupload the BagIt files.

For information on how to enable and configure the BagIt file handler see the :ref:`installation guide <BagIt File Handler>`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why is this here? This PR should be about workflow support. Is this some rogue commit that landed in the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the objective 2 in the DataCommons project, the project manager wanted to bundle all the changes into a single unit to do user testing. When discussing with @scolapasta what would be the best approach to integrate all of the objective 2 changes into Dataverse, we decided to split into a set of PRs.

Documentation as a whole for objective 2 was bundle into this change, this is why BagIt appears here.

@pdurbin is it OK to leave the BagIt documentation in this PR or do you want me to move it to a new one?
As far as objective 2 is concerned, we need all the changes.

@@ -7,3 +7,5 @@ curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @da
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/astrophysics.tsv -H "Content-type: text/tab-separated-values"
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/biomedical.tsv -H "Content-type: text/tab-separated-values"
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/journals.tsv -H "Content-type: text/tab-separated-values"
curl http://localhost:8080/api/admin/datasetfield/load -X POST --data-binary @data/metadatablocks/computational_workflow.tsv -H "Content-type: text/tab-separated-values"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Can we please make this not a default for new installations, instead adding at as an experimental thing?

There is lots of more discussion coming down about how data, code and workflows should be handled and the results might change/replace/alter the approach outlined here. Would be nice to not have new installations add this by default then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure myself when creating the PR so I asked @scolapasta about it. If I recall correctly, he said it was OK to add it in the default setup for new installations as you still need to add it to your collection manually.

@pdurbin
Copy link
Member

pdurbin commented Jul 19, 2022

updating the display format to:
<a href="#VALUE" target="_blank" rel="noopener">#VALUE</a>, right?

@adaybujeda yes exactly. You can look at keywordVocabularyURI as an example. It looks like you already have the type (url) and watermark set correctly.

I'm not actually sure what "termURI" should be. You have "computationalworkflow" for all three fields but https://guides.dataverse.org/en/5.11/admin/metadatacustomization.html suggests this could be a schema.org or purl.org URL. @qqmyers and @jggautier know more about this.

@jggautier
Copy link
Contributor

In the TSV file at https://github.com/IQSS/dataverse/blob/8b467f710051c26950f38cdc9bf5eaffb5f76d0f/scripts/api/data/metadatablocks/computational_workflow.tsv, I don't see any URIs in the termURI field. I see "computationalworkflow" in the metadatablock_id column:

Screen Shot 2022-07-19 at 10 09 47 AM

@pdurbin
Copy link
Member

pdurbin commented Jul 20, 2022

We just had a sprint planning meeting where a bunch of us discussed this issue and PR: @lenwiz @scolapasta @mreekie @landreev @sekmiller @kcondon @qqmyers @donsizemore @siacus

Here are the decisions we made:

  • Overall, we'd like to treat this Computational Workflow metadata block as experimental. This will prevent this metadata block from getting into common usage, at least for now. This means (@adaybujeda, I can make these changes):
    • In the appendix, we'll carve out an area to list experimental blocks
    • In the installer, we won't load this block.
    • In the release notes, we'll explain how to load it.
  • In terms of deployment, post release (probably 5.12):
    • We'll deploy the block to https://demo.dataverse.org and possibly ask @mmshad to ask a few labs to try it. (We understand that people may have already given input. Also, we'd need to make sure they understand that it's a test server without real DOIs or persistence.)
    • Assuming we're happy with demo, we'll deploy the block to Harvard Dataverse.
  • Going forward, we can probably advertise other experimental blocks in the appendix:
    • CodeMeta
    • 3D Object
    • DarwinCore
    • ... ?
  • We should continue to work on the process for how experimental blocks become real.

@pdurbin
Copy link
Member

pdurbin commented Jul 20, 2022

@adaybujeda I just made some suggestions in a pull request: abujeda#1

I'm happy to discuss! Or feel free to merge if you like the direction! Thanks!

@abujeda
Copy link
Contributor Author

abujeda commented Jul 20, 2022

@pdurbin I have merged the PR.

@jggautier @mmshad, Philip made some changes to the documentation, see previous comments for more background information regarding the changes.

This commit has a summary of the changes in case you want to review:
6d75a63

@jggautier
Copy link
Contributor

jggautier commented Jul 21, 2022

Hi all. I'd like to move the "(see .tsv version)" link closer to what it's describing. So instead of:

Computational Workflow Metadata: adapted from Bioschemas Computational Workflow Profile, version 1.0 and Codemeta (see .tsv version).

It would be:

Computational Workflow Metadata (see .tsv version): adapted from Bioschemas Computational Workflow Profile, version 1.0 and Codemeta.

Otherwise, I think it looks like the link is pointing to the TSV version of Codemeta. The same is true for the other metadatablocks described in the section.

Or maybe an alternative is to make the "See .tsv version" link its own sentence:

Computational Workflow Metadata: adapted from Bioschemas Computational Workflow Profile, version 1.0 and Codemeta. See .tsv version.

What do you think?

@abujeda
Copy link
Contributor Author

abujeda commented Jul 22, 2022

The 2 options look good @jggautier.

Let me know if you want me to implement one to make the changes.

@poikilotherm poikilotherm mentioned this pull request Jul 22, 2022
3 tasks
@jggautier
Copy link
Contributor

Thanks for considering, @adaybujeda. I just made and committed the changes. They're relatively small but I figured it would be better to make them now instead of me opening a new GitHub issue about it.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the "Computational Workflow" block is flagged as experimental and not loaded by default, I'm ready for this to go to QA. Approved.

@pdurbin pdurbin removed their assignment Jul 25, 2022
abujeda and others added 4 commits July 25, 2022 15:13
Other small changes:

- Carve out space for advertising experimental metadata blocks.
- CodePlex is extinct. Remove and reorder source hosting.
- Tweaks to BagIt docs.
Moved "see .tsv version" link closer to the name of the metadatablock it's describing
@abujeda abujeda force-pushed the 8639-computational-workflow-metadata branch from 29adad1 to b1bb1d3 Compare July 25, 2022 14:13
@abujeda
Copy link
Contributor Author

abujeda commented Jul 25, 2022

Thanks @pdurbin. I have rebased from develop. Ready for QA

@philippconzett
Copy link
Contributor

If file-level DOI is activated, would workflows get the value "Workflow" in the field resourceTypeGeneral of the DataCite Metadata Schema? See Table 7: Description of resourceTypeGeneral on page 48 of the DataCite Metadata Schema 4.4, available at https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf.

@abujeda
Copy link
Contributor Author

abujeda commented Jul 26, 2022

As far as I know, no, they wouldn't. We haven't updated DataCite to understand the new metadata block.

@kcondon kcondon self-assigned this Jul 29, 2022
@kcondon kcondon merged commit 4605ecf into IQSS:develop Jul 29, 2022
@scolapasta scolapasta added HDC: 2 Harvard Data Commons Obj. 2 HDC Harvard Data Commons labels Aug 1, 2022
@pdurbin pdurbin added this to the 5.12 milestone Aug 2, 2022
@mreekie mreekie added the NIH OTA: 1.3.1 3 | 1.3.1 | Support software metadata | 5 prdOwnThis is an item synched from the product planning... label Dec 15, 2022
@mreekie mreekie added pm.GREI-d-1.3.1 NIH, yr1, aim3, task1: Support software metadata pm.GREI-d-1.3.2 NIH, yr1, aim3, task2: R & D phase biomedical workflows support labels Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
HDC Harvard Data Commons HDC: 2 Harvard Data Commons Obj. 2 NIH OTA: 1.3.1 3 | 1.3.1 | Support software metadata | 5 prdOwnThis is an item synched from the product planning... pm.GREI-d-1.3.1 NIH, yr1, aim3, task1: Support software metadata pm.GREI-d-1.3.2 NIH, yr1, aim3, task2: R & D phase biomedical workflows support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request/Idea: Add Computational Workflow Metadata and Related Docs