Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: What work has already been done towards support for controlled vocabularies for metadata fields #8571

Closed
4 tasks done
mreekie opened this issue Apr 4, 2022 · 5 comments
Labels
Feature: Controlled Vocabulary Includes both Internal and external controlled vocabularies NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.2.1 2 | 1.2.1 | Design and implement integration with controlled vocabularies | 5 prdOwnThis is an it... pm.GREI-d-1.2.1 NIH, yr1, aim2, task1: Design and implement integration with controlled voc

Comments

@mreekie
Copy link

mreekie commented Apr 4, 2022

This is in support of:

The first step is to figure out what has already been done by the dataverse team and by the community towards this aim. The focus here is on the general area of controlled vocabularies as opposed to specific biomedical vocabularies

For example:

And then to figure out what the next steps are.

Def of done

As completely as is reasonably possible in a 2 week period (sprint):

  • Search out previous related work that has been done by the Harvard Dataverse team

  • Search out previous work done within the community

  • demonstration of what is found to be implemented already in dataverse.

    • This is a configuration item on dataverse
  • Define what's next

    • do we have enough information to describe how to get from here to implementing this feature
    • Or what do we need to do next to get additional information/context.

Aim 2:

Increase support for biomedical and cross-domain metadata standards and controlled vocabularies

One of the useful characteristics of the Dataverse open-source software is its extensive support for metadata standards and additional custom metadata. The standards currently supported include the Data Documentation Initiative (DDI), Dublin Core, DataCite, and Schema.org.

In particular, DDI makes a Dataverse repository interoperable even at the variable/attribute level since it supports variable descriptive and statistical metadata. This allows data exploration and analysis tools to integrate easily with the repository and discovery engines to find variable information.

In this project, we propose to

  1. expand DDI support to include the recently released DDI-Cross-Domain Integration (DDI-CDI) schema,
  2. build on existing support for biomedical-related standards relevant to NIH-funded research cases, following the recommendations from https://fairsharing.org/,
  3. expand descriptive and citation metadata to support funding information and related fields, and
  4. integrate with external services to enable the support of controlled vocabularies for any metadata field, based on standardized, widely used data dictionaries. The HMS Research Data Management group will participate in the development of these standards and vocabularies for biomedical datasets, working directly with research laboratories.

Related documents

@mreekie mreekie changed the title Spike: What work has the dataverse team already done towards controlled vocabularies Spike: What work has already been done towards controlled vocabularies Apr 4, 2022
@mreekie mreekie added the NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons label Apr 4, 2022
@mreekie mreekie changed the title Spike: What work has already been done towards controlled vocabularies Spike: What work has already been done towards support for controlled vocabularies for any metadata field Apr 4, 2022
@mreekie mreekie changed the title Spike: What work has already been done towards support for controlled vocabularies for any metadata field Spike: What work has already been done towards support for biomedical controlled vocabularies metadata fields Apr 4, 2022
@mreekie mreekie changed the title Spike: What work has already been done towards support for biomedical controlled vocabularies metadata fields Spike: What work has already been done towards support for biomedical controlled vocabularies for metadata fields Apr 7, 2022
@mreekie mreekie changed the title Spike: What work has already been done towards support for biomedical controlled vocabularies for metadata fields Spike: What work has already been done towards support for controlled vocabularies for metadata fields Apr 7, 2022
@mreekie mreekie added the Medium label Apr 7, 2022
@mreekie mreekie added the pm.Len label Apr 25, 2022
@landreev landreev self-assigned this Apr 26, 2022
@landreev
Copy link
Contributor

landreev commented May 3, 2022

Brief summary:

There's a good chance that Dataverse offers sufficient support for controlled vocabularies to achieve the goals of the NIH grant without much development/coding work being necessary. Meaning that most of the work needed will concern defining the actual metadata standards and the Controlled Vocabulary Values. Some javascript coding may be necessary if we end up using External Vocabulary mechanism for importing the CVV on the Dataverse side.

Support for Controlled Vocabularies in the Dataverse software

In order to use CVVs in Dataverse metadata fields the CV needs to be defined and imported in one of 2 supported ways:

  1. It can be defined as part of a Metadata Block. This has been part of the core functionality of the application from the get go. We define and publish a proprietary file format for encoding metadata fields (with the provision for specifying that a field supports a CV and defining the allowed CVVs), and an API for importing and updating these definitions. See the Metadata Customization guide for more details.
  2. Support for External Vocabulary Services has been added in 2021.
    The important PRs in question:

The grant document linked and quoted in the description of the spike, above, says explicitly "integrate with external services". This implies that we will be using the solution 2. above for integrating with these CVVs from "standardized, widely used data dictionaries". However, we should keep in mind that there is also a possibility of achieving this integration using the standard built-in mechanism 1. - by creating a Metadata block (or expanding the existing Biomedical block) and defining the CVVs as part of it; perhaps providing some scripted solution for retrieving the dictionaries from external sources and encoding them as standard Dataverse block definition files. This would be a matter of certain specifics of the dictionaries and definitions in question, how large the vocabulary is, in what format it is served remotely and how often we should expect it to change. (There's some discussion of this in the Metadata Customization guide above. If using the External Service solution is a fixed decision that has already been made, we can skip this step).

As far as "what's next" is concerned, the most logical next step appears to be this (again, quoting the item 4. under "Aim 2" in the grant description):

The HMS Research Data Management group will participate in the development of these standards and vocabularies for biomedical datasets, working directly with research laboratories.

I.e. we need a better idea of the actual metadata specifications that we will need to support; and/or how these definitions will be served to us externally. The support for External Vocabulary Services is implemented in part by supplying Javascript code that interfaces with the remote provider and assists with populating the metadata edit forms in the Dataverse UI. Scripts that support SKOMOS and ORCIDs are provided as standard in the dedicated repository. Scripts supporting extra protocols may be available in the same repository, supplied by the Dataverse dev. community. If Dataverse needs to integrate with remote CVs served via a protocol not yet supported, more custom scripts will need to be developed. Hence figuring out these details is the next logical step of the effort.

@landreev
Copy link
Contributor

landreev commented May 5, 2022

For the "demonstration of what is found to be implemented already in dataverse" item on the checkbox list (that I missed earlier):

  1. A very simple example demonstrating an External Vocabulary Service integrated with a Dataverse installation to supply Controlled Vocabulary Values for a metadata field:

    Go to https://demo.dataverse.org.
    This installation is configured to query Research Organization Registry (ROR) for a list of institutions to be used as a controlled vocabulary for the authorAffiliation metadata field. This field is part of our standard/basic metadata block "citation", but it's a free-text (non-CV) field in the default configuration. Thanks to @pdurbin for suggesting this example. Support for ROR is a new contribution from a community member: Authors and Affiliations Lookup (KU Leuven).
    Create a dataset, or select a dataset that you are authorized to edit.
    In the Edit Metadata form, note the search icon next to the "Author Affiliation" text box:

    Screen Shot 2022-05-04 at 7 25 10 PM

    (optionally) type something in it:

    Screen Shot 2022-05-04 at 7 25 31 PM

    ... now click on the search icon - the UI will present you with a list of the available CV values:

    Screen Shot 2022-05-04 at 7 25 49 PM

  2. Also, a quick demonstration of what locally-defined CVs look like in the core Dataverse functionality:

    (I'm assuming the demo is potentially for somebody not yet familiar with the application)
    (since this is base functionality, this will work on any Dataverse instance)

    On the same Edit Metadata form, click on the "Language" field.
    The list of the defined CVVs is presented as a pulldown menu:

    Screen Shot 2022-05-04 at 7 27 41 PM

    These values are defined in the Metadata Block that we distribute (as citation.tsv). When we get a request to add some extra known languages to the list, we add them to the tsv, publish the updated file and tell all the installations to refresh their block definitions (usually as part of a Dataverse version release).

(Thinking about it, this list of valid language names from the last example would also be a prime use case for being supplied by an External Service. Pulling it directly from the ISO-639 definition page, for example. This would eliminate us as the middleman having to maintain/replicate the list in our own block distribution... But this is of course entirely outside the scope of this spike issue.)

@landreev landreev removed their assignment May 5, 2022
@scolapasta
Copy link
Contributor

Thanks @landreev, this is a great explanation of what we have there and confirmed ways to demo. I checked with @lenwiz and confirmed we can close this spike now.

@landreev
Copy link
Contributor

landreev commented May 6, 2022

@scolapasta OK, great. I left the checkboxes, under the "definition of done", un-checked; I wanted a reviewer to do the clicking, if they were satisfied with what I wrote. I can do it now; not that it matters much, just being thorough/ocd.

The only thing I could think of adding to the above: there are some subtle differences in behavior between locally imported and external CVs. (different meaning of "controlled" really). This could be an area where some dev. effort would be needed. If, for example, we were to use the external model for the GREI vocabularies, but wanted them to function 1:1 like fixed CVs defined in metadata blocks. But, again, we need to know more about the actual metadata standards and vocabularies we will be working with in order to discuss that.

@scolapasta
Copy link
Contributor

@landreev I went ahead and checked them.

@mreekie mreekie added the Feature: Controlled Vocabulary Includes both Internal and external controlled vocabularies label May 9, 2022
@mreekie mreekie added the NIH OTA: 1.5.1 collection: 5 | 1.5.1 | Standardize download metrics for the Harvard Dataverse repository... label Oct 6, 2022
@mreekie mreekie added NIH OTA: 1.2.1 2 | 1.2.1 | Design and implement integration with controlled vocabularies | 5 prdOwnThis is an it... and removed NIH OTA: 1.5.1 collection: 5 | 1.5.1 | Standardize download metrics for the Harvard Dataverse repository... labels Oct 18, 2022
@mreekie mreekie added the pm.GREI-d-1.2.1 NIH, yr1, aim2, task1: Design and implement integration with controlled voc label Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Controlled Vocabulary Includes both Internal and external controlled vocabularies NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.2.1 2 | 1.2.1 | Design and implement integration with controlled vocabularies | 5 prdOwnThis is an it... pm.GREI-d-1.2.1 NIH, yr1, aim2, task1: Design and implement integration with controlled voc
Projects
None yet
Development

No branches or pull requests

3 participants