Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert specification to schema format #540

Closed
tsalo opened this issue Jul 23, 2020 · 16 comments
Closed

Convert specification to schema format #540

tsalo opened this issue Jul 23, 2020 · 16 comments
Labels
formatting Aesthetics and formatting of the spec infrastructure schema Issues related to the YAML schema representation of the specification. Patch version release.

Comments

@tsalo
Copy link
Member

tsalo commented Jul 23, 2020

A long-term goal of the specification could be to make almost all of its content into a machine-readable schema, to facilitate automated use of the specification in other packages (e.g., pybids and bids-validator), as well as to propagate small changes across the specification.

This is related to #423, #466, and #475. In #423, @dbkeator discusses work in extracting schema-like information from the specification, converting the terms to JSON-LD format, and linking BIDS terms to similar terms in other ontologies (see BIDS_Terms). Many elements of this work should also be incorporated into the actual specification, as it explicitly defines associations between terms within the specification. This should, in turn, make extracting relevant information from BIDS for other efforts, like BIDS_Terms, much easier. Initial work toward doing this conversion with the YAML format and specifically limited to the entity table has been done in #475.

Here are some initial goals for the conversion:

  • Distinguish different object types, and organize them into different folders: entities, modalities, datatypes, suffixes, extensions, and metadata fields, at minimum.
    • In cases where order is important (e.g., entities), organize or label the objects in a manner conducive to this.
  • Define top level files and associated data folders (e.g., sourcedata/).
  • The inheritance principle, somehow.
  • Explicitly require jsons.
  • Define each of the objects, possibly in its own file, with the following fields:
    • Name
    • Definition
    • Format/allowed values (for entities and metadata fields)
    • Mutually exclusive objects (at least for metadata fields)
    • Citation (at least for modalities)
  • Link associated object types:
    • Required and optional metadata for each datatype.
    • Required and optional columns for tabular files.
    • Required and optional entities for each datatype, broken down by groups of suffixes.
    • Required and optional suffixes for each datatype, broken down by groups of suffixes.
  • Code to automatically compile the Markdown and PDF versions of the schema.
    • Objects should either have their own pages or their definitions should be duplicated in any sections where they're applicable. For example, the run entity is currently defined once, under Anatomy imaging data under Magnetic Resonance Imaging, even though the entity applies to many other datatypes, and is generally referenced or briefly defined in those other datatypes' sections, without a link to the main definition.
    • The code-formatted templates should also be compilable.
  • Code to minimally validate the specification, ensuring that all files for a given object type have the required fields. In cases where those fields are supposed to have specific values, the validation should check that those fields are correct in all files.

I think that only sections of the specification that wouldn't be described in the schema would be the Appendix pages, introduction pages/sections, Common Principles, and specific examples.

Open questions:

  • JSON-LD vs. YAML
    • I think JSON-LD is more standard for this type of thing, but I personally find it harder to work with than YAML.
  • Duplication vs. modularization
    • In [INFRA] Convert entity table to yaml #475, we've leaned toward minimizing duplication and placing information in larger files, in order to make it easier to make changes across files. When there's automated validation, this may be less of an issue since developers will be aware when they introduce breaking changes. Generally in schemas, it seems like each object gets its own file.
  • Conversion of past versions of the specification
    • @yarikoptic created bids-schema, where the schema can be made available across versions, so it's possible to convert old versions of the specification to the new schema format, although it would be a lot of work.
@tsalo tsalo added formatting Aesthetics and formatting of the spec infrastructure labels Jul 23, 2020
@yarikoptic
Copy link
Collaborator

a small clarification: JSON-LD vs YAML. is a bit wrong comparison since JSON-LD is JSON + LD conventions for key names etc. YAML (2.0) is super class of JSON (so any valid json is also valid YAML but not necessarily vs versa). So, it is possible to pretty much have "YAML-"LD, which would remain a more human accessible version of JSON-LD. See e.g. YAML used for "working toward JSON-LD" used by @satra within https://github.com/dandi/schema/blob/master/terms/AccessRequirements.yaml . We would just need to verify (test) that we can still do YAML -> JSON -> YAML round trip, i.e. that no features of YAML which are not in JSON were introduced.

So AFAIK we do not really need to "use JSON" while working on the schema (in our "sources") -- we can convert to JSON-LD for the "published" document. Our "sources" could remain as modular and avoid duplication as we like it, we just need to introduce "provisions" for encoding graph structure of LD and to ensure that we can produce proper json-ld from it.

Having said all that, @satra et al are releasing json-ld's (and other serializations) while I believe working directly within "json-ld" sources, but also modularized into files (but without hierarchy), e.g. see https://github.com/ReproNim/reproschema/blob/master/terms/multipleChoice . To my very personal liking it is a bit of duplication etc which we could avoid... ;-)

@satra
Copy link
Collaborator

satra commented Jul 24, 2020

JSON-LD is JSON + LD conventions for key names

i should clarify this is only true for the compact form of JSON-LD. the expanded form has its own specific syntax.

@yarikoptic
Copy link
Collaborator

JSON-LD is JSON + LD conventions for key names

i should clarify this is only true for the compact form of JSON-LD. the expanded form has its own specific syntax.

it is no longer a valid JSON?

@satra
Copy link
Collaborator

satra commented Jul 24, 2020

it is valid JSON, since JSON is always the underlying format, but the validation schema for the compact form and the expanded forms would be different. valid JSON doesn't mean valid data :)

@yarikoptic
Copy link
Collaborator

sure, good... I was simply pointing out that any (compact or not) JSON-LD is a valid JSON (and thus a valid YAML). What we would need to do for validation of the schema itself is another aspect.

@tsalo
Copy link
Member Author

tsalo commented Jul 26, 2020

I guess I really don't know enough about JSON-LD (or even just LD). The decision I was trying to describe is generally between an established, but highly technical specification with a steep learning curve (i.e., JSON-LD) and a customized, readable, but highly idiosyncratic specification (e.g., something like what's being done in #475). If we do choose to follow established ontology structures like JSON-LD, we'll need some sort of translation between standard contributors and the schema, such as maintainers with specific training. On the other hand, if we use ad hoc structures, then they'll be much more readable, but also not easily interactive with other ontologies.

Given that I have very limited experience with JSON-LD (mostly looking at NIMADS and NIDM and being confused), I hope that I'm not misrepresenting the tradeoff. Please correct me if either of you see a problem with it.

@sappelhoff - On an unrelated note, since there are currently several issues and one PR dedicated to the conversion of the specification to a schema, could we create a new Project board to manage them?

@satra
Copy link
Collaborator

satra commented Jul 26, 2020

@tsalo - i think there are perhaps several components being conflated together, so instead of directly thinking about json/jsonld/ld, let's consider the components and why we may want to represent them in some schema or other structured format.

I am also writing this partly to think this through a bit more openly. It may also be helpful for anyone new to bids to consider the potential places for contribution and ramifications that any change should consider. This builds on the very nice set of areas that @tsalo lays out in the original post (so please read: #540 (comment) before this).

  1. Vocabulary or listing of valid bids terms. This is part of the goal of the nidm-terms project. The intent here is to accumulate possible terms in the bids world in some structured manner. This is not about where these terms are used or what relations these terms may have. It is simply saying these terms exist and here is how i can describe them better through potential contextualization both within bids and to other ontologies (cogpo, cog atlas, and others). Thus creating a universal vocabulary across BIDS. So any new person adding something would simply need to add a term in a common place and then use it as necessary in the specification.

  2. Structural and value constraints
    a. BIDS provides structural constraints via its hierarchy. This translates to applicable patterns at different levels. For this situation it is more about validation. One could simply take a BIDS dataset walk (i.e. a tree), turn it into a JSON document and run a json-schema against it for validation. I'm not sure at this point, whether or not a JSON schema can successfully validate this object.
    b. Structures of files. JSON, TSV files should be parseable as those.
    c. Contents of JSON files. These are JSON themselves, therefore ideally there should be a schema to validate against.
    d. Contents of TSV files. This requires an interaction between the TSV file and an associated "validation schema" of some kind in the related JSON file that describes the .
    e. Other files. BIDS at present includes a host of other data files. Those should be validated at least a a structural level.

  3. Reasons for structuring
    a. Generate documentation that highlights the structure. I believe this is what this started with. The entity table. But more generally this could be used to relate pieces of bids as well across modalities, datatypes, etc.,..
    b. Provide new contributors an easy place to add things without violating other pieces of the specification. And perhaps for contributors to see what the current constraints are.
    c. Simplify software maintenance. Extend the bids validator/pybids to support more things over time and build it from a common model that supports the specification and the validation.
    d. Validation. To validate contents or structure of hierarchy automatically.
    e. Search. At the end of the day search can be seen through multiple lenses. Free form text search or more faceted search based on structured representations.
    f. Linking to other things. While many things in BIDS are constrained, several things are not. Some of the ongoing work tries to help link arbitrary ad hoc names to less ad hoc concepts. This does impact search directly, but also helps consolidate knowledge at a given time.

  4. Implementation technologies.
    There are many ways to create validators (JSON schema, SHACL, XMLschema, and others). There are many tools that support developing constraint models with validators (attrs, traitlets/traits, pydantic, sqlalchemy, pyshacl, pyld, and others). And of course one can create custom validators. Unfortunately, because trees and graphs are different structures, one may have to adopt one or more validation schemes depending on the lens through which you look at your object. This is partly where JSON-LD kind of comes in as a potentially happy medium. Under certain circumstances you can validate using JSON schema and be confident that a validation through SHACL would give the same result. Or vice versa, you can validate via SHACL and know that it will also pass JSON schema validation. This happy medium then also allows usage in different settings (ORM databases, graph databases) for different applications. There may be other constraints that many json validators may not be able to support. And that comes about trying to implement the schema in some form.

@sappelhoff
Copy link
Member

@sappelhoff - On an unrelated note, since there are currently several issues and one PR dedicated to the conversion of the specification to a schema, could we create a new Project board to manage them?

Hi @tsalo I just increased your scope of permissions for this repository. Can you go ahead and try to make a project now? If it doesn't work, I'll look into the permissions again.

I think it's a great idea to create a project board for this!

@tsalo
Copy link
Member Author

tsalo commented Jul 27, 2020

It worked. Thanks!

@tsalo
Copy link
Member Author

tsalo commented Jul 27, 2020

@satra You expressed all of my goals better than I did! As well as noted several more that I hadn't thought of.

Do you have any thoughts on where we should go from here? It seems like we need to decide on the schema of choice for each of the elements you describe in "Structural and value constraints". Should we open a separate issue for each, or try to find one solution that works for all of them? I assume we could commit to technology (e.g., JSON-LD or SHACL) and then break it down from there?

Perhaps we could ping some of the other folks who work with schemata more than me for their thoughts on the best technology for this case. Most everything under "Implementation technologies" went over my head, so I doubt I'll be able to contribute to that decision.

@satra
Copy link
Collaborator

satra commented Aug 5, 2020

@tsalo - it may be good to have an online chat with interested parties to discuss some of these things?

@tsalo
Copy link
Member Author

tsalo commented Aug 5, 2020

@satra that sounds like a good plan!

@tsalo
Copy link
Member Author

tsalo commented Aug 10, 2020

@satra Who would be interested? Based on a recent BIDS maintainers call, I was thinking @yarikoptic, @dbkeator, @dnkennedy, @rwblair, and possibly Maryann Martone (not sure if Dr. Martone has a GitHub account).

EDIT: Oh and @nqueder! Apologies for the oversight.

@dbkeator
Copy link

I'm interested....

@tsalo
Copy link
Member Author

tsalo commented Sep 11, 2020

@yarikoptic and I joined a couple of NIDM-Terms calls with most of the interested folks where we've made solid progress on the format of the schema, and since there is now a project board for the conversion I think we can close this issue in favor of more focused issues. Any objections?

@sappelhoff
Copy link
Member

+1 for smaller, targeted, actionable issue that are connected into a coherent whole via the project board

@tsalo tsalo closed this as completed Sep 12, 2020
@tsalo tsalo added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
formatting Aesthetics and formatting of the spec infrastructure schema Issues related to the YAML schema representation of the specification. Patch version release.
Projects
Development

No branches or pull requests

5 participants