Convert specification to schema format #540

tsalo · 2020-07-23T17:08:47Z

A long-term goal of the specification could be to make almost all of its content into a machine-readable schema, to facilitate automated use of the specification in other packages (e.g., pybids and bids-validator), as well as to propagate small changes across the specification.

This is related to #423, #466, and #475. In #423, @dbkeator discusses work in extracting schema-like information from the specification, converting the terms to JSON-LD format, and linking BIDS terms to similar terms in other ontologies (see BIDS_Terms). Many elements of this work should also be incorporated into the actual specification, as it explicitly defines associations between terms within the specification. This should, in turn, make extracting relevant information from BIDS for other efforts, like BIDS_Terms, much easier. Initial work toward doing this conversion with the YAML format and specifically limited to the entity table has been done in #475.

Here are some initial goals for the conversion:

Distinguish different object types, and organize them into different folders: entities, modalities, datatypes, suffixes, extensions, and metadata fields, at minimum.
- In cases where order is important (e.g., entities), organize or label the objects in a manner conducive to this.
Define top level files and associated data folders (e.g., sourcedata/).
The inheritance principle, somehow.
Explicitly require jsons.
Define each of the objects, possibly in its own file, with the following fields:
- Name
- Definition
- Format/allowed values (for entities and metadata fields)
- Mutually exclusive objects (at least for metadata fields)
- Citation (at least for modalities)
Link associated object types:
- Required and optional metadata for each datatype.
- Required and optional columns for tabular files.
- Required and optional entities for each datatype, broken down by groups of suffixes.
- Required and optional suffixes for each datatype, broken down by groups of suffixes.
Code to automatically compile the Markdown and PDF versions of the schema.
- Objects should either have their own pages or their definitions should be duplicated in any sections where they're applicable. For example, the run entity is currently defined once, under Anatomy imaging data under Magnetic Resonance Imaging, even though the entity applies to many other datatypes, and is generally referenced or briefly defined in those other datatypes' sections, without a link to the main definition.
- The code-formatted templates should also be compilable.
Code to minimally validate the specification, ensuring that all files for a given object type have the required fields. In cases where those fields are supposed to have specific values, the validation should check that those fields are correct in all files.

I think that only sections of the specification that wouldn't be described in the schema would be the Appendix pages, introduction pages/sections, Common Principles, and specific examples.

Open questions:

JSON-LD vs. YAML
- I think JSON-LD is more standard for this type of thing, but I personally find it harder to work with than YAML.
Duplication vs. modularization
- In [INFRA] Convert entity table to yaml #475, we've leaned toward minimizing duplication and placing information in larger files, in order to make it easier to make changes across files. When there's automated validation, this may be less of an issue since developers will be aware when they introduce breaking changes. Generally in schemas, it seems like each object gets its own file.
Conversion of past versions of the specification
- @yarikoptic created bids-schema, where the schema can be made available across versions, so it's possible to convert old versions of the specification to the new schema format, although it would be a lot of work.

The text was updated successfully, but these errors were encountered:

yarikoptic · 2020-07-24T19:57:02Z

a small clarification: JSON-LD vs YAML. is a bit wrong comparison since JSON-LD is JSON + LD conventions for key names etc. YAML (2.0) is super class of JSON (so any valid json is also valid YAML but not necessarily vs versa). So, it is possible to pretty much have "YAML-"LD, which would remain a more human accessible version of JSON-LD. See e.g. YAML used for "working toward JSON-LD" used by @satra within https://github.com/dandi/schema/blob/master/terms/AccessRequirements.yaml . We would just need to verify (test) that we can still do YAML -> JSON -> YAML round trip, i.e. that no features of YAML which are not in JSON were introduced.

So AFAIK we do not really need to "use JSON" while working on the schema (in our "sources") -- we can convert to JSON-LD for the "published" document. Our "sources" could remain as modular and avoid duplication as we like it, we just need to introduce "provisions" for encoding graph structure of LD and to ensure that we can produce proper json-ld from it.

Having said all that, @satra et al are releasing json-ld's (and other serializations) while I believe working directly within "json-ld" sources, but also modularized into files (but without hierarchy), e.g. see https://github.com/ReproNim/reproschema/blob/master/terms/multipleChoice . To my very personal liking it is a bit of duplication etc which we could avoid... ;-)

satra · 2020-07-24T20:10:35Z

JSON-LD is JSON + LD conventions for key names

i should clarify this is only true for the compact form of JSON-LD. the expanded form has its own specific syntax.

yarikoptic · 2020-07-24T21:04:14Z

JSON-LD is JSON + LD conventions for key names

i should clarify this is only true for the compact form of JSON-LD. the expanded form has its own specific syntax.

it is no longer a valid JSON?

satra · 2020-07-24T21:13:10Z

it is valid JSON, since JSON is always the underlying format, but the validation schema for the compact form and the expanded forms would be different. valid JSON doesn't mean valid data :)

yarikoptic · 2020-07-24T21:22:46Z

sure, good... I was simply pointing out that any (compact or not) JSON-LD is a valid JSON (and thus a valid YAML). What we would need to do for validation of the schema itself is another aspect.

tsalo · 2020-07-26T15:27:15Z

I guess I really don't know enough about JSON-LD (or even just LD). The decision I was trying to describe is generally between an established, but highly technical specification with a steep learning curve (i.e., JSON-LD) and a customized, readable, but highly idiosyncratic specification (e.g., something like what's being done in #475). If we do choose to follow established ontology structures like JSON-LD, we'll need some sort of translation between standard contributors and the schema, such as maintainers with specific training. On the other hand, if we use ad hoc structures, then they'll be much more readable, but also not easily interactive with other ontologies.

Given that I have very limited experience with JSON-LD (mostly looking at NIMADS and NIDM and being confused), I hope that I'm not misrepresenting the tradeoff. Please correct me if either of you see a problem with it.

@sappelhoff - On an unrelated note, since there are currently several issues and one PR dedicated to the conversion of the specification to a schema, could we create a new Project board to manage them?

satra · 2020-07-26T16:07:15Z

@tsalo - i think there are perhaps several components being conflated together, so instead of directly thinking about json/jsonld/ld, let's consider the components and why we may want to represent them in some schema or other structured format.

I am also writing this partly to think this through a bit more openly. It may also be helpful for anyone new to bids to consider the potential places for contribution and ramifications that any change should consider. This builds on the very nice set of areas that @tsalo lays out in the original post (so please read: #540 (comment) before this).

Vocabulary or listing of valid bids terms. This is part of the goal of the nidm-terms project. The intent here is to accumulate possible terms in the bids world in some structured manner. This is not about where these terms are used or what relations these terms may have. It is simply saying these terms exist and here is how i can describe them better through potential contextualization both within bids and to other ontologies (cogpo, cog atlas, and others). Thus creating a universal vocabulary across BIDS. So any new person adding something would simply need to add a term in a common place and then use it as necessary in the specification.
Structural and value constraints
a. BIDS provides structural constraints via its hierarchy. This translates to applicable patterns at different levels. For this situation it is more about validation. One could simply take a BIDS dataset walk (i.e. a tree), turn it into a JSON document and run a json-schema against it for validation. I'm not sure at this point, whether or not a JSON schema can successfully validate this object.
b. Structures of files. JSON, TSV files should be parseable as those.
c. Contents of JSON files. These are JSON themselves, therefore ideally there should be a schema to validate against.
d. Contents of TSV files. This requires an interaction between the TSV file and an associated "validation schema" of some kind in the related JSON file that describes the .
e. Other files. BIDS at present includes a host of other data files. Those should be validated at least a a structural level.
Reasons for structuring
a. Generate documentation that highlights the structure. I believe this is what this started with. The entity table. But more generally this could be used to relate pieces of bids as well across modalities, datatypes, etc.,..
b. Provide new contributors an easy place to add things without violating other pieces of the specification. And perhaps for contributors to see what the current constraints are.
c. Simplify software maintenance. Extend the bids validator/pybids to support more things over time and build it from a common model that supports the specification and the validation.
d. Validation. To validate contents or structure of hierarchy automatically.
e. Search. At the end of the day search can be seen through multiple lenses. Free form text search or more faceted search based on structured representations.
f. Linking to other things. While many things in BIDS are constrained, several things are not. Some of the ongoing work tries to help link arbitrary ad hoc names to less ad hoc concepts. This does impact search directly, but also helps consolidate knowledge at a given time.
Implementation technologies.
There are many ways to create validators (JSON schema, SHACL, XMLschema, and others). There are many tools that support developing constraint models with validators (attrs, traitlets/traits, pydantic, sqlalchemy, pyshacl, pyld, and others). And of course one can create custom validators. Unfortunately, because trees and graphs are different structures, one may have to adopt one or more validation schemes depending on the lens through which you look at your object. This is partly where JSON-LD kind of comes in as a potentially happy medium. Under certain circumstances you can validate using JSON schema and be confident that a validation through SHACL would give the same result. Or vice versa, you can validate via SHACL and know that it will also pass JSON schema validation. This happy medium then also allows usage in different settings (ORM databases, graph databases) for different applications. There may be other constraints that many json validators may not be able to support. And that comes about trying to implement the schema in some form.

sappelhoff · 2020-07-27T16:55:13Z

@sappelhoff - On an unrelated note, since there are currently several issues and one PR dedicated to the conversion of the specification to a schema, could we create a new Project board to manage them?

Hi @tsalo I just increased your scope of permissions for this repository. Can you go ahead and try to make a project now? If it doesn't work, I'll look into the permissions again.

I think it's a great idea to create a project board for this!

tsalo · 2020-07-27T17:05:03Z

It worked. Thanks!

tsalo · 2020-07-27T17:26:47Z

@satra You expressed all of my goals better than I did! As well as noted several more that I hadn't thought of.

Do you have any thoughts on where we should go from here? It seems like we need to decide on the schema of choice for each of the elements you describe in "Structural and value constraints". Should we open a separate issue for each, or try to find one solution that works for all of them? I assume we could commit to technology (e.g., JSON-LD or SHACL) and then break it down from there?

Perhaps we could ping some of the other folks who work with schemata more than me for their thoughts on the best technology for this case. Most everything under "Implementation technologies" went over my head, so I doubt I'll be able to contribute to that decision.

satra · 2020-08-05T12:24:23Z

@tsalo - it may be good to have an online chat with interested parties to discuss some of these things?

tsalo · 2020-08-05T14:33:51Z

@satra that sounds like a good plan!

tsalo · 2020-08-10T16:52:09Z

@satra Who would be interested? Based on a recent BIDS maintainers call, I was thinking @yarikoptic, @dbkeator, @dnkennedy, @rwblair, and possibly Maryann Martone (not sure if Dr. Martone has a GitHub account).

EDIT: Oh and @nqueder! Apologies for the oversight.

dbkeator · 2020-08-10T17:00:16Z

I'm interested....

tsalo · 2020-09-11T19:18:52Z

@yarikoptic and I joined a couple of NIDM-Terms calls with most of the interested folks where we've made solid progress on the format of the schema, and since there is now a project board for the conversion I think we can close this issue in favor of more focused issues. Any objections?

sappelhoff · 2020-09-11T19:34:51Z

+1 for smaller, targeted, actionable issue that are connected into a coherent whole via the project board

tsalo mentioned this issue Jul 23, 2020

[INFRA] Convert entity table to yaml #475

Merged

5 tasks

tsalo added formatting Aesthetics and formatting of the spec infrastructure labels Jul 23, 2020

tsalo mentioned this issue Aug 3, 2020

Use Augmented Backus–Naur form instead of definition templates bids-standard/bids-2-devel#23

Open

tsalo mentioned this issue Aug 10, 2020

Move entity definitions to separate page(s) #567

Closed

tsalo closed this as completed Sep 12, 2020

tsalo added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Sep 23, 2020

sappelhoff added this to Conversion to schema Jun 8, 2024

sappelhoff moved this to Done in Conversion to schema Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert specification to schema format #540

Convert specification to schema format #540

tsalo commented Jul 23, 2020 •

edited

Loading

yarikoptic commented Jul 24, 2020

satra commented Jul 24, 2020

yarikoptic commented Jul 24, 2020

satra commented Jul 24, 2020

yarikoptic commented Jul 24, 2020

tsalo commented Jul 26, 2020 •

edited

Loading

satra commented Jul 26, 2020

sappelhoff commented Jul 27, 2020

tsalo commented Jul 27, 2020

tsalo commented Jul 27, 2020

satra commented Aug 5, 2020

tsalo commented Aug 5, 2020

tsalo commented Aug 10, 2020 •

edited

Loading

dbkeator commented Aug 10, 2020

tsalo commented Sep 11, 2020

sappelhoff commented Sep 11, 2020

Convert specification to schema format #540

Convert specification to schema format #540

Comments

tsalo commented Jul 23, 2020 • edited Loading

yarikoptic commented Jul 24, 2020

satra commented Jul 24, 2020

yarikoptic commented Jul 24, 2020

satra commented Jul 24, 2020

yarikoptic commented Jul 24, 2020

tsalo commented Jul 26, 2020 • edited Loading

satra commented Jul 26, 2020

sappelhoff commented Jul 27, 2020

tsalo commented Jul 27, 2020

tsalo commented Jul 27, 2020

satra commented Aug 5, 2020

tsalo commented Aug 5, 2020

tsalo commented Aug 10, 2020 • edited Loading

dbkeator commented Aug 10, 2020

tsalo commented Sep 11, 2020

sappelhoff commented Sep 11, 2020

tsalo commented Jul 23, 2020 •

edited

Loading

tsalo commented Jul 26, 2020 •

edited

Loading

tsalo commented Aug 10, 2020 •

edited

Loading