Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISA light #97

Closed
HLWeil opened this issue Feb 15, 2024 · 6 comments
Closed

ISA light #97

HLWeil opened this issue Feb 15, 2024 · 6 comments

Comments

@HLWeil
Copy link
Member

HLWeil commented Feb 15, 2024

Currently, the investigation file contains registry information about studies and assays. This is necessary in ISA-Tab because of two reasons:

  1. Studies and Assays (not being xlsx files) cannot contain contextualizing information themselves
  2. Study and Assay files could be placed anywhere, so they have to be explicitly referenced.
    Both these reasons do not apply in the ARC, as 1) the ISA-XLSX files have their own contextualizing information in an additional metadata sheet and 2) the ARC is a structured container, so Assay and Study file location is explicit.

Instead this registration now causes two problems in the context of the ARC:

  1. Whenever there are changes in studies or assays, the investigation file has to be updated. This can cause user pain because of difficulties in xlsx merging in git
  2. Duplicate information between the different files, making parsing unnecessarily complicated.

As a solution, we propose ISA light:

  • The investigation file does only contain top-level metadata and no study information
  • The study file does only contain study information and no assay information
  • Study and Assay registration in the ARC is implicit by them being placed in the ARC
  • Assay registration in the study is implicit by Assay processes referecing Study processes

I would suggest having ISA-light as an option in the ISA-XLSX specification. The ARC specification would then explicitly implement ISA-light, making it non-optional (with implicit backwards compatability in the tools)

@Freymaurer @JonasLukasczyk @Brilator @muehlhaus @chgarth

PS: This is already being tested out in the ARCitect

@Freymaurer
Copy link
Contributor

Assay registration in the study is implicit by Assay processes referecing Study processes

Can you give a more detailed exampled for this?

@HLWeil
Copy link
Member Author

HLWeil commented Feb 15, 2024

Assay registration in the study is implicit by Assay processes referecing Study processes

Can you give a more detailed exampled for this?

IMO we don't need to explicitly state which assay is part of which study. This can be inferred by the processes in the assay having outputs from the study processes as inputs.

@eik-dahms
Copy link

From my point of view the ISA-TAB format format already differs slightly from the tab format. See (ISA-TAB) - not only formaly but also in terms of new concepts being introduced which are necessitated to make the format work in the ARC evironment. Now you are proposing a third format, which is not that different - but still - different. I think that would have more caveats then it solves the problems stated here.

In general I agree with the problems you stated:

  1. Whenever there are changes in studies or assays, the investigation file has to be updated. This can cause user pain because of difficulties in xlsx merging in git.

Yes, and this would be so much easier with the ISA-TAB format.

  1. Duplicate information between the different files, making parsing unnecessarily complicated.

Here I really see a problem in usability and development. For developers - because this introduces a lot of overhead to make sure this data is in sync, bloates code, potential error source. For users, if there is no program like Architect this has to be done manually which can be tidious, prone to human error and cause confusion (why duplicate entries?)

And although I think that these point should be improved I really do not like that solution of an additional format. Firstly because this has, again, consequences for users, data stewards and developers.

Users

In a perfect world there would be gui tools that make it very convient for a user to work on their ARC. I know tools have improved a lot and they are very helpfull. But as of now the users will have to interact with their ARCs in more inconvinient ways. Which - depending on the users background - can be a rather confusing experience. Adding another format adds additional information in sources and knowledge base and potential usability issues/ error sources - and just because it makes sense from a developer view - it does not have to make sense for a user. Another scenario would be if a user takes another ARC as basis of a new ARC but depending on which format definition they look at, they will run into trouble.

Data Stewardship/ Teaching

As of the nature of this project it is impossible to have a finished definition of the ARC format from the start. With the consequence that we will have to update what we teach students from time to time and I think it would be beneficial to keep these changes as little as possible. Introducing another format is not only changing something but it is also adding a possibility. Which means either no explanation to keep it simple or add it and make it more complicated. The ARC itself is not complicated to teach ... but it's strength - beeing quite flexible besides its given constraints - can sometimes be a bit confusing, especially as we are all learning how to do this. And to exactly this flexibility we add another ISA format which intern increases the way of how things are done. At least for the first-ARCers it should rather be: this is the structure - this is the file format - and how to use it.

Development

  • Implementing this, even as the option construct, would require to asure mapping works in all allowed combinations. All internal and external developed tools need to adapt to the possibility of ISA having two ways to work.
  • I am not sure if decoupling is a good idea. As you said, if you know what an ARC is you know where Study/Assay files are and yes that means one can assume these files are there and omit refering data in the Investigation file. This all works fine if you completly work in the ARCverse using ARCverse tools and/ or know what an ARC is. If you now look at this from a "blind" perspective and imagine you try to determine the content of an ARC only by its investigation file you lost access to these decoupled files. (idk imagine in the future when there are millions of ARCs do you want to have a short look at the isa files to know whats in there or do you want to search all folders of each arc separately? )

Again I think your mentioned problems are valid and there needs to be a solution but I don't think that this ISA LIGHT format should be the way to go.

@Freymaurer
Copy link
Contributor

Whenever there are changes in studies or assays, the investigation file has to be updated. This can cause user pain because of >>difficulties in xlsx merging in git.

Yes, and this would be so much easier with the ISA-TAB format.

This is a valid point, the complexity of git diffing is further complicated by us using the XLSX format. But we decided on XLSX for the other benefits, like containing multiple sheets and the visual representation benefits it provides. Viewing big tables in ISA-TAB becomes a mess.

From my point of view the ISA-TAB format format already differs slightly from the tab format.

Well, we are not using ISA-TAB, but ISA-XLSX, which already differs quite a lot from ISA-TAB, but still implements the ISA abstract model, making parsing towards ISA-Json and ISA-TAB comparatively straight forward. The reasons for creating our own ISA-XLSX specification are plentyfold, e.g. improved usage of controlled vocabularies, self contained data containers (assays,studies) and the abiltiy to add features necessary for FAIR depiction of a full research cycle (#93).

Now you are proposing a third format, which is not that different - but still - different. I think that would have more caveats then it solves the problems stated here.

We are proposing a variant to the ISA-XLSX format (you can call it a reduced version of ISA-XLSX) which may only be used in the ARC context with less information than a full ISA-XLSX format. As stated above, ISA-XLSX has its own specification. Users of the ARC don't have to learn three formats now, ISA-XLSX suffices. Knowledge about the other implementations of the ISA abstract model becomes relevant only for active, manual interoperation between different ecosystems.

I am not sure if decoupling is a good idea. As you said, if you know what an ARC is you know where Study/Assay files are and yes that means one can assume these files are there and omit refering data in the Investigation file. This all works fine if you completly work in the ARCverse using ARCverse tools and/ or know what an ARC is. If you now look at this from a "blind" perspective and imagine you try to determine the content of an ARC only by its investigation file you lost access to these decoupled files. (idk imagine in the future when there are millions of ARCs do you want to have a short look at the isa files to know whats in there or do you want to search all folders of each arc separately? )

I think this should be more important to users as to developers, as we can expect reading a specification from developers. But you have a valid point regarding exploration, in which the investigation file does not contain information about its studies anymore. This is a good point, as it would reduce discoverability of 2nd level metadata (e.g. what protocol is part of which study) from the pure investigation ISA-XLSX file. Discoverability of 1st level metadata (what study is part of this ARC) is not affected though, as looking into the studies folder suffices now.


To sum up, look at it from the following perspective: Because ISA-XLSX is designed to work as an alternative for ISA-TAB and ISA-JSON it must contain some information to be convertible into these formats. But in the ARC, a lot of this information is handled implicitly and therefore we do not write it out. This makes the format easier, as there is less duplication and less room for error (as you stated). We do not need to teach full ISA-XLSX in the ARC context or even explain that there is a difference between ISA-XLSX light and ISA-XLSX. 95% of all ARC users will never require full ISA-XLSX.

@HLWeil
Copy link
Member Author

HLWeil commented Mar 8, 2024

Hey @eik-dahms, after some consideration, maybe there was a little bit of confusion caused by me framing this as ISA light. This is not meant to be a new format or anything.
I'd be happy if you'd take a look into my approach to solve this in a low friction way: #101

@HLWeil
Copy link
Member Author

HLWeil commented Apr 2, 2024

Closed by #101

@HLWeil HLWeil closed this as completed Apr 2, 2024
@kMutagene kMutagene added this to the ARC-specification v2.0.0 milestone Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants