Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New tool: Required publication references #236

Open
ewels opened this issue Jan 2, 2019 · 23 comments
Open

New tool: Required publication references #236

ewels opened this issue Jan 2, 2019 · 23 comments
Labels
command line tools Anything to do with the cli interfaces

Comments

@ewels
Copy link
Member

ewels commented Jan 2, 2019

It would be nice to make it easier for people to know what should be referenced if they use a pipeline in a manuscript. For example, nf-core references <pipeline-name> could return a list of the references that you need to add into your paper. (alt names: nf-core refs, nf-core bib..?)

Different flags could give different output formats, but perhaps the default could be prose text. For example:

Data was processed using nf-core/rnaseq [pipeline DOI, nf-core paper]. This pipeline is built using nextflow [nextflow paper] and uses the following tools: FastQC (Quality control of raw data) [ref], TrimGalore! (Trimming of adapter sequence contamination) [ref], STAR (Alignment of RNA-seq reads to the reference genome) [ref] …etc

Need to think about where and how to capture this information in the pipeline files. For example, a simple YAML file could work nicely:

- tools:
    - fastqc:
        - name: FastQC
        - description: Quality control of raw data
        - ref: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
    - trimgalore:
        - name: Trim Galore!
        - description: Trimming of adapter sequence contamination
        - ref:
            - https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
            - 10.14806/ej.17.1.200
    - star:
        - name: STAR
        - description: Alignment of RNA-seq reads to the reference genome
        - ref: 10.1093/bioinformatics/bts635

Requirements:

  • Should handle either DOI or URL (DOI preferable where available)
  • Should be able to handle multiple references per tool
    • Alternatively, force one per tool and instead list multiple tools? eg. have Cutadapt in its own entry above.
  • Name and reference should be mandatory
  • Additional text per tool should be as short as possible

Output options could be:

  • List of references alone
  • List of tool names and references
  • Full prose text
  • Prose text without additional tool descriptions
  • Option to give references in different formats, with a DOI lookup

The nextflow and nf-core references can be hardcoded. The workflow DOI can be lifted from README.md I guess. Or could potentially be added as a new workflow.metadata variable?

Thoughts / feedback?

Phil

@ewels ewels added the command line tools Anything to do with the cli interfaces label Jan 2, 2019
@maxulysse
Copy link
Member

I would also add the versions of each tools too

@drpatelh
Copy link
Member

drpatelh commented Jan 2, 2019

It might be good to host a central database (e.g. yaml) of tools and their associated information. This can then be used to parse the conda yaml to create a tool specific publication description that would be linked by release to the pipeline. It would be much neater to just reference the pipeline in papers (if morally possible) - with a sentence pointing to the pipeline for all the tool-specific citations. I've often been asked to trim down text and a decision may need to be made as to which tools you cite... I generally provide a short description of the tool, version, reference and pubmed id. Maybe we can provide this as a file that gets bundled with the pipeline that can be linked on the pipeline home page?

@sven1103
Copy link
Member

sven1103 commented Jan 2, 2019

Hm, I was just thinking that we get this information for free over the Anaconda API, right?

For example:
https://api.anaconda.org/package/bioconda/samtools

Although package maintainers do not always provide all fields info (which is bad!).

So instead of having another yaml file, we could use the environment.yml. If a package does not provide a description, it might be good practise to contact the package maintainer to do so?

@maxulysse
Copy link
Member

We might want to add extra informations, like an actual publication or DOI for the pipeline

@sven1103
Copy link
Member

sven1103 commented Jan 2, 2019

hm, i see. There is no such thing as a tool registry with DOI and publication URIs, right? Maybe we need this...

@ewels
Copy link
Member Author

ewels commented Jan 2, 2019

It might be good to host a central database (e.g. yaml) of tools and their associated information.

I see where you're going with this, however I quite like that all pipelines are totally self-sufficient currently. Especially if this will be used within tool execution, as many users run offline.

It would be much neater to just reference the pipeline in papers (if morally possible)

I don't think that it is morally good to do this. If people decide that they need to do this then that can be on their shoulders, but I don't think that we should help them.

I generally provide a short description of the tool, version, reference and pubmed id.

Yes - this is basically the information that I was thinking of listing (though DOI instead of pubmed). A table with this information would be a nice output option too though..

Maybe we can provide this as a file that gets bundled with the pipeline that can be linked on the pipeline home page?

Yes, that could be very nice actually. We have an ACKNOWLEDGMENTS.txt file that we deliver with all data from our centre to try to help people to mention us in their paper. The pipelines could do the same here, so that it's obviously alongside the results files when the pipeline runs.

@ewels
Copy link
Member Author

ewels commented Jan 2, 2019

Hm, I was just thinking that we get this information for free over the Anaconda API, right?

Not really - we're already using this for the nf-core licenceses command, but it doesn't have any info about publications that I'm aware of.. It's specifically the DOI / publication reference that I'm thinking of here.

Tying the names in with environment.yml and potentially using the descriptions would be a nice idea though 👍 The summary field where available should contain this. It will not describe how it's used in the pipeline though, so not as good as a specific string.

@sven1103
Copy link
Member

sven1103 commented Jan 2, 2019

Maybe should activate this discussion again: nextflow-io/nextflow#866

Tools and parameters that are used in Nextflow should be descripted in a structured way, so humans and machines can work with it.

I also see the tools metadata such as URI, URL, description and parameters there combined... Just brain-storming here.

@drpatelh
Copy link
Member

drpatelh commented Jan 2, 2019

How about tool-specific parameters? e.g. if you aren't using the defaults. I generally provide these as a double-quoted string for full traceability and reproducibility. Would it be enough to have these defined within main.nf bearing in mind that these may also change between releases.

@ewels
Copy link
Member Author

ewels commented Jan 2, 2019

Yes, I wondered about putting this kind of information alongside the parameter schema described in that issue. However, parameters and tool metadata are distinct, so it may not make sense. For example, it could break parsing by the general tools form-building tools discussed on that thread. A section of nextflow.config dedicated towards describing tools could work though, especially alongside the feature request for parsing tool version numbers at run time. Any thoughts @pditommaso?

How about tool-specific parameters

This is getting a bit off-topic now 😅But yes, I think having them defined in main.nf is enough - this file is tagged with each release so easy to find again. They're also in the trace and reports that are saved with the results. Personally, I think it improves code readability if they're in main.nf alongside the command template, instead of separately held in a different location.

@pditommaso
Copy link

IMO maintaining a separate annotation file does not work because very easily it gets out of sync with the actual tools used in the pipeline script.

Ideally these info should be inferred during process execution nextflow-io/nextflow#879. Alternatively we could add an annotation in the module/process definition nextflow-io/nextflow#984.

Otherwise the best approximation could be the Conda environment file tho, if I'm understanding well, the problem is that it does not include the citation/paper DOI, right? Not sure but I think using the tool name and version it should be possible to infer the related metadata from biotools.

Pinging @bgruening and @ypriverol who should know about the state of the art of bioconda/containers /biotools interoperability.

@bgruening
Copy link

@ewels @pditommaso we actually do include identifier into conda, see here: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/multiqc/meta.yaml#L137

This means you can infer this from the conda package or bio.tools. A DOI can, and should, be added to the conda package as well.

Does this answer your quesiton?

@sven1103
Copy link
Member

sven1103 commented Jan 3, 2019

Uh, this is actually very nice.

Just checked the API request for fasttree:
https://bio.tools/api/tool/fasttree

Seems that we get the information we need from it, so no need to have an additional file.

@ewels
Copy link
Member Author

ewels commented Jan 3, 2019

Fantastic - this is is great news! Many thanks @bgruening - I didn't know that this lookup existed.
However, it looks like the identifier isn't given in the Anaconda API 😞 https://api.anaconda.org/package/bioconda/multiqc

Any ideas on how we can best fetch this information? If we can it would be great to use this method. If we want, we could even get the linter to warn if the biotools identifier is missing.

A DOI can, and should, be added to the conda package as well.

Also under the identifiers section, as done here I guess? Cool! I'll add this to the MultiQC recipe.

@bgruening
Copy link

Any ideas on how we can best fetch this information? If we can it would be great to use this method. If we want, we could even get the linter to warn if the biotools identifier is missing.

Short answer is that its part of the tarball and with this part of the installation, afaik.
Long answer is that we are working on a central service (bio.tools) to make this all way easier and also independent of conda ... so a unified interface to pkgs and containers.

Also under the identifiers section, as done here I guess? Cool! I'll add this to the MultiQC recipe.

Yes :)

@ewels
Copy link
Member Author

ewels commented Jan 3, 2019

ok cool, thanks!

Then I wonder if the best bet is to just try pinging the biotools API with the conda package name if it's in the bioconda channel. I guess that the two will essentially always be the same.. This won't match up versions and could in some weird edge cases give the wrong information, so not ideal. But I don't really fancy downloading and extracting all software just for this fast little utility command.

@ewels
Copy link
Member Author

ewels commented Jan 3, 2019

..could also just grab the raw bioconda meta.yml directly from GitHub and parse the identifiers from that. But again, will be tricky to match up versions so not a whole lot better I guess.

@ewels ewels closed this as completed Jan 3, 2019
@ewels ewels reopened this Jan 3, 2019
@bgruening
Copy link

This depends if you always have internet access during the workflow run. I guess querying the API is ok. I suppose digging the information out of conda is also easy - which should be already available locally.

@ewels
Copy link
Member Author

ewels commented Jan 3, 2019

Ah true, there are two different use cases here. I was thinking primarily about a new nf-core references cli tool which would run totally separately from the workflow.

For using the data within a workflow run (eg. saving it to an ACKNOWLEDGMENTS.txt file), I think we need all of the data locally because so many people run without internet access. How would we go about finding the information from a local conda install? I've had a quick dig around but haven't found the meta.yml file yet.

@ewels
Copy link
Member Author

ewels commented Jan 3, 2019

..but we'd still need an internet connection for bio.tools. I think that this needs to be a separate cli tool. If we want the output as a results file with the pipeline then this should probably be a static file which is saved separately I think. If we want automation, the lint tool could check that it exists and is up to date (maybe on --release only for the latter).

@bgruening
Copy link

Have a look at miniconda3/pkgs/samtools-1.8-3/info/recipe/meta.yaml

@ewels
Copy link
Member Author

ewels commented Mar 18, 2021

This issue is getting much more manageable with DSL2 modules, where we have a meta file for each tool that includes DOI 🎉 (typically taken from Bioconda).

This could potentially be used both for a command line tool but also within pipelines as the meta file should be bundled within each pipeline.

@jfy133
Copy link
Member

jfy133 commented Jun 19, 2023

Following on from: #2326 (which starts providing a framework to insert this into a MultiQC report):

@maxulysse and @mashehu have both said we should automate this even more and should be possible via the DOIs in the meta.yml.

From @maxulysse a conceptual plan:

  • Adding bibtex to meta.yaml.
  • Ditch the get software version module and refactor the versions channel into a map like tools:versions+modules.
    • @mashehu suggests do similar channel/mag as with versions here
  • Auto generate a nice version HTML in pure groovy based on the versions from the map (versions field)
  • Auto generate citations based on the map (modules field) and parse modules to get citations if available (edited)

Initial problems I see:

  • Do we really want to load meta.ml into memory/process for every module we execute?
    • What about including an ext.arg in the module that holds that information and export that in a similar way to versions.yml?
  • How to get bibtext information for all modules
    • Should be quite straightforward as a one time thing using a cROSSREF API or similar
    • Would need to add functionality from tools to somehow pull this in when a DOI is added to meta.yml
  • How to format citations from BibTex stuff in Nextflow?
    • There are a few Java libraries at least for this: jBibtex and something from JabRef

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
command line tools Anything to do with the cli interfaces
Projects
None yet
Development

No branches or pull requests

7 participants