nf-core/references is a bioinformatics pipeline that build references, for multiple use cases.
It is primarily designed to build references for common organisms and store it on AWS iGenomes.
From a fasta file, it will be able to build the following references:
- Bowtie1 index
- Bowtie2 index
- BWA-MEM index
- BWA-MEM2 index
- DRAGMAP hashtable
- Fasta dictionary (with GATK4)
- Fasta fai (with SAMtools)
- Fasta sizes (with SAMtools)
- Fasta intervals bed (with GATK4)
- MSIsensor-pro list
With an additional annotation file describing the genes (either GFF3 or GTF), it will be able to build the following references:
- GTF (from GFF3 with GFFREAD)
- HISAT2 index
- Kallisto index
- RSEM index
- Salmon index
- Splice sites (with HISAT2)
- STAR index
- Transcript fasta (with RSEM)
With a vcf file, it will compress it, if it was not already compressed, and tabix index it.
Assets are stored in references-assets.
Note
If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow.Make sure to test your setup with -profile test
before running the workflow on actual data.
asset.yml
:
- genome: GRCh38_chr21
fasta: "https://raw.githubusercontent.com/nf-core/test-datasets/references/references/GRCh38_chr21/GRCh38_chr21.fa"
gtf: "https://raw.githubusercontent.com/nf-core/test-datasets/references/references/GRCh38_chr21/GRCh38_chr21.gtf"
source_version: "CUSTOM"
readme: "https://raw.githubusercontent.com/nf-core/test-datasets/references/references/GRCh38_chr21/README.md"
source: "nf-core/references"
source_vcf: "GATK_BUNDLE"
species: "Homo_sapiens"
vcf: "https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/genome/vcf/dbsnp_146.hg38.vcf.gz"
Each line represents a source for building a reference, a reference already built, or metadata.
Now, you can run the pipeline using:
nextflow run nf-core/references \
-profile <docker/singularity/.../institute> \
--input asset.yml \
--outdir <OUTDIR>
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters; see docs.
For more details and further functionality, please refer to the usage documentation and the parameter documentation.
To see the results of an example test run with a full size dataset refer to the results tab on the nf-core website pipeline page. For more details about the output files and reports, please refer to the output documentation.
nf-core/references was originally written by Maxime U Garcia | Edmund Miller | Phil Ewels.
We thank the following people for their extensive assistance in the development of this pipeline:
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch on the Slack #references
channel (you can join with this invite).
- Have docker, and Nextflow installed
nextflow run main.nf
- We could use the glob and if you just drop a fasta in s3 bucket it'll get picked up and new resources built
- Could take this a step further and make it a little config file that has the fasta, gtf, genome_size etc.
- How do we avoid rebuilding? Ideally we should build once on a new minor release of an aligner/reference. IMO kinda low priority because the main cost is going to be egress, not compute.
- How much effort is too much effort?
- Should it be as easy as adding a file on s3?
- No that shouldn't be a requirement, should be able to link to a reference externally (A "source of truth" ie an FTP link), and the workflow will build the references
- So like mulled biocontainers, just make a PR to the samplesheet and boom new reference in the s3 bucket if it's approved?
- Should it be as easy as adding a file on s3?
PoC for v1.0:
- Replace aws-igenomes
- bwa, bowtie2, star, bismark need to be built
- fasta, gtf, bed12, mito_name, macs_gsize, blacklist, copied over
Other nice things to have:
- Building our test-datasets
- Down-sampling for a unified genomics test dataset creation, (Thinking about viralitegration/rnaseq/wgs) and spiking in test cases of interest (Specific variants for example)
If you use nf-core/references for your analysis, please cite it using the following doi: 10.5281/zenodo.14576225
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.
You can cite the nf-core
publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.