dessimozlab/nf-oma-browser-build is a pipeline for building an OMA Browser instance from an OMA (Orthologous MAtrix) analysis. The pipeline converts the output of either a production OMA run or a FastOMA run into the HDF5 files needed to run a omabrowser webserver. The pipeline integrates a lot of additional data, i.e. GO annotations, domain annotations and cross-references to uniprot and refseq.
First part of the pipeline is dependend on input, i.e. production / FastOMA. The later steps are common to both input types.
- extract genomes in dataset from Matrix file
- extract from genome dbs relevant data such as proteins, locus, etc
- convert Matrix, extract splicing information
TODO: implement and write...
- convert HOGs, sequences into HDF5 database, build suffix index and kmer-lookup table (in subworkflow
IMPORT_HDF5
) - import domain annotations if available
- import cross-references from UniProt and RefSeq (subworkflow
GENERATE_XREFS
) - import GO annotations and Ontology
The pipeline produces in the end in the outputDir
(default results/
) the necessary files to be loaded into a
docker-compose managed omabrowser instance.
All parameters are described by running --help
of the workflow
nextflow run . --help
Parameter | Description | Type | Default |
---|---|---|---|
hog_orthoxml |
Hierarchcial orthologous groups (HOGs) in orthoxml format | string |
|
matrix_file |
OMA Groups file | string |
|
pairwise_orthologs_folder |
Pairwise Orthologs (only by Standard OMA pipeline) | string |
|
genomes_dir |
Folder containing genomes | string |
|
known_domains |
Folder containing known domain assignments files | string |
|
cath_names_path |
File containing CATH domain descriptions | string |
http://download.cathdb.info/cath/releases/latest-release/cath-classification-data/cath-names.txt |
pfam_names_path |
File containing Pfam descriptions | string |
https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.clans.tsv.gz |
xref_uniprot_swissprot |
UniProtKB/SwissProt annotation in text format | string |
https://ftp.ebi.ac.uk/pub/databases/uniprot/knowledgebase/uniprot_sprot.dat.gz |
xref_uniprot_trembl |
UniProtKB/TrEMBL annotations in text format | string |
/dev/null |
xref_refseq |
Folder containing RefSeq gbff files. | string |
|
go_obo |
Gene Ontology OBO file | string |
http://purl.obolibrary.org/obo/go/go-basic.obo |
go_gaf |
Gene Ontology annotations (GAF format) | string |
https://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz |
Parameter | Path / Value from OMA analysis |
---|---|
hog_orthoxml |
HOG in orthoxml format from $DARWIN_OMADATA_PATH/HOGs/ , usually latest one |
matrix_file |
OMA Groups Matrix file, same run as HOG, located in $DARWIN_OMADATA_PATH/Matrix ; use the _merged version |
genomes_dir |
corresponds to $DARWIN_GENOMES_PATH . Must contain Summaries.drw, and all the databases in the subfolders` |
pairwise_orthologs_folder |
Base directory where the .orth.tsv.gz files have been created. Usually corresponds to $DARWIN_OMA_SCRATCH_PATH/Phase4/ . If not specified, no VPairs will be imported and dotplot won't work |
known_domains |
folder with processed known domain annotations. this will likely change in the future. For now, these files have to be generated outside the pipeline |