Skip to content

Input Files

Francoise Thibaud-Nissen edited this page May 23, 2023 · 23 revisions

To annotate a genome with PGAP using the instructions in Quick Start, you need:

  • The binomial name or the genus for the sequenced organism. This identifier must be valid in NCBI Taxonomy (see Taxonomy information for how to find out if the name is valid). If you are unsure about the organism assignment, provide your best guess and use the flags --taxcheck and --auto-correct-tax, so the organism provided is verified by ANI and corrected as needed, prior to running the annotation.
  • The sequence for the genome in fasta format

You can provide this information in two ways, depending on your needs:

Annotation for your own use

Convenient if you do not intent to submit the annotated genome to GenBank, and do not need to have metadata information added to the annotated genome and genes.
Pass the path to the genome assembly sequence file (-g <fasta>) and the organism name (-s '<organism_name>') to the pgap.py script:

$ ./pgap.py [optional flags] -r|n -o <results> -g <fasta> -s '<organism_name>'

Annotation for submission to GenBank

In order to produce submission-ready annotation, at a minimum, contact info and authors need to be provided so they can be incorporated into the annotated files. For this use case, prepare three files:

  • The genome assembly sequence
  • The metadata YAML file
  • The generic YAML file

Read below to learn what these files should contain.

Then run:

$ ./pgap.py [optional flags] -r|n -o <results> <generic.yaml>

Genome assembly sequence file

The sequences constituting the genome assembly should be provided in a fasta file.
The genome assembly size (measured as the count of bases in the input fasta sequences ignoring Ns) must be within the reasonable range expected for the organism. If the size range is not known for the genus species, the minimum and maximum size allowed are 15 Kb and 100 Mb respectively.

Definition lines

Each sequence in the file must have a definition line beginning with '>' and a unique identifier (SeqID), eg >contig001 or >contig002.
The SeqIDs must:

  1. Be less than 50 characters long
  2. Only include letters, digits, hyphens (-), underscores (_), periods (.), colons (:), asterisks (*), and number signs (#).
  3. Be unique within a genome

⚠️ If some sequences in the assembled genome are known to be from plasmids and others from the chromosome, include the location in the definition line of each plasmid sequence with the tag value pair [location=plasmid], after the SeqID or another tag value pair and a space. In addition, the name of the plasmid can be provided as [plasmid-name=<value>]. For example:
>contig001 [location = plasmid] [plasmid-name = pABC01]
indicates that the sequence contig001 is part of the plasmid pABC01. Sequences that do not have the location modifier are assumed to be chromosomal ([location = chromosome]). Note that the location can be set globally in the Metadata YAML file (see below) if ALL sequences are chromosomal or all sequences are plasmid.

⚠️ If some sequences in the assembled genome are circular and others linear, include the topology in the definition line of each circular sequence with the tag value pair [topology=circular], after the SeqID or another tag value pair and a space. For example:
>contig001 [location = plasmid] [plasmid-name = pABC01] [topology=circular]
indicates that the sequence contig001 is the circular plasmid pABC01. Sequences that do not have the topology tag value pair are assumed to be linear ([topology=linear]). Note that the topology can be set in the Metadata YAML file if all sequences are linear or all sequences are circular.

Sequences

  • All sequences must be 200 nucleotides or more.
  • There should be no N at the beginning or end of each sequence.
  • No sequence should be all Ns.
  • Stretches of 10 Ns or more will be considered gaps of known length.

Metadata files

The attributes of the genome assembly must be provided as two files in YAML format.
⚠️ Note that YAML is a standard, human-readable format, but it is unforgiving and care is needed to write valid YAML:

  • for indentations, make sure to use spaces rather than tabs!
  • We recommend using a text editor that supports YAML validation, such as VS Code with the YAML extension, or other Supporting editors.
  • The schema to validate the YAML is available from the Schema Store under pgap_yaml_input_reader. Supporting editors may use it automatically.

Generic YAML file

Provides the necessary information for running the pipeline.
All fields are required.

  • fasta - Path to the genome assembly fasta file (see above, Genome assembly sequence file)
  • submol - Path to the file specifying the origin of the genome assembly (see below)

Example

fasta: 
    class: File
    location: Ecoli1_genomic.fna
submol:
    class: File
    location: E_coli1_submol.yaml

Metadata YAML file (submol)

submol, in the generic YAML file described above, points to another YAML file in the same directory, the metadata file. This file should preferably be named submol*.yaml so it may be recognized by your YAML editor.

The information that can be provided in the file is described below. Note that the only required field is the organism name, but additional information must be included if you plan to submit the results to GenBank.

Currently the metadata in the submol.yaml file only supports the 7-bit ASCII subset of Unicode. We are working on supporting the full UTF-8 character set.

  • topology - optional. Topology of the sequences included in the fasta file. Possible values are linear or circular. Circular means that the first base in the sequence is adjacent to the last base. Please provide the topology in the metadata YAML file only if it is applicable to ALL sequences in the fasta file. If some sequences in the assembled genome are circular and others linear, include the topology in the definition line of each sequence in the fasta file with the tag value pair [topology=circular] or [topology=linear], after the SeqID and a space (e.g. >seq1 [topology=circular]). If the topology is provided in neither the metadata YAML nor the fasta file, the sequences will be assumed to be linear.
  • location - optional. Location in the cell of the sequences in the fasta file. Possible values are plasmid or chromosome. Similarly to topology, please provide the location in the metadata YAML file only if it is applicable to ALL sequences in the fasta file. If some sequences in the assembled genome are from the chromosome and others are from plasmids, include the location in the definition line of each sequence in the fasta file with the tag value pair [location=plasmid] or [location=chromosome], after the SeqID and a space (e.g. >seq1 [location=plasmid]). If the location is provided in neither the metadata YAML nor the fasta file, the sequences will be assumed to be chromosomal.
  • organism
    genus_species - binomial name or, if the species is unknown, genus for the sequenced organism. This identifier must be valid in NCBI Taxonomy (see Taxonomy information for how to find out if the name is valid).
    strain - optional. Strain of the sequenced organism
  • contact info - optional, but include if intending to submit to GenBank. The main contact for this genome assembly
    last_name - Last name
    first_name - First name
    email - Email address
    organization - Organization or consortium submitting the genome assembly
    department - Department or division submitting the genome assembly
    phone - optional. Phone number
    fax - optional. Fax number
    street - Street address
    city - City
    state - State or region
    postal_code: Postal code
    country - Country
  • authors - optional, but include if intending to submit to GenBank. Author(s) of the genome assembly. Authors can be different from the contact.
    last_name - Last name
    first_name - First name
    middle_initial - optional. First letter of middle name.
  • consortium - optional. Name of the project that generated the genome assembly
  • comment - optional. Free text comment about the genome assembly. Appears in the COMMENT section of each GenBank sequence record.
  • bioproject - optional. BioProject ID (PRJXX) for the project, if available
  • biosample - optional. BioSample ID (SAMXXX) for the sequenced sample, if available
  • locus_tag_prefix - optional. One to 9-letter prefix to use for naming genes on this genome assembly. If an official locus tag prefix was already reserved from an INSDC organization (GenBank, ENA or DDBJ) for the given BioSample and BioProject pair, provide here. Otherwise, provide a string of your choice. If no value is provided, the prefix 'pgaptmp' will be used. See more details in this Note about locus tags.
  • sra - optional. Sequence reads used to build the assembly
    accession - Sequence Read Archive (SRA) accession for the run (with SRR, ERR or DRR prefix)
  • publications - optional. Publication describing the genome assembly
    publication.status - can be only one of: "published", "in-press", "unpublished"
    pmid - PubMed ID for the publication

Example:

topology: 'circular'
location: 'chromosome'
organism:
    genus_species: 'Escherichia coli'
    strain: 'my_strain'
contact_info:
    last_name: 'Doe'
    first_name: 'Jane'
    email: 'jane_doe@gmail.com'
    organization: 'NIH'
    department: 'NCBI'
    phone: '301-555-0245'
    fax: '301-555-1234'
    street: '9000 Rockville Pike'
    city: 'Bethesda'
    state: 'MD'
    postal_code: '20850'
    country: 'USA'
authors:
    - author:  
        last_name: 'Doe'    
        first_name: 'Jane'
        middle_initial: 'A'
    - author:  
        last_name: 'Doe'    
        first_name: 'John'
consortium: 'E. coli genome group'
bioproject: 'PRJ9999999'
biosample: 'SAMN99999999'      
locus_tag_prefix: 'pgaptmp'
sra:
    - accession: 'SRR9999999'
    - accession: 'ERR9999999'
publications:
    - publication:
        pmid: 29112715

Important! Please do not use empty fields.

Taxonomy information

How to find if your organism of interest is registered in NCBI taxonomy?

  1. Go to NCBI Taxonomy
  2. Enter the organism name in the search box and press Search
  3. Click on the result
  4. Verify that the rank is 'genus' or more specific

Note about locus tags

  • You can run PGAP with the locus tag prefix (LTP) of your choice, whether or not you plan to submit the annotated genome to GenBank.
  • If you plan to submit to GenBank and if you wish to have the final locus tags in the PGAP output, then you should register the BioProject and then the BioSample at https://submit.ncbi.nlm.nih.gov/subs/ PRIOR to running PGAP, and provide the BioProject, BioSample and the LTP that are returned in the input YAML file.
  • If you run PGAP with an arbitrarily chosen LTP and later decide to submit the PGAP-annotated genome to GenBank, the LTP will be automatically changed to the ones assigned to the BioProject:BioSample pair during processing of the genome.