Skip to content

Genome annotation in NGB

kupryashova edited this page Feb 6, 2018 · 6 revisions

Thesaurus:

­User-created annotation (UCA) - a GFF file that is represented by GUI and can be edited via GUI in NGB. Feature – any sequence ontology annotation term with defined coordinates (exon, transcript, gene, pseudogene, repeat, transposon, miRNA, ncRNA, tRNA, rRNA, mRNA, splice sites etc). Automatically generated user annotations – .fasta (if coordinates are present in header), .gff3, .gtf, .bed GFF3 description: http://gmod.org/wiki/GFF3

UCA track represents features:

Basic features - gene, exon, mRNA, miRNA, ncRNA, tRNA, rRNA, pseudogene, pseudogene transcript, repeat, transposon Also marked - readthrough stop codon, CDS leaking exons, start codons, stopcodons, CDS (if start and stop codon is set). CDS - A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.

Simplified SO terms scheme:

More details and definitions are here http://www.sequenceontology.org/browser/current\_svn/term/SO:0000316

  • Any RNA has 1 parent (gene or pseudogene).
  • RNAs and pseudogenic_transcript features are parents to included exons, splice sites, CDS etc
  • Gene and pseudogene representation at UCA track shows all the exons of childs and has borders corresponding to outer borders of the outer exons
  • Duplicated features have same parent as the original one (except : if gene is duplicated a mRNA is created, if pseudogene – pseudogenic_transcript)

Use cases scheme
Use case is available only when user passed any use case in the level above. For example: user can add feature to UCA only when UCA track is created. Note that “Undo/Redo actions” is available only after “Get history” use case. “Merge features” case is available only when there are minimum 2 basic features to merge.

Rules:

  • All changes of annotation via GUI are represented in the resulting UCA gff
  • 1 UCA for 1 reference
  • If CDS is not specified for the region that is dragged from auto-annotation or for newly created feature it is calculated automatically.
  • If more than 1 exon exists in the transcript canonical and noncanonical splice sites are automatically set
  • If borders are changed, the nearest features change to fit the border and CDS and splice-sites are recalculated automatically
  • CDS is calculated in the exons of pseudogenic_transcript or mRNA (any transcribed RNA is reverse complement to the coding strand of DNA) if border coordinates change or readthrough stop codon is set or exon is added or deleted from the RNA

Start RNA codons - AUG Stop RNA codons - UAA, UAG и UGA

Use cases

Create UCA track
Description: User creates a UCA track for manual genome annotation.
Preconditions: reference genome sequence (.fasta) is uploaded to NGB
Main scenario (User creates a new empty track):

  1. User selects the reference
  2. “Create UCA” option appears
  3. User selects the option
  4. NGB suggests to set a name for UCA (default <reference_name>_ann )
  5. User names a track
  6. New empty .gff track appears in Browser and in Dataset panels. It is marked as UCA
    Exception:
  • User cannot create more than one UCA for a one reference. If user tries to he gets a notification.

Alternative scenario (user marks existing gff as UCA)
Precondition: Auto annotation is uploaded to NGB (.gtf).

  1. User selects the auto-ann gff.
  2. “Mark as UCA” option appears
  3. User selects the option
  4. NGB suggests to set a name for UCA (default <reference_name>_ann )
  5. User names a track
  6. Track is renamed and marked as UCA When user drags and drops features from auto- ann coordinates (corresponding lines in GFF field are copied taking into account reference changes (see use case below). If gene with corresponding name exists in UCA we just copy transcripts.

Add feature to UCA
Main scenario:
Preconditions: UCA track is open. Reference is uploaded
Description: User creates new feature in the UCA track.

  1. User clicks at some coordinate in the UCA track
  2. “Add feature” option appears
  3. User selects the option
  4. User chooses type of feature, fills coordinates, name, comments.
  5. Feature is represented in UCA

Alternative scenario:
Preconditions: UCA track is open and auto-ann is open. Reference is uploaded
Description: User drags and drops basic features annotation from auto-ann .gff track to UCA track.

  1. User selects the feature at annotation track and drags it to the UCA track
  2. UCA represents the new feature and the specified in the auto ann SO downstream features

Extensions:

  1. a. User selects several distinct at GUI features at once (“ctrl” +click)
  2. a. UCA represents the new features

Change borders of the basic feature in UCA
Preconditions: UCA track is open, at least one feature is present.
Description: User moves the borders of the feature left and right with a mouse
Main scenario:

  1. User selects feature with a click
  2. User clicks on the boarder and moves it with a mouse.
  3. UCA represents modified coordinates of the feature

TBD - Parent feature also changes. If exon within a duplicated transcript which is identical between parent and child is changed a notification appears. “Make new exon?” If Yes – a new exon in the gff appears. Otherwise childs’ and parent’s exon change.

Features in gff and in GUI that have overlapping changed border also change border coordinates. CDS, splice sites are rcalculated

Change type of the feature in UCA
Preconditions: UCA track is open, at least one feature is present.
Description: User changes type of the feature (SO term) in the UCA track.
Main scenario:

  1. User selects feature (basic features except exon).
  2. The menu appears.
  3. User selects option to “change type” of the feature
  4. A list of available types appears (basic features except exon).
  5. User selects a desired type.
  6. UCA represents the change.

Changes in the GFF file corresponding to edited region occur: in case of transposon only full coordinates are stored and type of the feature. In other cases type of the feature is gene and gff line corresponding to some RNA is changed to type of RNA or pseudogenic_transcript (in case of pseudogene) Transposons and repeats don’t have corresponding genes. Moreover, all lines in the GFF file which are not the downstream terms in the sequence ontology hierarchy disappear from the file. http://www.sequenceontology.org/browser/current_svn/term/SO:0000655

Change reference sequence
Preconditions: UCA track is open, at least one feature is present here.
Description: user can alter the reference sequence at the reference track to get modified RNA and protein sequences and altered genomic coordinates in the UCA gff track.
Main scenario:

  1. User selects a distinct nucleotide in the reference track
  2. Menu appears
  3. User selects type of change
  4. Form specifying change appears
  5. User fills the form.
  6. Change is represented at reference track and in the comments field of the overlapping features

The original reference sequence is not changed. Form for different changes should contain:

Substitution fields (+) strand – new nucleotide(s) at the selected position (-) strand – new nucleotide(s) at the selected position. Comments – user's comments User fills in + or – strand, empty field is auto filled according to complementary rule
Deletion fields Length – a number of deleted nucleotides (from the selected position and to 3 prime of the + strand) Comments – user's comments Insertion fields
(+) strand – new nucleotide(s) at the selected position. (-) strand – new nucleotide(s) at the selected position Comments – user's comments Insertion occurs from the selected nucleotide to 3prime of the + strand User fills in + or – strand, empty field is auto filled according to complementary rule

Set readthrough stop codon
Preconditions: UCA track is open, basic feature with exon is present.
Description: User can mark stop codon as readthrough. the resulting CDS, RNA and peptide will lengthen.
Main scenario:

  1. User selects a mRNA or gene or pseudogenic_transcript (in case of pseudogene).
  2. Menu appears
  3. User selects Set readthrough stop codon
  4. Readthrough stop codon is marked. CDS lenthens till the next stop codon

Delete features in UCA
Preconditions: UCA track is open, at least one feature is present.
Description: User can delete features from the UCA
Main scenario:

  1. User selects region or particular exon
  2. Menu appears
  3. User selects “Delete”
  4. Feature disappears from the UCA track. All child features also disappear

Add intron into exon in UCA
Preconditions: UCA track is open, at least one feature with exon is present.
Description: User splits exon into two by making intron in-between.
Main scenario:

  1. User selects particular exon
  2. Menu appears
  3. User selects “Make intron”
  4. NGB finds the nearest canonical splice sites (5’-…exon]GT/AG[exon…-3’).
  5. Intron in the corresponding exon appears

Exception:
NGB cannot find a set of canonical splice sites within the selected exon - a box will appear with a warning. Alternative scenario

  1. User selects particular exon
  2. Menu appears
  3. User selects “Split”
  4. 1-nucleotide intron in the middle of the original exon appears

Merge several features in UCA
Preconditions: UCA track is open, at least 2 features are present. Description: 2 or more features become one
Main scenario:

  1. User selects 2 or more regions
  2. Menu appears
  3. User selects “Merge”
  4. Regions get merged (if there is at least one exon at the particular coordinate, the resulting coordinate will be exon)

Exception:
NGB cannot merge features of different types - a box will appear with a warning.

Alternative scenario:

  1. User selects 2 or more exons in one basic feature
  2. Menu appears
  3. User selects “Merge”
  4. Exons get merged. Intron disappears between them. If there’s another exon between selected ones it gets merged also

Duplicate region in UCA
Preconditions: UCA track is open, at least one feature is present. Main scenario:

  1. User selects a basic feature
  2. Menu appears
  3. User selects “Duplicate”
  4. Identical feature appears (transcript name number changed). If exon was selected a transcript containing only the selected exon appears

Move feature to the opposite strand
Preconditions: UCA track is open, at least one basic feature is present.
Main scenario:

  1. User selects a feature
  2. Menu appears
  3. User selects “Move to opposite strand”
  4. Region is represented at opposite strand.

Exception: Transposon and repeat don’t have strand specification so the strand cannot be changed

Set translation start, end.
Preconditions: UCA track is open, at least one feature is present.
Main scenario:

  1. User selects a coordinate in the mRNA or pseudogenic_transcript feature of UCA
  2. Menu appears
  3. User selects “set translation start” (or “Set translation end”)
  4. Translation start (or end) is represented at the region

Set longest ORF
Preconditions: UCA track is open, at least one mRNA feature is present. Main scenario:

  1. User selects a feature
  2. Menu appears
  3. User selects “Set longest ORF”
  4. NGB calculates longest ORF
  5. Longest ORF is represented at the region

Search features in UCA
Preconditions: reference is uploaded, corresponding UCA has at least one annotated feature
Main scenario:

  1. User selects the UCA track in the datasets tab
  2. Menu appears
  3. User selects “Search”
  4. A form appears with predetermined field “reference” showing all the annotations
  5. User can search by name of annotation field, by reference, date of last modification
  6. User clicks on desirable annotation
  7. Annotation opens in NGB with appropriate reference

Search other annotations option allows to search defined parameters in annotations for other references
UCA feature attributes fields: date of last modification, name, reference, length

Edit information about region in UCA
Preconditions: UCA track is open, at least one feature is present here. Main scenario:

  1. User selects feature.
  2. The menu appears
  3. User selects “Edit information”
  4. A form appears
  5. User edits form
  6. Edition is saved in Column 9 of the UCA GFF3. “Edit information” feature fields: Type of region, Name, Description, Database references, Attributes, PubMed IDs, Gene ontology IDs, Comments.

Get history of the region in UCA
Preconditions: UCA track is open, at least one feature is present here.
Main scenario:

  1. User selects feature.
  2. The menu appears
  3. User selects “History”
  4. The form with name of modification, date appears Names of modifications: changed border, moved to opposite strand, created (somehow added to UCA), changed type, set readthrough stop codon, set stop codon, set start codon, set longest CDS, added intron, merged exons, merged features.

Undo/Redo actions from the history
Preconditions: UCA track is open, region is chosen, history is shown Main success scenario:

  1. User selects the version to be actual.
  2. Actual information is shown in the track By default last version is actual one

Get GFF or fasta file
Preconditions: UCA track is present in the datasets tab
Main scenario:

  1. User selects the UCA track in the datasets tab
  2. Menu appears
  3. User selects “get fasta” or ”get gff”
  4. the appropriate file is downloaded

User can upload fasta in AA (CDS), DNA, DNA (CDS), RNA

Alternative: Preconditions: UCA track is open

  1. User selects region at reference bar
  2. a. User selects “get sequence” from menu b. User selects “get GFF” from menu
  3. appropriate form is shown
  4. User copies text information
    When specifying region to get GFF all features that overlap (with parents and “sisters”) are extracted
    If you export aminoacid sequence – CDSs of the region will be translated
    How exported fasta and gff headers are constructed – TBD

View and edit comments for UCA
Preconditions: UCA track is present in the datasets tab
Main scenario:

  1. User selects the UCA track in the datasets tab
  2. Menu appears
  3. User selects “Comments”
  4. An editable box of comments appears

Mockups: UCA menu feature's menu

Also needed features (the higher in the list the more priority):

  1. BLAT search is needed to search homologous regions to annotate
  2. Vertical line showing exact coordinate.
  3. Next transcript hotkey. Next ORF hotkey – for all gff tracks (is ORF represented in gff track in NGB) - TBD
  4. NGB automatically drops repeats to the bottom of UCA track
  5. Bar showing aminoacid sequence - 3 for each strand (all reading frames)
  6. Exons in the mRNA that are not in the CDS are highlighted

Use cases availability for features

  gene exon mRNA miRNA ncRNA tRNA rRNA pseudogene pseudogenic transcript (PT) repeat transposon
Create new + - - - - - - + - + +
Duplicate +, mRNA added +, mRNA added + + + + + +, PT added + + +
Change border + + + + + + + + + + +
Change type + + +
Change ref  NA   NA   NA   NA   NA   NA   NA   NA   NA  NA    NA
Set read through stop codon  -  + -  + - -
Delete  +
Add intron into exon  -  +
Merge features  -  +  +  +  +  +  +  -  + + +
Duplicate feature  +, mRNA appears  +, mRNA appears  + +, PT appears -
Move to opposite strand  +  +, with corresponding RNA or gene  + + - -
Set translation start-end  -  + - - - - -  +  -  -
Set longest ORF -  - -  +  -  -
Search features  +  -  +  +  +  +  +  +  +  +  +
Edit information  +  -  +  +  +  +  +  +  +  +  +
 Get history – UndoRedo actions  + -  + + +
Get GFF and fasta + +, gene’s gff +, gene’s gff +, gene’s gff +, gene’s gff +, gene’s gff +, gene’s gff + +, pseudogene’s or gene’s gff + +
View and edit comments for UCA  NA  NA  NA  NA  NA NA  NA  NA  NA  NA  NA
Clone this wiki locally