-
Notifications
You must be signed in to change notification settings - Fork 41
Genome annotation in NGB
Thesaurus:
User-created annotation (UCA) - a GFF file that is represented by GUI and can be edited via GUI in NGB. Feature – any sequence ontology annotation term with defined coordinates (exon, transcript, gene, pseudogene, repeat, transposon, miRNA, ncRNA, tRNA, rRNA, mRNA, splice sites etc). Automatically generated user annotations – .fasta (if coordinates are present in header), .gff3, .gtf, .bed GFF3 description: http://gmod.org/wiki/GFF3
UCA track represents features:
Basic features - gene, exon, mRNA, miRNA, ncRNA, tRNA, rRNA, pseudogene, pseudogene transcript, repeat, transposon Also marked - readthrough stop codon, CDS leaking exons, start codons, stopcodons, CDS (if start and stop codon is set). CDS - A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon.
Simplified SO terms scheme:
More details and definitions are here http://www.sequenceontology.org/browser/current\_svn/term/SO:0000316
- Any RNA has 1 parent (gene or pseudogene).
- RNAs and pseudogenic_transcript features are parents to included exons, splice sites, CDS etc
- Gene and pseudogene representation at UCA track shows all the exons of childs and has borders corresponding to outer borders of the outer exons
- Duplicated features have same parent as the original one (except : if gene is duplicated a mRNA is created, if pseudogene – pseudogenic_transcript)
Use cases scheme
Use case is available only when user passed any use case in the level above. For example: user can add feature to UCA only when UCA track is created.
Note that “Undo/Redo actions” is available only after “Get history” use case.
“Merge features” case is available only when there are minimum 2 basic features to merge.
Rules:
- All changes of annotation via GUI are represented in the resulting UCA gff
- 1 UCA for 1 reference
- If CDS is not specified for the region that is dragged from auto-annotation or for newly created feature it is calculated automatically.
- If more than 1 exon exists in the transcript canonical and noncanonical splice sites are automatically set
- If borders are changed, the nearest features change to fit the border and CDS and splice-sites are recalculated automatically
- CDS is calculated in the exons of pseudogenic_transcript or mRNA (any transcribed RNA is reverse complement to the coding strand of DNA) if border coordinates change or readthrough stop codon is set or exon is added or deleted from the RNA
Start RNA codons - AUG Stop RNA codons - UAA, UAG и UGA
Use cases
Create UCA track
Description: User creates a UCA track for manual genome annotation.
Preconditions: reference genome sequence (.fasta) is uploaded to NGB
Main scenario (User creates a new empty track):
- User selects the reference
- “Create UCA” option appears
- User selects the option
- NGB suggests to set a name for UCA (default <reference_name>_ann )
- User names a track
- New empty .gff track appears in Browser and in Dataset panels. It is marked as UCA
Exception:
- User cannot create more than one UCA for a one reference. If user tries to he gets a notification.
Alternative scenario (user marks existing gff as UCA)
Precondition: Auto annotation is uploaded to NGB (.gtf).
- User selects the auto-ann gff.
- “Mark as UCA” option appears
- User selects the option
- NGB suggests to set a name for UCA (default <reference_name>_ann )
- User names a track
- Track is renamed and marked as UCA When user drags and drops features from auto- ann coordinates (corresponding lines in GFF field are copied taking into account reference changes (see use case below). If gene with corresponding name exists in UCA we just copy transcripts.
Add feature to UCA
Main scenario:
Preconditions: UCA track is open. Reference is uploaded
Description: User creates new feature in the UCA track.
- User clicks at some coordinate in the UCA track
- “Add feature” option appears
- User selects the option
- User chooses type of feature, fills coordinates, name, comments.
- Feature is represented in UCA
Alternative scenario:
Preconditions: UCA track is open and auto-ann is open. Reference is uploaded
Description: User drags and drops basic features annotation from auto-ann .gff track to UCA track.
- User selects the feature at annotation track and drags it to the UCA track
- UCA represents the new feature and the specified in the auto ann SO downstream features
Extensions:
- a. User selects several distinct at GUI features at once (“ctrl” +click)
- a. UCA represents the new features
Change borders of the basic feature in UCA
Preconditions: UCA track is open, at least one feature is present.
Description: User moves the borders of the feature left and right with a mouse
Main scenario:
- User selects feature with a click
- User clicks on the boarder and moves it with a mouse.
- UCA represents modified coordinates of the feature
TBD - Parent feature also changes. If exon within a duplicated transcript which is identical between parent and child is changed a notification appears. “Make new exon?” If Yes – a new exon in the gff appears. Otherwise childs’ and parent’s exon change.
Features in gff and in GUI that have overlapping changed border also change border coordinates. CDS, splice sites are rcalculated
Change type of the feature in UCA
Preconditions: UCA track is open, at least one feature is present.
Description: User changes type of the feature (SO term) in the UCA track.
Main scenario:
- User selects feature (basic features except exon).
- The menu appears.
- User selects option to “change type” of the feature
- A list of available types appears (basic features except exon).
- User selects a desired type.
- UCA represents the change.
Changes in the GFF file corresponding to edited region occur: in case of transposon only full coordinates are stored and type of the feature. In other cases type of the feature is gene and gff line corresponding to some RNA is changed to type of RNA or pseudogenic_transcript (in case of pseudogene) Transposons and repeats don’t have corresponding genes. Moreover, all lines in the GFF file which are not the downstream terms in the sequence ontology hierarchy disappear from the file. http://www.sequenceontology.org/browser/current_svn/term/SO:0000655
Change reference sequence
Preconditions: UCA track is open, at least one feature is present here.
Description: user can alter the reference sequence at the reference track to get modified RNA and protein sequences and altered genomic coordinates in the UCA gff track.
Main scenario:
- User selects a distinct nucleotide in the reference track
- Menu appears
- User selects type of change
- Form specifying change appears
- User fills the form.
- Change is represented at reference track and in the comments field of the overlapping features
The original reference sequence is not changed. Form for different changes should contain:
Substitution fields
(+) strand – new nucleotide(s) at the selected position
(-) strand – new nucleotide(s) at the selected position.
Comments – user's comments
User fills in + or – strand, empty field is auto filled according to complementary rule
Deletion fields
Length – a number of deleted nucleotides (from the selected position and to 3 prime of the + strand)
Comments – user's comments
Insertion fields
(+) strand – new nucleotide(s) at the selected position.
(-) strand – new nucleotide(s) at the selected position
Comments – user's comments
Insertion occurs from the selected nucleotide to 3prime of the + strand User fills in + or – strand, empty field is auto filled according to complementary rule
Set readthrough stop codon
Preconditions: UCA track is open, basic feature with exon is present.
Description: User can mark stop codon as readthrough. the resulting CDS, RNA and peptide will lengthen.
Main scenario:
- User selects a mRNA or gene or pseudogenic_transcript (in case of pseudogene).
- Menu appears
- User selects Set readthrough stop codon
- Readthrough stop codon is marked. CDS lenthens till the next stop codon
Delete features in UCA
Preconditions: UCA track is open, at least one feature is present.
Description: User can delete features from the UCA
Main scenario:
- User selects region or particular exon
- Menu appears
- User selects “Delete”
- Feature disappears from the UCA track. All child features also disappear
Add intron into exon in UCA
Preconditions: UCA track is open, at least one feature with exon is present.
Description: User splits exon into two by making intron in-between.
Main scenario:
- User selects particular exon
- Menu appears
- User selects “Make intron”
- NGB finds the nearest canonical splice sites (5’-…exon]GT/AG[exon…-3’).
- Intron in the corresponding exon appears
Exception:
NGB cannot find a set of canonical splice sites within the selected exon - a box will appear with a warning.
Alternative scenario
- User selects particular exon
- Menu appears
- User selects “Split”
- 1-nucleotide intron in the middle of the original exon appears
Merge several features in UCA
Preconditions: UCA track is open, at least 2 features are present.
Description: 2 or more features become one
Main scenario:
- User selects 2 or more regions
- Menu appears
- User selects “Merge”
- Regions get merged (if there is at least one exon at the particular coordinate, the resulting coordinate will be exon)
Exception:
NGB cannot merge features of different types - a box will appear with a warning.
Alternative scenario:
- User selects 2 or more exons in one basic feature
- Menu appears
- User selects “Merge”
- Exons get merged. Intron disappears between them. If there’s another exon between selected ones it gets merged also
Duplicate region in UCA
Preconditions: UCA track is open, at least one feature is present.
Main scenario:
- User selects a basic feature
- Menu appears
- User selects “Duplicate”
- Identical feature appears (transcript name number changed). If exon was selected a transcript containing only the selected exon appears
Move feature to the opposite strand
Preconditions: UCA track is open, at least one basic feature is present.
Main scenario:
- User selects a feature
- Menu appears
- User selects “Move to opposite strand”
- Region is represented at opposite strand.
Exception: Transposon and repeat don’t have strand specification so the strand cannot be changed
Set translation start, end.
Preconditions: UCA track is open, at least one feature is present.
Main scenario:
- User selects a coordinate in the mRNA or pseudogenic_transcript feature of UCA
- Menu appears
- User selects “set translation start” (or “Set translation end”)
- Translation start (or end) is represented at the region
Set longest ORF
Preconditions: UCA track is open, at least one mRNA feature is present.
Main scenario:
- User selects a feature
- Menu appears
- User selects “Set longest ORF”
- NGB calculates longest ORF
- Longest ORF is represented at the region
Search features in UCA
Preconditions: reference is uploaded, corresponding UCA has at least one annotated feature
Main scenario:
- User selects the UCA track in the datasets tab
- Menu appears
- User selects “Search”
- A form appears with predetermined field “reference” showing all the annotations
- User can search by name of annotation field, by reference, date of last modification
- User clicks on desirable annotation
- Annotation opens in NGB with appropriate reference
Search other annotations option allows to search defined parameters in annotations for other references
UCA feature attributes fields: date of last modification, name, reference, length
Edit information about region in UCA
Preconditions: UCA track is open, at least one feature is present here.
Main scenario:
- User selects feature.
- The menu appears
- User selects “Edit information”
- A form appears
- User edits form
- Edition is saved in Column 9 of the UCA GFF3. “Edit information” feature fields: Type of region, Name, Description, Database references, Attributes, PubMed IDs, Gene ontology IDs, Comments.
Get history of the region in UCA
Preconditions: UCA track is open, at least one feature is present here.
Main scenario:
- User selects feature.
- The menu appears
- User selects “History”
- The form with name of modification, date appears Names of modifications: changed border, moved to opposite strand, created (somehow added to UCA), changed type, set readthrough stop codon, set stop codon, set start codon, set longest CDS, added intron, merged exons, merged features.
Undo/Redo actions from the history
Preconditions: UCA track is open, region is chosen, history is shown
Main success scenario:
- User selects the version to be actual.
- Actual information is shown in the track By default last version is actual one
Get GFF or fasta file
Preconditions: UCA track is present in the datasets tab
Main scenario:
- User selects the UCA track in the datasets tab
- Menu appears
- User selects “get fasta” or ”get gff”
- the appropriate file is downloaded
User can upload fasta in AA (CDS), DNA, DNA (CDS), RNA
Alternative: Preconditions: UCA track is open
- User selects region at reference bar
- a. User selects “get sequence” from menu b. User selects “get GFF” from menu
- appropriate form is shown
- User copies text information
When specifying region to get GFF all features that overlap (with parents and “sisters”) are extracted
If you export aminoacid sequence – CDSs of the region will be translated
How exported fasta and gff headers are constructed – TBD
View and edit comments for UCA
Preconditions: UCA track is present in the datasets tab
Main scenario:
- User selects the UCA track in the datasets tab
- Menu appears
- User selects “Comments”
- An editable box of comments appears
Mockups:
Also needed features (the higher in the list the more priority):
- BLAT search is needed to search homologous regions to annotate
- Vertical line showing exact coordinate.
- Next transcript hotkey. Next ORF hotkey – for all gff tracks (is ORF represented in gff track in NGB) - TBD
- NGB automatically drops repeats to the bottom of UCA track
- Bar showing aminoacid sequence - 3 for each strand (all reading frames)
- Exons in the mRNA that are not in the CDS are highlighted
Use cases availability for features
gene | exon | mRNA | miRNA | ncRNA | tRNA | rRNA | pseudogene | pseudogenic transcript (PT) | repeat | transposon | |
---|---|---|---|---|---|---|---|---|---|---|---|
Create new | + | - | - | - | - | - | - | + | - | + | + |
Duplicate | +, mRNA added | +, mRNA added | + | + | + | + | + | +, PT added | + | + | + |
Change border | + | + | + | + | + | + | + | + | + | + | + |
Change type | + | - | + | + | + | + | + | + | + | + | + |
Change ref | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Set read through stop codon | - | + | - | - | - | - | - | - | + | - | - |
Delete | + | + | + | + | + | + | + | + | + | + | + |
Add intron into exon | - | + | - | - | - | - | - | - | - | - | - |
Merge features | - | + | + | + | + | + | + | - | + | + | + |
Duplicate feature | +, mRNA appears | +, mRNA appears | + | + | + | + | + | +, PT appears | + | - | - |
Move to opposite strand | + | +, with corresponding RNA or gene | + | + | + | + | + | + | + | - | - |
Set translation start-end | - | + | - | - | - | - | - | - | + | - | - |
Set longest ORF | - | - | + | - | - | - | - | - | + | - | - |
Search features | + | - | + | + | + | + | + | + | + | + | + |
Edit information | + | - | + | + | + | + | + | + | + | + | + |
Get history – UndoRedo actions | + | - | + | + | + | + | + | + | + | + | + |
Get GFF and fasta | + | +, gene’s gff | +, gene’s gff | +, gene’s gff | +, gene’s gff | +, gene’s gff | +, gene’s gff | + | +, pseudogene’s or gene’s gff | + | + |
View and edit comments for UCA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |