Skip to content

Latest commit

 

History

History
205 lines (118 loc) · 9.51 KB

FIELDS.md

File metadata and controls

205 lines (118 loc) · 9.51 KB

Bystro Annotation Field Description

Italicized fields are custom Bystro fields. All others are sourced as described.


General output information:

Missing data in the annotation is marked by '!'

Multiple values for a single annotated position are separated by ';'

Multiple positions on a single annotation line (occurs with indels only) are separated by '|'

Annotated output data is ordered in the same way as the original file.

Reserved characters:

  • "!" ";" "|" "/"
  • "/" Will be used in a future release to denote overlapping data from a single track
    • For instance if 2 different dbSNP records overlap, which often occurs with indels, or when two refSeq transcripts overlap at the same position
    • Currently such sites are compressed to ";", but this loses information when a 1:1 relationship does not exist between a track's fields
      • For instance dbSNP.alleles are in the form Major;Minor1;Minor2 and dbSNP.name may or may not be a single value, regardless of # of minor alleles
      • When multiple dbSNP rows overlap, we store each field at that position in a 1D array, which loses the relationship between dbSNP.alleles and dbSNP.name

Input fields

Sourced from the input file, or calculated based on input fields

chrom - chromosome

pos - genomic position

type - the type of variant

  • VCF format types: SNP, INS, DEL, MULTIALLELIC
  • SNP format types: SNP, INS, DEL, MULTIALLELIC, DENOVO_*

discordant - does the input file's reference allele differ from Bystro's genome assembly? (1 if yes, 0 otherwise)

trTv - is the site a transition (1), transversion (2), or neither (0)?

alt - the alternate/nonreference allele

  • VCF multiallelics are split, one line each

heterozygotes - all samples that are heterozygotes for the alternate allele

homozygotes - all samples that are homozygotes for the alternate allele

missingGenos - all samples that have at least one '.' (VCF) or 'N' (SNP) genotype call.

  • Note: No samples are dropped

Multiallelic variants are always decomposed into bi-allelic variants on separate lines, and given the type MULTIALLELIC

  • Heterozygotes/Homozygotes are called based on the number of alleles for a given decomposed variants
    • For instance, if the variant is pos:1 alt:A,C ref:T and Sample1 is 1/1 on line 1: pos:1 alt:A ref:T hets:Sample1 and on line 2: pos:1 alt:C ref:T hets:Sample1

Reference Assembly

Sourced from UCSC

ref - the reference allele

  • e.g Human (hg38, hg19), Mouse (mm10, mm9), Fly (dm6), C.elegans (ce11), etc.

refSeq (FAQ)

Sourced from UCSC refGene (schema) and kgXref (schema)

All overlapping RefSeq transcripts are annotated (no prioritization, all possible values are reported)

refSeq.siteType - the effect the alt allele has on this transcript.

  • Possible types: intronic, exonic, UTR3, UTR5, spliceAcceptor, spliceDonor, ncRNA, intergenic
  • This is the only field that will have a value when a site is intergenic

refSeq.exonicAlleleFunction - The coding effect of the variant

  • Possible values: synonymous, nonSynonymous, indel-nonFrameshift, indel-frameshift, stopGain, stopLoss, startLoss

refSeq.refCodon - the codon based on in silico transcription of the reference assembly

refSeq.altCodon - the in silico transcribed codon after modification by the alt allele

refSeq.refAminoAcid - the amino acid based on in silico translation of the transcript

refSeq.altAminoAcid - the in silico translated amino acid after modification by the alt allele

refSeq.codonPosition - the site's position within the codon (1, 2, 3)

refSeq.codonNumber - the codon number within the transcript

refSeq.strand - the positive or negative watson/crick strand

refSeq.kgID - UCSC's Known Genes ID

refSeq.mRNA - mRNA ID, the transcript ID starting with NM_

refSeq.spID - UniProt protein accession number

refSeq.spDisplayID - UniProt display ID

refSeq.protAcc - NCBI protein accession number

refSeq.description - long form description of the RefSeq transcript

refSeq.rfamAcc - Rfam accession number

refSeq.name - RefSeq transcript ID

refSeq.name2 - RefSeq gene name


refSeq.nearest

The nearest transcript(s), upstream or downstream for every position in the genome

refSeq.nearest.name - the nearest transcript(s) RefSeq transcript ID

refSeq.nearest.name2 - the nearest transcript(s) RefSeq gene name


refSeq.clinvar

Alleles found in Clinvar that are larger than 32bp and overlap a refSeq transcript

We report these separately because large alleles are less likely to be relevant to small snps and indels

Clinvar variants are reported based on position and do not necessarily correspond to the input file's alleles at the same position

refSeq.clinvar.alleleID - unique Clinvar identifier

refSeq.clinvar.phenotypeList - associated pheontypes

refSeq.clinvar.clinicalSignificance - designation of significance (i.e. benign, pathogenic, etc) from clinical reports

refSeq.clinvar.type - the variant type (i.e. single nucleotide variant)

refSeq.clinvar.origin - origin tissue for the clinical sample in which the variant was identified (not always provided)

refSeq.clinvar.numberSubmitters - total number of submissions of the Clinvar variant

refSeq.clinvar.reviewStatus - level of intepretation of the variant provided

  • Such as "reviewed by expert panel"

refSeq.clinvar.chromStart - chromosome start site for the clinvar record

refSeq.clinvar.chromEnd - chromosome end site for the clinvar record


Genome-wide variant scores

Predications of conservation, evolution, and deleteriousness

phastCons - a conservation score that includes neighboring bases

phyloP - a conservation score that does not include neighboring bases

cadd - a score for the deleteriousness of a variant


dbSNP (FAQ)

The larget database of genetic variation

dbSNP variants up to 32 bases in length are reported

dbSNP variants are reported based on position and do not necessarily correspond to the input file's alleles at the same position

dbSNP.name - snp name, usually rs and a number

dbSNP.strand - strand orientation (+/-)

dbSNP.observed - observed SNP alleles at this position (+/- for indels)

dbSNP.class - variant type; includes single, insertion, and deletion

dbSNP.func - site type for the SNP name

dbSNP.alleles - SNP alleles in the dbSNP database

dbSNP.alleleNs - chromosome sample counts

dbSNP.alleleFreqs - major and minor allele frequencies


Clinvar (FAQ)

Clinically-reported human variants (hg38 and hg19 only)

Clinvar variants up to 32 bases in length are reported

Clinvar variants are reported based on position and do not necessarily correspond to the input file's alleles at the same position

clinvar.alleleID - unique clinvar identifier for a particular variant

clinvar.phenotypeList - list of associated phenotypes for variants at this position, including indels up to 32bp in size

clinvar.clinicalSignificance - designation of significance for a variant (i.e. benign, pathogenic, etc) from a clinical report

clinvar.Type - type of variant (i.e. single nucleotide variant

clinvar.Origin - origin tissue for clinical sample (not always provided)

clinvar.numberSubmitters - total number of submissions in clinvar overlapping this position, including indels up to 32bp in size

clinvar.reviewStatus - level of intepretation of the variant provided

clinvar.referenceAllele - reference allele for this position in clinvar

clinvar.alternateAllele - alternate allele(s) for this position seen in clinvar