Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made significant updates to Section 3, INFO keys used for structural … #231

Closed
wants to merge 3 commits into from

Conversation

thefferon
Copy link

…variants

@thefferon
Copy link
Author

I am suggesting extensive changes to the list of reserved SVTYPEs. These changes are needed to help standardize the way structural variants are reported in VCF, and to facilitate their exchange between groups.

I also added two new INFO tags, TDLOC and CIVALUE.

@cyenyxe cyenyxe added the vcf label Aug 9, 2017
VCFv4.3.tex Outdated
@@ -5,6 +5,7 @@
\usepackage{lscape}
\usepackage[margin=0.75in]{geometry}
\usepackage[pdfborder={0 0 0}]{hyperref}
%\usepackage{showframe}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented (unused) packages.

VCFv4.3.tex Outdated

SVTYPE is required for all structural variants. The following reserved values should be used whenever possible; indented values are subtypes that should be used if evidence supports them (see examples in Section 5.3, Encoding Structural Variants):\vspace{0.3cm}\\
\setlength{\parindent}{0.3cm}
\indent INS: Insertion relative to the reference\\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following would be better formatted as nested itemize instead of multiple indentation levels (example).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

VCFv4.3.tex Outdated

\footnotesize
\begin{verbatim}
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIVALUE,Number=,Type=Float,Description="Degree of certainty expressed by CIPOS/CIEND: `1' indicates 100% certainty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Percentage characters need to be escaped or LaTeX will interpret them as the beginning of a comment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

VCFv4.3.tex Outdated
@@ -547,6 +594,8 @@ \section{INFO keys used for structural variants}

\footnotesize
\begin{verbatim}
##INFO=<ID=TDLOC,Number=1,Type=String,Description="Position and orientation of tandem duplication; must be one of
(Direct, Inverted-downstream, or Inverted-upstream)">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep a single field description in one line whenever possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not able to keep this to one line, nor to wrap it, with the current text. I will try working up an alternative that fits on one line.

\footnotesize
\begin{verbatim}
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
\end{verbatim}
\normalsize
For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate.
For precise variants, END is POS + length of REF allele - 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please state what would be the preferred action in the case of imprecise variants, otherwise it will be open to interpretation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to remove the IMPRECISE tag and all references to it from the spec because:

  • what exactly is meant by "imprecise" variants is not, and has never been, clearly defined in the spec (I would also like to see the phrase "best estimate" removed, since it introduces artificial and somewhat arbitrary data points)
  • the tag is redundant with CIPOS and CIEND; I have yet to see an example where adding the 'IMPRECISE' tag has added any informational value to a VCF record

@cyenyxe , what do you think is the best approach to take from here?

Copy link
Contributor

@d-cameron d-cameron Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm (in GRIDSS) currently using the IMPRECISE/CIPOS fields to differentiate between variants in which the actual breakpoint position is not known due to lack of coverage, and a variant which the full breakpoint sequence is known but contains a micro-homology thus there is inherit ambiguity in the variant call.

HOMLEN/HOMSEQ don't tell you where the homology is relative to the called position. This information is critical when performing base-pair accurate analysis and the only way to recover it from HOMLEN/HOMSEQ is to recompute it from the reference genome. In the case where the sample differs from the reference, this may not be possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding @d-cameron 's comments on HOMSEQ/HOMLEN, if a left-align convention is enforced in the spec, this would at least remove some of the ambiguity. For example, imagine a deletion where the reference haplotype is LEFT_FLANK + HOM + UNIQUE_DEL + HOM + RIGHT_FLANK, and the alt haplotype is LEFT_FLANK HOM + RIGHT_FLANK, the deleted sequence is then any contiguous substring of the sequence HOM + UNIQUE_DEL + HOM with a size of len(HOM) + len(UNIQUE_DEL). Sticking to the left-aligning convention would put the POS at the end of LEFT_FLANK. Now depending on if other annotations exist (e.g. providing CIGARs), the exact haplotypes may be constructed to a certain degree close to the real alt haplotype. And yes, haplotype reconstruction/comparison is what ultimately matters IMHO.

The above comment would lead to a point that the IMPRECISE flag is useful: for the deletion example mentioned above, we do have the exact haplotype (if we have assembled across the deletion breakpoints or a read that is long enough to span it), therefor it is "precise", in the sense that we know its length. The ambiguity is how to represent it. But this is not the same as "we don't know the exact haplotype and therefore not even knowing where to put it exactly; all we know is an SV event occurred within this provided boundary, and the evidence doesn't allow us to pin down the position."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SHuang-Broad nicely clarifies the distinction between (A) not having enough sequence information in the vicinity of the breakpoint and (B) not being able to identify the precise breakpoint (despite having complete sequence information) due to a stretch of homology. (A) is an information problem (ie not enough data), while (B) is a methodology problem that cannot be solved no matter how many times you sequence across the breakpoint (because it is already information-complete). For this reason (B) can be addressed only by (1) acknowledging the intractability of identifying a single coordinate in all variants with sequence homologies, or (2) enforcing a convention such as the left-justification one @SHuang-Broad proposes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, any intelligent aligner, when presented with the following ... should recognize – and encode in its output, in some form – that there are four equally valid interpretations that explain the difference between the sequences.

Alternatively, any intelligent variant caller would calculate any micro-homology intervals around the split read/indel positions reported by the aligner. GRIDSS, Socrates and Pindel already do this. Given that callers can't even trust aligners to write SAM files that actually make sense, my preference is to keep this in the VCF specifications and not spill over into the SAM as it's not absolutely necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it your position, then, that the IMPRECISE flag should be applied in, and only in, the "latter" case - i.e. in which the variant lacks sequence across the breakpoint?

Correct.

The only edge cases I can think of are:

  • when an RP caller has sufficient coverage that there is only one breakpoint position possible. One could argue that these calls are precise. I (weakly) prefer these to have IMPRECISE and CIPOS=0,0 for consistency with your definition of precise calls requiring breakpoint sequence.

  • when the full untemplated inserted sequence is not known. e.g: a ~300bp insertion with soft clip/breakpoint assembly support from both sides, spanning RP, but a fragmented breakpoint assembly. The full sequence is not known, but the breakpoint positions are exact. Should this be flagged as IMPRECISE? I'm inclined to say no since the breakend positions are in fact precise. The current VCF specs can actually represent such a call at all:

REF: ACGT------------------------------------CCGG
ALT: ACGTTTCAAA?????(50-100bp)????????TTTCCAACCGG

I guess you could put Ns in there, but using standard VCF fields, there's no way to say that there's somewhere between 50 and 100 Ns. Come to think of it, this is just a special case of an insertion of unknown length, something that the standard VCF fields don't support, merely a single SVLEN value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit in both these examples is the introduction, by the caller/aligner, of a bias; they have already selected one solution from among several equally valid solutions; and this affects downstream interpretation of the data.

If the caller reports a POS and CIPOS, then it is reporting a set of solutions. If the downstream interpretation ignores most of these solutions then the issue is with the downstream interpretation, not what the caller reported. Just to reiterate: any tool that interprets VCF calls MUST be able to handle multiple different representations of the same variant. The only way that VCF readers can get away from this is if there is some sort of normalisation (similar to the left-alignment of SNV/indels). The only way I can see to normalise all SV events is to turn everything into left-aligned BND^, at which point, why have SVTYPE and any non-BND symbolic alleles at all?

^ This brings us back to the interpretation of SVTYPE. Is it a shorthand for the equivalent BND representation, or is it actually making a claim about the event type? Is a DEL caller claiming a CN reduction in the 'deleted' region? (for most callers, the answer is no). This is why I proposed the SVCLASS field. SVTYPE remains a restricted set of short-hand notations, and SVCLASS makes a claim about the mutation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As expressed in previous comments, when it comes to SV, what really matters, IMHO, are the haplotypes. Ideally, the tool that made the call set should provide the alt haplotype (ref is easy) explicitly when feasible (e.g. for complex SV) or provide a tool for doing this. So downstream analysis could use such information easily, for either querying the VCF to see if a variant is in close vicinity of a target region, or for making sense of what happened here.

Now regarding the concerns for the "bias" of the aligner-given position, I don't think it is practical to require that from a general purpose aligner and special purpose aligner may provide no guarantee in these (homologous) regions either, because if we think about it, it is an optimization problem that also depends on the parameters sent to the aligner. But this should not stop us from doing the work: a good caller and downstream analysis in the SV world should be aware of this phenomenon and, echoing what has been said before, allow ambiguity and focus on the haplotype.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different opinion on the example insertion with unsure length that @d-cameron gave above.

It's getting clearer and clearer to me that when I refer to IMPRECISE variants, I am talking about "exact haplotype unknown", which is not the same as "exact place cannot be determined". In the "insertion with unknown length" example above, the exact alt sequence cannot be determined, hence the alignment could NOT be pinned down either (what if the missing sequence can be aligned to the left or right flanking sequences?), making the alignment given in the example a bit suspicious.

Still evaluating effect of these changes on pdf layout...
Copy link
Contributor

@d-cameron d-cameron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this change weakens the specifications. The VCF specifications should be a common file format in which multiple programs can unambiguous interpret and parse variants. As it stands these changes would make it no longer possible to unambiguously convert a VCF SV record to another format such as BEDPE.


\footnotesize
\begin{verbatim}
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
\end{verbatim}
\normalsize
This key can be derived from the REF/ALT fields but is useful for filtering. The reserved values must be used for the types listed below:
SVTYPE is required for all structural variants. The following reserved values should be used whenever possible; indented values are subtypes that should be used if evidence supports them (see examples in Section 5.3, Encoding Structural Variants):\vspace{0.3cm}\\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"should be used whenever possible" is a massive weakening from "must be used". This weakening makes makes this field useless for downstream analysis programs because there's no requirement that the values actually be one of these. Any value would be spec-compliant.

Practical applications of the restrictive usage of this field include my StructuralVariantAnnotation R package in which I take most popular callers, and (in theory) any VCFv4.3 compliant SV callers and convert them to a standard breakend notation so variant calls written using different notations can actually be compared. With this change, program B will no longer be able to complain that the problem is in the non-compliant output of program A. If two programs don't work with each other then the specs need to be able to be used to say which of the programs is incorrect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point taken. I will re-write this to clarify. However I still see some ambiguity in the original text:

This key can be derived from the REF/ALT fields but is useful for filtering. The reserved values must be used for the types listed below:

To take these two sentences in reverse order:

The reserved values must be used for the types listed below:

One reading of this leaves the door open for SVTYPE to be assigned any value, so long as the author feels their variant is not one of "the types listed below." If the intention is instead, as I believe, to establish a closed list of reserved types, such that all VCF SV records require the SVTYPE field, and only the listed values are allowed (and no colons w/ subtypes, etc.) then this sentence needs to change.

This key can be derived from the REF/ALT fields but is useful for filtering.

This seems to imply the SVTYPE field is redundant with, and secondary to, REF and ALT – and may therefore be optional. Again, I propose this be rewritten or removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply the SVTYPE field is redundant with, and secondary to, REF and ALT – and may therefore be optional

You raise a very good point. I've been using SVTYPE, but s1.4.5 does indeed define a restrictive set of first-level SV ALT alleles. Although the spec does allow other ALT alleles, it is clear that only ALT alleles in the list are to be considered SVs^.

^ Section 1.4.5 needs the word "imprecise" removed for this to actually be true.

Copy link

@SHuang-Broad SHuang-Broad Sep 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be convenient for a human to have access for such information (SVTYPE) if the caller decides to output BND records for all variants.
Convenience is a virtue.

\item [\null] INS: Insertion relative to the reference
\begin{itemize}
\item [\null] MEI: Mobile element insertion relative to the reference
\begin{itemize}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this list to be maintained? Do we have any ME ontologies we can reference so our list is more exhaustive?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this PR I included mobile element types from data published by 1000 Genomes: Alu, LINE1, SVA, and HERV. I added 'RET' (for 'retrotransposon') because I am aware of a specific need to accommodate it.

For now I think it is sufficient to use the types in this list; additional types can be added, and a formal ontology consulted, if and when the need to do so becomes more obvious.

\end{itemize}
\item [\null] DEL: Deletion relative to the reference
\begin{itemize}
\item [\null] -MEI: Absence of a mobile element present in the reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer DEL:MEI over DEL:-MEI

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of colons to construct nested strings of types and subtypes is only described in the spec for the ALT field (symbolic ALT alleles). I have never seen this notation used in the SVTYPE field. Have you?

My intention with the indented list of subtypes in this PR is that only one value for SVTYPE should be used; the granularity of the choice of term depends on the evidence present in the data. For example, if an insertion is observed but nothing definitive is known (yet) about its content, SVTYPE=INS. However, if the same insertion is known to be a mobile element of some kind (family not specified), then SVTYPE=MEI. If it is known that the insertion is a LINE1 element, then SVTYPE=LINE1 (perhaps L1 would be better).

Thus the nesting would be implied. I proposed this schema mainly because it is consistent with the seemingly widely adopted practice (which ignores guidance provided by the current spec), of indicating mobile element insertions by using SVTYPE values such as ALU, LINE1, and SVA.

In keeping with this "one-word SVTYPE" controlled vocabulary, I proposed "-ALU", for example, to represent a so-called Alu-element deletion (more accurately described as the absence of an Alu insertion which is present in the reference genome). For an up-to-date treatment of mobile element variation, please see https://www.ncbi.nlm.nih.gov/pubmed/28855259.

I wanted to make the above points regardless of the mechanism that is eventually chosen to identify event types (like your proposed SVCLASS).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of colons to construct nested strings of types and subtypes is only described in the spec for the ALT field (symbolic ALT alleles). I have never seen this notation used in the SVTYPE field. Have you?

Both Pindel and manta use <DUP:TANDEM> alt alleles, with pindel using SVTYPE=DUP:TANDEM notation (manta just calls SVTYPE=DUP).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-cameron is that the only example you are aware of, or was it intended as one example that is part of an argument that colons should be used in SVTYPE?

ALT=<DUP:TANDEM> adheres to the existing spec's guidelines for the use of symbolic alternate alleles. Pindel's use of SVTYPE=DUP:TANDEM does not, and it seems like an effort to address exactly the sort of deficiency in the spec that I aim to improve with this PR. Even though SVTYPE=DUP is "legal" and SVTYPE=DUP:TANDEM is not, IMO the latter is better in that it imparts crucial information about the variant which the former does not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thefferon the example was just a comment on current usage. I'm of the opinion that colons should be used if the values are an open set, but are not necessary if the set is closed (since the hierarchy can be defined in the spec)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the parent (hierarchical) information in the ALT helps grep-ing

\end{itemize}

TRA (Translocation): A VCF translocation record describes the juxtaposition of two DNA segments from disparate genomic locations (from the same or from different chromosomes). It must include two genomic coordinates, each accompanied by an orientation (see STRAND below). A VCF translocation record only describes the immediate molecular context of the novel adjacency generated by the translocation event. It should not be confused with the traditional cytogenetic notion of translocation, which is understood to imply that the remainder of the chromosome arms beyond the specified breakpoints are also involved.\vspace{0.3cm}\\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already covered by the breakend notation. I strongly disagree with introducing yet another notation for arbitrary genomic rearrangements. BND should be used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to push back on this. While BND is, I believe, capable of explicitly describing almost any structural event – including variants normally described using the SVTYPEs of DEL, INS, INV, etc. – in practice it is unwieldy and complicated to implement. If BND can describe all events, why have any other SVTYPEs at all?
In my view, TRA serves the same purpose as all the other non-BND SVTYPEs: it is a brief-but-complete way to express commonly encountered variation events (most/all of which could also be described much less intuitively using only BND). Where BND shines is in its ability to capture complex events that cannot be described using other SVTYPEs; and perhaps, by virtue of this universal applicability, BND may be simpler for parsers to accommodate than a host of SVTYPEs, each with its own rules/idiosyncracies.
TRA has the following advantages over BND for describing any novel adjacency:

  • it clearly identifies the positions and orientations of both joined breakends in an intuitive way
  • is supported by at least one major SV caller (DELLY)
  • it is in fairly wide usage for describing translocations in VCF (whereas BND is probably less so, at least in the VCF I have encountered)

@d-cameron , do you agree that TRA serves the same sort of purpose as all the other, non-BND SVTYPEs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is in fairly wide usage for describing translocations in VCF (whereas BND is probably less so, at least in the VCF I have encountered)
is supported by at least one major SV caller (DELLY)

DELLY changed from TRA to BND notation back in March this year so as to be VCF compliant (see dellytools/delly#59 ). Do you know which other callers write TRA records? TIGRA uses CTX and CHR2 with a similar meaning to DELLY's TRA. As far as I can tell, LUMPY, GRIDSS, manta, and CREST are all compliant. Pindel deviates from the current specs with its use of RPL (replacement) for del/ins events.

in practice it is unwieldy and complicated to implement. If BND can describe all events, why have any other SVTYPEs at all?

I ask myself the same question. I believe it is because VCF was designed for SNVs and evolved over time to support single-loci events then finally arbitrarily complex structural variants through the use of BND notation. VCF is ill-suited to representing SVs but it's still a whole lot better than every tool writing their own custom tsv/csv format.

If you're handling SVs in general, you can't get away from having some events in BND notation. I agree it is unwieldy and complicated, but if the analysis/software already has to deal with BND events, adding additional event types on top of this just makes processing SVs even more messy and complicated.

@d-cameron , do you agree that TRA serves the same sort of purpose as all the other, non-BND SVTYPEs?

Adding it to the specifications would serve the same purpose as all the other SVTYPEs (except CNV, although even those can now be represented, albeit extremely unintuitively, using breakend notation): as a short-hand to an equivalent set of BND records. The key difference between the existing types and TRA is that all the existing types are all single-loci events.

As far as I can tell, the proposed TRA notation is just an alternate representation of information contained in a BND record. Why not just use a BND record? How is the specifications enhanced by requiring all implementing program to support both BND and TRA notations?

Note that BND records were intentionally defined to be paired and have a record at both locations so the events involving any given loci could be found by querying that loci. Yes this is messy, but programs already have to deal with this for BND events and it would be inconsistent if the proposed TRA records were not also paired.


\footnotesize
\begin{verbatim}
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
\end{verbatim}
\normalsize
This key can be derived from the REF/ALT fields but is useful for filtering. The reserved values must be used for the types listed below:
SVTYPE is required for all structural variants. The following reserved values should be used whenever possible; indented values are subtypes that should be used if evidence supports them (see examples in Section 5.3, Encoding Structural Variants):\vspace{0.3cm}\\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BND is missing from this list. This appears to be an oversight in the current specs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - I will add BND in a separate edit.

\item [\null] DISDUP: Dispersed duplication
\end{itemize}
\item [\null] DELINS: Deletion-insertion (replacement of reference sequence with sequence of different content or length)
\item [\null] OTHER: Complex variant not covered elsewhere
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't these be written as multiple VCF records with the same EVENT?

My preference is to just include simple variants in this field and use BND notation with EVENT to describe complex events.

OTHER is a catch-all that is going to break any program trying to be spec-compliant and I don't think belongs in these specifications.

Copy link
Contributor

@d-cameron d-cameron Aug 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fundamentally, all SVs should be able to be decomposed into an equivalent set of BND VCF records. If SVTYPE is to be expanded, I strongly recommend that every type (and subtype that has a different decomposition than the parent) has an example in the VCF specifications that has the event in both short-hand format, and the equivalent BND notation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a symbolic alt allele for complex event has its value, because being easily interpretable is a virtue, as long as we have a healthy suspicion on the provided interpretation, which is what the subtypes proposed here are trying to do.

On the other hand, this proposal of OTHER is too ambiguous. It has almost no power of restraining what can be put here. It should at least be mentioned that for a record to be classified as "complex", at least two types of events each would be associated with a distinct 1st level alt alle must be involved.

And a third point I would like to make: adding these subtypes is "unfair" to the other imaginable dozens (if not hundreds) subtypes.

Now regarding @d-cameron 's comments about equivalent BND records connected with their EVENT annotation, I agree there should be a tool providing such functionality. But, IMHO, for complex variant, what we should do is to signal that a series of simple event is connected together is a non-trivial way so a region of the reference is rearranged. A caller should output (best estimate of) this region's boundary in a as compact as possible fashion, and preferably output the reference and alternative haplotypes. This would fulfill the following needs:

  • the affected genomic locations are specified, this is beneficial as compared to a series of BND records which must be parsed all together to clearly define the boundaries that cluster these events together;
  • event could be easily interpreted (though one should not blindly trust the interpretation);
  • the concrete sequence is available for downstream analysis (genotyping, re-interpretation) and these downstream analysis doesn't need to painstakingly reconstruct the haplotype.

\item [\null] INV: Inversion relative to the reference
\item [\null] DUP: Duplication
\begin{itemize}
\item [\null] TANDUP: Tandem duplication (one of Direct, Inverted-upstream, or Inverted-downstream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this proposing Direct, Inverted-upstream, or Inverted-downstream are subsubtype?

\item [\null] STR: Short tandem repeat variation
\item [\null] INV: Inversion relative to the reference
\item [\null] DUP: Duplication
\begin{itemize}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These subtypes change the meaning of the event considerably and I would not consider them subtypes. VCF is intended to be machine-readable. Converting a DUP record to BEDPE format should not require completely different special casing code for each type of the duplication.

This change is an implicit change of the definition of DUP from a single breakpoint tandem duplication event to something more general. Unfortunately, this more general definition, although more closely matching the biological meaning, is under-defined and I don't think is an appropriate change for the file format. That said, the definition of DUP should indeed be clarified. My code assumes a single breakpoint non-inverted tandem duplication but other programs might not be using this definition.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering how you are proposing to change the 1st level ALT allele INS (used to be "Insertion of novel sequence relative to the reference", now proposed to be "Insertion relative to the reference"), do you intend DUP be part of INS under the new interpretation?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I sensed we are moving into complex SV land: duplications IMO means a string exist in the reference is repeated—not necessarily exactly—and placed in close vicinity or dispersed. Even for the case of close vicinity, one could imagine micro insertions (novel, mobile, inverted) in between the repeated units on the alt allele.

So I do agree that DUP should be more clearly defined.

\indent NS = number of samples with data\\
\indent REF = the reference copy number\\
\indent RL = total length of the STR tract in the reference\\
\indent RU = repeat motif as it appears on the forward strand in the reference\vspace{0.3cm}\\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this differ from HOMSEQ?

@@ -547,6 +606,7 @@ \section{INFO keys used for structural variants}

\footnotesize
\begin{verbatim}
##INFO=<ID=TDLOC,Number=1,Type=String,Description="Tandem duplication type (must be: direct, inverted-upstream, or inverted-downstream)">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I don't think this belongs in the specs. BND already covers this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID doesn't seem to convey the meaning: LOC would, IMO, signal location, but this to me means a subtype involving not only location but also strand.

Copy link

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some concerns voiced.

@@ -503,41 +503,100 @@ \subsection{VCF tag naming conventions}

\section{INFO keys used for structural variants}
\label{sv-info-keys}
The following INFO keys are reserved for encoding structural variants. In general, when these keys are used by imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).
The following INFO keys are reserved for encoding structural variants. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewording suggestions:

For keys reflecting properties of alt alleles (e.g. SVLEN), the number of values for such keys must equal to, or a multiple of, the count of distinct alt alleles (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).

\footnotesize
\begin{verbatim}
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
\end{verbatim}
\normalsize
For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate.
For precise variants, END is POS + length of REF allele - 1.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding @d-cameron 's comments on HOMSEQ/HOMLEN, if a left-align convention is enforced in the spec, this would at least remove some of the ambiguity. For example, imagine a deletion where the reference haplotype is LEFT_FLANK + HOM + UNIQUE_DEL + HOM + RIGHT_FLANK, and the alt haplotype is LEFT_FLANK HOM + RIGHT_FLANK, the deleted sequence is then any contiguous substring of the sequence HOM + UNIQUE_DEL + HOM with a size of len(HOM) + len(UNIQUE_DEL). Sticking to the left-aligning convention would put the POS at the end of LEFT_FLANK. Now depending on if other annotations exist (e.g. providing CIGARs), the exact haplotypes may be constructed to a certain degree close to the real alt haplotype. And yes, haplotype reconstruction/comparison is what ultimately matters IMHO.

The above comment would lead to a point that the IMPRECISE flag is useful: for the deletion example mentioned above, we do have the exact haplotype (if we have assembled across the deletion breakpoints or a read that is long enough to span it), therefor it is "precise", in the sense that we know its length. The ambiguity is how to represent it. But this is not the same as "we don't know the exact haplotype and therefore not even knowing where to put it exactly; all we know is an SV event occurred within this provided boundary, and the evidence doesn't allow us to pin down the position."


\footnotesize
\begin{verbatim}
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
\end{verbatim}
\normalsize
This key can be derived from the REF/ALT fields but is useful for filtering. The reserved values must be used for the types listed below:
SVTYPE is required for all structural variants. The following reserved values should be used whenever possible; indented values are subtypes that should be used if evidence supports them (see examples in Section 5.3, Encoding Structural Variants):\vspace{0.3cm}\\

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding see examples in Section 5.3, Encoding Structural Variants , I prefer not referring to other sections in a "hard-coded" way. Please use cross-referencing available in TeX.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


\footnotesize
\begin{verbatim}
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIVALUE,Number=,Type=Float,Description="Degree of certainty expressed by CIPOS/CIEND: `1' indicates 100\% certainty

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why a default value should be in a specification/standard.

@@ -547,6 +606,7 @@ \section{INFO keys used for structural variants}

\footnotesize
\begin{verbatim}
##INFO=<ID=TDLOC,Number=1,Type=String,Description="Tandem duplication type (must be: direct, inverted-upstream, or inverted-downstream)">

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ID doesn't seem to convey the meaning: LOC would, IMO, signal location, but this to me means a subtype involving not only location but also strand.

\end{itemize}
\item [\null] -RET: Absence of a retrotransposon present in the reference
\end{itemize}
\item [\null] TRA: Translocation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Translocations are hard to reliably predict, so should we refrain from adding it to the spec, considering that BND type already exist to cover such cases?

\item [\null] STR: Short tandem repeat variation
\item [\null] INV: Inversion relative to the reference
\item [\null] DUP: Duplication
\begin{itemize}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering how you are proposing to change the 1st level ALT allele INS (used to be "Insertion of novel sequence relative to the reference", now proposed to be "Insertion relative to the reference"), do you intend DUP be part of INS under the new interpretation?

\item [\null] STR: Short tandem repeat variation
\item [\null] INV: Inversion relative to the reference
\item [\null] DUP: Duplication
\begin{itemize}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I sensed we are moving into complex SV land: duplications IMO means a string exist in the reference is repeated—not necessarily exactly—and placed in close vicinity or dispersed. Even for the case of close vicinity, one could imagine micro insertions (novel, mobile, inverted) in between the repeated units on the alt allele.

So I do agree that DUP should be more clearly defined.

\indent REF = the reference copy number\\
\indent RL = total length of the STR tract in the reference\\
\indent RU = repeat motif as it appears on the forward strand in the reference\vspace{0.3cm}\\
\indent More information on representing STRs in VCF can be found at http://lobstr.teamerlich.org/file-formats.html\vspace{0.3cm}\\

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a specification/standard should refer to other documentation for information that itself should clarify.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SHuang-Broad
Agreed – the reference was intended more as a citation / acknowledgment than as a resource for more information – though I agree the spec needs clarification, if the term is kept.
That leaves us with two options:

  1. Provide enough detail in the spec, including an example, so the reference to lobSTR is not needed
  2. Remove STR as a SVTYPE (or SVSUBTYPE / SVSUB) since STRs can be handled by VCF as non-SV records.
    Right now I am leaning toward removing STR from the PR.
    Thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the 2nd, so to keep the spec as short as possible.

\item [\null] DISDUP: Dispersed duplication
\end{itemize}
\item [\null] DELINS: Deletion-insertion (replacement of reference sequence with sequence of different content or length)
\item [\null] OTHER: Complex variant not covered elsewhere

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that a symbolic alt allele for complex event has its value, because being easily interpretable is a virtue, as long as we have a healthy suspicion on the provided interpretation, which is what the subtypes proposed here are trying to do.

On the other hand, this proposal of OTHER is too ambiguous. It has almost no power of restraining what can be put here. It should at least be mentioned that for a record to be classified as "complex", at least two types of events each would be associated with a distinct 1st level alt alle must be involved.

And a third point I would like to make: adding these subtypes is "unfair" to the other imaginable dozens (if not hundreds) subtypes.

Now regarding @d-cameron 's comments about equivalent BND records connected with their EVENT annotation, I agree there should be a tool providing such functionality. But, IMHO, for complex variant, what we should do is to signal that a series of simple event is connected together is a non-trivial way so a region of the reference is rearranged. A caller should output (best estimate of) this region's boundary in a as compact as possible fashion, and preferably output the reference and alternative haplotypes. This would fulfill the following needs:

  • the affected genomic locations are specified, this is beneficial as compared to a series of BND records which must be parsed all together to clearly define the boundaries that cluster these events together;
  • event could be easily interpreted (though one should not blindly trust the interpretation);
  • the concrete sequence is available for downstream analysis (genotyping, re-interpretation) and these downstream analysis doesn't need to painstakingly reconstruct the haplotype.

@d-cameron
Copy link
Contributor

d-cameron commented Aug 14, 2017

@SHuang-Broad left-alignment is not possible for BND variants as, depending on the breakend orientations, left-aligning one breakend results in the other breakend being right-aligned.

@d-cameron
Copy link
Contributor

This PR really exposes the dual interpretations of SVTYPE. SVTYPE can be considered a convenience field that allows simple structural variants to be represent using a single VCF record instead of having to use the more unwieldy BND notation. Alternatively, SVTYPE can be considered to be a prediction of a specific type of mutational event. Whilst these interpretations have quite a bit of overlap, there a subtle differences.

As an example, take the variants called by Pindel and DELLY:

SVTYPE=DEL is called for all events in which the orientation of the breakends is consistent with a deletion event. Deletions are called irrespective of whether there is any change in copy number in the 'deleted' region and this approach implies the first interpretation.

SVTYPE=INV is called for all events for which a single breakpoint in an inversion-like orientation. Pindel and DELLY call inversions even when only one of the two required breakpoints actually exist in the input. The predictive nature of these calls imply the second interpretation.

In my work with VCF SVs, I have interpreted SVTYPE as a shorthand format for the equivalent BND events. As a file format intended to be unambiguous, it must be the former interpretation (even if it also the latter). The restricted list of top-level event types allowed under VCFv4.3 are consistent with the first interpretation. This PR changes the meaning of the SVTYPE field to the latter.

To introduce the type of changes requested in the PR, I am in favour of SVTYPE retaining it's current strict definition (we can even remove the annoying subtypes) and introducing a new field(s) containing the level of detail requested in this PR. This has the added bonus that the shorthand notation becomes much less important and making BND calls much less painful when performing downstream analysis. I for one would find it extremely useful to be able to classify the more complex events. Something along the lines of SVTYPE=BND;SVCLASS=DUP:DISDUP;EVENT=mycduplication, or SVTYPE=DEL;SVCLASS=DUP:ProcessedPseudoGene;EVENT=processeddup1^, would be really useful to have when performing downstream analysis.

^processed pseudo genes look like duplication with lots of deletions in the duplicated region, being able to call the deletions as DEL, but also being able to indicate that they are part of a more complex event is very important - just look at the number of supposedly germline intron deletions in the variant databases that are actually processed pseudogene insertions.

@SHuang-Broad
Copy link

@d-cameron
"
@SHuang-Broad left-alignment is not possible for BND variants as, depending on the breakend orientations, left-aligning one breakend results in the other breakend being right-aligned.
"
In that case we could always left align the upstream/left record, right?

@d-cameron
Copy link
Contributor

@SHuang-Broad We could indeed left-align the left record for events with a known micro-homology. I'm not in favour of left-aligning imprecise variants as it prevents callers from calling the most likely position within the region of uncertainty.

@SHuang-Broad
Copy link

@d-cameron , oh yes, for imprecise variants the boundary should be the best estimate, unlike the microhomology case in precise variants where no evidence exists favoring any particular point location.

@d-cameron
Copy link
Contributor

@thefferon, does my SVCLASS proposal work for your purposes? Given our current lack of detailed knowledge of mechanisms of SV events (e.g. http://www.biorxiv.org/content/early/2017/08/27/181339), having a fixed SVTYPE list to ensure unambiguous parsing, and an open-ended SVCLASS field prevents us having to push out a new spec version every time a new mutational process is identified, but also enables sophisticated downstream analysis with minimal VCF parsing.

@thefferon
Copy link
Author

thefferon commented Sep 13, 2017

@d-cameron and @SHuang-Broad,

@d-cameron, you described two possible "interpretations" of SVTYPE:

1 - as a shorthand for the equivalent BND representation
2 - as actually making a claim about the event type

My interpretation has always been of the second type, and maybe it will help you to know that I created this PR from that standpoint. I am less familiar with the computational dependencies that constrain the existing SVTYPE controlled vocabulary; and so I am very grateful to have your input.

Please go into a bit more detail about what you mean by the first interpretation.

@d-cameron, I think some implementation of your 'SVCLASS' idea could work. I think it will be best if we agree on a mechanism for encoding variant types (e.g., SVCLASS, other, or no change) before engaging in further debate of individual SVTYPE values (e.g., TRA, DUP, subtypes).

Here is a bit of background on why I generated this pull request:

As curator at dbVar, I must interpret each new strucutral variation VCF submission and fit it to a common data model; there are as many 'flavors' of VCF as there are submitters, and many VCF authors do not adhere to the guidelines in the existing spec. There may be many reasons for this, including that VCF is suboptimal for describing SV to begin with, and that the spec is decidedly vague on many details that are critical to the success of VCF as a means of sharing SV data. My intention with this PR was to address those areas of the spec I felt could easily be improved – for example, the use of more definitions (and improvements to existing ones), and solid examples that clearly demonstrate the preferred way for each variant type to be encoded. I hope this would result in more consistency of usage, and a dramatic improvement in the success and ease with which VCF generated by different sources can be shared and integrated.

I have been granted a limited amount of time to resolve the VCF issues that will lead to the most significant and helpful improvements. Therefore I propose a deadline of 30 September 2017 for us to come to agreement on the most important issues we can. I hope I have given you a better understanding of where I am coming from in creating this PR.

Going forward, please consider:

  • helping to decide whether to change the way different types of variants are encoded in VCF – e.g., using SVTYPE, ALT, and perhaps SVCLASS
  • if we agree to make these changes, determine which mechanism works best and will not break others' code
  • once a mechanism is agreed upon, write up examples of different kinds of variants to include in the spec (e.g., @d-cameron, perhaps you could largely handle the BND representations for each example)

Finally, if you have encountered problems working with SV-VCF, or always wished there were a solution to some particular problem(s) with it, please describe them here (or in a parallel PR) so we can address them now and improve the spec on the whole.

Thanks,
Tim

@thefferon
Copy link
Author

@d-cameron and @SHuang-Broad ,

Finally, if you have encountered problems working with SV-VCF, or always wished there were a solution to some particular problem(s) with it, please describe them here (or in a parallel PR) so we can address them now and improve the spec on the whole.

Another way to put this is: Do you agree with my assertion that there is a need for the spec to be improved, with respect to describing the various SV types? Or do you think it is fine the way it is? What do you think are the most important changes it would need in order to become more useful – that would help provide a standard for the way in which different groups / callers ought to encode the variety of types observed?

@thefferon
Copy link
Author

@d-cameron and @SHuang-Broad,

Please comment on the following proposal for handling SV subtypes in VCF. If you agree I will write it up and add it to this PR.

@d-cameron, you argued for keeping the current SVTYPE controlled vocabulary, and also I believe for enforcing it (by which I mean, making it clear in the spec that these are the only allowed values for SVTYPE, anything else being invalid). Those values are:

DEL
INS
DUP
INV
CNV
BND

Furthermore, you argued that SVTYPE should (must?) be interpreted not as an assertion about the type of biological event, but rather as shorthand for a BND representation, and that this is required for compatibility with downstream analysis tools. (If at any point I misrepresent anyone's position, please correct me.)

FIRST:

Is it your position that SVTYPE should be interpreted this way, and so retain the current restricted set of types, to avoid breaking others' code?

If you answer is yes, then I accept, but argue we still need a mechanism for VCF authors to specify SV subtypes (like "Alu insertion", "Processed pseudogene insertion", "Absence of an L1 insertion that is present in the reference", etc.). You proposed "SVCLASS" for this; I think a more appropriate term would be "SVSUBTYPE" (or "SVSUB"), a name which implies its value is restricted by the value of SVTYPE (dependencies would be spelled out in the spec - e.g., if SVTYPE="INS", then SVSUB, if provided, must be one of MEI, ALU, L1, HERV, SVA, RET, or PSG - all of which represent types of insertion that prior VCF authors have represented with custom SVTYPE values). There would be similar restrictions for other SVTYPES.

If you think the above makes sense, I will draw up a specific section for your review.

SECOND:

I have argued, and still maintain, that BND is not really an SVTYPE – rather it is a means by which any SV can be described (a point with which I think you agree, based on your SVTYPE interpretation #1, that SVTYPEs are shorthand for BND representations). I therefore propose that BND is NOT a SVTYPE, and that we choose some other way of indicating when a record is a BND representation (or maybe one is not needed, because it is evident from the format of the corresponding ALT value).

THIRD:

I believe there are other problems with the SVTYPE controlled vocabulary (CV), some of which might be addressed by improving the definitions of existing terms, others that may require new terms. To be more specific:

  1. DUP is a very broad term, open to many interpretations; a variety of assumptions are made about its meaning; a stricter definition is warranted.
  2. CNV is similarly open to several, equally plausible, interpretations. If it is to be used consistently and is to have any meaning, those interpretations have to be differentiated in the supporting text.
  3. There is no term in the current CV to represent a deletion-insertion (i.e. the replacement of one sequence with another). @d-cameron you suggested RPL (for replacement). I like this; does RPL fit into the framework of the SVTYPE CV, in your opinions?
  4. If you agree, per above, that BND is not a SVTYPE, then we need a SVTYPE value for variants in which two genomic loci, which are not normally seen together, are joined (some would call this a 'translocation'). I assert that 'translocation' has connotations that presuppose knowledge of what sequence is present beyond the observed adjacency, and that a better term would be something like ADJ (for adjacency). If this doesn't sit well, I would argue for the already-discussed TRA as be an acceptable alternative.

Thoughts?

@d-cameron
Copy link
Contributor

d-cameron commented Sep 18, 2017

@thefferon I feel like we're making some progress here :)

Is it your position that SVTYPE should be interpreted this way, and so retain the current restricted set of types, to avoid breaking others' code?

It is.

Do you agree with my assertion that there is a need for the spec to be improved, with respect to describing the various SV types? Or do you think it is fine the way it is?

The current spec is sufficient as a technical format for defining breakpoints and CNVs.

we still need a mechanism for VCF authors to specify SV subtypes (like "Alu insertion", "Processed pseudogene insertion", "Absence of an L1 insertion that is present in the reference", etc.).

Having had some time to think about what you're wanting to do, I feel that we want to be even more general, and I propose the EVENTTYPE field: This can be an open field, with a controlled vocab for known event types (i.e. a superset of your proposed SVTYPE)

Some good examples are:

Translocations:

chr1	1000	.	T	<DEL>	.	.	SVTYPE=DEL;END=2000;SVLEN=-1000;EVENT=sv1;EVENTTYPE=TRA
chr2	5000	.	T	<INS>	.	.	SVTYPE=INS;SVLEN=1000;EVENT=sv1;EVENTTYPE=TRA

It's nice a human readable and pretty clear what is going on biologically (although this example loses the translocation orientation information, it is available if the tool uses BND format to write the event).

Processed pseudo-gene (e.g. there's plenty of exon deletions in dbVar that are actually germline processed pseudo-gene insertions that are not in the reference genome; I don't trust any gene that has adjacent-exon deletions in dbVar to actually have these deletions in the actual gene)

chr1	1000	.	T	<DEL>	.	.	SVTYPE=DEL;END=2000;SVLEN=-1000;EVENT=sv2;EVENTTYPE=PROCESSED_PSEUDO_GENE_INSERTION
chr1	2500	.	T	<DEL>	.	.	SVTYPE=DEL;END=3000;SVLEN=-500;EVENT=sv2;EVENTTYPE=PROCESSED_PSEUDO_GENE_INSERTION
chr1	4000	.	T	<DEL>	.	.	SVTYPE=DEL;END=4500;SVLEN=-500;EVENT=sv2;EVENTTYPE=PROCESSED_PSEUDO_GENE_INSERTION

The thing I really like about EVENTTYPE is that is works with tandem repeats:

chr2	5000	.	T	<INS>	.	.	SVTYPE=INS;SVLEN=12;EVENTTYPE=STR;EVENT=repeat_expansion_in_sv_format
chr2	5000	.	T	TACACACACACAC	.	.	EVENTTYPE=STR;EVENT=repeat_expansion_in_indel_format

If you care about the interpretation of the variants, then being able to handle STRs in both formats is a really useful.

Proposal:

  • Deprecate SVTYPE as it is redundant with the ALT allele. Definition remains unchanged.
  • Deprecate nested types from SV ALT (e.g. <DUP:TANDEM> becomes <DUP>).
  • Explicitly state that SVTYPE records represent BND/CNV shorthand notation and variants should be biologically interpreted according to their EVENTTYPE
  • Add EVENTTYPE field
    • can/should include definitions for complex events such as kataegis, chromothripis, chromoplexy, telomere shortening/lengthening, fold-back inversions, and so on
  • (Optional) include a mechanism/recommendation/workflow that tool developers can use if they want to get a new term into the offical EVENTTYPE vocab.

@d-cameron
Copy link
Contributor

Finally, if you have encountered problems working with SV-VCF, or always wished there were a solution to some particular problem(s) with it, please describe them here (or in a parallel PR) so we can address them now and improve the spec on the whole.

  • Most of VCF ambiguities #89 has been addressed in some form; there's still some outstanding but I'm not sure if they're outside scope of these updates (I presume this is VCFv4.4, yes?)

  • The fact that we can't represent tandem repeat microhomologies correctly is problematic (i.e. there is no CISTRIDE or equivalent).

  • Per allele field encoding #133 per-allele INFO fields

  • Per allele field encoding #133 list-of-list field notation, although I'm not sure how widespread usage/support that would get

  • SVTYPE description for nested values and multi-allelic variants #142 can't represent more than on SV ALT allele (problematic for promiscuous breakpoints)

  • A explanatory note on SVLEN explicitly stating that inversions are of SVLEN=0 may help tool developers to get this correct (SVLEN description wrong for INV variants #139)

  • No CIPOS for remote breakend when using BND notation. This means tools need to perform random access to get the required information from the other side. I (GRIDSS) use CIRPOS.

  • VCF can't encode CIPOS constraints: e.g. I know the event is 100-120bp long, but there is a 50bp window it can occur in.

  • There is no formal grammar for the specifications and parsing is ambiguous (e.g. hg38 HLA contigs in BND notation) . I'm happy to contribute one (I was in the process of making one before getting stuck by the VCF ambiguities #89 issues).

  • Similar to my SAM strict proposal, it would be good to either a) constrain VCF to require semantically valid files, or b) add a VCF strict specifications that requires semantically valid data. E.g.: CIPOS interval actually includes the called POS position, SV ALT allele matches SVTYPE, END and SVLEN match, both sides of BND exist in the file, both BND records contain the same information, and so on. Again, I'm happy do the work for this.

@thefferon
Copy link
Author

@d-cameron , @SHuang-Broad , and @cyenyxe ,

I want to begin wrapping up this pull request. It occurs to me the most efficient way to determine which changes are in scope and which are not may be a video chat.

Please reply to this post, stating:

  • your willingness to participate in a video call, and
  • your time zone (so I can begin trying to schedule it)

Thanks!

@d-cameron
Copy link
Contributor

More than happy to have a meeting. My timezone is AEST, but I'm very flexible and would like to move this forward (I just got off a 1-2am GA4GH file format teleconference call).

@cyenyxe
Copy link
Member

cyenyxe commented Sep 22, 2017

I am also fine with joining the video chat. My time zone is BST.

@SHuang-Broad
Copy link

Sorry guys that I've been busy implementing functionalities for complex SV's, taking the discussions here into account.
I would love to have a conference call on this. My timezone is UTC−04:00 rightnow.

@thefferon
Copy link
Author

Please consult the following "schedule" and indicate when you would prefer to have a call (my initial preference is indicated in yellow). The schedule starts with 3pm @d-cameron 's time Monday, followed by BST (U.K.) and EST (U.S. Eastern). At least I think the times are right. Monday is just an example; we could have it any day of the week.

VCF Call Schedule.pdf

@SHuang-Broad
Copy link

Please consider my preference same as Tim's. Thanks!

@SHuang-Broad
Copy link

SHuang-Broad commented Sep 22, 2017

@thefferon
Regarding the three points you make

FIRST:

I believe the colon-separated way is cleaner, and provides a balance between convenience and restriction. But don't insist.

SECOND:

Agree. BND is a breakpoint, any type of variant could have any (appropriate) number of BND records.

THREE:

  1. agree.
  2. agree.
  3. agree, though SUBS for substitution is what I initially thought of.
  4. I prefer not to bring back the TRA value, and SVTYPE for such breakpoints should be left as "undetermined" because we don't know the "meaning of these events": it could be a duplication of a region and inserted somewhere else, or simply that region is "move" to other places.

@d-cameron
for the translocation example you gave, the current spec says for EVENT: "ID of event associated to breakend". And it would be better if in the insertion record there's an annotation pointing to the mapping of the inserted sequence (to the deletion record).

I agree that SVTYPE conveys (almost) redundant information, but I feel that improving up on it makes more sense, e.g. allowing what would go into the EVENTTYPE be in SVTYPE.


And I also like @thefferon's proposal for bringing in the OTHER for complex variants, though I prefer the name CPX and no subtypes need to be specified in the spec because it introduces bias favoring what people already know as complex but when it is complex, there's almost no limit how complex it can be (though providing an interpretation would be good, preferably with an annotation).

@d-cameron
Copy link
Contributor

d-cameron commented Sep 23, 2017

I'm travelling next week so some time between 11:00pm and 01:00am AEST would be best for me. Tuesday or Wednesday AEST?

@d-cameron
Copy link
Contributor

@SHuang-Broad the translocation example would be written in a lossless manner as:

chr1	1000	sv1_lower1	T	]chr2:5000]T	.	.	SVTYPE=BND;PARID=sv1_upper1;EVENT=sv1;EVENTTYPE=TRA
chr2	5000	sv1_upper1	T	T[chr1:1000]	.	.	SVTYPE=BND;PARID=sv1_lower1;EVENT=sv1;EVENTTYPE=TRA
chr1	3000	sv2_lower2	T	T[chr2:6000]	.	.	SVTYPE=BND;PARID=sv2_upper2;EVENT=sv1;EVENTTYPE=TRA
chr2	6000	sv2_upper2	T	]chr1:3000]T	.	.	SVTYPE=BND;PARID=sv2_lower2;EVENT=sv1;EVENTTYPE=TRA

@thefferon
Copy link
Author

Call schedule:

Please sign off if you can make the following:

@d-cameron : Thurs. 12:00am AEST
@cyenyxe : Wed. 1500 BST
@SHuang-Broad : Wed. 9:00am EST

If anyone has had a very positive video call experience using a particular application or service, please mention it and I will use it if possible. Otherwise I will try Webex, GoToMeeting, or zoom.us. I have found audio to be choppy on Google Hangout. I have no experience with Skype for international multi-calls.

Thanks,
Tim

@cyenyxe
Copy link
Member

cyenyxe commented Sep 25, 2017

That time slot works for me. From those 3 apps, I am most familiar with Zoom.

@thefferon
Copy link
Author

I took a stab at a an agenda for the call. Please update as needed, using Track Changes "on" in Word. (Doc can be migrated to Google Docs or similar if it would be easier.)

Agenda – VCF Pull Request.docx

@SHuang-Broad
Copy link

The proposed time works for me. Thanks Tim!

@SHuang-Broad
Copy link

@d-cameron I feel like we shouldn't be arguing about the detail at the moment but I'm confused by the BND equivalent of the translocation, why is chr2:6000 involved? I thought it would be chr2:5001, since there's an insertion on chr2 at that point by the symb alt allele representation.

@SHuang-Broad
Copy link

Would you mind having @cwhelan, who is our team lead on developing SV tools, and @xuefzhao, joining us?

@thefferon
Copy link
Author

@SHuang-Broad , by all means! The more experienced minds on this, the better. Will they both be joining you physically, or should I send them call-in details separately (I'd need addresses)?

Incidentally, it appears that, if we are to use Zoom, there is a 40-minute time limit on calls with more than 2 participants on the free plan, which is all I have access to. Our options are:

  1. someone else, with the ability to hold longer calls, sets up and initiates the call
  2. I schedule two consecutive 40-minute calls (with a pee-break in the middle)
  3. we use a different service

I'm OK with any of the above.

@SHuang-Broad
Copy link

@thefferon . Thanks!

I believe assuming we three not in the same physical location is a safer bet. What's your preferred way of getting their info for setting up the call? I assume posting email address here publicly may not make everyone comfortable.

Regarding which service we use, I am OK with either. And having a break in the middle is a great idea!

@thefferon
Copy link
Author

thefferon commented Sep 25, 2017

@SHuang-Broad and I had a successful test video call earlier today using Zoom. I am replacing the test call information that was here with tomorrow's call-in details:

Tim Hefferon is inviting you to a scheduled Zoom meeting. 

Topic: VCF spec PR #231
Time: Sep 27, 2017 9:00 AM Eastern Standard Time (US and Canada)

Join from PC, Mac, Linux, iOS or Android: https://zoom.us/j/535584891

Or iPhone one-tap :
    US: +16699006833,,535584891#  or +16468769923,,535584891# 
Or Telephone:
    Dial(for higher quality, dial a number based on your current location):
        US: +1 669 900 6833  or +1 646 876 9923 
    Meeting ID: 535 584 891
    International numbers available: https://zoom.us/zoomconference?m=PkFQhOIJ9Za9pJsoeSYja5RxbcdOB6_O

I will be updating the call Agenda, so please check back periodically between now and then.

Hope to see you all on the call!

–Tim

@SHuang-Broad
Copy link

@thefferon thanks! Will try the test call on time.

@thefferon
Copy link
Author

@d-cameron ,

I have the feeling much of the discussion on the call will center around the distinction you made between the two interpretations of SVTYPE. Would you be willing to provide an explanation of the first interpretation, preferably with examples and mentioning dependencies, toward the beginning of the call? I think that would help frame the rest of the discussion.

@d-cameron
Copy link
Contributor

d-cameron commented Sep 25, 2017

@thefferon I can do that. Fundamentally, my concerns are about ensuring the unambiguous machine interpretation of variants does not get any more broken than it current is.

@SHuang-Broad Yes, it should be 5001. To be fully lossless, it would need either a) the 1000bp of inserted sequenced in the ALT allele, or BNDs for chr2:5000->chr1:1000 and chr1:1999->chr2:5001

@thefferon
Copy link
Author

Hi All,

For the sake of having as productive a call as possible, I'd like to make the following suggestions (if you have feedback you think should be heard before - or at the beginning of - the call, please speak up; otherwise we can review them at the end if there is time):

  1. dbVar has limited resources to contribute toward developing the VCF spec; we also have a need to square away a few specific issues. I am adding a section to the agenda for today's call, listing several issues I would like us to resolve today if at all possible.
  2. I suggest we hold subsequent calls (once a quarter?) until everyone is satisfied that we have come up with workable solutions to these and any important related issues.
  3. To summarize which changes are still in scope for the PR and which I have dropped, based on the excellent feedback and discussion we've had so far.

In scope:
SVTYPE, ALT, EVENT, EVENTTYPE, etc.
SVTYPEs: BND, TRA, RPL
IMPRECISE
CIPOS, CIEND, CIVALUE

Dropped (for now at least):
SVTYPEs: OTHER, STR, TANDUP, DISDUP, DEL_DUP, DEL_INV, DUP_INV, DEL_DUP_INV
INFO Tag: TDLOC

Please review the agenda prior to the call, in case I or anyone else has posted updates.

I look forward to seeing everyone tomorrow!

–Tim

P.S. I see we're at 111 Comments on this PR; what is the record?

@thefferon
Copy link
Author

Updated Agenda:

Agenda – VCF Pull Request v2.docx

@thefferon
Copy link
Author

I closed the chat window from our conference call earlier today before saving the email addresses everyone posted. I've contacted the support team for the Zoom app, but they cannot retrieve the info.

@SHuang-Broad I will contact you via email for Chris' and Xuefang's addresses; @d-cameron I am missing yours. Please send it to t>i>m>h>e>f>f(e(r(0(n( at g>m>a>i>l>.>c>o>m>.

Thanks!

@thefferon
Copy link
Author

@d-cameron , please provide the information requested in my previous message.

Thanks!

@thefferon
Copy link
Author

Please provide your approval and/or comments on the Confluence Page I sent around recently via email.

Let's wrap this up!

Thanks,
Tim

bringing my fork up to date
@thefferon thefferon closed this Nov 28, 2017
@thefferon
Copy link
Author

I am closing this PR and will open a new one with a limited subset of the changes discussed here.

thefferon added a commit to thefferon/hts-specs that referenced this pull request May 1, 2018
Still evaluating effect of these changes on pdf layout...
thefferon added a commit to thefferon/hts-specs that referenced this pull request May 12, 2018
cyenyxe pushed a commit to cyenyxe/hts-specs that referenced this pull request Sep 22, 2018
d-cameron pushed a commit to d-cameron/hts-specs that referenced this pull request Dec 18, 2019
commit ecd40f0
Author: Tim Hefferon <theffero@nih.gov>
Date:   Mon Dec 18 13:43:12 2017 -0500

    editorial improvements

commit a1b6ad4
Merge: 986e835 7aa24be
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Fri Dec 15 12:41:29 2017 -0500

    Merge pull request samtools#5 from samtools/master

    Fix RFC3986 encoding for "%"

commit 986e835
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:48:21 2017 -0500

    Update VCFv4.3.tex

commit c598717
Merge: 46042a8 4c18a2c
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:41:20 2017 -0500

    Merge pull request samtools#4 from thefferon/revert-whitespace

    Update VCFv4.3.tex

commit 4c18a2c
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:39:55 2017 -0500

    Update VCFv4.3.tex

commit 46042a8
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Dec 14 10:31:48 2017 -0500

    Add files via upload

commit 1d3c079
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 13 14:35:38 2017 -0500

    cleanup samtools#1

commit 6d30b09
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 13 14:33:39 2017 -0500

    incorporated @d-cameron's changes from PR samtools#266

commit 1de4ac4
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 13 14:25:41 2017 -0500

    backed out whitespace changes

commit 406e9da
Merge: 489b696 ce1e750
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 6 09:11:52 2017 -0500

    Merge pull request samtools#3 from thefferon/thefferon-patch-1

    Add files via upload

commit ce1e750
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Dec 6 09:11:20 2017 -0500

    Add files via upload

commit 489b696
Merge: 527e04b 2f915a8
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Tue Dec 5 15:54:37 2017 -0500

    Merge pull request samtools#2 from samtools/master

    bringing my fork up to date

commit 527e04b
Merge: 7c5259c 85a0fef
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Thu Nov 16 16:21:31 2017 -0500

    Merge pull request samtools#1 from samtools/master

    bringing my fork up to date

commit 7c5259c
Author: Tim Hefferon <thefferon@users.noreply.github.com>
Date:   Wed Aug 9 16:25:24 2017 -0400

    First go at changes requested in pull request samtools#231

    Still evaluating effect of these changes on pdf layout...

commit cf2ffa9
Author: Tim Hefferon <theffero@ncbi.nlm.nih.gov>
Date:   Tue Aug 8 11:22:22 2017 -0400

    Made significant updates to Section 3, INFO keys used for structural variants

# Conflicts:
#	VCFv4.3.tex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants