-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Made significant updates to Section 3, INFO keys used for structural … #231
Conversation
I am suggesting extensive changes to the list of reserved SVTYPEs. These changes are needed to help standardize the way structural variants are reported in VCF, and to facilitate their exchange between groups. I also added two new INFO tags, TDLOC and CIVALUE. |
VCFv4.3.tex
Outdated
@@ -5,6 +5,7 @@ | |||
\usepackage{lscape} | |||
\usepackage[margin=0.75in]{geometry} | |||
\usepackage[pdfborder={0 0 0}]{hyperref} | |||
%\usepackage{showframe} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented (unused) packages.
VCFv4.3.tex
Outdated
|
||
SVTYPE is required for all structural variants. The following reserved values should be used whenever possible; indented values are subtypes that should be used if evidence supports them (see examples in Section 5.3, Encoding Structural Variants):\vspace{0.3cm}\\ | ||
\setlength{\parindent}{0.3cm} | ||
\indent INS: Insertion relative to the reference\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following would be better formatted as nested itemize
instead of multiple indentation levels (example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
VCFv4.3.tex
Outdated
|
||
\footnotesize | ||
\begin{verbatim} | ||
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants"> | ||
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants"> | ||
##INFO=<ID=CIVALUE,Number=,Type=Float,Description="Degree of certainty expressed by CIPOS/CIEND: `1' indicates 100% certainty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Percentage characters need to be escaped or LaTeX will interpret them as the beginning of a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
VCFv4.3.tex
Outdated
@@ -547,6 +594,8 @@ \section{INFO keys used for structural variants} | |||
|
|||
\footnotesize | |||
\begin{verbatim} | |||
##INFO=<ID=TDLOC,Number=1,Type=String,Description="Position and orientation of tandem duplication; must be one of | |||
(Direct, Inverted-downstream, or Inverted-upstream)"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep a single field description in one line whenever possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not able to keep this to one line, nor to wrap it, with the current text. I will try working up an alternative that fits on one line.
\footnotesize | ||
\begin{verbatim} | ||
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation"> | ||
##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation"> | ||
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> | ||
\end{verbatim} | ||
\normalsize | ||
For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate. | ||
For precise variants, END is POS + length of REF allele - 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please state what would be the preferred action in the case of imprecise variants, otherwise it will be open to interpretation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to remove the IMPRECISE tag and all references to it from the spec because:
- what exactly is meant by "imprecise" variants is not, and has never been, clearly defined in the spec (I would also like to see the phrase "best estimate" removed, since it introduces artificial and somewhat arbitrary data points)
- the tag is redundant with CIPOS and CIEND; I have yet to see an example where adding the 'IMPRECISE' tag has added any informational value to a VCF record
@cyenyxe , what do you think is the best approach to take from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm (in GRIDSS) currently using the IMPRECISE/CIPOS fields to differentiate between variants in which the actual breakpoint position is not known due to lack of coverage, and a variant which the full breakpoint sequence is known but contains a micro-homology thus there is inherit ambiguity in the variant call.
HOMLEN/HOMSEQ don't tell you where the homology is relative to the called position. This information is critical when performing base-pair accurate analysis and the only way to recover it from HOMLEN/HOMSEQ is to recompute it from the reference genome. In the case where the sample differs from the reference, this may not be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding @d-cameron 's comments on HOMSEQ/HOMLEN, if a left-align convention is enforced in the spec, this would at least remove some of the ambiguity. For example, imagine a deletion where the reference haplotype is LEFT_FLANK + HOM + UNIQUE_DEL + HOM + RIGHT_FLANK
, and the alt haplotype is LEFT_FLANK HOM + RIGHT_FLANK
, the deleted sequence is then any contiguous substring of the sequence HOM + UNIQUE_DEL + HOM
with a size of len(HOM) + len(UNIQUE_DEL)
. Sticking to the left-aligning convention would put the POS
at the end of LEFT_FLANK
. Now depending on if other annotations exist (e.g. providing CIGARs), the exact haplotypes may be constructed to a certain degree close to the real alt haplotype. And yes, haplotype reconstruction/comparison is what ultimately matters IMHO.
The above comment would lead to a point that the IMPRECISE
flag is useful: for the deletion example mentioned above, we do have the exact haplotype (if we have assembled across the deletion breakpoints or a read that is long enough to span it), therefor it is "precise", in the sense that we know its length. The ambiguity is how to represent it. But this is not the same as "we don't know the exact haplotype and therefore not even knowing where to put it exactly; all we know is an SV event occurred within this provided boundary, and the evidence doesn't allow us to pin down the position."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SHuang-Broad nicely clarifies the distinction between (A) not having enough sequence information in the vicinity of the breakpoint and (B) not being able to identify the precise breakpoint (despite having complete sequence information) due to a stretch of homology. (A) is an information problem (ie not enough data), while (B) is a methodology problem that cannot be solved no matter how many times you sequence across the breakpoint (because it is already information-complete). For this reason (B) can be addressed only by (1) acknowledging the intractability of identifying a single coordinate in all variants with sequence homologies, or (2) enforcing a convention such as the left-justification one @SHuang-Broad proposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, any intelligent aligner, when presented with the following ... should recognize – and encode in its output, in some form – that there are four equally valid interpretations that explain the difference between the sequences.
Alternatively, any intelligent variant caller would calculate any micro-homology intervals around the split read/indel positions reported by the aligner. GRIDSS, Socrates and Pindel already do this. Given that callers can't even trust aligners to write SAM files that actually make sense, my preference is to keep this in the VCF specifications and not spill over into the SAM as it's not absolutely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it your position, then, that the IMPRECISE flag should be applied in, and only in, the "latter" case - i.e. in which the variant lacks sequence across the breakpoint?
Correct.
The only edge cases I can think of are:
-
when an RP caller has sufficient coverage that there is only one breakpoint position possible. One could argue that these calls are precise. I (weakly) prefer these to have
IMPRECISE
andCIPOS=0,0
for consistency with your definition of precise calls requiring breakpoint sequence. -
when the full untemplated inserted sequence is not known. e.g: a ~300bp insertion with soft clip/breakpoint assembly support from both sides, spanning RP, but a fragmented breakpoint assembly. The full sequence is not known, but the breakpoint positions are exact. Should this be flagged as IMPRECISE? I'm inclined to say no since the breakend positions are in fact precise. The current VCF specs can actually represent such a call at all:
REF: ACGT------------------------------------CCGG
ALT: ACGTTTCAAA?????(50-100bp)????????TTTCCAACCGG
I guess you could put Ns in there, but using standard VCF fields, there's no way to say that there's somewhere between 50 and 100 Ns. Come to think of it, this is just a special case of an insertion of unknown length, something that the standard VCF fields don't support, merely a single SVLEN value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicit in both these examples is the introduction, by the caller/aligner, of a bias; they have already selected one solution from among several equally valid solutions; and this affects downstream interpretation of the data.
If the caller reports a POS and CIPOS, then it is reporting a set of solutions. If the downstream interpretation ignores most of these solutions then the issue is with the downstream interpretation, not what the caller reported. Just to reiterate: any tool that interprets VCF calls MUST be able to handle multiple different representations of the same variant. The only way that VCF readers can get away from this is if there is some sort of normalisation (similar to the left-alignment of SNV/indels). The only way I can see to normalise all SV events is to turn everything into left-aligned BND^, at which point, why have SVTYPE and any non-BND symbolic alleles at all?
^ This brings us back to the interpretation of SVTYPE. Is it a shorthand for the equivalent BND representation, or is it actually making a claim about the event type? Is a DEL caller claiming a CN reduction in the 'deleted' region? (for most callers, the answer is no). This is why I proposed the SVCLASS field. SVTYPE remains a restricted set of short-hand notations, and SVCLASS makes a claim about the mutation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As expressed in previous comments, when it comes to SV, what really matters, IMHO, are the haplotypes. Ideally, the tool that made the call set should provide the alt haplotype (ref is easy) explicitly when feasible (e.g. for complex SV) or provide a tool for doing this. So downstream analysis could use such information easily, for either querying the VCF to see if a variant is in close vicinity of a target region, or for making sense of what happened here.
Now regarding the concerns for the "bias" of the aligner-given position, I don't think it is practical to require that from a general purpose aligner and special purpose aligner may provide no guarantee in these (homologous) regions either, because if we think about it, it is an optimization problem that also depends on the parameters sent to the aligner. But this should not stop us from doing the work: a good caller and downstream analysis in the SV world should be aware of this phenomenon and, echoing what has been said before, allow ambiguity and focus on the haplotype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different opinion on the example insertion with unsure length that @d-cameron gave above.
It's getting clearer and clearer to me that when I refer to IMPRECISE
variants, I am talking about "exact haplotype unknown", which is not the same as "exact place cannot be determined". In the "insertion with unknown length" example above, the exact alt sequence cannot be determined, hence the alignment could NOT be pinned down either (what if the missing sequence can be aligned to the left or right flanking sequences?), making the alignment given in the example a bit suspicious.
Still evaluating effect of these changes on pdf layout...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this change weakens the specifications. The VCF specifications should be a common file format in which multiple programs can unambiguous interpret and parse variants. As it stands these changes would make it no longer possible to unambiguously convert a VCF SV record to another format such as BEDPE.
|
||
\footnotesize | ||
\begin{verbatim} | ||
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> | ||
\end{verbatim} | ||
\normalsize | ||
This key can be derived from the REF/ALT fields but is useful for filtering. The reserved values must be used for the types listed below: | ||
SVTYPE is required for all structural variants. The following reserved values should be used whenever possible; indented values are subtypes that should be used if evidence supports them (see examples in Section 5.3, Encoding Structural Variants):\vspace{0.3cm}\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"should be used whenever possible" is a massive weakening from "must be used". This weakening makes makes this field useless for downstream analysis programs because there's no requirement that the values actually be one of these. Any value would be spec-compliant.
Practical applications of the restrictive usage of this field include my StructuralVariantAnnotation R package in which I take most popular callers, and (in theory) any VCFv4.3 compliant SV callers and convert them to a standard breakend notation so variant calls written using different notations can actually be compared. With this change, program B will no longer be able to complain that the problem is in the non-compliant output of program A. If two programs don't work with each other then the specs need to be able to be used to say which of the programs is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point taken. I will re-write this to clarify. However I still see some ambiguity in the original text:
This key can be derived from the REF/ALT fields but is useful for filtering. The reserved values must be used for the types listed below:
To take these two sentences in reverse order:
The reserved values must be used for the types listed below:
One reading of this leaves the door open for SVTYPE to be assigned any value, so long as the author feels their variant is not one of "the types listed below." If the intention is instead, as I believe, to establish a closed list of reserved types, such that all VCF SV records require the SVTYPE field, and only the listed values are allowed (and no colons w/ subtypes, etc.) then this sentence needs to change.
This key can be derived from the REF/ALT fields but is useful for filtering.
This seems to imply the SVTYPE field is redundant with, and secondary to, REF and ALT – and may therefore be optional. Again, I propose this be rewritten or removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to imply the SVTYPE field is redundant with, and secondary to, REF and ALT – and may therefore be optional
You raise a very good point. I've been using SVTYPE, but s1.4.5 does indeed define a restrictive set of first-level SV ALT alleles. Although the spec does allow other ALT alleles, it is clear that only ALT alleles in the list are to be considered SVs^.
^ Section 1.4.5 needs the word "imprecise" removed for this to actually be true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be convenient for a human to have access for such information (SVTYPE) if the caller decides to output BND records for all variants.
Convenience is a virtue.
\item [\null] INS: Insertion relative to the reference | ||
\begin{itemize} | ||
\item [\null] MEI: Mobile element insertion relative to the reference | ||
\begin{itemize} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this list to be maintained? Do we have any ME ontologies we can reference so our list is more exhaustive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this PR I included mobile element types from data published by 1000 Genomes: Alu, LINE1, SVA, and HERV. I added 'RET' (for 'retrotransposon') because I am aware of a specific need to accommodate it.
For now I think it is sufficient to use the types in this list; additional types can be added, and a formal ontology consulted, if and when the need to do so becomes more obvious.
\end{itemize} | ||
\item [\null] DEL: Deletion relative to the reference | ||
\begin{itemize} | ||
\item [\null] -MEI: Absence of a mobile element present in the reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer DEL:MEI over DEL:-MEI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of colons to construct nested strings of types and subtypes is only described in the spec for the ALT field (symbolic ALT alleles). I have never seen this notation used in the SVTYPE field. Have you?
My intention with the indented list of subtypes in this PR is that only one value for SVTYPE should be used; the granularity of the choice of term depends on the evidence present in the data. For example, if an insertion is observed but nothing definitive is known (yet) about its content, SVTYPE=INS. However, if the same insertion is known to be a mobile element of some kind (family not specified), then SVTYPE=MEI. If it is known that the insertion is a LINE1 element, then SVTYPE=LINE1 (perhaps L1 would be better).
Thus the nesting would be implied. I proposed this schema mainly because it is consistent with the seemingly widely adopted practice (which ignores guidance provided by the current spec), of indicating mobile element insertions by using SVTYPE values such as ALU, LINE1, and SVA.
In keeping with this "one-word SVTYPE" controlled vocabulary, I proposed "-ALU", for example, to represent a so-called Alu-element deletion (more accurately described as the absence of an Alu insertion which is present in the reference genome). For an up-to-date treatment of mobile element variation, please see https://www.ncbi.nlm.nih.gov/pubmed/28855259.
I wanted to make the above points regardless of the mechanism that is eventually chosen to identify event types (like your proposed SVCLASS).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of colons to construct nested strings of types and subtypes is only described in the spec for the ALT field (symbolic ALT alleles). I have never seen this notation used in the SVTYPE field. Have you?
Both Pindel and manta use <DUP:TANDEM>
alt alleles, with pindel using SVTYPE=DUP:TANDEM
notation (manta just calls SVTYPE=DUP
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-cameron is that the only example you are aware of, or was it intended as one example that is part of an argument that colons should be used in SVTYPE?
ALT=<DUP:TANDEM>
adheres to the existing spec's guidelines for the use of symbolic alternate alleles. Pindel's use of SVTYPE=DUP:TANDEM
does not, and it seems like an effort to address exactly the sort of deficiency in the spec that I aim to improve with this PR. Even though SVTYPE=DUP
is "legal" and SVTYPE=DUP:TANDEM
is not, IMO the latter is better in that it imparts crucial information about the variant which the former does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thefferon the example was just a comment on current usage. I'm of the opinion that colons should be used if the values are an open set, but are not necessary if the set is closed (since the hierarchy can be defined in the spec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the parent (hierarchical) information in the ALT helps grep-ing
\end{itemize} | ||
|
||
TRA (Translocation): A VCF translocation record describes the juxtaposition of two DNA segments from disparate genomic locations (from the same or from different chromosomes). It must include two genomic coordinates, each accompanied by an orientation (see STRAND below). A VCF translocation record only describes the immediate molecular context of the novel adjacency generated by the translocation event. It should not be confused with the traditional cytogenetic notion of translocation, which is understood to imply that the remainder of the chromosome arms beyond the specified breakpoints are also involved.\vspace{0.3cm}\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already covered by the breakend notation. I strongly disagree with introducing yet another notation for arbitrary genomic rearrangements. BND should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to push back on this. While BND is, I believe, capable of explicitly describing almost any structural event – including variants normally described using the SVTYPEs of DEL, INS, INV, etc. – in practice it is unwieldy and complicated to implement. If BND can describe all events, why have any other SVTYPEs at all?
In my view, TRA serves the same purpose as all the other non-BND SVTYPEs: it is a brief-but-complete way to express commonly encountered variation events (most/all of which could also be described much less intuitively using only BND). Where BND shines is in its ability to capture complex events that cannot be described using other SVTYPEs; and perhaps, by virtue of this universal applicability, BND may be simpler for parsers to accommodate than a host of SVTYPEs, each with its own rules/idiosyncracies.
TRA has the following advantages over BND for describing any novel adjacency:
- it clearly identifies the positions and orientations of both joined breakends in an intuitive way
- is supported by at least one major SV caller (DELLY)
- it is in fairly wide usage for describing translocations in VCF (whereas BND is probably less so, at least in the VCF I have encountered)
@d-cameron , do you agree that TRA serves the same sort of purpose as all the other, non-BND SVTYPEs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is in fairly wide usage for describing translocations in VCF (whereas BND is probably less so, at least in the VCF I have encountered)
is supported by at least one major SV caller (DELLY)
DELLY changed from TRA to BND notation back in March this year so as to be VCF compliant (see dellytools/delly#59 ). Do you know which other callers write TRA records? TIGRA uses CTX and CHR2 with a similar meaning to DELLY's TRA. As far as I can tell, LUMPY, GRIDSS, manta, and CREST are all compliant. Pindel deviates from the current specs with its use of RPL (replacement) for del/ins events.
in practice it is unwieldy and complicated to implement. If BND can describe all events, why have any other SVTYPEs at all?
I ask myself the same question. I believe it is because VCF was designed for SNVs and evolved over time to support single-loci events then finally arbitrarily complex structural variants through the use of BND notation. VCF is ill-suited to representing SVs but it's still a whole lot better than every tool writing their own custom tsv/csv format.
If you're handling SVs in general, you can't get away from having some events in BND notation. I agree it is unwieldy and complicated, but if the analysis/software already has to deal with BND events, adding additional event types on top of this just makes processing SVs even more messy and complicated.
@d-cameron , do you agree that TRA serves the same sort of purpose as all the other, non-BND SVTYPEs?
Adding it to the specifications would serve the same purpose as all the other SVTYPEs (except CNV, although even those can now be represented, albeit extremely unintuitively, using breakend notation): as a short-hand to an equivalent set of BND records. The key difference between the existing types and TRA is that all the existing types are all single-loci events.
As far as I can tell, the proposed TRA notation is just an alternate representation of information contained in a BND record. Why not just use a BND record? How is the specifications enhanced by requiring all implementing program to support both BND and TRA notations?
Note that BND records were intentionally defined to be paired and have a record at both locations so the events involving any given loci could be found by querying that loci. Yes this is messy, but programs already have to deal with this for BND events and it would be inconsistent if the proposed TRA records were not also paired.
|
||
\footnotesize | ||
\begin{verbatim} | ||
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> | ||
\end{verbatim} | ||
\normalsize | ||
This key can be derived from the REF/ALT fields but is useful for filtering. The reserved values must be used for the types listed below: | ||
SVTYPE is required for all structural variants. The following reserved values should be used whenever possible; indented values are subtypes that should be used if evidence supports them (see examples in Section 5.3, Encoding Structural Variants):\vspace{0.3cm}\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BND is missing from this list. This appears to be an oversight in the current specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - I will add BND in a separate edit.
\item [\null] DISDUP: Dispersed duplication | ||
\end{itemize} | ||
\item [\null] DELINS: Deletion-insertion (replacement of reference sequence with sequence of different content or length) | ||
\item [\null] OTHER: Complex variant not covered elsewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't these be written as multiple VCF records with the same EVENT?
My preference is to just include simple variants in this field and use BND notation with EVENT to describe complex events.
OTHER is a catch-all that is going to break any program trying to be spec-compliant and I don't think belongs in these specifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fundamentally, all SVs should be able to be decomposed into an equivalent set of BND VCF records. If SVTYPE is to be expanded, I strongly recommend that every type (and subtype that has a different decomposition than the parent) has an example in the VCF specifications that has the event in both short-hand format, and the equivalent BND notation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a symbolic alt allele for complex event has its value, because being easily interpretable is a virtue, as long as we have a healthy suspicion on the provided interpretation, which is what the subtypes proposed here are trying to do.
On the other hand, this proposal of OTHER
is too ambiguous. It has almost no power of restraining what can be put here. It should at least be mentioned that for a record to be classified as "complex", at least two types of events each would be associated with a distinct 1st level alt alle must be involved.
And a third point I would like to make: adding these subtypes is "unfair" to the other imaginable dozens (if not hundreds) subtypes.
Now regarding @d-cameron 's comments about equivalent BND records connected with their EVENT
annotation, I agree there should be a tool providing such functionality. But, IMHO, for complex variant, what we should do is to signal that a series of simple event is connected together is a non-trivial way so a region of the reference is rearranged. A caller should output (best estimate of) this region's boundary in a as compact as possible fashion, and preferably output the reference and alternative haplotypes. This would fulfill the following needs:
- the affected genomic locations are specified, this is beneficial as compared to a series of BND records which must be parsed all together to clearly define the boundaries that cluster these events together;
- event could be easily interpreted (though one should not blindly trust the interpretation);
- the concrete sequence is available for downstream analysis (genotyping, re-interpretation) and these downstream analysis doesn't need to painstakingly reconstruct the haplotype.
\item [\null] INV: Inversion relative to the reference | ||
\item [\null] DUP: Duplication | ||
\begin{itemize} | ||
\item [\null] TANDUP: Tandem duplication (one of Direct, Inverted-upstream, or Inverted-downstream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this proposing Direct, Inverted-upstream, or Inverted-downstream are subsubtype?
\item [\null] STR: Short tandem repeat variation | ||
\item [\null] INV: Inversion relative to the reference | ||
\item [\null] DUP: Duplication | ||
\begin{itemize} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These subtypes change the meaning of the event considerably and I would not consider them subtypes. VCF is intended to be machine-readable. Converting a DUP record to BEDPE format should not require completely different special casing code for each type of the duplication.
This change is an implicit change of the definition of DUP from a single breakpoint tandem duplication event to something more general. Unfortunately, this more general definition, although more closely matching the biological meaning, is under-defined and I don't think is an appropriate change for the file format. That said, the definition of DUP should indeed be clarified. My code assumes a single breakpoint non-inverted tandem duplication but other programs might not be using this definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering how you are proposing to change the 1st level ALT allele INS (used to be "Insertion of novel sequence relative to the reference", now proposed to be "Insertion relative to the reference"), do you intend DUP be part of INS under the new interpretation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, I sensed we are moving into complex SV land: duplications IMO means a string exist in the reference is repeated—not necessarily exactly—and placed in close vicinity or dispersed. Even for the case of close vicinity, one could imagine micro insertions (novel, mobile, inverted) in between the repeated units on the alt allele.
So I do agree that DUP should be more clearly defined.
\indent NS = number of samples with data\\ | ||
\indent REF = the reference copy number\\ | ||
\indent RL = total length of the STR tract in the reference\\ | ||
\indent RU = repeat motif as it appears on the forward strand in the reference\vspace{0.3cm}\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this differ from HOMSEQ?
@@ -547,6 +606,7 @@ \section{INFO keys used for structural variants} | |||
|
|||
\footnotesize | |||
\begin{verbatim} | |||
##INFO=<ID=TDLOC,Number=1,Type=String,Description="Tandem duplication type (must be: direct, inverted-upstream, or inverted-downstream)"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I don't think this belongs in the specs. BND already covers this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ID doesn't seem to convey the meaning: LOC would, IMO, signal location, but this to me means a subtype involving not only location but also strand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some concerns voiced.
@@ -503,41 +503,100 @@ \subsection{VCF tag naming conventions} | |||
|
|||
\section{INFO keys used for structural variants} | |||
\label{sv-info-keys} | |||
The following INFO keys are reserved for encoding structural variants. In general, when these keys are used by imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles). | |||
The following INFO keys are reserved for encoding structural variants. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewording suggestions:
For keys reflecting properties of alt alleles (e.g. SVLEN), the number of values for such keys must equal to, or a multiple of, the count of distinct alt alleles (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles).
\footnotesize | ||
\begin{verbatim} | ||
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation"> | ||
##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation"> | ||
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> | ||
\end{verbatim} | ||
\normalsize | ||
For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate. | ||
For precise variants, END is POS + length of REF allele - 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding @d-cameron 's comments on HOMSEQ/HOMLEN, if a left-align convention is enforced in the spec, this would at least remove some of the ambiguity. For example, imagine a deletion where the reference haplotype is LEFT_FLANK + HOM + UNIQUE_DEL + HOM + RIGHT_FLANK
, and the alt haplotype is LEFT_FLANK HOM + RIGHT_FLANK
, the deleted sequence is then any contiguous substring of the sequence HOM + UNIQUE_DEL + HOM
with a size of len(HOM) + len(UNIQUE_DEL)
. Sticking to the left-aligning convention would put the POS
at the end of LEFT_FLANK
. Now depending on if other annotations exist (e.g. providing CIGARs), the exact haplotypes may be constructed to a certain degree close to the real alt haplotype. And yes, haplotype reconstruction/comparison is what ultimately matters IMHO.
The above comment would lead to a point that the IMPRECISE
flag is useful: for the deletion example mentioned above, we do have the exact haplotype (if we have assembled across the deletion breakpoints or a read that is long enough to span it), therefor it is "precise", in the sense that we know its length. The ambiguity is how to represent it. But this is not the same as "we don't know the exact haplotype and therefore not even knowing where to put it exactly; all we know is an SV event occurred within this provided boundary, and the evidence doesn't allow us to pin down the position."
|
||
\footnotesize | ||
\begin{verbatim} | ||
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> | ||
\end{verbatim} | ||
\normalsize | ||
This key can be derived from the REF/ALT fields but is useful for filtering. The reserved values must be used for the types listed below: | ||
SVTYPE is required for all structural variants. The following reserved values should be used whenever possible; indented values are subtypes that should be used if evidence supports them (see examples in Section 5.3, Encoding Structural Variants):\vspace{0.3cm}\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding see examples in Section 5.3, Encoding Structural Variants , I prefer not referring to other sections in a "hard-coded" way. Please use cross-referencing available in TeX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
\footnotesize | ||
\begin{verbatim} | ||
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants"> | ||
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants"> | ||
##INFO=<ID=CIVALUE,Number=,Type=Float,Description="Degree of certainty expressed by CIPOS/CIEND: `1' indicates 100\% certainty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand why a default value should be in a specification/standard.
@@ -547,6 +606,7 @@ \section{INFO keys used for structural variants} | |||
|
|||
\footnotesize | |||
\begin{verbatim} | |||
##INFO=<ID=TDLOC,Number=1,Type=String,Description="Tandem duplication type (must be: direct, inverted-upstream, or inverted-downstream)"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ID doesn't seem to convey the meaning: LOC would, IMO, signal location, but this to me means a subtype involving not only location but also strand.
\end{itemize} | ||
\item [\null] -RET: Absence of a retrotransposon present in the reference | ||
\end{itemize} | ||
\item [\null] TRA: Translocation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Translocations are hard to reliably predict, so should we refrain from adding it to the spec, considering that BND type already exist to cover such cases?
\item [\null] STR: Short tandem repeat variation | ||
\item [\null] INV: Inversion relative to the reference | ||
\item [\null] DUP: Duplication | ||
\begin{itemize} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering how you are proposing to change the 1st level ALT allele INS (used to be "Insertion of novel sequence relative to the reference", now proposed to be "Insertion relative to the reference"), do you intend DUP be part of INS under the new interpretation?
\item [\null] STR: Short tandem repeat variation | ||
\item [\null] INV: Inversion relative to the reference | ||
\item [\null] DUP: Duplication | ||
\begin{itemize} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, I sensed we are moving into complex SV land: duplications IMO means a string exist in the reference is repeated—not necessarily exactly—and placed in close vicinity or dispersed. Even for the case of close vicinity, one could imagine micro insertions (novel, mobile, inverted) in between the repeated units on the alt allele.
So I do agree that DUP should be more clearly defined.
\indent REF = the reference copy number\\ | ||
\indent RL = total length of the STR tract in the reference\\ | ||
\indent RU = repeat motif as it appears on the forward strand in the reference\vspace{0.3cm}\\ | ||
\indent More information on representing STRs in VCF can be found at http://lobstr.teamerlich.org/file-formats.html\vspace{0.3cm}\\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a specification/standard should refer to other documentation for information that itself should clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SHuang-Broad
Agreed – the reference was intended more as a citation / acknowledgment than as a resource for more information – though I agree the spec needs clarification, if the term is kept.
That leaves us with two options:
- Provide enough detail in the spec, including an example, so the reference to lobSTR is not needed
- Remove STR as a SVTYPE (or SVSUBTYPE / SVSUB) since STRs can be handled by VCF as non-SV records.
Right now I am leaning toward removing STR from the PR.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the 2nd, so to keep the spec as short as possible.
\item [\null] DISDUP: Dispersed duplication | ||
\end{itemize} | ||
\item [\null] DELINS: Deletion-insertion (replacement of reference sequence with sequence of different content or length) | ||
\item [\null] OTHER: Complex variant not covered elsewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a symbolic alt allele for complex event has its value, because being easily interpretable is a virtue, as long as we have a healthy suspicion on the provided interpretation, which is what the subtypes proposed here are trying to do.
On the other hand, this proposal of OTHER
is too ambiguous. It has almost no power of restraining what can be put here. It should at least be mentioned that for a record to be classified as "complex", at least two types of events each would be associated with a distinct 1st level alt alle must be involved.
And a third point I would like to make: adding these subtypes is "unfair" to the other imaginable dozens (if not hundreds) subtypes.
Now regarding @d-cameron 's comments about equivalent BND records connected with their EVENT
annotation, I agree there should be a tool providing such functionality. But, IMHO, for complex variant, what we should do is to signal that a series of simple event is connected together is a non-trivial way so a region of the reference is rearranged. A caller should output (best estimate of) this region's boundary in a as compact as possible fashion, and preferably output the reference and alternative haplotypes. This would fulfill the following needs:
- the affected genomic locations are specified, this is beneficial as compared to a series of BND records which must be parsed all together to clearly define the boundaries that cluster these events together;
- event could be easily interpreted (though one should not blindly trust the interpretation);
- the concrete sequence is available for downstream analysis (genotyping, re-interpretation) and these downstream analysis doesn't need to painstakingly reconstruct the haplotype.
@SHuang-Broad left-alignment is not possible for BND variants as, depending on the breakend orientations, left-aligning one breakend results in the other breakend being right-aligned. |
This PR really exposes the dual interpretations of SVTYPE. SVTYPE can be considered a convenience field that allows simple structural variants to be represent using a single VCF record instead of having to use the more unwieldy BND notation. Alternatively, SVTYPE can be considered to be a prediction of a specific type of mutational event. Whilst these interpretations have quite a bit of overlap, there a subtle differences. As an example, take the variants called by Pindel and DELLY: SVTYPE=DEL is called for all events in which the orientation of the breakends is consistent with a deletion event. Deletions are called irrespective of whether there is any change in copy number in the 'deleted' region and this approach implies the first interpretation. SVTYPE=INV is called for all events for which a single breakpoint in an inversion-like orientation. Pindel and DELLY call inversions even when only one of the two required breakpoints actually exist in the input. The predictive nature of these calls imply the second interpretation. In my work with VCF SVs, I have interpreted SVTYPE as a shorthand format for the equivalent BND events. As a file format intended to be unambiguous, it must be the former interpretation (even if it also the latter). The restricted list of top-level event types allowed under VCFv4.3 are consistent with the first interpretation. This PR changes the meaning of the SVTYPE field to the latter. To introduce the type of changes requested in the PR, I am in favour of SVTYPE retaining it's current strict definition (we can even remove the annoying subtypes) and introducing a new field(s) containing the level of detail requested in this PR. This has the added bonus that the shorthand notation becomes much less important and making BND calls much less painful when performing downstream analysis. I for one would find it extremely useful to be able to classify the more complex events. Something along the lines of ^processed pseudo genes look like duplication with lots of deletions in the duplicated region, being able to call the deletions as DEL, but also being able to indicate that they are part of a more complex event is very important - just look at the number of supposedly germline intron deletions in the variant databases that are actually processed pseudogene insertions. |
@d-cameron |
@SHuang-Broad We could indeed left-align the left record for events with a known micro-homology. I'm not in favour of left-aligning imprecise variants as it prevents callers from calling the most likely position within the region of uncertainty. |
@d-cameron , oh yes, for imprecise variants the boundary should be the best estimate, unlike the microhomology case in precise variants where no evidence exists favoring any particular point location. |
@thefferon, does my SVCLASS proposal work for your purposes? Given our current lack of detailed knowledge of mechanisms of SV events (e.g. http://www.biorxiv.org/content/early/2017/08/27/181339), having a fixed SVTYPE list to ensure unambiguous parsing, and an open-ended SVCLASS field prevents us having to push out a new spec version every time a new mutational process is identified, but also enables sophisticated downstream analysis with minimal VCF parsing. |
@d-cameron and @SHuang-Broad, @d-cameron, you described two possible "interpretations" of SVTYPE: 1 - as a shorthand for the equivalent BND representation My interpretation has always been of the second type, and maybe it will help you to know that I created this PR from that standpoint. I am less familiar with the computational dependencies that constrain the existing SVTYPE controlled vocabulary; and so I am very grateful to have your input. Please go into a bit more detail about what you mean by the first interpretation. @d-cameron, I think some implementation of your 'SVCLASS' idea could work. I think it will be best if we agree on a mechanism for encoding variant types (e.g., SVCLASS, other, or no change) before engaging in further debate of individual SVTYPE values (e.g., TRA, DUP, subtypes). Here is a bit of background on why I generated this pull request: As curator at dbVar, I must interpret each new strucutral variation VCF submission and fit it to a common data model; there are as many 'flavors' of VCF as there are submitters, and many VCF authors do not adhere to the guidelines in the existing spec. There may be many reasons for this, including that VCF is suboptimal for describing SV to begin with, and that the spec is decidedly vague on many details that are critical to the success of VCF as a means of sharing SV data. My intention with this PR was to address those areas of the spec I felt could easily be improved – for example, the use of more definitions (and improvements to existing ones), and solid examples that clearly demonstrate the preferred way for each variant type to be encoded. I hope this would result in more consistency of usage, and a dramatic improvement in the success and ease with which VCF generated by different sources can be shared and integrated. I have been granted a limited amount of time to resolve the VCF issues that will lead to the most significant and helpful improvements. Therefore I propose a deadline of 30 September 2017 for us to come to agreement on the most important issues we can. I hope I have given you a better understanding of where I am coming from in creating this PR. Going forward, please consider:
Finally, if you have encountered problems working with SV-VCF, or always wished there were a solution to some particular problem(s) with it, please describe them here (or in a parallel PR) so we can address them now and improve the spec on the whole. Thanks, |
@d-cameron and @SHuang-Broad ,
Another way to put this is: Do you agree with my assertion that there is a need for the spec to be improved, with respect to describing the various SV types? Or do you think it is fine the way it is? What do you think are the most important changes it would need in order to become more useful – that would help provide a standard for the way in which different groups / callers ought to encode the variety of types observed? |
@d-cameron and @SHuang-Broad, Please comment on the following proposal for handling SV subtypes in VCF. If you agree I will write it up and add it to this PR. @d-cameron, you argued for keeping the current SVTYPE controlled vocabulary, and also I believe for enforcing it (by which I mean, making it clear in the spec that these are the only allowed values for SVTYPE, anything else being invalid). Those values are: DEL Furthermore, you argued that SVTYPE should (must?) be interpreted not as an assertion about the type of biological event, but rather as shorthand for a BND representation, and that this is required for compatibility with downstream analysis tools. (If at any point I misrepresent anyone's position, please correct me.) FIRST: Is it your position that SVTYPE should be interpreted this way, and so retain the current restricted set of types, to avoid breaking others' code? If you answer is yes, then I accept, but argue we still need a mechanism for VCF authors to specify SV subtypes (like "Alu insertion", "Processed pseudogene insertion", "Absence of an L1 insertion that is present in the reference", etc.). You proposed "SVCLASS" for this; I think a more appropriate term would be "SVSUBTYPE" (or "SVSUB"), a name which implies its value is restricted by the value of SVTYPE (dependencies would be spelled out in the spec - e.g., if SVTYPE="INS", then SVSUB, if provided, must be one of MEI, ALU, L1, HERV, SVA, RET, or PSG - all of which represent types of insertion that prior VCF authors have represented with custom SVTYPE values). There would be similar restrictions for other SVTYPES. If you think the above makes sense, I will draw up a specific section for your review. SECOND: I have argued, and still maintain, that BND is not really an SVTYPE – rather it is a means by which any SV can be described (a point with which I think you agree, based on your SVTYPE interpretation #1, that SVTYPEs are shorthand for BND representations). I therefore propose that BND is NOT a SVTYPE, and that we choose some other way of indicating when a record is a BND representation (or maybe one is not needed, because it is evident from the format of the corresponding ALT value). THIRD: I believe there are other problems with the SVTYPE controlled vocabulary (CV), some of which might be addressed by improving the definitions of existing terms, others that may require new terms. To be more specific:
Thoughts? |
@thefferon I feel like we're making some progress here :)
It is.
The current spec is sufficient as a technical format for defining breakpoints and CNVs.
Having had some time to think about what you're wanting to do, I feel that we want to be even more general, and I propose the Some good examples are: Translocations:
It's nice a human readable and pretty clear what is going on biologically (although this example loses the translocation orientation information, it is available if the tool uses BND format to write the event). Processed pseudo-gene (e.g. there's plenty of exon deletions in dbVar that are actually germline processed pseudo-gene insertions that are not in the reference genome; I don't trust any gene that has adjacent-exon deletions in dbVar to actually have these deletions in the actual gene)
The thing I really like about
If you care about the interpretation of the variants, then being able to handle STRs in both formats is a really useful. Proposal:
|
|
@d-cameron , @SHuang-Broad , and @cyenyxe , I want to begin wrapping up this pull request. It occurs to me the most efficient way to determine which changes are in scope and which are not may be a video chat. Please reply to this post, stating:
Thanks! |
More than happy to have a meeting. My timezone is AEST, but I'm very flexible and would like to move this forward (I just got off a 1-2am GA4GH file format teleconference call). |
I am also fine with joining the video chat. My time zone is BST. |
Sorry guys that I've been busy implementing functionalities for complex SV's, taking the discussions here into account. |
Please consult the following "schedule" and indicate when you would prefer to have a call (my initial preference is indicated in yellow). The schedule starts with 3pm @d-cameron 's time Monday, followed by BST (U.K.) and EST (U.S. Eastern). At least I think the times are right. Monday is just an example; we could have it any day of the week. |
Please consider my preference same as Tim's. Thanks! |
@thefferon FIRST: I believe the colon-separated way is cleaner, and provides a balance between convenience and restriction. But don't insist. SECOND: Agree. BND is a breakpoint, any type of variant could have any (appropriate) number of BND records. THREE:
@d-cameron I agree that SVTYPE conveys (almost) redundant information, but I feel that improving up on it makes more sense, e.g. allowing what would go into the EVENTTYPE be in SVTYPE. And I also like @thefferon's proposal for bringing in the |
I'm travelling next week so some time between 11:00pm and 01:00am AEST would be best for me. Tuesday or Wednesday AEST? |
@SHuang-Broad the translocation example would be written in a lossless manner as:
|
Call schedule: Please sign off if you can make the following: @d-cameron : Thurs. 12:00am AEST If anyone has had a very positive video call experience using a particular application or service, please mention it and I will use it if possible. Otherwise I will try Webex, GoToMeeting, or zoom.us. I have found audio to be choppy on Google Hangout. I have no experience with Skype for international multi-calls. Thanks, |
That time slot works for me. From those 3 apps, I am most familiar with Zoom. |
I took a stab at a an agenda for the call. Please update as needed, using Track Changes "on" in Word. (Doc can be migrated to Google Docs or similar if it would be easier.) |
The proposed time works for me. Thanks Tim! |
@d-cameron I feel like we shouldn't be arguing about the detail at the moment but I'm confused by the BND equivalent of the translocation, why is |
@SHuang-Broad , by all means! The more experienced minds on this, the better. Will they both be joining you physically, or should I send them call-in details separately (I'd need addresses)? Incidentally, it appears that, if we are to use Zoom, there is a 40-minute time limit on calls with more than 2 participants on the free plan, which is all I have access to. Our options are:
I'm OK with any of the above. |
@thefferon . Thanks! I believe assuming we three not in the same physical location is a safer bet. What's your preferred way of getting their info for setting up the call? I assume posting email address here publicly may not make everyone comfortable. Regarding which service we use, I am OK with either. And having a break in the middle is a great idea! |
@SHuang-Broad and I had a successful test video call earlier today using Zoom. I am replacing the test call information that was here with tomorrow's call-in details:
I will be updating the call Agenda, so please check back periodically between now and then. Hope to see you all on the call! –Tim |
@thefferon thanks! Will try the test call on time. |
I have the feeling much of the discussion on the call will center around the distinction you made between the two interpretations of SVTYPE. Would you be willing to provide an explanation of the first interpretation, preferably with examples and mentioning dependencies, toward the beginning of the call? I think that would help frame the rest of the discussion. |
@thefferon I can do that. Fundamentally, my concerns are about ensuring the unambiguous machine interpretation of variants does not get any more broken than it current is. @SHuang-Broad Yes, it should be 5001. To be fully lossless, it would need either a) the 1000bp of inserted sequenced in the ALT allele, or BNDs for chr2:5000->chr1:1000 and chr1:1999->chr2:5001 |
Hi All, For the sake of having as productive a call as possible, I'd like to make the following suggestions (if you have feedback you think should be heard before - or at the beginning of - the call, please speak up; otherwise we can review them at the end if there is time):
In scope: Dropped (for now at least): Please review the agenda prior to the call, in case I or anyone else has posted updates. I look forward to seeing everyone tomorrow! –Tim P.S. I see we're at 111 Comments on this PR; what is the record? |
Updated Agenda: |
I closed the chat window from our conference call earlier today before saving the email addresses everyone posted. I've contacted the support team for the Zoom app, but they cannot retrieve the info. @SHuang-Broad I will contact you via email for Chris' and Xuefang's addresses; @d-cameron I am missing yours. Please send it to t>i>m>h>e>f>f(e(r(0(n( at g>m>a>i>l>.>c>o>m>. Thanks! |
@d-cameron , please provide the information requested in my previous message. Thanks! |
Please provide your approval and/or comments on the Confluence Page I sent around recently via email. Let's wrap this up! Thanks, |
bringing my fork up to date
I am closing this PR and will open a new one with a limited subset of the changes discussed here. |
Still evaluating effect of these changes on pdf layout...
Add files via upload
commit ecd40f0 Author: Tim Hefferon <theffero@nih.gov> Date: Mon Dec 18 13:43:12 2017 -0500 editorial improvements commit a1b6ad4 Merge: 986e835 7aa24be Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Fri Dec 15 12:41:29 2017 -0500 Merge pull request samtools#5 from samtools/master Fix RFC3986 encoding for "%" commit 986e835 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:48:21 2017 -0500 Update VCFv4.3.tex commit c598717 Merge: 46042a8 4c18a2c Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:41:20 2017 -0500 Merge pull request samtools#4 from thefferon/revert-whitespace Update VCFv4.3.tex commit 4c18a2c Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:39:55 2017 -0500 Update VCFv4.3.tex commit 46042a8 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Dec 14 10:31:48 2017 -0500 Add files via upload commit 1d3c079 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 13 14:35:38 2017 -0500 cleanup samtools#1 commit 6d30b09 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 13 14:33:39 2017 -0500 incorporated @d-cameron's changes from PR samtools#266 commit 1de4ac4 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 13 14:25:41 2017 -0500 backed out whitespace changes commit 406e9da Merge: 489b696 ce1e750 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 6 09:11:52 2017 -0500 Merge pull request samtools#3 from thefferon/thefferon-patch-1 Add files via upload commit ce1e750 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Dec 6 09:11:20 2017 -0500 Add files via upload commit 489b696 Merge: 527e04b 2f915a8 Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Tue Dec 5 15:54:37 2017 -0500 Merge pull request samtools#2 from samtools/master bringing my fork up to date commit 527e04b Merge: 7c5259c 85a0fef Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Thu Nov 16 16:21:31 2017 -0500 Merge pull request samtools#1 from samtools/master bringing my fork up to date commit 7c5259c Author: Tim Hefferon <thefferon@users.noreply.github.com> Date: Wed Aug 9 16:25:24 2017 -0400 First go at changes requested in pull request samtools#231 Still evaluating effect of these changes on pdf layout... commit cf2ffa9 Author: Tim Hefferon <theffero@ncbi.nlm.nih.gov> Date: Tue Aug 8 11:22:22 2017 -0400 Made significant updates to Section 3, INFO keys used for structural variants # Conflicts: # VCFv4.3.tex
…variants