Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add documention for VariantContext.getStart() regarding telomeric events #1369

Merged
merged 4 commits into from
Jul 1, 2019

Conversation

SHuang-Broad
Copy link
Contributor

@SHuang-Broad SHuang-Broad commented May 17, 2019

Description

The VCF spec allows POS column to take value 0 (or chromosome length + 1), when the event is at a telomere.

Currently the documentation for public int VariantContext.getStart() claims it returns a 1-based value, where it seems that 0 is an invalid value.

This PR intends to clarify that, by adding documentation.

Checklist

(Documentation change, most of the following does not apply)

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@codecov-io
Copy link

codecov-io commented May 17, 2019

Codecov Report

Merging #1369 into master will decrease coverage by 0.009%.
The diff coverage is n/a.

@@              Coverage Diff               @@
##             master     #1369       +/-   ##
==============================================
- Coverage     67.85%   67.841%   -0.009%     
+ Complexity     8283      8282        -1     
==============================================
  Files           564       564               
  Lines         33695     33695               
  Branches       5650      5650               
==============================================
- Hits          22862     22859        -3     
- Misses         8653      8655        +2     
- Partials       2180      2181        +1
Impacted Files Coverage Δ Complexity Δ
.../htsjdk/variant/variantcontext/VariantContext.java 77.714% <ø> (ø) 246 <0> (ø) ⬇️
src/main/java/htsjdk/samtools/BAMFileReader.java 67.847% <0%> (-0.817%) 51% <0%> (-1%)

@@ -1664,7 +1664,15 @@ public String getContig() {

/**
* @return 1-based inclusive start position of the Variant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are here, I would reformat the javadoc like this:

/**
 * Summary sentence.
 *<p>
*    explanation of what it does without loosing time in details of particular edge cases.
 *</p>
* <p>
*   edge case-1
*</p>
* <p>
*   edge case-2
* </p>
*
* @return the text above should had made clear wha is returned... here you report on the possible range of values very briefly.
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example in this case something like this:

/**
  * Returns the (start) position of the variant.
  * <p>
  *   Main blah blah ... 1-based  ... blah blah.
  *</p>
  *<p>
  * For telomeres  blah blah can be 0  or N+1 blah blah
  *</p>
  * @return 0 or greater, never a negative number.
 */

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm just putting emphasis in the summary sentence (first one finished with .) and the statement in the [at]return tag, the details in the section in between is up to you (that is way I add those "blah" "blah").

That said I think you should be more concise... The programmer don't want to expend to much time to understand what may happen in 0.1% percent of cases. For example I would not include details on what is said or depicted in the spec, just simply refer to it for further reading/details.

@SHuang-Broad
Copy link
Contributor Author

@vruano Thanks for taking a look!
Please take a look at the re-formatted doc.

@vruano
Copy link
Contributor

vruano commented May 19, 2019

In my view is just too much text; think about the 99.5% of people that don't care about telomeres. but I rather you get some other opinion as perhaps I'm just too pedantic.

About the summary and [at]return annotation. I mean something like this:

/**
 * Returns the position for this variant.
 *
 * @return 0 or greater.
 */

The summary sentence is "what it does" very briefly whareas the [at]return is just give info about the possible range of values without semantics (this goes into the summary and the rest of the javadoc. For example if the return was Object the often is either @return never {@code null} or @return might be {@code null}. Sometimes if the method is know to return this (e.g. StringBuilder append methods) then I also say that @return this builder (in this case it is imply that it's never null.

@vruano
Copy link
Contributor

vruano commented May 19, 2019

I would add <strong>warning</strong> people that work to this level with VCF data are supposed to be aware of this.

@SHuang-Broad
Copy link
Contributor Author

I agree with you that the summary was too long.
So I shrunk it further and added a warning to the new note about telomere.

@@ -1663,8 +1663,11 @@ public String getContig() {
}

/**
* Returns 1-based inclusive start position of the variant, 0 or greater.
* See below for explanation on "0".
* Returns 1-based inclusive start position of the variant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop the "start" since this is in fact controversial, despite that it is the method's name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would people know what "1-based inclusive" mean? perhaps that should be remove from here and or explained in another "

" block... but since this not the only place we use 1-based indexes ... so do you even need to explain it here..... I guess it won't hurt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop the "start" since this is in fact controversial, despite that it is the method's name.

Why is "start" controversial?

I don't quite understand the rest of the comments

* Returns 1-based inclusive start position of the variant.
*
* <p>
* INDEL events usually start on the first unaltered reference base before the INDEL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confussing... INDEL event actual start in the bae before the INDEL... not really ... the indel start where it starts... perhaps you mean to say that it is reported one base before.
Also perhaps you should try to be more general here... instead of talking about INDEL alone you could say something like:
"Notice that for some types of variant events the actual start position may not be this value (e.g. deletions are reported on the base before the first base deleted)."

So you are not giving the impression that it may only happen with deletions or indels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is directly copied from the old comment.

But I agree that it can be more general.

* Note also that the VCF spec allows 0 and N + 1 for POS field for telomeric event,
* where N is the length of the chromosome.
* The "0" value returned should be interpreted as telomere, and does not violate the above "1-based" comment.
* It's the responsibility of code generating such variants to make sure {@code start} is populated correctly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is unfortunate the the spec talks about "Telomeres" only. It is assuming that the CHROM (thus the referene) only can contain full non-circular chromosomes.... total BS. (e.g. the MT chromosome in humans or nearly every contig in unfinished genomic references)
Yes CHROM as a name wasn't a good choice to start with but we don't need to keep up to that mistake. Notice that API using "contig" instead of "chr" or "chromosome; it is still assuming too much but is closer to the truth..

So I would refrain to make it seems as it can only be telomerees.

What about something like:

"
This property can take on "0" and "N+1" (where N is the last base in the enclosing contig) when this variant record makes references to events that happen before or at the beginning or after or at the end of the enclosing contig."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this PR does not turn into a loath about the spec itself. I can only work within the current spec for this PR.
So, since the latest spec talks about telomere, I'm happy to use a single word "telomere" to avoid such long sentence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't, I leave it up to you at this point.

@vruano
Copy link
Contributor

vruano commented May 20, 2019

Sorry perhaps I went to far giving examples ... can borrow mine but you can/should use your own words.

*
* <p>
* INDEL events usually start on the first unaltered reference base before the INDEL.
* </p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

</p> markup is unnecessary as the <p> tag closes the previous open paragraph tag.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comments are valid but I don't think they have to be addressed as part of this change, this is just mentioning the existence of telomere and N+1 positions as possibilities.

@lbergelson lbergelson merged commit 9aa81ed into samtools:master Jul 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants