-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only compute BIN on BAM write and on index building #1258
Conversation
…M and during index creation.
…in BAMRecordCodec.
Codecov Report
@@ Coverage Diff @@
## master #1258 +/- ##
===============================================
+ Coverage 69.366% 69.417% +0.051%
- Complexity 8157 8201 +44
===============================================
Files 548 548
Lines 32706 32842 +136
Branches 5523 5556 +33
===============================================
+ Hits 22687 22798 +111
- Misses 7799 7822 +23
- Partials 2220 2222 +2
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. in particular, given the long-cigar shinanigans, I like not over-using setCigar and getCigar in the encode method....
@@ -294,9 +293,6 @@ public void testBAMThroughCRAMRoundTrip() throws IOException, NoSuchAlgorithmExc | |||
// retrieve all records from the bam and reset the indexing bins to keep comparisons with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just removed the comment as all that line now does is load the records from a file and that didn't seem comment-worthy.
@@ -327,9 +323,6 @@ public void testBAMThroughCRAMRoundTripViaPath() throws IOException, NoSuchAlgor | |||
// retrieve all records from the bam and reset the indexing bins to keep comparisons with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
void setIndexingBin(final Integer mIndexingBin) { | ||
this.mIndexingBin = mIndexingBin; | ||
} | ||
|
||
/** | ||
* Does not change state of this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update comment to reflect the fact that there is a possible side-effect due to the need to parse the cigar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@lbergelson case to cast a glance? |
@lbergelson I'd be very grateful if you could let me know, in the next few days, whether you'd like to review this PR or if you're happy enough with @yfarjoun's review. Thanks! |
@tfenne Sorry! I will review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfenne This seems pretty reasonable to me. A few minor comments.
@@ -108,18 +108,20 @@ public void encode(final SAMRecord alignment) { | |||
final int readLength = alignment.getReadLength(); | |||
|
|||
// if cigar is too long, put into CG tag and replace with sentinel value | |||
if (alignment.getCigarLength() > BAMRecord.MAX_CIGAR_OPERATORS) { | |||
final Cigar cigarToWrite; | |||
final boolean cigarSwitcharoo = alignment.getCigar().numCigarElements() > BAMRecord.MAX_CIGAR_OPERATORS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the change from getCigarLength -> getCigar().numCigarElements()
for a reason? If not I might keep getCigarLength
. If it is then it needs a comment because it's not obvious to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The short answer is that some of the tests fail if I leave it as getCigarLength()
and they all pass if I just call getCigar().numCigarElements()
. I'm honestly not 100% sure why and that code has gotten horribly complicated with all the long cigar support. I'll add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That scares the hell out of me... those seem like they should be identical..
if (getReferenceName() != null) { | ||
LOG.warn(String.format("%s reference length is too large for BAM bin field. %s record bin field value is incorrect.", | ||
getReferenceName(), getReadName())); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping the reference length is too long for bam warning is a good idea. It's still problem you would want to know about if you're validating your bam I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually my primary reason for the PR in the first place! With this check here, unless you set validation stringency to silent, you get a warning logged per record when reading. So I'm getting millions of warnings!
My take is that it's perfectly valid SAM/BAM to have records mapped to references that are longer than BAI can handle, but it's problematic when you try to index with BAI. That's why I removed the warnings, and made sure it was a clear failure when building a BAI. If I leave this in, then anyone using either unindexed BAMs or CSI-indexed BAMs will see millions of unnecessary warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh.. of course.. that makes sense. Maybe there's a way we can have ValidateSam warn exactly once if it encounters reads like that but not have it happen in the general case of just processing a sam file with validation on.
* @deprecated because the method does the exact opposite of what it says. Use the correctly named | ||
* isReferenceSequenceIncompatibleWithBAI() instead. | ||
*/ | ||
@Deprecated public static boolean isReferenceSequenceCompatibleWithBAI(final SAMSequenceRecord sequence) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh god, that's horrible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I was really confused initially.
return (binNumber == null ? rec.computeIndexingBin() : binNumber); | ||
|
||
} | ||
public Integer getIndexingBin() { return rec.computeIndexingBin(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format over lines, e.g.
public Integer getIndexingBin() {
return rec.computeIndexingBin();
}
alignment.setAttribute(CG.name(), cigarEncoding); | ||
cigarToWrite = makeSentinelCigar(alignment.getCigar()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format else
on previous line
} else {
final boolean tooLarge = sequence != null && SAMUtils.isReferenceSequenceIncompatibleWithBAI(sequence); | ||
if (!isReferenceSizeWarningShowed & tooLarge && rec.getValidationStringency() != ValidationStringency.SILENT) { | ||
LOG.warn("Reference length is too large for BAM bin field."); | ||
LOG.warn("Reads on references longer than " + GenomicIndexUtil.BIN_GENOMIC_SPAN + "bp will have bin set to 0."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space in warning
LOG.warn("Reads on references longer than " + GenomicIndexUtil.BIN_GENOMIC_SPAN + "bp will have bin set to 0."); | |
LOG.warn("Reads on references longer than " + GenomicIndexUtil.BIN_GENOMIC_SPAN + " bp will have bin set to 0."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentional. It's normal to write 1234bp
vs. 1234 bp
.
void setIndexingBin(final Integer mIndexingBin) { | ||
this.mIndexingBin = mIndexingBin; | ||
} | ||
@Deprecated() public int computeIndexingBinIfAbsent(final SAMRecord alignment) { return alignment.computeIndexingBin(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format across lines
…der is too long for BAI indexing.
Co-Authored-By: tfenne <tfenne@tfenne.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thank you for adding the warning.
Description
Prior to this change if you had a SAM or BAM file with records mapped to references > 512mb, SAMRecord would log a warning per record mapped to those contigs!! It would also blindly compute an invalid BIN value and store it.
This change removes the stored
indexingBin
field inSAMRecord
and has it computed only when needed - i.e. when writing a BAM file inBAMRecordCodec
and during index building. InBAMRecordCoded
it will warn once when the first read mapped to a long contig is found, and will write a bin value of0
, the same as we do for unmapped reads. When index building it will throw an exception since BAI cannot support reads mapped beyond 512mb.I ran into a lot of issues with how this interacted with the long cigar code. @yfarjoun I ended up resolving those by modifying the code in
BAMRecordCodec
slightly. Instead of callingsetCigar()
with the sentinal cigar, it creates a local variable calledcigarToWrite
and assigns either the real cigar or the sentinal cigar as appropriate. I also added a line at the end of theencode()
to remove theCG
tag from the record post-encoding, so that a pass throughBAMRecordCodec
now leaves the record largely untouched.Checklist