Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only compute BIN on BAM write and on index building #1258

Merged
merged 6 commits into from
Jan 16, 2019

Conversation

tfenne
Copy link
Member

@tfenne tfenne commented Jan 10, 2019

Description

Prior to this change if you had a SAM or BAM file with records mapped to references > 512mb, SAMRecord would log a warning per record mapped to those contigs!! It would also blindly compute an invalid BIN value and store it.

This change removes the stored indexingBin field in SAMRecord and has it computed only when needed - i.e. when writing a BAM file in BAMRecordCodec and during index building. In BAMRecordCoded it will warn once when the first read mapped to a long contig is found, and will write a bin value of 0, the same as we do for unmapped reads. When index building it will throw an exception since BAI cannot support reads mapped beyond 512mb.

I ran into a lot of issues with how this interacted with the long cigar code. @yfarjoun I ended up resolving those by modifying the code in BAMRecordCodec slightly. Instead of calling setCigar() with the sentinal cigar, it creates a local variable called cigarToWrite and assigns either the real cigar or the sentinal cigar as appropriate. I also added a line at the end of the encode() to remove the CG tag from the record post-encoding, so that a pass through BAMRecordCodec now leaves the record largely untouched.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@tfenne tfenne self-assigned this Jan 10, 2019
@tfenne tfenne requested review from yfarjoun and lbergelson January 10, 2019 15:52
@codecov-io
Copy link

codecov-io commented Jan 10, 2019

Codecov Report

Merging #1258 into master will increase coverage by 0.051%.
The diff coverage is 70.968%.

@@               Coverage Diff               @@
##              master     #1258       +/-   ##
===============================================
+ Coverage     69.366%   69.417%   +0.051%     
- Complexity      8157      8201       +44     
===============================================
  Files            548       548               
  Lines          32706     32842      +136     
  Branches        5523      5556       +33     
===============================================
+ Hits           22687     22798      +111     
- Misses          7799      7822       +23     
- Partials        2220      2222        +2
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/samtools/BAMRecord.java 83.14% <ø> (+1.637%) 53 <0> (+2) ⬆️
src/main/java/htsjdk/samtools/SAMUtils.java 59.5% <0%> (-0.149%) 127 <0> (ø)
src/main/java/htsjdk/samtools/BAMIndexer.java 76.531% <100%> (-0.237%) 17 <0> (ø)
.../main/java/htsjdk/samtools/SAMValidationError.java 96.117% <100%> (+0.038%) 9 <0> (ø) ⬇️
...rc/main/java/htsjdk/samtools/SamFileValidator.java 84.211% <20%> (-0.743%) 82 <1> (+1)
src/main/java/htsjdk/samtools/SAMRecord.java 67.271% <40%> (-0.524%) 300 <2> (-15)
src/main/java/htsjdk/samtools/BAMRecordCodec.java 79.839% <94.118%> (+0.847%) 25 <0> (+4) ⬆️
src/main/java/htsjdk/samtools/cram/CRAIIndex.java 89.362% <0%> (+1.126%) 43% <0%> (+15%) ⬆️
...java/htsjdk/samtools/cram/structure/Container.java 83.871% <0%> (+2.621%) 14% <0%> (+7%) ⬆️
... and 3 more

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. in particular, given the long-cigar shinanigans, I like not over-using setCigar and getCigar in the encode method....

src/main/java/htsjdk/samtools/SAMRecord.java Show resolved Hide resolved
src/main/java/htsjdk/samtools/SAMUtils.java Show resolved Hide resolved
@@ -294,9 +293,6 @@ public void testBAMThroughCRAMRoundTrip() throws IOException, NoSuchAlgorithmExc
// retrieve all records from the bam and reset the indexing bins to keep comparisons with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just removed the comment as all that line now does is load the records from a file and that didn't seem comment-worthy.

@@ -327,9 +323,6 @@ public void testBAMThroughCRAMRoundTripViaPath() throws IOException, NoSuchAlgor
// retrieve all records from the bam and reset the indexing bins to keep comparisons with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

src/main/java/htsjdk/samtools/SAMRecord.java Show resolved Hide resolved
void setIndexingBin(final Integer mIndexingBin) {
this.mIndexingBin = mIndexingBin;
}

/**
* Does not change state of this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment to reflect the fact that there is a possible side-effect due to the need to parse the cigar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@yfarjoun
Copy link
Contributor

@lbergelson case to cast a glance?

@tfenne
Copy link
Member Author

tfenne commented Jan 12, 2019

@lbergelson I'd be very grateful if you could let me know, in the next few days, whether you'd like to review this PR or if you're happy enough with @yfarjoun's review. Thanks!

@lbergelson
Copy link
Member

@tfenne Sorry! I will review.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne This seems pretty reasonable to me. A few minor comments.

@@ -108,18 +108,20 @@ public void encode(final SAMRecord alignment) {
final int readLength = alignment.getReadLength();

// if cigar is too long, put into CG tag and replace with sentinel value
if (alignment.getCigarLength() > BAMRecord.MAX_CIGAR_OPERATORS) {
final Cigar cigarToWrite;
final boolean cigarSwitcharoo = alignment.getCigar().numCigarElements() > BAMRecord.MAX_CIGAR_OPERATORS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change from getCigarLength -> getCigar().numCigarElements() for a reason? If not I might keep getCigarLength. If it is then it needs a comment because it's not obvious to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer is that some of the tests fail if I leave it as getCigarLength() and they all pass if I just call getCigar().numCigarElements(). I'm honestly not 100% sure why and that code has gotten horribly complicated with all the long cigar support. I'll add a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That scares the hell out of me... those seem like they should be identical..

if (getReferenceName() != null) {
LOG.warn(String.format("%s reference length is too large for BAM bin field. %s record bin field value is incorrect.",
getReferenceName(), getReadName()));
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping the reference length is too long for bam warning is a good idea. It's still problem you would want to know about if you're validating your bam I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually my primary reason for the PR in the first place! With this check here, unless you set validation stringency to silent, you get a warning logged per record when reading. So I'm getting millions of warnings!
My take is that it's perfectly valid SAM/BAM to have records mapped to references that are longer than BAI can handle, but it's problematic when you try to index with BAI. That's why I removed the warnings, and made sure it was a clear failure when building a BAI. If I leave this in, then anyone using either unindexed BAMs or CSI-indexed BAMs will see millions of unnecessary warnings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh.. of course.. that makes sense. Maybe there's a way we can have ValidateSam warn exactly once if it encounters reads like that but not have it happen in the general case of just processing a sam file with validation on.

* @deprecated because the method does the exact opposite of what it says. Use the correctly named
* isReferenceSequenceIncompatibleWithBAI() instead.
*/
@Deprecated public static boolean isReferenceSequenceCompatibleWithBAI(final SAMSequenceRecord sequence) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh god, that's horrible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I was really confused initially.

return (binNumber == null ? rec.computeIndexingBin() : binNumber);

}
public Integer getIndexingBin() { return rec.computeIndexingBin(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format over lines, e.g.

public Integer getIndexingBin() {
    return rec.computeIndexingBin();
}

alignment.setAttribute(CG.name(), cigarEncoding);
cigarToWrite = makeSentinelCigar(alignment.getCigar());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format else on previous line

} else {

src/main/java/htsjdk/samtools/BAMRecordCodec.java Outdated Show resolved Hide resolved
final boolean tooLarge = sequence != null && SAMUtils.isReferenceSequenceIncompatibleWithBAI(sequence);
if (!isReferenceSizeWarningShowed & tooLarge && rec.getValidationStringency() != ValidationStringency.SILENT) {
LOG.warn("Reference length is too large for BAM bin field.");
LOG.warn("Reads on references longer than " + GenomicIndexUtil.BIN_GENOMIC_SPAN + "bp will have bin set to 0.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space in warning

Suggested change
LOG.warn("Reads on references longer than " + GenomicIndexUtil.BIN_GENOMIC_SPAN + "bp will have bin set to 0.");
LOG.warn("Reads on references longer than " + GenomicIndexUtil.BIN_GENOMIC_SPAN + " bp will have bin set to 0.");

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional. It's normal to write 1234bp vs. 1234 bp.

void setIndexingBin(final Integer mIndexingBin) {
this.mIndexingBin = mIndexingBin;
}
@Deprecated() public int computeIndexingBinIfAbsent(final SAMRecord alignment) { return alignment.computeIndexingBin(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format across lines

src/main/java/htsjdk/samtools/SAMRecord.java Outdated Show resolved Hide resolved
src/test/java/htsjdk/samtools/BAMCigarOverflowTest.java Outdated Show resolved Hide resolved
tfenne and others added 2 commits January 16, 2019 08:58
Co-Authored-By: tfenne <tfenne@tfenne.com>
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thank you for adding the warning.

@tfenne tfenne merged commit 15ec7da into master Jan 16, 2019
@tfenne tfenne deleted the tf_maybe_dont_compute_bin branch January 16, 2019 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants