Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove caching of alleles in AbstractVCFCodec #1282

Merged
merged 5 commits into from
Feb 26, 2019

Conversation

lbergelson
Copy link
Member

This replaces #1264 and responds to my own comments.

@lbergelson lbergelson requested a review from yfarjoun February 13, 2019 21:55
@lbergelson lbergelson mentioned this pull request Feb 13, 2019
5 tasks
@codecov-io
Copy link

codecov-io commented Feb 13, 2019

Codecov Report

Merging #1282 into master will increase coverage by 0.439%.
The diff coverage is 90%.

@@              Coverage Diff               @@
##             master     #1282       +/-   ##
==============================================
+ Coverage     67.49%   67.929%   +0.439%     
- Complexity     8150      8483      +333     
==============================================
  Files           558       561        +3     
  Lines         33365     34324      +959     
  Branches       5608      5848      +240     
==============================================
+ Hits          22518     23316      +798     
- Misses         8658      8796      +138     
- Partials       2189      2212       +23
Impacted Files Coverage Δ Complexity Δ
...main/java/htsjdk/variant/vcf/AbstractVCFCodec.java 73.765% <100%> (+0.759%) 86 <0> (-4) ⬇️
...ain/java/htsjdk/variant/variantcontext/Allele.java 77.703% <86.667%> (+0.497%) 95 <9> (+7) ⬆️
...c/main/java/htsjdk/samtools/util/SnappyLoader.java 78.571% <0%> (-2.198%) 9% <0%> (ø)
src/main/java/htsjdk/samtools/SamReader.java 81.818% <0%> (-0.94%) 0% <0%> (ø)
...va/htsjdk/samtools/SAMSequenceDictionaryCodec.java 86.957% <0%> (-0.543%) 8% <0%> (+2%)
src/main/java/htsjdk/samtools/util/CigarUtil.java 30.303% <0%> (-0.466%) 14% <0%> (ø)
...java/htsjdk/samtools/cram/ref/ReferenceSource.java 46.364% <0%> (-0.425%) 19% <0%> (ø)
...htsjdk/tribble/readers/LongLineBufferedReader.java 29.944% <0%> (-0.342%) 15% <0%> (ø)
...ain/java/htsjdk/samtools/util/MergingIterator.java 85.417% <0%> (-0.298%) 18% <0%> (+7%)
...ava/htsjdk/samtools/CRAMContainerStreamWriter.java 70.612% <0%> (-0.289%) 54% <0%> (ø)
... and 35 more

@lbergelson lbergelson force-pushed the lb_vcfcodec_memory_leak branch from ba32f98 to a9065a3 Compare February 14, 2019 16:03
@lbergelson lbergelson removed the request for review from yfarjoun February 14, 2019 18:27
@lbergelson lbergelson force-pushed the lb_vcfcodec_memory_leak branch from a9065a3 to 5d993aa Compare February 15, 2019 19:16
@lbergelson
Copy link
Member Author

@pshapiro4broad Could you review this branch?

Copy link
Contributor

@pshapiro4broad pshapiro4broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at the code, and seeing what it's trying to do, I wonder if there's a better way that can avoid this problem in the first place. The built in method String.intern() creates a canonical copy (aka an interned copy) of a String. This has no memory leak downsides as the String intern table has an upper limit and will GC strings as necessary. If we have a test that can exercise this code (which I assume we had at one point, when this string caching code was written) it would be interesting to run it using String.intern() and see if that's sufficient here.

src/main/java/htsjdk/variant/variantcontext/Allele.java Outdated Show resolved Hide resolved
public static boolean wouldBeBreakpoint(final byte[] bases) {
if ( bases.length <= 1 )
return false;
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary else after return

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if ( bases.length <= 1 )
return false;
else {
for (int i = 0; i < bases.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use java foreach here

Suggested change
for (int i = 0; i < bases.length; i++) {
for (byte base : bases) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been finding noticeable runtime costs for using foreach in hotspots (seemingly because it instantiates a new iterator each time which is much more costly than a fast for loop). Since Alleles are created in a lot of places I figured I'd stick with the for loop even though it's gross... That said, I haven't profiled this particular instance in any sensible way so I can't prove that this is any faster.

if ( bases.length <= 1 )
return false;
else {
return bases[0] == '.' || bases[bases.length - 1] == '.';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use constants instead of the magic characters > < [ ] .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extracted some... I can't decide if it's an improvement or not... You can take a look when I push the changes.

builder.log10PError(parseQual(parts[5]));

final List<String> filters = parseFilters(getCachedString(parts[6]));
if ( filters != null ) builder.filters(new HashSet<String>(filters));
if ( filters != null ) builder.filters(new HashSet<>(filters));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may as well format the if across lines and use { } if you're changing this line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -577,7 +577,7 @@ private static void checkAllele(String allele, boolean isRef, int lineNo) {
System.err.println(String.format("Allele detected with length %d exceeding max size %d at approximately line %d, likely resulting in degraded VCF processing performance", allele.length(), MAX_ALLELE_SIZE_BEFORE_WARNING, lineNo));
}

if ( isSymbolicAllele(allele) ) {
if (Allele.wouldBeSymbolicAllele(allele.getBytes())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is done before calling the Allele constructor, but it appears to be ensuring that we create a valid Allele. Is there a reason why this code isn't part of the Allele constructor? We would have to wrap the constructor call in a try/catch to get a message printed out with the VCF line number, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's a big redundant section of allele checking code here that I believe is redundant with the constructor. I wanted to make this as small a change set as possible though. I started a follow up branch that cleans up some of this stuff but I figured I'd get this one in before that.

* @param bases bases representing an allele
* @return true if the bases represent a symbolic allele in breakpoint notation, (ex: G]17:198982] or ]13:123456]T )
*/
public static boolean wouldBeBreakpoint(final byte[] bases) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of would here confuses me. If this is a breakpoint allele string, then why not is?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's weird, but it matches the existing methods. I think changing 1 of them is worse than having this weird convention. We could change all of them though...

Copy link
Contributor

@pshapiro4broad pshapiro4broad Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok, I didn't see the rest of them.

@@ -277,4 +277,38 @@ public void testExtend() {
Assert.assertEquals("ATCGA", Allele.extend(Allele.create("AT"), "CGA".getBytes()).toString());
Assert.assertEquals("ATCGA", Allele.extend(Allele.create("ATC"), "GA".getBytes()).toString());
}
@Test
public void testWouldBeSymbolic() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could rewrite these tests using a testng data provider

Suggested change
public void testWouldBeSymbolic() {
public void testWouldBeSymbolic(String allele, boolean expectedSymbolic) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I didn't do all the tests in this file though, just the bottom ones..

pshapiro4broad and others added 3 commits February 21, 2019 15:00
spacing fix

Co-Authored-By: lbergelson <louisb@broadinstitute.org>
@lbergelson lbergelson merged commit a6c5837 into master Feb 26, 2019
@lbergelson lbergelson deleted the lb_vcfcodec_memory_leak branch February 26, 2019 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants