-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed memory leak in VCFCodec #1264
Conversation
Also performed a minor refactor to remove the duplicated symbolic alllele detection code in Allele and VCFCodec |
Codecov Report
@@ Coverage Diff @@
## master #1264 +/- ##
===============================================
+ Coverage 69.353% 69.354% +0.001%
- Complexity 8301 8307 +6
===============================================
Files 555 555
Lines 33116 33127 +11
Branches 5572 5575 +3
===============================================
+ Hits 22967 22975 +8
- Misses 7890 7894 +4
+ Partials 2259 2258 -1
|
Alternative approaches:
|
@d-cameron We had a similar proposal a while ago #845 which fell apart because no one could agree about the right thing to do so in the end we did nothing. However, looking at this more it looks to me that the caching / interning of the allele strings is entirely pointless. It just gets passed into a call to Allele.create() which then re-encodes the string into bytes. I can't tell exactly, because the string encoding is complex and varies by local, but it doesn't look to me that it has any sort of string -> byte cache in the encoding stage. It looks like maybe the right thing to do would be to not bother caching any of the allele strings. Then in addition we could create a cache of String -> Allele's which would be used to reduce byte[] overhead from many duplicate @yfarjoun Can you take a look and see if you agree with my interpretation of what's happening? How did we not see that before when we looked at this? |
I agree, @lbergelson however, I would like to note that the internedString map is also used for filters, reference names and genotypes. For those objects, the current solution might be OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-cameron I think we should just remove the allele caching instead of making it conditional. Thanks for deduplicating some of the other code here. I have a few nitpicky comments about javadoc etc, then it should be good to merge.
* @param bases bases representing an allele | ||
* @return true if the bases represent a symbolic allele in breakpoint notation | ||
*/ | ||
public static boolean wouldBeBreakpoint(final byte[] bases) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide an example of what these look like in the javadoc, lots of people never work with SV's and don't know what these are.
return false; | ||
else { | ||
for (int i = 0; i < bases.length; i++) { | ||
if (bases[i] == '[' || bases[i] == ']') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micro optimization to avoid checking the array twice.
if (bases[i] == '[' || bases[i] == ']') { | |
final byte base = bases[i]; | |
if (base == '[' || base == ']') { |
} | ||
/** | ||
* @param bases bases representing an allele | ||
* @return true if the bases represent a symbolic allele in single breakend notation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, could you add an example of one of these here?
@@ -333,7 +335,13 @@ else if ( parts[2].equals(VCFConstants.EMPTY_ID_FIELD) ) | |||
builder.id(parts[2]); | |||
|
|||
final String ref = getCachedString(parts[3].toUpperCase()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out the caching here is completely pointless. Just remove it from here.
final String ref = getCachedString(parts[3].toUpperCase()); | |
final String ref = parts[3].toUpperCase(); |
@@ -333,7 +335,13 @@ else if ( parts[2].equals(VCFConstants.EMPTY_ID_FIELD) ) | |||
builder.id(parts[2]); | |||
|
|||
final String ref = getCachedString(parts[3].toUpperCase()); | |||
final String alts = getCachedString(parts[4]); | |||
String alts = parts[4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, just remove the caching. It was a good idea to only cache short alleles but it turns out it's broken anyway so lets just remove it.
@@ -300,10 +300,36 @@ public static boolean wouldBeSymbolicAllele(final byte[] bases) { | |||
if ( bases.length <= 1 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the javadoc here to mention breakends like you did below?
@@ -421,6 +447,10 @@ public static Allele create(final Allele allele, final boolean ignoreRefState) { | |||
// Returns true if this Allele is symbolic (i.e. no well-defined base sequence) | |||
public boolean isSymbolic() { return isSymbolic; } | |||
|
|||
public boolean isBreakpoint() { return wouldBeBreakpoint(bases); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc here
@@ -421,6 +447,10 @@ public static Allele create(final Allele allele, final boolean ignoreRefState) { | |||
// Returns true if this Allele is symbolic (i.e. no well-defined base sequence) | |||
public boolean isSymbolic() { return isSymbolic; } | |||
|
|||
public boolean isBreakpoint() { return wouldBeBreakpoint(bases); } | |||
|
|||
public boolean isSingleBreakend() { return wouldBeSingleBreakend(bases); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise
@@ -277,4 +277,34 @@ public void testExtend() { | |||
Assert.assertEquals("ATCGA", Allele.extend(Allele.create("AT"), "CGA".getBytes()).toString()); | |||
Assert.assertEquals("ATCGA", Allele.extend(Allele.create("ATC"), "GA".getBytes()).toString()); | |||
} | |||
@Test | |||
public void testWouldBeSymbolic() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding these tests!
Hi @d-cameron, Are you interested in continuing to work on this branch or would you like us to take it over for you? |
@lbergelson happy for you to take it over. The change is relatively small but the code reviewing will be much faster if you only have to do it internally and not bounce back and forward to Australia for each revision. I have a few other projects on the go and since I already have this patch packaged into GRIDSS, it's less urgent that the other work. Thanks |
Closing this because it's been replace by #1282 |
Description
VCFCodec caches previously seen allele strings which I presume is so that the millions of identical allele strings (e.g A, AA, AT, T) are weakly interned in the string HashMap<>. Unfortunately, when parsing large VCF with SV alleles, this causes a memory leak.
Since each allele string is retained until the VCFCodec is GCed, streaming a VCF containing SVs (e.g just counting # records) results in a continual increase in memory usage even when the VCF records are no longer used. For SVs, the allele string can easily be thousands of bytes in size.
Changes:
Checklist