Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed memory leak in VCFCodec #1264

Closed
wants to merge 1 commit into from

Conversation

d-cameron
Copy link
Contributor

@d-cameron d-cameron commented Jan 23, 2019

Description

VCFCodec caches previously seen allele strings which I presume is so that the millions of identical allele strings (e.g A, AA, AT, T) are weakly interned in the string HashMap<>. Unfortunately, when parsing large VCF with SV alleles, this causes a memory leak.

Since each allele string is retained until the VCFCodec is GCed, streaming a VCF containing SVs (e.g just counting # records) results in a continual increase in memory usage even when the VCF records are no longer used. For SVs, the allele string can easily be thousands of bytes in size.

Changes:

  • Excluding breakpoint, single breakend allele strings from the cache.
  • Excluding ALT allele strings containing multiple alleles from the cache (String.split() gets called so there's no point in caching these at all)
  • I also put a safety check to exclude long alleles from the cache since they're very likely to be unique and not benefit from the caching at all.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@d-cameron
Copy link
Contributor Author

d-cameron commented Jan 23, 2019

Also performed a minor refactor to remove the duplicated symbolic alllele detection code in Allele and VCFCodec

@codecov-io
Copy link

codecov-io commented Jan 23, 2019

Codecov Report

Merging #1264 into master will increase coverage by 0.001%.
The diff coverage is 83.333%.

@@               Coverage Diff               @@
##              master     #1264       +/-   ##
===============================================
+ Coverage     69.353%   69.354%   +0.001%     
- Complexity      8301      8307        +6     
===============================================
  Files            555       555               
  Lines          33116     33127       +11     
  Branches        5572      5575        +3     
===============================================
+ Hits           22967     22975        +8     
- Misses          7890      7894        +4     
+ Partials        2259      2258        -1
Impacted Files Coverage Δ Complexity Δ
...main/java/htsjdk/variant/vcf/AbstractVCFCodec.java 73.62% <75%> (+0.614%) 90 <1> (ø) ⬇️
...ain/java/htsjdk/variant/variantcontext/Allele.java 77.551% <85.714%> (+0.345%) 95 <9> (+7) ⬆️
...samtools/util/AsyncBlockCompressedInputStream.java 72% <0%> (-4%) 12% <0%> (-1%)

@d-cameron
Copy link
Contributor Author

Alternative approaches:

  • just remove the HashMap entirely and to have a fixed size cache of the small alleles you're likely to encounter. 680 Allele records covers all A/C/G/T ref and alt alleles up to 4bp in length. 2,728 covers up to 5bp. A fixed static cache works better when you're opening multiple files as the existing cache is per-codec anyway.

  • String.intern() the short strings

  • Replace the map with a WeakHashMap<String, WeakReference<String>> so it doesn't leak

@lbergelson
Copy link
Member

@d-cameron We had a similar proposal a while ago #845 which fell apart because no one could agree about the right thing to do so in the end we did nothing.

However, looking at this more it looks to me that the caching / interning of the allele strings is entirely pointless. It just gets passed into a call to Allele.create() which then re-encodes the string into bytes. I can't tell exactly, because the string encoding is complex and varies by local, but it doesn't look to me that it has any sort of string -> byte cache in the encoding stage. It looks like maybe the right thing to do would be to not bother caching any of the allele strings. Then in addition we could create a cache of String -> Allele's which would be used to reduce byte[] overhead from many duplicate AA alleles...

@yfarjoun Can you take a look and see if you agree with my interpretation of what's happening? How did we not see that before when we looked at this?

@yfarjoun
Copy link
Contributor

I agree, @lbergelson however, I would like to note that the internedString map is also used for filters, reference names and genotypes. For those objects, the current solution might be OK.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-cameron I think we should just remove the allele caching instead of making it conditional. Thanks for deduplicating some of the other code here. I have a few nitpicky comments about javadoc etc, then it should be good to merge.

* @param bases bases representing an allele
* @return true if the bases represent a symbolic allele in breakpoint notation
*/
public static boolean wouldBeBreakpoint(final byte[] bases) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide an example of what these look like in the javadoc, lots of people never work with SV's and don't know what these are.

return false;
else {
for (int i = 0; i < bases.length; i++) {
if (bases[i] == '[' || bases[i] == ']') {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micro optimization to avoid checking the array twice.

Suggested change
if (bases[i] == '[' || bases[i] == ']') {
final byte base = bases[i];
if (base == '[' || base == ']') {

}
/**
* @param bases bases representing an allele
* @return true if the bases represent a symbolic allele in single breakend notation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, could you add an example of one of these here?

@@ -333,7 +335,13 @@ else if ( parts[2].equals(VCFConstants.EMPTY_ID_FIELD) )
builder.id(parts[2]);

final String ref = getCachedString(parts[3].toUpperCase());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out the caching here is completely pointless. Just remove it from here.

Suggested change
final String ref = getCachedString(parts[3].toUpperCase());
final String ref = parts[3].toUpperCase();

@@ -333,7 +335,13 @@ else if ( parts[2].equals(VCFConstants.EMPTY_ID_FIELD) )
builder.id(parts[2]);

final String ref = getCachedString(parts[3].toUpperCase());
final String alts = getCachedString(parts[4]);
String alts = parts[4];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, just remove the caching. It was a good idea to only cache short alleles but it turns out it's broken anyway so lets just remove it.

@@ -300,10 +300,36 @@ public static boolean wouldBeSymbolicAllele(final byte[] bases) {
if ( bases.length <= 1 )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the javadoc here to mention breakends like you did below?

@@ -421,6 +447,10 @@ public static Allele create(final Allele allele, final boolean ignoreRefState) {
// Returns true if this Allele is symbolic (i.e. no well-defined base sequence)
public boolean isSymbolic() { return isSymbolic; }

public boolean isBreakpoint() { return wouldBeBreakpoint(bases); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc here

@@ -421,6 +447,10 @@ public static Allele create(final Allele allele, final boolean ignoreRefState) {
// Returns true if this Allele is symbolic (i.e. no well-defined base sequence)
public boolean isSymbolic() { return isSymbolic; }

public boolean isBreakpoint() { return wouldBeBreakpoint(bases); }

public boolean isSingleBreakend() { return wouldBeSingleBreakend(bases); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise

@@ -277,4 +277,34 @@ public void testExtend() {
Assert.assertEquals("ATCGA", Allele.extend(Allele.create("AT"), "CGA".getBytes()).toString());
Assert.assertEquals("ATCGA", Allele.extend(Allele.create("ATC"), "GA".getBytes()).toString());
}
@Test
public void testWouldBeSymbolic() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding these tests!

@lbergelson
Copy link
Member

Hi @d-cameron,

Are you interested in continuing to work on this branch or would you like us to take it over for you?

@d-cameron
Copy link
Contributor Author

@lbergelson happy for you to take it over. The change is relatively small but the code reviewing will be much faster if you only have to do it internally and not bounce back and forward to Australia for each revision. I have a few other projects on the go and since I already have this patch packaged into GRIDSS, it's less urgent that the other work.

Thanks

@lbergelson
Copy link
Member

Closing this because it's been replace by #1282

@lbergelson lbergelson closed this Feb 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants