-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove caching of alleles in AbstractVCFCodec #1282
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1282 +/- ##
==============================================
+ Coverage 67.49% 67.929% +0.439%
- Complexity 8150 8483 +333
==============================================
Files 558 561 +3
Lines 33365 34324 +959
Branches 5608 5848 +240
==============================================
+ Hits 22518 23316 +798
- Misses 8658 8796 +138
- Partials 2189 2212 +23
|
ba32f98
to
a9065a3
Compare
a9065a3
to
5d993aa
Compare
@pshapiro4broad Could you review this branch? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking at the code, and seeing what it's trying to do, I wonder if there's a better way that can avoid this problem in the first place. The built in method String.intern()
creates a canonical copy (aka an interned copy) of a String. This has no memory leak downsides as the String intern table has an upper limit and will GC strings as necessary. If we have a test that can exercise this code (which I assume we had at one point, when this string caching code was written) it would be interesting to run it using String.intern()
and see if that's sufficient here.
public static boolean wouldBeBreakpoint(final byte[] bases) { | ||
if ( bases.length <= 1 ) | ||
return false; | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary else
after return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if ( bases.length <= 1 ) | ||
return false; | ||
else { | ||
for (int i = 0; i < bases.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can use java foreach here
for (int i = 0; i < bases.length; i++) { | |
for (byte base : bases) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been finding noticeable runtime costs for using foreach in hotspots (seemingly because it instantiates a new iterator each time which is much more costly than a fast for loop). Since Alleles are created in a lot of places I figured I'd stick with the for loop even though it's gross... That said, I haven't profiled this particular instance in any sensible way so I can't prove that this is any faster.
if ( bases.length <= 1 ) | ||
return false; | ||
else { | ||
return bases[0] == '.' || bases[bases.length - 1] == '.'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to use constants instead of the magic characters >
<
[
]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've extracted some... I can't decide if it's an improvement or not... You can take a look when I push the changes.
builder.log10PError(parseQual(parts[5])); | ||
|
||
final List<String> filters = parseFilters(getCachedString(parts[6])); | ||
if ( filters != null ) builder.filters(new HashSet<String>(filters)); | ||
if ( filters != null ) builder.filters(new HashSet<>(filters)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may as well format the if
across lines and use {
}
if you're changing this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -577,7 +577,7 @@ private static void checkAllele(String allele, boolean isRef, int lineNo) { | |||
System.err.println(String.format("Allele detected with length %d exceeding max size %d at approximately line %d, likely resulting in degraded VCF processing performance", allele.length(), MAX_ALLELE_SIZE_BEFORE_WARNING, lineNo)); | |||
} | |||
|
|||
if ( isSymbolicAllele(allele) ) { | |||
if (Allele.wouldBeSymbolicAllele(allele.getBytes())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is done before calling the Allele
constructor, but it appears to be ensuring that we create a valid Allele. Is there a reason why this code isn't part of the Allele
constructor? We would have to wrap the constructor call in a try/catch to get a message printed out with the VCF line number, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's a big redundant section of allele checking code here that I believe is redundant with the constructor. I wanted to make this as small a change set as possible though. I started a follow up branch that cleans up some of this stuff but I figured I'd get this one in before that.
* @param bases bases representing an allele | ||
* @return true if the bases represent a symbolic allele in breakpoint notation, (ex: G]17:198982] or ]13:123456]T ) | ||
*/ | ||
public static boolean wouldBeBreakpoint(final byte[] bases) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of would
here confuses me. If this is a breakpoint allele string, then why not is
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's weird, but it matches the existing methods. I think changing 1 of them is worse than having this weird convention. We could change all of them though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok, I didn't see the rest of them.
@@ -277,4 +277,38 @@ public void testExtend() { | |||
Assert.assertEquals("ATCGA", Allele.extend(Allele.create("AT"), "CGA".getBytes()).toString()); | |||
Assert.assertEquals("ATCGA", Allele.extend(Allele.create("ATC"), "GA".getBytes()).toString()); | |||
} | |||
@Test | |||
public void testWouldBeSymbolic() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could rewrite these tests using a testng data provider
public void testWouldBeSymbolic() { | |
public void testWouldBeSymbolic(String allele, boolean expectedSymbolic) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I didn't do all the tests in this file though, just the bottom ones..
spacing fix Co-Authored-By: lbergelson <louisb@broadinstitute.org>
This replaces #1264 and responds to my own comments.