-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
making changes to reduce size of giant interval lists #1309
Conversation
@samuelklee Simple changes to reduce the in memory size of interval lists. |
Codecov Report
@@ Coverage Diff @@
## master #1309 +/- ##
===============================================
+ Coverage 67.728% 68.724% +0.996%
- Complexity 8225 8855 +630
===============================================
Files 560 562 +2
Lines 33493 35503 +2010
Branches 5637 6165 +528
===============================================
+ Hits 22684 24399 +1715
- Misses 8632 8835 +203
- Partials 2177 2269 +92
|
Thanks for making these changes! I think the per-contig OverlapDetector you suggested was enough to bring down memory in CollectReadCounts, so I didn't see any change there. However, I do see a substantial improvement in ModelSegments, which still uses global OverlapDetectors. I was able to run with ~10M bins and <8GB memory, while master required ~10GB. Runtime also actually went down slightly from 18 minutes to 17 minutes, but that might be within the noise. |
if (alreadyThere != null) { | ||
alreadyThere.add(object); | ||
tree.put(start, end, alreadyThere); | ||
if( alreadyThere.size() == 1){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should be implemented within the IntervalTree class as a computeIfPresent
method...
alreadyThere.add(object); | ||
tree.put(start, end, alreadyThere); | ||
if( alreadyThere.size() == 1){ | ||
Set<T> mutableSet = new HashSet<>(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm. better find out if you got a mutable one and just add the new element if so...no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean with an instanceof check? As it's written it should be immutable always if it's exactly size 1 and mutable otherwise. It's a bit awkward but I thought maybe relying on the size was better than relying on the class. I can change that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some changes requested...thanks for this.
@yfarjoun Could you take a look. The implementation in interval tree is slightly more complicated because it has to deal with arbitrary sentinel values instead of just null, and I added some tests because I realized that the conditions I cared about didn't look well covered. |
* @param start interval start position | ||
* @param end interval end position | ||
* @param value value to merge into the tree, must not be equal to the sentinel value | ||
* @param remappingFunction a function that merges the new value with the existing value for the same start and end position |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that if function returns null the region will be left empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* @param end interval end position | ||
* @param value value to merge into the tree, must not be equal to the sentinel value | ||
* @param remappingFunction a function that merges the new value with the existing value for the same start and end position | ||
* @return the updated value that is now stored in the tree or sentinel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now
is not clear enough...(does it refer to before or after the application of the method?)
perhaps the updated value that is eventually stored in the tree or the sentinel value if nothing ends up being stored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with a slight modification:
the updated value that is stored in the tree after the completion of this merge operation, this will be the sentinel value if nothing ended up being stored
``
} | ||
} | ||
tree.merge(start, end, | ||
Collections.singleton(object), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wierd spacing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -56,6 +60,8 @@ public void clear() | |||
mRoot = null; | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra nls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// Sets of size 1 are immutable SingletonSets so we have to make a new | ||
// mutable one to add to | ||
if (oldValue.size() == 1) { | ||
Set<T> newMutableSet = new HashSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can replace entire if statement with:
final Set<T> newMutableSet = oldValue.size() == 1 ? new HashSet<>() : oldValue;
newMutableSet.addAll(oldValue);
newMutableSet.addAll(newValue);
return newMutableSet;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, clever, it's a set so you can unconditionally add the same thing over again...
final V alreadyPresent = put(start, end, value); | ||
if (!Objects.equals(alreadyPresent, mSentinel)) { | ||
final V newComputedValue = remappingFunction.apply(value, alreadyPresent); | ||
if(Objects.equals(newComputedValue, mSentinel)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice. I agree that it's a bit more complicated, but I think that the result is a little nicer...I hope you agree...
@yfarjoun It's an improvement. Thanks for the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
@yfarjoun Thanks for the reviews! |
Description
Two separate changes which should reduce the in-memory size of IntervalList and OverlapDetector.
1: Cache the last seen contig name when loading IntervalLists and reuse the same String to avoid keeping many duplicate copies of "chr1" in memory. This should have very minor overhead since interval files are sorted it should be about as effective as a more complicated caching scheme.
Checklist