-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update GatherResult
to no longer store MinHashes
#2950
Comments
note connection to suggestions at bottom of #416 |
I have questions - are the dataclasses caching results, or recomputing them multiple times?? -- and the people demand answers! |
OK, #2962 tackles this for just |
looks like |
…#2962) This is kind of a patch-fix for #2950 for `sourmash gather` specifically. This PR changes `sourmash gather` and `sourmash multigather` so that they no longer store any `GatherResult` objects, thus decreasing memory usage substantially. The solution is hacky at several levels, including storing a CSV file in memory rather than writing it progressively. But I think it's an important fix to get in, since `gather` is one of our main use cases and it's causing people some problems (including me) :(. The PR also changes `--save-matches` so that it writes out sketches as they are encountered. This breaks semantic versioning a little bit because the target file for `--save-matches` is opened before any matches are found, and thus may be empty and may also overwrite files unnecessarily. Ultimately, a better fix is needed - probably one that changes up the dataclasses so that they don't store MinHashes - but such a fix is beyond me at the moment. ## benchmarking with latest @ e2c199f: 645 MB ``` Command being timed: "sourmash gather /home/ctbrown/transfer/SRR606249.trim.k31.sig.gz /home/ctbrown/transfer/podar-ref.zip -o xxx.csv" User time (seconds): 48.51 System time (seconds): 1.15 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.91 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 644900 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 156 Minor (reclaiming a frame) page faults: 254494 Voluntary context switches: 2412 Involuntary context switches: 2749 Swaps: 0 File system inputs: 31488 File system outputs: 64 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` with this branch: 215 MB ``` Command being timed: "sourmash gather /home/ctbrown/transfer/SRR606249.trim.k31.sig.gz /home/ctbrown/transfer/podar-ref.zip -o xxx.csv" User time (seconds): 43.38 System time (seconds): 0.89 Percent of CPU this job got: 97% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:45.58 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 215560 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 773 Minor (reclaiming a frame) page faults: 148722 Voluntary context switches: 3884 Involuntary context switches: 6174 Swaps: 0 File system inputs: 151648 File system outputs: 160 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ```
#2962 addresses the memory usage, but not the underlying problem. From the PR:
|
gather
code is very memory intensive; switch back to streaming?GatherResult
to no longer store MinHashes
over in https://github.com/ctb/2024-calc-full-gather/ I have implemented a simple script that takes fastgather output (from https://github.com/sourmash-bio/sourmash_plugin_branchwater/) and turns it into full gather output without redoing the searches - it literally just trusts the rank and match information from fastgather completely, and calculates all the stats.
this was easier than I expected because of the very nice
GatherResult
refactoring that @bluegenes did a while back in #1955!however it also revealed that #1955 probably added significantly to the memory footprint of gather, because the
GatherResult
dataclasses keep sketches in memory and they are retained throughout the full gather process.I figured this out when I noticed that my
calc-full-gather
script was running out of memory in the same way thatgather
was running out of memory, and in ctb/2024-calc-full-gather@a09215e I fixed it by discarding theGatherResult
objects after each result. It's now nice and low memory (if not exactly fast ;) - see #2943.I am also wondering if perhaps
PrefetchResult
has the same problem inprefetch
?We should fix the gather code in sourmash to be lower memory.
We probably need to do some kind of regression testing that tracks memory usage and the like, too.
viz sourmash-bio/sourmash_plugin_branchwater#187, #2943.
The text was updated successfully, but these errors were encountered: