Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MRG: fix gather memory usage issue by not accumulating
GatherResult
(…
…#2962) This is kind of a patch-fix for #2950 for `sourmash gather` specifically. This PR changes `sourmash gather` and `sourmash multigather` so that they no longer store any `GatherResult` objects, thus decreasing memory usage substantially. The solution is hacky at several levels, including storing a CSV file in memory rather than writing it progressively. But I think it's an important fix to get in, since `gather` is one of our main use cases and it's causing people some problems (including me) :(. The PR also changes `--save-matches` so that it writes out sketches as they are encountered. This breaks semantic versioning a little bit because the target file for `--save-matches` is opened before any matches are found, and thus may be empty and may also overwrite files unnecessarily. Ultimately, a better fix is needed - probably one that changes up the dataclasses so that they don't store MinHashes - but such a fix is beyond me at the moment. ## benchmarking with latest @ e2c199f: 645 MB ``` Command being timed: "sourmash gather /home/ctbrown/transfer/SRR606249.trim.k31.sig.gz /home/ctbrown/transfer/podar-ref.zip -o xxx.csv" User time (seconds): 48.51 System time (seconds): 1.15 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.91 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 644900 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 156 Minor (reclaiming a frame) page faults: 254494 Voluntary context switches: 2412 Involuntary context switches: 2749 Swaps: 0 File system inputs: 31488 File system outputs: 64 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` with this branch: 215 MB ``` Command being timed: "sourmash gather /home/ctbrown/transfer/SRR606249.trim.k31.sig.gz /home/ctbrown/transfer/podar-ref.zip -o xxx.csv" User time (seconds): 43.38 System time (seconds): 0.89 Percent of CPU this job got: 97% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:45.58 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 215560 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 773 Minor (reclaiming a frame) page faults: 148722 Voluntary context switches: 3884 Involuntary context switches: 6174 Swaps: 0 File system inputs: 151648 File system outputs: 160 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ```
- Loading branch information