-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up filtering by not processing entire file #789
Comments
@trvrb Separately from the issue of speeding up the underlying |
The slowest part of augur filter is reading/writing sequence data. If you have multiple subsampling calls, the faster way to get the same output is to only output strain names for each subsampling call and then make a final call to augur filter to extract sequences for the lists of strains you want. One example of this is in the data prep guide for “select contextual data for your region of interest”. Each metadata-only (or metadata + sequence index) call should only take a couple of minutes. We can rewrite the ncov workflow to optionally extract sequences for subsampled data only when priorities need to be calculated. Faster still would be to use metadata-based priorities where we calculate Jaccard distances between strains using pre-calculated mutations (from Nextclade or GISAID). I have implemented a prototype of this in ncov as "jaccard-priority". With a little more testing, we could get to metadata-only subsampling soon. We could speed up the sequence extraction by using something other than (Bio)Python like samtools faidx or UCSC’s faSomeRecords. There are some times when we could short-circuit the loop through the metadata or the sequences, but this approach won't currently work when we have force-included strains that have to be found in metadata and sequences. |
After doing a bit of testing using a dataset with 160k sequences I found on S3, I am seeing some evidence that metadata IO is slower than sequence IO. This metadata-only call takes ~45s:
While the comparable metadata + sequence call takes ~52s:
This means sequence IO only accounts for ~13% (7/52) of runtime. I'm going to dig into where this slowness is coming from and see if it's possible to speed up. Let me know if there's a better dataset I can test with. |
@victorlin Not sure that comparison is telling the whole story because it's only outputting ~10 sequences (due to |
@tsibley good point, it can be broken down to:
My example above does a poor job of testing (2) and (4) since output is small. But it highlights the differences between (1) and (3). For @trvrb's example which also has small output, I'm positing that (1) is the bottleneck instead of (3). There are probably improvements to be made across the board though. @huddlej's suggestion of replacing sequence extraction with samtools/faSomeRecords focuses on (3) but I don't think that's the issue here. |
Thank you for digging into this, @victorlin! You and @tsibley both make good points. Benchmaking this command is tricky since apparently minor changes to the input arguments or data formats can have a large effect on runtime. I just reviewed my notes from testing augur filter back in September with a large ncov dataset and noticed that my conclusion about metadata vs. sequence time was based on filter commands that don't do any grouping. The group-by logic requires a second pass through the metadata, so it takes longer. I'll re-run my analysis from September and post the corresponding code here, so you can verify my results. |
@victorlin Here are my notes from September with updated runtimes for the latest data as of today. Setup environment and data
Compare subsampling performance for augur filter
Edit: I forgot to note that we've discussed changing back from using |
@victorlin Good way to break it down.
The reasoning for (1) vs (3) dominating the runtime isn't clear to me. Could you explain? And ultimately, the best thing to do here is actually profile the code. |
Thanks so much for digging in here @victorlin, @huddlej and @tsibley. This doesn't need to be the only use case, but you can quickly replicate the use case that I had via the From this branch can run locally as:
It will take a few minutes to "sanitize" sequences but you'll be to the 12 |
@tsibley The metadata-only call took 45s while metadata+sequence call took 52s. So the additional sequence IO there had little effect on run time. However, @huddlej's output says otherwise (2m metadata vs. 15m metadata+sequences), probably since it's a different dataset and not using
Agreed. A few years back I used cProfile for this. If anyone has other suggestions I'd love to see what's being used for profiling nowadays. EDIT: I'm having fun with cProfile + SnakeViz 🙂 |
I just ran the identical command as in the original issue, but dropping all sequence references, so:
And again confirming the timing of the original command
So 25 minutes with just metadata and 40 minutes with sequences. I would assume that with |
@trvrb the current implementation of Lines 894 to 934 in d553f5e
Mainly, line 897 is inefficient usage of a pandas DataFrame (confirmed with profiler). I'm looking to see if we can leverage builtin pandas utilities (e.g. |
Nice find, @victorlin! I'd love to see how you used SnakeViz with cProfile to track this down. I'm also curious how much we can reduce runtime from 25 minutes by switching to Pandas-based logic. |
This is the necessary change for profiling: 0f357be I ran this command for 160k sequences and uploaded the profile result file to gist: # 83.76s user 0.52s system 99% cpu 1:24.29 total
time augur filter \
--metadata data/160k/metadata.tsv \
--group-by country year month \
--subsample-max-sequences 10 \
--output-strains filter_strains.txt If you want to look at profile results for yourself:
You will see something like this. Most of it is pandas internal methods, but it's clear that pandas |
Note: I also experimented with a couple other ideas to improve metadata read performance:
Both are separate from the refactor to use pandas functions, so I'll look into them later. |
First performance improvement PR is ready for review. See PR comment for detailed profiling results, but in summary it reduces the run time of |
I just ran @trvrb's example with/without PR and profiling. Everything took much less than 25m each maybe due to environment/machine differences, but still surprising.
|
Thanks for doing this before/after analysis, @victorlin. I'm trying to recreate the same results locally (the slower of the two is running right now). In addition to the slow inner loop issue you've sped up in #794, there is another potentially easy optimization: filter currently calculates groups for strains twice; once in the first pass and again in the second pass. We should think about an efficient way to cache the results from the first pass to use in the second. For example, running Trevor's example command with Augur 13.0.3 now, it took 5+ minutes to calculate groups the first time and now its doing it again. |
Improvements to `augur.filter.get_groups_for_subsampling`, motivated by #789: - Optimize with builtin pandas functions instead of inefficiently iterating through each row. - Change logic to raise `FilterException` when `--group-by year month` is used without a `date` column (prep work to fix #795) - Add tests in new file: tests/test_filter_groupby.py
I'm going to tackle this next, since it seems like another inefficient usage of pandas DataFrame (line 1522): Lines 1519 to 1524 in af23a7b
And this is the only reference for the second pass of |
@victorlin Rad! I've also used py-spy in the past for some Python profiling. |
Amazing progress here @victorlin. Really appreciate the speed ups. |
I got similar run times to Trevor with the same command:
|
Updated numbers on my machine for @trvrb's example of 3m sequences with the new PR.
With these PRs plus the two smaller improvements mentioned in #789 (comment), I think this covers all the low-hanging fruit for metadata read performance improvement without any major changes to how we process the data. Based on profiling results (prof file here), the bottleneck is now reading the metadata file in its entirety. I'm not sure if it's even possible to do
The current code must go through the file entirely to determine the number of distinct observed groups (e.g. |
Don't know, but its worth noting that "reading lines in a random order" is not an entirely straightforward process either (and not that commonly done) and comes with its own perf cost of seek time if the disk isn't an SSD (less and less common, but still out there, even in the cloud). |
A few more ideas, borrowed from this blog post:
|
Right, this is an inherent limit of processing data from a text file. Has a database-oriented model been considered? For example:
The initial loading+indexing will be computationally expensive, but it's a one-time step which should improve performance of subsequent operations. |
At least @corneliusroemer has considered this 😃 https://github.com/corneliusroemer/fasta_zstd_sqlite Will work with him to prototype this into Augur. |
Yes, of course. :-) We've had previous conversations around this on Slack, GitHub, and in person, including at least @huddlej, @joverlee521, @corneliusroemer, and myself (and probably others). Searching for "sqlite" on Slack is probably the best way to find the myriad of threads. We haven't summarized any of those conversations, AFAIK. So my gentle suggestion before diving into prototyping something into Augur is to catch up on the relevant prior context here and to also sketch out a design doc for how this would work within all the parts of Augur (to identify the hard spots and decision points earlier than later, solicit and incorporate design feedback, etc). |
@tsibley Nice! I'll start with going through all the context on Slack and writing up a separate GitHub issue so we can continue centralized discussion there. |
Closing this issue since we have #1575 to keep track of To close out the original idea:
As mentioned in #789 (comment), this isn't trivial. The closest option I know of is to use |
Context
I've been running builds locally to get context surrounding 21K viruses and I'd like to again flag
augur filter
subsampling as a major performance bottleneck. Here, I'm starting from already aligned and filtered data: https://github.com/nextstrain/ncov/blob/master/nextstrain_profiles/nextstrain-gisaid/builds.yaml#L23. I just wanted a context of ~200 background viruses from across the pandemic to include in a 21K focal build. I thinned downmax_sequences
forafrica_early
, etc... Doing this results in 12 calls that look like:Each of these calls takes 25 minutes on my laptop. It's not so bad on the cluster where they can be parallelized, but laptop is rough with ~5 hour runtime for subsampling (and many users will be in this situation).
Feels especially crazy given that I just want a random 10 (!) entries from this table that fall into appropriate buckets.
Description
Would there be a way to read lines in random order trying to fill subsampling buckets and just stop once 10 entries have been used? We shouldn't have to find every entry with region=Asia and date<2021-07-24 to collect a random 10 distributed across "country year month" that fit the bill. You should be able to process just a small fraction of the file rather than needing to process the entire file.
cc @huddlej @corneliusroemer
The text was updated successfully, but these errors were encountered: