Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up filtering by not processing entire file #789

Closed
trvrb opened this issue Nov 28, 2021 · 30 comments
Closed

Speed up filtering by not processing entire file #789

trvrb opened this issue Nov 28, 2021 · 30 comments
Assignees
Labels
enhancement New feature or request

Comments

@trvrb
Copy link
Member

trvrb commented Nov 28, 2021

Context

I've been running builds locally to get context surrounding 21K viruses and I'd like to again flag augur filter subsampling as a major performance bottleneck. Here, I'm starting from already aligned and filtered data: https://github.com/nextstrain/ncov/blob/master/nextstrain_profiles/nextstrain-gisaid/builds.yaml#L23. I just wanted a context of ~200 background viruses from across the pandemic to include in a 21K focal build. I thinned down max_sequences for africa_early, etc... Doing this results in 12 calls that look like:

augur filter 
  --sequences nextstrain-ncov-private/filtered.fasta.xz 
  --metadata results/sanitized_metadata_gisaid.tsv.xz 
  --sequence-index results/combined_sequence_index.tsv.xz 
  --include defaults/include.txt 
  --exclude defaults/exclude.txt 
  --max-date 2021-07-24 
  --exclude-where 'region!=Asia' 
  --group-by country year month 
  --subsample-max-sequences 10 
  --probabilistic-sampling 
  --output results/global/sample-asia_early.fasta 
  --output-strains results/global/sample-asia_early.txt

Each of these calls takes 25 minutes on my laptop. It's not so bad on the cluster where they can be parallelized, but laptop is rough with ~5 hour runtime for subsampling (and many users will be in this situation).

Feels especially crazy given that I just want a random 10 (!) entries from this table that fall into appropriate buckets.

Description

Would there be a way to read lines in random order trying to fill subsampling buckets and just stop once 10 entries have been used? We shouldn't have to find every entry with region=Asia and date<2021-07-24 to collect a random 10 distributed across "country year month" that fit the bill. You should be able to process just a small fraction of the file rather than needing to process the entire file.

cc @huddlej @corneliusroemer

@trvrb trvrb added the enhancement New feature or request label Nov 28, 2021
@tsibley
Copy link
Member

tsibley commented Nov 29, 2021

@trvrb Separately from the issue of speeding up the underlying augur filter, why can't you run these calls in parallel on your laptop too?

@huddlej
Copy link
Contributor

huddlej commented Nov 29, 2021

The slowest part of augur filter is reading/writing sequence data. If you have multiple subsampling calls, the faster way to get the same output is to only output strain names for each subsampling call and then make a final call to augur filter to extract sequences for the lists of strains you want. One example of this is in the data prep guide for “select contextual data for your region of interest”. Each metadata-only (or metadata + sequence index) call should only take a couple of minutes. We can rewrite the ncov workflow to optionally extract sequences for subsampled data only when priorities need to be calculated. Faster still would be to use metadata-based priorities where we calculate Jaccard distances between strains using pre-calculated mutations (from Nextclade or GISAID). I have implemented a prototype of this in ncov as "jaccard-priority". With a little more testing, we could get to metadata-only subsampling soon.

We could speed up the sequence extraction by using something other than (Bio)Python like samtools faidx or UCSC’s faSomeRecords.

There are some times when we could short-circuit the loop through the metadata or the sequences, but this approach won't currently work when we have force-included strains that have to be found in metadata and sequences.

@victorlin
Copy link
Member

The slowest part of augur filter is reading/writing sequence data.

After doing a bit of testing using a dataset with 160k sequences I found on S3, I am seeing some evidence that metadata IO is slower than sequence IO.

This metadata-only call takes ~45s:

time augur filter \
  --metadata data/metadata.tsv \
  --group-by country year month \
  --subsample-max-sequences 10 \
  --output-strains filter_strains.txt

Sampling at 0.018930879516601565 per group.
160278 strains were dropped during filtering
	1979 were dropped during grouping due to ambiguous month information
	158299 of these were dropped because of subsampling criteria
11 strains passed all filters
augur filter --metadata data/metadata.tsv --group-by country year month  10    44.56s user 0.43s system 99% cpu 45.000 total

While the comparable metadata + sequence call takes ~52s:

time augur filter \
  --sequences data/sequences.fasta \
  --sequence-index data/sequence_index.tsv \
  --metadata data/metadata.tsv \
  --group-by country year month \
  --subsample-max-sequences 10 \
  --output-sequences data/filter_sequences.fasta

160293 strains were dropped during filtering
	16 had no metadata
	16 had no sequence data
	1979 were dropped during grouping due to ambiguous month information
	158282 of these were dropped because of subsampling criteria
12 strains passed all filters
augur filter --sequences data/sequences.fasta --sequence-index  --metadata     51.15s user 1.20s system 99% cpu 52.485 total

This means sequence IO only accounts for ~13% (7/52) of runtime.

I'm going to dig into where this slowness is coming from and see if it's possible to speed up.

Let me know if there's a better dataset I can test with.

@victorlin victorlin self-assigned this Nov 30, 2021
@tsibley
Copy link
Member

tsibley commented Nov 30, 2021

@victorlin Not sure that comparison is telling the whole story because it's only outputting ~10 sequences (due to --subsample-max-sequences) instead of a more realistic number. Look at it this way, for example: the second command took an extra 7 seconds to output just 12 sequences. If that's steady, then as you ask for more sequences to be output, the sequence IO will account for a greater and greater portion of the total runtime.

@victorlin
Copy link
Member

victorlin commented Nov 30, 2021

@tsibley good point, it can be broken down to:

  1. metadata read
  2. metadata write
  3. sequence read
  4. sequence write

My example above does a poor job of testing (2) and (4) since output is small. But it highlights the differences between (1) and (3).

For @trvrb's example which also has small output, I'm positing that (1) is the bottleneck instead of (3). There are probably improvements to be made across the board though. @huddlej's suggestion of replacing sequence extraction with samtools/faSomeRecords focuses on (3) but I don't think that's the issue here.

@huddlej
Copy link
Contributor

huddlej commented Dec 1, 2021

Thank you for digging into this, @victorlin! You and @tsibley both make good points. Benchmaking this command is tricky since apparently minor changes to the input arguments or data formats can have a large effect on runtime.

I just reviewed my notes from testing augur filter back in September with a large ncov dataset and noticed that my conclusion about metadata vs. sequence time was based on filter commands that don't do any grouping. The group-by logic requires a second pass through the metadata, so it takes longer. I'll re-run my analysis from September and post the corresponding code here, so you can verify my results.

@huddlej
Copy link
Contributor

huddlej commented Dec 1, 2021

@victorlin Here are my notes from September with updated runtimes for the latest data as of today.

Setup environment and data

# Create an environment with the tools we need.
mamba create -n ncov -c conda-forge -c bioconda \
  augur==13.0.2 \
  samtools \
  tabix \
  awscli

# Activate the environment.
conda activate ncov

# Download complete set of open aligned, filtered sequences.
curl -O https://data.nextstrain.org/files/ncov/open/filtered.fasta.xz

# Download complete open metadata.
curl -O https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz

# Index sequences.
# This took 34 minutes.
augur index \
  --sequences filtered.fasta.xz \
  --output sequence_index.tsv

# Compress index (forgot to do this above).
# This took 7 seconds.
gzip sequence_index.tsv

Compare subsampling performance for augur filter

# Select all metadata records matching a specific query.
# This took 2 minutes.
augur filter \
  --metadata metadata.tsv.gz \
  --query "division == 'Nebraska'" \
  --output-metadata nebraska_metadata.tsv \
  --output-strains nebraska_strains.txt

2490241 strains were dropped during filtering
	2490241 of these were filtered out by the query: "division == 'Nebraska'"
3530 strains passed all filters

# Select metadata and sequences matching a specific query.
# This took ~16 minutes.
augur filter \
  --metadata metadata.tsv.gz \
  --sequences filtered.fasta.xz \
  --sequence-index sequence_index.tsv.gz \
  --query "division == 'Nebraska'" \
  --output-metadata nebraska_metadata.tsv \
  --output-strains nebraska_strains.txt \
  --output-sequences nebraska_sequences.fasta

2490317 strains were dropped during filtering
	227031 had no sequence data
	2263286 of these were filtered out by the query: "division == 'Nebraska'"
3454 strains passed all filters

# Select 10 random metadata records matching a specific query.
# This took 2 minutes.
augur filter \
  --metadata metadata.tsv.gz \
  --query "division == 'Nebraska'" \
  --subsample-max-sequences 10 \
  --output-metadata nebraska_random_metadata.tsv \
  --output-strains nebraska_random_strains.txt

2493761 strains were dropped during filtering
	2490241 of these were filtered out by the query: "division == 'Nebraska'"
	3520 of these were dropped because of subsampling criteria
10 strains passed all filters

# Select 10 random metadata and sequences for a query.
# This took 15 minutes.
augur filter \
  --metadata metadata.tsv.gz \
  --sequences filtered.fasta.xz \
  --sequence-index sequence_index.tsv.gz \
  --query "division == 'Nebraska'" \
  --subsample-max-sequences 10 \
  --output-metadata nebraska_random_metadata.tsv \
  --output-strains nebraska_random_strains.txt \
  --output-sequences nebraska_random_sequences.fasta

2493761 strains were dropped during filtering
	227031 had no sequence data
	2263286 of these were filtered out by the query: "division == 'Nebraska'"
	3444 of these were dropped because of subsampling criteria
10 strains passed all filters

# Select specific sequences with prefiltered metadata
# and strain list.
# This took 16 minutes.
augur filter \
  --metadata nebraska_metadata.tsv \
  --sequences filtered.fasta.xz \
  --exclude-all \
  --include nebraska_strains.txt \
  --output-sequences nebraska_sequences.fasta

2263286 strains were dropped during filtering
	2263286 had no metadata
	3454 of these were dropped by `--exclude-all`
	3454 strains were added back because they were in nebraska_strains.txt
3454 strains passed all filters

Edit: I forgot to note that we've discussed changing back from using engine="python" as the pandas parser to engine="c". When I change to the C parser, the initial metadata-only filters (both one- and two-pass) run in 1 minute instead of 2.

@tsibley
Copy link
Member

tsibley commented Dec 1, 2021

@victorlin Good way to break it down.

For @trvrb's example which also has small output, I'm positing that (1) is the bottleneck instead of (3). There are probably improvements to be made across the board though. @huddlej's suggestion of replacing sequence extraction with samtools/faSomeRecords focuses on (3) but I don't think that's the issue here.

The reasoning for (1) vs (3) dominating the runtime isn't clear to me. Could you explain?

And ultimately, the best thing to do here is actually profile the code.

@trvrb
Copy link
Member Author

trvrb commented Dec 1, 2021

Thanks so much for digging in here @victorlin, @huddlej and @tsibley. This doesn't need to be the only use case, but you can quickly replicate the use case that I had via the filter-test branch I just pushed up here: https://github.com/nextstrain/ncov/tree/filter-test

From this branch can run locally as:

snakemake -j 1 -p --profile nextstrain_profiles/nextstrain-gisaid auspice/ncov_gisaid_global.json

It will take a few minutes to "sanitize" sequences but you'll be to the 12 augur filter via subsample rule pretty rapidly.

@victorlin
Copy link
Member

victorlin commented Dec 1, 2021

The reasoning for (1) vs (3) dominating the runtime isn't clear to me. Could you explain?

@tsibley The metadata-only call took 45s while metadata+sequence call took 52s. So the additional sequence IO there had little effect on run time.

However, @huddlej's output says otherwise (2m metadata vs. 15m metadata+sequences), probably since it's a different dataset and not using --group-by. I'm running it locally to see if I can reproduce similar numbers.

And ultimately, the best thing to do here is actually profile the code.

Agreed. A few years back I used cProfile for this. If anyone has other suggestions I'd love to see what's being used for profiling nowadays.

EDIT: I'm having fun with cProfile + SnakeViz 🙂

@trvrb
Copy link
Member Author

trvrb commented Dec 2, 2021

I just ran the identical command as in the original issue, but dropping all sequence references, so:

% 25 minutes on my laptop
time augur filter            
 --metadata results/sanitized_metadata_gisaid.tsv.xz              
 --include defaults/include.txt            
 --exclude defaults/exclude.txt            
 --min-date 2021-07-28                         
 --exclude-where 'region!=Europe'                                                                
 --group-by country year month                         
 --subsample-max-sequences 10            
 --probabilistic-sampling        
 --output-strains results/global/sample-europe_late.txt

And again confirming the timing of the original command

% 40 minutes on my laptop
time augur filter            
 --sequences nextstrain-ncov-private/filtered.fasta.xz            
 --metadata results/sanitized_metadata_gisaid.tsv.xz            
 --sequence-index results/combined_sequence_index.tsv.xz            
 --include defaults/include.txt            
 --exclude defaults/exclude.txt            
 --min-date 2021-07-28                         
 --exclude-where 'region!=Europe'                                                                
 --group-by country year month                         
 --subsample-max-sequences 10            
 --probabilistic-sampling            
 --output results/global/sample-europe_late.fasta            
 --output-strains results/global/sample-europe_late.txt

So 25 minutes with just metadata and 40 minutes with sequences. I would assume that with --group-by or other logic is causing some poor scaling behavior (in addition to sequence IO causing overhead).

@victorlin
Copy link
Member

victorlin commented Dec 2, 2021

@trvrb the current implementation of --group-by is computationally expensive, notably augur.filter.get_groups_for_subsampling which is looping through every valid sequence in the metadata file (post-filter) and determining the groups:

augur/augur/filter.py

Lines 894 to 934 in d553f5e

for strain in strains:
skip_strain = False
group = []
m = metadata.loc[strain].to_dict()
# collect group specifiers
for c in groups:
if c == "_dummy":
group.append(c)
elif c in m:
group.append(m[c])
elif c in ['month', 'year'] and 'date' in m:
try:
year = int(m["date"].split('-')[0])
except:
skipped_strains.append({
"strain": strain,
"filter": "skip_group_by_with_ambiguous_year",
"kwargs": "",
})
skip_strain = True
break
if c=='month':
try:
month = int(m["date"].split('-')[1])
except:
skipped_strains.append({
"strain": strain,
"filter": "skip_group_by_with_ambiguous_month",
"kwargs": "",
})
skip_strain = True
break
group.append((year, month))
else:
group.append(year)
else:
group.append('unknown')
if not skip_strain:
group_by_strain[strain] = tuple(group)

Mainly, line 897 is inefficient usage of a pandas DataFrame (confirmed with profiler). I'm looking to see if we can leverage builtin pandas utilities (e.g. pandas.DataFrame.groupby) to achieve what's being done here.

@huddlej
Copy link
Contributor

huddlej commented Dec 2, 2021

Nice find, @victorlin! I'd love to see how you used SnakeViz with cProfile to track this down. I'm also curious how much we can reduce runtime from 25 minutes by switching to Pandas-based logic.

@victorlin
Copy link
Member

victorlin commented Dec 2, 2021

This is the necessary change for profiling: 0f357be

I ran this command for 160k sequences and uploaded the profile result file to gist:

# 83.76s user 0.52s system 99% cpu 1:24.29 total
time augur filter \
  --metadata data/160k/metadata.tsv \
  --group-by country year month \
  --subsample-max-sequences 10 \
  --output-strains filter_strains.txt

If you want to look at profile results for yourself:

  1. Download 160k-sequences-current.prof.
  2. Go to https://nejc.saje.info/pstats-viewer.html. (SnakeViz web server hosted by a nice person on the internet)
  3. Select Python 3.
  4. Open downloaded file.

You will see something like this. Most of it is pandas internal methods, but it's clear that pandas __getitem__ in augur.filter.get_groups_for_subsampling and somewhere directly in augur.filter.run accounts for most of the run time. Unfortunately line 897 isn't in the call stack – I did some line removals locally to determine that one.

snakeviz 160k-sequences-current.prof

@victorlin
Copy link
Member

Note: I also experimented with a couple other ideas to improve metadata read performance:

  1. Use pd.read_csv(sep = "\t", engine="c") as @huddlej mentioned above.
    This improved things slightly: 45s -> 42s in my example.
  2. Read a subset of the columns with pd.read_csv(usecols=["strain", "date", "country"]).
    This improved things significantly: 45s -> 27s in my example.

Both are separate from the refactor to use pandas functions, so I'll look into them later.

@victorlin
Copy link
Member

victorlin commented Dec 3, 2021

First performance improvement PR is ready for review. See PR comment for detailed profiling results, but in summary it reduces the run time of augur.filter.get_groups_for_subsampling from 55s down to 5s.

@victorlin
Copy link
Member

victorlin commented Dec 3, 2021

I just ran @trvrb's example with/without PR and profiling. Everything took much less than 25m each maybe due to environment/machine differences, but still surprising.

  • Without PR changes:
    • Non-profile run time: 8m 40s
    • Profile run time: 13m 20s
      • get_groups_for_subsampling: 6m 19s
  • With PR changes:
    • Non-profile run time: 5m 48s
    • Profile run time: 7m 43s
      • get_groups_for_subsampling: 41s

@huddlej
Copy link
Contributor

huddlej commented Dec 3, 2021

Thanks for doing this before/after analysis, @victorlin. I'm trying to recreate the same results locally (the slower of the two is running right now).

In addition to the slow inner loop issue you've sped up in #794, there is another potentially easy optimization: filter currently calculates groups for strains twice; once in the first pass and again in the second pass. We should think about an efficient way to cache the results from the first pass to use in the second. For example, running Trevor's example command with Augur 13.0.3 now, it took 5+ minutes to calculate groups the first time and now its doing it again.

victorlin added a commit that referenced this issue Dec 3, 2021
Improvements to `augur.filter.get_groups_for_subsampling`, motivated by #789:

- Optimize with builtin pandas functions instead of inefficiently iterating through each row.
- Change logic to raise `FilterException` when `--group-by year month` is used without a `date` column (prep work to fix #795)
- Add tests in new file: tests/test_filter_groupby.py
@victorlin
Copy link
Member

We should think about an efficient way to cache the results from the first pass to use in the second

I'm going to tackle this next, since it seems like another inefficient usage of pandas DataFrame (line 1522):

augur/augur/filter.py

Lines 1519 to 1524 in af23a7b

for strain in sorted(group_by_strain.keys()):
group = group_by_strain[strain]
queues_by_group[group].add(
metadata.loc[strain],
priorities[strain],
)

And this is the only reference for the second pass of get_groups_for_subsampling.

@tsibley
Copy link
Member

tsibley commented Dec 4, 2021

@victorlin Rad! I've also used py-spy in the past for some Python profiling.

@trvrb
Copy link
Member Author

trvrb commented Dec 4, 2021

Amazing progress here @victorlin. Really appreciate the speed ups.

@huddlej
Copy link
Contributor

huddlej commented Dec 4, 2021

I got similar run times to Trevor with the same command:

  • Augur 13.0.3: 24 minutes
  • PR 794: 17 minutes
  • PR 794 and pandas C parser: 10 minutes

@victorlin
Copy link
Member

victorlin commented Dec 10, 2021

Updated numbers on my machine for @trvrb's example of 3m sequences with the new PR.

With these PRs plus the two smaller improvements mentioned in #789 (comment), I think this covers all the low-hanging fruit for metadata read performance improvement without any major changes to how we process the data.

Based on profiling results (prof file here), the bottleneck is now reading the metadata file in its entirety. I'm not sure if it's even possible to do --probablistic-subsampling without processing the entire file, per @trvrb's original suggestion:

We shouldn't have to find every entry with region=Asia and date<2021-07-24 to collect a random 10 distributed across "country year month" that fit the bill. You should be able to process just a small fraction of the file rather than needing to process the entire file.

The current code must go through the file entirely to determine the number of distinct observed groups (e.g. (Asia, 2020-01), (Asia, 2020-03), (Asia, 2021-07), ...) then use Poisson sampling to determine the ~10 sequences to output. Is there a way to guarantee a randomly distributed sample by reading lines in random order?

@tsibley
Copy link
Member

tsibley commented Dec 14, 2021

Is there a way to guarantee a randomly distributed sample by reading lines in random order?

Don't know, but its worth noting that "reading lines in a random order" is not an entirely straightforward process either (and not that commonly done) and comes with its own perf cost of seek time if the disk isn't an SSD (less and less common, but still out there, even in the cloud).

@victorlin
Copy link
Member

A few more ideas, borrowed from this blog post:

  1. Specify dtypes: low hanging fruit similar to other parameter changes to pd.read_csv().
  2. Process chunks in parallel using multiprocessing: this has potential for substantial speedup, though need to experiment if we can compute the same data structures using multiprocessing shared state.

@victorlin
Copy link
Member

"reading lines in a random order" is not an entirely straightforward process either

Right, this is an inherent limit of processing data from a text file.

Has a database-oriented model been considered? For example:

  1. Store metadata and sequences into a database (e.g. SQLite) using some new command augur load metadata.tsv sequences.fasta
  2. Compute some indexes.
  3. All augur subcommands connect to and update the database accordingly.

The initial loading+indexing will be computationally expensive, but it's a one-time step which should improve performance of subsequent operations.

@victorlin
Copy link
Member

Has a database-oriented model been considered?

At least @corneliusroemer has considered this 😃 https://github.com/corneliusroemer/fasta_zstd_sqlite

Will work with him to prototype this into Augur.

@tsibley
Copy link
Member

tsibley commented Dec 16, 2021

@victorlin

Has a database-oriented model been considered?

Yes, of course. :-) We've had previous conversations around this on Slack, GitHub, and in person, including at least @huddlej, @joverlee521, @corneliusroemer, and myself (and probably others). Searching for "sqlite" on Slack is probably the best way to find the myriad of threads. We haven't summarized any of those conversations, AFAIK.

So my gentle suggestion before diving into prototyping something into Augur is to catch up on the relevant prior context here and to also sketch out a design doc for how this would work within all the parts of Augur (to identify the hard spots and decision points earlier than later, solicit and incorporate design feedback, etc).

@victorlin
Copy link
Member

@tsibley Nice! I'll start with going through all the context on Slack and writing up a separate GitHub issue so we can continue centralized discussion there.

@victorlin
Copy link
Member

victorlin commented Aug 15, 2022

Closing this issue since we have #1575 to keep track of augur filter speed improvements while processing the entire file. We also have #866 to continue the larger database discussion.

To close out the original idea:

Would there be a way to read lines in random order trying to fill subsampling buckets and just stop once 10 entries have been used?

As mentioned in #789 (comment), this isn't trivial. The closest option I know of is to use seek() on a file object to start reading from a specific byte position. However, that isn't practical in our case where each of our lines has a different byte length, so it isn't possible to find the exact byte position of arbitrary lines without actually reading them in first, even if just to get the byte length.

@victorlin victorlin closed this as not planned Won't fix, can't repro, duplicate, stale Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

No branches or pull requests

4 participants