filter: Split filtering and subsampling #1432

victorlin · 2024-03-07T04:23:31Z

Description of proposed changes

This is motivated by the idea that a future more sophisticated subsampling command would benefit from exposing modularized subsampling functions from the existing augur filter code. In order to do that, it's much easier if the subsampling is decoupled from other parts of augur filter, namely the parts that coordinate strain lists and other variables among different chunks of metadata.

With the recent optimizations to reduce memory usage, it might be feasible to do everything in one pass of metadata, with all rows loaded into memory at once.

Related issue(s)

Checklist

Checks pass
Test on large datasets (e.g. ncov)
- Docker image preview build
- GISAID trial build
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

This ensures that no output files are written on error.

This is more memory efficient for columns with many duplicate values, which can be expected in most use cases. ¹ <https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/scale.html#use-efficient-datatypes>

This simplifies the process and allows for more portable subsampling functions.

victorlin · 2024-03-08T01:24:26Z

augur/filter/_run.py

52fbad6 passes all Cram tests but fails on the ncov trial run:

File "/nextstrain/augur/augur/io/metadata.py", line 155, in read_metadata return pd.read_csv( … OSError: [Errno 12] Cannot allocate memory

This means that given its current and growing size, the SARS-CoV-2 dataset is impractical to load into memory all at once despite optimizations such as loading a subset of columns and 8d7206c. The only solution for more modular and portable functions in the codebase is an on-disk approach, which Pandas does not support. There is a spectrum of alternatives, which I think can be divided into two categories:

Pandas-like alternative such as Dask. Unsure how portable the existing Pandas logic is to Dask, but ideally this would be close to a library swap with minimal code change.

Database file approach such as SQLite. Started in filter: Rewrite using SQLite3 #1242. It's essentially a rewrite of augur filter, requiring extensive testing. The other downside is this will require some form of Pandas roundtripping to continue supporting the widely-used Pandas-based queries in --query.

I think it's reasonable to explore (1) on top of current changes in this PR to see if it's a viable option. Even if the end goal is (2), (1) would provide a good stepping stone in the direction of the database approach.

huddlej · 2024-03-08T18:33:45Z

The only solution for more modular and portable functions in the codebase is an on-disk approach, which Pandas does not support.

I agree with this, @victorlin, and I'm sorry I didn't notice this PR direction earlier since I probably could have saved you some effort here.

The original idea that motivated this PR's work depends unrealistically on keeping all data in memory, but augur filter couldn't load all available SARS-CoV-2 data in memory as of June 2021. At that time, @tsibley and I discussed some of the options you've mentioned above, including Dask and SQLite databases. Even though we both preferred a database solution, we opted for the current chunked iterator approach because it was simpler to implement under the time pressure to keep the ncov workflow running. Now that we have a functional (albeit slow) filter command, we have more freedom to push in a more experimental direction like the database approach you've led. If a database-backed augur filter is a key step toward supporting flexible subsampling, it makes sense to me to devote more effort as a team toward that work.

It also seems like the current augur subsample prototype could be useful immediately even if it isn't as fast or otherwise optimal as we'd like. I think this might be @jameshadfield's feeling about the situation, too. I could imagine that the path to better and faster subsampling could be to 1) decide on the minimal viable augur subsample interface that will work for us now, 2) push forward database-backed augur filter to simplify and speed up internal subsampling logic, and 3) refine the internal interfaces between subsample and filter as needed.

victorlin · 2024-03-08T19:33:54Z

@huddlej no worries at all, thanks for chiming in! I agree that the external interface is more important. I opened this PR because the thought of ditching the chunked approach keeps coming up in my mind while thinking about the internal implementation of augur subsample, and I had wondered if there were any non-database approaches. Now it's clear what the options are, and I don't intend to spend much more effort here.

When we get back to the work on speeding up internal subsampling logic, some experimentation with Dask may be worthwhile to check if it's a viable alternative to a complete rewrite with SQLite (or at least used in place of the chunked Pandas querying currently used in #1242).

EDIT: Dumped some Dask experimentation into a commit: 946d93f

jameshadfield · 2024-03-08T20:20:15Z

It also seems like the current augur subsample prototype could be useful immediately even if it isn't as fast or otherwise optimal as we'd like. I think this might be @jameshadfield's feeling about the situation, too. I could imagine that the path to better and faster subsampling could be to 1) decide on the minimal viable augur subsample interface that will work for us now, 2) push forward database-backed augur filter to simplify and speed up internal subsampling logic, and 3) refine the internal interfaces between subsample and filter as needed.

Yes -- augur subsample can work just as prototyped and it'll be fast enough for probably all non-nCoV datasets. We can then iterate on the implementation details and performance over time. I had hoped that augur filter could conditionally load everything into memory or parse it in chunks, but reading this PR it seems that's going to be more difficult than I thought.

trvrb · 2024-04-12T19:54:25Z

How far could we get with just using tsv-utils? tsv-sample supports random and weighted sampling. I've made amazed at the performance of TSV utils in working with ncov metadata.

Simple and weighted random sampling use reservoir sampling and only need to hold the specified sample size (--n|num) in memory.

victorlin · 2024-04-12T21:30:13Z

I just took tsv-sample for a spin on ncov's 100k subsampled dataset. There are small things such as lack of compressed file support that we can easily work around with a wrapper. However, there's also things that augur filter can do which tsv-sample cannot, for example uniform sampling over time:

augur filter \
  --metadata metadata.tsv.xz \
  --group-by month \
  --subsample-max-sequences 100 \
  --output-metadata filtered.tsv

The weighted sampling available in tsv-sample requires a column of weights per row, which is analogous to augur filter --priorities and not quite the same as the weighted sampling we are talking about in #1318 (I believe this was one of @huddlej's initial confusion points).

victorlin added 5 commits March 6, 2024 19:57

Cleanup outputs using top-level exception handling

cf0b808

This ensures that no output files are written on error.

Properly set index dtype as string

7bb4650

Use category dtype

8d7206c

This is more memory efficient for columns with many duplicate values, which can be expected in most use cases. ¹ <https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/scale.html#use-efficient-datatypes>

🚧 Run through metadata in one chunk

52fbad6

This simplifies the process and allows for more portable subsampling functions.

🚧 Deprecate augur filter --metadata-chunk-size

850d908

victorlin self-assigned this Mar 7, 2024

victorlin commented Mar 8, 2024

View reviewed changes

victorlin mentioned this pull request Aug 9, 2024

Speed up augur filter by replacing Pandas #1574

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Split filtering and subsampling #1432

filter: Split filtering and subsampling #1432

victorlin commented Mar 7, 2024 •

edited

Loading

victorlin Mar 8, 2024

huddlej commented Mar 8, 2024

victorlin commented Mar 8, 2024 •

edited

Loading

jameshadfield commented Mar 8, 2024

trvrb commented Apr 12, 2024 •

edited

Loading

victorlin commented Apr 12, 2024

filter: Split filtering and subsampling #1432

Are you sure you want to change the base?

filter: Split filtering and subsampling #1432

Conversation

victorlin commented Mar 7, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

victorlin Mar 8, 2024

Choose a reason for hiding this comment

huddlej commented Mar 8, 2024

victorlin commented Mar 8, 2024 • edited Loading

jameshadfield commented Mar 8, 2024

trvrb commented Apr 12, 2024 • edited Loading

victorlin commented Apr 12, 2024

victorlin commented Mar 7, 2024 •

edited

Loading

victorlin commented Mar 8, 2024 •

edited

Loading

trvrb commented Apr 12, 2024 •

edited

Loading