Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter: Split filtering and subsampling #1432

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Mar 7, 2024

Description of proposed changes

This is motivated by the idea that a future more sophisticated subsampling command would benefit from exposing modularized subsampling functions from the existing augur filter code. In order to do that, it's much easier if the subsampling is decoupled from other parts of augur filter, namely the parts that coordinate strain lists and other variables among different chunks of metadata.

With the recent optimizations to reduce memory usage, it might be feasible to do everything in one pass of metadata, with all rows loaded into memory at once.

Related issue(s)

Checklist

This ensures that no output files are written on error.
This is more memory efficient for columns with many duplicate values,
which can be expected in most use cases.

¹ <https://pandas.pydata.org/pandas-docs/version/1.5/user_guide/scale.html#use-efficient-datatypes>
This simplifies the process and allows for more portable subsampling
functions.
@victorlin victorlin self-assigned this Mar 7, 2024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

52fbad6 passes all Cram tests but fails on the ncov trial run:

  File "/nextstrain/augur/augur/io/metadata.py", line 155, in read_metadata
    return pd.read_csv(
  …
OSError: [Errno 12] Cannot allocate memory

This means that given its current and growing size, the SARS-CoV-2 dataset is impractical to load into memory all at once despite optimizations such as loading a subset of columns and 8d7206c. The only solution for more modular and portable functions in the codebase is an on-disk approach, which Pandas does not support. There is a spectrum of alternatives, which I think can be divided into two categories:

  1. Pandas-like alternative such as Dask. Unsure how portable the existing Pandas logic is to Dask, but ideally this would be close to a library swap with minimal code change.
  2. Database file approach such as SQLite. Started in filter: Rewrite using SQLite3 #1242. It's essentially a rewrite of augur filter, requiring extensive testing. The other downside is this will require some form of Pandas roundtripping to continue supporting the widely-used Pandas-based queries in --query.

I think it's reasonable to explore (1) on top of current changes in this PR to see if it's a viable option. Even if the end goal is (2), (1) would provide a good stepping stone in the direction of the database approach.

@huddlej
Copy link
Contributor

huddlej commented Mar 8, 2024

The only solution for more modular and portable functions in the codebase is an on-disk approach, which Pandas does not support.

I agree with this, @victorlin, and I'm sorry I didn't notice this PR direction earlier since I probably could have saved you some effort here.

The original idea that motivated this PR's work depends unrealistically on keeping all data in memory, but augur filter couldn't load all available SARS-CoV-2 data in memory as of June 2021. At that time, @tsibley and I discussed some of the options you've mentioned above, including Dask and SQLite databases. Even though we both preferred a database solution, we opted for the current chunked iterator approach because it was simpler to implement under the time pressure to keep the ncov workflow running. Now that we have a functional (albeit slow) filter command, we have more freedom to push in a more experimental direction like the database approach you've led. If a database-backed augur filter is a key step toward supporting flexible subsampling, it makes sense to me to devote more effort as a team toward that work.

It also seems like the current augur subsample prototype could be useful immediately even if it isn't as fast or otherwise optimal as we'd like. I think this might be @jameshadfield's feeling about the situation, too. I could imagine that the path to better and faster subsampling could be to 1) decide on the minimal viable augur subsample interface that will work for us now, 2) push forward database-backed augur filter to simplify and speed up internal subsampling logic, and 3) refine the internal interfaces between subsample and filter as needed.

@victorlin
Copy link
Member Author

victorlin commented Mar 8, 2024

@huddlej no worries at all, thanks for chiming in! I agree that the external interface is more important. I opened this PR because the thought of ditching the chunked approach keeps coming up in my mind while thinking about the internal implementation of augur subsample, and I had wondered if there were any non-database approaches. Now it's clear what the options are, and I don't intend to spend much more effort here.

When we get back to the work on speeding up internal subsampling logic, some experimentation with Dask may be worthwhile to check if it's a viable alternative to a complete rewrite with SQLite (or at least used in place of the chunked Pandas querying currently used in #1242).

EDIT: Dumped some Dask experimentation into a commit: 946d93f

@jameshadfield
Copy link
Member

It also seems like the current augur subsample prototype could be useful immediately even if it isn't as fast or otherwise optimal as we'd like. I think this might be @jameshadfield's feeling about the situation, too. I could imagine that the path to better and faster subsampling could be to 1) decide on the minimal viable augur subsample interface that will work for us now, 2) push forward database-backed augur filter to simplify and speed up internal subsampling logic, and 3) refine the internal interfaces between subsample and filter as needed.

Yes -- augur subsample can work just as prototyped and it'll be fast enough for probably all non-nCoV datasets. We can then iterate on the implementation details and performance over time. I had hoped that augur filter could conditionally load everything into memory or parse it in chunks, but reading this PR it seems that's going to be more difficult than I thought.

@trvrb
Copy link
Member

trvrb commented Apr 12, 2024

How far could we get with just using tsv-utils? tsv-sample supports random and weighted sampling. I've made amazed at the performance of TSV utils in working with ncov metadata.

Simple and weighted random sampling use reservoir sampling and only need to hold the specified sample size (--n|num) in memory.

@victorlin
Copy link
Member Author

I just took tsv-sample for a spin on ncov's 100k subsampled dataset. There are small things such as lack of compressed file support that we can easily work around with a wrapper. However, there's also things that augur filter can do which tsv-sample cannot, for example uniform sampling over time:

augur filter \
  --metadata metadata.tsv.xz \
  --group-by month \
  --subsample-max-sequences 100 \
  --output-metadata filtered.tsv

The weighted sampling available in tsv-sample requires a column of weights per row, which is analogous to augur filter --priorities and not quite the same as the weighted sampling we are talking about in #1318 (I believe this was one of @huddlej's initial confusion points).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants