Skip to content

Commit

Permalink
🚧 Add docs for weighted sampling
Browse files Browse the repository at this point in the history
  • Loading branch information
victorlin committed Aug 16, 2024
1 parent 249cb0a commit 872bddc
Showing 1 changed file with 30 additions and 2 deletions.
32 changes: 30 additions & 2 deletions src/guides/bioinformatics/filtering-and-subsampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,9 +149,37 @@ total sequences:
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
``augur filter`` will automatically determine a value for
By default, ``augur filter`` will automatically determine a value for
``--sequences-per-group`` based on the number of available groups and sample
uniformly.
uniformly. This can be customized further with `--group-by-weights`` which
allows different target sizes per group. For example, target twice the amount of
sequences from Asia compared to other regions. First, create a file
``weights.tsv``:

.. code-block::
region weight
Asia 2
default 1
...
The format specifications are described in ``augur filter`` docs for
``--group-by-weights``.

Add the option by using ``--group-by-weights weights.tsv`` in the command:

.. code-block:: bash
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--group-by region year month \
--group-by-weights weights.tsv \
--subsample-max-sequences 100 \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
.. note::

Expand Down

0 comments on commit 872bddc

Please sign in to comment.