diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst index 2e1e747e..bf0c90ad 100644 --- a/src/guides/bioinformatics/filtering-and-subsampling.rst +++ b/src/guides/bioinformatics/filtering-and-subsampling.rst @@ -149,9 +149,37 @@ total sequences: --output-sequences subsampled_sequences.fasta \ --output-metadata subsampled_metadata.tsv -``augur filter`` will automatically determine a value for +By default, ``augur filter`` will automatically determine a value for ``--sequences-per-group`` based on the number of available groups and sample -uniformly. +uniformly. This can be customized further with `--group-by-weights`` which +allows different target sizes per group. For example, target twice the amount of +sequences from Asia compared to other regions. First, create a file +``weights.tsv``: + +.. code-block:: + + region weight + Asia 2 + default 1 + ... + +The format specifications are described in ``augur filter`` docs for +``--group-by-weights``. + +Add the option by using ``--group-by-weights weights.tsv`` in the command: + +.. code-block:: bash + + augur filter \ + --sequences data/sequences.fasta \ + --metadata data/metadata.tsv \ + --min-date 2012 \ + --exclude exclude.txt \ + --group-by region year month \ + --group-by-weights weights.tsv \ + --subsample-max-sequences 100 \ + --output-sequences subsampled_sequences.fasta \ + --output-metadata subsampled_metadata.tsv .. note::