🚧 Add docs for weighted sampling

nextstrain · Aug 16, 2024 · 872bddc · 872bddc
1 parent 249cb0a
commit 872bddc
Showing 1 changed file with 30 additions and 2 deletions.
diff --git a/src/guides/bioinformatics/filtering-and-subsampling.rst b/src/guides/bioinformatics/filtering-and-subsampling.rst
@@ -149,9 +149,37 @@ total sequences:
      --output-sequences subsampled_sequences.fasta \
      --output-metadata subsampled_metadata.tsv
 
-``augur filter`` will automatically determine a value for
+By default, ``augur filter`` will automatically determine a value for
 ``--sequences-per-group`` based on the number of available groups and sample
-uniformly.
+uniformly. This can be customized further with `--group-by-weights`` which
+allows different target sizes per group. For example, target twice the amount of
+sequences from Asia compared to other regions. First, create a file
+``weights.tsv``:
+
+.. code-block::
+
+   region	weight
+   Asia	2
+   default	1
+   ...
+
+The format specifications are described in ``augur filter`` docs for
+``--group-by-weights``.
+
+Add the option by using ``--group-by-weights weights.tsv`` in the command:
+
+.. code-block:: bash
+
+   augur filter \
+     --sequences data/sequences.fasta \
+     --metadata data/metadata.tsv \
+     --min-date 2012 \
+     --exclude exclude.txt \
+     --group-by region year month \
+     --group-by-weights weights.tsv \
+     --subsample-max-sequences 100 \
+     --output-sequences subsampled_sequences.fasta \
+     --output-metadata subsampled_metadata.tsv
 
 .. note::