Skip to content

Commit

Permalink
Merge pull request #728 from nextstrain/low-memory-sanitize-metadata
Browse files Browse the repository at this point in the history
Sanitize metadata in chunks
  • Loading branch information
huddlej authored Oct 6, 2021
2 parents a02fa1b + 48a6f97 commit 6313a0c
Show file tree
Hide file tree
Showing 10 changed files with 500 additions and 202 deletions.
11 changes: 10 additions & 1 deletion defaults/parameters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,16 @@ strip_strain_prefixes:
- SARS-CoV-2/

sanitize_metadata:
metadata_id_columns:
- strain
- name
- Virus name
database_id_columns:
- "Accession ID"
- gisaid_epi_isl
- genbank_accession
error_on_duplicate_strains: false
parse_location_field: Location
rename_fields:
- "Virus name=strain"
- "Type=type"
Expand All @@ -36,7 +46,6 @@ sanitize_metadata:
- "Is low coverage?=is_low_coverage"
- "N-Content=n_content"
- "GC-Content=gc_content"
parse_location_field: Location

reference_node_name: "USA/WA1/2020"

Expand Down
3 changes: 2 additions & 1 deletion docs/src/reference/change_log.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ We also use this change log to document new features that maintain backward comp

## New features since last version update

- 6 October 2021: Update clades with `21I (Delta)` and `21J (Delta)` viruses. These are subclades within `21A (Delta)`. Based on mutations they should have largely Delta-like phenotypes, although additional ORF1a mutations in `21J (Delta)` appear to confer higher fitness.
- 6 October 2021: Add three configuration parameters to control the metadata sanitizer step of the workflow. These parameters allow users to specify the metadata columns to use for strain names (`metadata_id_columns`) and to resolve duplicate records with database ids (`database_id_columns`). The new `error_on_duplicate_strains` parameter allows users to ask the workflow to exit with an error when any duplicates appear in the metadata. [See the configuration reference for more details](https://docs.nextstrain.org/projects/ncov/en/latest/reference/configuration.html#sanitize-metadata). ([#728](https://github.com/nextstrain/ncov/pull/728))
- 6 October 2021: Update clades with `21I (Delta)` and `21J (Delta)` viruses. These are subclades within `21A (Delta)`. Based on mutations they should have largely Delta-like phenotypes, although additional ORF1a mutations in `21J (Delta)` appear to confer higher fitness.

## v9 (6 October 2021)

Expand Down
27 changes: 26 additions & 1 deletion docs/src/reference/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -505,7 +505,32 @@ Valid attributes for list entries in `inputs` are provided below.

## sanitize_metadata
* type: object
* description: Parameters to configure how to sanitize metadata to a Nextstrain-compatible format.
* description: Parameters to configure how to sanitize metadata to a Nextstrain-compatible format. The sanitize metadata script resolves duplicate records using database ids, parses a GISAID-style location field into Nextstrain-style location fields, strips prefixes from strain names, and renames fields in that order.

### metadata_id_columns
* type: object
* description: A list of valid strain name columns in the metadata. The sanitize metadata script will check attempt to use the first of these columns that exists in the metadata. It will exit with an error, if none of the columns exist.
* default:
```yaml
- strain
- name
- "Virus name"
```

### database_id_columns
* type: object
* description: A list of columns representing external database ids for metadata records. These unique ids represent a snapshot of data at a specific time for a given strain name. The sanitize metadata script resolves duplicate metadata records for the same strain name by selecting the record with the latest database id. Multiple database id columns allow the script to resolve duplicates when one or more columns has ambiguous values (e.g., "?"). Deduplication occurs before renaming of columns, so the default values include GISAID's own "Accession ID" as well as Nextstrain-style database ids.
* default:
```yaml
- "Accession ID"
- gisaid_epi_isl
- genbank_accession
```

### error_on_duplicate_strains
* type: boolean
* description: Exit the sanitize metadata script with an error when any strains have multiple records in the metadata. The script writes list of all duplicate strains to a file named like `<input>.duplicates.txt` that users can review and use to address unexpected duplicates.
* default: `false`

### parse_location_field
* type: string
Expand Down
Loading

0 comments on commit 6313a0c

Please sign in to comment.