Sanitize metadata in chunks #728

huddlej · 2021-09-22T00:01:29Z

Description of proposed changes

Instead of loading all metadata into memory to perform transformation and filtering associated with sanitizing logic, process metadata in chunks that easily fit in memory.

To support this low-memory implementation, this PR changes the approach to resolving duplicate records. We now make a first pass through the metadata to store strain ids and corresponding database ids for each record in a temporary file on disk. We make a second pass to filter duplicate records, transform the remaining records, and stream the results to disk.

One minor interface change (improvement) from this refactoring is the creation of a separate file containing the list of all duplicate strains when the user has requested an error on duplicates. This change allows the user to bioinformatically address duplicates after the fact in a way that printing the list of strains to stderr did not easily allow.

Note that one side effect of the two-pass approach to sanitizing metadata is that reading metadata from tarballs no longer worked as expected. This functionality was fixed by writing the requested single file from a given tarball to a temporary file on disk instead of streaming the contents of that file as an io.BufferedReader.

Fixes #697

Development plan

Testing

This PR refactors top-level code into its own functions with doctests. It also adds cram-based functional tests that were used for test-driven development of the new low-memory functionality and to confirm that the user interface remains stable.

Run doctests:

python3 -m doctest scripts/sanitize_metadata.py

Run functional tests:

cram --shell=/bin/bash tests/sanitize-metadata.t

Run sanitize script on full GISAID metadata.

python3 scripts/sanitize_metadata.py \
  --metadata gisaid_metadata.tsv.gz \
  --metadata-id-columns 'Virus name' \
  --database-id-columns 'Accession ID' \
  --parse-location-field Location \
  --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content \
  --strip-prefixes hCoV-19/ \
  --output results/sanitized_metadata_gisaid.tsv.gz

Running this script on the 3,492,105 records from a recent full GISAID download took ~10 minutes (~1 minute for the first pass to build the id map, ~9 minutes for the second pass) and used ~330 MB of memory at peak use. The current implementation could be sped up substantially by tuning the duplicate filter logic (e.g., querying a sqlite database indexed by strain name instead of searching a plain text file for each chunk in the second pass) and by using the C-based pandas parser for TSV instead of the Python-based parser. This latter change requires a breaking change to the Augur API.

Release checklist

This pull request introduces the following features that should be noted in the change log:

Enable configuration of metadata strain id and database id columns in the build config
Create a separate text file named like <input>.duplicates.txt containing a list of duplicate strains (one per line) when the user requests --error-on-duplicate-strains.

Remaining tasks are:

Resolve issues reading from tarball file handles with augur.io.read_metadata
Add config parameter to allow toggling --error-on-duplicate-strains
Document new config parameters
Update change log to reflect new features

scripts/sanitize_metadata.py

defaults/parameters.yaml

scripts/utils.py

Promote hardcoded metadata and database id columns to workflow config parameters and command line arguments for the sanitize metadata script. This change allows us to encode less information about the metadata in the sanitize script, allows the user to set their own values, and also sets us up to call Augur's `io.read_metadata` function which requires the metadata id columns.

Move logic to parse mapping of old to new column names and strip prefixes into their own functions with tests. This refactoring simplifies the code in the main body of the sanitizer script.

Instead of loading all metadata into memory to perform transformation and filtering associated with sanitizing logic, process metadata in chunks that easily fit in memory. To support this low-memory implementation, this commit changes the approach to resolving duplicate records. We now make a first pass through the metadata to store strain ids and corresponding database ids for each record in a temporary file on disk. We make a second pass to filter duplicate records, transform the remaining records, and stream the results to disk. This commit adds cram-based functional tests that were used for test-driven development of the new low-memory functionality and to confirm that the user interface remains stable. One minor interface change (improvement) from this refactoring is the creation of a separate file containing the list of all duplicate strains when the user has requested an error on duplicates. This change allows the user to bioinformatically address duplicates after the fact in a way that printing the list of strains to stderr did not easily allow.

Instead of streaming metadata and sequences from a tarball with the `tarfile.extractfile` method, write the contents of the requested internal file to a temporary external file. This change allows downstream processes to read from a file path which is both a more consistent interface for libraries than the `io.BufferedReader` object returned by the `tarfile` module and an interface that supports the multiple passes required by the sanitize metadata script.

Adds a config parameter for sanitizing metadata that allows users to ask the workflow to throw an error when duplicates are found in the metadata. This parameter is set to `false` by default.

Adds three new config parameters to the configuration reference.

Reverts order of default id columns to our previous canonical order preferring Nextstrain-style columns ("strain" and "name") when they are available over the GISAID-style column ("Virus name").

Correct the order that sanitize metadata operations appear in the script's help, workflow config, and configuration guide. Update the configuration guide to include an explicit list of these operations and to clarify that deduplication happens before renaming of fields.

Simplify the logic for extracting files from tar archives by directly extracting the contents with `tar.extractfile` instead of writing the contents line by line with different file modes.

Testing of full open and GISAID metadata required 700 and 600 MB of memory at peak use, respectively, so we set the requested memory for this job to 2000 MB for a little extra buffer.

huddlej force-pushed the low-memory-sanitize-metadata branch from b5b9d7f to c1901eb Compare September 22, 2021 18:42

huddlej marked this pull request as ready for review September 22, 2021 18:47

huddlej requested review from jameshadfield and tsibley September 22, 2021 18:47

tsibley reviewed Sep 28, 2021

View reviewed changes

scripts/sanitize_metadata.py Outdated Show resolved Hide resolved

defaults/parameters.yaml Outdated Show resolved Hide resolved

defaults/parameters.yaml Show resolved Hide resolved

scripts/utils.py Outdated Show resolved Hide resolved

scripts/utils.py Outdated Show resolved Hide resolved

huddlej force-pushed the low-memory-sanitize-metadata branch 2 times, most recently from ec9d3fe to f2f98e4 Compare October 6, 2021 18:07

huddlej added 11 commits October 6, 2021 11:08

Refactor sanitizer transform logic

b1e5407

Move logic to parse mapping of old to new column names and strip prefixes into their own functions with tests. This refactoring simplifies the code in the main body of the sanitizer script.

Add config parameter to toggle error on duplicates

85b948d

Adds a config parameter for sanitizing metadata that allows users to ask the workflow to throw an error when duplicates are found in the metadata. This parameter is set to `false` by default.

Document new config parameters

e9d3c6b

Adds three new config parameters to the configuration reference.

Document new feature in the change log

14c06b2

Use canonical order for default id columns

2c86b64

Reverts order of default id columns to our previous canonical order preferring Nextstrain-style columns ("strain" and "name") when they are available over the GISAID-style column ("Virus name").

Extract from tar archives into temporary dirs

c1616bc

Simplify the logic for extracting files from tar archives by directly extracting the contents with `tar.extractfile` instead of writing the contents line by line with different file modes.

Set memory resources required to sanitize metadata

48a6f97

Testing of full open and GISAID metadata required 700 and 600 MB of memory at peak use, respectively, so we set the requested memory for this job to 2000 MB for a little extra buffer.

huddlej force-pushed the low-memory-sanitize-metadata branch from f2f98e4 to 48a6f97 Compare October 6, 2021 18:10

huddlej merged commit 6313a0c into master Oct 6, 2021

huddlej deleted the low-memory-sanitize-metadata branch October 6, 2021 18:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitize metadata in chunks #728

Sanitize metadata in chunks #728

huddlej commented Sep 22, 2021 •

edited

Loading

Sanitize metadata in chunks #728

Sanitize metadata in chunks #728

Conversation

huddlej commented Sep 22, 2021 • edited Loading

Description of proposed changes

Development plan

Testing

Release checklist

huddlej commented Sep 22, 2021 •

edited

Loading