Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sanitize metadata in chunks #728

Merged
merged 11 commits into from
Oct 6, 2021
Merged

Sanitize metadata in chunks #728

merged 11 commits into from
Oct 6, 2021

Conversation

huddlej
Copy link
Contributor

@huddlej huddlej commented Sep 22, 2021

Description of proposed changes

Instead of loading all metadata into memory to perform transformation and filtering associated with sanitizing logic, process metadata in chunks that easily fit in memory.

To support this low-memory implementation, this PR changes the approach to resolving duplicate records. We now make a first pass through the metadata to store strain ids and corresponding database ids for each record in a temporary file on disk. We make a second pass to filter duplicate records, transform the remaining records, and stream the results to disk.

One minor interface change (improvement) from this refactoring is the creation of a separate file containing the list of all duplicate strains when the user has requested an error on duplicates. This change allows the user to bioinformatically address duplicates after the fact in a way that printing the list of strains to stderr did not easily allow.

Note that one side effect of the two-pass approach to sanitizing metadata is that reading metadata from tarballs no longer worked as expected. This functionality was fixed by writing the requested single file from a given tarball to a temporary file on disk instead of streaming the contents of that file as an io.BufferedReader.

Fixes #697

Development plan

  • Move hard-coded strain field names to defaults for a user argument to parallel augur filter’s --metadata-id-columns.
  • Define a function to check for and return all available accession fields.
  • Move logic for renaming fields into a function and call it from earlier in the script
  • Move logic for stripping prefixes into a function with tests
  • Use augur.io.read_metadata to read metadata in an iterator.
  • Loop through the metadata in the first pass, writing out strain name (i.e., the index column of the data frame) and all available accession fields to a database.
  • Loop through the metadata in the second pass,
    • Select for the latest accession for each strain in the current batch from the name/id mapping database.
    • Filter the current batch data frame to the latest accessions
    • Apply column and value changes to the filtered batch data frame
    • Write the filtered and transformed batch data frame to disk.
  • Delete the name/id mapping database.

Testing

This PR refactors top-level code into its own functions with doctests. It also adds cram-based functional tests that were used for test-driven development of the new low-memory functionality and to confirm that the user interface remains stable.

Run doctests:

python3 -m doctest scripts/sanitize_metadata.py

Run functional tests:

cram --shell=/bin/bash tests/sanitize-metadata.t

Run sanitize script on full GISAID metadata.

python3 scripts/sanitize_metadata.py \
  --metadata gisaid_metadata.tsv.gz \
  --metadata-id-columns 'Virus name' \
  --database-id-columns 'Accession ID' \
  --parse-location-field Location \
  --rename-fields 'Virus name=strain' Type=type 'Accession ID=gisaid_epi_isl' 'Collection date=date' 'Additional location information=additional_location_information' 'Sequence length=length' Host=host 'Patient age=patient_age' Gender=sex Clade=GISAID_clade 'Pango lineage=pango_lineage' Lineage=pango_lineage 'Pangolin version=pangolin_version' Variant=variant 'AA Substitutions=aa_substitutions' aaSubstitutions=aa_substitutions 'Submission date=date_submitted' 'Is reference?=is_reference' 'Is complete?=is_complete' 'Is high coverage?=is_high_coverage' 'Is low coverage?=is_low_coverage' N-Content=n_content GC-Content=gc_content \
  --strip-prefixes hCoV-19/ \
  --output results/sanitized_metadata_gisaid.tsv.gz

Running this script on the 3,492,105 records from a recent full GISAID download took ~10 minutes (~1 minute for the first pass to build the id map, ~9 minutes for the second pass) and used ~330 MB of memory at peak use. The current implementation could be sped up substantially by tuning the duplicate filter logic (e.g., querying a sqlite database indexed by strain name instead of searching a plain text file for each chunk in the second pass) and by using the C-based pandas parser for TSV instead of the Python-based parser. This latter change requires a breaking change to the Augur API.

Release checklist

This pull request introduces the following features that should be noted in the change log:

  • Enable configuration of metadata strain id and database id columns in the build config
  • Create a separate text file named like <input>.duplicates.txt containing a list of duplicate strains (one per line) when the user requests --error-on-duplicate-strains.

Remaining tasks are:

  • Resolve issues reading from tarball file handles with augur.io.read_metadata
  • Add config parameter to allow toggling --error-on-duplicate-strains
  • Document new config parameters
  • Update change log to reflect new features

@huddlej huddlej force-pushed the low-memory-sanitize-metadata branch from b5b9d7f to c1901eb Compare September 22, 2021 18:42
@huddlej huddlej marked this pull request as ready for review September 22, 2021 18:47
scripts/sanitize_metadata.py Outdated Show resolved Hide resolved
defaults/parameters.yaml Outdated Show resolved Hide resolved
defaults/parameters.yaml Show resolved Hide resolved
scripts/utils.py Outdated Show resolved Hide resolved
scripts/utils.py Outdated Show resolved Hide resolved
@huddlej huddlej force-pushed the low-memory-sanitize-metadata branch 2 times, most recently from ec9d3fe to f2f98e4 Compare October 6, 2021 18:07
huddlej added 11 commits October 6, 2021 11:08
Promote hardcoded metadata and database id columns to workflow config
parameters and command line arguments for the sanitize metadata script.
This change allows us to encode less information about the metadata in
the sanitize script, allows the user to set their own values, and also
sets us up to call Augur's `io.read_metadata` function which requires
the metadata id columns.
Move logic to parse mapping of old to new column names and strip
prefixes into their own functions with tests. This refactoring
simplifies the code in the main body of the sanitizer script.
Instead of loading all metadata into memory to perform transformation
and filtering associated with sanitizing logic, process metadata in
chunks that easily fit in memory.

To support this low-memory implementation, this commit changes the
approach to resolving duplicate records. We now make a first pass
through the metadata to store strain ids and corresponding database ids
for each record in a temporary file on disk. We make a second pass to
filter duplicate records, transform the remaining records, and stream
the results to disk.

This commit adds cram-based functional tests that were used for
test-driven development of the new low-memory functionality and to
confirm that the user interface remains stable.

One minor interface change (improvement) from this refactoring is the
creation of a separate file containing the list of all duplicate strains
when the user has requested an error on duplicates. This change allows
the user to bioinformatically address duplicates after the fact in a way
that printing the list of strains to stderr did not easily allow.
Instead of streaming metadata and sequences from a tarball with the
`tarfile.extractfile` method, write the contents of the requested
internal file to a temporary external file. This change allows
downstream processes to read from a file path which is both a more
consistent interface for libraries than the `io.BufferedReader` object
returned by the `tarfile` module and an interface that supports the
multiple passes required by the sanitize metadata script.
Adds a config parameter for sanitizing metadata that allows users to
ask the workflow to throw an error when duplicates are found in the
metadata. This parameter is set to `false` by default.
Adds three new config parameters to the configuration reference.
Reverts order of default id columns to our previous canonical order
preferring Nextstrain-style columns ("strain" and "name") when they are
available over the GISAID-style column ("Virus name").
Correct the order that sanitize metadata operations appear in the
script's help, workflow config, and configuration guide. Update the
configuration guide to include an explicit list of these operations and
to clarify that deduplication happens before renaming of fields.
Simplify the logic for extracting files from tar archives by directly
extracting the contents with `tar.extractfile` instead of writing the
contents line by line with different file modes.
Testing of full open and GISAID metadata required 700 and 600 MB of
memory at peak use, respectively, so we set the requested memory for
this job to 2000 MB for a little extra buffer.
@huddlej huddlej force-pushed the low-memory-sanitize-metadata branch from f2f98e4 to 48a6f97 Compare October 6, 2021 18:10
@huddlej huddlej merged commit 6313a0c into master Oct 6, 2021
@huddlej huddlej deleted the low-memory-sanitize-metadata branch October 6, 2021 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not load all metadata into memory during sanitize step
2 participants