-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanitize metadata in chunks #728
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
huddlej
force-pushed
the
low-memory-sanitize-metadata
branch
from
September 22, 2021 18:42
b5b9d7f
to
c1901eb
Compare
tsibley
reviewed
Sep 28, 2021
huddlej
force-pushed
the
low-memory-sanitize-metadata
branch
2 times, most recently
from
October 6, 2021 18:07
ec9d3fe
to
f2f98e4
Compare
Promote hardcoded metadata and database id columns to workflow config parameters and command line arguments for the sanitize metadata script. This change allows us to encode less information about the metadata in the sanitize script, allows the user to set their own values, and also sets us up to call Augur's `io.read_metadata` function which requires the metadata id columns.
Move logic to parse mapping of old to new column names and strip prefixes into their own functions with tests. This refactoring simplifies the code in the main body of the sanitizer script.
Instead of loading all metadata into memory to perform transformation and filtering associated with sanitizing logic, process metadata in chunks that easily fit in memory. To support this low-memory implementation, this commit changes the approach to resolving duplicate records. We now make a first pass through the metadata to store strain ids and corresponding database ids for each record in a temporary file on disk. We make a second pass to filter duplicate records, transform the remaining records, and stream the results to disk. This commit adds cram-based functional tests that were used for test-driven development of the new low-memory functionality and to confirm that the user interface remains stable. One minor interface change (improvement) from this refactoring is the creation of a separate file containing the list of all duplicate strains when the user has requested an error on duplicates. This change allows the user to bioinformatically address duplicates after the fact in a way that printing the list of strains to stderr did not easily allow.
Instead of streaming metadata and sequences from a tarball with the `tarfile.extractfile` method, write the contents of the requested internal file to a temporary external file. This change allows downstream processes to read from a file path which is both a more consistent interface for libraries than the `io.BufferedReader` object returned by the `tarfile` module and an interface that supports the multiple passes required by the sanitize metadata script.
Adds a config parameter for sanitizing metadata that allows users to ask the workflow to throw an error when duplicates are found in the metadata. This parameter is set to `false` by default.
Adds three new config parameters to the configuration reference.
Reverts order of default id columns to our previous canonical order preferring Nextstrain-style columns ("strain" and "name") when they are available over the GISAID-style column ("Virus name").
Correct the order that sanitize metadata operations appear in the script's help, workflow config, and configuration guide. Update the configuration guide to include an explicit list of these operations and to clarify that deduplication happens before renaming of fields.
Simplify the logic for extracting files from tar archives by directly extracting the contents with `tar.extractfile` instead of writing the contents line by line with different file modes.
Testing of full open and GISAID metadata required 700 and 600 MB of memory at peak use, respectively, so we set the requested memory for this job to 2000 MB for a little extra buffer.
huddlej
force-pushed
the
low-memory-sanitize-metadata
branch
from
October 6, 2021 18:10
f2f98e4
to
48a6f97
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
Instead of loading all metadata into memory to perform transformation and filtering associated with sanitizing logic, process metadata in chunks that easily fit in memory.
To support this low-memory implementation, this PR changes the approach to resolving duplicate records. We now make a first pass through the metadata to store strain ids and corresponding database ids for each record in a temporary file on disk. We make a second pass to filter duplicate records, transform the remaining records, and stream the results to disk.
One minor interface change (improvement) from this refactoring is the creation of a separate file containing the list of all duplicate strains when the user has requested an error on duplicates. This change allows the user to bioinformatically address duplicates after the fact in a way that printing the list of strains to stderr did not easily allow.
Note that one side effect of the two-pass approach to sanitizing metadata is that reading metadata from tarballs no longer worked as expected. This functionality was fixed by writing the requested single file from a given tarball to a temporary file on disk instead of streaming the contents of that file as an
io.BufferedReader
.Fixes #697
Development plan
--metadata-id-columns
.augur.io.read_metadata
to read metadata in an iterator.Testing
This PR refactors top-level code into its own functions with doctests. It also adds cram-based functional tests that were used for test-driven development of the new low-memory functionality and to confirm that the user interface remains stable.
Run doctests:
Run functional tests:
Run sanitize script on full GISAID metadata.
Running this script on the 3,492,105 records from a recent full GISAID download took ~10 minutes (~1 minute for the first pass to build the id map, ~9 minutes for the second pass) and used ~330 MB of memory at peak use. The current implementation could be sped up substantially by tuning the duplicate filter logic (e.g., querying a sqlite database indexed by strain name instead of searching a plain text file for each chunk in the second pass) and by using the C-based pandas parser for TSV instead of the Python-based parser. This latter change requires a breaking change to the Augur API.
Release checklist
This pull request introduces the following features that should be noted in the change log:
<input>.duplicates.txt
containing a list of duplicate strains (one per line) when the user requests--error-on-duplicate-strains
.Remaining tasks are:
augur.io.read_metadata
--error-on-duplicate-strains