Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support vanilla GISAID downloads #701

Merged
merged 3 commits into from
Aug 12, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions defaults/parameters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ sanitize_metadata:
- "Gender=sex"
- "Clade=GISAID_clade"
- "Pango lineage=pango_lineage"
- "Lineage=pango_lineage"
- "Pangolin version=pangolin_version"
- "Variant=variant"
- "AA Substitutions=aa_substitutions"
Expand Down
29 changes: 29 additions & 0 deletions docs/data-prep.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,10 @@ If you need to download more records, constrain your search results to smaller w
</p>

Select the "Download" button in the bottom right of the search results.
There are two options to download data from GISAID, both of which we describe below.

#### Option 1: Download "Input for the Augur pipeline"

From the resulting "Download" window, select "Input for the Augur pipeline" as the download format.

![GISAID search download window showing "Input for the Augur pipeline" option](images/gisaid-search-download-window.png)
Expand Down Expand Up @@ -193,6 +197,31 @@ inputs:
sequences: data/gisaid_washington_sequences.fasta.xz
```

#### Option 2: Download "Sequences" and "Patient status metadata"

Alternately, you can download sequences and metadata as two separate uncompressed files.
First, select "Sequences (FASTA)" as the download format.
Check the box for replacing spaces with underscores.
Select the "Download" button and save the resulting file to the `data/` directory with a descriptive name like `gisaid_washington_sequences.fasta`.

![GISAID search download window showing "Sequences (FASTA)" option](images/gisaid-search-download-window-sequences.png)

From the search results interface, select the "Download" button in the bottom right again.
Select "Patient status metadata" as the download format.
Select the "Download" button and save the file to `data/` with a descriptive name like `gisaid_washington_metadata.tsv`.

![GISAID search download window showing "Patient status metadata" option](images/gisaid-search-download-window-metadata.png)

You can use these files as inputs for the workflow like so.

```yaml
# Define inputs for the workflow.
inputs:
- name: washington
metadata: data/gisaid_washington_metadata.tsv
sequences: data/gisaid_washington_sequences.fasta
```

### Download contextual data for your region of interest

Next, select the "Downloads" link from the EpiCoV navigation bar.
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions scripts/sanitize_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -259,9 +259,9 @@ def resolve_duplicates(metadata, strain_field, error_on_duplicates=False):
)
)

# Remove whitespaces from strain names since they are not allowed in FASTA
# record names.
metadata[strain_field] = metadata[strain_field].str.replace(" ", "")
# Replace whitespaces from strain names with underscores to match GISAID's
# convention since whitespaces are not allowed in FASTA record names.
metadata[strain_field] = metadata[strain_field].str.replace(" ", "_")

# Check for duplicates and try to resolve these by default.
try:
Expand Down