Skip to content

Download from SRA

Sam Minot edited this page Mar 19, 2020 · 5 revisions

There are many occasions when it is useful to download metagenomic microbiome data from SRA. To support this automated download in a way which makes it easy to subsequently analyze those FASTQ datasets with geneshot, we have provided a script called download_sra.nf.

nextflow run Golob-Minot/geneshot/download_sra.nf <ARGUMENTS>

NOTE:   This script expects paired-end FASTQ data, and will not download any other type

Required Arguments:
  --accession           Accession for NCBI BioProject to download
  --output              Folder to write output files

Output Files:

All output files will be written to the --output folder. This includes one or two
FASTQ files per Run as well as a `$BIOPROJECT.metadata.csv` file listing all of the files which
were downloaded, as well as the metadata describing those samples within NCBI.

The `$BIOPROJECT.metadata.csv` file will also include the metadata recorded for this set of Runs
within the SRA database. The columns for this file may not be formatted nicely,
but they do match the structure of the data within the SRA API.

Here is an example of how you can run this script:

nextflow run Golob-Minot/geneshot/download_sra.nf \
    --accession PRJNA541981 \
    --output output_download_1

That command will:

  1. Download a set of files from SRA, the forward and reverse reads for each Run in FASTQ(.GZ) format.
  2. Create a manifest CSV (PRJNA541981.metadata.csv) describing which of the FASTQ.GZ files corresponds to which of the Runs.
  3. Save all of those files to the folder specified by --output

Our hope is that the PRJNA541981.metadata.csv created by this script can be used directly for analysis with geneshot. Of course, this manifest may not have metadata in the ideal format, and so probably needs to be modified to include the information needed for biological analysis of this dataset.