Skip to content

Documentation and description of AWS iGenomes S3 resource.

License

Notifications You must be signed in to change notification settings

ewels/AWS-iGenomes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AWS-iGenomes

Common reference genomes hosted on AWS S3

Download script & command builder: https://ewels.github.io/AWS-iGenomes/

Amazon Web Services

Introduction

In NGS bioinformatics, a typical analysis run involves aligning raw DNA sequencing reads against a known reference genome. A different reference is needed for every species, and many species have several references to choose from. Each tool then builds its own indices against these references. As such, one analysis run typically requires a number of different files. For example: raw underlying DNA sequence, annotation (GTF files) and index file for use the chosen alignment tool.

These files are quite large and take time to generate. Downloading and building them for each AWS run often takes a significant of the total run time and resources, which is very wasteful. To help with this, we have created an AWS S3 bucket containing the illumina iGenomes references, with a few additional indices for a extra tools on top of this base dataset. The iGeomes initiative aims to collect and standardise a number of common species, references and tool indices.

This data is hosted in an S3 bucket (~5TB) and crucially is uncompressed (unlike the .tar.gz files held on the illumina iGenomes FTP servers). AWS runs can by pull just the required files to their local file storage before running. This has the advantage of being faster, cheaper and more reproducible.

Download Script

To make usage easier, this repository contains a script (aws-igenomes.sh) which can sync the AWS-iGenomes for you. It requires the AWS command line tools to be installed and configured with authentication. Required references can be supplied on the command line or given through prompts when running the script.

This repository is hosted using GitHub pages, so the script can be run in a single command as follows:

curl -fsSL https://ewels.github.io/AWS-iGenomes/aws-igenomes.sh | bash

For more details, see https://ewels.github.io/AWS-iGenomes/

Command Builder

If you'd prefer to just get a sync command for the files you need, you can use the web-based command builder that's available at https://ewels.github.io/AWS-iGenomes/

Instructions

Bucket details

The details of the S3 bucket are as follows:

  • Bucket Name: ngi-igenomes
  • Bucket ARN: arn:aws:s3:::ngi-igenomes
  • Region: EU (Ireland)

Description of Files

A full list of available files can be seen in ngi-igenomes_file_manifest.txt

The following species have reference builds available:

  • Arabidopsis thaliana
  • Bacillus cereus ATCC 10987
  • Bacillus subtilis 168
  • Bos taurus
  • Caenorhabditis elegans
  • Canis familiaris
  • Danio rerio
  • Drosophila melanogaster
  • Enterobacteriophage lambda
  • Equus caballus
  • Escherichia coli K 12 DH10B
  • Escherichia coli K 12 MG1655
  • Gallus gallus
  • Glycine max
  • Homo sapiens
  • Macaca mulatta
  • Mus musculus
  • Mycobacterium tuberculosis H37RV
  • Oryza sativa japonica
  • Pan troglodytes
  • PhiX
  • Pseudomonas aeruginosa PAO1
  • Rattus norvegicus
  • Rhodobacter sphaeroides 2.4.1
  • Saccharomyces cerevisiae
  • Schizosaccharomyces pombe
  • Sorangium cellulosum So ce 56
  • Sorghum bicolor
  • Staphylococcus aureus NCTC 8325
  • Sus scrofa
  • Zea mays

Most of these species then have references from multiple sources and builds. For example, Mus musculus has the following:

  • Ensembl
    • GRCm38, NCBIM37
  • NCBI
    • build37.1, build37.2, GRCm38
  • UCSC
    • mm10, mm9

Within each reference build, the following resources are typically available (with a few exceptions):

  • Gene annotation in GTF and BED format
  • Sequence FASTA files:
    • Whole genome files
    • Separate chromosomes
    • Abundant sequences
  • Alignment indices for the following tools:
  • For some genomes:

An additional special-case is the GATK bundles, available for Homo sapiens (b37, hg19, hg38, GRCh37 and GRCh38).

See Data origin below for more details of how these files were generated.

Costs, billing and authentication

The S3 bucket is currently set to be completely open access (there were problems with the previous Requester Pays policy). This will remain the case until the credits awarded to fund this project from Amazon run out or expire (hopefully stable for some time yet).

Note that if if possible, it's best for us if you run in the same region as this S3 bucket (eu-west, Ireland). Then there should be no data transfer fees and the resource should stay around for longer. From the EC2 FAQ:

There is no Data Transfer charge between two Amazon Web Services within the same region (i.e. between Amazon EC2 US West and another AWS service in the US West). Data transferred between AWS services in different regions will be charged as Internet Data Transfer on both sides of the transfer.

Basic Usage

How you use this resource largely depends on how you're using AWS. Very generally however, you can retrieve your required data by using the AWS Command Line Interface.

For example, using the aws sync command:

aws s3 sync s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Sequence/STARIndex/ ./my_refs/

If the aws tool isn't installed, probably the easiest way to get it is using pip:

pip install --upgrade --user awscli

Remember that you must configure the tool with some kind of AWS authentication to access the contents of the s3 bucket.

For more information and help, see the AWS CLI user guide.

Usage with Nextflow

Nextflow is a powerful workflow manager allowing the creation of bioinformatics analysis pipelines. It was created to help the transition from traditional academic HPC systems to cloud computing. As such, it has extensive built-in support for a number of AWS features. One such feature is native integration with s3. This means that you can specify paths to required reference files in your pipeline which are stored in s3 and Nextflow will automatically retrieve them.

The repository contains an example Nextflow config file containing common paths and a suggested usage example: nextflow.config

For an example of this in action, see our NGI-RNAseq pipeline. The aws profile config contains s3 paths and our regular HPC config contains comparable regular file paths. This allows us to run the pipeline on either our HPC system or AWS with the same command and no extra setup.

Data origin

This resource is based on the illumina iGenomes references. These were downloaded and unpacked in April 2016.

After unpacking, references were added for STAR, Bismark and BED12. A new reference directory was contained for each reference and the index built (see commands below).

A full list of available files can be seen in this repository: ngi-igenomes_file_manifest.txt

STAR

module load star/2.5.1b

STAR --runMode genomeGenerate --runThreadN 8 --genomeDir ./ --genomeFastaFiles genome.fa --sjdbGTFfile genes.gtf --sjdbOverhang 100

(if no GTF file available, --sjdbGTFfile genes.gtf --sjdbOverhang 100 was not specified).

Bismark

module load bowtie/1.1.2
module load bowtie2/2.2.6
module load bismark/0.14.5

bismark_genome_preparation ./
bismark_genome_preparation --bowtie2 ./

BED12

BED12 files were generated using the gtf2bed tool from ea-utils.

More details: STAR / Bismark / BED12

Files for the STAR, Bismark and BED12 additions were kindly generated by the UPPMAX team. Full details and exactly scripts used for this can be found at github.com/UPPMAX/bio-data.

GATK Bundles

The GATK Resource Bundles for builds b37, hg19 and hg38 were downloaded from the Broad FTP server on 2017-05-25. For more information about their contents, please see this article.

Please note that b37/CEUTrio.HiSeq.WGS.b37.NA12878.bam and associated files are not included. This file is ~355GB and with the FTP download limiting from Broad it was going to take nearly a year to transfer.

Mouse Bundles

The Mouse Genome Project data was added to allow for the usage of GRCm38 data with the Sarek pipeline. This data was simply downloaded from the MGP FTP and additional files were created.

These files were addeed to AWS-iGenomes in November 2019.

Downloaded Files

These included the dbSNP SNP files with index and the dbSNP Indel files with corresponding index.

mgp.v5.merged.snps_all.dbSNP142.vcf.gz
mgp.v5.merged.snps_all.dbSNP142.vcf.gz.tbi
mgp.v5.merged.indels.dbSNP142.normed.vcf.gz
mgp.v5.merged.indels.dbSNP142.normed.vcf.gz.tbi

Genome BED

While the annotation folder contains a BED file for gene annotation, there was no intervals BED or interval list as required for running GATK available. This was simply created using the genome.fa.fai of GRCm38 and modified as follows:

awk -v FS='\t' -v OFS='\t' '{ print $1, "0", $2 }' genome.fa.fai > wgs_calling_regions.grcm38.bed

Then we created an interval_list file using this command:

gatk BedToIntervalList --INPUT References/GRCm38_calling_list.bed --OUTPUT References/GRCm38_calling_list.list --SEQUENCE_DICTIONARY References/genome.dict

The Future

AWS-iGenomes is now an AWS Open Data Resource (see https://registry.opendata.aws/aws-igenomes/). AWS has agreed to host up to 8TB data for AWS-iGenomes dataset until at least 28th October 2022. The resource has been renewed once so far and I hope that it will continue to be renewed for the forseeable future.

If you have any questions please get in touch with Phil Ewels (phil.ewels@scilifelab.se, @ewels) or create an issue on this repository.

Changelog

Version v0.3 (dev)

  • Made a web interface for generating aws s3 sync commands (not everyone likes random command line scripts..)
  • Now that Amazon are taking the cost of the hosting, everything is fully public
    • Added --no-sign-request to the commands so that they work without authentication
  • Added new GRCh37 and GRCh38 builds for GATK
    • Different to the existing hg18 and hg19 builds only in that the file organisation is cleaner and consistent with the rest of iGenomes (old builds left for backwards-compatibility)
    • Contain new indexes for BWA. More to be added in the future.

Version v0.2 - 2016-05-25

  • Added GATK bundles b37, hg19 and hg38 from the Broad FTP download
  • Minor download script updates

Version v0.1 - 2016-05-23

Initial released. Repository created with file-list of the iGenomes resource, with added BED12, STAR and Bismark indices. Download bash script written and basic website created at https://ewels.github.io/AWS-iGenomes/

Credits

The iGenomes resource was created by illumina. All credit for the collection and standardisation of this data should go to them!

This S3 resource was set up and documented by Phil Ewels (@ewels). The additional references not found in the base iGenomes resource were created with the help of Wesley Schaal (@wschaal) - a system administrator at UPPMAX (Uppsala Multidisciplinary Center for Advanced Computational Science).

The resource was initially developed for use at the National Genomics Infrastructure at SciLifeLab in Stockholm, Sweden.


SciLifeLab National Genomics Infrastructure