Recovery of deleted deep sequencing data sheds more light on the early Wuhan SARS-CoV-2 epidemic

This GitHub repository analyzes SARS-CoV-2 deep sequencing data recovered from the deleted BioProject PRJNA612766. This analysis corresponds to the work described in this pre-print.

Running the analysis

The analysis is nearly fully automated by the snakemake pipeline included in Snakefile. The configuration for the analysis is in config.yaml. Note that the pipeline is somewhat convoluted and performs a variety of steps only tangentially related to the paper corresponding to this study. The reason is that the study started simply as an effort to validate the analyses in the joint WHO-China report on COVID-19 origins, but then gradually shifted in goal upon the discovery of the deleted data set. For this reason, there are still some vestigial parts of the code and analysis structure.

The only required manual step is to download existing coronavirus sequences from GISAID, which must be done manually after creating a GISAID account since GISAID data sharing terms prevent distribution of their sequences. To get these sequences, download both the *.metadata.tsv.xz and *.fasta.xz files for the accessions in data/gisaid_sequences_through_Feb2020/accessions.txt to the subdirectory data/gisaid_sequences_through_Feb2020/, and the same two files for the accessions in data/comparator_genomes_gisaid/accessions.txt to the subdirectory data/comparator_genomes_gisaid/.

After downloading these sequences and ensuring you have installed conda, build the main conda environment for the pipeline with:

conda env create -f environment.yml

Then activate the conda environment with:

conda activate SARS-CoV-2_PRJNA612766

You can then run the entire analysis with:

snakemake -j 1 --use-conda

Note that you need the --use-conda command because one of the rules in Snakefile uses a separate environment as specified in environment_ete3.yml.

The above command will run the snakemake pipeline using just one computing core. If you want to use more cores, adjust the value passed by -j appropriately. If you have access to a computing cluster you can distribute the run across the cluster. For the Fred Hutch computing cluster, that can be done using cluster.yaml by running the pipeline with the commands in run_Hutch_cluster.bash.

Input data, results, etc

The input data needed for the analysis are all available in the ./data/ subdirectory, which contains a README describing the files therein.

The results of running the pipeline are placed in the ./results/ subdirectory. Most of these results are not tracked in this GitHub repo, but some key files are as described in the Methods of the paper associated with this study.

The code used to process the Excel supplementary table of accessions from project PRJNA612766 to generate the information found in config.yaml is in ./manual_analyses/PRJNA612766/.

Paper

The LaTex source for the paper and its figures are found in the ./paper/ subdirectory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recovery of deleted deep sequencing data sheds more light on the early Wuhan SARS-CoV-2 epidemic

Running the analysis

Input data, results, etc

Paper

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 252 Commits
data		data
docs		docs
literature_notes		literature_notes
manual_analyses		manual_analyses
notebooks		notebooks
paper		paper
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
cluster.yaml		cluster.yaml
config.yaml		config.yaml
environment.yml		environment.yml
environment_ete3.yml		environment_ete3.yml
run_Hutch_cluster.bash		run_Hutch_cluster.bash

vzpgb/SARS-CoV-2_PRJNA612766

Folders and files

Latest commit

History

Repository files navigation

Recovery of deleted deep sequencing data sheds more light on the early Wuhan SARS-CoV-2 epidemic

Running the analysis

Input data, results, etc

Paper

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages