peptides

Currently the only script in this repo (download_domain.py) is to download sequences from the NCBI refseq that will be compatible with Kraken2.

This is to get around the issues with using rsync, wget and ftp (that seem to be related to the newer NCBI file structure) that are mentioned on the Kraken2 issues here and here as none of the fixes listed helped me.

I have based this upon the scripts at this repo (which crashed before downloading all bacteria for me) to make one script that will download any domain you choose, has the option of downloading only complete genomes or all genomes, can download either protein sequences or nucleotide (DNA) sequences and can download only the human genome from the vertebrate_mammalian section (when given the appropriate options). It will also not download something that has already been downloaded, and adds the NCBI taxid to each sequence ID so it is compatible with Kraken2.

It will also give a log file that tells you each of the sequences it has downloaded and added kraken taxids to, but also any files that it had a problem at some point (search for "Didn't" in the text file to see this).

Please feel free to email me with any questions.

It requires the additional packages (as well as the standard os and argparse packages):

pandas

conda install pandas

biopython

conda install -c conda-forge biopython

Example usage

Example 1 - this will download the protein sequences of complete bacterial reference genomes:

python download_domain.py --domain bacteria --complete True --ext protein

Example 2 - this will download the nucleotide sequences of the reference human genome:

python download_domain.py --domain vertebrate_mammalian --complete True --ext dna --human False

Example 3 - this will download the nucleotide sequences of all fungal reference genomes (including scaffolds, contigs, etc.):

python download_domain.py --domain fungi --complete False --ext dna

Just a further note that the human genome is listed as Chromosome rather than Complete. I have made it so that if adding the complete flag will still download the genome, but this is just something to be aware of.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
download_domain.py		download_domain.py
test_trees.png		test_trees.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

peptides

Example usage

About

Releases

Packages

Languages

R-Wright-1/peptides

Folders and files

Latest commit

History

Repository files navigation

peptides

Example usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages