-
Notifications
You must be signed in to change notification settings - Fork 21
Building the TMvec search database
In order to be able to perform structural similarity search, a protein search database needs to be constructed. In this tutorial, we will show how one can construct a queryable protein sequence database. Before reading this tutorial, keep in mind that we have prebuilt databases on CATHS100 and SwissProt already available. So if you just want to search against those databases, you don't need to rebuild the database.
But if you do want to build your own custom database, then keep reading.
You first want to make sure that you have the tm-vec models downloaded
wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model.ckpt
wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model_params.json
Then, assuming that you have a fasta file of proteins called my_big_fat_database.fasta
, you can build your own database called my_big_fat_database
with TMvec as follows
tmvec-build-database \
--input-fasta bagel.fa \
--tm-vec-model tm_vec_cath_model.ckpt \
--tm-vec-config-path tm_vec_cath_model_params.json \
--device 'gpu' \
--output bagel_database
Specifying the GPU flag will speed up the encoding step. This can only be run if you have followed the GPU install instructions. Once the database is built, you should see two files, my_big_fat_database.meta
and my_big_fat_database.db
. These files can be used to scalable remote homology search.
If you need to generate alignments, you'll also need to construct a fasta index. The fasta index allows for fasta sequences to be directly queried by name from a fasta file. This index can be done as follows
build-fasta-index \
--fasta bagel.fa \
--faidx bagel.fai
If you want to speed up database construction using your gpu, make sure to specify the --device
flag. You can either have --device gpu
, or specify the device number (i.e. --device 0
).
By default, we are downloading the entire ProTrans model. This operation can be time consuming, especially if you are going to be building multiple databases. To avoid this step, we have an option --protrans-model
that allows you to specify a pre-downloaded ProTrans model.