Building the TMvec search database

In order to be able to perform structural similarity search, a protein search database needs to be constructed. In this tutorial, we will show how one can construct a queryable protein sequence database. Before reading this tutorial, keep in mind that we have prebuilt databases on CATHS100 and SwissProt already available. So if you just want to search against those databases, you don't need to rebuild the database.

But if you do want to build your own custom database, then keep reading.

You first want to make sure that you have the tm-vec models downloaded

wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model.ckpt
wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model_params.json

Then, assuming that you have a fasta file of proteins called my_big_fat_database.fasta, you can build your own database called my_big_fat_database with TMvec as follows

tmvec-build-database \
    --input-fasta bagel.fa \
    --tm-vec-model tm_vec_cath_model.ckpt \
    --tm-vec-config-path tm_vec_cath_model_params.json \
    --device 'gpu' \
    --output bagel_database

Specifying the GPU flag will speed up the encoding step. This can only be run if you have followed the GPU install instructions. Once the database is built, you should see two files, my_big_fat_database.meta and my_big_fat_database.db. These files can be used to scalable remote homology search.

If you need to generate alignments, you'll also need to construct a fasta index. The fasta index allows for fasta sequences to be directly queried by name from a fasta file. This index can be done as follows

build-fasta-index \
    --fasta bagel.fa \
    --faidx bagel.fai

Important considerations

If you want to speed up database construction using your gpu, make sure to specify the --device flag. You can either have --device gpu, or specify the device number (i.e. --device 0).

By default, we are downloading the entire ProTrans model. This operation can be time consuming, especially if you are going to be building multiple databases. To avoid this step, we have an option --protrans-model that allows you to specify a pre-downloaded ProTrans model.

Tutorials

Installation

Python Tutorial

Aligning proteins

Command Line Tutorial

Resources

Available databases

Provide feedback

Saved searches