Before running the script on google colab, download these reduced databases created from the public data available on Phagescope (https://phagescope.deepomics.org/download), and keep them in your main My Drive, as they are required for the code to run:
- https://drive.google.com/file/d/1g3gbYUp_GDmMT587LC0j6Y9qTF2tzv0m/view?usp=sharing (database with infection protein sequences)
- https://drive.google.com/file/d/19FDHLSgKgBYkqxSmAcvqv1MSfMPU6SRm/view?usp=sharing
- https://drive.google.com/file/d/1_mnUoMlvNHkzPRMWISDB_mCEk80XAa4u/view?usp=sharing
- https://drive.google.com/file/d/1-2PQ30-_ssvHJYHPmgqwOOUYz-wReaWX/view?usp=sharing
Abstract: This project focuses on developing a script to find bacteriophages (phages) targeting specific bacterial receptors, with an emphasis on phage antireceptors. The model will blend two approaches: detecting antireceptors in prophages within bacterial genomes and proposing candidate phages from related bacterial species. These two approaches will, when possible, cross-validate each other to enhance prediction accuracy. Data from PhageScope and PHASTEST is used as a source.
Background and Motivation: Phages can be used for precision-targeted gene delivery to bacteria, but accurately identifying or designing phages for specific bacterial receptors remains a challenge. This project aims to improve phage prediction and design by focusing on phage antireceptor-bacterial receptor interactions and leveraging data from closely related bacterial species. By blending and cross-validating two complementary approaches—prophage antireceptor identification and receptor similarity across species—we seek to create more reliable models for phage therapy and gene delivery applications.
Dataset Summary: the search of prophages is performed directly using the PHASTEST api, while the search for phages of bacterial relatives is performed using reduced databases assembled from the data available on the PhageScope website (more information in “Data and Code Availability”), so that one could perform a fast search of phages ids corresponding to one bacterial host, fast search of all the protein ids with “infection” ontology in the database for the found phages ids, fast search of all the protein sequences for the found protein ids. This required top-bottom merging of the different databases referenced by PHASTEST, filtering and removing of unnecessary fields.
Method Description: This project developed a script that identifies bacteriophages (phages) targeting specific bacterial hosts by focusing on proteins involved in the infection process, with an emphasis on cross-validating phage candidates. The method operates as follows:
- Host Identification and Phage Targeting: The script prompts for the name or ID of a target bacterial host and retrieves phages known to target that host. If the target host lacks matching phages, the script expands the search to closely related species within the same genus, family, or order, favoring those with the shortest phylogenetic distance.
- Protein Sequence Extraction: Once candidate phages are identified, the script retrieves sequences of proteins involved in the infection process, including phage receptors and other proteins essential to binding and host recognition, from the identified phages.
- Prophage Detection and Protein Extraction: The script then searches for prophages within the genome of the target bacterial host and extracts sequences of prophage proteins that are similarly involved in infection.
- Cross-Validation of Sequence Homology: Finally, the script compares the infection-related protein sequences from both phages and prophages to validate host-target predictions. This dual set of protein sequences enables cross-validation, enhancing prediction reliability by confirming infection potential from both the phage’s and host’s perspectives.