Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
memgonzales committed Apr 23, 2024
1 parent 03232c2 commit 6c2a934
Showing 1 changed file with 1 addition and 5 deletions.
6 changes: 1 addition & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,23 +70,19 @@ python3 -m pip install -r requirements.txt
python3 phiembed.py --input <input_filename> --output <output_filename>
```

Arguments:

- `input_filename` is the filename of the FASTA file containing the receptor-binding protein sequences.
- `output_filename` is the filename of the file to which the results of running PHIEmbed will be written

Each row in the results file contains two comma-separated values: a host genus and the predicted class probability. The rows are sorted in order of decreasing class probability. Hence, the first row in the results file corresponds to the top-ranked prediction.

Under the hood, this script first converts each sequence into a protein embedding using ProtT5 (the top-performing protein language model based on our experiments) and then passes the embedding to a random forest classifier trained on our _entire_ dataset.
Under the hood, this script first converts each sequence into a protein embedding using ProtT5 (the top-performing protein language model based on our experiments) and then passes the embedding to a random forest classifier trained on our _entire_ dataset.

### Training PHIEmbed

```
python3 train.py --input <training_dataset>
```

Argument:

- `training_dataset` is the filename of the training dataset

The training dataset should be formatted as a CSV file (without a header row). Each row corresponds to a training sample. The first column is for the protein IDs, the second column is for the host genera, and the next 1,024 columns are for the components of the ProtT5 embeddings.
Expand Down

0 comments on commit 6c2a934

Please sign in to comment.