Where to find the exact fasta files used for model building #9
Unanswered
hamidghaedi
asked this question in
Q&A
Replies: 1 comment
-
Hi @hamidghaedi, we filter out some sequences and also only keep unique sequences, see here: Line 60 in 6c463b1 process() function and write it to a file.
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there,
I really enjoyed reading the paper and would love to read it again.
Reading the paper supplementary document on page 3, I can see "We trained an amino acid residue-level language model on a total of 44,851 unique influenza A hemagglutinin (HA) amino acid sequences observed in animal
hosts from 1908 through 2019." I thought this is the number of sequences present in the "data/influenza/ird_influenzaA_HA_allspecies.fa" file. But that was not true as ird_influenzaA_HA_allspecies.fa file contains 94,560 sequences. The same story is true for other viruses and number of records in provided fasta files are not the same to what you mentioned in the supplementary file.
Would you please let me know where to look to see the exact fasta files you have used to train the models?
Beta Was this translation helpful? Give feedback.
All reactions