-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Converting sequence into indices of characters #1917
Conversation
This reverts commit 69e15b2.
I've generated the documentation, conducted local testing for the new functions, and everything looks good at this stage. Additionally, I had a face-to-face meeting with @qiyunzhu to delve into the rationale behind the proposed method and its improved speed compared to alternative approaches. |
Hi @qiyunzhu I think the proposed change makes a lot of sense -- indeed it is standard to map characters to indices for these types of algorithms. I strongly recommend looking into tokenizers in existing protein language models. If you are able to support naive support for Protrans or ESM2 it would substantially open up use cases for this functionality. In the current documentation, it isn't clear how one would specify the indices. The indices (aka tokenizers) in these models tends to be arbitrary, so it is highly advantageous to be able to swap out new indices. We have some of this functionality built into deepblast already. See here, here and here to get started. |
Hi @mortonjt This is a briliant idea! Although I am not familiar with it, I guess a |
Done. Note that you are basically implementing a tokenizer in this pull request -- you just didn't use that particular name. |
Let's merge this one and look into tokenization (and its reverse process) as @mortonjt suggested. |
This PR mainly implements one thing: converting a sequence into a vector of indices of characters. This is useful for matching characters in a sequence with indices in a substitution matrix, which is essential for efficient sequence alignment. With this, accessing the substitution score of two characters can be as simple as
submat[i, j]
.The conversion step may not be a bottlenecking step in the entire sequence alignment process. However, it could still cost some time, as a naive solution would be O(nk), in which n is the sequence length and k is the alphabet size, as compared to the dynamic programming algorithm which is O(n2). Therefore, I did some careful optimization. In the present implementation (see
_alphabet_to_hashes
and_indices_in_alphabet_ascii
), the entire conversion is achieved with:In which
seq._bytes
is a vector of ASCII code points, which is the native data structure underlying allskbio.Sequence
(sub)classes (other formats such as string requires conversion).submat._char_hash
is a pre-computed hash table, in which buckets are all possible ASCII code points (0 to 127), and values are indices of characters in the substitution matrix. Therefore, this hash table has a fixed shape of (128,). The output is a 1D array ofuint8
data type, which is most efficient in memory space.The reason not to directly use ASCII code points as indices is because in this scenario, a substitution matrix will occupy 128 * 128 * 8 = 128 KB memory space, which may be too much, especially when there are multiple of them (before we implement lazy loading). It could also prohibit the use of generalized alphabets (currently
SubstitutionMatrix
supports generalized alphabets).At present
skbio.Sequence
only supports ASCII characters. There is no way to generate a sequence with Unicode, extended ASCII (>127) or non-character values. This makes the optimization feasible and safe. However, I also implemented a generalized solution (_indices_in_alphabet
) which relies on dictionary query. It could be useful in the future when generalized sequence is implemented.Meanwhile, I implemented
_indices_in_observed
to convert sequences into vectors of indices in observed unique characters in them. This may be useful for sequence alignment based on edits (i.e., match, mismatch and gap).These features are flexible with the sequence type, including nucleotide, amino acid, or any arbitray grammared or ungrammared.
To better support these features, I added a
wildcard_char
attribute to the grammared sequence class. This character may beN
for nucleotides orX
for amino acids. The rationale is that some substitution matrices may not contain all available degenerate characters. This happens in many real matrices. For example,ACGTN
cannot handleR
orS
. The feature enables replacing such characters with the wildcard character.There could be more usages of the wildcard character. For example, if one wants to initiate a DNA sequence of a given length, they can do something like
DNA.full(10)
, and they getNNNNNNNNNN
. This can be useful in multiple applications (such as filling gaps between contigs).To ensure backward compatibility,
wildcard_char
is not yet an abstract property. Otherwise one cannot instantiate a custom sequence without setting this property. However, it may become one in a future version (requesting your opinion).The new feature is also capable of handling gaps. They automatically locate gaps, and mask them in the output vector, which then becomes a
np.ma.MaskedArray
. This is useful because gaps are allowed in skbio's sequences, and they need to be treated during alignment.A minor change is that I removed
pprint
from docstring examples as it is no longer necessary for modern Python dictionaries, which are always ordered.For record, there are alternative approaches to
_indices_in_alphabet
, such as the following. However, they are slower than the present versions in my tests on real-world sequence data.Method 1: Extract unique characters before finding indices.
Method 2: Use
np.searchsorted
NumPy (alphabet is a sorted array of characters)Please complete the following checklist:
I have read the guidelines in CONTRIBUTING.md.
I have documented all public-facing changes in CHANGELOG.md.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the
licenses
directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied. It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.This pull request does not include code, documentation, or other content derived from external source(s).
Note: REVIEWING.md may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.