Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting sequence into indices of characters #1917

Merged
merged 9 commits into from
Feb 1, 2024

Conversation

qiyunzhu
Copy link
Contributor

This PR mainly implements one thing: converting a sequence into a vector of indices of characters. This is useful for matching characters in a sequence with indices in a substitution matrix, which is essential for efficient sequence alignment. With this, accessing the substitution score of two characters can be as simple as submat[i, j].

The conversion step may not be a bottlenecking step in the entire sequence alignment process. However, it could still cost some time, as a naive solution would be O(nk), in which n is the sequence length and k is the alphabet size, as compared to the dynamic programming algorithm which is O(n2). Therefore, I did some careful optimization. In the present implementation (see _alphabet_to_hashes and _indices_in_alphabet_ascii), the entire conversion is achieved with:

submat._char_hash[seq._bytes]

In which seq._bytes is a vector of ASCII code points, which is the native data structure underlying all skbio.Sequence (sub)classes (other formats such as string requires conversion). submat._char_hash is a pre-computed hash table, in which buckets are all possible ASCII code points (0 to 127), and values are indices of characters in the substitution matrix. Therefore, this hash table has a fixed shape of (128,). The output is a 1D array of uint8 data type, which is most efficient in memory space.


The reason not to directly use ASCII code points as indices is because in this scenario, a substitution matrix will occupy 128 * 128 * 8 = 128 KB memory space, which may be too much, especially when there are multiple of them (before we implement lazy loading). It could also prohibit the use of generalized alphabets (currently SubstitutionMatrix supports generalized alphabets).

At present skbio.Sequence only supports ASCII characters. There is no way to generate a sequence with Unicode, extended ASCII (>127) or non-character values. This makes the optimization feasible and safe. However, I also implemented a generalized solution (_indices_in_alphabet) which relies on dictionary query. It could be useful in the future when generalized sequence is implemented.

Meanwhile, I implemented _indices_in_observed to convert sequences into vectors of indices in observed unique characters in them. This may be useful for sequence alignment based on edits (i.e., match, mismatch and gap).

These features are flexible with the sequence type, including nucleotide, amino acid, or any arbitray grammared or ungrammared.


To better support these features, I added a wildcard_char attribute to the grammared sequence class. This character may be N for nucleotides or X for amino acids. The rationale is that some substitution matrices may not contain all available degenerate characters. This happens in many real matrices. For example, ACGTN cannot handle R or S. The feature enables replacing such characters with the wildcard character.

There could be more usages of the wildcard character. For example, if one wants to initiate a DNA sequence of a given length, they can do something like DNA.full(10), and they get NNNNNNNNNN. This can be useful in multiple applications (such as filling gaps between contigs).

To ensure backward compatibility, wildcard_char is not yet an abstract property. Otherwise one cannot instantiate a custom sequence without setting this property. However, it may become one in a future version (requesting your opinion).


The new feature is also capable of handling gaps. They automatically locate gaps, and mask them in the output vector, which then becomes a np.ma.MaskedArray. This is useful because gaps are allowed in skbio's sequences, and they need to be treated during alignment.


A minor change is that I removed pprint from docstring examples as it is no longer necessary for modern Python dictionaries, which are always ordered.


For record, there are alternative approaches to _indices_in_alphabet, such as the following. However, they are slower than the present versions in my tests on real-world sequence data.

Method 1: Extract unique characters before finding indices.

uniq, index = np.unique(seq, return_inverse=True)
pos = list(map(alphabet.get, uniq))
absence = alphabet[absence]
pos = [absence if x is None else x for x in pos]
return np.array(pos)[index]

Method 2: Use np.searchsorted NumPy (alphabet is a sorted array of characters)

pos = np.searchsorted(alphabet, seq)
last = len(alphabet) - 1
pos[pos > last] = last
absent = alphabet[pos] != seq
return np.where(absent, absence, pos)

Please complete the following checklist:

  • I have read the guidelines in CONTRIBUTING.md.

  • I have documented all public-facing changes in CHANGELOG.md.

  • This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied. It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.

  • This pull request does not include code, documentation, or other content derived from external source(s).

Note: REVIEWING.md may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

@qiyunzhu qiyunzhu requested review from wasade and mortonjt January 17, 2024 00:44
@qiyunzhu
Copy link
Contributor Author

Hi @wasade @mortonjt Would either of you be interested in taking a look? Thanks!

@mataton
Copy link
Collaborator

mataton commented Jan 23, 2024

I've generated the documentation, conducted local testing for the new functions, and everything looks good at this stage. Additionally, I had a face-to-face meeting with @qiyunzhu to delve into the rationale behind the proposed method and its improved speed compared to alternative approaches.

@qiyunzhu
Copy link
Contributor Author

@mataton Thank you!

@wasade and @mortonjt Does either of you want to take a look? Or shall we just move forward? Thanks!

@mortonjt
Copy link
Collaborator

Hi @qiyunzhu I think the proposed change makes a lot of sense -- indeed it is standard to map characters to indices for these types of algorithms.

I strongly recommend looking into tokenizers in existing protein language models. If you are able to support naive support for Protrans or ESM2 it would substantially open up use cases for this functionality. In the current documentation, it isn't clear how one would specify the indices. The indices (aka tokenizers) in these models tends to be arbitrary, so it is highly advantageous to be able to swap out new indices.

We have some of this functionality built into deepblast already. See here, here and here to get started.

@qiyunzhu
Copy link
Contributor Author

Hi @mortonjt This is a briliant idea! Although I am not familiar with it, I guess a Sequence.tokenize method may be a low-hanging fruit to implement, and opens lots of new possibilities. How about you create an issue to suggest this change?

@mortonjt
Copy link
Collaborator

Done. Note that you are basically implementing a tokenizer in this pull request -- you just didn't use that particular name.
In a future PR, it would also be helpful to enable backwards transforms : indices -> alphabet, as well as alphabet -> indices (which is what is implemented here)

@qiyunzhu
Copy link
Contributor Author

qiyunzhu commented Feb 1, 2024

Let's merge this one and look into tokenization (and its reverse process) as @mortonjt suggested.

@mataton mataton merged commit 3071b7c into scikit-bio:master Feb 1, 2024
22 checks passed
@qiyunzhu qiyunzhu deleted the align branch February 1, 2024 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants