Converting sequence into indices of characters #1917

qiyunzhu · 2024-01-17T00:44:09Z

This PR mainly implements one thing: converting a sequence into a vector of indices of characters. This is useful for matching characters in a sequence with indices in a substitution matrix, which is essential for efficient sequence alignment. With this, accessing the substitution score of two characters can be as simple as submat[i, j].

The conversion step may not be a bottlenecking step in the entire sequence alignment process. However, it could still cost some time, as a naive solution would be O(nk), in which n is the sequence length and k is the alphabet size, as compared to the dynamic programming algorithm which is O(n²). Therefore, I did some careful optimization. In the present implementation (see _alphabet_to_hashes and _indices_in_alphabet_ascii), the entire conversion is achieved with:

submat._char_hash[seq._bytes]

In which seq._bytes is a vector of ASCII code points, which is the native data structure underlying all skbio.Sequence (sub)classes (other formats such as string requires conversion). submat._char_hash is a pre-computed hash table, in which buckets are all possible ASCII code points (0 to 127), and values are indices of characters in the substitution matrix. Therefore, this hash table has a fixed shape of (128,). The output is a 1D array of uint8 data type, which is most efficient in memory space.

The reason not to directly use ASCII code points as indices is because in this scenario, a substitution matrix will occupy 128 * 128 * 8 = 128 KB memory space, which may be too much, especially when there are multiple of them (before we implement lazy loading). It could also prohibit the use of generalized alphabets (currently SubstitutionMatrix supports generalized alphabets).

At present skbio.Sequence only supports ASCII characters. There is no way to generate a sequence with Unicode, extended ASCII (>127) or non-character values. This makes the optimization feasible and safe. However, I also implemented a generalized solution (_indices_in_alphabet) which relies on dictionary query. It could be useful in the future when generalized sequence is implemented.

Meanwhile, I implemented _indices_in_observed to convert sequences into vectors of indices in observed unique characters in them. This may be useful for sequence alignment based on edits (i.e., match, mismatch and gap).

These features are flexible with the sequence type, including nucleotide, amino acid, or any arbitray grammared or ungrammared.

To better support these features, I added a wildcard_char attribute to the grammared sequence class. This character may be N for nucleotides or X for amino acids. The rationale is that some substitution matrices may not contain all available degenerate characters. This happens in many real matrices. For example, ACGTN cannot handle R or S. The feature enables replacing such characters with the wildcard character.

There could be more usages of the wildcard character. For example, if one wants to initiate a DNA sequence of a given length, they can do something like DNA.full(10), and they get NNNNNNNNNN. This can be useful in multiple applications (such as filling gaps between contigs).

To ensure backward compatibility, wildcard_char is not yet an abstract property. Otherwise one cannot instantiate a custom sequence without setting this property. However, it may become one in a future version (requesting your opinion).

The new feature is also capable of handling gaps. They automatically locate gaps, and mask them in the output vector, which then becomes a np.ma.MaskedArray. This is useful because gaps are allowed in skbio's sequences, and they need to be treated during alignment.

A minor change is that I removed pprint from docstring examples as it is no longer necessary for modern Python dictionaries, which are always ordered.

For record, there are alternative approaches to _indices_in_alphabet, such as the following. However, they are slower than the present versions in my tests on real-world sequence data.

Method 1: Extract unique characters before finding indices.

uniq, index = np.unique(seq, return_inverse=True)
pos = list(map(alphabet.get, uniq))
absence = alphabet[absence]
pos = [absence if x is None else x for x in pos]
return np.array(pos)[index]

Method 2: Use np.searchsorted NumPy (alphabet is a sorted array of characters)

pos = np.searchsorted(alphabet, seq)
last = len(alphabet) - 1
pos[pos > last] = last
absent = alphabet[pos] != seq
return np.where(absent, absence, pos)

Please complete the following checklist:

I have read the guidelines in CONTRIBUTING.md.
I have documented all public-facing changes in CHANGELOG.md.
This pull request includes code, documentation, or other content derived from external source(s). If this is the case, ensure the external source's license is compatible with scikit-bio's license. Include the license in the licenses directory and add a comment in the code giving proper attribution. Ensure any other requirements set forth by the license and/or author are satisfied. It is your responsibility to disclose code, documentation, or other content derived from external source(s). If you have questions about whether something can be included in the project or how to give proper attribution, include those questions in your pull request and a reviewer will assist you.
This pull request does not include code, documentation, or other content derived from external source(s).

Note: REVIEWING.md may also be helpful to see some of the things code reviewers will be verifying when reviewing your pull request.

This reverts commit 69e15b2.

qiyunzhu · 2024-01-17T00:45:46Z

Hi @wasade @mortonjt Would either of you be interested in taking a look? Thanks!

mataton · 2024-01-23T22:34:10Z

I've generated the documentation, conducted local testing for the new functions, and everything looks good at this stage. Additionally, I had a face-to-face meeting with @qiyunzhu to delve into the rationale behind the proposed method and its improved speed compared to alternative approaches.

qiyunzhu · 2024-01-24T15:25:36Z

@mataton Thank you!

@wasade and @mortonjt Does either of you want to take a look? Or shall we just move forward? Thanks!

mortonjt · 2024-01-24T15:38:20Z

Hi @qiyunzhu I think the proposed change makes a lot of sense -- indeed it is standard to map characters to indices for these types of algorithms.

I strongly recommend looking into tokenizers in existing protein language models. If you are able to support naive support for Protrans or ESM2 it would substantially open up use cases for this functionality. In the current documentation, it isn't clear how one would specify the indices. The indices (aka tokenizers) in these models tends to be arbitrary, so it is highly advantageous to be able to swap out new indices.

We have some of this functionality built into deepblast already. See here, here and here to get started.

qiyunzhu · 2024-01-24T18:07:19Z

Hi @mortonjt This is a briliant idea! Although I am not familiar with it, I guess a Sequence.tokenize method may be a low-hanging fruit to implement, and opens lots of new possibilities. How about you create an issue to suggest this change?

mortonjt · 2024-01-24T18:36:40Z

Done. Note that you are basically implementing a tokenizer in this pull request -- you just didn't use that particular name.
In a future PR, it would also be helpful to enable backwards transforms : indices -> alphabet, as well as alphabet -> indices (which is what is implemented here)

qiyunzhu · 2024-02-01T20:02:29Z

Let's merge this one and look into tokenization (and its reverse process) as @mortonjt suggested.

qiyunzhu added 8 commits December 21, 2023 10:11

updated URL and doc build

69e15b2

Revert "updated URL and doc build"

631ed95

This reverts commit 69e15b2.

Merge branch 'master' of https://github.com/biocore/scikit-bio

34bfef2

Merge branch 'master' of https://github.com/biocore/scikit-bio

d4558cb

added _get_alphabet_index

8879369

added _make_alphabet_and_index

67e2fc0

added sequence to indices

62d831b

updated changelog

f23f15e

qiyunzhu requested review from wasade and mortonjt January 17, 2024 00:44

fixing linting

5a59384

mortonjt mentioned this pull request Jan 24, 2024

Tokenizer compatibility with existing aligners / language models #1920

Closed

mataton merged commit 3071b7c into scikit-bio:master Feb 1, 2024
22 checks passed

qiyunzhu deleted the align branch February 1, 2024 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting sequence into indices of characters #1917

Converting sequence into indices of characters #1917

qiyunzhu commented Jan 17, 2024

qiyunzhu commented Jan 17, 2024

mataton commented Jan 23, 2024

qiyunzhu commented Jan 24, 2024

mortonjt commented Jan 24, 2024

qiyunzhu commented Jan 24, 2024

mortonjt commented Jan 24, 2024

qiyunzhu commented Feb 1, 2024

Converting sequence into indices of characters #1917

Converting sequence into indices of characters #1917

Conversation

qiyunzhu commented Jan 17, 2024

qiyunzhu commented Jan 17, 2024

mataton commented Jan 23, 2024

qiyunzhu commented Jan 24, 2024

mortonjt commented Jan 24, 2024

qiyunzhu commented Jan 24, 2024

mortonjt commented Jan 24, 2024

qiyunzhu commented Feb 1, 2024