Peptide Matching Problem, Gao lab bioinformatics subgroup discussion

Topic for discussion: search sequence in fasta Problem description: A fasta file typically contains thousands of protein sequence, and how to quickly match millions or even billions of peptide sequences to a fasta file is a critical task in proteomics.

Goal: minimize the runtime of the following function

function search_pep_seq(peptide_seq_list, long_combined_seq) -> peptide_seq_pos_list [(start, end), (start, end), ...]

Example: input: search_pep_seq(['abc', 'def', 'efgh', 'xyz', 'xxx'], 'abcdefgh') -> [(0, 2), (3, 5), (4, 7), (-1, 1)], with non-exist seq, use -1 as start, end can be any and will be ignored.

The example.7z file, once extracted, will give you an example.json file, which contains three keys, use the first two as your input and check with the expected output in results, time it and see if you can beat the standard solution.

{'peptide_list': peptide_list, 'long_seq': long_seq, 'results': positions}

Example standard solution:

import ahocorasick

def build_automaton(peptide_list):
    peptide_list = list(set(peptide_list))
    A = ahocorasick.Automaton()
    for idx, peptide in enumerate(peptide_list):
        A.add_word(peptide, (idx, peptide))
    A.make_automaton()
    return A

def find_peptide_positions(long_seq, automaton):
    positions = defaultdict(list)
    for end_index, (idx, peptide) in automaton.iter(long_seq):
        start_index = end_index - len(peptide) + 1
        result = (start_index, end_index)
        positions[peptide].append(result)
    return positions
not_found_peptides = [peptide for peptide in peptide_list if peptide not in positions]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
example.7z		example.7z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Peptide Matching Problem, Gao lab bioinformatics subgroup discussion

About

Releases

Packages

License

Gaolaboratory/peptide_match

Folders and files

Latest commit

History

Repository files navigation

Peptide Matching Problem, Gao lab bioinformatics subgroup discussion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages