Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add spectrumAI #70

Merged
merged 22 commits into from
Oct 17, 2022
Merged

add spectrumAI #70

merged 22 commits into from
Oct 17, 2022

Conversation

DongdongdongW
Copy link
Collaborator

No description provided.

@ypriverol ypriverol linked an issue Aug 23, 2022 that may be closed by this pull request
pypgatk/spectrumAI/validate_peptides.py Outdated Show resolved Hide resolved
pypgatk/tests/pypgatk_tests.py Outdated Show resolved Hide resolved
pypgatk/commands/get_subpos.py Outdated Show resolved Hide resolved
@DongdongdongW
Copy link
Collaborator Author

Two works were possible using validate_peptides, one to calculate the position of the variant amino acids on the variant peptide and the other to validate the variant peptide using spectrumAI.
get position:pypgatk validate_peptides --input_psm_table xxx --input_fasta xxx --output_psm_table xxx
'--input_psm_table' is the PSMs table where position is to be obtained.
'--input_fasta' is the protein sequence used for quantification.
'--output_psm_table' is the file name of the output.
spectrumAI: pypgatk validate_peptides --mzml_path xxx --infile_name xxx --outfile_name xxx
or pypgatk validate_peptides --mzml_files xxx --infile_name xxx --outfile_name xxx
'--mzml_path' is the path to the mzML file in the PSMs table.
'--mzml_files' is the name of the mzML file in the PSMs table (need to specify the location of the file, different files are separated by ',')
'--infile_name' is the PSMs table that needs to run spectrumAI. It needs to contain 'position', which can be obtained using the the previous command to get position.
'--outfile_name' is the file name of the output.

@husensofteng
Copy link
Member

Thanks for the great work. I agree with @ypriverol it would be better to have one command for both processes. To avoid re-calculating the variant position we can have a condition to skip the process if the position column exists in the input_psm_table file.

Also, regarding the mzml_path, maybe it is better to change to mzmls_base_path since input_psm_table usually contains PSMs from multiple mzML files and the file names are written in one of the columns.

@DongdongdongW
Copy link
Collaborator Author

Thanks for the great work. I agree with @ypriverol it would be better to have one command for both processes. To avoid re-calculating the variant position we can have a condition to skip the process if the position column exists in the input_psm_table file.

Also, regarding the mzml_path, maybe it is better to change to mzmls_base_path since input_psm_table usually contains PSMs from multiple mzML files and the file names are written in one of the columns.

Thank you for your affirmation.At present, they are under one command, but they still belong to two separate processes. Do you mean we can merge into one process?
And at present, mzml_path can be the path of many mzmls. If necessary, I can change mzml_path to mzmls_base_path.

@ypriverol
Copy link
Member

No, @DongdongdongW now is fine with only one command. The only pending task is to support mzTab.

@DongdongdongW
Copy link
Collaborator Author

不,@DongdongdongW现在只需一个命令就可以了。唯一未决的任务是支持 mzTab。

got it

add mztab class-fdr and modified the code that validates peptides
@husensofteng
Copy link
Member

Regarding replacing blast to identify the variant position, we discussed the following with @ypriverol:
We can avoid using blast by implementing a function to identify proteins that overlap:

  1. Non-canonical peptides: each peptide should be compared to the canonical protein sequences and only those that have one mismatch need to be checked by SpectrumAI. There is no need to peptides with more than one mismatch since two or more amino acid differences are quite different than the canonical sequences
  2. Mutated peptides: each peptide should be compared with all sequences and those with miss-match should be further checked by SpectrumAI

@DongdongdongW
Copy link
Collaborator Author

Use our own method to compare peptides and sequences? @husensofteng

@husensofteng
Copy link
Member

husensofteng commented Sep 26, 2022

yes, if we can have an efficient implementation, ahocorasick is good for exact matches though I am not sure about its usability for single mismatches.

@ypriverol ypriverol self-requested a review October 17, 2022 12:51
pypgatk/commands/validate_peptides.py Outdated Show resolved Hide resolved
def test_get_subpos(self):
runner = CliRunner()
result = runner.invoke(cli,
['get_subpos', '--input_psm_table', 'testdata/MFA380.tsv',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MFA380.tsv @DongdongdongW where is this file?

@ypriverol ypriverol merged commit b60cef1 into bigbio:spectrumAI Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implement spectrumAI in pypgatk
3 participants