Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Library Files python script - missing step #140

Closed
justinjjvanderhooft opened this issue Apr 23, 2022 · 10 comments
Closed

Create Library Files python script - missing step #140

justinjjvanderhooft opened this issue Apr 23, 2022 · 10 comments

Comments

@justinjjvanderhooft
Copy link

"7_create_library_files.py"

There is a small bug, as the folder path_library need to be created, e.g. including:
if not(os.path.isdir(path_library)): os.makedirs(path_library)

@niekdejonge
Copy link
Collaborator

Sorry missed this issue at the time. Thanks for pointing me at this.

The issue is actually slightly different, the directory is made (if needed), but the function does not expect a directory, but instead expects a based file name:
So something like:
C:\Niek\ms2query_library\gnps_12_15_21
It will than create three files named:
C:\Niek\ms2query_library\gnps_12_15_21_ms2ds_embeddings.pickle
C:\Niek\ms2query_library\gnps_12_15_21_s2v_embeddings.pickle
C:\Niek\ms2query_library\gnps_12_15_21.sqlite
So instead of expecting a directory we expect the base of the file to which the different extensions are made.

However, I agree that this is not intuitive for the user. So I will change this to specifying the directory. And making the expected files.

Additionally with #146 it is made a lot easier to create new library files for your own data, it is now possible to do this with just a few lines of code, without needing to run all the notebooks.

@guikool
Copy link

guikool commented Sep 27, 2022

Hi
I just post my experience in this issue since my problem seems related:
After trying to use my own library spectra, the workflow is fine but no data are being created at the final step without any readback error.

library_creator = LibraryFilesCreator(library_spectra,
                                      output_directory="/content/drive/MyDrive/neg_librarylrsv092022/Alllrsv_neg_",  # For instance "data/library_data/all_GNPS_positive_mode_"
                                      ion_mode="negative",
                                      ms2ds_model_file_name="/content/drive/MyDrive/neg_librarylrsv092022/ms2ds_model_GNPS_15_12_2021.hdf5",  # The file location of the ms2ds model
                                      s2v_model_file_name="/content/drive/MyDrive/neg_librarylrsv092022/spec2vec_model_GNPS_15_12_2021.model", )  # The file location of the s2v model
library_creator.clean_up_smiles_inchi_and_inchikeys(do_pubchem_lookup=False)
library_creator.clean_peaks_and_normalise_intensities_spectra()
library_creator.remove_not_fully_annotated_spectra()
library_creator.calculate_tanimoto_scores()
library_creator.create_all_library_files()
Applying default filters to spectra:  98%|█████████▊| 502041/512922 [03:02<00:04, 2410.99it/s]

2022-09-27 10:51:24,681:WARNING:matchms:add_precursor_mz:197,9867 can't be converted to float.

WARNING:matchms:197,9867 can't be converted to float.

2022-09-27 10:51:24,686:WARNING:matchms:add_precursor_mz:No precursor_mz found in metadata.

WARNING:matchms:No precursor_mz found in metadata.
Applying default filters to spectra: 100%|██████████| 512922/512922 [03:08<00:00, 2720.55it/s]
Selecting negative mode spectra: 100%|██████████| 512922/512922 [00:00<00:00, 692109.85it/s]

From 512922 spectra, 373 are removed since they are not in negative mode

Cleaning metadata library spectra:  98%|█████████▊| 502143/512549 [21:06<00:18, 552.89it/s]

2022-09-27 11:12:37,212:WARNING:matchms:add_parent_mass:Missing precursor m/z to derive parent mass.

WARNING:matchms:Missing precursor m/z to derive parent mass.
Cleaning metadata library spectra: 100%|██████████| 512549/512549 [21:22<00:00, 399.50it/s]
Cleaning and filtering peaks library spectra: 100%|██████████| 512549/512549 [03:01<00:00, 2818.53it/s]

From 494673 spectra, 0 are removed since they are not fully annotated

Calculating fingerprints for tanimoto scores: 100%|██████████| 168039/168039 [06:13<00:00, 450.08it/s]

@guikool
Copy link

guikool commented Sep 27, 2022

The results was an empty directory:
/content/drive/MyDrive/neg_librarylrsv092022/Alllrsv_neg_

@niekdejonge
Copy link
Collaborator

Hi,
Thanks for the clear overview of the problem.
Did the program finish by itself or did you stop it before finishing completely? After the fingerprints are determined the scores are calculated for 400.000 spectra, but this does not print a progress bar. Which might make it seem like the program is finished, while it is still calculating. I will add an progress bar at the next release, so it is clear that the program is still running.

The loading bars I see (for a small test set) are:

Cleaning metadata library spectra: 100%|██████████| 100/100 [00:00<00:00, 417.27it/s]
Cleaning and filtering peaks library spectra: 100%|██████████| 100/100 [00:00<00:00, 3533.38it/s]
Calculating fingerprints for tanimoto scores: 0%| | 0/61 [00:00<?, ?it/s]From 100 spectra, 0 are removed since they are not fully annotated
Calculating fingerprints for tanimoto scores: 100%|██████████| 61/61 [00:00<00:00, 201.11it/s]
Adding spectra to sqlite table: 100it [00:00, ?it/s]
Adding inchikey14s to sqlite table: 100%|██████████| 61/61 [00:00<00:00, 3908.00it/s]
Converting Spectrum to Spectrum_document: 100%|██████████| 100/100 [00:00<00:00, 3194.88it/s]
Calculating embeddings: 100it [00:00, 2133.04it/s]
Spectrum binning: 100%|██████████| 100/100 [00:00<00:00, 6381.50it/s]
Create BinnedSpectrum instances: 100%|██████████| 100/100 [00:00<?, ?it/s]
Calculating vectors of reference spectrums: 0%| | 0/100 [00:00<?, ?it/s]
Calculating vectors of reference spectrums: 100%|██████████| 100/100 [00:02<00:00, 39.87it/s]

Does waiting longer solve the issue for you?

@niekdejonge
Copy link
Collaborator

I added printing "Calculating Tanimoto scores"
Showing a progress bar for this as well would be better, but this is complex to implement with the current implementation of matchms. An issue in matchms was created, to address this problem. Might be implemented in the future.

@guikool
Copy link

guikool commented Sep 27, 2022

Thanks for your prompt reply,
I've used Google collab notebook. It interrupts just after tanimoto fingerprint and score calculation. I don't see the other progress bar you display in your response (adding spectra to sqlite...)
Is there any spectral metadata requirement to complete the process ?
I can share with you the notebook if you want to test

here is a sample of my msp file:

NAME: Actinorhodin
PRECURSORMZ: 629.083
spectrumid: CHMPS387
PRECURSORTYPE: M-H
INCHIKEY: MGFJRQUGYNFFDQ-WYUUTHIRSA-N
SMILES: C[C@H]1OC@HCC2=C(O)C3=C(O)C=C(C(O)=C3C(O)=C12)C1=CC(=O)C2=C(C1=O)C(=O)C1=C(CC@@HO[C@@h]1C)C2=O
INCHI: InChI=1S/C32H26O14/c1-9-21-15(3-11(45-9)5-19(35)36)29(41)23-17(33)7-13(27(39)25(23)31(21)43)14-8-18(34)24-26(28(14)40)32(44)22-10(2)46-12(6-20(37)38)4-16(22)30(24)42/h7-12,33,39,41,43H,3-6H2,1-2H3,(H,35,36)(H,37,38)/t9-,10-,11+,12+/m1/s1
RETENTIONTIME: CCS
IONMODE: Negative
INSTRUMENT: qTof
INSTRUMENTTYPE: DI-ESI-QTOF
COMPOUNDCLASS:
ADDUCTIONNAME:
LINKS:
SOURCEDB: ALL_GNPS.msp
ORIGIN: GNPS
COLLISIONENERGY:
Molecular Formula: C32H26O14
Molar Mass: 634.5416772826582
Num Peaks: 149
197.930786 17.0
197.931046 17.0
197.931305 17.0
197.931564 17.0
...

@niekdejonge
Copy link
Collaborator

I now notice you have quite some unique Inchikeys; 168039. Is this an in house library and does that number of unique inchikeys match with your expectations?

This increase in unique inchikeys might result in some memory issues in google colab. I tested it for up to about 20.000 unique inchikeys. 168039 is quite substantially more and since the tanimoto score is calculated between each inchikey, the size increases to the power of 2.

I think this makes google colab crash, it might be possible to still run this on a server, with more memory available than google colab.

Could you maybe try the workflow with a smaller spectrum file (with e.g. 100 spectra). To make sure the workflow works well in google colab?

If this is indeed the issue, I could have a look at some improvements to reduce the memory footprint of the generation of the Tanimoto scores.

@guikool
Copy link

guikool commented Sep 28, 2022

You're probably right,
I'm waiting for access to a server to test and will post a feedback asap.

@guikool
Copy link

guikool commented Sep 28, 2022

I confirm, it works on collab for 10000 spectra!!
I'll use a bigger server to process my entire library
Thanks for your help

@niekdejonge
Copy link
Collaborator

Great I will also make a less memory intensive implementation. This can be further discussed in #150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants