Create Library Files python script - missing step #140

justinjjvanderhooft · 2022-04-23T15:58:01Z

"7_create_library_files.py"

There is a small bug, as the folder path_library need to be created, e.g. including:
if not(os.path.isdir(path_library)): os.makedirs(path_library)

niekdejonge · 2022-09-23T13:06:38Z

Sorry missed this issue at the time. Thanks for pointing me at this.

The issue is actually slightly different, the directory is made (if needed), but the function does not expect a directory, but instead expects a based file name:
So something like:
C:\Niek\ms2query_library\gnps_12_15_21
It will than create three files named:
C:\Niek\ms2query_library\gnps_12_15_21_ms2ds_embeddings.pickle
C:\Niek\ms2query_library\gnps_12_15_21_s2v_embeddings.pickle
C:\Niek\ms2query_library\gnps_12_15_21.sqlite
So instead of expecting a directory we expect the base of the file to which the different extensions are made.

However, I agree that this is not intuitive for the user. So I will change this to specifying the directory. And making the expected files.

Additionally with #146 it is made a lot easier to create new library files for your own data, it is now possible to do this with just a few lines of code, without needing to run all the notebooks.

guikool · 2022-09-27T11:50:08Z

Hi
I just post my experience in this issue since my problem seems related:
After trying to use my own library spectra, the workflow is fine but no data are being created at the final step without any readback error.

library_creator = LibraryFilesCreator(library_spectra,
                                      output_directory="/content/drive/MyDrive/neg_librarylrsv092022/Alllrsv_neg_",  # For instance "data/library_data/all_GNPS_positive_mode_"
                                      ion_mode="negative",
                                      ms2ds_model_file_name="/content/drive/MyDrive/neg_librarylrsv092022/ms2ds_model_GNPS_15_12_2021.hdf5",  # The file location of the ms2ds model
                                      s2v_model_file_name="/content/drive/MyDrive/neg_librarylrsv092022/spec2vec_model_GNPS_15_12_2021.model", )  # The file location of the s2v model
library_creator.clean_up_smiles_inchi_and_inchikeys(do_pubchem_lookup=False)
library_creator.clean_peaks_and_normalise_intensities_spectra()
library_creator.remove_not_fully_annotated_spectra()
library_creator.calculate_tanimoto_scores()
library_creator.create_all_library_files()
Applying default filters to spectra:  98%|█████████▊| 502041/512922 [03:02<00:04, 2410.99it/s]

2022-09-27 10:51:24,681:WARNING:matchms:add_precursor_mz:197,9867 can't be converted to float.

WARNING:matchms:197,9867 can't be converted to float.

2022-09-27 10:51:24,686:WARNING:matchms:add_precursor_mz:No precursor_mz found in metadata.

WARNING:matchms:No precursor_mz found in metadata.
Applying default filters to spectra: 100%|██████████| 512922/512922 [03:08<00:00, 2720.55it/s]
Selecting negative mode spectra: 100%|██████████| 512922/512922 [00:00<00:00, 692109.85it/s]

From 512922 spectra, 373 are removed since they are not in negative mode

Cleaning metadata library spectra:  98%|█████████▊| 502143/512549 [21:06<00:18, 552.89it/s]

2022-09-27 11:12:37,212:WARNING:matchms:add_parent_mass:Missing precursor m/z to derive parent mass.

WARNING:matchms:Missing precursor m/z to derive parent mass.
Cleaning metadata library spectra: 100%|██████████| 512549/512549 [21:22<00:00, 399.50it/s]
Cleaning and filtering peaks library spectra: 100%|██████████| 512549/512549 [03:01<00:00, 2818.53it/s]

From 494673 spectra, 0 are removed since they are not fully annotated

Calculating fingerprints for tanimoto scores: 100%|██████████| 168039/168039 [06:13<00:00, 450.08it/s]

guikool · 2022-09-27T12:09:21Z

The results was an empty directory:
/content/drive/MyDrive/neg_librarylrsv092022/Alllrsv_neg_

niekdejonge · 2022-09-27T13:14:35Z

Hi,
Thanks for the clear overview of the problem.
Did the program finish by itself or did you stop it before finishing completely? After the fingerprints are determined the scores are calculated for 400.000 spectra, but this does not print a progress bar. Which might make it seem like the program is finished, while it is still calculating. I will add an progress bar at the next release, so it is clear that the program is still running.

The loading bars I see (for a small test set) are:

Cleaning metadata library spectra: 100%|██████████| 100/100 [00:00<00:00, 417.27it/s]
Cleaning and filtering peaks library spectra: 100%|██████████| 100/100 [00:00<00:00, 3533.38it/s]
Calculating fingerprints for tanimoto scores: 0%| | 0/61 [00:00<?, ?it/s]From 100 spectra, 0 are removed since they are not fully annotated
Calculating fingerprints for tanimoto scores: 100%|██████████| 61/61 [00:00<00:00, 201.11it/s]
Adding spectra to sqlite table: 100it [00:00, ?it/s]
Adding inchikey14s to sqlite table: 100%|██████████| 61/61 [00:00<00:00, 3908.00it/s]
Converting Spectrum to Spectrum_document: 100%|██████████| 100/100 [00:00<00:00, 3194.88it/s]
Calculating embeddings: 100it [00:00, 2133.04it/s]
Spectrum binning: 100%|██████████| 100/100 [00:00<00:00, 6381.50it/s]
Create BinnedSpectrum instances: 100%|██████████| 100/100 [00:00<?, ?it/s]
Calculating vectors of reference spectrums: 0%| | 0/100 [00:00<?, ?it/s]
Calculating vectors of reference spectrums: 100%|██████████| 100/100 [00:02<00:00, 39.87it/s]

Does waiting longer solve the issue for you?

niekdejonge · 2022-09-27T13:32:10Z

I added printing "Calculating Tanimoto scores"
Showing a progress bar for this as well would be better, but this is complex to implement with the current implementation of matchms. An issue in matchms was created, to address this problem. Might be implemented in the future.

guikool · 2022-09-27T14:49:08Z

Thanks for your prompt reply,
I've used Google collab notebook. It interrupts just after tanimoto fingerprint and score calculation. I don't see the other progress bar you display in your response (adding spectra to sqlite...)
Is there any spectral metadata requirement to complete the process ?
I can share with you the notebook if you want to test

here is a sample of my msp file:

NAME: Actinorhodin
PRECURSORMZ: 629.083
spectrumid: CHMPS387
PRECURSORTYPE: M-H
INCHIKEY: MGFJRQUGYNFFDQ-WYUUTHIRSA-N
SMILES: C[C@H]1OC@HCC2=C(O)C3=C(O)C=C(C(O)=C3C(O)=C12)C1=CC(=O)C2=C(C1=O)C(=O)C1=C(CC@@HO[C@@h]1C)C2=O
INCHI: InChI=1S/C32H26O14/c1-9-21-15(3-11(45-9)5-19(35)36)29(41)23-17(33)7-13(27(39)25(23)31(21)43)14-8-18(34)24-26(28(14)40)32(44)22-10(2)46-12(6-20(37)38)4-16(22)30(24)42/h7-12,33,39,41,43H,3-6H2,1-2H3,(H,35,36)(H,37,38)/t9-,10-,11+,12+/m1/s1
RETENTIONTIME: CCS
IONMODE: Negative
INSTRUMENT: qTof
INSTRUMENTTYPE: DI-ESI-QTOF
COMPOUNDCLASS:
ADDUCTIONNAME:
LINKS:
SOURCEDB: ALL_GNPS.msp
ORIGIN: GNPS
COLLISIONENERGY:
Molecular Formula: C32H26O14
Molar Mass: 634.5416772826582
Num Peaks: 149
197.930786 17.0
197.931046 17.0
197.931305 17.0
197.931564 17.0
...

niekdejonge · 2022-09-28T08:17:24Z

I now notice you have quite some unique Inchikeys; 168039. Is this an in house library and does that number of unique inchikeys match with your expectations?

This increase in unique inchikeys might result in some memory issues in google colab. I tested it for up to about 20.000 unique inchikeys. 168039 is quite substantially more and since the tanimoto score is calculated between each inchikey, the size increases to the power of 2.

I think this makes google colab crash, it might be possible to still run this on a server, with more memory available than google colab.

Could you maybe try the workflow with a smaller spectrum file (with e.g. 100 spectra). To make sure the workflow works well in google colab?

If this is indeed the issue, I could have a look at some improvements to reduce the memory footprint of the generation of the Tanimoto scores.

guikool · 2022-09-28T08:29:25Z

You're probably right,
I'm waiting for access to a server to test and will post a feedback asap.

guikool · 2022-09-28T08:59:41Z

I confirm, it works on collab for 10000 spectra!!
I'll use a bigger server to process my entire library
Thanks for your help

niekdejonge · 2022-09-28T09:01:41Z

Great I will also make a less memory intensive implementation. This can be further discussed in #150

niekdejonge mentioned this issue Sep 23, 2022

Improve library files creation #149

Merged

niekdejonge closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Library Files python script - missing step #140

Create Library Files python script - missing step #140

justinjjvanderhooft commented Apr 23, 2022

niekdejonge commented Sep 23, 2022

guikool commented Sep 27, 2022

guikool commented Sep 27, 2022

niekdejonge commented Sep 27, 2022

niekdejonge commented Sep 27, 2022

guikool commented Sep 27, 2022

niekdejonge commented Sep 28, 2022

guikool commented Sep 28, 2022

guikool commented Sep 28, 2022

niekdejonge commented Sep 28, 2022

Create Library Files python script - missing step #140

Create Library Files python script - missing step #140

Comments

justinjjvanderhooft commented Apr 23, 2022

niekdejonge commented Sep 23, 2022

guikool commented Sep 27, 2022

guikool commented Sep 27, 2022

niekdejonge commented Sep 27, 2022

niekdejonge commented Sep 27, 2022

guikool commented Sep 27, 2022

niekdejonge commented Sep 28, 2022

guikool commented Sep 28, 2022

guikool commented Sep 28, 2022

niekdejonge commented Sep 28, 2022