Memory foodprint creating large tanimoto score files #150

niekdejonge · 2022-09-28T08:28:47Z

Currently a matrix with tanimoto scores is generated. However, only the top 10 highest scoring Tanimoto scores are needed for MS2Query.

Suggested change:
Do not store the entire matrix with tanimoto scores, instead just store the top 10 highest tanimoto scores. And pass this to the sqlite file generator.

niekdejonge · 2022-09-28T13:33:04Z

Will be solved with #151

niekdejonge · 2022-09-28T13:58:38Z

@guikool
I released a new version which should run a lot less memory intensive, during the tanimoto score calculation.
So it is probably best to use this, please let me know if this still gives issues.

guikool · 2022-09-28T15:27:09Z

launch started on 500K spectra in collab...

guikool · 2022-09-28T15:58:17Z

oups..crash
I definitely need a cluster..

AttributeError                            Traceback (most recent call last)

[<ipython-input-7-a084f5e74829>](https://localhost:8080/#) in <module>
      7 library_creator.clean_peaks_and_normalise_intensities_spectra()
      8 library_creator.remove_not_fully_annotated_spectra()
----> 9 library_creator.calculate_tanimoto_scores()
     10 library_creator.create_all_library_files()

AttributeError: 'LibraryFilesCreator' object has no attribute 'calculate_tanimoto_scores'

guikool · 2022-09-28T15:59:25Z

one way to limit the size of the tanimoto square matrix is perhaps to limit the tanimoto score to a given treshold (0.7?)

niekdejonge · 2022-09-28T16:48:39Z

You can remove the step library_creator.calculate_tanimoto_scores()
This was changed in the new version. Now the tanimoto scores are automatically calculated in create_all_library_files()

We actually only need a fraction of the tanimoto scores, so the memory footprint in this version should be reduced a lot (even more than only above 0.7).

guikool · 2022-09-28T20:43:28Z

I didn't notice the script change.
It seems to work but not possible on google collab due to extensive estimated time calculation:
Calculating Tanimoto scores: 1%| | 846/168039 [37:47<124:30:01, 2.68s/it]
Still, I'll run it on a strong config and let you know the results.

guikool · 2023-01-13T11:32:47Z

Dear Niek, I just benchmarked the last version of MS2query for library creation. On a 32Go memory based computer, It terminates with the following error:

tanimoto_scores = jaccard_similarity_matrix(fingerprints_1, fingerprints_2)
MemoryError: Allocation failed (probably too large).

I've access to a 256 Go workstation and will make a try, but perhaps there is something to optimize on this part.
Best regards
G.

niekdejonge · 2023-01-13T13:47:55Z

Thanks for letting us know.
This step is indeed creating a large matrix, which might therefore give some memory issues (number of unique inchikeys squared). However, I never had issues with this before. How many unique InChiKeys did you have in your training spectra?

It is hard for me to change this since this step is not needed for MS2Query but instead is needed for training MS2Deepscore. I had a quick look if this could be easily changed, but it is not straightforward to change this. I will make an issue in MS2Deepscore, about this, so this mitght be changed in the future.

I hope it works on the 256Gb workstation.

guikool · 2023-01-13T18:37:23Z

I've encountered another issue on the workstation, but related to python install.
For the record, I'm trying the library creation without model on 500K spectra...
I'll give it a try on the university cluster and let you know.

guikool · 2023-01-17T15:03:57Z

I finally removed all in-silico spectra from my in house library and work with less than 50K unique inchikey, no problem so far, library creation works really well and fast.
In the results.csv, although scoring from the model is important, it could also be useful to have the dot product of crop spectra between library analog and experimental MS/MS query.

niekdejonge · 2023-01-19T11:36:37Z

Great to hear that it works well now!
Thanks for the suggestion to add the dot product. This might indeed be a useful addition. However, my concern is that it might confuse some users on what score they should trust.
I will generate a separate issue for this, to discuss if we want to add this to the results.

niekdejonge mentioned this issue Sep 28, 2022

Create Library Files python script - missing step #140

Closed

niekdejonge closed this as completed Sep 28, 2022

niekdejonge mentioned this issue Jan 13, 2023

Improve generator performance matchms/ms2deepscore#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory foodprint creating large tanimoto score files #150

Memory foodprint creating large tanimoto score files #150

niekdejonge commented Sep 28, 2022

niekdejonge commented Sep 28, 2022

niekdejonge commented Sep 28, 2022

guikool commented Sep 28, 2022

guikool commented Sep 28, 2022

guikool commented Sep 28, 2022

niekdejonge commented Sep 28, 2022

guikool commented Sep 28, 2022

guikool commented Jan 13, 2023

niekdejonge commented Jan 13, 2023

guikool commented Jan 13, 2023

guikool commented Jan 17, 2023

niekdejonge commented Jan 19, 2023

Memory foodprint creating large tanimoto score files #150

Memory foodprint creating large tanimoto score files #150

Comments

niekdejonge commented Sep 28, 2022

niekdejonge commented Sep 28, 2022

niekdejonge commented Sep 28, 2022

guikool commented Sep 28, 2022

guikool commented Sep 28, 2022

guikool commented Sep 28, 2022

niekdejonge commented Sep 28, 2022

guikool commented Sep 28, 2022

guikool commented Jan 13, 2023

niekdejonge commented Jan 13, 2023

guikool commented Jan 13, 2023

guikool commented Jan 17, 2023

niekdejonge commented Jan 19, 2023