Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory foodprint creating large tanimoto score files #150

Closed
niekdejonge opened this issue Sep 28, 2022 · 12 comments
Closed

Memory foodprint creating large tanimoto score files #150

niekdejonge opened this issue Sep 28, 2022 · 12 comments

Comments

@niekdejonge
Copy link
Collaborator

Currently a matrix with tanimoto scores is generated. However, only the top 10 highest scoring Tanimoto scores are needed for MS2Query.

Suggested change:
Do not store the entire matrix with tanimoto scores, instead just store the top 10 highest tanimoto scores. And pass this to the sqlite file generator.

@niekdejonge
Copy link
Collaborator Author

Will be solved with #151

@niekdejonge
Copy link
Collaborator Author

@guikool
I released a new version which should run a lot less memory intensive, during the tanimoto score calculation.
So it is probably best to use this, please let me know if this still gives issues.

@guikool
Copy link

guikool commented Sep 28, 2022

launch started on 500K spectra in collab...

@guikool
Copy link

guikool commented Sep 28, 2022

oups..crash
I definitely need a cluster..

AttributeError                            Traceback (most recent call last)

[<ipython-input-7-a084f5e74829>](https://localhost:8080/#) in <module>
      7 library_creator.clean_peaks_and_normalise_intensities_spectra()
      8 library_creator.remove_not_fully_annotated_spectra()
----> 9 library_creator.calculate_tanimoto_scores()
     10 library_creator.create_all_library_files()

AttributeError: 'LibraryFilesCreator' object has no attribute 'calculate_tanimoto_scores'

@guikool
Copy link

guikool commented Sep 28, 2022

one way to limit the size of the tanimoto square matrix is perhaps to limit the tanimoto score to a given treshold (0.7?)

@niekdejonge
Copy link
Collaborator Author

You can remove the step library_creator.calculate_tanimoto_scores()
This was changed in the new version. Now the tanimoto scores are automatically calculated in create_all_library_files()

We actually only need a fraction of the tanimoto scores, so the memory footprint in this version should be reduced a lot (even more than only above 0.7).

@guikool
Copy link

guikool commented Sep 28, 2022

I didn't notice the script change.
It seems to work but not possible on google collab due to extensive estimated time calculation:
Calculating Tanimoto scores: 1%| | 846/168039 [37:47<124:30:01, 2.68s/it]
Still, I'll run it on a strong config and let you know the results.

@guikool
Copy link

guikool commented Jan 13, 2023

Dear Niek, I just benchmarked the last version of MS2query for library creation. On a 32Go memory based computer, It terminates with the following error:

tanimoto_scores = jaccard_similarity_matrix(fingerprints_1, fingerprints_2)
MemoryError: Allocation failed (probably too large).

I've access to a 256 Go workstation and will make a try, but perhaps there is something to optimize on this part.
Best regards
G.

@niekdejonge
Copy link
Collaborator Author

Thanks for letting us know.
This step is indeed creating a large matrix, which might therefore give some memory issues (number of unique inchikeys squared). However, I never had issues with this before. How many unique InChiKeys did you have in your training spectra?

It is hard for me to change this since this step is not needed for MS2Query but instead is needed for training MS2Deepscore. I had a quick look if this could be easily changed, but it is not straightforward to change this. I will make an issue in MS2Deepscore, about this, so this mitght be changed in the future.

I hope it works on the 256Gb workstation.

@guikool
Copy link

guikool commented Jan 13, 2023

I've encountered another issue on the workstation, but related to python install.
For the record, I'm trying the library creation without model on 500K spectra...
I'll give it a try on the university cluster and let you know.

@guikool
Copy link

guikool commented Jan 17, 2023

I finally removed all in-silico spectra from my in house library and work with less than 50K unique inchikey, no problem so far, library creation works really well and fast.
In the results.csv, although scoring from the model is important, it could also be useful to have the dot product of crop spectra between library analog and experimental MS/MS query.

@niekdejonge
Copy link
Collaborator Author

Great to hear that it works well now!
Thanks for the suggestion to add the dot product. This might indeed be a useful addition. However, my concern is that it might confuse some users on what score they should trust.
I will generate a separate issue for this, to discuss if we want to add this to the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants