-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve generator performance #107
Comments
We ran some cProfile tests, and the generator is not that bad after all... still might be worth some more optimization. |
@florian-huber A possible solution might be: Still, it is good to know that the jaccard_similarity_matrix for some people is already causing memory issues, when a high number of unique inchikeys is used. |
Yes, that is indeed a problem we didn't have with our initial training set (roughly 20,000 vs 20,000 pairs)! Your suggestion: Compute Tanimoto scores on the flyI think your suggestions could work. My main concern would be that the data is extremely biased towards low Tanimoto scores. Maybe an alternative: Precompute subset of scoresBecause for much larger arrays we would anyway not expect to probe all pairs during training.
|
I like your second suggestion! It could be part of our planned MS2DeepScore improvements, and/or make a good student project, as we would also need to find out what amount of high scoring pairs is "sufficient" for a decent model performance.... |
Yes, that could work. Great that the sparse matrixes are now implemented :) Instead of needing to check if it is over or under-represented, we could also just continue with saving until there are sufficient enough for each bin, e.g 50 matches. |
This was now realized in #145. |
This is now fixed with the new, much faster generator in #168 . |
Currently, the data generator has to pick random pairs of spectra during training (and testing) which fall into specific Tanimoto score bins.
This is very expensive computationally because for each data point a full array of Tanimoto scores needs to be searched. In my experience the actual deep learning training might only take 1-5% of the actual computation time!
One strategy could be:
The text was updated successfully, but these errors were encountered: