Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve generator performance #107

Closed
florian-huber opened this issue Oct 17, 2022 · 7 comments
Closed

Improve generator performance #107

florian-huber opened this issue Oct 17, 2022 · 7 comments

Comments

@florian-huber
Copy link
Contributor

Currently, the data generator has to pick random pairs of spectra during training (and testing) which fall into specific Tanimoto score bins.
This is very expensive computationally because for each data point a full array of Tanimoto scores needs to be searched. In my experience the actual deep learning training might only take 1-5% of the actual computation time!

One strategy could be:

  • pre-pick the actual pairs before! E.g. combine 100s of pairs of all Tanimoto bin ranges (0-0.1, 0.1-0.2, ..., 0.9-1.0).
@florian-huber
Copy link
Contributor Author

We ran some cProfile tests, and the generator is not that bad after all... still might be worth some more optimization.

@niekdejonge
Copy link
Collaborator

@florian-huber
In addition, creating a Tanimoto score matrix using the matchms jaccard_similarity_matrix can give memory issues, when a high number of inchikeys is used. See the last comments in: iomega/ms2query#150.

A possible solution might be:
Instead of pre-calculating all the Tanimoto scores, it might be possible to calculate them on the spot from the fingerprints until a Tanimoto score within the bin range is found. With library size increasing, this might be faster and more memory efficient than precalculating a very large Tanimoto score matrix and searching this matrix. I do not really know the speed and frequency of the different steps involved, so it might also be slowing it down a lot, so I am not sure if this would actually be a good solution.

Still, it is good to know that the jaccard_similarity_matrix for some people is already causing memory issues, when a high number of unique inchikeys is used.

@florian-huber
Copy link
Contributor Author

Yes, that is indeed a problem we didn't have with our initial training set (roughly 20,000 vs 20,000 pairs)!

Your suggestion: Compute Tanimoto scores on the fly

I think your suggestions could work. My main concern would be that the data is extremely biased towards low Tanimoto scores.
If the data generator starts looking for a high Tanimoto pair (say 0.8 - 0.9), then it might have to compute thousands of pairs before finding one that fits.

Maybe an alternative: Precompute subset of scores

Because for much larger arrays we would anyway not expect to probe all pairs during training.
So, maybe, it would be enough to compute a subset of scores such that it is randomized and sufficiently large for model training.
We could use a sparse matrix for that and fill it with scores following a certain logic.
E.g.

  • If score belongs to under-represented class, then add.
  • If score belongs to over-represented class, then skip.

@justinjjvanderhooft
Copy link

I like your second suggestion! It could be part of our planned MS2DeepScore improvements, and/or make a good student project, as we would also need to find out what amount of high scoring pairs is "sufficient" for a decent model performance....

@niekdejonge
Copy link
Collaborator

Yes, that could work. Great that the sparse matrixes are now implemented :)

Instead of needing to check if it is over or under-represented, we could also just continue with saving until there are sufficient enough for each bin, e.g 50 matches.

@florian-huber
Copy link
Contributor Author

florian-huber commented Aug 15, 2023

This was now realized in #145.

@florian-huber
Copy link
Contributor Author

This is now fixed with the new, much faster generator in #168 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants