Improve generator performance #107

florian-huber · 2022-10-17T12:38:37Z

Currently, the data generator has to pick random pairs of spectra during training (and testing) which fall into specific Tanimoto score bins.
This is very expensive computationally because for each data point a full array of Tanimoto scores needs to be searched. In my experience the actual deep learning training might only take 1-5% of the actual computation time!

One strategy could be:

pre-pick the actual pairs before! E.g. combine 100s of pairs of all Tanimoto bin ranges (0-0.1, 0.1-0.2, ..., 0.9-1.0).

florian-huber · 2022-11-29T19:26:41Z

We ran some cProfile tests, and the generator is not that bad after all... still might be worth some more optimization.

niekdejonge · 2023-01-13T14:27:39Z

@florian-huber
In addition, creating a Tanimoto score matrix using the matchms jaccard_similarity_matrix can give memory issues, when a high number of inchikeys is used. See the last comments in: iomega/ms2query#150.

A possible solution might be:
Instead of pre-calculating all the Tanimoto scores, it might be possible to calculate them on the spot from the fingerprints until a Tanimoto score within the bin range is found. With library size increasing, this might be faster and more memory efficient than precalculating a very large Tanimoto score matrix and searching this matrix. I do not really know the speed and frequency of the different steps involved, so it might also be slowing it down a lot, so I am not sure if this would actually be a good solution.

Still, it is good to know that the jaccard_similarity_matrix for some people is already causing memory issues, when a high number of unique inchikeys is used.

florian-huber · 2023-01-13T14:35:45Z

Yes, that is indeed a problem we didn't have with our initial training set (roughly 20,000 vs 20,000 pairs)!

Your suggestion: Compute Tanimoto scores on the fly

I think your suggestions could work. My main concern would be that the data is extremely biased towards low Tanimoto scores.
If the data generator starts looking for a high Tanimoto pair (say 0.8 - 0.9), then it might have to compute thousands of pairs before finding one that fits.

Maybe an alternative: Precompute subset of scores

Because for much larger arrays we would anyway not expect to probe all pairs during training.
So, maybe, it would be enough to compute a subset of scores such that it is randomized and sufficiently large for model training.
We could use a sparse matrix for that and fill it with scores following a certain logic.
E.g.

If score belongs to under-represented class, then add.
If score belongs to over-represented class, then skip.

justinjjvanderhooft · 2023-01-13T14:39:18Z

I like your second suggestion! It could be part of our planned MS2DeepScore improvements, and/or make a good student project, as we would also need to find out what amount of high scoring pairs is "sufficient" for a decent model performance....

niekdejonge · 2023-01-13T16:16:48Z

Yes, that could work. Great that the sparse matrixes are now implemented :)

Instead of needing to check if it is over or under-represented, we could also just continue with saving until there are sufficient enough for each bin, e.g 50 matches.

florian-huber · 2023-08-15T12:46:52Z

This was now realized in #145.

florian-huber · 2024-01-23T10:52:05Z

This is now fixed with the new, much faster generator in #168 .

florian-huber added the performance label Oct 17, 2022

florian-huber mentioned this issue Aug 15, 2023

New pair generation #145

Merged

florian-huber closed this as completed Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve generator performance #107

Improve generator performance #107

florian-huber commented Oct 17, 2022

florian-huber commented Nov 29, 2022

niekdejonge commented Jan 13, 2023

florian-huber commented Jan 13, 2023

justinjjvanderhooft commented Jan 13, 2023

niekdejonge commented Jan 13, 2023

florian-huber commented Aug 15, 2023 •

edited

Loading

florian-huber commented Jan 23, 2024

Improve generator performance #107

Improve generator performance #107

Comments

florian-huber commented Oct 17, 2022

florian-huber commented Nov 29, 2022

niekdejonge commented Jan 13, 2023

florian-huber commented Jan 13, 2023

Your suggestion: Compute Tanimoto scores on the fly

Maybe an alternative: Precompute subset of scores

justinjjvanderhooft commented Jan 13, 2023

niekdejonge commented Jan 13, 2023

florian-huber commented Aug 15, 2023 • edited Loading

florian-huber commented Jan 23, 2024

florian-huber commented Aug 15, 2023 •

edited

Loading