Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve inchikey pair selection and data generators #232

Open
wants to merge 76 commits into
base: main
Choose a base branch
from

Conversation

niekdejonge
Copy link
Collaborator

@niekdejonge niekdejonge commented Aug 29, 2024

Refactoring of the InchikeyPairSelection and datagenerators.

Changes made:

  • Made DataGenerator and SelectedCompoundPairs less intertwined. SelectedCompoundPairs just contains the inchikey pairs and the dataGenerator just adds the spectrum picking and tensorization.
  • Fixed the bug that only one spectrum per inchikey was used during training
  • Sampling of inchikeys is more balanced, the algorithm is now like this:
    • First tanimoto scores are collected per tanimoto score bin a 100 pairs are selected for each inchikey (if available). The 100 pairs per inchikey is used to keep the algorithm scalable and not run into memory issues, in this way memory scales linearly instead of quadratically.
    • Pairs are converted from a matrix to just a list of pairs in the form (inchikey1, inchikey2, score). Any inverted duplicates are removed (e.g. (inchikey2, inchikey1, score)).
    • We count the available number of pairs per bin. The number of pairs in the bin that has the least number of bins is used as pairs per bin, to ensure a good balance over the different bins.
    • We start selecting pairs per bin from the bin with the lowest number of pairs to the highest number of pairs.
    • We track the frequency of each inchikey in all pairs over all bins. (both the first and the second inchikey in each pair is counted, which was not the case before).
    • When picking pairs we start with the inchikey with the lowest frequency that is available for that tanimoto bin. We select all available pairs for this inchikey in this bin. There can be mulitple pairs available. (e.g. [(inchikey5, inchikey3, 0.9), (inchikey5, inchikey8, 0.92), (inchikey5, inchikey6, 0.91)]. From these pairs we pick the pair were the second inchikey is the least frequent.
    • This results in a list of pairs, that have a perfect balance over the different bins, with an almost perfect balance of unique inchikeys.

Additional changes:

  • I have improved the tests, we now test the balance of inchikeys, the balance of scores, and that we actually select all spectra (and not just one per inchikey).
  • Before an epoch would be exactly nr_of_inchikeys long. Even if it did not fit with the batch size. This could result in the final batch being smaller than the batch_size. Now I have changed this to just using the batch size for the final batch size as well, resulting in an epoch that takes a few more pairs than the nr_of_inchikeys.

Copy link

sonarcloud bot commented Aug 29, 2024

… batch_size * nr_of_batches instead of nr_of_inchikeys (before the last batch, would be smaller than batch size, to exactly fit the nr_of_inchikeys)
…t second inchikey of the pair for better balanced selection.
…and <= bin[1]. Since we now require the bins to start at a value smaller than 0 (e.g. -0.01)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants