Improve inchikey pair selection and data generators #232

niekdejonge · 2024-08-29T10:08:29Z

Refactoring of the InchikeyPairSelection and datagenerators.

Changes made:

Made DataGenerator and SelectedCompoundPairs less intertwined. SelectedCompoundPairs just contains the inchikey pairs and the dataGenerator just adds the spectrum picking and tensorization.
Fixed the bug that only one spectrum per inchikey was used during training
Sampling of inchikeys is more balanced, the algorithm is now like this:
- First tanimoto scores are collected per tanimoto score bin a 100 pairs are selected for each inchikey (if available). The 100 pairs per inchikey is used to keep the algorithm scalable and not run into memory issues, in this way memory scales linearly instead of quadratically.
- Pairs are converted from a matrix to just a list of pairs in the form (inchikey1, inchikey2, score). Any inverted duplicates are removed (e.g. (inchikey2, inchikey1, score)).
- We count the available number of pairs per bin. The number of pairs in the bin that has the least number of bins is used as pairs per bin, to ensure a good balance over the different bins.
- We start selecting pairs per bin from the bin with the lowest number of pairs to the highest number of pairs.
- We track the frequency of each inchikey in all pairs over all bins. (both the first and the second inchikey in each pair is counted, which was not the case before).
- When picking pairs we start with the inchikey with the lowest frequency that is available for that tanimoto bin. We select all available pairs for this inchikey in this bin. There can be mulitple pairs available. (e.g. [(inchikey5, inchikey3, 0.9), (inchikey5, inchikey8, 0.92), (inchikey5, inchikey6, 0.91)]. From these pairs we pick the pair were the second inchikey is the least frequent.
- This results in a list of pairs, that have a perfect balance over the different bins, with an almost perfect balance of unique inchikeys.

Additional changes:

I have improved the tests, we now test the balance of inchikeys, the balance of scores, and that we actually select all spectra (and not just one per inchikey).
Before an epoch would be exactly nr_of_inchikeys long. Even if it did not fit with the batch size. This could result in the final batch being smaller than the batch_size. Now I have changed this to just using the batch size for the final batch size as well, resulting in an epoch that takes a few more pairs than the nr_of_inchikeys.

sonarcloud · 2024-08-29T10:14:24Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

… in fixed batch length

… batch_size * nr_of_batches instead of nr_of_inchikeys (before the last batch, would be smaller than batch size, to exactly fit the nr_of_inchikeys)

…try except loop

… be added)

…dInchikeyPairs

… compound pairs wrapper)

…g_inchikey_pairs

…(needed for previous commit)

…eversed pairs)

…t second inchikey of the pair for better balanced selection.

… inchikey pairs

…and <= bin[1]. Since we now require the bins to start at a value smaller than 0 (e.g. -0.01)

… backwards compatibility issues)

…e order given as input.

…r oversampling

…ey sampling

…lied on this

…ad of InchikeyPairGenerator

niekdejonge added 3 commits August 29, 2024 10:34

Add test for issue

0819cf7

Skipp cases where no pair is available

11325ad

Rename to current_batch_index

9a7b518

niekdejonge added 26 commits August 29, 2024 13:30

Add shuffling option to SelectedCompoundNames.generator

6191e15

Update tests to test shuffling option of SelectedCompoundNames.generator

c23290c

Use CompoundPairSelector.generator() in DataGeneratorPytorch, results…

560fb6a

… in fixed batch length

Update test_DataGeneratorPytorch, since the length of an epoch is now…

227fd01

… batch_size * nr_of_batches instead of nr_of_inchikeys (before the last batch, would be smaller than batch size, to exactly fit the nr_of_inchikeys)

linting

363e6da

Undo removal of StopIteration and fix actual linting issue by adding …

5726542

…try except loop

Separate nr_of_batches from length

40cc133

Added test for equal inchikey distribution (which currently fails)

4c93afb

Improve documentation and code readability of spectrum_pair_selection.py

ba71d7f

Rename spectrum_pair_selection to inchikey_pair_selection.py

d7b6eaf

Add new methods for balanced spectrum pair selection

d525805

Add SelectedInchikeyPairs

7e263c1

Fix bug selecting only one spectrum per inchikey (test still needs to…

a48155c

… be added)

Use inchikeys instead of indexes in available_inchikey_counts

b761419

Remove spectrums as output from select_compound_pairs_wrapper in test

48b21de

Make test for inchikey balance less strict

64b0477

Switch order of generator output

88c574d

Remove SelectedCompoundPairs and replace tests with tests for Selecte…

b065144

…dInchikeyPairs

Fix test model training (still expected spectra as output from select…

0a93443

… compound pairs wrapper)

Remove outdated test and add test for balanced inchikeys for selectin…

d92b466

…g_inchikey_pairs

Add methods for calculating inchikey counts to SelectedInchikeyPairs …

9c22063

…(needed for previous commit)

Only select each pair once in convert_selected_pairs_matrix (remove r…

1836d8c

…eversed pairs)

Change inchikey pair selection algorithm to also select least frequen…

38c6be2

…t second inchikey of the pair for better balanced selection.

Move create test spectra

3fe3e2b

Add test for balanced score when selecting inchikey pairs

ce2273f

Add test to check that there are no repeating pairs in when selecting…

a6abf48

… inchikey pairs

niekdejonge added 29 commits September 17, 2024 10:15

Change the inchikey_pair_selection to always select bins by > bin[0] …

ec7acb9

…and <= bin[1]. Since we now require the bins to start at a value smaller than 0 (e.g. -0.01)

Don't validate settings when loading the model (to reduce unnecessary…

4338a58

… backwards compatibility issues)

Update bins in tests to start at -0.01

454b204

Remove the sorting bases on lowest number of pairs and instead use th…

9770d01

…e order given as input.

Restructure tests in test_inchikey_pair_selection.py and add tests fo…

d70fe99

…r oversampling

Add save as json

510eedc

Add description to tqdm

ed4d7c3

Add saving train_generator pairs as json in train_ms2ds_model

0e9cde0

Remove select one spectrum per inchikey function, since redundant

74ac106

Fix create_test_spectra.py

9d00498

Fix ValidationLossCalculator.py for equal multiple spectra per inchik…

f103696

…ey sampling

Fix tests that expected 2 spectra per inchikey

0599a7f

Optimized speed by working with numpy matrixes

04faf50

Remove the need for convert_selected_pairs_matrix in tests

828fae8

Remove unused import

1648799

Remove unused functions

c222380

Improve progress bars.

77e7526

Add progress bar when loading in spectra

bfe25e2

Remove unused conversion to coo arrays and adjust tests that still re…

a67e5b0

…lied on this

Rename SelectedInchikeyPairs to InchikeyPairGenerator

66aefb8

move inchikeyPairGenerator to data_generators.py

c404243

Rename DataGeneratorPytorch to SpectrumPairGenerator

1aad8bc

reformatting file

3b95b32

Reordering function order and adding docstrings

d20e23a

Add docstring to select_balanced_pairs

fbd8bcc

Change output of select compound pairs wrapper to list of pairs inste…

57252fb

…ad of InchikeyPairGenerator

Fixing prospector warnings

7022b47

Remove unreliable test.

3d3fe5b

Fix sonarcloud issue

4970a41

niekdejonge requested a review from florian-huber September 25, 2024 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve inchikey pair selection and data generators #232

Improve inchikey pair selection and data generators #232

niekdejonge commented Aug 29, 2024 •

edited

Loading

sonarcloud bot commented Aug 29, 2024

Improve inchikey pair selection and data generators #232

Are you sure you want to change the base?

Improve inchikey pair selection and data generators #232

Conversation

niekdejonge commented Aug 29, 2024 • edited Loading

sonarcloud bot commented Aug 29, 2024

Quality Gate passed

niekdejonge commented Aug 29, 2024 •

edited

Loading