Change creating sqlite file function #63

niekdejonge · 2021-02-16T16:57:07Z

Currently to create an sqlite file, 3 files are needed:

.npy file with all tanimoto scores
.pickled file with all spectra
csv file with the spectra names in the order corresponding to the .npy file

It might be better to change this so only a pickled file with spectra is needed, by calculating the tanimoto scores in this function.

This makes it possible to easily integrate filtering/processing of spectra in this function and makes the function easier to understand.

The downside is that calculating tanimoto scores takes quite long.

Currently tanimoto scores are often already calculated for new datasets, but I am not sure if they are used beside ms2query. If they are not used for other applications, we can integrate calculating the tanimoto score, into the create sqlite database function, to make the code easier to understand and less susceptible to errors.
If the tanimoto scores are always already calculated beforehand, we could keep the same structure (to save loading time), but in that case the filtering/processing of the spectra should be done before calculating the tanimoto scores and before loading into create_sqlite_wrapper.

niekdejonge · 2021-02-16T17:02:12Z

A third easy solution is still integrating the preselection in make_sqlite_wrapper. And just adding all tanimoto scores of the unprocessed dataset to sqlite. In this case there are possibly a few inchikeys for which the tanimoto scores are stored, that are not needed, since the corresponding spectra are removed. This would not cause any problems further downstream, but feels a bit less clean.

niekdejonge · 2021-02-17T15:52:14Z

Decided together with Florian that this will be changed into the following solution:

The tanimoto scores will be created in a pd.Dataframe with as columns and index the inchikeys (instead of the current .npy file for tanimoto scores and .csv file for order of the inchikeys)

The create_sqlite_wrapper function will have an optional argument to supply a pickled pd.Dataframe for the tanimoto scores. If this file is not provided, it will automatically run a function that calculates the tanimoto scores for all provided spectra.
If the file is provided, it will just use this file instead and thereby saving a lot of time.

To do:

Write a wrapper for calculating the tanimoto scores and output a pd.Dataframe.
There are already functions for this, but the output is currently a .npy file and a csv file with the order of the inchikeys
Add an if statement that checks if a tanimoto score file (pickled dataframe) is supplied, if this is the case it is loaded and used, else the scores are calculated with above mentioned wrapper.
The load into sqlite functions now expect file names, this should be changed to a pd.dataframe or a list of inchikeys.

niekdejonge · 2021-02-24T09:13:37Z

To do:
Check for/ remove null values from tanimoto scores.

When rewriting the function for creating tanimoto scores one extra feature should be added. In the current similarities_AllInchikeys14_daylight2048_jaccard.npy file, there are some null values (at least at row 7819), which give problems further down the line. It should not be possible for the function creating the tanimoto matrix to output null in the table. The problem is currently patched by removing None values when they are selected (in select_dat_for_training_nn.py get_tanimoto_for_spectrum_ids) this should be removed once this is implemented in creating the tanimoto matrix.

niekdejonge · 2021-10-28T09:05:13Z

Is currently not a priority. Following things discussed above still could be implemented:

Make providing a tanimoto score file optional. When not provided MS2Query will use a function to create the tanimoto scores.
Double check if the None values is still an issue.

niekdejonge · 2022-09-23T13:16:42Z

This was solved in #146

niekdejonge added code structure E.g converting functions to a class without changing functionality enhancement New feature or request and removed code structure E.g converting functions to a class without changing functionality labels Oct 28, 2021

niekdejonge closed this as completed Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change creating sqlite file function #63

Change creating sqlite file function #63

niekdejonge commented Feb 16, 2021

niekdejonge commented Feb 16, 2021

niekdejonge commented Feb 17, 2021

niekdejonge commented Feb 24, 2021

niekdejonge commented Oct 28, 2021

niekdejonge commented Sep 23, 2022

Change creating sqlite file function #63

Change creating sqlite file function #63

Comments

niekdejonge commented Feb 16, 2021

niekdejonge commented Feb 16, 2021

niekdejonge commented Feb 17, 2021

niekdejonge commented Feb 24, 2021

niekdejonge commented Oct 28, 2021

niekdejonge commented Sep 23, 2022