Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change creating sqlite file function #63

Closed
niekdejonge opened this issue Feb 16, 2021 · 5 comments
Closed

Change creating sqlite file function #63

niekdejonge opened this issue Feb 16, 2021 · 5 comments
Labels
enhancement New feature or request

Comments

@niekdejonge
Copy link
Collaborator

Currently to create an sqlite file, 3 files are needed:

  • .npy file with all tanimoto scores
  • .pickled file with all spectra
  • csv file with the spectra names in the order corresponding to the .npy file

It might be better to change this so only a pickled file with spectra is needed, by calculating the tanimoto scores in this function.

This makes it possible to easily integrate filtering/processing of spectra in this function and makes the function easier to understand.

The downside is that calculating tanimoto scores takes quite long.

Currently tanimoto scores are often already calculated for new datasets, but I am not sure if they are used beside ms2query. If they are not used for other applications, we can integrate calculating the tanimoto score, into the create sqlite database function, to make the code easier to understand and less susceptible to errors.
If the tanimoto scores are always already calculated beforehand, we could keep the same structure (to save loading time), but in that case the filtering/processing of the spectra should be done before calculating the tanimoto scores and before loading into create_sqlite_wrapper.

@niekdejonge
Copy link
Collaborator Author

A third easy solution is still integrating the preselection in make_sqlite_wrapper. And just adding all tanimoto scores of the unprocessed dataset to sqlite. In this case there are possibly a few inchikeys for which the tanimoto scores are stored, that are not needed, since the corresponding spectra are removed. This would not cause any problems further downstream, but feels a bit less clean.

@niekdejonge
Copy link
Collaborator Author

Decided together with Florian that this will be changed into the following solution:

The tanimoto scores will be created in a pd.Dataframe with as columns and index the inchikeys (instead of the current .npy file for tanimoto scores and .csv file for order of the inchikeys)

The create_sqlite_wrapper function will have an optional argument to supply a pickled pd.Dataframe for the tanimoto scores. If this file is not provided, it will automatically run a function that calculates the tanimoto scores for all provided spectra.
If the file is provided, it will just use this file instead and thereby saving a lot of time.

To do:

  • Write a wrapper for calculating the tanimoto scores and output a pd.Dataframe.
    There are already functions for this, but the output is currently a .npy file and a csv file with the order of the inchikeys
  • Add an if statement that checks if a tanimoto score file (pickled dataframe) is supplied, if this is the case it is loaded and used, else the scores are calculated with above mentioned wrapper.
  • The load into sqlite functions now expect file names, this should be changed to a pd.dataframe or a list of inchikeys.

@niekdejonge
Copy link
Collaborator Author

To do:
Check for/ remove null values from tanimoto scores.

When rewriting the function for creating tanimoto scores one extra feature should be added. In the current similarities_AllInchikeys14_daylight2048_jaccard.npy file, there are some null values (at least at row 7819), which give problems further down the line. It should not be possible for the function creating the tanimoto matrix to output null in the table. The problem is currently patched by removing None values when they are selected (in select_dat_for_training_nn.py get_tanimoto_for_spectrum_ids) this should be removed once this is implemented in creating the tanimoto matrix.

@niekdejonge
Copy link
Collaborator Author

Is currently not a priority. Following things discussed above still could be implemented:

  • Make providing a tanimoto score file optional. When not provided MS2Query will use a function to create the tanimoto scores.
  • Double check if the None values is still an issue.

@niekdejonge niekdejonge added code structure E.g converting functions to a class without changing functionality enhancement New feature or request and removed code structure E.g converting functions to a class without changing functionality labels Oct 28, 2021
@niekdejonge
Copy link
Collaborator Author

This was solved in #146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant