-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change creating sqlite file function #63
Comments
A third easy solution is still integrating the preselection in make_sqlite_wrapper. And just adding all tanimoto scores of the unprocessed dataset to sqlite. In this case there are possibly a few inchikeys for which the tanimoto scores are stored, that are not needed, since the corresponding spectra are removed. This would not cause any problems further downstream, but feels a bit less clean. |
Decided together with Florian that this will be changed into the following solution: The tanimoto scores will be created in a pd.Dataframe with as columns and index the inchikeys (instead of the current .npy file for tanimoto scores and .csv file for order of the inchikeys) The create_sqlite_wrapper function will have an optional argument to supply a pickled pd.Dataframe for the tanimoto scores. If this file is not provided, it will automatically run a function that calculates the tanimoto scores for all provided spectra. To do:
|
To do: When rewriting the function for creating tanimoto scores one extra feature should be added. In the current similarities_AllInchikeys14_daylight2048_jaccard.npy file, there are some null values (at least at row 7819), which give problems further down the line. It should not be possible for the function creating the tanimoto matrix to output null in the table. The problem is currently patched by removing None values when they are selected (in select_dat_for_training_nn.py get_tanimoto_for_spectrum_ids) this should be removed once this is implemented in creating the tanimoto matrix. |
Is currently not a priority. Following things discussed above still could be implemented:
|
This was solved in #146 |
Currently to create an sqlite file, 3 files are needed:
It might be better to change this so only a pickled file with spectra is needed, by calculating the tanimoto scores in this function.
This makes it possible to easily integrate filtering/processing of spectra in this function and makes the function easier to understand.
The downside is that calculating tanimoto scores takes quite long.
Currently tanimoto scores are often already calculated for new datasets, but I am not sure if they are used beside ms2query. If they are not used for other applications, we can integrate calculating the tanimoto score, into the create sqlite database function, to make the code easier to understand and less susceptible to errors.
If the tanimoto scores are always already calculated beforehand, we could keep the same structure (to save loading time), but in that case the filtering/processing of the spectra should be done before calculating the tanimoto scores and before loading into create_sqlite_wrapper.
The text was updated successfully, but these errors were encountered: