Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Also add QM energies with 'default' OpenFF compute spec? #39

Closed
jchodera opened this issue Jul 31, 2022 · 12 comments
Closed

Also add QM energies with 'default' OpenFF compute spec? #39

jchodera opened this issue Jul 31, 2022 · 12 comments
Labels
question Further information is requested

Comments

@jchodera
Copy link
Member

jchodera commented Jul 31, 2022

@peastman: I just realized that the dataset we generated only used the QM level of theory used for OpenMM SPICE, which would mean the data is not useful to the OpenFF folks because it is not compatible with the default OpenFF compute spec (B3LYP-D3BJ/DZVP). We included both levels of theory for this recent RNA dataset so the dataset would be compatible with both OpenMM SPICE and OpenFF datasets, and it looks like the OpenFF level of theory is much less expensive.

Would it be OK to have @pavankum add the OpenFF compute spect to the SPICE QCArchive dataset so we end up with both sets of QM data on QCArchive? We can still primarily distribute the more expensive QM data in our HDF5 distributions, but having both would enable multiple applications:

  • OpenFF could use SPICE as a test set, or to expand its coverage of chemical space
  • We could experiment with cases where we only use the expensive level of theory on a limited subdomain
  • Multitask models that predict both levels of theory may have other advantages
  • We may ultimately find the OpenFF level of theory sufficient for our purposes when expanding to other parts of chemical space
@jchodera jchodera added the question Further information is requested label Jul 31, 2022
@peastman
Copy link
Member

peastman commented Aug 1, 2022

It's fine if you want to compute the same conformations at a lower level of theory. But let's be careful about calling the result "SPICE". We don't want to do anything that might create confusion, or lead someone to get the low accuracy results thinking they're getting the high accuracy ones.

At some point you might want to consider updating to a better level of theory for OpenFF. B3LYP is pretty dated at this point. There are newer functionals that provide better accuracy at the same cost.

@tmarkland
Copy link
Member

There is probably a good naming convention one could use for SPICE configurations but at a different level of theory (maybe OpenFF already has adopted a particular one) e.g. SPICE(B3LYP-D3BJ/DZVP) or SPICE@B3LYP-D3BJ/DZVP etc. where SPICE would refer to the current level of theory and the ones with brackets or @ would denote the same configurations computed a different way.

@peastman
Copy link
Member

peastman commented Aug 1, 2022

The risk with that is that someone would see a reference to it somewhere and come away thinking, "SPICE uses a cheap, inaccurate level of theory." It would have a high risk of causing confusion.

@jchodera
Copy link
Member Author

jchodera commented Aug 2, 2022

I definitely agree we want to avoid confusion!

We can give the other levels of theory a less prominent role in the manuscript (or even name them SPICE-lite, etc), and control what we put in the HDF5 files we make available for download and how we name them, which will be the primary way people interact with the dataset.

If they access it through the QCPortal, they will see there are multiple levels of theory attached---it would be impossible for them to conclude there is only one low level of theory present.

Practically, if would also be a huge pain, a significant waste of space, and rather awkward to try to correlate data between datasets if other levels of theory were generated as entirely separate groups of datasets in QCArchive.

Does this make sense? Or am I missing some other failure mode of concern?

@peastman
Copy link
Member

peastman commented Aug 2, 2022

I'm not familiar with how QCArchive handles this sort of thing. If it allows a single dataset to provide multiple levels of theory for each sample, and for all of them to be enumerated through the API, that seems reasonable. As long as we can make sure the higher accuracy one is what people get by default if they don't explicitly specify a level of theory. In practice I expect very few people to access it directly through the API.

@pavankum
Copy link
Collaborator

pavankum commented Aug 2, 2022

I'm not familiar with how QCArchive handles this sort of thing. If it allows a single dataset to provide multiple levels of theory for each sample, and for all of them to be enumerated through the API, that seems reasonable. As long as we can make sure the higher accuracy one is what people get by default if they don't explicitly specify a level of theory. In practice I expect very few people to access it directly through the API.

Yeah, the access is through explicit specification of theory level as in the line here in downloader script. We can completely avoid mentioning other QC specs if we choose to and whoever wants to work with the other spec can download it at their own volition.

If models from second spec are much closer in accuracy to spice_default then it would be helpful to do much larger molecules with the second spec as John mentioned.

In practice I expect very few people to access it directly through the API.

I agree.

@peastman
Copy link
Member

peastman commented Aug 2, 2022

That sounds like a good plan.

@jchodera
Copy link
Member Author

jchodera commented Aug 2, 2022

I'm not familiar with how QCArchive handles this sort of thing. If it allows a single dataset to provide multiple levels of theory for each sample, and for all of them to be enumerated through the API, that seems reasonable. As long as we can make sure the higher accuracy one is what people get by default if they don't explicitly specify a level of theory. In practice I expect very few people to access it directly through the API.

I think our primary user group will be downloading the HDF5 files we control, or via the downloader we provide.

But QCArchive has a great QCPortal API that is improving its support for bulk downloads. Currently, it's still a great way for exploring datasets. Check out this example, which shows how to access a reaction dataset and browse which levels of theory and molecules are available.

@jchodera
Copy link
Member Author

@pavankum is running this for us now!
https://github.com/openforcefield/qca-dataset-submission/pulls?q=is%3Apr+label%3Acompute-openff-spice+

It looks like essentially everything is complete (except for some errored calculations).

@jchodera
Copy link
Member Author

jchodera commented Oct 11, 2022 via email

@peastman
Copy link
Member

Let me emphasize once again: SPICE is computed at ωB97M-D3BJ/def2-TZVPPD. Any computations performed at any other level of theory are not SPICE. They are a different dataset that needs to have a different name and must never be referred to as "SPICE", "SPICE-lite", or anything similar. Anything else will create confusion. If the current organization of the data on QCArchive creates confusion, then the data organization needs to be fixed.

@giadefa
Copy link
Member

giadefa commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants