Also add QM energies with 'default' OpenFF compute spec? #39

jchodera · 2022-07-31T05:49:58Z

@peastman: I just realized that the dataset we generated only used the QM level of theory used for OpenMM SPICE, which would mean the data is not useful to the OpenFF folks because it is not compatible with the default OpenFF compute spec (B3LYP-D3BJ/DZVP). We included both levels of theory for this recent RNA dataset so the dataset would be compatible with both OpenMM SPICE and OpenFF datasets, and it looks like the OpenFF level of theory is much less expensive.

Would it be OK to have @pavankum add the OpenFF compute spect to the SPICE QCArchive dataset so we end up with both sets of QM data on QCArchive? We can still primarily distribute the more expensive QM data in our HDF5 distributions, but having both would enable multiple applications:

OpenFF could use SPICE as a test set, or to expand its coverage of chemical space
We could experiment with cases where we only use the expensive level of theory on a limited subdomain
Multitask models that predict both levels of theory may have other advantages
We may ultimately find the OpenFF level of theory sufficient for our purposes when expanding to other parts of chemical space

The text was updated successfully, but these errors were encountered:

peastman · 2022-08-01T00:31:37Z

It's fine if you want to compute the same conformations at a lower level of theory. But let's be careful about calling the result "SPICE". We don't want to do anything that might create confusion, or lead someone to get the low accuracy results thinking they're getting the high accuracy ones.

At some point you might want to consider updating to a better level of theory for OpenFF. B3LYP is pretty dated at this point. There are newer functionals that provide better accuracy at the same cost.

tmarkland · 2022-08-01T00:36:32Z

There is probably a good naming convention one could use for SPICE configurations but at a different level of theory (maybe OpenFF already has adopted a particular one) e.g. SPICE(B3LYP-D3BJ/DZVP) or SPICE@B3LYP-D3BJ/DZVP etc. where SPICE would refer to the current level of theory and the ones with brackets or @ would denote the same configurations computed a different way.

peastman · 2022-08-01T15:04:57Z

The risk with that is that someone would see a reference to it somewhere and come away thinking, "SPICE uses a cheap, inaccurate level of theory." It would have a high risk of causing confusion.

jchodera · 2022-08-02T15:33:27Z

I definitely agree we want to avoid confusion!

We can give the other levels of theory a less prominent role in the manuscript (or even name them SPICE-lite, etc), and control what we put in the HDF5 files we make available for download and how we name them, which will be the primary way people interact with the dataset.

If they access it through the QCPortal, they will see there are multiple levels of theory attached---it would be impossible for them to conclude there is only one low level of theory present.

Practically, if would also be a huge pain, a significant waste of space, and rather awkward to try to correlate data between datasets if other levels of theory were generated as entirely separate groups of datasets in QCArchive.

Does this make sense? Or am I missing some other failure mode of concern?

peastman · 2022-08-02T19:01:39Z

I'm not familiar with how QCArchive handles this sort of thing. If it allows a single dataset to provide multiple levels of theory for each sample, and for all of them to be enumerated through the API, that seems reasonable. As long as we can make sure the higher accuracy one is what people get by default if they don't explicitly specify a level of theory. In practice I expect very few people to access it directly through the API.

pavankum · 2022-08-02T20:30:18Z

I'm not familiar with how QCArchive handles this sort of thing. If it allows a single dataset to provide multiple levels of theory for each sample, and for all of them to be enumerated through the API, that seems reasonable. As long as we can make sure the higher accuracy one is what people get by default if they don't explicitly specify a level of theory. In practice I expect very few people to access it directly through the API.

Yeah, the access is through explicit specification of theory level as in the line here in downloader script. We can completely avoid mentioning other QC specs if we choose to and whoever wants to work with the other spec can download it at their own volition.

If models from second spec are much closer in accuracy to spice_default then it would be helpful to do much larger molecules with the second spec as John mentioned.

In practice I expect very few people to access it directly through the API.

I agree.

peastman · 2022-08-02T20:32:28Z

That sounds like a good plan.

jchodera · 2022-08-02T20:39:17Z

I'm not familiar with how QCArchive handles this sort of thing. If it allows a single dataset to provide multiple levels of theory for each sample, and for all of them to be enumerated through the API, that seems reasonable. As long as we can make sure the higher accuracy one is what people get by default if they don't explicitly specify a level of theory. In practice I expect very few people to access it directly through the API.

I think our primary user group will be downloading the HDF5 files we control, or via the downloader we provide.

But QCArchive has a great QCPortal API that is improving its support for bulk downloads. Currently, it's still a great way for exploring datasets. Check out this example, which shows how to access a reaction dataset and browse which levels of theory and molecules are available.

jchodera · 2022-09-23T15:04:11Z

@pavankum is running this for us now!
https://github.com/openforcefield/qca-dataset-submission/pulls?q=is%3Apr+label%3Acompute-openff-spice+

It looks like essentially everything is complete (except for some errored calculations).

jchodera · 2022-10-11T07:35:59Z

I definitely agree we want to avoid confusion! Practically, we can give the other levels of theory a less prominent role in the manuscript (or even name them SPICE-lite, etc), and control what we put in the HDF5 files we make available for download and how we name them, which will be the primary way people interact with the dataset. If they access it through the QCPortal, they will see there are multiple levels of theory attached---it would be impossible for them to conclude there is only one low level of theory present. Practically, if would also be a huge pain, a big waste of space, and very awkward to try to correlate data between datasets if other levels of theory were generated as entirely separate groups of datasets in QCArchive. Does this make sense? Or am I missing some other failure mode of concern.

peastman · 2022-10-11T15:42:21Z

Let me emphasize once again: SPICE is computed at ωB97M-D3BJ/def2-TZVPPD. Any computations performed at any other level of theory are not SPICE. They are a different dataset that needs to have a different name and must never be referred to as "SPICE", "SPICE-lite", or anything similar. Anything else will create confusion. If the current organization of the data on QCArchive creates confusion, then the data organization needs to be fixed.

giadefa · 2022-10-11T17:56:06Z

There is a similar situation with MD17. It has been computed at two different levels of theory and it is often confusing in papers at what level a benchmark is done.

…

On Tue, Oct 11, 2022 at 5:42 PM Peter Eastman ***@***.***> wrote: Let me emphasize once again: SPICE is computed at ωB97M-D3BJ/def2-TZVPPD. Any computations performed at any other level of theory *are not SPICE*. They are a different dataset that needs to have a different name and must never be referred to as "SPICE", "SPICE-lite", or anything similar. Anything else will create confusion. If the current organization of the data on QCArchive creates confusion, then the data organization needs to be fixed. — Reply to this email directly, view it on GitHub <#39 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3KUOSHUOUAVJUTFO7FXDLWCWDGRANCNFSM55ELKFWA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jchodera added the question Further information is requested label Jul 31, 2022

jchodera closed this as completed Sep 23, 2022

peastman mentioned this issue Sep 28, 2022

downloader.py sets method and basis non-deterministically #44

Closed

peastman mentioned this issue Oct 6, 2022

openff-default spec downloader #51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Also add QM energies with 'default' OpenFF compute spec? #39

Also add QM energies with 'default' OpenFF compute spec? #39

jchodera commented Jul 31, 2022 •

edited

Loading

peastman commented Aug 1, 2022

tmarkland commented Aug 1, 2022

peastman commented Aug 1, 2022

jchodera commented Aug 2, 2022 •

edited

Loading

peastman commented Aug 2, 2022

pavankum commented Aug 2, 2022

peastman commented Aug 2, 2022

jchodera commented Aug 2, 2022

jchodera commented Sep 23, 2022

jchodera commented Oct 11, 2022 via email

peastman commented Oct 11, 2022

giadefa commented Oct 11, 2022 via email

Also add QM energies with 'default' OpenFF compute spec? #39

Also add QM energies with 'default' OpenFF compute spec? #39

Comments

jchodera commented Jul 31, 2022 • edited Loading

peastman commented Aug 1, 2022

tmarkland commented Aug 1, 2022

peastman commented Aug 1, 2022

jchodera commented Aug 2, 2022 • edited Loading

peastman commented Aug 2, 2022

pavankum commented Aug 2, 2022

peastman commented Aug 2, 2022

jchodera commented Aug 2, 2022

jchodera commented Sep 23, 2022

jchodera commented Oct 11, 2022 via email

peastman commented Oct 11, 2022

giadefa commented Oct 11, 2022 via email

jchodera commented Jul 31, 2022 •

edited

Loading

jchodera commented Aug 2, 2022 •

edited

Loading