Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot reproduce wB97M-D3(BJ) energies and forces #50

Closed
pablo-unzueta opened this issue Oct 5, 2022 · 14 comments
Closed

Cannot reproduce wB97M-D3(BJ) energies and forces #50

pablo-unzueta opened this issue Oct 5, 2022 · 14 comments

Comments

@pablo-unzueta
Copy link

pablo-unzueta commented Oct 5, 2022

Hello OpenMM team,

I'm trying to reproduce some of the energies in the dataset.

Input file used:

memory 1000 mb

molecule h2o {
  O        0.07854020       0.65225506      -0.06068829
  H       -0.31700531       0.93845546       1.56071150
  H       -0.78358060      -0.85430729      -0.76194274
}


set basis def2-TZVPPD
energy('WB97M-D3BJ')
gradient('WB97M-D3BJ')

Energy from Calculation: -76.1624479876510065
Energy from SPICE.hdf5: -76.42840576171875

One caveat is that I am using Psi4-1.6.1 and not 1.4.1. But I wouldn't expect a 0.3ish hartree difference.

Pulled from SPICE.hdf5 file used after using downloader.py and writing an ase style xyz file:

3
Properties=species:S:1:pos:R:3:forces:R:3 energy=-76.42840576171875 pbc="F F F"
O        0.07854020       0.65225506      -0.06068829      -0.02045737       0.02555173       0.10522837
H       -0.31700531       0.93845546       1.56071150       0.02771311      -0.00678600      -0.08722682
H       -0.78358060      -0.85430729      -0.76194274      -0.00725587      -0.01876622      -0.01800219

Found a previous issue using wB97M-D3(BJ) using version Psi4-1.4.1. I don't know if this affects it: psi4/psi4#2279

I'm currently installing Psi4 version 1.4.1 to rerun my test.

Can you point me in any other directions? Thanks

@pablo-unzueta
Copy link
Author

Just a follow up: I tried the same calculation using version 1.4.1 and the energy was -76.1624480295512996

@pavankum
Copy link
Collaborator

pavankum commented Oct 6, 2022

Hi @pablo-unzueta , thank you for raising the issue, we did use wcombine=False in our calculations as you pointed in the psi4 issue, if you add set wcombine false to your input you would get -76.1624479879508129 Ha and there is not much difference in this case.

The main reason for mismatch in energies is not converting the coordinates from Bohr to Angstrom. I get -76.4731342821588811 Ha if I convert the coordinates you posted to Angstrom, which is same as the one in SPICE.hdf5.

Also, for water the energies (in hartree) I see in the hdf5 file are

[-76.47314,
 -76.47893,
 -76.47752,
 -76.47264,
 -76.474144,
 -76.47805,
 -76.478226,
 -76.4784,
 -76.46969,
 -76.47316,
 -76.47886,
 -76.473694,
 -76.47529,
 -76.47065,
 -76.474556,
 -76.47413,
 -76.47509,
 -76.475624,
 -76.4757,
 -76.476036,
 -76.478455,
 -76.47198,
 -76.476944,
 -76.47644,
 -76.477036,
 -76.47788,
 -76.4784,
 -76.477135,
 -76.47543,
 -76.478546,
 -76.478836,
 -76.47762,
 -76.47699,
 -76.47901,
 -76.47935,
 -76.47802,
 -76.47666,
 -76.47942,
 -76.47783,
 -76.47891,
 -76.47949,
 -76.478386,
 -76.4763,
 -76.47962,
 -76.47897,
 -76.47839,
 -76.478004,
 -76.478546,
 -76.47865,
 -76.47522]

I couldn't find -76.42840576171875, can you please drop the code on how you accessed it, I might be missing something here.

Edit: sentence

@pablo-unzueta
Copy link
Author

pablo-unzueta commented Oct 12, 2022

Hi @pavankum,

I started working on this again today. I think the issue comes from the downloader.py script. When I use it, my energies for water are the following, before I start writing them as ase files:

[-76.428406 -76.43536 -76.434044 -76.42908 -76.42975 -76.43357 -76.433205 -76.433464 -76.42748 -76.42969 -76.43557 -76.4307 -76.43093 -76.42836 -76.43234 -76.42874 -76.432236 -76.432465 -76.4319 -76.433235 -76.43428 -76.42969 -76.4324 -76.433914 -76.433334 -76.43449 -76.43445 -76.43406 -76.43279 -76.434616 -76.43502 -76.434425 -76.434296 -76.43554 -76.4353 -76.43447 -76.433815 -76.435585 -76.43474 -76.434906 -76.43576 -76.435234 -76.432976 -76.4358 -76.43504 -76.43486 -76.43459 -76.43516 -76.434944 -76.43096 ]

However, if I download the SPICE.hdf5 file directly from the releases page, I reproduce what you posted:
[-76.47314 -76.47893 -76.47752 -76.47264 -76.474144 -76.47805 -76.478226 -76.4784 -76.46969 -76.47316 -76.47886 -76.473694 -76.47529 -76.47065 -76.474556 -76.47413 -76.47509 -76.475624 -76.4757 -76.476036 -76.478455 -76.47198 -76.476944 -76.47644 -76.477036 -76.47788 -76.4784 -76.477135 -76.47543 -76.478546 -76.478836 -76.47762 -76.47699 -76.47901 -76.47935 -76.47802 -76.47666 -76.47942 -76.47783 -76.47891 -76.47949 -76.478386 -76.4763 -76.47962 -76.47897 -76.47839 -76.478004 -76.478546 -76.47865 -76.47522 ]

Attached is the downloader.py script I used. Please let me know if I goofed something :)

from qcportal import FractalClient
from collections import defaultdict
from rdkit import Chem
import numpy as np
import h5py
import yaml

# Units for a variety of fields that can be downloaded.

units = {'dft_total_energy': 'hartree',
         'dft_total_gradient': 'hartree/bohr',
         'mbis_charges': 'elementary_charge',
         'mbis_dipoles': 'elementary_charge*bohr',
         'mbis_quadrupoles': 'elementary_charge*bohr^2',
         'mbis_octupoles': 'elementary_charge*bohr^3',
         'scf_dipole': 'elementary_charge*bohr',
         'scf_quadrupole': 'elementary_charge*bohr^2'}

# Reference energies computed with Psi4 1.5 wB97M-D3BJ/def2-TZVPPD.

atom_energy = {'Br': {-1: -2574.2451510945853, 0: -2574.1167240829964},
               'C': {-1: -37.91424135791358, 0: -37.87264507233593, 1: -37.45349214963933},
               'Ca': {2: -676.9528465198214},
               'Cl': {-1: -460.3350243496703, 0: -460.1988762285739},
               'F': {-1: -99.91298732343974, 0: -99.78611622985483},
               'H': {-1: -0.5027370838721259, 0: -0.4987605100487531, 1: 0.0},
               'I': {-1: -297.8813829975981, 0: -297.76228914445625},
               'K': {1: -599.8025677513111},
               'Li': {1: -7.285254714046546},
               'Mg': {2: -199.2688420040449},
               'N': {-1: -54.602291095426494, 0: -54.62327513368922, 1: -54.08594142587869},
               'Na': {1: -162.11366478783253},
               'O': {-1: -75.17101657391741, 0: -75.11317840410095, 1: -74.60241514396725},
               'P': {0: -341.3059197024934, 1: -340.9258392474849},
               'S': {-1: -398.2405387031612, 0: -398.1599636677874, 1: -397.7746615977658}}
default_charge = {}
for symbol in atom_energy:
    energies = [(energy, charge) for charge, energy in atom_energy[symbol].items()]
    default_charge[symbol] = sorted(energies)[0][1]

# Given a SMILES string, compute the reference energy of the atoms when fully separated.

def compute_reference_energy(smiles):
    rdmol = Chem.MolFromSmiles(smiles, sanitize=False)
    total_charge = sum(atom.GetFormalCharge() for atom in rdmol.GetAtoms())
    symbol = [atom.GetSymbol() for atom in rdmol.GetAtoms()]
    charge = [default_charge[s] for s in symbol]
    delta = np.sign(total_charge-sum(charge))
    while delta != 0:
        best_index = -1
        best_energy = None
        for i in range(len(symbol)):
            s = symbol[i]
            e = atom_energy[s]
            new_charge = charge[i]+delta
            if new_charge in e:
                if best_index == -1 or e[new_charge]-e[charge[i]] < best_energy:
                    best_index = i
                    best_energy = e[new_charge]-e[charge[i]]
        charge[best_index] += delta
        delta = np.sign(total_charge - sum(charge))
    return sum(atom_energy[s][c] for s, c in zip(symbol, charge))

# Process the configuration file and download data.

with open('config.yaml') as input:
    config = yaml.safe_load(input.read())
if 'max_force' in config:
    max_force = float(config['max_force'])
else:
    max_force = None
client = FractalClient()
outputfile = h5py.File('SPICE.hdf5', 'w')
for subset in config['subsets']:
    # Download the next subset.

    print('Processing', subset)
    ds = client.get_collection('Dataset', subset)
    all_molecules = ds.get_molecules()
    spec = ds.list_records().iloc[0].to_dict()
    recs = ds.get_records(method=spec['method'], basis=spec['basis'], program=spec['program'], keywords=spec['keywords'])
    recs_by_name = defaultdict(list)
    mols_by_name = defaultdict(list)
    for i in range(len(recs)):
        rec = recs.iloc[i].record
        if rec is not None and rec.status == 'COMPLETE':
            index = recs.index[i]
            name = index[:index.rfind('-')]
            recs_by_name[name].append(rec)
            mols_by_name[name].append(all_molecules.loc[index][0])

    # Add the data to the HDF5 file.

    for name in recs_by_name:
        group_recs = recs_by_name[name]
        molecules = mols_by_name[name]
        qcvars = [r.extras['qcvars'] for r in group_recs]
        smiles = molecules[0].extras['canonical_isomeric_explicit_hydrogen_mapped_smiles']
        ref_energy = compute_reference_energy(smiles)
        name = name.replace('/', '')  # Remove stereochemistry markers that h5py interprets as path separators
        group = outputfile.create_group(name)
        group.create_dataset('subset', data=[subset], dtype=h5py.string_dtype())
        group.create_dataset('smiles', data=[smiles], dtype=h5py.string_dtype())
        group.create_dataset("atomic_numbers", data=molecules[0].atomic_numbers, dtype=np.int16)
        if max_force is not None:
            force = np.array([vars['DFT TOTAL GRADIENT'] for vars in qcvars])
            samples = [i for i in range(len(molecules)) if np.max(np.abs(force[i])) <= max_force]
            molecules = [molecules[i] for i in samples]
            qcvars = [qcvars[i] for i in samples]
        ds = group.create_dataset('conformations', data=np.array([m.geometry for m in molecules]), dtype=np.float32)
        ds.attrs['units'] = 'bohr'
        ds = group.create_dataset('formation_energy', data=np.array([vars['DFT TOTAL ENERGY']-ref_energy for vars in qcvars]), dtype=np.float32)
        ds.attrs['units'] = 'hartree'
        for value in config['values']:
            key = value.lower().replace(' ', '_')
            try:
                ds = group.create_dataset(key, data=np.array([vars[value] for vars in qcvars]), dtype=np.float32)
                if key in units:
                    ds.attrs['units'] = units[key]
            except KeyError:
                pass

Edit: Code formatting

@peastman
Copy link
Member

I think you're running into #44. Make sure you have the latest version of the downloader script.

We're going to be suffering from that error for a long time. :( In retrospect, it was a big mistake to put the lower level of theory into the same dataset on QCArchive. Is there a way we can move it out or delete it? Working around it in the downloader is clearly insufficient, because there are still people with the old version of the script on their computers.

@pablo-unzueta
Copy link
Author

Thanks everyone! I'll close this issue for now.

@pavankum
Copy link
Collaborator

@peastman the next release of qcfractal (expected late December) may allow data deletion within a set, right now I think the dataset collections are immutable, or need Ben's help in modifying the database at root level.

@peastman
Copy link
Member

We can't afford to wait until late December. People are downloading it right now. Can we escalate this with the QCArchive team?

@dotsdl
Copy link

dotsdl commented Oct 12, 2022

@peastman putting the lower level of theory into the same dataset wasn't the wrong move here; though it is possible to create a new dataset for each additional level of theory / compute spec you want to run, it quickly becomes even more of a zoo to navigate.

Can we quantify how bad a problem this is first? The new release of this repo with the downloader update has only been out two weeks, so it's not clear to me that performing surgery on the database it's targeting is warranted.

Also: we'll likely need another update to the downloader script once the next release of QCFractal is deployed as the public QCArchive, scheduled for December, as @pavankum mentioned. The client interface will feature ease-of-use improvements, and better performance, that the downloader will benefit from.

Perhaps instead of a downloader script, an archival service such as Zenodo or OSF is a better approach for versioning and dataset stability? That would also mint DOIs and satisfy journal requirements. QCArchive, despite its name, is currently more of a compute engine with attached storage, and versioned exports are more appropriate for archival.

@peastman
Copy link
Member

Can we quantify how bad a problem this is first?

About as bad as it gets. It means that at this moment there may be people running the downloader and getting corrupt versions of the dataset. And there's no way for them to tell it's corrupt, and those corrupt dataset files may sit around on their computers for years and get used to train models that they publish years from now. So very, very bad.

Perhaps instead of a downloader script, an archival service such as Zenodo or OSF is a better approach for versioning and dataset stability?

It's too late for any change like that to fix the problem. The downloader script is out there. People have already cloned the original version of the repository. They've read the preprint, which instructs them to use the downloader. So any other distribution mechanism we create in the future won't prevent people from running the downloader that's already out there and getting wrong results. The only possible fix is to make sure that the original version of the downloader produces correct results (as it did when it was originally written, before the alternate data was added to QCArchive).

@dotsdl
Copy link

dotsdl commented Oct 12, 2022

@peastman I'm talking with @bennybp on what might be possible here. Will update on what we conclude.

@dotsdl
Copy link

dotsdl commented Oct 12, 2022

@peastman I sent a DM to both you and @jchodera in the OpenMM slack to set up a call with myself and @bennybp to resolve the archival question, which impacts this issue here. Can you respond yes/no and I'll go ahead with scheduling?

@jchodera
Copy link
Member

About as bad as it gets. It means that at this moment there may be people running the downloader and getting corrupt versions of the dataset. And there's no way for them to tell it's corrupt, and those corrupt dataset files may sit around on their computers for years and get used to train models that they publish years from now. So very, very bad.

Can you help me understand the degree of concern, @peastman?

There was a bug in the unpublished SPICE 1.0 and 1.1 downloader scripts (but not the downloadable artifact that we generated) that results in some download attempts getting a different level of theory. The 1.1.1 version in the manuscript has fixed the nondeterminism bug a week after the manuscript was posted to arXiv.

The repo has only been forked twice, by @peastman and @pavankum, so we don't even need to worry about forks that out of sync.

In order to be impacted, someone would have to have

  • decided to use the downloader instead of the prepared archive
  • cloned an early version with the bug within the first week of the manuscript being posted to arXiv
  • never looked at the repo issues or releases page after 28 Sep 2022
  • never subsequently pulled the latest version
  • provided we update the arXiv manuscript (and revision) to note the SPICE 1.1.1 version of the downloader (and potentially the alternative dataset), they will also have to have never looked at the publication or latest version of the arXiv manuscript to cite it prior to publishing themselves

It's hard for me to understand how there can be that large a number of people impacted by this issue, or how it could result in erroneous publications given how unlikely it is for someone to make it all the way to publication with the wrong version of the dataset.

We can certainly tweet about the fix to reach the same audience that saw the original tweet if this is a concern, which should further minimize the issue. And if QCPortal ends up being able to make the default choice deterministic, that problem will end up fully closed.

People are using OpenMM with known bugs in it right now. But we don't try to erase historical versions of OpenMM just because there were bugs in it. We do our best to disseminate the bugfix so that our user community knows about it.

@peastman
Copy link
Member

decided to use the downloader instead of the prepared archive

Correct. But that's not an unusual thing, given that a lot of data fields are only available through the downloader.

cloned an early version with the bug within the first week of the manuscript being posted to arXiv

The time window is much larger than that. The 1.0 release (which included the downloader script) was created on July 12. The first data at the lower level of theory was added on Aug. 21. The 1.1.1 workaround was not added until Sept. 28. That's 2.5 months during which someone could have cloned the repository and gotten a version without the workaround. You're assuming the preprint is the only way someone could have learned about it. But given how much we've all been talking about this project and how many people we've pointed to the repository, that isn't a safe assumption.

So this affects anyone who cloned the repository on or before Sept. 28, and ran (or will run in the future) the downloader script any time after Aug. 21.

never looked at the repo issues or releases page after 28 Sep 2022

Correct. But then, why would they? When I download a piece of software or dataset I want to use, I don't usually keep checking their issue tracker afterward unless I specifically discover a problem.

never subsequently pulled the latest version

Correct. Again, why would they? Unless they were checking back, they would have no way of knowing there was a newer version.

It's hard for me to understand how there can be that large a number of people impacted by this issue

We definitely know of at least one person who encountered the problem, because he opened this issue. How many others are there? We have no idea. But since one person encountered it, it's likely other people have (or will) too.

And if QCPortal ends up being able to make the default choice deterministic, that problem will end up fully closed.

That would be an acceptable solution. But again, this is urgent. People may be downloading corrupt datasets at this moment. We can't wait months for a fix.

@jchodera
Copy link
Member

The time window is much larger than that. The 1.0 release (which included the downloader script) was created on July 12. The first data at the lower level of theory was added on Aug. 21. The 1.1.1 workaround was not added until Sept. 28. That's 2.5 months during which someone could have cloned the repository and gotten a version without the workaround. You're assuming the preprint is the only way someone could have learned about it. But given how much we've all been talking about this project and how many people we've pointed to the repository, that isn't a safe assumption.

We did very little to publicly advertise this repository until the preprint posted because there was insufficient information to understand what it contained until the manuscript was available. And the repo has only been cloned by 11 people in the last two weeks after considerable public advertising.

Correct. But then, why would they? When I download a piece of software or dataset I want to use, I don't usually keep checking their issue tracker afterward unless I specifically discover a problem.

To find the citation associated with the dataset. Citation information was only added on Sep 26, just two days before the 1.1.1 fix was released. To publish with the incorrect results, someone would had to discover the repo, clone it prior to Sep 28, and access the README in the 48 hours that the citation was posted but the updated release was not yet made. Given the cloning rate of 0.8 clones/day even after the manuscript is posted, this seems extremely unlikely.

What if we publicly announce the 1.1.1 bugfix on twitter (OpenMM, and have @giadefa and myself retweet) and the README? That seems like it will reach nearly all the likely audience here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants