ASE databases incompatible with current fine-tuning tutorial #629

gunnarpsu · 2024-02-26T13:18:54Z

I’ve been using the OCP framework recently to try to perform ethylene adsorption calculations and was able to get GemNet OC20+22 to predict the adsorption energies within a reasonable margin of error, but I now would like to fine-tune the model for these systems. I followed through the oxides fine-tuning tutorial on the OCP tutorials repo to ensure that everything is in working order before continuing, but got the following error in the training output once training began using main.py:

2024-02-26 08:10:49 (INFO): Loading dataset: lmdb
Traceback (most recent call last):
  File "C:\Users\gls5443\Desktop\ocp-main\main.py", line 92, in <module>
    Runner()(config)
  File "C:\Users\gls5443\Desktop\ocp-main\main.py", line 30, in __call__
    with new_trainer_context(args=args, config=config) as ctx:
  File "c:\Users\gls5443\AppData\Local\miniconda3\envs\ocp_new1\lib\contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "C:\Users\gls5443\Desktop\ocp-main\ocpmodels\common\utils.py", line 1012, in new_trainer_context
    trainer = trainer_cls(
  File "C:\Users\gls5443\Desktop\ocp-main\ocpmodels\trainers\ocp_trainer.py", line 92, in __init__
    super().__init__(
  File "C:\Users\gls5443\Desktop\ocp-main\ocpmodels\trainers\base_trainer.py", line 176, in __init__
    self.load()
  File "C:\Users\gls5443\Desktop\ocp-main\ocpmodels\trainers\base_trainer.py", line 194, in load
    self.load_datasets()
  File "C:\Users\gls5443\Desktop\ocp-main\ocpmodels\trainers\base_trainer.py", line 279, in load_datasets
    self.train_dataset = registry.get_dataset_class(
  File "C:\Users\gls5443\Desktop\ocp-main\ocpmodels\datasets\lmdb_dataset.py", line 89, in __init__
    self.env = self.connect_db(self.path)
  File "C:\Users\gls5443\Desktop\ocp-main\ocpmodels\datasets\lmdb_dataset.py", line 167, in connect_db
    env = lmdb.open(
lmdb.InvalidError: train.db: MDB_INVALID: File is not an LMDB file

I followed the traceback and think I’ve narrowed it down to an issue with the ocpmodels\trainers\base_trainer.py file which seems to not recognize ASE databases and instead defaults to LMDB databases [ Line 279-280: self.train_dataset = registry.get_dataset_class(self.config["dataset"].get("format", "lmdb")) ]. Could this be an artifact of the “new” config.yml format that I occasionally see log messages about converting to?

Are you aware if there currently is working version of the fine-tuning example code that works with ASE databases, or alternatively is there example code for performing the fine-tuning with LMDB databases? All the code surrounding LMDB I could find revolved around starting from scratch with OC’s datasets rather than continuing with a pre-trained checkpoint.

For reference, the following packages are currently being used:

numba: 0.58.1
numpy: 1.26.3
ase: 3.22.1
e3nn: 0.4.4
pymatgen: 2023.5.10
torch: 1.13.0+cu116
torch.version.cuda: 11.6
torch.cuda: is_available: True
  __CUDNN VERSION: 8302
  __Number CUDA Devices: 1
  __CUDA Device Name: NVIDIA GeForce RTX 4070
  __CUDA Device Total Memory [GB]: 12.878086144
torch geometric: 2.4.0
Platform: Windows-10-10.0.22631-SP0
  Processor: AMD64 Family 25 Model 33 Stepping 2, AuthenticAMD
  Virtual memory: svmem(total=68641931264, available=54806425600, percent=20.2, used=13835505664, free=54806425600)
  Swap memory: sswap(total=4294967296, used=0, free=4294967296, percent=0.0, sin=0, sout=0)
  Disk usage: sdiskusage(total=1999619747840, used=257166356480, free=1742453391360, percent=12.9)

The text was updated successfully, but these errors were encountered:

emsunshine · 2024-02-26T14:09:26Z

I think you are correct that this was an oversight when converting to the new trainer/configs. The new location for dataset format makes more sense but is not backwards compatible. You should be able to get around this error by adding "format":"ase_db" to the dataset config.

This PR is intended to address #629

gunnarpsu · 2024-02-26T14:28:53Z

That did the trick regarding that part - thank you!
However, I now receive the following error in the output after the model loads:

2024-02-26 09:22:56 (INFO): Loading dataset: ase_db
2024-02-26 09:22:56 (INFO): Batch balancing is disabled for single GPU training.
2024-02-26 09:22:56 (INFO): Batch balancing is disabled for single GPU training.
2024-02-26 09:22:56 (INFO): Batch balancing is disabled for single GPU training.
2024-02-26 09:22:56 (INFO): Loading model: gemnet_oc
C:\Users\gls5443\Desktop\ocp-main\ocpmodels\datasets\ase_datasets.py:108: UserWarning: Supplied sid is not numeric (or missing). Using dataset indices instead.
  warnings.warn(
2024-02-26 09:22:59 (INFO): Loaded GemNetOC with 38864438 parameters.
2024-02-26 09:22:59 (WARNING): Model gradient logging to tensorboard not yet supported.
2024-02-26 09:22:59 (WARNING): Using `weight_decay` from `optim` instead of `optim.optimizer_params`.Please update your config to use `optim.optimizer_params.weight_decay`.`optim.weight_decay` will soon be deprecated.
2024-02-26 09:23:00 (INFO): Loading checkpoint from: gnoc_oc22_oc20_all_s2ef.pt
C:\Users\gls5443\Desktop\ocp-main\ocpmodels\datasets\ase_datasets.py:108: UserWarning: Supplied sid is not numeric (or missing). Using dataset indices instead.
  warnings.warn(
C:\Users\gls5443\Desktop\ocp-main\ocpmodels\datasets\ase_datasets.py:108: UserWarning: Supplied sid is not numeric (or missing). Using dataset indices instead.
  warnings.warn(
Traceback (most recent call last):
  File "C:\Users\gls5443\Desktop\ocp-main\main.py", line 92, in <module>
    Runner()(config)
  File "C:\Users\gls5443\Desktop\ocp-main\main.py", line 36, in __call__
    self.task.run()
  File "C:\Users\gls5443\Desktop\ocp-main\ocpmodels\tasks\task.py", line 51, in run
    self.trainer.train(
  File "C:\Users\gls5443\Desktop\ocp-main\ocpmodels\trainers\ocp_trainer.py", line 158, in train
    loss = self._compute_loss(out, batch)
  File "C:\Users\gls5443\Desktop\ocp-main\ocpmodels\trainers\ocp_trainer.py", line 317, in _compute_loss
    target = batch[target_name]
  File "c:\Users\gls5443\AppData\Local\miniconda3\envs\ocp_new1\lib\site-packages\torch_geometric\data\batch.py", line 175, in __getitem__
    return super().__getitem__(idx)
  File "c:\Users\gls5443\AppData\Local\miniconda3\envs\ocp_new1\lib\site-packages\torch_geometric\data\data.py", line 498, in __getitem__
    return self._store[key]
  File "c:\Users\gls5443\AppData\Local\miniconda3\envs\ocp_new1\lib\site-packages\torch_geometric\data\storage.py", line 111, in __getitem__
    return self._mapping[key]
KeyError: 'energy'

Are there additional tags I need to supply to the config for it to parse the databases?

emsunshine · 2024-02-26T14:43:46Z

Thanks for flagging this. The new trainer has renamed the targets from y and force to energy and forces respectively. The ASE datasets were not updated to reflect this. Until the datasets are updated, you should be able to get around this by using the following in the dataset config:

key_mapping:
    y: energy
    force: forces

Referencing these lines from the new example config
https://github.com/Open-Catalyst-Project/ocp/blob/394e9bad7780a05d3371f52550c1f92c47a61ce3/configs/ocp_example.yml#L20

gunnarpsu · 2024-02-26T14:51:51Z

Unfortunately it still throws that error. Just for reference, here is the currently used config.yml:

amp: true
checkpoint: ./gnoc_oc22_oc20_all_s2ef.pt
dataset:
  test:
    a2g_args:
      r_energy: false
      r_forces: false
    format: ase_db
    key_mapping:
      force: forces
      y: energy
    src: test.db
  train:
    a2g_args:
      r_energy: true
      r_forces: true
    format: ase_db
    key_mapping:
      force: forces
      y: energy
    src: train.db
  val:
    a2g_args:
      r_energy: true
      r_forces: true
    format: ase_db
    key_mapping:
      force: forces
      y: energy
    src: val.db
eval_metrics:
  metrics:
    energy:
    - mae
    forces:
    - forcesx_mae
    - forcesy_mae
    - forcesz_mae
    - mae
    - cosine_similarity
    - magnitude_error
    misc:
    - energy_forces_within_threshold
  primary_metric: forces_mae
gpus: 1
loss_fns:
- energy:
    coefficient: 1
    fn: mae
- forces:
    coefficient: 1
    fn: l2mae
model:
  activation: silu
  atom_edge_interaction: true
  atom_interaction: true
  cbf:
    name: spherical_harmonics
  cutoff: 12.0
  cutoff_aeaint: 12.0
  cutoff_aint: 12.0
  cutoff_qint: 12.0
  direct_forces: true
  edge_atom_interaction: true
  emb_size_aint_in: 64
  emb_size_aint_out: 64
  emb_size_atom: 256
  emb_size_cbf: 16
  emb_size_edge: 512
  emb_size_quad_in: 32
  emb_size_quad_out: 32
  emb_size_rbf: 16
  emb_size_sbf: 32
  emb_size_trip_in: 64
  emb_size_trip_out: 64
  enforce_max_neighbors_strictly: false
  envelope:
    exponent: 5
    name: polynomial
  extensive: true
  forces_coupled: false
  max_neighbors: 30
  max_neighbors_aeaint: 20
  max_neighbors_aint: 1000
  max_neighbors_qint: 8
  name: gemnet_oc
  num_after_skip: 2
  num_atom: 3
  num_atom_emb_layers: 2
  num_before_skip: 2
  num_blocks: 4
  num_concat: 1
  num_global_out_layers: 2
  num_output_afteratom: 3
  num_radial: 128
  num_spherical: 7
  otf_graph: true
  output_init: HeOrthogonal
  qint_tags:
  - 1
  - 2
  quad_interaction: true
  rbf:
    name: gaussian
  regress_forces: true
  sbf:
    name: legendre_outer
noddp: false
optim:
  batch_size: 10
  clip_grad_norm: 10
  ema_decay: 0.999
  energy_coefficient: 1
  eval_batch_size: 10
  eval_every: 1
  factor: 0.8
  force_coefficient: 1
  load_balancing: atoms
  loss_energy: mae
  lr_initial: 0.0005
  max_epochs: 10
  mode: min
  num_workers: 2
  optimizer: AdamW
  optimizer_params:
    amsgrad: true
  patience: 3
  scheduler: ReduceLROnPlateau
  weight_decay: 0
outputs:
  energy:
    level: system
  forces:
    eval_on_free_atoms: true
    level: atom
    train_on_free_atoms: true
task:
  dataset: ase_db
trainer: forces

mshuaibii · 2024-02-26T15:33:43Z

The 'key_mapping' functionality has not hit main yet. It currently lives in this PR - #622.

@lbluque is there an update on what's blocking this.

In the mean time you can checkout that branch if you would like to use it before we land it to main.

lbluque · 2024-02-26T15:41:42Z

Nothing holding it up. This should be ready to merge, unless @mshuaibii or @emsunshine have any further suggestions

gunnarpsu · 2024-02-26T19:32:15Z

Hello,
I now have a two part problem, one of which I fixed but which may need to be committed to a future branch, and the other I am unable to solve.

First, I think that line 1018 of ocpmodels/common/utils.py may need to change from
loss_fns=config.get("loss_functions", {}),
to
loss_fns=config.get("loss_fns", {}),
as it is the only way for the configs to be read properly without throwing NotImplementedError.

While the new branch - #622 - which was recommended for using the ASE db's does enable the first inferencing step, it quickly resolves into the second error:

2024-02-26 14:25:48 (INFO): Loading dataset: ase_db
2024-02-26 14:25:49 (INFO): Batch balancing is disabled for single GPU training.
2024-02-26 14:25:49 (INFO): Batch balancing is disabled for single GPU training.
2024-02-26 14:25:49 (INFO): Batch balancing is disabled for single GPU training.
2024-02-26 14:25:49 (INFO): Loading model: gemnet_oc
2024-02-26 14:25:51 (INFO): Loaded GemNetOC with 38864438 parameters.
2024-02-26 14:25:51 (WARNING): Model gradient logging to tensorboard not yet supported.
2024-02-26 14:25:51 (WARNING): Using `weight_decay` from `optim` instead of `optim.optimizer_params`.Please update your config to use `optim.optimizer_params.weight_decay`.`optim.weight_decay` will soon be deprecated.
2024-02-26 14:25:52 (INFO): Loading checkpoint from: gnoc_oc22_oc20_all_s2ef.pt
2024-02-26 14:25:58 (INFO): Evaluating on val.

device 0:   0%|          | 0/3 [00:00<?, ?it/s]
device 0:  33%|███▎      | 1/3 [00:03<00:06,  3.38s/it]
device 0:  67%|██████▋   | 2/3 [00:03<00:01,  1.48s/it]
device 0: 100%|██████████| 3/3 [00:03<00:00,  1.15it/s]
device 0: 100%|██████████| 3/3 [00:04<00:00,  1.37s/it]
2024-02-26 14:26:02 (INFO): energy_mae: 2.9834, forcesx_mae: 0.0080, forcesy_mae: 0.0130, forcesz_mae: 0.0073, forces_mae: 0.0094, forces_cosine_similarity: 0.1755, forces_magnitude_error: 0.0144, energy_forces_within_threshold: 0.0000, loss: 3.0039, epoch: 0.0417
2024-02-26 14:26:02 (INFO): Predicting on test.

device 0:   0%|          | 0/3 [00:00<?, ?it/s]
device 0:  33%|███▎      | 1/3 [00:03<00:06,  3.41s/it]
device 0:  67%|██████▋   | 2/3 [00:03<00:01,  1.46s/it]
device 0: 100%|██████████| 3/3 [00:03<00:00,  1.18it/s]
device 0: 100%|██████████| 3/3 [00:04<00:00,  1.35s/it]
Traceback (most recent call last):
  File "C:\Users\gls5443\Desktop\ocp-ase_data_updates\main.py", line 92, in <module>
    Runner()(config)
  File "C:\Users\gls5443\Desktop\ocp-ase_data_updates\main.py", line 36, in __call__
    self.task.run()
  File "C:\Users\gls5443\Desktop\ocp-ase_data_updates\ocpmodels\tasks\task.py", line 51, in run
    self.trainer.train(
  File "C:\Users\gls5443\Desktop\ocp-ase_data_updates\ocpmodels\trainers\ocp_trainer.py", line 215, in train
    self.update_best(
  File "C:\Users\gls5443\Desktop\ocp-ase_data_updates\ocpmodels\trainers\base_trainer.py", line 706, in update_best
    self.predict(
  File "c:\Users\gls5443\AppData\Local\miniconda3\envs\ocp_new1\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\gls5443\Desktop\ocp-ase_data_updates\ocpmodels\trainers\ocp_trainer.py", line 528, in predict
    predictions[key] = np.array(predictions[key])
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (29,) + inhomogeneous part.

emsunshine · 2024-02-27T18:01:47Z

I was able to get the fine-tuning tutorial working with the changes from these two PRs: Open-Catalyst-Project/tutorial#4 and #630. You can try these branches to see if they solve the problem.

github-actions · 2024-03-29T00:35:23Z

This issue has been marked as stale because it has been open for 30 days with no activity.

* minor cleanup of lmbddatabase * ase dataset compat for unified trainer and cleanup * typo in docstring * key_mapping docstring * add stress to atoms_to_graphs.py and test * allow adding target properties in atoms.info * test using generic tensor property in ase_datasets * minor docstring/comments * handle stress in voigt notation in metadata guesser * handle scalar generic values in a2g * clean up ase dataset unit tests * allow .aselmdb extensions * fix minor bugs in lmdb database and update tests * make connect_db staticmethod * remove redundant methods and make some private * allow a list of paths in AseDBdataset * remove sprinkled print statement * remove deprecated transform kwarg * fix doctring typo * rename keys function * fix missing comma in tests * set default r_edges in a2g in AseDatasets to false * simple unit-test for good measure * call _get_row directly * [wip] allow string sids * raise a helpful error if AseAtomsAdaptor not available * remove db extension in filepaths * set logger to info level when trying to read non db files, remove print * set logging.debug to avoid saturating logs * Update documentation for dataset config changes This PR is intended to address #629 * Update atoms_to_graphs.py * Update test_ase_datasets.py * Update test_ase_datasets.py * Update test_atoms_to_graphs.py * Update test_atoms_to_graphs.py * case for explicit a2g_args None values * Update update_config() * Update utils.py * Update utils.py * Update ocp_trainer.py More helpful warning for debug mode * Update ocp_trainer.py * Update ocp_trainer.py * Update TRAIN.md * fix concatenating predictions * check if keys exist in atoms.info * Update test_ase_datasets.py * use list() to cast all batch.sid/fid * correctly stack predictions * raise error on empty datasets * raise ValueError instead of exception * code cleanup * rename get_atoms object -> get_atoms for brevity * revert to raise keyerror when data_keys are missing * cast tensors to list using tolist and vstack relaxation pos * remove r_energy, r_forces, r_stress and r_data_keys from test_dataset w use_train_settings * fix test_dataset key * fix test_dataset key! * revert to not setting a2g_args dataset keys * fix debug predict logic * support numpy 1.26 * fix numpy version * revert write_pos * no list casting on batch lists * pretty logging --------- Co-authored-by: Ethan Sunshine <93541000+emsunshine@users.noreply.github.com> Co-authored-by: Muhammed Shuaibi <mushuaibi@gmail.com>

lbluque · 2024-04-03T17:55:38Z

This should be fixed in #622. closing.

* minor cleanup of lmbddatabase * ase dataset compat for unified trainer and cleanup * typo in docstring * key_mapping docstring * add stress to atoms_to_graphs.py and test * allow adding target properties in atoms.info * test using generic tensor property in ase_datasets * minor docstring/comments * handle stress in voigt notation in metadata guesser * handle scalar generic values in a2g * clean up ase dataset unit tests * allow .aselmdb extensions * fix minor bugs in lmdb database and update tests * make connect_db staticmethod * remove redundant methods and make some private * allow a list of paths in AseDBdataset * remove sprinkled print statement * remove deprecated transform kwarg * fix doctring typo * rename keys function * fix missing comma in tests * set default r_edges in a2g in AseDatasets to false * simple unit-test for good measure * call _get_row directly * [wip] allow string sids * raise a helpful error if AseAtomsAdaptor not available * remove db extension in filepaths * set logger to info level when trying to read non db files, remove print * set logging.debug to avoid saturating logs * Update documentation for dataset config changes This PR is intended to address #629 * Update atoms_to_graphs.py * Update test_ase_datasets.py * Update test_ase_datasets.py * Update test_atoms_to_graphs.py * Update test_atoms_to_graphs.py * case for explicit a2g_args None values * Update update_config() * Update utils.py * Update utils.py * Update ocp_trainer.py More helpful warning for debug mode * Update ocp_trainer.py * Update ocp_trainer.py * Update TRAIN.md * fix concatenating predictions * check if keys exist in atoms.info * Update test_ase_datasets.py * use list() to cast all batch.sid/fid * correctly stack predictions * raise error on empty datasets * raise ValueError instead of exception * code cleanup * rename get_atoms object -> get_atoms for brevity * revert to raise keyerror when data_keys are missing * cast tensors to list using tolist and vstack relaxation pos * remove r_energy, r_forces, r_stress and r_data_keys from test_dataset w use_train_settings * fix test_dataset key * fix test_dataset key! * revert to not setting a2g_args dataset keys * fix debug predict logic * support numpy 1.26 * fix numpy version * revert write_pos * no list casting on batch lists * pretty logging --------- Co-authored-by: Ethan Sunshine <93541000+emsunshine@users.noreply.github.com> Co-authored-by: Muhammed Shuaibi <mushuaibi@gmail.com>

* minor cleanup of lmbddatabase * ase dataset compat for unified trainer and cleanup * typo in docstring * key_mapping docstring * add stress to atoms_to_graphs.py and test * allow adding target properties in atoms.info * test using generic tensor property in ase_datasets * minor docstring/comments * handle stress in voigt notation in metadata guesser * handle scalar generic values in a2g * clean up ase dataset unit tests * allow .aselmdb extensions * fix minor bugs in lmdb database and update tests * make connect_db staticmethod * remove redundant methods and make some private * allow a list of paths in AseDBdataset * remove sprinkled print statement * remove deprecated transform kwarg * fix doctring typo * rename keys function * fix missing comma in tests * set default r_edges in a2g in AseDatasets to false * simple unit-test for good measure * call _get_row directly * [wip] allow string sids * raise a helpful error if AseAtomsAdaptor not available * remove db extension in filepaths * set logger to info level when trying to read non db files, remove print * set logging.debug to avoid saturating logs * Update documentation for dataset config changes This PR is intended to address FAIR-Chem#629 * Update atoms_to_graphs.py * Update test_ase_datasets.py * Update test_ase_datasets.py * Update test_atoms_to_graphs.py * Update test_atoms_to_graphs.py * case for explicit a2g_args None values * Update update_config() * Update utils.py * Update utils.py * Update ocp_trainer.py More helpful warning for debug mode * Update ocp_trainer.py * Update ocp_trainer.py * Update TRAIN.md * fix concatenating predictions * check if keys exist in atoms.info * Update test_ase_datasets.py * use list() to cast all batch.sid/fid * correctly stack predictions * raise error on empty datasets * raise ValueError instead of exception * code cleanup * rename get_atoms object -> get_atoms for brevity * revert to raise keyerror when data_keys are missing * cast tensors to list using tolist and vstack relaxation pos * remove r_energy, r_forces, r_stress and r_data_keys from test_dataset w use_train_settings * fix test_dataset key * fix test_dataset key! * revert to not setting a2g_args dataset keys * fix debug predict logic * support numpy 1.26 * fix numpy version * revert write_pos * no list casting on batch lists * pretty logging --------- Co-authored-by: Ethan Sunshine <93541000+emsunshine@users.noreply.github.com> Co-authored-by: Muhammed Shuaibi <mushuaibi@gmail.com> Former-commit-id: 24092fae39e1e45bec1795884b08218d47ccdb94

emsunshine added a commit that referenced this issue Feb 26, 2024

Update documentation for dataset config changes

6c678f1

This PR is intended to address #629

emsunshine mentioned this issue Feb 26, 2024

New trainer compatibility fixes #630

Merged

emsunshine mentioned this issue Feb 27, 2024

Minor updates for compatibility with the new trainer Open-Catalyst-Project/tutorial#4

Open

github-actions bot added the stale label Mar 29, 2024

lbluque closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASE databases incompatible with current fine-tuning tutorial #629

ASE databases incompatible with current fine-tuning tutorial #629

gunnarpsu commented Feb 26, 2024

emsunshine commented Feb 26, 2024

gunnarpsu commented Feb 26, 2024

emsunshine commented Feb 26, 2024

gunnarpsu commented Feb 26, 2024

mshuaibii commented Feb 26, 2024

lbluque commented Feb 26, 2024

gunnarpsu commented Feb 26, 2024

emsunshine commented Feb 27, 2024

github-actions bot commented Mar 29, 2024

lbluque commented Apr 3, 2024

ASE databases incompatible with current fine-tuning tutorial #629

ASE databases incompatible with current fine-tuning tutorial #629

Comments

gunnarpsu commented Feb 26, 2024

emsunshine commented Feb 26, 2024

gunnarpsu commented Feb 26, 2024

emsunshine commented Feb 26, 2024

gunnarpsu commented Feb 26, 2024

mshuaibii commented Feb 26, 2024

lbluque commented Feb 26, 2024

gunnarpsu commented Feb 26, 2024

emsunshine commented Feb 27, 2024

github-actions bot commented Mar 29, 2024

lbluque commented Apr 3, 2024