Add support for Pymatgen StructureGraph with different local_env strategies #18

Andrew-S-Rosen · 2020-03-23T17:51:14Z

Currently, the approach to generating the crystal graph is based on a simple constant cutoff radius, which is not necessarily ideal. It would be great if cgcnn were able to use Pymatgen's pymatgen.analysis.graphs.StructureGraph function to generate a crystal graph via one of its several pymatgen.analysis.local_env strategies.

This PR provides support for a new argument in CIFData within data.py to allow for the use of Pymatgen's StructureGraph function. One can now provide the nn_method argument to the CIFData function, which must be the name (in string format) of a valid pymatgen.analysis.local_env.NearNeighbors object. An example of a NearNeighbors object would be nn_method='CrystalNN', for instance. The new nn_method argument accounts for the user-specified max_num_nbr and radius. In other words, the number of neighbors will never exceed max_num_nbr and, in this case, no neighbor will have a distance greater than radius from the central atom.

This PR is built on PR #16 of @CompRhys to allow for the removal of zero padding, which is particularly important in this case because each environment will not have the same number of neighbors as every other environment.

Remove Zero Padding from model

This commit provides support for using from pymatgen.analysis.graphs import StructureGraph in generating the crystal graphs

CompRhys · 2020-03-24T09:57:33Z

As the codebase already depends on pymatgen and pymatgen is unlikely to go anywhere I think these are helpful improvements that increase the functionality of the code and make great use of #16.

I am not sure whether @txie-93 actively maintains this codebase so I think for the PR to be merged it would need the pre-trained models to be re-trained - it would also be interesting to see what affect zero-padding had on model performance. Is this feasible?

Andrew-S-Rosen · 2020-03-24T14:36:53Z

@CompRhys Thanks for the comment! I would be more than happy to do this, but unfortunately I'm trying to debug one remaining issue with my implementation. Everything, to the best of my knowledge, is working correctly with the StructureGraph (it's just a different way of generating the variable all_nbrs), but when training the model, the start of each epoch takes so long with nn_method='CrystalNN' compared to when I use the original cutoff radius approach (i.e. nn_method=None). I'm talking like 10x slower. When testing on ~10 CIFs, I did not notice any difference in performance and everything seemed like it worked appropriately, but when I scaled it up to 1000 or 10,000 CIFs the performance is quite bad despite the fact that the number of neighbors is much lower when using the StructureGraph. I'm a bit perplexed because the variables in data.py seem fine, and I made it so that there are very few modified lines of code compared to your PR (primarily lines 353-360). If/when I figure this out, I'd be happy to investigate the effect of padding and re-training the models.

That being said, I don't think it will automatically accept the PR though -- it says only those with write access can merge the PR, which means presumably only if @txie-93 were to approve it.

Context:
With nn_method='CrystalNN' on a set of ~14,000 CIFs:
Epoch: [0][0/45] Time 1081.625 (1081.625) Data 1078.277 (1078.277) Loss 1.1046 (1.1046) MAE 0.962 (0.962)
Epoch: [0][10/45] Time 2.203 (112.008) Data 0.000 (109.472) Loss 0.9041 (0.9997) MAE 0.903 (0.934)
Epoch: [0][20/45] Time 2.644 (59.835) Data 0.000 (57.343) Loss 0.7171 (0.9069) MAE 0.765 (0.881)
Epoch: [0][30/45] Time 23.268 (74.369) Data 21.728 (71.944) Loss 0.5786 (0.8007) MAE 0.687 (0.816)
Epoch: [0][40/45] Time 1.229 (57.123) Data 0.000 (54.981) Loss 0.6453 (0.7380) MAE 0.725 (0.778)
Validation: [0/6] Time 968.401 (968.401) Loss 0.6168 (0.6168) MAE 0.727 (0.727)

With nn_method=None on the same set:
Epoch: [0][0/45] Time 108.573 (108.573) Data 101.745 (101.745) Loss 1.0877 (1.0877) MAE 0.997 (0.997)
Epoch: [0][10/45] Time 5.779 (15.200) Data 0.000 (9.287) Loss 0.7675 (0.9516) MAE 0.844 (0.917)
Epoch: [0][20/45] Time 5.862 (10.771) Data 0.000 (4.865) Loss 0.7384 (0.8276) MAE 0.794 (0.840)
Epoch: [0][30/45] Time 6.061 (9.256) Data 0.000 (3.341) Loss 0.6094 (0.7505) MAE 0.658 (0.794)
Epoch: [0][40/45] Time 6.039 (8.456) Data 0.000 (2.531) Loss 0.4387 (0.6933) MAE 0.582 (0.757)
Validation: [0/6] Time 50.250 (50.250) Loss 0.5987 (0.5987) MAE 0.730 (0.730)

Perhaps I just need to tweak some of the settings. Hopefully that's all it is.

CompRhys · 2020-03-24T15:04:13Z

@arosen93 I would guess it's an issue with caching in the dataloaders? Have you run through a profiler?

The need to re-train models is that if I were @txie-93 I wouldn't merge the extensions to the code unless the PR kept the same functionality and the default model was as described in his original paper (not that zero padding is mentioned in the paper). This is why my PR was a request for comment [RFC] as I wanted to know in principle he would merge the changes before re-training the models.

Andrew-S-Rosen · 2020-03-24T15:10:58Z

@CompRhys That might be it. Thanks for the suggestion. I haven't run it through a profiler (admittedly, I have not used a profiler before) but can investigate further on that.

Noted about re-training the models. Yes, it'd definitely be good to make sure things are kept the same there.

CompRhys · 2020-03-24T15:45:24Z

@arosen93 I would expect using pymatgen to get the neighbours to just be straight up slower than having a fixed number of neighbours so there may not be any issues with your code and it may just be the penalty you have to pay. If caching correctly you will only pay this cost once on the first epoch as the data sets here are typically small enough to hold everything in RAM I would think.

(removing zero-padding probably also slows the code down vs dense multiplications but I never tested this)

Andrew-S-Rosen · 2020-03-24T16:14:02Z

@CompRhys. Yes, the pymatgen nearest neighbor searching is definitely slower, so there's an up-front penalty for that. In my tests, the start of every new epoch is extremely slow -- it's not just epoch 0, sadly. I'm running the profiler right now to see if that leads to any ideas.

I did some quick tests with the removal of the zero-padding you implemented by setting radius to be smaller so that max_num_nbr isn't hit for all the materials and found, when setting nn_method=None, that the code did not slow down by much at all without zero-padding, but I don't have hard numbers for that at the moment.

Some values were starting at index0 and others at index1

Andrew-S-Rosen · 2020-03-24T20:20:51Z

@CompRhys -- After running it through the profiler, I believe you are right. The issue is in the dataloader, specifically the call to __getitem__ in data.py, which I believe is causing a slowdown not only at the beginning of the first epoch (as would be expected) but at the beginning of every epoch, in both training and validation. I specify 28 workers to load the data, each core having 4.5 GB for a total of 128 GB. I agree it should be able to store everything though. I will have to figure out how to fix that. Thanks for pointing me in what I think is the right direction.

pgg1610 · 2020-03-24T21:36:19Z

If I understand correctly changing the method from the default get_all_neighbors to VoronnoiNN causes a significant slow-down in the algorithm however what is interesting is that the atom_fea, neighbor_fea for the training, validation, and test sets have already been created through the data-loader before the main nn.module is called. So I was under the impression that once collated in their respective tensors these methods are independent from the CGCNN. Another thing to try would be use the VoronoiConnectivity sub module from pymatgen.analysis.structure_analyzer which is at the heart of the chem_env module. That might reduce function calls.

Andrew-S-Rosen · 2020-03-24T22:06:21Z

Hey @pgg1610! Yeah, all the Pymatgen nearest neighbor methods are slow (compared to a simple call to get_all_neighbors), but this is a price I'd potentially be willing to pay if it's just a one-time thing before running the neural network. Even if it takes, in the extreme case, 2 s per function call (i.e. per material), you can get 10,000 structure graphs in ~12 minutes when parallelized over something like 28 workers.

Technically, the dataloader does load the data before calling the neural network, but the data stored in CIFData is not necessarily permanently stored in memory. You can see in data.py that there is a __getitem__ function, and therefore if you have dataset=CIFData('data/...'), the information in dataset[0] or dataset[1] (and so on) is kept in memory only until the RAM is used up. Should the RAM be filled up, it will have to recalculate the contents of dataset[i] whenever it is indexed next. I think this is what @CompRhys is referring to -- the caching of the loaded structures. I'm not positive if that's the issue or not. Still debugging. I may look into other routes to generate the graphs via Pymatgen, but the StructureGraph tool is obviously well-suited for a CGCNN.

Example using nn_method='CrystalNN' (note: indexing starts at 1)
Epoch: [0][1/45] Time 1671.479 (1671.479)
Epoch: [0][2/45] Time 2.593 (837.036)
...
Epoch: [0][45/45] Time 0.688 (77.424)
Validation: [1/6] Time 1636.080 (1636.080)
Validation: [2/6] Time 0.610 (818.345)
...
Validation: [6/6] Time 0.317 (277.829)
Epoch: [1][1/45] Time 1655.360 (1655.360)
Epoch: [1][2/45] Time 10.167 (832.763)
...
Epoch: [1][45/45] Time 0.647 (77.788)
Validation: [1/6] Time 1405.139 (1405.139)
Validation: [2/6] Time 195.614 (800.377)
...
Validation: [6/6] Time 0.326 (274.190)

Example using get_all_neighbors:
Epoch: [0][1/45] Time 31.781 (31.781)
Epoch: [0][2/45] Time 9.049 (20.415)
...
Validation: [1/6] Time 22.292 (22.292)
Validation: [2/6] Time 3.009 (12.650)
...
Validation: [6/6] Time 1.642 (6.025)

Epoch: [1][1/45] Time 31.617 (31.617)
Epoch: [1][2/45] Time 10.802 (21.210)
...
Epoch: [1][45/45] Time 3.181 (7.514)
Validation: [1/6] Time 22.242 (22.242)
Validation: [2/6] Time 2.979 (12.610)
...
Validation: [6/6] Time 1.586 (6.034)

Andrew-S-Rosen · 2020-03-24T23:37:13Z

@CompRhys -- I'm starting to think that's along the lines of what the issue is. As you can see from the log file I posted before, even when using get_all_neighbors (and I have confirmed this with the "original" cgcnn code as well), there is a spike in the runtime at the start of each epoch when workers=28. In contrast, when using workers=0, the time for each iteration of a given epoch is roughly constant. The log files are all attached below:
get_all_neighbors.txt
get_all_neighbors_1cpu.txt
nn_method.txt

The corresponding profiler results are also attached and indicate that calls to the dataloader take up a lot of the time -- more so than the forward/backward calls to the neural network in the case of nn_method='CrystalNN': profiler_results.zip.

I know this isn't your problem to fix, so I appreciate the suggestions you've provided. Hopefully I am close to finding the fix.

Andrew-S-Rosen · 2020-03-25T04:37:37Z

@CompRhys and @pgg1610:

I want to thank you again for taking the time to help me troubleshoot this. I ended up figuring it out. As @CompRhys correctly surmised from the beginning, the issue I ran into was entirely due to a caching problem with the crystal graphs being flushed from memory. It turns out I did indeed have memory overload, particularly when split over several workers where each worker has only a few GBs of memory. This then caused all the feature data to get dumped from memory and regenerated at the start of the new epochs -- when it needed all the training data info again.

This issue has nothing to do with my implementation of the nn_method and is present even in the original code by @txie-93 (but is likely not observed in most use-cases). I only just noticed it because Pymaten's StructureGraph methods are slower than a simple radial cutoff distance, making this problem more noticeable when it does occur. However, like I said, to the best of my understanding it has nothing to do with my PR here and and everything is working well both with the removal of zero-padding in PR #16 and the StructureGraph features here.

Should anyone run into this problem in the future, there are a few ways to tackle it. The first is to just request more memory if possible. Another way to do this is to use a full node but fewer CPU workers so each worker has more memory. Of course, this slows down the initial file I/O fairly substantially. Another option that I am exploring is saving all the data returned by CIFData in a folder. Then, if the memory gets flushed, the code can read in these saved Pytorch files rather than generating the crystal graph again from scratch. There will be file I/O cost for reading in Pytorch files, but the idea is that it hopefully costs less than generating the graph data again from scratch.

New flag `save_torch` in `CIFData` allows for saving of crystal graphs as .pkl files so if the memory is flushed, it can read directly from .pkl files rather than regenerate the graphs. The `clean_torch` flag will remove all existing .pkl files before the run begins if set to True.

Previously, predict.py computed the MAE on the entire data folder. Now, predict.py computes the MAE on the train, val, and test set separately and outputs 3 csv files

Return train and val only if specified by user

Use --train-val-test in predict.py to get train/val/test stats

Notify user is .json files are being read

Andrew-S-Rosen · 2021-01-03T23:40:13Z

I'm closing this PR because I ended up making a bunch of commits unrelated to the original PR (apologies...). I can make a new branch of just the relevant commits if ever needed in the future.

CompRhys and others added 16 commits May 30, 2019 23:22

remove zero-padding using torch-scatter

1fdb2d7

Merge https://github.com/txie-93/cgcnn

bea1349

Remove Zero Padding from model

working implementation with torch-scatter

74773b8

Provide support for Pymatgen's StructureGraph

ad656c2

This commit provides support for using from pymatgen.analysis.graphs import StructureGraph in generating the crystal graphs

Better support for calling Pymatgen NN algos from command line

c662cbb

Better parsing of input arguments

9c8a7f6

Clarify warning when # neighbors < max_num_nbr

f6482e6

Clarify documentation for new nn_object flag

938afe1

Fix self.radius bug when set to None

3c6cd6b

Fix inconsistent tabs/spacing

ea73ea6

Move definition of self.nn_object do __init__

6f97423

Undo modification to CIFData defaults

dd7e981

Fix print statement (test --> validation)

fea23aa

Decrease number of changes to parent code

3907bd4

Remove log files from #16 (PR)

155e6be

Remove lingering test_results.csv

9ab3228

CompRhys mentioned this pull request Mar 24, 2020

[RFC] Zero padding can be removed using torch-scatter #16

Closed

Andrew S. Rosen added 4 commits March 24, 2020 14:04

Make output folder if not present

48c3a88

Fix off-by-one error in log file

a80e5eb

Consistent numbering in log file

8a707f6

Some values were starting at index0 and others at index1

Fix numbering in log file of predict.py

bea5d82

Fix validation/test print name in log file

4bc19a5

Andrew Rosen added 24 commits March 25, 2020 00:59

Fix predict.py with updated zero-padding patch

55059ac

Use os.path.join rather than forcing "/" in paths

95bc3bc

Add option for tanh and more command line options

4aced1d

Update installation instructions

f2f4628

Fix \n typo

92e5aa1

Better use of predict.py

f861a3f

Previously, predict.py computed the MAE on the entire data folder. Now, predict.py computes the MAE on the train, val, and test set separately and outputs 3 csv files

Make new predict.py options optional

12d1ba9

Return train and val only if specified by user

Ensure no breaking changes

8f67bb0

Use --train-val-test in predict.py to get train/val/test stats

Clarify name of predict.py csv files

b4552a6

Add missing enable_tanh argument in predict.py

0f4def0

Provide warning if reading .json files

4371f60

Notify user is .json files are being read

Update README.md

09faafa

Save .json files as default

ba55eb5

Merge remote-tracking branch 'origin/master'

74e3e83

Update README.md

acbca8b

Fix arg in predict.py

df163da

Merge remote-tracking branch 'origin/master'

891cfe9

Clarify --disable-save-torch in predict.py

7b61ac5

Remove unused flag in CIFData

c26f4a9

Clean torch data if specified in predict.py

46dde05

Remove old Py2 imports

3de07e6

Clarify warning statement

52a5894

PEP8 update

a084923

nissy-dev mentioned this pull request Jun 25, 2020

[WIP] Adding inorganic crystal featurizers deepchem/deepchem#1916

Merged

Andrew-S-Rosen closed this Jan 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Pymatgen StructureGraph with different local_env strategies #18

Add support for Pymatgen StructureGraph with different local_env strategies #18

Andrew-S-Rosen commented Mar 23, 2020 •

edited

Loading

CompRhys commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 24, 2020 •

edited

Loading

CompRhys commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 24, 2020

CompRhys commented Mar 24, 2020

Andrew-S-Rosen commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 24, 2020 •

edited

Loading

pgg1610 commented Mar 24, 2020

Andrew-S-Rosen commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 25, 2020 •

edited

Loading

Andrew-S-Rosen commented Jan 3, 2021

Add support for Pymatgen StructureGraph with different local_env strategies #18

Add support for Pymatgen StructureGraph with different local_env strategies #18

Conversation

Andrew-S-Rosen commented Mar 23, 2020 • edited Loading

CompRhys commented Mar 24, 2020 • edited Loading

Andrew-S-Rosen commented Mar 24, 2020 • edited Loading

CompRhys commented Mar 24, 2020 • edited Loading

Andrew-S-Rosen commented Mar 24, 2020

CompRhys commented Mar 24, 2020

Andrew-S-Rosen commented Mar 24, 2020 • edited Loading

Andrew-S-Rosen commented Mar 24, 2020 • edited Loading

pgg1610 commented Mar 24, 2020

Andrew-S-Rosen commented Mar 24, 2020 • edited Loading

Andrew-S-Rosen commented Mar 24, 2020 • edited Loading

Andrew-S-Rosen commented Mar 25, 2020 • edited Loading

Andrew-S-Rosen commented Jan 3, 2021

Andrew-S-Rosen commented Mar 23, 2020 •

edited

Loading

CompRhys commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 24, 2020 •

edited

Loading

CompRhys commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 24, 2020 •

edited

Loading

Andrew-S-Rosen commented Mar 25, 2020 •

edited

Loading