Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Pymatgen StructureGraph with different local_env strategies #18

Closed
wants to merge 45 commits into from
Closed

Conversation

Andrew-S-Rosen
Copy link

@Andrew-S-Rosen Andrew-S-Rosen commented Mar 23, 2020

Currently, the approach to generating the crystal graph is based on a simple constant cutoff radius, which is not necessarily ideal. It would be great if cgcnn were able to use Pymatgen's pymatgen.analysis.graphs.StructureGraph function to generate a crystal graph via one of its several pymatgen.analysis.local_env strategies.

This PR provides support for a new argument in CIFData within data.py to allow for the use of Pymatgen's StructureGraph function. One can now provide the nn_method argument to the CIFData function, which must be the name (in string format) of a valid pymatgen.analysis.local_env.NearNeighbors object. An example of a NearNeighbors object would be nn_method='CrystalNN', for instance. The new nn_method argument accounts for the user-specified max_num_nbr and radius. In other words, the number of neighbors will never exceed max_num_nbr and, in this case, no neighbor will have a distance greater than radius from the central atom.

This PR is built on PR #16 of @CompRhys to allow for the removal of zero padding, which is particularly important in this case because each environment will not have the same number of neighbors as every other environment.

@CompRhys
Copy link

CompRhys commented Mar 24, 2020

As the codebase already depends on pymatgen and pymatgen is unlikely to go anywhere I think these are helpful improvements that increase the functionality of the code and make great use of #16.

I am not sure whether @txie-93 actively maintains this codebase so I think for the PR to be merged it would need the pre-trained models to be re-trained - it would also be interesting to see what affect zero-padding had on model performance. Is this feasible?

@Andrew-S-Rosen
Copy link
Author

Andrew-S-Rosen commented Mar 24, 2020

@CompRhys Thanks for the comment! I would be more than happy to do this, but unfortunately I'm trying to debug one remaining issue with my implementation. Everything, to the best of my knowledge, is working correctly with the StructureGraph (it's just a different way of generating the variable all_nbrs), but when training the model, the start of each epoch takes so long with nn_method='CrystalNN' compared to when I use the original cutoff radius approach (i.e. nn_method=None). I'm talking like 10x slower. When testing on ~10 CIFs, I did not notice any difference in performance and everything seemed like it worked appropriately, but when I scaled it up to 1000 or 10,000 CIFs the performance is quite bad despite the fact that the number of neighbors is much lower when using the StructureGraph. I'm a bit perplexed because the variables in data.py seem fine, and I made it so that there are very few modified lines of code compared to your PR (primarily lines 353-360). If/when I figure this out, I'd be happy to investigate the effect of padding and re-training the models.

That being said, I don't think it will automatically accept the PR though -- it says only those with write access can merge the PR, which means presumably only if @txie-93 were to approve it.

Context:
With nn_method='CrystalNN' on a set of ~14,000 CIFs:
Epoch: [0][0/45] Time 1081.625 (1081.625) Data 1078.277 (1078.277) Loss 1.1046 (1.1046) MAE 0.962 (0.962)
Epoch: [0][10/45] Time 2.203 (112.008) Data 0.000 (109.472) Loss 0.9041 (0.9997) MAE 0.903 (0.934)
Epoch: [0][20/45] Time 2.644 (59.835) Data 0.000 (57.343) Loss 0.7171 (0.9069) MAE 0.765 (0.881)
Epoch: [0][30/45] Time 23.268 (74.369) Data 21.728 (71.944) Loss 0.5786 (0.8007) MAE 0.687 (0.816)
Epoch: [0][40/45] Time 1.229 (57.123) Data 0.000 (54.981) Loss 0.6453 (0.7380) MAE 0.725 (0.778)
Validation: [0/6] Time 968.401 (968.401) Loss 0.6168 (0.6168) MAE 0.727 (0.727)

With nn_method=None on the same set:
Epoch: [0][0/45] Time 108.573 (108.573) Data 101.745 (101.745) Loss 1.0877 (1.0877) MAE 0.997 (0.997)
Epoch: [0][10/45] Time 5.779 (15.200) Data 0.000 (9.287) Loss 0.7675 (0.9516) MAE 0.844 (0.917)
Epoch: [0][20/45] Time 5.862 (10.771) Data 0.000 (4.865) Loss 0.7384 (0.8276) MAE 0.794 (0.840)
Epoch: [0][30/45] Time 6.061 (9.256) Data 0.000 (3.341) Loss 0.6094 (0.7505) MAE 0.658 (0.794)
Epoch: [0][40/45] Time 6.039 (8.456) Data 0.000 (2.531) Loss 0.4387 (0.6933) MAE 0.582 (0.757)
Validation: [0/6] Time 50.250 (50.250) Loss 0.5987 (0.5987) MAE 0.730 (0.730)

Perhaps I just need to tweak some of the settings. Hopefully that's all it is.

@CompRhys
Copy link

CompRhys commented Mar 24, 2020

@arosen93 I would guess it's an issue with caching in the dataloaders? Have you run through a profiler?

The need to re-train models is that if I were @txie-93 I wouldn't merge the extensions to the code unless the PR kept the same functionality and the default model was as described in his original paper (not that zero padding is mentioned in the paper). This is why my PR was a request for comment [RFC] as I wanted to know in principle he would merge the changes before re-training the models.

@Andrew-S-Rosen
Copy link
Author

@CompRhys That might be it. Thanks for the suggestion. I haven't run it through a profiler (admittedly, I have not used a profiler before) but can investigate further on that.

Noted about re-training the models. Yes, it'd definitely be good to make sure things are kept the same there.

@CompRhys
Copy link

@arosen93 I would expect using pymatgen to get the neighbours to just be straight up slower than having a fixed number of neighbours so there may not be any issues with your code and it may just be the penalty you have to pay. If caching correctly you will only pay this cost once on the first epoch as the data sets here are typically small enough to hold everything in RAM I would think.

(removing zero-padding probably also slows the code down vs dense multiplications but I never tested this)

@Andrew-S-Rosen
Copy link
Author

Andrew-S-Rosen commented Mar 24, 2020

@CompRhys. Yes, the pymatgen nearest neighbor searching is definitely slower, so there's an up-front penalty for that. In my tests, the start of every new epoch is extremely slow -- it's not just epoch 0, sadly. I'm running the profiler right now to see if that leads to any ideas.

I did some quick tests with the removal of the zero-padding you implemented by setting radius to be smaller so that max_num_nbr isn't hit for all the materials and found, when setting nn_method=None, that the code did not slow down by much at all without zero-padding, but I don't have hard numbers for that at the moment.

@Andrew-S-Rosen
Copy link
Author

Andrew-S-Rosen commented Mar 24, 2020

@CompRhys -- After running it through the profiler, I believe you are right. The issue is in the dataloader, specifically the call to __getitem__ in data.py, which I believe is causing a slowdown not only at the beginning of the first epoch (as would be expected) but at the beginning of every epoch, in both training and validation. I specify 28 workers to load the data, each core having 4.5 GB for a total of 128 GB. I agree it should be able to store everything though. I will have to figure out how to fix that. Thanks for pointing me in what I think is the right direction.

@pgg1610
Copy link

pgg1610 commented Mar 24, 2020

If I understand correctly changing the method from the default get_all_neighbors to VoronnoiNN causes a significant slow-down in the algorithm however what is interesting is that the atom_fea, neighbor_fea for the training, validation, and test sets have already been created through the data-loader before the main nn.module is called. So I was under the impression that once collated in their respective tensors these methods are independent from the CGCNN. Another thing to try would be use the VoronoiConnectivity sub module from pymatgen.analysis.structure_analyzer which is at the heart of the chem_env module. That might reduce function calls.

@Andrew-S-Rosen
Copy link
Author

Andrew-S-Rosen commented Mar 24, 2020

Hey @pgg1610! Yeah, all the Pymatgen nearest neighbor methods are slow (compared to a simple call to get_all_neighbors), but this is a price I'd potentially be willing to pay if it's just a one-time thing before running the neural network. Even if it takes, in the extreme case, 2 s per function call (i.e. per material), you can get 10,000 structure graphs in ~12 minutes when parallelized over something like 28 workers.

Technically, the dataloader does load the data before calling the neural network, but the data stored in CIFData is not necessarily permanently stored in memory. You can see in data.py that there is a __getitem__ function, and therefore if you have dataset=CIFData('data/...'), the information in dataset[0] or dataset[1] (and so on) is kept in memory only until the RAM is used up. Should the RAM be filled up, it will have to recalculate the contents of dataset[i] whenever it is indexed next. I think this is what @CompRhys is referring to -- the caching of the loaded structures. I'm not positive if that's the issue or not. Still debugging. I may look into other routes to generate the graphs via Pymatgen, but the StructureGraph tool is obviously well-suited for a CGCNN.

Example using nn_method='CrystalNN' (note: indexing starts at 1)
Epoch: [0][1/45] Time 1671.479 (1671.479)
Epoch: [0][2/45] Time 2.593 (837.036)
...
Epoch: [0][45/45] Time 0.688 (77.424)
Validation: [1/6] Time 1636.080 (1636.080)
Validation: [2/6] Time 0.610 (818.345)
...
Validation: [6/6] Time 0.317 (277.829)
Epoch: [1][1/45] Time 1655.360 (1655.360)
Epoch: [1][2/45] Time 10.167 (832.763)
...
Epoch: [1][45/45] Time 0.647 (77.788)
Validation: [1/6] Time 1405.139 (1405.139)
Validation: [2/6] Time 195.614 (800.377)
...
Validation: [6/6] Time 0.326 (274.190)

Example using get_all_neighbors:
Epoch: [0][1/45] Time 31.781 (31.781)
Epoch: [0][2/45] Time 9.049 (20.415)
...
Validation: [1/6] Time 22.292 (22.292)
Validation: [2/6] Time 3.009 (12.650)
...
Validation: [6/6] Time 1.642 (6.025)

Epoch: [1][1/45] Time 31.617 (31.617)
Epoch: [1][2/45] Time 10.802 (21.210)
...
Epoch: [1][45/45] Time 3.181 (7.514)
Validation: [1/6] Time 22.242 (22.242)
Validation: [2/6] Time 2.979 (12.610)
...
Validation: [6/6] Time 1.586 (6.034)

@Andrew-S-Rosen
Copy link
Author

Andrew-S-Rosen commented Mar 24, 2020

@CompRhys -- I'm starting to think that's along the lines of what the issue is. As you can see from the log file I posted before, even when using get_all_neighbors (and I have confirmed this with the "original" cgcnn code as well), there is a spike in the runtime at the start of each epoch when workers=28. In contrast, when using workers=0, the time for each iteration of a given epoch is roughly constant. The log files are all attached below:
get_all_neighbors.txt
get_all_neighbors_1cpu.txt
nn_method.txt

The corresponding profiler results are also attached and indicate that calls to the dataloader take up a lot of the time -- more so than the forward/backward calls to the neural network in the case of nn_method='CrystalNN': profiler_results.zip.

I know this isn't your problem to fix, so I appreciate the suggestions you've provided. Hopefully I am close to finding the fix.

@Andrew-S-Rosen
Copy link
Author

Andrew-S-Rosen commented Mar 25, 2020

@CompRhys and @pgg1610:

I want to thank you again for taking the time to help me troubleshoot this. I ended up figuring it out. As @CompRhys correctly surmised from the beginning, the issue I ran into was entirely due to a caching problem with the crystal graphs being flushed from memory. It turns out I did indeed have memory overload, particularly when split over several workers where each worker has only a few GBs of memory. This then caused all the feature data to get dumped from memory and regenerated at the start of the new epochs -- when it needed all the training data info again.

This issue has nothing to do with my implementation of the nn_method and is present even in the original code by @txie-93 (but is likely not observed in most use-cases). I only just noticed it because Pymaten's StructureGraph methods are slower than a simple radial cutoff distance, making this problem more noticeable when it does occur. However, like I said, to the best of my understanding it has nothing to do with my PR here and and everything is working well both with the removal of zero-padding in PR #16 and the StructureGraph features here.

Should anyone run into this problem in the future, there are a few ways to tackle it. The first is to just request more memory if possible. Another way to do this is to use a full node but fewer CPU workers so each worker has more memory. Of course, this slows down the initial file I/O fairly substantially. Another option that I am exploring is saving all the data returned by CIFData in a folder. Then, if the memory gets flushed, the code can read in these saved Pytorch files rather than generating the crystal graph again from scratch. There will be file I/O cost for reading in Pytorch files, but the idea is that it hopefully costs less than generating the graph data again from scratch.

Andrew Rosen added 24 commits March 25, 2020 00:59
New flag `save_torch` in `CIFData` allows for saving of crystal graphs as .pkl files so if the memory is flushed, it can read directly from .pkl files rather than regenerate the graphs. The `clean_torch` flag will remove all existing .pkl files before the run begins if set to True.
Previously, predict.py computed the MAE on the entire data folder. Now, predict.py computes the MAE on the train, val, and test set separately and outputs 3 csv files
Return train and val only if specified by user
Use --train-val-test in predict.py to get train/val/test stats
Notify user is .json files are being read
@Andrew-S-Rosen
Copy link
Author

I'm closing this PR because I ended up making a bunch of commits unrelated to the original PR (apologies...). I can make a new branch of just the relevant commits if ever needed in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants