Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add LRU cache to structure matcher #4036

Merged
merged 17 commits into from
Sep 6, 2024

Conversation

kbuma
Copy link
Contributor

@kbuma kbuma commented Sep 3, 2024

Summary

This adds an LRU cache to the structure matcher, specifically to the StructureMatcher._get_reduced_structure function.

This suggested improvement is motivated by the performance of the AflowPrototypeMatcher._match_prototype function (and in turn the robocrys builder that uses it). We believe this will speed up that calculation by ~288x since the sometimes expensive calculation in StructureMatcher._get_reduced_structure will only need to be performed once per structure rather than the 288 times it is performed now.

@DanielYang59
Copy link
Contributor

DanielYang59 commented Sep 4, 2024

Beautiful! That's an impressive performance gain!

I guess we should also be aware of the potential memory leakage issue for lru_cache when applied on methods (I believe a reference to the cls would also be held in the case of class methods, but I could be wrong here):

Using the functools.lru_cache and functools.cache decorators on methods can lead to memory leaks, as the global cache will retain a reference to the instance, preventing it from being garbage collected.

Also here is a more detailed explanation from Anthony Sottile .

I guess we could just make it a staticmethod to avoid this unless there's some reason to make it classmethod that I'm not aware of?

@shyuep
Copy link
Member

shyuep commented Sep 4, 2024

I would move the method out of the class and just make it a method. This will resolve the memory leak issue (though I don't think it is a significant problem in our case).

@shyuep
Copy link
Member

shyuep commented Sep 4, 2024

I would be very wary of the current hash function. That looks like an extremely expensive hash function. Furthermore, depending on transformations, this hash violates some basic rules on hashes. I would argue it is better to use the formula as a hash.

@esoteric-ephemera
Copy link
Contributor

@shyuep, it could be expensive but my concern about using the formula is uniqueness - thinking about polymorphs, displacements, different magnetic configs

For transformations, shouldn't the hash change to reflect that the underlying object has changed? Maybe this is my misunderstanding of how hashes should apply to mutables

My thinking is: if I run a transformation on a structure to displace a site, the original hash should no longer apply to the transformed structure, right?

@shyuep
Copy link
Member

shyuep commented Sep 4, 2024

@esoteric-ephemera The rule for hashes is that they do not have to be unique. Ultimately, the __eq__ function is what must be unique for different objects. A valid but bad hash function would be return 42. This is a common implementation in simple "Hello World" type tutorials for many languages.

See https://www.geeksforgeeks.org/what-are-hash-functions-and-how-to-choose-a-good-hash-function

Efficiency and minimizing collisions is the key. But collisions need not be 0 (that is what __eq__ is for). For a structure, the formula is a pretty good hash function - equal structures definitely have the same formula, and formula is cheap to compute. There will be collisions but they should be relatively minimal (polymorphs basically).

@kbuma
Copy link
Contributor Author

kbuma commented Sep 4, 2024

@shyuep @esoteric-ephemera An alternative to adding a hash function to Structure that would work for this use case is to create an IStructure copy in the class method and call the new cached function with that copy. It would be something like this:

changing the class method to

@classmethod
def _get_reduced_structure(cls, struct: Structure, primitive_cell: bool = True, niggli: bool = True) -> Structure:
    immutable_struct = IStructure.from_dict(struct.as_dict())
    return _get_reduced_structure(immutable_struct, primitive_cell, niggli)

and the new function to

@lru_cache(maxsize=LRU_CACHE_SIZE)
def _get_reduced_structure(
  struct: IStructure, primitive_cell: bool, niggli: bool
) -> Structure:
  """Helper method to find a reduced structure."""
  if niggli:
      struct = struct.get_reduced_structure(reduction_algo="niggli")
  if primitive_cell:
      struct = struct.get_primitive_structure()
  return Structure.from_dict(struct.as_dict())

@esoteric-ephemera
Copy link
Contributor

I think we have a good solution now that doesn't require hashing Structure, and the failing ruff stuff is unrelated to this PR -should be ready for review

@shyuep shyuep merged commit 51ea7de into materialsproject:master Sep 6, 2024
42 of 43 checks passed
@shyuep
Copy link
Member

shyuep commented Sep 6, 2024

Merged. Thanks.

@esoteric-ephemera
Copy link
Contributor

Could we push a release relatively soon (or a release candidate)? Would help with builder development on our side

@janosh janosh added performance Some functionality is too slow or regressed analysis Concerning pymatgen.analysis labels Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Concerning pymatgen.analysis performance Some functionality is too slow or regressed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants