Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping attribution of tokens between encoder and decoder #75

Merged
merged 8 commits into from
Feb 16, 2022

Conversation

whitead
Copy link
Contributor

@whitead whitead commented Nov 24, 2021

Fixes Issue #48. Following @alstonlo's idea, I'll be making the attribution stored in the graph. I decided to make this a WIP since there are multiple participants in that issue. Atoms/bonds are attributed to a list of tokens (e.g., a branch and atom token).

  • Add attribution container to graph
  • Attribute SELFIES when building graph
  • Map attributions to SMILES when consuming graph
  • Attribute SMILES when building graph
  • Map attributions to SELFIES when consuming graph
  • Docs (?)
  • Release notes

@whitead whitead changed the title [WIP] Mapping attribution of tokens between encoder and decoder Mapping attribution of tokens between encoder and decoder Nov 28, 2021
@whitead
Copy link
Contributor Author

whitead commented Nov 28, 2021

@MarioKrenn6240 and @alstonlo Ready for review

@alstonlo
Copy link
Collaborator

alstonlo commented Dec 9, 2021

Hi @whitead,

Thank you for this PR! I will get to reviewing this as soon as possible (likely next week). I apologize for the delay!

@alstonlo
Copy link
Collaborator

Hi @whitead,

Thank you for the PR, and apologies once again for the delay! I do have some minor suggestions:

  • The attributions in MolecularGraph are implemented as a dictionary from Union[Atom, DirectedBond] to a list. This may cause some issues because Atom and DirectedBond are not hashable classes. One potential solution (other than making the classes hashable) could be to store the attributions within each atom and bond object.
  • In _derive_mol_from_symbols, the attribution stack is added to via a call of the form list + [new_entry]. If possible, I suggest using list.append(new_entry) instead because list + [new_entry] copies the entire list and can be inefficient.
  • In the README example, I noticed that the P in the output SMILES C1NC(P)CC1 was attributed to a [Branch] character. Was there a reason the index symbols were not also attributed here?

One more involved extension that I propose is creating a separate AttributionMap (or EncoderAttribution and DecoderAttribution) object, which would be returned instead of the List[Tuple[str, List[Tuple[int, str]]]] type that is currently returned. I have a couple of reasons for this suggestion, which are:

  1. It would be more flexible and extensible than returning the raw list. If we wanted to return more information in the attribution map or change its format, then we would have to rewrite the function signature of encoder and decoder each time, which breaks backwards compatibility.
  2. We can cache more computationally expensive or messy logic within the AttributionMap class. For example, instead of maintaining both indices and symbols, the attribution map can take in indices (and the original SMILES or SELFIES) and deduce the corresponding symbol. This has the benefit of keeping encoder and decoder fast and simple.

@jannisborn
Copy link
Contributor

Hi @alstonlo, is there any update on this issue?
@whitead has done almost the entire job here and the remaining things dont seem to be strictly related to the feature itself. They seem to be more of a refactoring due to package style choices.

@whitead
Copy link
Contributor Author

whitead commented Feb 8, 2022

Yes, it's been a while since I looked at this.

  • Hashing - get_attribution should be called with the same molecular graph, so the default class hash should be ok unless you forsee a situation of when the Atom/Directed Bond being queried in attribution is not from the same Graph. Not possible in my opinion - should return None as the function does as implemented.
  • The copy is intentional because we want a new list, not to append. Maybe I'm missing your point though?
  • For token attribution, I added more details to README to clear that up.

For the broader proposed changes - these sound like a good refactor. Maybe we could merge this though since there are current user needs and we can explore efficiency or more complex attribution as the use-cases arises? This PR satisfies my needs for attribution currently.

@MarioKrenn6240
Copy link
Collaborator

MarioKrenn6240 commented Feb 10, 2022

Sorry this took so long, will be resolved in a few days. (we had and have long discussions about this and the other PRs internally - that took loong time). Thanks @whitead again! :)

Short summary, our question (after some timing tests) for this and other PRs was, that we dont want to influence the behaviour of the main code, if the attribute map is not required. So we were thinking about different ways how to modify the PR in the simplest way such that it doesnt impact directly the encoder and decoder (like a flag or a separate function call). If you have any direct ideas how to do it, please let us know.

@whitead
Copy link
Contributor Author

whitead commented Feb 14, 2022

Per @MarioKrenn6240's communication, he and the SELFIES developers did testing and found this PR to be too slow. I recorded the timings from the test code with pytest. I had trouble measuring the speed change because it is so small and sensitive to FS activity. With repeated testing I've found there is a 1% increase, which I believe is marginal, but I guess that's an important consideration for the package so I have spent time to improve the speed. I have made the PR 0.3% faster than master

Master

15.00s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
9.55s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.32s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
9.05s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
5.97s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]

PR (1.3% slower than master) as of 4c99c47

15.19s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
10.33s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.97s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
9.59s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
6.30s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]

Improvement 1 (1.1% slower than master): Removing SinglyLinkedList code (not part of PR) to improve speed

15.17s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
10.10s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.62s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
9.48s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
5.98s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]

Improvement 2 (0.2% slower than master): Flag for decoder

15.03s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
9.94s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.62s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
9.54s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
6.02s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]

Improvement 3 (0.3% faster than master): Flag for mol graph

14.96s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
9.69s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.54s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
9.34s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
5.87s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]

@MarioKrenn6240
Copy link
Collaborator

Wow thanks @whitead , that's impressive that you add a new feature and at the same time make the code faster :). We merge it, v2.1.0 will be announced in a bit (we want to add the other two PRs too, just need to do a few small checks).

@MarioKrenn6240 MarioKrenn6240 merged commit 20aa4b3 into aspuru-guzik-group:master Feb 16, 2022
@whitead whitead deleted the issue-48 branch February 17, 2022 14:47
MarioKrenn6240 added a commit that referenced this pull request Feb 21, 2022
Attribution map improvements (#75)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants