Mapping attribution of tokens between encoder and decoder #75

whitead · 2021-11-24T22:10:28Z

Fixes Issue #48. Following @alstonlo's idea, I'll be making the attribution stored in the graph. I decided to make this a WIP since there are multiple participants in that issue. Atoms/bonds are attributed to a list of tokens (e.g., a branch and atom token).

Add attribution container to graph
Attribute SELFIES when building graph
Map attributions to SMILES when consuming graph
Attribute SMILES when building graph
Map attributions to SELFIES when consuming graph
Docs (?)
Release notes

whitead · 2021-11-28T00:12:42Z

@MarioKrenn6240 and @alstonlo Ready for review

alstonlo · 2021-12-09T20:08:11Z

Hi @whitead,

Thank you for this PR! I will get to reviewing this as soon as possible (likely next week). I apologize for the delay!

alstonlo · 2021-12-22T00:01:17Z

Hi @whitead,

Thank you for the PR, and apologies once again for the delay! I do have some minor suggestions:

The attributions in MolecularGraph are implemented as a dictionary from Union[Atom, DirectedBond] to a list. This may cause some issues because Atom and DirectedBond are not hashable classes. One potential solution (other than making the classes hashable) could be to store the attributions within each atom and bond object.
In _derive_mol_from_symbols, the attribution stack is added to via a call of the form list + [new_entry]. If possible, I suggest using list.append(new_entry) instead because list + [new_entry] copies the entire list and can be inefficient.
In the README example, I noticed that the P in the output SMILES C1NC(P)CC1 was attributed to a [Branch] character. Was there a reason the index symbols were not also attributed here?

One more involved extension that I propose is creating a separate AttributionMap (or EncoderAttribution and DecoderAttribution) object, which would be returned instead of the List[Tuple[str, List[Tuple[int, str]]]] type that is currently returned. I have a couple of reasons for this suggestion, which are:

It would be more flexible and extensible than returning the raw list. If we wanted to return more information in the attribution map or change its format, then we would have to rewrite the function signature of encoder and decoder each time, which breaks backwards compatibility.
We can cache more computationally expensive or messy logic within the AttributionMap class. For example, instead of maintaining both indices and symbols, the attribution map can take in indices (and the original SMILES or SELFIES) and deduce the corresponding symbol. This has the benefit of keeping encoder and decoder fast and simple.

jannisborn · 2022-02-08T22:12:12Z

Hi @alstonlo, is there any update on this issue?
@whitead has done almost the entire job here and the remaining things dont seem to be strictly related to the feature itself. They seem to be more of a refactoring due to package style choices.

whitead · 2022-02-08T23:44:33Z

Yes, it's been a while since I looked at this.

Hashing - get_attribution should be called with the same molecular graph, so the default class hash should be ok unless you forsee a situation of when the Atom/Directed Bond being queried in attribution is not from the same Graph. Not possible in my opinion - should return None as the function does as implemented.
The copy is intentional because we want a new list, not to append. Maybe I'm missing your point though?
For token attribution, I added more details to README to clear that up.

For the broader proposed changes - these sound like a good refactor. Maybe we could merge this though since there are current user needs and we can explore efficiency or more complex attribution as the use-cases arises? This PR satisfies my needs for attribution currently.

MarioKrenn6240 · 2022-02-10T17:59:34Z

Sorry this took so long, will be resolved in a few days. (we had and have long discussions about this and the other PRs internally - that took loong time). Thanks @whitead again! :)

Short summary, our question (after some timing tests) for this and other PRs was, that we dont want to influence the behaviour of the main code, if the attribute map is not required. So we were thinking about different ways how to modify the PR in the simplest way such that it doesnt impact directly the encoder and decoder (like a flag or a separate function call). If you have any direct ideas how to do it, please let us know.

whitead · 2022-02-14T21:30:40Z

Per @MarioKrenn6240's communication, he and the SELFIES developers did testing and found this PR to be too slow. I recorded the timings from the test code with pytest. I had trouble measuring the speed change because it is so small and sensitive to FS activity. With repeated testing I've found there is a 1% increase, which I believe is marginal, but I guess that's an important consideration for the package so I have spent time to improve the speed. I have made the PR 0.3% faster than master

Master

15.00s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
9.55s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.32s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
9.05s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
5.97s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]

PR (1.3% slower than master) as of 4c99c47

15.19s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
10.33s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.97s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
9.59s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
6.30s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]

Improvement 1 (1.1% slower than master): Removing SinglyLinkedList code (not part of PR) to improve speed

15.17s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
10.10s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.62s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
9.48s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
5.98s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]

Improvement 2 (0.2% slower than master): Flag for decoder

15.03s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
9.94s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.62s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
9.54s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
6.02s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]

Improvement 3 (0.3% faster than master): Flag for mol graph

14.96s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path1]
9.69s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path9]
9.54s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path11]
9.34s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path3]
5.87s call     tests/test_on_datasets.py::test_roundtrip_translation[test_path14]

MarioKrenn6240 · 2022-02-16T15:53:20Z

Wow thanks @whitead , that's impressive that you add a new feature and at the same time make the code faster :). We merge it, v2.1.0 will be announced in a bit (we want to add the other two PRs too, just need to do a few small checks).

Attribution map improvements (#75)

Added selfies to graph attribution

eaf1d8f

whitead mentioned this pull request Nov 24, 2021

Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens #48

Closed

whitead added 4 commits November 27, 2021 14:58

Completed decoder attribution and added unit test

935799c

Added encoder attribution and test

0a818af

Added changelog

08fa612

Added tutorial and trimmed outputs

7aa549b

whitead changed the title ~~[WIP] Mapping attribution of tokens between encoder and decoder~~ Mapping attribution of tokens between encoder and decoder Nov 28, 2021

Added more details to README

4c99c47

Sped-up PR

ee425dc

Fixed line width for pep8

c9055fb

MarioKrenn6240 merged commit 20aa4b3 into aspuru-guzik-group:master Feb 16, 2022

whitead deleted the issue-48 branch February 17, 2022 14:47

This was referenced Feb 19, 2022

Attribution map improvements (#75) #78

Merged

Regression Transformer - facilitate scaffold decoration GT4SD/gt4sd-core#39

Closed

MarioKrenn6240 added a commit that referenced this pull request Feb 21, 2022

Merge pull request #78 from jannisborn/master

fdb1789

Attribution map improvements (#75)

whitead mentioned this pull request May 7, 2022

Improvements to attribution #84

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mapping attribution of tokens between encoder and decoder #75

Mapping attribution of tokens between encoder and decoder #75

whitead commented Nov 24, 2021 •

edited

Loading

whitead commented Nov 28, 2021 •

edited

Loading

alstonlo commented Dec 9, 2021

alstonlo commented Dec 22, 2021

jannisborn commented Feb 8, 2022

whitead commented Feb 8, 2022

MarioKrenn6240 commented Feb 10, 2022 •

edited

Loading

whitead commented Feb 14, 2022 •

edited

Loading

MarioKrenn6240 commented Feb 16, 2022

Mapping attribution of tokens between encoder and decoder #75

Mapping attribution of tokens between encoder and decoder #75

Conversation

whitead commented Nov 24, 2021 • edited Loading

whitead commented Nov 28, 2021 • edited Loading

alstonlo commented Dec 9, 2021

alstonlo commented Dec 22, 2021

jannisborn commented Feb 8, 2022

whitead commented Feb 8, 2022

MarioKrenn6240 commented Feb 10, 2022 • edited Loading

whitead commented Feb 14, 2022 • edited Loading

MarioKrenn6240 commented Feb 16, 2022

whitead commented Nov 24, 2021 •

edited

Loading

whitead commented Nov 28, 2021 •

edited

Loading

MarioKrenn6240 commented Feb 10, 2022 •

edited

Loading

whitead commented Feb 14, 2022 •

edited

Loading