Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribution map improvements (#75) #78

Merged
merged 3 commits into from
Feb 21, 2022
Merged

Attribution map improvements (#75) #78

merged 3 commits into from
Feb 21, 2022

Conversation

jannisborn
Copy link
Contributor

@jannisborn jannisborn commented Feb 19, 2022

Context:
Related to the issue #48, solved by @whitead in #75, I am addressing minor inconsistencies that I detailed here. PR ready for review.

Content:
Adjusting how the index is build in the attribution maps.

  1. In the encoder attributions, we now start counting from 0, just like for the decoder.
  2. Encoder does not skip anymore over bond tokens

Example:

smiles = 'C=CO'
selfie, attribution = encoder(smiles, attribute=True)
print(selfie)
for a in attribution:
    print(a)

Old result:

[C][=C][O]
('[C]', [(1, 'C')])
('[=C]', [(2, 'C')])
('[O]', [(3, 'O')])

Behavior new:

[C][=C][O]
('[C]', [(0, 'C')])
('[=C]', [(2, 'C')])
('[O]', [(3, 'O')])

Closes #48

@whitead
Copy link
Contributor

whitead commented Feb 19, 2022

Thanks for catching this! Maybe you could add a short unit test based on your example here so that we can keep track of this behavior?

@jannisborn
Copy link
Contributor Author

Thanks for the quick feedback @whitead. I expanded your test to guarantee the indices are correct 👍🏼

On a separate note, I was a bit surprised about your example from the unittest: The order of the selfies tokens in the attribution map does not correspond to their oder in the returned string. This does not seem right to me. I verified with a checkout to an older commit, this behavior is not related to my PR (the indices here correspond to my latest changes, but the ordering is identical in both cases). Any thoughts about this?

smiles:  C1([O-])C=CC=C1Cl
selfies:  [C][Branch1][C][O-1][C][=C][C][=C][Ring1][=Branch1][Cl]
('[C]', [(0, 'C')])
('[O-1]', [(3, '[O-]')])
('[Branch1]', [(3, '[O-]')])
('[C]', [(3, '[O-]')])
('[C]', [(5, 'C')])
('[=C]', [(7, 'C')])
('[C]', [(8, 'C')])
('[=C]', [(10, 'C')])
('[Ring1]', None)
('[=Branch1]', None)
('[Cl]', [(12, 'Cl')])

I used this code:

import selfies as sf

smiles = "C1([O-])C=CC=C1Cl"
s, am = sf.encoder(smiles, attribute=True)
print('smiles: ', smiles)
print('selfies: ', s)
for a in am:
    print(a)

@MarioKrenn6240 MarioKrenn6240 merged commit fdb1789 into aspuru-guzik-group:master Feb 21, 2022
@whitead
Copy link
Contributor

whitead commented Feb 22, 2022

@jannisborn Thanks - That is intended behavior - encoder attribution is ordered by input SMILES token. I can see now that would be relatively useless though since it's non-trivial to align them. I'll open an issue to add indices on SELFIES tokens and/or sort them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Retrieve mapping betwewen SMILES and SELFIES tokens
3 participants