Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

one question about the canonicalize_products.py #10

Closed
teslacool opened this issue Mar 16, 2021 · 5 comments
Closed

one question about the canonicalize_products.py #10

teslacool opened this issue Mar 16, 2021 · 5 comments

Comments

@teslacool
Copy link

@chaoyan1037 If I understand correctly, the SMILES in train.csv before processed by the canonicalize_product.py is canonicalized by RDKit through smiles=Chem.MolToSmiles(mol, canonical=True). So, what is the purpose to permute the atommapnumber in this canonicalize_product.py and is there any reference work?

@chaoyan1037
Copy link
Contributor

@teslacool It is a good catch. We found there was an information leak within the USPTO dataset itself. The order of the atoms within the product SMILES may indicate the reaction atoms. To be more specific, we found that for most USPTO products, the first atoms of the product SMILES are usually reaction atoms. You may look into the reaction atoms yourself.
So we use canonicalize_products.py to rearrange the atom order to be the same as the canonical atom order, hoping to remove the potential information leak.

@teslacool
Copy link
Author

Ok. thanks for your answer.

@chaoyan1037
Copy link
Contributor

We have an important update of our method. Please refer to the readme for more details.

@YanjingLiLi
Copy link

For this leakage, does it only appear in your developed model, or it's actually a general problem for all the models using this dataset?

@najwalb
Copy link

najwalb commented Sep 23, 2023

@YanjingLiLi I believe it's a general problem, it was mentioned by other methods later as well: https://openreview.net/pdf?id=SnONpXZ_uQ_

Methods using atom-mapping information shld be particularly careful with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants